SparkCRF: A Parallel Implementation of CRFs Algorithm with Spark

Zhu Jizhao; Jia Yantao; Xu Jun; Qiao Jianzhong; Wang Yuanzhuo; Cheng Xueqi

doi:10.7544/issn1000-1239.2016.20160197

Zhu Jizhao, Jia Yantao, Xu Jun, Qiao Jianzhong, Wang Yuanzhuo, Cheng Xueqi. SparkCRF: A Parallel Implementation of CRFs Algorithm with SparkJ. Journal of Computer Research and Development, 2016, 53(8): 1819-1828. DOI: 10.7544/issn1000-1239.2016.20160197

Citation:

SparkCRF: A Parallel Implementation of CRFs Algorithm with Spark

Graphical Abstract

Abstract

Abstract

Condition random fields has been successfully applied to various applications in text analysis, such as sequence labeling, Chinese words segmentation, named entity recognition, and relation extraction in nature language processing. The traditional CRFs tools in single-node computer meet many challenges when dealing with large-scale texts. For one thing, the personal computer experiences the performance bottleneck; For another, the server fails to tackle the analysis efficiently. And upgrading hardware of the server to promote the capability of computing is not always feasible due to the cost constrains. To tackle these problems, in light of the idea of “divide and conquer”, we design and implement SparkCRF, which is a kind of distributed CRFs running on cluster environment based on Apache Spark. We perform three experiments using NLPCC2015 and the 2nd International Chinese Word Segmentation Bakeoff datasets, to evaluate SparkCRF from the aspects of performance, scalability and accuracy. Results show that: 1)compared with CRF++, SparkCRF runs almost 4 times faster on our cluster in sequence labeling task; 2)it has good scalability by adjusting the number of working cores; 3)furthermore, SparkCRF has comparable accuracy to the state-of-the-art CRF tools, such as CRF++ in the task of text analysis.

FullText(HTML)

References (0)

Cited By

Turn off MathJax

Article Contents

SparkCRF: A Parallel Implementation of CRFs Algorithm with Spark

Abstract

Catalog

Export File

Citation

Format

Content