ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2016, Vol. 53 ›› Issue (8): 1819-1828.doi: 10.7544/issn1000-1239.2016.20160197

Special Issue: 2016数据挖掘前沿技术专题

Previous Articles     Next Articles

SparkCRF: A Parallel Implementation of CRFs Algorithm with Spark

Zhu Jizhao1,2, Jia Yantao2, Xu Jun2, Qiao Jianzhong1, Wang Yuanzhuo2,Cheng Xueqi2   

  1. 1(College of Computer Science and Engineering, Northeastern University, Shenyang 110819);2(Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190)
  • Online:2016-08-01

Abstract: Condition random fields has been successfully applied to various applications in text analysis, such as sequence labeling, Chinese words segmentation, named entity recognition, and relation extraction in nature language processing. The traditional CRFs tools in single-node computer meet many challenges when dealing with large-scale texts. For one thing, the personal computer experiences the performance bottleneck; For another, the server fails to tackle the analysis efficiently. And upgrading hardware of the server to promote the capability of computing is not always feasible due to the cost constrains. To tackle these problems, in light of the idea of “divide and conquer”, we design and implement SparkCRF, which is a kind of distributed CRFs running on cluster environment based on Apache Spark. We perform three experiments using NLPCC2015 and the 2nd International Chinese Word Segmentation Bakeoff datasets, to evaluate SparkCRF from the aspects of performance, scalability and accuracy. Results show that: 1)compared with CRF++, SparkCRF runs almost 4 times faster on our cluster in sequence labeling task; 2)it has good scalability by adjusting the number of working cores; 3)furthermore, SparkCRF has comparable accuracy to the state-of-the-art CRF tools, such as CRF++ in the task of text analysis.

Key words: big data, machine learning, distributed computing, Spark, condition random fields (CRFs)

CLC Number: