ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2015, Vol. 52 ›› Issue (2): 295-308.doi: 10.7544/issn1000-1239.2015.20140224

Special Issue: 2015大数据管理

Previous Articles     Next Articles

Automatically Discovering of Inconsistency Among Cross-Source Data Based on Web Big Data

Yu Wei1,Li Shijun1,Yang Sha1,2, Hu Yahui1,3,Liu Jing1, Ding Yonggang1, Wang Qian1   

  1. 1(Computer School, Wuhan University, Wuhan 430079); 2(College of Computer Science and Technology, Hankou University, Wuhan 430212); 3(Air Force Early Warning Academy, Wuhan 430070)
  • Online:2015-02-01

Abstract: Data inconsistency is a pervasive phenomenon existing in Web, which has gravely affected the quality of Web information. The current research of data inconsistency mainly focused on traditional database application. It is lack of consistency research on diverse, complicated, rapidly-changing and abundant Web big data. On account of multi-source heterogeneous Web data and 5V features of big data, we present unified algorithm of data extraction and Web object data model based on three aspects: website structure, characteristic data and knowledge rules. We study and sort the features of data inconsistency, and establish inconsistency classifier model, inconsistency constraint mechanism and inconsistency inference algebra computing system. Then based on cross-source Web data consistency theory system, we've researched Web inconsistency data automatically discovery method via constraint rules detection and statistical deviation analysis. Combining the characters of the two methods, we propose an automatically discovery algorithm of Web inconsistency data in view of hierarchy probabilistic judgment based on Hadoop MapReduce architecture. The framework is applied to multiple B2C electronic commerce big data on Hadoop platform and compared with traditional architecture and other methods. The results of our experiment proves the accuracy and efficiency of the method.

Key words: Web big data, Web data mining, data consistency, Web data management, data quality assessment, cross-source analysis

CLC Number: