Abstract:
Data inconsistency is a pervasive phenomenon existing in Web, which has gravely affected the quality of Web information. The current research of data inconsistency mainly focused on traditional database application. It is lack of consistency research on diverse, complicated, rapidly-changing and abundant Web big data. On account of multi-source heterogeneous Web data and 5V features of big data, we present unified algorithm of data extraction and Web object data model based on three aspects: website structure, characteristic data and knowledge rules. We study and sort the features of data inconsistency, and establish inconsistency classifier model, inconsistency constraint mechanism and inconsistency inference algebra computing system. Then based on cross-source Web data consistency theory system, we've researched Web inconsistency data automatically discovery method via constraint rules detection and statistical deviation analysis. Combining the characters of the two methods, we propose an automatically discovery algorithm of Web inconsistency data in view of hierarchy probabilistic judgment based on Hadoop MapReduce architecture. The framework is applied to multiple B2C electronic commerce big data on Hadoop platform and compared with traditional architecture and other methods. The results of our experiment proves the accuracy and efficiency of the method.