ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2015, Vol. 52 ›› Issue (8): 1794-1805.doi: 10.7544/issn1000-1239.2015.20150252

Special Issue: 2015面向大数据的人工智能技术

Previous Articles     Next Articles

Recognizing the Same Commodity Entities in Big Data

Hu Yahui1,2,Li Shijun1,Yu Wei1,Yang Sha1,3,Gan Lin1,Wang Kai1,Fang Qiqing2   

  1. 1(Computer School, Wuhan University, Wuhan 430079); 2(Air Force Early Warning Academy, Wuhan 430019); 3(School of Computer Science and Technology, Hankou University, Wuhan 430212)
  • Online:2015-08-01

Abstract: The recent blossom of big data and e-commerce has revolutionized our life by providing everyone with the ease and fun never before. How to identify the same commodity entities from these multi-source heterogeneous, fragmented, various and inconsistent e-commerce data for better business intelligence raises a very valuable and challenging topic. In this light, we analyze the characteristics of Web big data and collect the crawled original commodity information data from the different e-commerce platforms, which are the multi-source heterogeneous and mass scales of data. Then, we build an index model based on commodity’s attributes and values, and construct a global model map to record the commodity’s attribute and value, and form the unified model and high efficient commodity information for the next step. And we measure the similarity of the commodity’s identity on the multilayer hierarchical probabilistic model, including identifying the possible candidate commodity set, similarity filtering the candidate commodity set and similarity filtering based on the special items of candidate commodities set. Finally, we output the same commodity set in the inverted index list. We also evaluate our method on the datasets collected from Chinese three main-stream B2C e-commerce platforms with Hadoop framework. Experimental results show the accuracy and effectiveness of our method.

Key words: Web big data, e-commerce, hierarchical probabilistic model, commodity, Hadoop

CLC Number: