ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (8): 1794-1805.doi: 10.7544/issn1000-1239.2015.20150252

所属专题: 2015面向大数据的人工智能技术

• 人工智能 • 上一篇    下一篇

大数据环境下的电子商务商品实体同一性识别

胡亚慧1,2,李石君1,余伟1,杨莎1,3,甘琳1,王凯1,方其庆2   

  1. 1(武汉大学计算机学院 武汉 430079); 2(空军预警学院 武汉 430019); 3(汉口学院计算机科学与技术学院 武汉 430212)(hyh5800@163.com)
  • 出版日期: 2015-08-01
  • 基金资助: 
    基金项目:国家自然科学基金项目(61272109);中央高校基本科研业务费专项资金项目(2042014KF0057);湖北省自然科学基金项目(2014CFB289);空军预警学院青年创新基金项目(2013ZDJC0101)

Recognizing the Same Commodity Entities in Big Data

Hu Yahui1,2,Li Shijun1,Yu Wei1,Yang Sha1,3,Gan Lin1,Wang Kai1,Fang Qiqing2   

  1. 1(Computer School, Wuhan University, Wuhan 430079); 2(Air Force Early Warning Academy, Wuhan 430019); 3(School of Computer Science and Technology, Hankou University, Wuhan 430212)
  • Online: 2015-08-01

摘要: 怎样从多源异构的、自治独立的、多样化的、不一致的电子商务数据中找出同一商品实体是当前面临的主要挑战.通过分析不同平台的数据特征,首先建立基于商品属性/值的索引模型,构造商品属性-值的全局模式图并进行模式集成,形成模式统一、质量高效的商品信息数据;而后基于层次概率模型对商品的同一性进行多层相似度量;最终完成商品实体识别,并归一化输出满足同一性的商品集和关联属性并进行排序.基于Hadoop平台对3个B2C电子商务数据源中的商品进行了实验,并与传统方法和产品进行了比较,实验结果证明了本框架的可行性、精确性和高效性.

关键词: Web大数据, 电子商务, 层次概率模型, 商品, Hadoop

Abstract: The recent blossom of big data and e-commerce has revolutionized our life by providing everyone with the ease and fun never before. How to identify the same commodity entities from these multi-source heterogeneous, fragmented, various and inconsistent e-commerce data for better business intelligence raises a very valuable and challenging topic. In this light, we analyze the characteristics of Web big data and collect the crawled original commodity information data from the different e-commerce platforms, which are the multi-source heterogeneous and mass scales of data. Then, we build an index model based on commodity’s attributes and values, and construct a global model map to record the commodity’s attribute and value, and form the unified model and high efficient commodity information for the next step. And we measure the similarity of the commodity’s identity on the multilayer hierarchical probabilistic model, including identifying the possible candidate commodity set, similarity filtering the candidate commodity set and similarity filtering based on the special items of candidate commodities set. Finally, we output the same commodity set in the inverted index list. We also evaluate our method on the datasets collected from Chinese three main-stream B2C e-commerce platforms with Hadoop framework. Experimental results show the accuracy and effectiveness of our method.

Key words: Web big data, e-commerce, hierarchical probabilistic model, commodity, Hadoop

中图分类号: