高级检索

    基于维基百科的未登录词译文挖掘

    The Translation Mining of the Out of Vocabulary Based on Wikipedia

    • 摘要: 未登录词(out of vocabulary, OOV)的查询翻译是影响跨语言信息检索(cross-language information retrieval, CLIR)性能的关键因素之一.它根据维基百科(Wikipedia)的数据结构和语言特性,将译文环境划分为目标存在环境和目标缺失环境.针对目标缺失环境下的译文挖掘难点,它采用频度变化信息和邻接信息实现候选单元抽取,并建立基于频度-距离模型、表层匹配模板和摘要得分模型的混合译文挖掘策略.实验将基于搜索引擎的未登录词挖掘技术作为baseline,并采用TOP1进行评测.实验验证基于维基百科的混合译文挖掘方法可达到0.6822的译文正确率,相对baseline取得6.98%的改进.

       

      Abstract: The query translation is one of the key factors that affect the performance of cross-language information retrieval (CLIR). In the process of querying, the excavation of the out of vocabulary (OOV) has the important significance to improve CLIRT. Out of Vocabulary means the words or phrase which cant be found in the dictionary. In this paper, according to Wikipedia data structure and language features, we divide translation environment into target-existence environment and target-deficit environment. Depending on the difficulty of translation mining in the target-deficit environment, we adopt the frequency change information and adjacency information to realize the extraction of candidate units, and compare common extraction methods of units. The results verify that our methods are more effective. We establish the strategy of mixed translation mining based on the frequency-distance model, surface pattern matching model and summary-score model, and add the model one by one, and then verify the function influence of each model. The experiments use the mining technique of OOV in search engine as baseline and then evaluate the results with TOP1. The results verify that the mixed translation mining method based on Wikipedia can achieve the correct translation rate of 0.6822, and the improvements on this method are 6.98% over the baseline.

       

    /

    返回文章
    返回