ISSN 1000-1239 CN 11-1777/TP

• Paper • Previous Articles     Next Articles

The Translation Mining of the Out of Vocabulary Based on Wikipedia

Sun Changlong, Hong Yu, Ge Yundong, Yao Jianmin, and Zhu Qiaoming   

  1. (Jiangsu Province Key Laboratory of Computer Information Processing, Soochow University, Suzhou, Jiangsu 215006)
  • Online:2011-06-15

Abstract: The query translation is one of the key factors that affect the performance of cross-language information retrieval (CLIR). In the process of querying, the excavation of the out of vocabulary (OOV) has the important significance to improve CLIRT. Out of Vocabulary means the words or phrase which cant be found in the dictionary. In this paper, according to Wikipedia data structure and language features, we divide translation environment into target-existence environment and target-deficit environment. Depending on the difficulty of translation mining in the target-deficit environment, we adopt the frequency change information and adjacency information to realize the extraction of candidate units, and compare common extraction methods of units. The results verify that our methods are more effective. We establish the strategy of mixed translation mining based on the frequency-distance model, surface pattern matching model and summary-score model, and add the model one by one, and then verify the function influence of each model. The experiments use the mining technique of OOV in search engine as baseline and then evaluate the results with TOP1. The results verify that the mixed translation mining method based on Wikipedia can achieve the correct translation rate of 0.6822, and the improvements on this method are 6.98% over the baseline.

Key words: out of vocabulary (OOV), Wikipedia, cross-language information retrieval (CLIR), translation mining, target-deficit environment