Abstract:
The query translation is one of the key factors that affect the performance of cross-language information retrieval (CLIR). In the process of querying, the excavation of the out of vocabulary (OOV) has the important significance to improve CLIRT. Out of Vocabulary means the words or phrase which cant be found in the dictionary. In this paper, according to Wikipedia data structure and language features, we divide translation environment into target-existence environment and target-deficit environment. Depending on the difficulty of translation mining in the target-deficit environment, we adopt the frequency change information and adjacency information to realize the extraction of candidate units, and compare common extraction methods of units. The results verify that our methods are more effective. We establish the strategy of mixed translation mining based on the frequency-distance model, surface pattern matching model and summary-score model, and add the model one by one, and then verify the function influence of each model. The experiments use the mining technique of OOV in search engine as baseline and then evaluate the results with TOP1. The results verify that the mixed translation mining method based on Wikipedia can achieve the correct translation rate of 0.6822, and the improvements on this method are 6.98% over the baseline.