Abstract:
identification of technical new term and its definition is an important research topic in information extraction. It is still a great challenge to provide a scalable solution for large-scale terms extraction, because most previous approaches fail to explicitly define the linguistic constituent of terms and the function of their definition patterns. The authors’ research shows that the occurrences of technical new terms in most cases are accompanied with their definition descriptions in the real corpus. Based on this intuition, the linguistic constituent of technical terms and the numerical function of their definitions are defined explicitly. Also presented is a novel statistical approach based on linguistic pattern of terminology definition (LPTD) to extract Chinese technical new terms and their definitions. LPTD in this paper is first proposed to delimit the boundary of technical terms. In the identification phase, both statistical information of terms and LPTD features obtained from the previous filtering process are taken into account in the SVM classifier. They are integrated into one unified framework. The idea in this paper can also be used for reference in collocation extraction (CE) and be easily extended to other different languages. Compared with the previously reported outcomes, this approach achieves a competitive result in real large-scale corpora at 90.5% in precision and 78.1% in recall.