Abstract:
A semantic class is a collection of terms which share similar meaning. Knowing the semantic classes of words can be extremely valuable for many natural language processing tasks. This paper investigates the usage of linguistic knowledge on the graph-based acquisition of Chinese semantic classes, and demonstrates that linguistic knowledge can really improve the graph-based method. The used corpus is Xinhua News of LDC Chinese Gigaword. A graph is built by extracting word pairs with coordination structure from corpus, with the co-occurring words as nodes and the co-occurring frequency as edges weight between the two words. And then Newman algorithm is adopted to experiment word clustering in the graph. This paper focuses on transforming the edges weight, motivated by the properties of coordinate structure and Chinese language. We present six kinds of methods: divide the whole corpus to small parts, cut the low-frequency edges, enlarge the weight of bidirectional edges, enlarge the weight of edges within cliques, enlarge the weight of edges in which two nodes share the same last-character, and reduce the weight of edges in which two nodes have different number of characters. The experimental result with the six methods yields a promising precision of 53.12%, which outperform the baseline Newman algorithm by 29.84%.