ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2021, Vol. 58 ›› Issue (8): 1751-1760.doi: 10.7544/issn1000-1239.2021.20210323

Special Issue: 2021人工智能前沿进展专题

Previous Articles     Next Articles

Siamese BERT-Networks Based Classification Mapping of Scientific and Technological Literature

He Xianmin1, Li Maoxi1, He Yanqing2   

  1. 1(School of Computer Information and Engineering, Jiangxi Normal University, Nanchang 330022);2(Institute of Scientific and Technical Information of China, Beijing 100038)
  • Online:2021-08-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61662031) and the Fund of the Institute of Scientific and Technical Information of China (ZD2020-18).

Abstract: International patent classification (IPC) and Chinese library classification (CLC), as important classification marks, play an important role in the organization and management of patent information and journal literature respectively. How to accurately establish the mapping relationship between two classifications is of great significance to the realization of cross-browsing and retrieval of patent information and journal resources. In the paper, a siamese network based on BERT pre-training contextual language model is proposed to establish the mapping relationship between IPC and CLC. A siamese network model is used to abstract the description texts of two classification categories respectively, and the sentence vectors of the same dimension are calculated by average pooling the word representation after abstraction, and the similarity score between sentences is calculated based on cosine similarity to complete classification mapping. The mapping corpus between IPC category and CLC category is manually annotated. The experimental results on the corpus show that the proposed method is significantly better than the rule-based method and other deep neural network methods, such as Sia-Multi, Bi-TextCNN, Bi-LSTM etc. The relevant code, models, and manual annotation corpus are publicly released.

Key words: international patent classification, Chinese library classification, siamese BERT-networks, classification mapping, contrastive loss

CLC Number: