ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2021, Vol. 58 ›› Issue (8): 1751-1760.doi: 10.7544/issn1000-1239.2021.20210323

所属专题: 2021人工智能前沿进展专题

• 人工智能 • 上一篇    下一篇

基于孪生BERT网络的科技文献类目映射

何贤敏1,李茂西1,何彦青2   

  1. 1(江西师范大学计算机信息工程学院 南昌 330022);2(中国科学技术信息研究所 北京 100038) (xianminhe@jxnu.edu.cn)
  • 出版日期: 2021-08-01
  • 基金资助: 
    国家自然科学基金项目(61662031);中国科学技术信息研究所重点工作项目(ZD2020-18)

Siamese BERT-Networks Based Classification Mapping of Scientific and Technological Literature

He Xianmin1, Li Maoxi1, He Yanqing2   

  1. 1(School of Computer Information and Engineering, Jiangxi Normal University, Nanchang 330022);2(Institute of Scientific and Technical Information of China, Beijing 100038)
  • Online: 2021-08-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61662031) and the Fund of the Institute of Scientific and Technical Information of China (ZD2020-18).

摘要: 国际专利分类法(international patent classification, IPC)和中国图书馆分类法(Chinese library classification, CLC)作为重要分类标识,分别在专利信息和期刊文献的组织以及管理中发挥着重要作用.如何准确地建立它们之间的映射关系对实现专利信息、期刊资源交叉浏览和检索有着重要的意义.提出了基于BERT预训练上下文语言模型的孪生网络用于建立IPC类目和CLC类目之间的映射关系,利用孪生网络模型分别抽象这2个分类法类目描述文本,通过平均池化抽象后的向量表示计算得到它们相同维度的句子向量,基于余弦相似度计算句子之间的相似度得分,完成类目映射.在人工标注一定规模的IPC类目和CLC类目之间的映射语料库上进行实验验证,结果表明提出的方法显著优于基于规则的方法和Sia-Multi,Bi-TextCNN,Bi-LSTM等深度神经网络的方法.相关的代码、模型和人工标注语料库已经公开发布.

关键词: 国际专利分类法, 中国图书馆分类法, 基于孪生BERT网络, 类目映射, 对比损失

Abstract: International patent classification (IPC) and Chinese library classification (CLC), as important classification marks, play an important role in the organization and management of patent information and journal literature respectively. How to accurately establish the mapping relationship between two classifications is of great significance to the realization of cross-browsing and retrieval of patent information and journal resources. In the paper, a siamese network based on BERT pre-training contextual language model is proposed to establish the mapping relationship between IPC and CLC. A siamese network model is used to abstract the description texts of two classification categories respectively, and the sentence vectors of the same dimension are calculated by average pooling the word representation after abstraction, and the similarity score between sentences is calculated based on cosine similarity to complete classification mapping. The mapping corpus between IPC category and CLC category is manually annotated. The experimental results on the corpus show that the proposed method is significantly better than the rule-based method and other deep neural network methods, such as Sia-Multi, Bi-TextCNN, Bi-LSTM etc. The relevant code, models, and manual annotation corpus are publicly released.

Key words: international patent classification, Chinese library classification, siamese BERT-networks, classification mapping, contrastive loss

中图分类号: