基于孪生BERT网络的科技文献类目映射

何贤敏; 李茂西; 何彦青

doi:10.7544/issn1000-1239.2021.20210323

基于孪生BERT网络的科技文献类目映射

Siamese BERT-Networks Based Classification Mapping of Scientific and Technological Literature

摘要

摘要: 国际专利分类法(international patent classification, IPC)和中国图书馆分类法(Chinese library classification, CLC)作为重要分类标识，分别在专利信息和期刊文献的组织以及管理中发挥着重要作用.如何准确地建立它们之间的映射关系对实现专利信息、期刊资源交叉浏览和检索有着重要的意义.提出了基于BERT预训练上下文语言模型的孪生网络用于建立IPC类目和CLC类目之间的映射关系，利用孪生网络模型分别抽象这2个分类法类目描述文本，通过平均池化抽象后的向量表示计算得到它们相同维度的句子向量，基于余弦相似度计算句子之间的相似度得分，完成类目映射.在人工标注一定规模的IPC类目和CLC类目之间的映射语料库上进行实验验证，结果表明提出的方法显著优于基于规则的方法和Sia-Multi,Bi-TextCNN,Bi-LSTM等深度神经网络的方法.相关的代码、模型和人工标注语料库已经公开发布.

Abstract: International patent classification (IPC) and Chinese library classification (CLC), as important classification marks, play an important role in the organization and management of patent information and journal literature respectively. How to accurately establish the mapping relationship between two classifications is of great significance to the realization of cross-browsing and retrieval of patent information and journal resources. In the paper, a siamese network based on BERT pre-training contextual language model is proposed to establish the mapping relationship between IPC and CLC. A siamese network model is used to abstract the description texts of two classification categories respectively, and the sentence vectors of the same dimension are calculated by average pooling the word representation after abstraction, and the similarity score between sentences is calculated based on cosine similarity to complete classification mapping. The mapping corpus between IPC category and CLC category is manually annotated. The experimental results on the corpus show that the proposed method is significantly better than the rule-based method and other deep neural network methods, such as Sia-Multi, Bi-TextCNN, Bi-LSTM etc. The relevant code, models, and manual annotation corpus are publicly released.

HTML全文

参考文献(0)

施引文献

资源附件(0)