大语言模型和知识图谱协同的跨域异质数据查询框架

吴文隆; 尹海莲; 王宁; 徐梦飞; 赵鑫喆; 殷崭祚; 刘元睿; 王昊奋; 丁岩; 李博涵

doi:10.7544/issn1000-1239.202440634

大语言模型和知识图谱协同的跨域异质数据查询框架

A Synergetic LLM-KG Framework for Cross-Domain Heterogeneous Data Query

摘要

摘要: 大语言模型（large language model，LLM）技术热潮对数据质量的要求提升到了一个新的高度. 在现实场景中，数据通常来源不同且高度相关. 但由于数据隐私安全问题，跨域异质数据往往不允许集中共享，难以被LLM高效利用. 鉴于此，提出了一种LLM和知识图谱（knowledge graph，KG）协同的跨域异质数据查询框架，在LLM+KG的范式下给出跨域异质数据查询的一个治理方案. 为确保LLM能够适应多场景中的跨域异质数据，首先采用适配器对跨域异质数据进行融合，并构建相应的知识图谱. 为提高查询效率，引入线性知识图，并提出同源知识图抽取算法HKGE来实现知识图谱的重构，可显著提高查询性能，确保跨域异质数据治理的高效性. 进而，为保证多域数据查询的高可信度，提出可信候选子图匹配算法TrustHKGM，用于检验跨域同源数据的置信度计算和可信候选子图匹配，剔除低质量节点. 最后，提出基于线性知识图提示的多域数据查询算法MKLGP，实现LLM+KG范式下的高效可信跨域查询. 该方法在多个真实数据集上进行了广泛实验，验证了所提方法的有效性和高效性.

Abstract: Recent advances in large language models (LLMs) have significantly elevated requirements for data quality in practical applications. Real-world scenarios often involve heterogeneous data from multiple correlated domains. Yet cross-domain data integration remains challenging due to privacy and security concerns that prohibit centralized sharing, thereby limiting LLM’s effective utilization. To address this critical issue, we propose a novel framework integrating LLM with knowledge graphs (KGs) for cross-domain heterogeneous data query. Our approach presents a systematic governance solution under the LLM-KG paradigm. First, we employ domain adapters to fuse cross-domain heterogeneous data and construct corresponding KG. To enhance query efficiency, we introduce knowledge line graphs and develop a homogeneous knowledge graph extraction (HKGE) algorithm for graph reconstruction, significantly improving cross-domain data governance performance. Subsequently, we propose a trusted subgraph matching algorithm TrustHKGM to ensure high-confidence multi-domain queries through confidence computation and low-quality node filtering. Finally, we design a multi-domain knowledge line graph prompting (MKLGP) algorithm to enable efficient and trustworthy cross-domain query answering within the LLM-KG framework. Extensive experiments on multiple real-world datasets demonstrate the superior effectiveness and efficiency of our approach compared with state-of-the-art solutions.

HTML全文

参考文献(59)

施引文献

资源附件(0)