Multi-Agent Scientific Hypothesis Generation Based on Human-Machine Collaboration

Chen Ziyang; Zhao Xiang; Zhao Runhao; Ni Ziqi; Ye Yicong

doi:10.7544/issn1000-1239.202440552

Journal of Computer Research and Development > 2025 > Accepted Manuscript > DOI: 10.7544/issn1000-1239.202440552 CSTR: 32373.14.issn1000-1239.202440552

Chen Ziyang, Zhao Xiang, Zhao Runhao, Ni Ziqi, Ye Yicong. Multi-Agent Scientific Hypothesis Generation Based on Human-Machine Collaboration[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440552

Citation:

PDF (2011 KB)

Multi-Agent Scientific Hypothesis Generation Based on Human-Machine Collaboration

1.
Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410003
2.
College of Aerospace Science and Engineering, National University of Defense Technology, Changsha 410003

Funds: This work was supported by the Natural Science Foundation of China (U23A20296, 62272469) and the Science and Technology Innovation Program of Hunan Province (2023RC1007).

More Information

Author Bio:
Chen Ziyang: born in 1999. PhD candidate. Member of CCF. His main research interests include natural language processing, knowledge graph, and large language models

Zhao Xiang: born in 1986. PhD, professor. Distinguished member of CCF. His main research areas include graph data management and analysis, knowledge graphs, and big data knowledge engineering

Zhao Runhao: born in 2002. Master candidate. His main research interests include natural language processing, knowledge graph question answering, and data fusion

Ni Ziqi: born in 2000. PhD candidate. Her main research interests include materials informatics and energy storage ceramics

Ye Yicong: born in 1985. PhD, professor. His main research interests include materials informatics and special metal materials
Received Date: June 19, 2024
Revised Date: August 30, 2024
Available Online: April 14, 2025

Graphical Abstract

Abstract

Abstract

With the explosive growth of scientific literature and the continuous deepening of research fields, researchers face significant information processing challenges when attempting to formulate novel scientific hypotheses. Although Large Language Models (LLMs) possess considerable potential for data processing and knowledge integration, they remain limited in their ability to generate original and insightful scientific hypotheses. Existing research predominantly emphasizes utilizing LLMs to expedite and refine established theories and technologies, often overlooking the initial stages of scientific inquiry where novel hypotheses are proposed and new theories are developed—a stage vital to scientific advancement. This study, grounded in the principles of divergent and convergent thinking from the theory of structured intelligence, proposes an innovative Human-in-the-loop Multi-agent Framework (HILMA) for the reliable generation of scientific hypotheses. The HILMA framework incorporates a real-time, systematic knowledge retrieval enhancement mechanism, dynamically integrating the latest research advancements to construct citation network subgraphs, providing LLMs with comprehensive and up-to-date scientific knowledge surveys. Additionally, the framework enhances hypothesis generation through a multi-agent argumentation approach that simulates the scientific peer review process, while also leveraging the intuition and expertise of human experts to further refine and diversify the generated hypotheses. A series of human-machine evaluations has shown that this method demonstrates significant advantages over existing baselines in generating high-quality scientific hypotheses and holds promise as a key facilitator for driving technological innovation.
- large language models,
- scientific hypothesis generation,
- multi-agent,
- human-machine collaboration,
- theory of structural intelligence

FullText(HTML)

References (58)

References

[1]	Bornmann L, Haunschild R, Mutz R. Growth rates of modern science: A latent piecewise growth curve approach to model publication numbers from established and new literature databases[J]. Humanities and Social Sciences Communications, 2021, 8(1): 1−5 doi: 10.1057/s41599-020-00684-8
[2]	Rothenberg A. The janusian process in scientific creativity[J]. Creativity Research Journal, 1996, 9(2/3): 207−231
[3]	Birhane A, Kasirzadeh A, Leslie D, et al. Science in the age of large language models[J]. Nature Reviews Physics, 2023, 5(5): 277−280 doi: 10.1038/s42254-023-00581-4
[4]	Fakhoury S, Naik A, Sakkas G, et al. LLM-based test-driven interactive code generation: User study and empirical evaluation[J]. arXiv preprint, arXiv: 2404.10100, 2024
[5]	Wu Yiquan, Zhou Siying, Liu Yifei, et al. Precedent-enhanced legal judgment prediction with LLM and domain-model collaboration[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 12060−12075
[6]	Thirunavukarasu A J, Ting D S, Elangovan K, et al. Large language models in medicine[J]. Nature Medicine, 2023, 29(8): 1930−1940 doi: 10.1038/s41591-023-02448-8
[7]	Liu Yiheng, Han Tianle, Ma Siyuan, et al. Summary of chatgpt-related research and perspective towards the future of large language models[J]. Meta-Radiology, 2023, 1(2): 1−14
[8]	Meyer J G, Urbanowicz R J, Martin P C, et al. ChatGPT and large language models in academia: opportunities and challenges[J]. BioData Mining, 2023, 16(1): 20−31 doi: 10.1186/s13040-023-00339-9
[9]	Walsh E, Anders K, Hancock S, et al. Reclaiming creativity in the era of impact: Exploring ideas about creative research in science and engineering[J]. Studies in Higher Education, 2013, 38(9): 1259−73 doi: 10.1080/03075079.2011.620091
[10]	Chen Ziyang, Li Dongfang, Zhao Xiang, et al. Temporal knowledge question answering via abstract reasoning induction[J]. arXiv preprint, arXiv: 2311.09149, 2023
[11]	Ziems C, Held W, Shaikh O, et al. Can large language models transform computational social science?[J]. Computational Linguistics, 2024, 50(1): 237−291 doi: 10.1162/coli_a_00502
[12]	Guilford J P. The structure of intellect[J]. Psychological Bulletin, 1956, 53(4): 267−293 doi: 10.1037/h0040755
[13]	Wang Hanchen, Fu Tianfan, Du Yuanqi, et al. Scientific discovery in the age of artificial intelligence[J]. Nature, 2023, 620(7972): 47−60 doi: 10.1038/s41586-023-06221-2
[14]	Baek J, Jauhar S K, Cucerzan S, et al. ResearchAgent: Iterative research idea generation over scientific literature with large language models[J]. arXiv preprint, arXiv: 2404.07738, 2024
[15]	Microsoft Research AI4Science and Microsoft Azure Quantum. The impact of large language models on scientific discovery: A preliminary study using gpt−4[J]. arXiv preprint, arXiv: 2311.07361, 2023
[16]	Majumder B P, Surana H, Agarwal D, et al. Data-driven discovery with large generative models[J]. arXiv preprint, arXiv: 2402.13610, 2024
[17]	Qi Biqing, Zhang Kaiyan, Li Haoxiang, et al. Large language models are zero shot hypothesis proposers[J]. arXiv preprint, arXiv: 2311.05965, 2023
[18]	Shojaee P, Meidani K, Gupta S, et al. LLM-SR: Scientific equation discovery via programming with large language models[J]. arXiv preprint, arXiv: 2404.18400, 2024
[19]	Lu C, Lu Cong, Lange R T, et al. The AI scientist: Towards fully automated open-ended scientific discovery[J]. arXiv preprint, arXiv: 2408.06292, 2024
[20]	Li Yunxin, Hu Baotian, Shi Haoyuan, et al. VisionGraph: Leveraging large multimodal models for graph theory problems in visual context[J]. arXiv preprint, arXiv: 2405.04950, 2024
[21]	Ji Ziwei, Lee N, Frieske R, et al. Survey of hallucination in natural language generation[J]. ACM Computing Surveys, 2023, 55(12): 1−38
[22]	Shuster K, Spencer P, Chen M, et al. Retrieval augmentation reduces hallucination in conversation[C]//Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 3784−3803
[23]	王梦如,姚云志,习泽坤,等. 基于知识编辑的大语言模型内容生成安全分析[J]. 计算机研究与发展,2024,61(5):1143−1155 doi: 10.7544/issn1000-1239.202330965 Wang Mengru, Yao Yunzhi, Xi Zekun, et al. Safety analysis of large model content generation based on knowledge editing[J]. Journal of Computer Research and Development, 2024, 61(5): 1143−1155 (in Chinese) doi: 10.7544/issn1000-1239.202330965
[24]	Wang Cunxiang, Liu Xiaoze, Yue Yuanhao, et al. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity[J]. arXiv preprint, arXiv: 2310.07521. 2023
[25]	Xu Xinchao, Gou Zhibin, Wu Wenquan, et al. Long time no see! Open-domain conversation with long-term persona mssemory[C]//Proc of the 60th Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 2639–2650
[26]	Gao Yunfan, Xiong Yun, Gao Xinyu, et al. Retrieval-augmented generation for large language models: A survey[J]. arXiv preprint, arXiv: 2312.10997. 2023
[27]	Lewis L, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[C]//Proc of the 34th Conf on Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2020: 9459−9474
[28]	Robertson S E, Zaragoza H. The probabilistic relevance framework: BM25 and beyond[J]. Foundations and Trends in Information Retrieval, 2009, 3(4): 333−389 doi: 10.1561/1500000019
[29]	Wu H, Luk P W P, Wong K F, et al. Interpreting TF-IDF term weights as making relevance decisions[J]. ACM Transactions on Information Systems, 2008, 26(3): 1−37
[30]	Guo Jiafeng, Cai Yinqiong, Fan Yixing, et al. Semantic models for the first-stage retrieval: A comprehensive review[J]. ACM Transactions on Information Systems, 2022, 40(4): 1−42
[31]	Bruch, Gai S, Ingber A. An analysis of fusion functions for hybrid retrieval[J]. ACM Transactions on Information Systems, 2023, 42(1): 1−35
[32]	Li Hang, Mourad A, Zhuang Shengyao, et al. Pseudo relevance feedback with deep language models and dense retrievers: Successes and pitfalls[J]. ACM Transactions on Information Systems, 2023, 41(3): 1−40
[33]	Shen Tao, Long Guodong, Geng Xiubo, et al. Large language models are strong zero-shot retriever[J]. arXiv preprint, arXiv: 2304.14233, 2023
[34]	Ma Xueguang, Zhang Xinyu, Pradeep R, et al. Zero-shot listwise document reranking with a large language model[J]. arXiv preprint, arXiv: 2304.14233, 2023
[35]	Sun Weiwei, Yan Lingyong, Ma Xinyu, et al. Is ChatGPT good at search? Investigating large language models as re-ranking agents[C]// Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 14918−14937
[36]	Jeong M, Sohn J, Sung M, et al. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models[J]. arXiv preprint, arXiv: 2401.15269, 2024
[37]	Mousavi S M, Alghisi S, Riccardi G. Is your LLM outdated? Benchmarking llms & alignment algorithms for time-sensitive knowledge[J]. arXiv preprint, arXiv: 2404.08700, 2024
[38]	Wang Z, Choi D, Xu Shenyu, et al. Putting humans in the natural language processing loop: A survey[J]. arXiv preprint, arXiv: 2103.04044, 2021
[39]	Wu Xingjiao, Xiao Luwei, Sun Yixuan, et al. A survey of human-in-the-loop for machine learning[J]. Future Generation Computer Systems, 2022, 135: 364−381 doi: 10.1016/j.future.2022.05.014
[40]	Cai Zefan, Chang Baobao, and Han Wenjuan. Human-in-the-loop through chain-of-thought[J]. arXiv preprint, arXiv: 2306.07932, 2023
[41]	Mehta N, Teruel M, Sanz P F, et al. Improving grounded language understanding in a collaborative environment by interacting with agents through help feedback[C]//Proc of the 62nd Association for Computational Linguistics. Stroudsburg, PA: ACL, 2024: 1306−1321
[42]	Huang Wenlong, Xia Fei, Xiao T, et al. Inner monologue: Embodied reasoning through planning with language models[J]. arXiv preprint, arXiv: 2207.05608, 2022
[43]	Wang Xingyao, Wang Zihan, Liu Jiateng, et al. MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback[J]. arXiv preprint, arXiv: 2309.10691, 2023
[44]	Feng Xueyang, Chen Zhiyuan, Qin Yujia, et al. Large language model-based human-agent collaboration for complex task solving[J]. arXiv preprint, arXiv: 2402.122914, 2024
[45]	Dhillon P S, Molaei S, Li Jiaqi, et al. Shaping Human-AI collaboration: varied scaffolding levels in co-writing with language models[J]. arXiv preprint, arXiv: 2402.11723, 2024
[46]	李戈,彭鑫,王千祥,等. 大语言模型:基于自然交互的人机协同软件开发与演化工具带来的挑战[J]. 软件学报,2023,34(10):4601−4606 Li Ge, Peng Xin, Wang Qianxiang, et al. Challenges from LLMs as a natural language based human-machine collaborative tool for software development and evolution[J]. Journal of Software, 2023, 34(10): 4601−4606 (in Chinese)
[47]	靳东明,金芝,陈小红,等. ChatModeler:基于大语言模型的人机协作迭代式需求获取和建模方法[J]. 计算机研究与发展,2024,61(2):338−350 doi: 10.7544/issn1000-1239.202330746 Jin Dongming, Jin Zhi, Chen Xiaohong, et al. ChatModeler: A human-machine collaborative and iterative requirements elicitation and modeling approach via large language models[J]. Journal of Computer Research and Development, 2024, 61(2): 338−350 (in Chinese) doi: 10.7544/issn1000-1239.202330746
[48]	Ai2. Semantic Scholar API [EB/OL]. [2024-05-17]. https://www.semanticscholar.org/product/api
[49]	OpenAI. ChatGPT [EB/OL]. [2024-05-17]. https://chat.openai.com
[50]	Wei J, Wang Xuezhi, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Proc of the 36th Conf on Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2022: 24824−24837
[51]	Min S, Lyu X, Holtzman A, et al. Rethinking the role of demonstrations: What makes in-context learning work[C]// Proc of the 2022 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 11048−11064
[52]	Zheng Lianmin, Chiang W, Sheng Ying, et al. Judging LLM-as-a-Judge with MT-Bench and chatbot arena[J]. arXiv preprint, arXiv: 2306.05685, 2023
[53]	Fu Jinlan, Ng S K, Jiang Zhengbao, et al. Gptscore: Evaluate as you desire[J]. arXiv preprint, arXiv: 2302.04166, 2023
[54]	Joshi A, Kale S, Chandel S, et al. Likert scale: Explored and explained[J]. British Journal of Applied Science & Technology, 2015, 7(4): 396−403
[55]	Meta. llama3−70b-instruct[EB/OL]. [2024-05-10]. https://huggingface.co/meta-llama
[56]	Alibaba. Qwen1.5−72B-Chat[EB/OL]. [2024-05-10]. https://huggingface.co/Qwen/Qwen1.5-72B-Chat
[57]	OpenAI. GPT−3.5-Turbo[EB/OL]. [2024-05-10]. https://platform.openai.com/docs/models/gpt-3-5-turbo
[58]	Alibaba. Qwen-Max API[EB/OL]. [2024-05-10]. https://help.aliyun.com/zh/dashscope/developer-reference

[1]	Jin Dongming, Jin Zhi, Chen Xiaohong, Wang Chunhui. ChatModeler: A Human-Machine Collaborative and Iterative Requirements Elicitation and Modeling Approach via Large Language Models[J]. Journal of Computer Research and Development, 2024, 61(2): 338-350. DOI: 10.7544/issn1000-1239.202330746
[2]	Wang Juanjuan, Wang Hongan. Multi-Agent Multi-Criticality Scheduling Based Self-Healing System of Power Grid[J]. Journal of Computer Research and Development, 2017, 54(4): 720-730. DOI: 10.7544/issn1000-1239.2017.20161026
[3]	He Wenbin, Liu Qunfeng, Xiong Jinzhi. The Error Theory of Polynomial Smoothing Functions for Support Vector Machines[J]. Journal of Computer Research and Development, 2016, 53(7): 1576-1585. DOI: 10.7544/issn1000-1239.2016.20148462
[4]	He Wangquan, Wei Di, Quan Jianxiao, Wu Wei, Qi Fengbin. Dynamic Task Scheduling Model and Fault-Tolerant via Queuing Theory[J]. Journal of Computer Research and Development, 2016, 53(6): 1271-1280. DOI: 10.7544/issn1000-1239.2016.20148445
[5]	Zhao Yu, Wang Yadi, Han Jihong, Fan Yudan, and Zhang Chao. A Formal Model for Cryptographic Protocols Based on Planning Theory[J]. Journal of Computer Research and Development, 2008, 45(9).
[6]	Shi Jin, Lu Yin, and Xie Li. Dynamic Intrusion Response Based on Game Theory[J]. Journal of Computer Research and Development, 2008, 45(5): 747-757.
[7]	Li Ye, Cai Yunze, Yin Rupo, Xu Xiaoming. Support Vector Machine Ensemble Based on Evidence Theory for Multi-Class Classification[J]. Journal of Computer Research and Development, 2008, 45(4): 571-578.
[8]	Lin Jianning, Wu Huizhong. Research on a Trust Model Based on the Subjective Logic Theory[J]. Journal of Computer Research and Development, 2007, 44(8): 1365-1370.
[9]	He Lijian and Zhang Wei. An Agent Organization Structure for Solving DCOP Based on the Partitions of Constraint Graph[J]. Journal of Computer Research and Development, 2007, 44(3).
[10]	Mu Kedian and Lin Zuoquan. Symbolic Dempster-Shafer Theory[J]. Journal of Computer Research and Development, 2005, 42(11): 1833-1842.