Citation: | Li Kunze, Zhang Yu. Adaptive Pipeline Unsupervised Question Generation Method[J]. Journal of Computer Research and Development, 2025, 62(4): 905-914. DOI: 10.7544/issn1000-1239.202330857 |
In traditional question-answering tasks, models generally require extensive data for training, which entails considerable time and manpower costs for data annotation. Unsupervised question generation represents an effective solution to address the scarcity of training data in question-answering tasks. However, the questions generated using this approach currently suffer from issues such as being difficult to answer, lacking variety, and having unclear semantics. To address these issues, we propose an adaptive multi-module pipeline model named ADVICE, with modules improving existing methods in answerability, question diversity and grammatical correctness. Within the question answerability module, we employ coreference resolution and named entity recognition techniques to improve the answerability of questions. For question diversity, we design specific rules for various question types to enhance the diversity of question and answer types. In the grammatical correctness module, a grammar error correction model targeted at questions is trained based on T5 model, and a filtering module is designed to refine the generated question-answer data. Finally, a classifier is trained to automatically select the necessary modules. Experiments demonstrate that the improved question generation method enhances the performance of downstream question-answering models on the SQuAD dataset, with the EM (exact match) score increasing by an average of 2.9% and the F1 score by an average of 4.4%.
[1] |
吴云芳,张仰森. 问题生成研究综述[J]. 中文信息学报,2021,35(7):1−9
Wu Yunfang, Zhang Yangsen. A survey of question generation[J]. Journal of Chinese Information Processing, 2021, 35(7): 1−9 (in Chinese)
|
[2] |
Kurdi G, Leo J, Parsia B, et al. A systematic review of automatic question generation for educational purposes[J]. International Journal of Artificial Intelligence in Education, 2020, 30: 121−204
|
[3] |
刘铁园,陈威,常亮,等. 基于深度学习的知识追踪研究进展[J]. 计算机研究与发展,2022,59(1):81−104
Liu Tieyuan, Chen Wei, Chang Liang, et al. Research advances in the knowledge tracing based on deep learning[J]. Journal of Computer Research and Development, 2022, 59(1): 81−104 (in Chinese)
|
[4] |
Dong Li, Yang Nan, Wang Wenhui, et al. Unified language model pre-training for natural language understanding and generation[J]. Advances in Neural Information Processing Systems, 2019, 32: 13063−13075
|
[5] |
Lewis P, Denoyer L, Riedel S. Unsupervised question answering by cloze translation[C]//Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 4896−4910
|
[6] |
Li Zhongli, Wang Wenhui, Dong Li, et al. Harvesting and refining question-answer pairs for unsupervised QA[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 6719−6728
|
[7] |
Devlin J, Chang Mingwei, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proc of the 2019 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: ACL, 2019: 4171−4186
|
[8] |
Rajpurkar P, Zhang Jian, Lopyrev K, et al. SQuAD: 100, 000+ questions for machine comprehension of text[C]//Proc of the 2016 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2016: 2383−2392
|
[9] |
Heilman M, Smith N A. Good question! statistical ranking for question generation[C]//Proc of the 2010 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2010: 609−617
|
[10] |
Fabbri A R, Ng P, Wang Zhiguo, et al. Template-based question generation from retrieved sentences for improved unsupervised question answering[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 4508−4513
|
[11] |
Lyu Chenyang, Shang Lifeng, Graham Y, et al. Improving unsupervised question answering via summarization-informed question generation[C]//Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 4134−4148
|
[12] |
Lewis M, Liu Y, Goyal N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 7871−7880
|
[13] |
Nie Yuxiang, Huang Heyan, Chi Zewen, et al. Unsupervised question answering via answer diversifying[C]//Proc of the 29th Int Conf on Computational Linguistics. New York: ACM, 2022: 1732−1742
|
[14] |
Trischler A, Wang Tong, Yuan Xingdi, et al. NewsQA: A machine comprehension dataset[C]//Proc of the 2nd Workshop on Representation Learning for NLP. Stroudsburg, PA: ACL, 2017: 191−200
|
[15] |
Sun Yuan, Liu Sisi, Dan Zhengcuo, et al. Question generation based on grammar knowledge and fine-grained classification[C]//Proc of the 29th Int Conf on Computational Linguistics. New York: ACM, 2022: 6457−6467
|
[16] |
Stasaski K, Rathod M, Tu T, et al. Automatically generating cause-and-effect questions from passages[C]//Proc of the 16th Workshop on Innovative Use of NLP for Building Educational Applications. Stroudsburg, PA: ACL, 2021: 158−170
|
[17] |
Kwiatkowski T, Palomaki J, Redfield O, et al. Natural questions: A benchmark for question answering research[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 452−466
|
[18] |
Awasthi A, Sarawagi S, Goyal R, et al. Parallel iterative edit models for local sequence transduction[C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing (EMNLP-IJCNLP). Stroudsburg, PA: ACL, 2019: 4260−4270
|
[19] |
Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of Machine Learning Research, 2020, 21(140): 1−67
|
[20] |
Liu Yinhan, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach[J]. arXiv preprint, arXiv: 1907.11692, 2019
|
[21] |
Rajpurkar P, Jia R, Liang P. Know what you don’t know: Unanswerable questions for SQuAD[C]//Proc of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg, PA: ACL, 2018: 784−789
|
[22] |
Sanh V, Debut L, Chaumond J, et al. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter[J]. arXiv preprint, arXiv: 1910.01108, 2019
|
[23] |
Yang Zhilin, Dai Zihang, Yang Yiming, et al. XLNet: Generalized autoregressive pretraining for language understanding[C]//Proc of the 33rd Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2019: 5753−5763
|
[24] |
Lan Zhenzhong, Chen Mingda, Goodman S, et al. ALBERT: A lite BERT for self-supervised learning of language representations[J]. arXiv preprint, arXiv: 1909.11942, 2019
|
[25] |
Bartolo M, Thrush T, Jia R, et al. Improving question answering model robustness with synthetic adversarial data generation[C]//Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 8830−8848
|
[1] | Li Nan, Ding Yidong, Jiang Haoyu, Niu Jiafei, Yi Ping. Jailbreak Attack for Large Language Models: A Survey[J]. Journal of Computer Research and Development, 2024, 61(5): 1156-1181. DOI: 10.7544/issn1000-1239.202330962 |
[2] | Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801 |
[3] | Shu Wentao, Li Ruixiao, Sun Tianxiang, Huang Xuanjing, Qiu Xipeng. Large Language Models: Principles, Implementation, and Progress[J]. Journal of Computer Research and Development, 2024, 61(2): 351-361. DOI: 10.7544/issn1000-1239.202330303 |
[4] | Yang Yi, Li Ying, Chen Kai. Vulnerability Detection Methods Based on Natural Language Processing[J]. Journal of Computer Research and Development, 2022, 59(12): 2649-2666. DOI: 10.7544/issn1000-1239.20210627 |
[5] | Pan Xuan, Xu Sihan, Cai Xiangrui, Wen Yanlong, Yuan Xiaojie. Survey on Deep Learning Based Natural Language Interface to Database[J]. Journal of Computer Research and Development, 2021, 58(9): 1925-1950. DOI: 10.7544/issn1000-1239.2021.20200209 |
[6] | Zheng Haibin, Chen Jinyin, Zhang Yan, Zhang Xuhong, Ge Chunpeng, Liu Zhe, Ouyang Yike, Ji Shouling. Survey of Adversarial Attack, Defense and Robustness Analysis for Natural Language Processing[J]. Journal of Computer Research and Development, 2021, 58(8): 1727-1750. DOI: 10.7544/issn1000-1239.2021.20210304 |
[7] | Pan Xudong, Zhang Mi, Yan Yifan, Lu Yifan, Yang Min. Evaluating Privacy Risks of Deep Learning Based General-Purpose Language Models[J]. Journal of Computer Research and Development, 2021, 58(5): 1092-1105. DOI: 10.7544/issn1000-1239.2021.20200908 |
[8] | Bao Yang, Yang Zhibin, Yang Yongqiang, Xie Jian, Zhou Yong, Yue Tao, Huang Zhiqiu, Guo Peng. An Automated Approach to Generate SysML Models from Restricted Natural Language Requirements in Chinese[J]. Journal of Computer Research and Development, 2021, 58(4): 706-730. DOI: 10.7544/issn1000-1239.2021.20200757 |
[9] | Yu Kai, Jia Lei, Chen Yuqiang, and Xu Wei. Deep Learning: Yesterday, Today, and Tomorrow[J]. Journal of Computer Research and Development, 2013, 50(9): 1799-1804. |
[10] | Che Haiyan, Feng Tie, Zhang Jiachen, Chen Wei, and Li Dali. Automatic Knowledge Extraction from Chinese Natural Language Documents[J]. Journal of Computer Research and Development, 2013, 50(4): 834-842. |