Abstract:
In traditional question-answering tasks, models generally require extensive data for training, which entails considerable time and manpower costs for data annotation. Unsupervised question generation represents an effective solution to address the scarcity of training data in question-answering tasks. However, the questions generated using this approach currently suffer from issues such as being difficult to answer, lacking variety, and having unclear semantics. To address these issues, this paper proposes an adaptive multi-module pipeline model named ADVICE, with modules improving existing methods in answerability, question diversity and grammatical correctness. Within the question answerability module, the paper employs coreference resolution and named entity recognition techniques to improve the answerability of questions. For question diversity, the paper designs specific rules for various question types to enhance the diversity of question and answer types. In the grammatical correctness module, a grammar error correction model targeted at questions is trained based on T5 model, and a filtering module is designed to refine the generated question-answer data. Finally, a classifier is trained to automatically select the necessary modules. Experiments demonstrate that the improved question generation method enhances the performance of downstream question-answering models on the SQuAD dataset, with the
EM (exact match) score increasing by an average of 2.9% and the
F1 score by an average of 4.4%.