基于强化学习的科学数据特征生成算法

肖濛; 周骏丰; 周园春

doi:10.7544/issn1000-1239.202550306

基于强化学习的科学数据特征生成算法

Reinforcement Learning-Based Feature Generation Algorithm for Scientific Data

摘要

摘要: 特征生成（feature generation, FG）的目标是通过构建高阶特征组合并去除冗余特征，提升原始数据的预测潜力. 针对表格型科学数据，特征生成是提高下游机器学习模型性能的重要预处理环节. 然而，传统方法在处理科学数据特征生成问题时面临以下两方面挑战：首先，针对科学数据生成有效的高阶特征组合通常需要深入且广泛的领域知识；其次，随着特征组合阶数的增加，组合搜索空间呈指数级扩张，导致人工探索成本过高. 近年来，以数据为核心的人工智能（data-centric artificial intelligence, DCAI）范式的兴起为自动化特征生成过程提供了新的可能性. 受此启发，重新审视了传统特征生成工作流程，并提出了一种多智能体特征生成（multi-agent feature generation, MAFG）框架. 具体而言，在迭代探索阶段，多个智能体协作构建数学变换算式，识别并合成具有高信息含量的特征组合，最后通过强化学习机制实现策略的自适应演化. 探索阶段结束后，MAFG框架引入大语言模型（large language models, LLMs），针对探索过程中的每个关键模型性能突破点，解释性评估该步骤新生成的特征. 实验结果和具体案例研究表明，MAFG框架能够有效地实现特征生成过程的自动化，并能显著提升下游多个科学数据挖掘任务的效果.

Abstract: Feature generation (FG) aims to enhance the prediction potential of original data by constructing high-order feature combinations and removing redundant features. It is a key preprocessing step for tabular scientific data to improve downstream machine-learning model performance. Traditional methods face the following two challenges when dealing with the feature generation of scientific data: First, the effective construction of high-order feature combinations in scientific data necessitates profound and extensive domain-specific expertise. Secondly, as the order of feature combinations increases, the search space expands exponentially, imposing prohibitive human labor consumption. Advancements in the data-centric artificial intelligence (DCAI) paradigm have opened novel avenues for automating feature generation processes. Inspired by that, this paper revisits the conventional feature generation workflow and proposes the multi-agent feature generation (MAFG) framework. Specifically, in the iterative exploration stage, multi-agents will construct mathematical transformation equations collaboratively, synthesize and identify feature combinations exhibiting high information content, and leverage a reinforcement learning mechanism to evolve their strategies. Upon completing the exploration phase, MAFG integrates the large language models (LLMs) to interpretatively evaluate the generated features of each significant model performance breakthrough. Experimental results and case studies consistently demonstrate that MAFG framework effectively automates the feature generation process and significantly enhances various downstream scientific data mining tasks.

HTML全文

参考文献(32)

施引文献

资源附件(0)