检索增强下交叉注意力引导的时间卷积网络HPC作业能耗预测方法

吴众欣; 赵涛; 高亦沁; 于洋; 史彦磊; 翟建兴; 韦建文

doi:10.7544/issn1000-1239.202550757

检索增强下交叉注意力引导的时间卷积网络HPC作业能耗预测方法

Energy Consumption Prediction Method for HPC Jobs Using Temporal Convolutional Network with Cross-Attention Guidance under Retrieval Augmentation

摘要

摘要: 针对高性能计算（High Performance Computing, HPC）作业能耗预测长时序依赖建模误差大、领域先验知识利用不足的问题，本文提一种在检索增强支持下，使用交叉注意力机制引导的时间卷积网络，预测HPC作业能耗的方法。该方法基于历史作业数据及运维知识的沉淀构建索引增强知识库（RAG KB），由作业能耗序列频谱特征和算子能耗敏感度动态调整膨胀率，动态增强模型对关键时序特征的关注能力，将知识库中积累的作业特性知识通过交叉注意力计算传递到能耗预测任务中，自适应调整不同时间步的权重，实现对超算作业能耗波动的精准捕捉，提升时间卷积网络（TCN，Temporal Convolutional Network）在超算作业能耗预测的准确能力。实验结果表明，相较于传统 TCN模型，本文方法的平均绝对百分比误差（MAPE）降低至8.6% ~ 11.3%，对称平均绝对百分比误差(SMAPE)降低至8.7% ~ 13.7%，该方法有效融合领域先验知识及注意力引导于时间卷积网络模型中，在保持计算效率的同时提升了模型业务场景适配能力，为超算能效管理提供了一种兼具理论深度与工程价值的运维方法。

Abstract: Addressing the challenges of significant modeling errors in long-sequence dependencies and insufficient utilization of domain prior knowledge for energy consumption prediction in High Performance Computing (HPC) jobs, this paper proposes a retrieval-augmented method that employs a cross-attention mechanism to guide a Temporal Convolutional Network (TCN) for predicting HPC job energy consumption. The approach constructs a Retrieval-Augmented Knowledge Base (RAG KB) leveraging historical job data and accumulated operational knowledge. By dynamically adjusting the dilation rate based on the spectral characteristics of job energy consumption sequences and the energy sensitivity of operators, the model enhances its focus on critical temporal features. The job-specific knowledge embedded in the knowledge base is transferred to the energy consumption prediction task through cross-attention computation, enabling adaptive weight adjustment across different time steps. This allows for precise capture of energy consumption fluctuations in supercomputing jobs and improves the predictive accuracy of the TCN model in this domain. Experimental results demonstrate that, compared to traditional TCN models, the proposed method reduces the Mean Absolute Percentage Error (MAPE) to 8.6%-11.3% and the Symmetric Mean Absolute Percentage Error (SMAPE) to 8.7%-13.7%. This approach effectively integrates domain prior knowledge and attention guidance into the temporal convolutional network, enhancing the model's adaptability to business scenarios while maintaining computational efficiency. It provides an operational method for supercomputing energy efficiency management that combines theoretical depth with engineering value, offering a viable solution for optimizing energy consumption in high-performance computing environments. The integration of retrievable operational knowledge represents a significant advancement in developing context-aware energy management systems for large-scale computational facilities.

HTML全文

参考文献(0)

施引文献

资源附件(0)