Abstract:
GPU clusters have become important high-performance parallel computing systems in the large-scale stream data field. In practice, the computing requires high computing speed, less power consumption and better reliability.So GPU clusters have three significantly performance indices restrainting each others that are computing speed, power consumption and reliability. In real-time computing phase, it needs to dynamically search the optimal point that is the tradeoff among computing speed, power consumption optimization and reliability. So the multi-indices optimization in GPU clusters power consumption control process is a challenging issue. To consider the three indices simultaneously, a comprehensive index is generated by maxinum entropy function that can combine them. Then an adaptable control model is built based on model prediction theory that can dynamically scale power consumption status with the workloads variation. This control model can cap the redundant energy consumption and control the power consumption of the GPU clusters under a specific ideal set point while guaranteeing computing speed and reliability. Compared with the control scheme without considering reliability, the results demonstrate that the proposed control scheme has better control stability and robustness and is very suitable to apply into GPU cluster power management projects to handle the real-time large-scale stream data.