Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across various domains, including natural language processing, computer vision, chat bots and content generation. Deploying LLMs on edge devices has attracted significant attention due to its advantages in service stability and data privacy. However, the decoding inference of LLMs is composed of memory-bound operations, with model sizes growing and complex tasks demanding increasingly longer context lengths, the resulting memory access overhead becomes critical bottleneck limiting the deployment of LLMs on edge platforms. To lower the memory access demand of long context decoding, locality aware attention mechanism was proposed. By leveraging the numerical locality of attention scores during decoding, this mechanism compresses the key-value cache (KV cache), significantly reducing redundant memory access without compromising model accuracy. This paper presents a locality-aware inference accelerator for LLMs with hybrid-bonded 3D DRAM. The accelerator not only reduces the overall memory traffic but also significantly improves the main memory bandwidth. It integrates load-balancing and prediction-based prefetching mechanisms tailored to the structural characteristics of hybrid-bonded 3D DRAM, thereby enhancing the bandwidth utilization. On the evaluated models, the proposed accelerator achieves on average 1971, 3752 and 6849 tokens/s throughputs for batch sizes of 1, 2 and 4, respectively, with corresponding energy efficiencies of 0.23, 0.12 and 0.07 J/token. The achieved throughput corresponds to 50.5, 49.4, 50.0 times that of the A100 GPU, and 23.7, 22.9, 21.5 times that of the LAD-HBM accelerator. Meanwhile, the energy consumption amounts to only 5.1%, 5.1%, 4.9% of the GPU, and 46.2%, 46.9%, 48.2% of the LAD-HBM.