高级检索

    基于混合键合架构的大模型局部性感知解码加速器

    Locality-Aware Decoding Accelerator for Large Language Models Based on Hybrid Bonding Architecture

    • 摘要: 大模型在自然语言处理、计算机视觉、聊天机器人以及内容生成等多个领域中展现出了优秀的能力.大模型在边缘设备上的部署因其在稳定性和数据安全方面的优势受到广泛关注.然而,大模型的解码推理过程主要由访存受限的算子构成,随着模型规模的扩展以及复杂任务对上下文长度的需求持续增长,访存开销成为了限制大模型在边缘设备上部署的关键瓶颈.为了降低大模型在长上下文解码时的访存开销,局部性感知的注意力机制被提出.该机制基于注意力分数在解码过程中的数值局部性,对键值缓存(KV cache)进行压缩,实现了在不影响模型精度的情况下,有效削减了冗余的访存.本文设计了一种基于混合键合三维DRAM的局部性感知大模型推理加速器,在有效降低总体访存量的同时显著提升了主存带宽.该加速器集成了负载均衡与基于预测的预取机制,来适配混合键合三维DRAM的结构特性,从而提升带宽利用率.在所评估的模型上,该加速器在批处理大小为1、2和4时,平均实现了分别为1971、3752和6849 tokens/s的吞吐,对应的能耗分别为0.23、0.12和0.07 J/token.该吞吐相当于A100 GPU的50.5、49.4、50.0倍,LAD-HBM加速器的23.7、22.9、21.5倍,而能耗仅相当于GPU的5.1%、5.1%、4.9%,LAD-HBM的46.2%、46.9%、48.2%.

       

      Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various domains, including natural language processing, computer vision, chat bots and content generation. Deploying LLMs on edge devices has attracted significant attention due to its advantages in service stability and data privacy. However, the decoding inference of LLMs is composed of memory-bound operations, with model sizes growing and complex tasks demanding increasingly longer context lengths, the resulting memory access overhead becomes critical bottleneck limiting the deployment of LLMs on edge platforms. To lower the memory access demand of long context decoding, locality aware attention mechanism was proposed. By leveraging the numerical locality of attention scores during decoding, this mechanism compresses the key-value cache (KV cache), significantly reducing redundant memory access without compromising model accuracy. This paper presents a locality-aware inference accelerator for LLMs with hybrid-bonded 3D DRAM. The accelerator not only reduces the overall memory traffic but also significantly improves the main memory bandwidth. It integrates load-balancing and prediction-based prefetching mechanisms tailored to the structural characteristics of hybrid-bonded 3D DRAM, thereby enhancing the bandwidth utilization. On the evaluated models, the proposed accelerator achieves on average 1971, 3752 and 6849 tokens/s throughputs for batch sizes of 1, 2 and 4, respectively, with corresponding energy efficiencies of 0.23, 0.12 and 0.07 J/token. The achieved throughput corresponds to 50.5, 49.4, 50.0 times that of the A100 GPU, and 23.7, 22.9, 21.5 times that of the LAD-HBM accelerator. Meanwhile, the energy consumption amounts to only 5.1%, 5.1%, 4.9% of the GPU, and 46.2%, 46.9%, 48.2% of the LAD-HBM.

       

    /

    返回文章
    返回