高级检索

    针对冗余零的跨平台细粒度性能分析技术

    A Cross-Platform Fine-Grained Performance Analysis Technique for Redundant Zeros

    • 摘要: 冗余零造成的软件低效行为会导致大量的0值被读取甚至被用来重复地进行无用的计算,从而导致内存、计算资源的浪费. 然而,现有的编译器工具链都不能够有效地识别并消除应用程序中0值相关的冗余操作,且冗余零相关的硬件优化方法仍然没有应用于商业处理器中. 虽然ZeroSpy能够识别冗余零并提示足够的信息来指导优化,但其检测方法仍然局限于Intel平台,且其过大的性能开销阻碍了更加广泛的使用. 针对冗余零的跨平台细粒度性能分析工具DrZero则可以克服上述限制. DrZero支持x86和ARM平台,并实现在线细粒度缓存迹分析来减少性能开销. 为了支持ARM平台,基于数据流分析的数据类型推断方法可以自动推断内存读取值的数据类型. 经过评测,DrZero的以代码、数据为中心的分析模式可以在x86和ARM平台上分别以平均45.31倍、54.20倍和14.12倍、13.40倍的性能开销识别冗余零并给出优化建议. 此外,在x86平台上与ZeroSpy所报告的性能开销相比,DrZero的平均性能开销分别在以代码、数据为中心的分析模式下降低了37.2%,55.8%.基于DrZero给出的性能优化指导,应用程序优化后在x86和ARM上分别达到了最高1.76倍和2.12倍的性能加速. DrZero 的实现代码已经开源:https://github.com/buaa-hipo/zerospy-drcctprof.

       

      Abstract: Software inefficiencies caused by redundant zeros will introduce massive zero values to be loaded or used for trivial computation, which significantly wastes memory and compute resources. However, the compiler toolchain still cannot effectively identify the redundant operations dealing with zeros and hardware optimizations handling redundant zeros have not been adopted in commercial hardware yet. Although ZeroSpy can detect the existence of redundant zero buried within software and report sufficient information for performance optimization, its detection is still limited in Intel platform as well as its large overhead. Therefore, we propose a cross-platform tool DrZero to overcome these limitations. DrZero can detect redundant zeros in both x86 and ARM platforms and it implements novel online analysis based on buffered tracing for lower overhead. For ARM platform, we propose floating-point estimation via dataflow analysis to estimate the data type of a memory operand for further detection. The evaluation results demonstrate that DrZero can detect redundant zeros with code-centric, data-centric analysis on both x86 and ARM platforms with 45.31×, 54.20× and 14.12×, 13.40× performance overheads, respectively. Besides, DrZero incurs 37.2% and 55.8% lower time overheads than ZeroSpy with code-centric and data-centric analysis on the x86 platform, respectively. Based on the optimization guidance revealed by DrZero, we can achieve 1.76× and 2.12× speedups at maximum on both x86 and ARM platforms after eliminating redundant zeros for evaluated applications. DrZero is open-source at https://github.com/buaa-hipo/zerospy-drcctprof.

       

    /

    返回文章
    返回