Abstract:
Simultaneous multithreading is a latency-tolerant architecture that usually employs shared L2 cache. It can execute multiple instructions from multiple threads each cycle, thus increasing the pressure on memory hierarchy. In this paper, the problem of partitioning a shared cache between multiple concurrently executing threads in the SMT architecture, especially the issue of fairness in cache sharing and its relation to throughput are studied. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand. LRU manages the cache unfairly, and will lead to some serious problems, such as thread starvation and priority inversion. An adaptive runtime partition (ARP) mechanism is implemented to manage the shared cache. ARP takes fairness as the metric of cache partitioning, and uses dynamic partitioning algorithm to optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling. Meanwhile it uses a classical monitor circuit to collect the stack distance information of each thread, and requires less than 0.25% of storage overhead. The evaluation shows that on the average, ARP improves the fairness of a 2-way SMT by a factor of 2.26, while increasing the throughput by 14.75%, compared with LRU-based cache partitioning.