ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2015, Vol. 52 ›› Issue (7): 1522-1530.doi: 10.7544/issn1000-1239.2015.20148073

Previous Articles     Next Articles

A Cache Approach for Large Scale Data-Intensive Computing

Zhou Enqiang1, Zhang Wei1, Lu Yutong1, Hou Hongjun2, Dong Yong1   

  1. 1(State Key Laboratory of High Performance Computing (National University of Defense Technology), Changsha 410073);2(Bureau of Geophysical Prospecting, China National Petroleum Corporation, Zhuozhou, Hebei 072751)
  • Online:2015-07-01

Abstract: With HPC systems widely used in today’s modern science computing, more data-intensive applications are generating and analyzing the increasing scale of datasets, which makes HPC storage system facing new challenges. By comparing the different storage architectures with the corresponding approaches of file system, a novel cache approach, named DDCache, is proposed to improve the efficiency of data-intensive computing. DDCache leverages the distributed storage architecture as performance booster for centralized storage architecture by fully exploiting the potential benefits of node-local storage distributed across the system. In order to supply much larger cache volume than volatile memory cache, DDCache aggregates the node-local disks as huge non-volatile cooperative cache. Then high cache hit ratio is achieved through keeping intermediate data in the DDCache as long as possible during overall process of applications. To make the node-local storage efficient enough to act as data cache, locality aware data layout is used to make cached data close to compute tasks and evenly distributed. Furthermore, concurrency control is introduced to throttle I/O requests flowing into or out of DDCache and regain the special advantage of node-local storage. Evaluations on the typical HPC platforms verify the effectiveness of DDCache. Scalable I/O bandwidth is achieved on the well-known HPC scenario of checkpoint/restart and the overall performance of typical data-intensive application is improved up to 6 times.

Key words: data-intensive computing, cache, local storage, shared storage, seismic data processing

CLC Number: