ISSN 1000-1239 CN 11-1777/TP


    Default Latest Most Read
    Please wait a minute...
    For Selected: Toggle Thumbnails
    Cache Optimization Approaches of Emerging Non-Volatile Memory Architecture: A Survey
    He Yanxiang, Shen Fanfan, Zhang Jun, Jiang Nan, Li Qing’an, Li Jianhua
    Journal of Computer Research and Development    2015, 52 (6): 1225-1241.   DOI: 10.7544/issn1000-1239.2015.20150104
    Abstract1915)   HTML17)    PDF (2019KB)(1452)       Save
    With the development of semiconductor technology and CMOS scaling, the size of on-chip cache memory is gradually increasing in modern processor design. The density of traditional static RAM (SRAM) has been close to the limit. Moreover, SRAM consumes a large amount of leakage power which severely affects system performance. Therefore, how to design efficient on-chip storage architecture has become more and more challenging. To address these issues, researchers have discussed a large number of emerging non-volatile memory (NVM) technologies which have shown attractive features, such as non-volatile, low leakage power and high density. In order to explore cache optimization approaches based on emerging non-volatile memory including spin-transfer torque RAM (STT-RAM), phase change memory (PCM), resistive RAM (RRAM) and domain-wall memory (DWM), this paper surveys the property of non-volatile memory compared with traditional memory devices. Then, the advantages, disadvantages and feasibility of architecting caches are discussed. To highlight their differences and similarities, a detailed analysis is then conducted to classify and summarize the cache optimization approaches and policies. These key technologies are trying to solve the high write power, limited write endurance and long write latency of emerging non-volatile memory. Finally, the potential research prospect of emerging non-volatile memory in future storage architecture is discussed.
    Related Articles | Metrics
    Directory Cache Design for Multi-Core Processor
    Wang Endong, Tang Shibin, Chen Jicheng, Wang Hongwei, Ni Fan, Zhao Yaqian
    Journal of Computer Research and Development    2015, 52 (6): 1242-1253.   DOI: 10.7544/issn1000-1239.2015.20150140
    Abstract1258)   HTML7)    PDF (3171KB)(821)       Save
    With the development of Internet of things, cloud computing and Internet public opinion analysis, big data applications are growing into the critical workloads in current data center. Directory cache is used to guarantee cache coherence in chip multi-processor, which is massively deployed in data centers. Previous researches proposed all kinds of innovation to improve the utilization of directory cache capacity and scalability, making it more suitable for high-performance computing. Big data workloads are timing sensitive, which is not satisfied by previous works. To meet the requirement of big data workloads, master-salve directory is a novel directory cache design, which can optimize the path of memory instruction. In the novel directory cache design, master directory picks up private data accesses and provides services for them to reduce miss-latency, and slave directory provides cache coherence for shared memory space to improve the utilization of cache capacity and the scalability of chip multi-processor. Our experiment benchmark is CloudSuite-v1.0, running on Simics+GEMS simulator. Compared with sparse directory with 2×capacity, the experimental results show that master-slave directory can reduce hardware overhead by 24.39%, and reduce miss-latency by 28.45%, and improve IPC by 3.5%. Compared with in-cache directory, the results show that master-slave directory sacrifices 5.14% miss-latency and 1.1% IPC, but reduces hardware overhead by 42.59%.
    Related Articles | Metrics
    MACT: Discrete Memory Access Requests Batch Processing Mechanism for High-Throughput Many-Core Processor
    Li Wenming, Ye Xiaochun, Wang Da, Zheng Fang, Li Hongliang, Lin Han, Fan Dongrui, Sun Ninghui
    Journal of Computer Research and Development    2015, 52 (6): 1254-1265.   DOI: 10.7544/issn1000-1239.2015.20150154
    Abstract1284)   HTML3)    PDF (5554KB)(806)       Save
    The rapid development of new high-throughput applications, such as Web services, brings huge challenges to traditional processors which target at high-performance applications. High-throughput many-core processors, as new processors, become hotspot for high-throughput applications. However, with the dramatic increase in the number of on chip cores, combined with the property of memory intensive of high throughput applications, the “memory wall” problems have intensified. After analyzing the memory access behavior of high throughput applications, it is found out that there are a large proportion of fine-grained granularity memory accesses which degrade the efficiency of bandwidth utilization and cause unnecessary energy consumption. Based on this observation, in high-throughput many-core processors design, memory access collection table (MACT) is implemented to collect discrete memory access requests and to handle them in batch under deadline constraint. Using MACT hardware mechanism, both bandwidth utilization and execution efficiency have been improved. QoS is also guaranteed by employing time-window mechanism, which insures that all the requests can be sent before the deadline. WordCount, TeraSort and Search are typical high-throughput application benchmarks which are used in experiments. The experimental results show that MACT reduces the number of memory accesses requests by 49% and improves bandwidth efficiency by 24%, and the average execution speed is improved by 89%.
    Related Articles | Metrics
    A Trace-Driven Simulation of Memory System in Multithread Applications
    Zhu Pengfei, Lu Tianyue, Chen Mingyu
    Journal of Computer Research and Development    2015, 52 (6): 1266-1277.   DOI: 10.7544/issn1000-1239.2015.20150160
    Abstract1429)   HTML1)    PDF (3681KB)(681)       Save
    Nowadays, chip-multiprocessors (CMPs) become significantly important for multithread applications due to their high-throughput performance in big data computing. But growing latency to memory is increasingly impacting system performance because of memory wall. Two independent simulation methods: trace-driven and execution-driven, are available for system researchers to study and evaluate the memory system. On one hand, in order to leverage simulation speed, researchers employ trace-driven simulation because it removes data processing and is faster than execution-driven counterpart. On the other hand, lack of data processing induces both global and local trace misplacements, which never exist in multithread applications on real machine. Through analytical modeling, remarkable performance metrics variations are observed due to trace misplacements. Basically speaking, the reasons are in trace-driven simulation: 1)locks do not prevent threads from non-exclusively entering critical range; 2)barriers do not synchronize threads as need; 3)the dependence among memory operations is violated. In order to improve memory system simulation accuracy in multithread applications, a methodology is designed to eliminate both global and local trace misplacement in trace-driven simulation. As shown in experiments, eliminating global trace misplacement of memory operation induces up to 10.22% reduction in various IPC metrics, while eliminating local trace misplacement of memory operation induces at least 50% reduction in arithmetic mean of IPC metrics. The proposed methodology ensures multithread application’s invariability in trace-driven simulation.
    Related Articles | Metrics
    A Data Deduplication-Based Primary Storage System in Cloud-of-Clouds
    Mao Bo, Ye Geyan, Lan Yanjia, Zhang Yangsong, Wu Suzhen
    Journal of Computer Research and Development    2015, 52 (6): 1278-1287.   DOI: 10.7544/issn1000-1239.2015.20150139
    Abstract1523)   HTML4)    PDF (3517KB)(946)       Save
    With the rapid development of cloud storage technology, more and more companies are beginning to upload data to the cloud storage platform. However, solely depending on the particular cloud storage provider has a number of potentially serious problems, such as vendor lock-in, availability, and security issues. To address the problems, we propose a deduplication-based primary storage system in cloud-of-clouds in this paper by eliminating the redundant data block in the cloud computing environment and distributing the data among multiple independent cloud storage providers. The data is stored in multiple cloud storage providers by combining the replication and erasure code schemes. The replication way is easy to implement and deploy but has high storage overhead. The storage overhead of erasure code is small, but it requires computational overhead for encode and decode operations. To better utilize the advantages of both replication and erasure code schemes and to exploit the reference characteristics in data deduplication, the high referenced data blocks are stored with replication scheme and the other data blocks are stored with erasure code scheme. The experiments conducted on our lightweight prototype implementation of new system show that the deduplication-based primary storage system in cloud-of-clouds improves the performance and cost efficiency significantly than the existing schemes.
    Related Articles | Metrics
    A Heterogeneous Cloud Computing Architecture and Multi-Resource-Joint Fairness Allocation Strategy
    Wang Jinhai, Huang Chuanhe, Wang Jing, He Kai, Shi Jiaoli, Chen Xi
    Journal of Computer Research and Development    2015, 52 (6): 1288-1302.   DOI: 10.7544/issn1000-1239.2015.20150168
    Abstract1309)   HTML2)    PDF (4971KB)(1194)       Save
    Resource allocation strategies are an important research hotspot about cloud computing at present. The most fundamental problem is how to fairly allocate the finite amount of resources to multiple users or applications in complex application under heterogeneous cloud computing architecture, at the same time, to achieve maximize resource utilization or efficiency. However, tasks or users are often greedy for classical resource allocation problems, therefore, under the condition of finite amount of resource, the fairness of resource allocation is particularly important. To meet different task requirements and achieve multiple types resource fairness, we design a heterogeneous cloud computing architecture and present an algorithm of maximizing multi-resource fairness based on dominant resource(MDRF). We further prove the related attributions of our algorithm such as Pareto efficiency, and give the definition of dominant resource entropy (DRE) and dominant resource weight (DRW). DRE accurately depicts the adaption degree between the resource requirement of user and the resource type of server allocated for user tasks, and makes the system more adaptive and improves the system resource utilization. DRW guarantees the priority of users obtaining resource when cooperating with the adopted Max-Min strategy guaranteeing fairness, and makes the system resource allocation more ordered. Experimental results demonstrate that our strategy has more higher resource utilization and makes resource requirements and resource provision more matching. Furthermore, our algorithm makes users achieve more dominant resource and improves the quality of service.
    Related Articles | Metrics
    EOFDM: A Search Method for Energy-Efficient Optimization in Many-Core Architecture
    Zhu Yatao, Zhang Shuai, Wang Da, Ye Xiaochun, Zhang Yang, Hu Jiuchuan, Zhang Zhimin, Fan Dongrui, Li Hongliang
    Journal of Computer Research and Development    2015, 52 (6): 1303-1315.   DOI: 10.7544/issn1000-1239.2015.20150153
    Abstract1221)   HTML0)    PDF (4170KB)(677)       Save
    Based on the optimization of energy consumption, “area-power” assignment is one of research issues in many-core processors. The distribution of area-power in space of core number and frequency level can be obtained form energy-performance model. Then the progressive search for optimal solutions of “core number and frequency level” configuration can be implemented in two dimensions. However, the existing methods of searching for energy-efficient optimization have slow convergence speed and great overhead of search in the space of core number and frequency level. Moreover, though searching for optimal core number and frequency level in the space composed by an analytical energy-performance model can reduce the overhead of real execution, the accuracy of optimal solution greatly depends on the misprediction of the model. Therefore, a search method based on FDM(EOFDM) is developed to reduce the dimensions of core number and frequency, and to involve the real energy and the performance of each feasible point to correct the model computation. The experimental results show that, compared with hill-climbing heuristic(HCH) in the execution times, the performance overhead and the energy overhead, our method makes an average reduction by 39.5%, 46.8%, 48.3%, and 48.8%, 51.6%, 50.9% in doubling the number of cores, and 45.5%, 49.8%, 54.4% in doubling the number of frequency levels. Our method is improved in convergence, search cost and scalability.
    Related Articles | Metrics
    Lightweight Error Recovery Techniques of Many-Core Processor in High Performance Computing
    Zheng Fang, Shen Li, Li Hongliang, Xie Xianghui
    Journal of Computer Research and Development    2015, 52 (6): 1316-1328.   DOI: 10.7544/issn1000-1239.2015.20150119
    Abstract1254)   HTML1)    PDF (3340KB)(751)       Save
    Due to the advances in semiconductor techniques, many-core processors with a large number of cores have been widely used in high-performance computing. Compared with multi-core processors, many-core processors can provide higher computing density and ratio of computation to power consumption. However, many-core processors must design more efficient fault tolerance mechanism to solve the serious reliability problem and alleviate performance degradation, while the cost of chip area and power must be low. In this paper, we present a prototype of home-grown many-core processor DFMC(deeply fused and heterogeneous many-core). Referring to the processor’s architecture and the applications related to the characters among cores, independent and coordinated lightweight error recovery techniques are proposed. When errors are detected, the related cores can roll back to consistent recovery line quickly by coordinated error recovery technique which is controlled by centralized unit and connected by coordinated recovery bus. To guarantee the applications’ performance, error recovery techniques are performed by instructions and recovery states are saved in cores. Our experimental results show that the effect of the techniques is significant, and the transient errors can be corrected by 80% with the chip area increased by 1.257%. The influences of lightweight error recovery techniques on applications performance, chip frequency and chip power consumption are very little. The techniques can improve the fault tolerant ability of the many-core processor.
    Related Articles | Metrics
    Paleyfly: A Scalable Topology in High Performance Interconnection Network
    Lei Fei, Dong Dezun, Pang Zhengbin, Liao Xiangke, Yang Mingying
    Journal of Computer Research and Development    2015, 52 (6): 1329-1340.   DOI: 10.7544/issn1000-1239.2015.20150162
    Abstract1672)   HTML2)    PDF (4758KB)(960)       Save
    High performance interconnection network is one of the most important parts in high performance computing system. How to design the topology of interconnection networks is the key point for the development of larger scale networks. Therefore, we contribute a new hierarchical topology structure Paleyfly (PF), which not only utilizes the property of strong regular graph with Paley graph but also supports the continued scale like Random Regular (RR) graph. Compared with other new high performance interconnection networks, Paleyfly can solve the problems of the scalability of Dragonfly (DF), the physical cost of Fat tree (Ft), the wiring complexity and the storage for routing table of Random Regular and so on. Meanwhile, according to the property of strong regular graph for load-balanced routing algorithm, we propose four routing algorithms to deal with congestion. Finally, through the simulation we briefly analyze the performance of Paleyfly comparing with other kinds of topologies and different routing algorithms. Experimental results show that our topology can achieve better effect compared with Random Regular under the various scales of network and different traffic patterns.
    Related Articles | Metrics
    Ant Cluster: A Novel High-Efficiency Multipurpose Computing Platform
    Xie Xianghui, Qian Lei, Wu Dong, Yuan Hao, Li Xiang
    Journal of Computer Research and Development    2015, 52 (6): 1341-1350.   DOI: 10.7544/issn1000-1239.2015.20150201
    Abstract1497)   HTML1)    PDF (3407KB)(803)       Save
    Driven by the demands of scientific computing and big data processing, high performance computers in the world have been more powerful and the system scales have been larger than ever before. However, the power consumption of the whole system is becoming a severe bottleneck in the further improvement of performance. In this paper, after analyzing four types of HPC systems deeply, we propose and study two key technologies which include reconfigurable micro server (RMS) technology and cluster constructing technology with the combination of node autonomy and node cooperation. RMS technology provides a new way to make the performance, the power consumption and the size of computing nodes in balance. By combining the node autonomy and the node cooperation, a large amount of small-sized computing nodes can be aggregated to be a scalable RMS cluster. Based on these technologies, we propose a new high-efficiency multipurpose computing platform architecture called Ant Cluster and construct a prototype system which consists of 2,048 low-power ant-like small-sized computing nodes. On this cluster, we implement two actual applications. The test results show that, for real-time large-scale fingerprint matching, single RMS node can achieve 34 times speed-up compared with single Inter Xeon core and the power consumption is only 5W. The whole prototype system supports processing hundreds of queries on a database of 10 million fingerprints in real time. For data sorting, our prototype system achieves 10 times more performance per watt than GPU platform and obtains higher efficiency.
    Related Articles | Metrics
    Mitigating Log Cost through Non-Volatile Memory and Checkpoint Optimization
    Wan Hu, Xu Yuanchao, Yan Junfeng, Sun Fengyun, Zhang Weigong
    Journal of Computer Research and Development    2015, 52 (6): 1351-1361.   DOI: 10.7544/issn1000-1239.2015.20150171
    Abstract1321)   HTML3)    PDF (4980KB)(873)       Save
    The sudden power failure or system crash can result in file system inconsistency upon updating permanent user data or metadata to their home locations in disk layout, an issue known as crash-consistency problem. Most existing file systems leverage some kind of consistency techniques such as write-ahead logging(WAL), copy-on-write(COW) to avoid this situation. Ext4 file system ensures the consistency of persistent operations through transaction as well as journaling mechanism. However, it is required to write file system metadata to disk twice. The metadata has the features with small granularity, big quantity and high repetition, which degrades the performance of program and also shortens the lifetime of flash-based SSD. This paper is proposed to employ non-volatile memory(NVM) as an independent log partition, which can be accessed through load/store interface directly. Furthermore, we optimize disk write operations by using reverse scan while checkpointing in order to reduce the repeated metadata updates to the same data block. The preliminary experimental results show that the performance can be improved up to 50% on HDD, and 23% on SSD for heavy-write workloads when using NVM as the external journal partition device and the number of write operations can be reduced significantly after using reverse scan checkpoint technique.
    Related Articles | Metrics
    Journal of Computer Research and Development    2015, 52 (6): 1223-1224.  
    Abstract805)   HTML1)    PDF (397KB)(602)       Save
    Related Articles | Metrics