ISSN 1000-1239 CN 11-1777/TP

Table of Content

01 February 2018, Volume 55 Issue 2
NV-Shuffle: Shuffle Based on Non-Volatile Memory
Pan Fengfeng, Xiong Jin
2018, 55(2):  229-245.  doi:10.7544/issn1000-1239.2018.20170742
Asbtract ( 1091 )   HTML ( 3)   PDF (6147KB) ( 785 )  
Related Articles | Metrics
In the popular big data processing platforms like Spark, it is common to collect data in a many-to-many fashion during a stage traditionally known as the Shuffle phase. Data exchange happens across different types of tasks or stages via Shuffle phase. And during this phase, the data need to be transferred via network and persisted into traditional disk-based file system. Hence, the efficiency of Shuffle phase is one of the key factors in the performance of the big data processing. In order to reducing I/O overheads, we propose an optimized Shuffle strategy based on Non-Volatile Memory (NVM)—NV-Shuffle. Next-generation non-volatile memory (NVM) technologies, such as Phase Change Memory (PCM), Spin-Transfer Torque Magnetic Memories (STTMs) introduce new opportunities for reducing I/O overhead, due to their non-volatility, high read/write performance, low energy, etc. In the big data processing platform based on memory computing such as Spark, Shuffle data access based on disks is an important factor of application performance, NV-Shuffle uses NVM as persist memory to store Shuffle data and employs direct data accesses like memory by introducing NV-Buffer to organize data instead of traditional file system.We implemented NV-Shuffle in Spark. Our performance results show, NV-shuffle reduces job execution time by 10%~40% for Shuffle-heavy workloads.
Heterogeneous Memory Programming Framework Based on Spark for Big Data Processing
Wang Chenxi, Lü Fang, Cui Huimin, Cao Ting, John Zigman, Zhuang Liangji, Feng Xiaobing
2018, 55(2):  246-264.  doi:10.7544/issn1000-1239.2018.20170687
Asbtract ( 1238 )   HTML ( 5)   PDF (5066KB) ( 721 )  
Related Articles | Metrics
Due to the boom of big data applications, the amount of data being processed by servers is increasing rapidly. In order to improve processing and response speed, industry is deploying in-memory big data computing systems, such as Apache Spark. However, traditional DRAM memory cannot satisfy the large memory request of these systems for the following reasons: firstly, the energy consumption of DRAM can be as high as 40% of the total; secondly, the scaling of DRAM manufacturing technology is hitting the limit. As a result, heterogeneous memory integrating DRAM and NVM (non-volatile memory) is a promising candidate for future memory systems. However, because of the longer latency and lower bandwidth of NVM compared with DRAM, it is necessary to place data in appropriate memory module to achieve ideal performance. This paper analyzes the memory access behavior of Spark applications and proposes a heterogeneous memory programming framework based on Spark. It is easy to apply this framework to existing Spark applications without rewriting the code. Experiments show that for Spark benchmarks, by utilizing our framework, only placing 20%~25% data on DRAM and the remaining on NVM can reach 90% of the performance when all the data is placed on DRAM. This leads to an improved performance-dollar ratio compared with DRAM-only servers and the potential support for larger scale in-memory computing applications.
Design and Verification of NVM Control Architecture Based on High-Performance SOC FPGA Array
Liu Ke, Cai Xiaojun, Zhang Zhiyong, Zhao Mengying, Jia Zhiping
2018, 55(2):  265-272.  doi:10.7544/issn1000-1239.2018.20170695
Asbtract ( 1286 )   HTML ( 9)   PDF (3108KB) ( 591 )  
Related Articles | Metrics
Emerging non-volatile memory (NVM) technologies are getting mature with lower latency and higher bandwidth. In the future, these new technologies show the potentials that not only replace the DRAM as the main memory but also serve in the external memory storage. Meanwhile, designing an efficient memory system has become popular in both the academic world and the industrial world. In this paper, we describe a high-performance NVM verification architecture based on the array of SOC FPGAs. Within the architecture, multiple levels of FPGAs are employed to connect many NVMs. Based on the architecture, we propose a novel master-slave NVM controller and then design a hardware prototype accordingly. The experiment results running on this prototype show that the architecture can not only test the performance of the homogenous NVM groups, but also verify the management scheme of hybrid NVM arrays. Moreover, the high performance of MRAM shows that MRAM has the potential to serve in both cache and main memory.
Large-Scale Graph Processing on Multi-GPU Platforms
Zhang Heng, Zhang Libo, WuYanjun
2018, 55(2):  273-288.  doi:10.7544/issn1000-1239.2018.20170697
Asbtract ( 1168 )   HTML ( 8)   PDF (5112KB) ( 1051 )  
Related Articles | Metrics
GPU-based node has emerged as a promising direction toward efficient large-scale graph processing, which is relied on the high computational power and scalable caching mechanisms of GPUs. Out-of-core graphs are the graphs that exceed main and GPU-resident memory capacity. To handle them, most existing systems using GPUs employ compact partitions of fix-sized ordered edge sets (i.e., shards) for the data movement and computation. However, when scaling to platforms with multiple GPUs, these systems have a high demand of interconnect (PCI-E) bandwidth. They suffer from GPU underutilization and represent scalability and performance bottlenecks. This paper presents GFlow, an efficient and scalable graph processing system to handle out-of-core graphs on multi-GPU nodes. In GFlow, we propose a novel 2-level streaming windows method, which stores graph’s attribute data consecutively in shared memory of multi-GPUs, and then streams graph’s topology data (shards) to GPUs. With the novel 2-level streaming windows, GFlow streams shards dynamically from SSDs to GPU devices’ memories via PCI-E fabric and applies on-the-fiy updates while processing graphs, thus reducing the amount of data movement required for computation. The detailed evaluations demonstrate that GFlow significantly outperforms most other competing out-of-core systems for a wide variety of graphs and algorithms under multi-GPUs environment, i.e., yields average speedups of 256X and 203X over CPU-based GraphChi and X-Stream respectively, and 1.3~2.5X speedup against GPU-based GraphReduce (single-GPU). Meanwhile, GFlow represents excellent scalability as we increase the number of GPUs in the node.
Partitioning Acceleration Between CPU and DRAM: A Case Study on Accelerating Hash Joins in the Big Data Era
Wu Linyang, Luo Rong, Guo Xueting, Guo Qi
2018, 55(2):  289-304.  doi:10.7544/issn1000-1239.2018.20170842
Asbtract ( 1372 )   HTML ( 5)   PDF (5194KB) ( 537 )  
Related Articles | Metrics
Hardware acceleration has been very effective in improving energy efficiency of existing computer systems. As traditional hardware accelerator designs (e.g. GPU, FPGA and customized accelerators) remain decoupled from main memory systems, reducing the energy cost of data movement remains a challenging problem, especially in the big data era. The emergence of near-data processing enables acceleration within the 3D-stacked DRAM to greatly reduce the data movement cost. However, due to the stringent area, power and thermal constraints on the 3D-stacked DRAM, it is nearly impossible to integrate all computation units required for a sufficiently complex functionality into the DRAM. Therefore, there is a need to design the memory side accelerator with this partitioning between CPU and accelerator in mind. In this paper, we describe our experience with partitioning the acceleration of hash joins, a key functionality for databases and big data systems, using a data-movement driven approach on a hybrid system, containing both memory-side customized accelerators and processor-side SIMD units. The memory-side accelerators are designed for accelerating execution phases that are bounded by data movements, while the processor-side SIMD units are employed for accelerating execution phases with negligible data movement cost. Experimental results show that the hybrid accelerated system improves energy efficiency up to 47.52x and 19.81x, compared with the Intel Has well and Xeon Phi processor, respectively. Moreover, our data-movement driven design approach can be easily extended to guide the design decisions of accelerating other emerging applications.
Persistent Transactional Memory for Databases
Hillel Avni, Wang Peng
2018, 55(2):  305-318.  doi:10.7544/issn1000-1239.2018.20170863
Asbtract ( 1289 )   HTML ( 10)   PDF (3702KB) ( 604 )  
Related Articles | Metrics
Hardware transactional memory (HTM) and byte-addressable nonvolatile memory (NVM) are already available in new computer equipment. It is tempting, but not trivial, to combine them to implement transactions having the capabilities of ACID (atomicity, consistency, isolation and durability), by using HTM for consistency and isolation, and NVM for durability. ACID transactions are especially useful in databases but, because of the size of database transactions, the challenge is to cope with the inherent HTM limitations of size and contention level. In this paper, we first present persistent HTM (PHTM), a software-hardware solution for ACID transactions with HTM. We continue with two methods to mitigate PHTM limitations. One is a persistent hybrid TM algorithm called PHyTM, which allows PHTM transactions to execute concurrently with pure software, unbounded transactions. The other is for workloads where most transactions are too large for PHTM. For the purpose we propose a new algorithm called split transactions execution (STE), which is tailored for relational database transactions. In a nutshell, this paper discusses the extension of HTM to ACID database transactions on NVM.
X-DB: Software and Hardware Co-Designed Database System
Zhang Tieying, Huang Gui, Zhang Yingqiang, Wang Jianying, Hu Wei, Zhao Diankui, He Dengcheng
2018, 55(2):  319-326.  doi:10.7544/issn1000-1239.2018.20170868
Asbtract ( 4069 )   HTML ( 34)   PDF (2166KB) ( 1889 )  
Related Articles | Metrics
The field of database system has three stages of development. The first stage is when relational model was proposed by E.F Codd. Relational model establishes the foundation of the database theory and database system. It contributes many database market giants, like IBM DB2, Microsoft SQLServer and Oracle. The second stage is due to the rapid development of Internet, which produces NoSQL database system. NoSQL focuses on system scalability but sacrifices transactional features. The third stage is called modern database era represented by new hardware features. Alibaba X-DB is such kind of database system. X-DB fully utilizes new hardware in different areas including storage, network, multi-core, parallel and heterogeneous computing. X-DB co-designs hardware and software and is compatible with MySQL ecosystem with the goal to renovate the relational database system.
Edge Computing: Platforms, Applications and Challenges
Zhao Ziming, Liu Fang, Cai Zhiping, Xiao Nong
2018, 55(2):  327-337.  doi:10.7544/issn1000-1239.2018.20170228
Asbtract ( 5177 )   HTML ( 145)   PDF (2284KB) ( 3352 )  
Related Articles | Metrics
With the trend of Internet of Everything, the number of end devices (e.g. smartphones, smart glasses) has increased rapidly which makes the data produced by these devices have grown at rates far more than the growth rate of network bandwidth. At the same time, the emergence of novel applications demands lower latency of the network, such as augmented reality and manless driving. Edge computing integrates any computing, storage and network resources at the edge of the network into a unified platform that provides services for users. This new computing model gets around the bottleneck that is caused by network bandwidth and latency, has received widespread attention in both industrial and academic. In this survey, we first introduce the concepts of edge computing and provide the definition of it. To help the readers to better understand the characteristics of edge computing, we compare it with cloud computing. We then analyze the three representative instances of edge computing platform and make a systematic comparison of them. After that, we enumerate some typical applications based on the edge computing to describe the advantages of edge computing in mobile or Internet of Things applications. Finally, the paper lays out several grand challenges of edge computing.
Contact Duration Aware Cooperative Data Caching in Mobile Opportunistic Networks
Zheng Xiao, Gao Han, Wang Xiujun, Qin Feng
2018, 55(2):  338-345.  doi:10.7544/issn1000-1239.2018.20160929
Asbtract ( 1249 )   HTML ( 4)   PDF (1585KB) ( 535 )  
Related Articles | Metrics
How to improve the efficiency of data access is always a hot topic in the research area of mobile opportunistic networks. Traditional cooperative caching techniques are commonly used to improve the performance of data access. However, the strongly independent mobility and limited contact duration of the mobile nodes render these traditional caching schemes inefficient. Firstly, a new metric, called as node important degree, is proposed to determine which node is more important to cooperative data caching. Based on this metric, a greedy algorithm is used to select initial cache nodes, and subsequently the cache data will be redistributed among these cache nodes actively as they meet each other. A novel data fragmenting strategy is suggested to adapt to the limited contact duration between nodes with the aim to make our protocol suitable for short-duration contact between cache nodes. In order to solve the coupon collector’s problem in data recovery, a randomly linear network coding method is used to encode the data fragmentations. Moreover, we describe an adaptive caching bound calculation method for each mobile node to limit the amount of data it caches, which is helpful to the rational utilization of cache space. Experimental results show that our suggested cooperative caching protocol can significantly improve the efficiency of data access in mobile opportunistic networks.
Reducing the Southbound Interface Overhead for OpenFlow Based on the Flow Volume Characteristics
Zheng Peng, Hu Chengchen, Li Hao
2018, 55(2):  346-357.  doi:10.7544/issn1000-1239.2018.20160743
Asbtract ( 1359 )   HTML ( 11)   PDF (5655KB) ( 593 )  
Related Articles | Metrics
Software defined networking (SDN) decouples the control plane from the switch in the data plane, which forms the SDN controller. This paradigm introduces many benefits, e.g., openness, management simplicity, etc. Nevertheless, the separation of the SDN switch and the controller also leads to great communication overhead between them due to controlling the network (the number of the control message and Table-Miss packets), and the overhead becomes the major bottleneck of SDN. On the one hand, each Table-Miss event can produce multiple Flow-Mod messages which add extra bandwidth overhead as well as delay to the southbound interface. On the other hand, controller has no awareness of flow characteristic information behind the Flow-Mod messages which make the overhead worse. This paper proposes a new architecture uFlow (split up Flow) to mitigate the overhead at the controller side based on the flow volume characteristics. We implemented the prototype of uFlow system both in software-based platform mininet and hardware-based platform ONetSwitch. Experimental results driven by the real traffic show that uFlow can significantly reduce the communication overhead between control plane and data plane, the number of the control message has a decrease of 70% off on average, eliminate redundant update of flow entries in switch and reduce the transmission delay of packets.
A Tool for Automatic Service Interface Testing
Zhuo Xinxin, Bai Xiaoying, Xu Jing, Li Enpeng, Liu Yu, Kang Jiehui, Song Wenli
2018, 55(2):  358-376.  doi:10.7544/issn1000-1239.2018.20160721
Asbtract ( 949 )   HTML ( 12)   PDF (6341KB) ( 799 )  
Related Articles | Metrics
In SaaS (software-as-a-service), software functions are encapsulated as independent and self-contained services, and users can access these services through well-defined interface. The correctness and reliability of service interfaces are critical for service understanding, reuse and integration. With the increasing acceptance of SaaS, more and more software expose interfaces for Internet-based open access. API testing for service interfaces is thus getting increasing attentions. To this end, a model-driven automatic testing method is presented to facilitate efficient and effective service interfaces testing. A model called ISC (interface semantic contract) is defined for modeling services with domain knowledge. Following the model-driven approach, tests are generated from ISC at three levels: test data, test cases for individual services and for composite services. Test cases are then translated to target programming languages through a kind of meta-model defined for test cases. An automatic testing tool (AutoTest) has been designed and implemented, which integrates various algorithms to optimize test generation. What’s more, the tool supports design test plan in graphical form and generates test cases in multiple programming languages, for example, C++ or Java. Experimental results demonstrate that AutoTest can support design and generation of large quantities of test cases effectively and efficiently, and test cases generated by OED (orthogonal experimental design) algorithm have more satisfactory test coverage than those by pairwise IPO (in parameter order) algorithm.
Service Substitution Method in Distributed Virtualized Environment Based on Transaction
Zou Shichen, Wang Huiqiang, Lü Hongwu, Feng Guangsheng, Lin Junyu
2018, 55(2):  377-390.  doi:10.7544/issn1000-1239.2018.20160925
Asbtract ( 853 )   HTML ( 1)   PDF (3492KB) ( 500 )  
Related Articles | Metrics
The dynamic and heterogeneous characteristics of distributed virtualized environment can lead to the failure or error of the service composition running in the distributed virtualized environment. It can result in the disruption of the entire business process, which greatly affects the dependability of the whole software system. As the most commonly used method to cope with the service failure, the existing service substitution methods can cause that the consistency and correctness of the service composition after the substitution is destroyed due to the lack of transaction support. In this paper, we propose a service substitution method in distributed virtualized environment based on transaction compensation. The method we proposed is based on the service composition transactional attributes. Firstly, a hierarchical service composition model which supports transaction attributes is proposed. Then the scope of the service transaction is identified according to the service data dependencies. Finally, based on transaction scope identification and service compensation mechanism, we propose a service failure processing method to promote the dependability of the service composition enhanced evolution. The experimental results show that the proposed method can ensure the atomic and data consistency of the transaction in service composition, and has good scalability that can achieve good service substitution.
Dynamic Group Discovery Based on Density Peaks Clustering
Wang Haiyan, Xiao Yikang
2018, 55(2):  391-399.  doi:10.7544/issn1000-1239.2018.20160928
Asbtract ( 1166 )   HTML ( 2)   PDF (2030KB) ( 529 )  
Related Articles | Metrics
Group recommendation has recently received wide attention due to its significance in real applications. As a premier step of group recommendation, group discovery is very important and discovery results will impact a lot on the performance of group recommendation. The higher similarity the groups have, the better effectiveness and stability the recommendation results will possess. However, current group discovery methods seldom consider the dynamicity of users’ tendency with variance of time context, nor do they support the existence of groups overlapping. In order to address the problems above, a dynamic group discovery method based on density peaks clustering (DGD-BDPC) is put forward in this paper. In the proposed DGD-BDPC method, quantitative users’ dynamic tendency is firstly obtained by dynamic poisson factorization. And secondly, users’ tendency under different time nodes for various items will be predicted with the employment of high order singular value decomposition (HOSVD) and user sets with high similarity will then be built according to users’ tendency. Finally, user sets will be clustered with a modification of density peaks clustering algorithm and group discovery will be realized successfully. Experimental results show that the proposed dynamic group discovery method based on density peaks clustering has higher accuracy, lower error and better stability compared with some other methods.
Heterogeneous Parallel Optimization of an Engine Combustion Simulation Application with the OpenMP 4.0 Standard
Yang Meifang, Che Yonggang, Gao Xiang
2018, 55(2):  400-408.  doi:10.7544/issn1000-1239.2018.20160872
Asbtract ( 1138 )   HTML ( 7)   PDF (2149KB) ( 448 )  
Related Articles | Metrics
LESAP is a combustion simulation application capable of simulating the chemical reactions and supersonic flows in the scramjet engines. It can be used to solve practical engineering problems and involve a large amount of computations. In this paper, we port and optimize LESAP with the OpenMP 4.0 accelerator model, targeting the heterogeneous many-core platform composed of general CPU and Intel Many Integrated Core (MIC). Based on the application characteristics, a series of techniques are proposed, including OpenMP 4.0 based task offloading, data movement optimization, grid-partition based load-balancing and SIMD optimization. The performance evaluation is done for a real combustion simulation configuration, with 5 320 896 grid cells, on one Tianhe-2 supercomputer node. The results show that the resulting heterogenous code significantly outperforms the original CPU only code. When the heterogenous code runs on two Intel Xeon E5-2692 CPUs and three Intel Xeon Phi 31S1P coprocessors, the runtime per time-steep is reduced from 64.72 seconds to 21.06 seconds. The heterogeneous computing achieves a speedup of 3.07 times over the original code that only runs on the two Intel Xeon E5-2692 CPUs.
Optimization and Parallelization Single Particle Cryo-EM Software RELION with GPU
Su Huayou, Wen Wen, Li Dongsheng
2018, 55(2):  409-417.  doi:10.7544/issn1000-1239.2018.20160873
Asbtract ( 3309 )   HTML ( 26)   PDF (2838KB) ( 879 )  
Related Articles | Metrics
Single particle cryo-electron microscopy (cryo-EM) is one of the most important methods of macromolecular structure. RELION (regularized likelihood optimization) is an open-source computer program for the refinement of macromolecular structures by single-particle analysis of cryo-EM data. Due to its easy usage and high quality results, RELION has attracted a lot of attentions from researchers. However, the computation requirement of this program is too huge to solve some large molecular structures with CPU, which harpers the popularization of RELION. In this paper, we characterize the algorithm of RELION and parallelize it with GPU. Firstly, the mathematical theory, computer patterns and performance bottlenecks of RELION are analyzed comprehensively. Then, we optimize the program targeting at fine-grained many-core architecture processor, such as GPU. We propose an efficient multi-level parallel model to utilize the powerful computation capacity of many-core processor. In order to achieve high performance, we reconstruct the data structure for GPU continues memory access. To avoid the limitation of GPU memory size, we implement an adaptive framework. The experimental results show that the proposed GPU based algorithm can achieve good performance. When compared with the CPU implementation, the speedup ratio of the application is more than 36 times, while the speedup ratio of compute-intensive algorithm is about 75X. Moreover, the testing results on multi GPUs show that the GPU based implementation has good scalability.
A Nested Partitioning Load Balancing Algorithm for Tianhe-2
Liu Xu, Yang Zhang, Yang Yang
2018, 55(2):  418-425.  doi:10.7544/issn1000-1239.2018.20160877
Asbtract ( 1122 )   HTML ( 2)   PDF (2628KB) ( 412 )  
Related Articles | Metrics
As energy consumption becomes a major design concern of supercomputers, three design trends emerge in supercomputer architectures: massive parallelism, deep memory and network hierarchy, and heterogeneous computing. Large scale computing on such supercomputers as Tianhe-2 requires the load balancing algorithms with three properties: fast, minimal data movement cost, and load balance among heterogeneous devices such as CPU cores and accelerators. On the other hand, multi-physics and multi-scale applications are becoming ubiquitous for many challenge scientific simulations, which results in non-uniform load distribution and demands powerful load balancing algorithms. In this paper, we propose a load balancing algorithm with the above properties by combining a nested partitioning scheme, a greedy partitioning algorithm and an inner-outer subdomain partitioning algorithm. Model experiment shows our algorithm can guarantee good load balance efficiency. Furthermore, experiment on Tianhe-2 with 32 nodes shows our algorithm is able to achieve low communication cost. Finally, experiments of 5 real applications on Tianhe-2 with 936 thousand CPU and MIC cores show that, our algorithm can support large scale simulations efficiently.
RPRU: A Unified Architecture for Rotation and Bit-Extraction Operations in General-Propose Processor
Ma Chao, Dai Zibin, Li Wei, Nan Longmei, Jin Yu
2018, 55(2):  426-437.  doi:10.7544/issn1000-1239.2018.20160775
Asbtract ( 1084 )   HTML ( 3)   PDF (5075KB) ( 446 )  
Related Articles | Metrics
Parallel bit extraction and rotation-shift operations can be completed by bit level permutation. At present, they are mainly implemented independently, which results in the waste of hardware logic resources. Although some of the researches unified the two operations into a single hardware unit, it was required to design two dedicated circuits to implement the routing algorithms for each operation. Consequently, the consumption of the logic resources is still high. To solve this problem, a unified routing algorithm is proposed by studying the mapping principle of rotation-shift and parallel bit extraction operations based on one kind of dynamic multistage interconnect network named Inverse Butterfly Network. The algorithm utilizes the self-routing and recursive characteristics of the network. It not only has high parallelism, but also is simple in hardware implementation, which is conductive to integration for the general-propose processor architecture. On this basis, we also develop a reconfigurable parallel bit extraction hardware unit with rotation-shift function named RPRU, and optimize the critical path of the unit. Then, we synthesize it into CMOS 90nm process. The experimental results show that the area of our RPRU using the unified algorithm is less by 30% than that of the previous design with identical functions.
Proxy Based Metadata Optimization and Implementation in Parallel Filesystem
Yi Jianliang, Chen Zhiguang, Xiao Nong, Lu Yutong
2018, 55(2):  438-446.  doi:10.7544/issn1000-1239.2018.20160796
Asbtract ( 1099 )   HTML ( 4)   PDF (2409KB) ( 494 )  
Related Articles | Metrics
In high-performance computing environment, parallel file system faces a mega client. These clients often issue a large number of concurrent IO request to the system in the same period of time, making the metadata server under a huge pressure. On the other hand, concurrent read and write requests from these clients often relate to the same directory. It makes it difficult to schedule work load across multiple servers. Therefore, we add a proxy server between the client and the metadata server and propose corresponding optimization methods to reduce the work load of the metadata server. In this paper, we realize two aspects of optimization based on proxy server. First of all, since the high-performance computing program often access files concurrently, we consider merging the numerous requests into a big one and then sent it to metadata server. Secondly, concurrent IO from the high-performance computing program often points to the same directory. Traditional metadata load balancing mechanism commonly use sub-tree partitioning method to dispatch work load across multiple server. This method is unable to realize load balancing in the situation where all operations relate to the same directory. The paper realizes fine-grained load balancing by scheduling the operations from the same directory to the plurality of metadata servers.