ISSN 1000-1239 CN 11-1777/TP

Table of Content

01 August 2019, Volume 56 Issue 8
Data Center Energy Consumption Models and Energy Efficient Algorithms
Wang Jiye, Zhou Biyu, Zhang Fa, Shi Xiang, Zeng Nan, Liu Zhiyong
2019, 56(8):  1587-1603.  doi:10.7544/issn1000-1239.2019.20180574
Asbtract ( 1428 )   HTML ( 47)   PDF (1055KB) ( 1326 )  
Related Articles | Metrics
With the rapid development of cloud computing and virtualization technology, the number as well as the scale of data centers are growing rapidly around the world, which results in huge energy consumption of data centers. At the same time, however, the resource utilization of data centers are very low, leading to a lot of waste of energy. Due to the contradiction between huge energy consumption and extreme low resource utilization in data centers, optimizing the energy efficiency of data centers has become a hot topic in both academic and industrial fields in recent years. Aiming at the basic problem of energy efficiency in data center, we study the key technology of energy saving in data center based on resource and task scheduling. From the view of energy efficiency model and energy efficiency algorithm, the research progress and the latest achievement of energy efficiency in data center are summarized, mainly on server system and network system. This paper firstly decomposes and analyzes the data center energy consumption sources. Then the typical energy consumption models and classification standard are introduced. Based on the models and the standard, the strategy algorithms are summarized. The development trend of energy efficiency optimization in data center is also prospected.
Program Comprehension Based on Deep Learning
Liu Fang, Li Ge, Hu Xing, Jin Zhi
2019, 56(8):  1605-1620.  doi:10.7544/issn1000-1239.2019.20190185
Asbtract ( 1337 )   HTML ( 33)   PDF (1562KB) ( 1122 )  
Related Articles | Metrics
Program comprehension is the process of obtaining relevant information in programs by analyzing, abstracting, and reasoning the programs. It plays an important role in software development, maintenance, migration, and other processes. It has received extensive attention in academia and industry. Traditional program comprehension relies heavily on the experience of developers. However, as the scale and complexity of software continue to grow, it is time-consuming and laborious to rely solely on the developer’s prior knowledge to extract program features, and it is difficult to fully exploit the hidden features in the program. Deep learning is a data-driven end-to-end method. It builds deep neural networks based on existing data to mine the hidden features in data, and has been successfully applied in many fields. By applying deep learning technology to program comprehension, we can automatically learn the features implied in programs, which can fully exploit the knowledge implied in the program and improve the efficiency of program comprehension. This paper surveys the research work of program comprehension based on deep learning in recent years. Firstly, we analyze the properties of the program, and then introduce mainstream program comprehension models, including sequential models, structural models, and execution traces based models. Furthermore, the applications of deep learning-based program comprehension in program analysis are introduced, which mainly focus on code completion, code summarization and code search, etc. Finally, we summarize the challenges in program comprehension research.
Predicting the Dynamics in Internet Finance Based on Deep Neural Network Structure
Zhao Hongke, Wu Likang, Li Zhi, Zhang Xi, Liu Qi, Chen Enhong
2019, 56(8):  1621-1631.  doi:10.7544/issn1000-1239.2019.20190330
Asbtract ( 795 )   HTML ( 25)   PDF (1261KB) ( 605 )  
Related Articles | Metrics
In recent years, the Internet financial market has achieved rapid development across the globe. In the meantime, Internet finance has become a hot topic in academia. Compared with traditional financial markets, the Internet financial market has higher liquidity and volatility. In this paper, the dynamics (daily trading amount and count) of the Internet financial market is studied and a prediction model is proposed based on deep neural network for fusion hierarchical time series learning. Firstly, the model can process the multiple sequence (macro dynamic sequence and multiple subsequences) feature as the input variables. And then, an attention mechanism is proposed to fuse the input variables from both the time and subsequence feature dimensions. Next, the model designs an optimization function based on the stability constraint of the sequence prediction, which makes the model have better robustness. Finally, a large number of experiments have been carried out on real large-scale data sets, and the results have fully proved the effectiveness and robustness of the proposed model in the dynamic prediction of Internet finance market.
Person Re-Identification Based on Deep Convolutional Generative Adversarial Network and Expanded Neighbor Reranking
Dai Chenchao, Wang Hongyuan, Ni Tongguang, Chen Shoubing
2019, 56(8):  1632-1641.  doi:10.7544/issn1000-1239.2019.20190195
Asbtract ( 759 )   HTML ( 15)   PDF (3089KB) ( 427 )  
Related Articles | Metrics
Person Re-Identification (Re-ID) focuses on identifying the same person among disjoint camera views. This task is highly challenging, especially when there exists only several images per person in the database. Aiming at the problem of insufficient number of person images in person re-identification dataset, a method that generates extra training data from the original dataset is proposed. There are two challenges in this work, one is how to get more training data from the original training set, and the other is how to deal with these newly generated training data. The deep convolutional generative adversarial network is used to generate extra unlabeled person images and label smoothing regularization is used to process these newly generated unlabeled person images. In order to further improve the accuracy of person re-identification, a new unsupervised reranking framework is proposed. This framework neither requires to recalculate a new sorted list for each image pairs nor requires any human interaction or label information. Experiments on the datasets Market-1501, CUHK03, and DukeMTMC-reID verify the effectiveness of the proposed method.
Non-Stationary Multivariate Time Series Prediction with MIX Gated Unit
Liu Jiexi, Chen Songcan
2019, 56(8):  1642-1651.  doi:10.7544/issn1000-1239.2019.20190326
Asbtract ( 752 )   HTML ( 6)   PDF (2185KB) ( 506 )  
Related Articles | Metrics
Non-stationary multivariate time series (NSMTS) forecasting is still a challenging issue nowadays. The existing deep learning models based on recurrent neural networks (RNNs), especially long short-term memory (LSTM) and gated recurrent unit (GRU) neural networks, have received impressive performance in prediction. Although the architecture of the LSTM is relatively complex, it cannot always dominate in performance. Latest researches show that with a simpler gated unit structure, the minimal gated unit (MGU) can not only simplify the network architecture, but also improve the training efficiency in computer vision and some sequence problems. Most importantly, our experiments show that this kind of unit can be effectively applied to the NSMTS predictions and achieve comparable results with LSTM and MGU neural networks. However, none of the three gated unit based neural networks can always dominate in performance over all the NSMTS. Therefore, in this paper we propose a novel linear MIX gated unit (MIXGU). This gated unit can adjust the importance weights of GRU and MGU dynamically to achieve a better hybrid structure for each MIXGU in the network during training. The experimental results show that this MIXGU neural network has higher prediction performance than other state-of-the-art one gated unit neural network models.
Prediction of miRNA-lncRNA Interaction by Combining CNN and Bi-LSTM
Shi Wenhao, Meng Jun, Zhang Peng, Liu Chanjuan
2019, 56(8):  1652-1660.  doi:10.7544/issn1000-1239.2019.20190128
Asbtract ( 1378 )   HTML ( 19)   PDF (2142KB) ( 564 )  
Related Articles | Metrics
Non-coding RNA (ncRNA) plays an important regulatory role in many animal and plant life activities, and the interaction of microRNA (miRNA) and long non-coding RNA (lncRNA) is more important. The study of their interaction not only helps to analyze the biological functions of genes, but also provides new ideas for disease diagnosis and treatment and plant genetic breeding. At present, biological experiments and machine learning methods are mostly used to predict miRNA-lncRNA interaction. Due to high cost and time consuming of biological identification and the excessive manual intervention of machine learning and the complex feature extraction process, a deep learning model combining convolutional neural network (CNN) and bidirectional long short-term memory network (Bi-LSTM) is proposed. It combines the advantages of two models, considering the information correlation between sequences and combining context information, and fully extracting features between sequence data. In the experiment, the performance of model is evaluated by cross-validation, compared with the traditional machine learning methods and single model on zea mays dataset, and the superior classification effect is obtained. In addition, the model tests of solanum tuberosum and triticum aestivum species are carried out, and the accuracy rates are up to 95% and 93%, respectively, which verifies good generalization ability of the model.
An Integrated Recommendation Model Based on Two-stage Deep Learning
Wang Ruiqin, Wu Zongda, Jiang Yunliang, Lou Jungang
2019, 56(8):  1661-1669.  doi:10.7544/issn1000-1239.2019.20190178
Asbtract ( 979 )   HTML ( 14)   PDF (1082KB) ( 494 )  
Related Articles | Metrics
In recent years, deep learning technology has been widely used in the field of recommendation systems and has achieved great success. However, the input quality of the deep learning models has a great influence on the learning results. A sparse input feature vector will not only increase the difficulty of subsequent model training, but also will lead to the learning results falling into local optimum. In this article, an integrated recommendation model based on two-stage deep learning is proposed. Firstly, two individual marginal stacked denoising auto-encoders (mSDA) models with closed-form parameter calculation are used to extract the high-level abstract features of the users and the items. Then the resulted user abstract feature and the item abstract feature are connected as the input vector of the deep neural network (DNN) model, and the parameter learning and model optimization are performed through joint training. In addition, in order to model low-order feature interactions, a logistic regression model based on original feature vector is also integrated into the recommendation model. Extensive experiments with two real-world datasets indicate that the proposed recommendation model shows excellent recommendation performance compared with the state-of-the-art methods, especially in the data sparse and the cold start environments.
Deep Forest for Multiple Instance Learning
Ren Jie, Hou Bojian, Jiang Yuan
2019, 56(8):  1670-1676.  doi:10.7544/issn1000-1239.2019.20190332
Asbtract ( 719 )   HTML ( 6)   PDF (1048KB) ( 479 )  
Related Articles | Metrics
Multi-instance learning has been applied to various tasks, such as image retrieval, text classification, face recognition, etc. Deep neural network has also been successfully applied to plenty of tasks and problems. MI-Nets are one of the successful applications to multi-instance learning of deep neural network. Although MI-Nets have obtained achievements and the main task they are good at is image task, while on non-image tasks, they show mediocre performance. Over the last two years, deep forest has achieved good performance on non-image tasks and is favored for its less parameters and steady performance compared with deep neural network. Thus it is urgent and necessary to apply deep forest to multi-instance learning. However, due to the limitation of the structure of deep forest, we cannot simply substitute the bag-level forest for each forest of deep forest. Therefore, we need to change the structure of deep forest to achieve our purpose. In this paper, we provide a new structure of deep forest, that is multiple instance deep forest (MIDF). We regard each instance from a bag as a new bag, and thus the distribution output from the middle level can concatenate the original bag to make the cascade structure valid. We can also assure the number of layers of MIDF. Experimental results show that our method has comparable performance with MI-Nets on image task, while on non-image tasks, our method outperforms MI-Nets and other baseline methods.
Sample-Weighted Multi-View Clustering
Hong Min, Jia Caiyan, Li Yafang, Yu Jian
2019, 56(8):  1677-1685.  doi:10.7544/issn1000-1239.2019.20190150
Asbtract ( 1001 )   HTML ( 20)   PDF (978KB) ( 615 )  
Related Articles | Metrics
In the era of big data, the ability of humans to collect, store, transmit and manage data has been increasingly improved. Various industries have accumulated a large amount of data resources, which are often multi-source and heterogeneous. How to effectively cluster these multi-source data (also known as multi-view clustering) has become one of the focuses of today’s machine learning research. The existing multi-view clustering algorithms mainly pay attention to the contribution of different views and features to the cluster structure from the “global” perspective, without considering the “local” information complementary differences between different samples. Therefore, this paper proposes a new sample-weighted multi-view clustering (SWMVC). The method weights each sample with different views and adopts alternating direction method of multipliers (ADMM) to learn sample weight, which can not only learn the “local” difference of weights among multiple views in different sample points, but also reflect the “global” difference of the contribution of different views to the cluster structure, and has better flexibility. Experiments on multiple datasets show that the SWMVC method has a better clustering effect on heterogeneous view data.
Optimal Individual Convergence Rate of the Heavy-Ball-Based Momentum Methods
Cheng Yujia, Tao Wei, Liu Yuxiang, Tao Qing
2019, 56(8):  1686-1694.  doi:10.7544/issn1000-1239.2019.20190167
Asbtract ( 710 )   HTML ( 4)   PDF (2703KB) ( 314 )  
Related Articles | Metrics
The momentum method is widely used as an acceleration technique to improve the convergence of the first-order gradient algorithms. So far, the momentum methods discussed in most literatures are only limited to the accelerated method proposed by Nesterov, but the Heavy-ball momentum method proposed by Polyak is seldom studied. In particular, in the case of non-smooth objective functions, the individual optimal convergence of Nesterov accelerated methods has been derived, and it has high performance in solving sparse optimization problems. In contrast, while it has been proved that the Heavy-ball momentum method has an optimal convergence rate,it is only in terms of the averaged outputs. To our best knowledge, whether it has optimal individual convergence or not still remains unknown. In this paper, we focus on the non-smooth optimizations. We prove that the Heavy-ball momentum method achieves the optimal individual convergence by skillfully selecting the time-varying step-size, which indicates that Heavy-ball momentum is an efficient acceleration strategy for the individual convergence of the projected subgradient methods. As an application, the constrained hinge loss function optimization problems within an l\-1-norm ball are considered. In comparison with other optimization algorithms, the experiments demonstrate the correctness of our theoretical analysis and performance of the proposed algorithms in keeping the sparsity.
An Adaptive Regression Feature Selection Method for Datasets with Outliers
Guo Yaqing, Wang Wenjian, Su Meihong
2019, 56(8):  1695-1707.  doi:10.7544/issn1000-1239.2019.20190313
Asbtract ( 588 )   HTML ( 6)   PDF (1355KB) ( 409 )  
Related Articles | Metrics
Irrelevant and redundant features embedded in data will raise the difficulty for learning tasks, and feature selection can solve this problem effectively and improve learning efficiency and learner performance. Most of existing feature selection approaches are proposed for classification problems, while there are few studies on regression problems. Eespecially in presence of outliers, the present methods do not perform well. Although some methods can increase their robustness by weighting sample loss functions, the weights are set in advance and fixed throughout feature selection and learner training, which leads to bad adaptability. This paper proposes a regression feature selection method named adaptive weight LASSO (AWLASSO) for outliers. Firstly, it updates sample errors according to regression coefficients. Then the weights for loss functions of all samples are set according to the adaptive regularization term, i.e., the loss functions of samples whose errors are larger than current threshold are set smaller weights and loss functions of samples whose errors are less than threshold are set larger weights. The regression coefficient will be estimated iteratively under weighted loss function whose weights are updated. AWLASSO controls whether samples participate in regression coefficient estimation by the threshold. Only those samples with small errors participate in estimation, so a better regression coefficient estimation may be obtained in the end. In addition, the error threshold of AWLASSO algorithm is not fixed but increasing(To make initial regression coefficient estimation be accurate, initial threshold is often smaller). So some samples which are misjudged as outliers will have chance to be added again in training set. The AWLASSO regards samples whose errors are larger than the maximum threshold as outliers for their learning cost is bigger, and the weights of their loss functions are set to 0. Hence, the influence of outliers will be reduced. Experiment results on artificial data and benchmark datasets demonstrate that the proposed AWLASSO has better robustness and sparsity specially for datasets with outliers in comparison with classical methods.
An Experience-Guided Deep Deterministic Actor-Critic Algorithm with Multi-Actor
Chen Hongming, Liu Quan, Yan Yan, He Bin, Jiang Yubin, Zhang Linlin
2019, 56(8):  1708-1720.  doi:10.7544/issn1000-1239.2019.20190155
Asbtract ( 550 )   HTML ( 6)   PDF (4919KB) ( 385 )  
Related Articles | Metrics
The continuous control task has always been an important research direction in reinforce-ment learning. In recent years, the development of deep learning (DL) and the advent of deterministic policy gradients algorithm (DPG), provide many good ideas for solving continuous control problems. The main difficulty faced by these methods is the exploration in the continuous action space. And some of them engage in exploratory behavior through external noise injection in the action space. However, this exploration method does not perform well in some continuous control tasks. This paper proposes an experience-guided deep deterministic actor-critic algorithm with multi-actor (EGDDAC-MA) without external noise, which learns a guiding network from excellent experiences to guide the updates of the actor network and the critic network. Besides, it uses a multi-actor actor-critic (AC) model which configures different actors for each phase in an episode. These actors are independent of each other and do not interfere with each other. Finally, the experimental results show that compared with DDPG, TRPO and PPO algorithms, the proposed algorithm has better performance in most continuous tasks in GYM simulation platform.
Scene Graph Generation Based on Shuffle Residual Context Information
Lin Xin, Tian Xin, Ji Yi, Xu Yunlong, Liu Chunping
2019, 56(8):  1721-1730.  doi:10.7544/issn1000-1239.2019.20190329
Asbtract ( 398 )   HTML ( 2)   PDF (4604KB) ( 259 )  
Related Articles | Metrics
Scene graphs play an important role in visual understanding. Existing scene graph generation methods focus on the research of the subjects, the objects as well as the predicates between them. However, human being abstracts the relationships using spatial relation context, semantic context and interaction between scene objects for better understanding and reasoning as whole. In order to obtain the better global context representation and reduce the impact of dataset bias, we propose a new framework of scene graph generation, called as residual shuffle sequence model (RSSQ). Our method is made up of object decoding, residual shuffle and position embedding modules. Residual shuffle module is stacked with two basic structures including the random shuffle operation and the residual bidirectional LSTM. We implement the random shuffle on the hidden state of bidirectional LSTM by the process of iterative operation to reduce the impact of dataset bias, and extract the shared global context information by the residual connection structure. To strengthen the spatial relationship between pair-wise objects, the encoding is achieved using the relative position and area ratio of objects in position embedding module. The experimental results of three sub-tasks of different difficulty performed on Visual Genome dataset, demonstrate that the poposed method can generate better scene graphs under Recall@50 and Recall@100 settings due to better global context and spatial information.
Aspect-Level Sentiment Classification for Sentences Based on Dependency Tree and Distance Attention
Su Jindian, Ouyang Zhifan, Yu Shanshan
2019, 56(8):  1731-1745.  doi:10.7544/issn1000-1239.2019.20190102
Asbtract ( 949 )   HTML ( 16)   PDF (1527KB) ( 449 )  
Related Articles | Metrics
Current attention-based approaches for aspect-level sentiment classification usually neglect the contexts of aspects and the distance feature between words and aspects, which as a result make it difficult for attention mechanism to learn suitable attention weights. To address this problem, a dependency tree and distance attention-based model DTDA for aspect-level sentiment classification is proposed. Firstly, DTDA extracts dependency subtree (aspect sub-sentence) that contains the modification information of the aspect with the help of dependency tree of sentences, and then uses bidirectional GRU networks to learn the contexts of sentence and aspects. After that, the position weights are determined according to the syntactic distance between words and aspect along their path on the dependency tree, which are then further combined with relative distance to build sentence representations that contain semantic and distance information. The aspect-related sentiment feature representations are finally generated via attention mechanism and merged with sentence-related contexts, which are fed to a softmax layer for classification. Experimental results show that DTDA achieves comparable results with those current state-of-the-art methods on the two benchmark datasets of SemEval 2014, Laptop and Restaurant. When using word vectors pre-trained on domain-relative data, DTDA achieves the results with the precision of 77.01% on Laptop and 81.68% on Restaurant.
Negatively Correlated Search with Asymmetry for Real-Parameter Optimization Problems
Yu Runlong, Zhao Hongke, Wang Zhong, Ye Yuyang, Zhang Peining, Liu Qi, Chen Enhong
2019, 56(8):  1746-1757.  doi:10.7544/issn1000-1239.2019.20190198
Asbtract ( 620 )   HTML ( 6)   PDF (4572KB) ( 250 )  
Related Articles | Metrics
As many real-world applications are closely related to complex real-parameter optimization problems, some metaheuristic assumptions are employed to help design search strategies and have been shown to be powerful tools. The balance between exploration (diversification) of new areas of the search space and exploitation (intensification) of good solutions accomplished by this kind of algorithms is one of the key factors for their high performance with respect to other metaheuristics. In particular, negatively correlated search (NCS) improves the search performance of parallel hill climbing by introducing negative correlation of search trends between search processes, which contributes greatly to the diversity maintenance of solutions. NCS models the search behaviors of individual search processes as probability distributions. On this basis, we further divide the search behaviors of a couple of search processes into global search behavior and local search behavior according to the size of the coverage of each search process. Then we present a new metaheuristic, namely negatively correlated search with asymmetry (NSA), which assumes that the search process with global search behavior should be away from the search process with local search behavior. Due to the asymmetry of the negative correlation between search processes, the efficiency of NSA has been greatly improved compared with NCS. The experimental results show that NSA is competitive to well-established search methods in the sense that NSA achieves the best overall performance on 20 multimodal real-parameter optimization problems.
Low-Redundancy Knowledge Graph Management with Range Query Support
Wang Fei, Qian Tieyun, Liu Bin, Peng Zhiyong
2019, 56(8):  1758-1771.  doi:10.7544/issn1000-1239.2019.20190169
Asbtract ( 586 )   HTML ( 5)   PDF (1564KB) ( 510 )  
Related Articles | Metrics
As more and more data is published in the form of knowledge graph, the management of which attracts a lot of attention. Existing approaches for knowledge graph management have two drawbacks: 1) logical storage modeling generates lots of redundancy and ineffectively supports range queries on continuous attributes; 2) semantic storage modeling costs much and inefficiently adapts to the dynamic evolution of knowledge graph. In this paper, we propose a novel method called cluster object deputy model (CODM) to manage knowledge and metadata. The model has two key properties, namely logical storage modeling of schema and semantic storage modeling of lightweight. To this end, we design a schema cluster algorithm based on the set editing distance to convert knowledge graph into schema data, which realizes schema storage of data and supports index specification of attribute type. Besides, CODM constructs a class hierarchical system to model different associations among entities. It adopts object pointers to achieve the lightweight materialization of generalized semantic association. Experimental results show that CODM can tremendously reduce the data redundancy and outperforms the state-of-the-art methods in terms of range queries. And those results also indicate that CODM can accelerate the processing of complex queries.
SSD Power Modeling Method Based on the Gradient of Energy Consumption
Sun Jian, Li Zhanhuai, Li Qiang, Zhang Xiao, Zhao Xiaonan
2019, 56(8):  1772-1782.  doi:10.7544/issn1000-1239.2019.20170694
Asbtract ( 474 )   HTML ( 7)   PDF (2526KB) ( 224 )  
Related Articles | Metrics
In recent years, flash memory chips (NANDFLASH) production technology has been improved, the storage capacity and data throughput enhances unceasingly. NAND flash SSDs have become the preferred storage device in both consumer electronics and datacenters. Flash has superior random access characteristics to speed up many applications and consumes less power than HDDs. Flash chips has become a mainstream storage components in the field of mobile terminal. But with the reduce of the flash memory cost, its application range is gradually extended to mass data storage system. In view of the lower forecast accuracy for flash drives energy consumption in the storage system. We propose a gradient-based SSD power modeling method that estimates the power consumption of storage workloads, and it effectively enhances the energy consumption prediction precision. This method is based on the hierarchical structure and working principle of NAND flash chips, and analyzes the energy consumption in the process of reading and writing. we build energy consumption gradient list with alternation and parallelism operation, and predict energy consumption of SSD. This kind of modeling method will not bring additional performance overhead to the system. The experimental results show the prediction accuracy of gradient-based modeling method has been improved significantly compared with the traditional linear model.
Optimization of Library Function Disposing in Dynamic Binary Translation
Fu Liguo, Pang Janming, Wang Jun, Zhang Jiahao, Yue Feng
2019, 56(8):  1783-1791.  doi:10.7544/issn1000-1239.2019.20170871
Asbtract ( 379 )   HTML ( 2)   PDF (1043KB) ( 187 )  
Related Articles | Metrics
In the research of cross-platform migration without source code, efficiency is the main bottleneck restricting the development of dynamic binary translation technology. Using the method of disposing the local function can effectively improve the performance of binary translation by jacketing and replacing. However, in practical applications, as the number of library function calls in the source program or the number of the library function supported by translators increasing, the benefit of using disposing the local function is on decrease. The querying useless overhead in library function disposing grows which weakens and reduces the optimization effect of the method of disposing the local function. In order to address this kind of problem, a method is proposed based on the properties combined with dynamic and static translation. It is based on the characteristic of dynamic binary translation and the using of disposing the local function. The overhead for the query decreases with the method by preprocessing the query table and realizing the query with Hash function. It can map the source program addresses to corresponding processing function rapidly. Realized on a dynamic translator QEMU, the optimization method is implemented and tested. Experiments verify the effectiveness of this method to reduce query overhead in the process of using disposing the local function with dynamic translation.
A Performance Optimization Method for Key-Value Store Based on LSM-tree
Wang Haitao, Li Zhanhuai, Zhang Xiao, Zhao Xiaonan
2019, 56(8):  1792-1802.  doi:10.7544/issn1000-1239.2019.20190110
Asbtract ( 677 )   HTML ( 15)   PDF (1650KB) ( 765 )  
Related Articles | Metrics
Nowadays, persistent key-value (KV) stores play a critical role in a variety of modern data-intensive applications, such as Web indexing, e-commerce, and cloud data storage systems, etc. KV stores that are based on log-structured merge tree (LSM-tree) have attracted growing attention because of their ability to eliminate random writes and maintain acceptable read performance. However, they also suffer from some performance issues. On one hand, they need to leverage write-ahead log (WAL) files to guarantee the atomicity and safety of write operations to enable recovery in case of a crash. This will result in severe write amplification and metadata overhead because of frequent WAL file update, leading to performance degradation. On the other hand, these KV stores usually use a conventional local filesystem to store KV data, which can harm the performance due to unnecessary operations in the filesystem. In this paper, we present RocksFS, an optimized filesystem for KV stores based on LSM-tree. We simplify the filesystem to remove unnecessary functions and attributes to reduce filesystem overhead and redesign the format and I/O path of WAL file to decrease metadata overhead. We compare RocksFS with conventional filesystems in the environment of RocksDB, a popular LSM-tree-based KV store. The experimental results demonstrate that RocksFS can observably improve the small key-value data write performance of RocksDB by 8x at most compared with traditional filesystems on both hard disk drive and solid state disk.