ISSN 1000-1239 CN 11-1777/TP

Table of Content

10 September 2019, Volume 56 Issue 9
Survey on Machine Learning for Database Systems
Meng Xiaofeng, Ma Chaohong, Yang Chen
2019, 56(9):  1803-1820.  doi:10.7544/issn1000-1239.2019.20190446
Asbtract ( 1429 )   HTML ( 50)   PDF (1227KB) ( 1550 )  
Related Articles | Metrics
As one of the most popular technologies, database systems have been developed for more than 50 years, and are mature enough to support many real scenarios. Although many researches still focus on the traditional database optimization tasks, the performance improvement is little. Actually, with the advent of big data, we have met the new gap obstructing the further performance improvement of database systems. The database systems face challenges in two aspects. Firstly, the increase of data volume requires the database system to process tasks more quickly. Secondly, the rapid change of query workload and its diversity make database systems impossible to adjust the system knobs to the optimal configuration in real time. Fortunately, machine learning may be the dawn bringing an unprecedented opportunity for the traditional database systems to lead us to the new optimization direction. In this paper, we introduce how to combine machine learning into the further development of database management systems. We focus on the current research work of machine learning for database systems, mainly including the machine learning for storage management and query optimization, as well as automatic database management systems. This area has also opened various challenges and problems to be solved. Thus, based on the analysis of existing technologies, the future challenges, which may be encountered in machine learning for database systems, are pointed out.
Energy-Efficiency Query Optimization for Green Datacenters
Xing Baoping, Lü Mengyuan, Jin Peiquan, Huang Guorui, Yue Lihua
2019, 56(9):  1821-1831.  doi:10.7544/issn1000-1239.2019.20180670
Asbtract ( 574 )   HTML ( 3)   PDF (1627KB) ( 291 )  
Related Articles | Metrics
Reducing energy consumption and building green datacenters has been one of the major needs of modern large-scale datacenters. In green datacenters, a key research issue is how to lower the energy consumption of database systems while keeping stable performance. This issue is called energy efficiency, and has become a new research frontier recently. Energy efficiency of database systems is defined as using little energy to accomplish as many operations as possible. High energy efficiency means that database systems can use less energy while processing a fixed number of operations. In other words, it uses less energy but achieves the same performance. In this paper, we propose a method for energy-efficient query optimization. First, an operator-level power model is established based on the regression analysis method, which can accurately predict average power consumption during query execution for a given query. Next, a new cost model is proposed for query optimizer, which considers both energy and performance aspects. The new cost model uses a new factor to obtain a better tradeoff between performance and energy costs. A testbed is built for measuring energy consumption of database systems, and the TPC-H and TPC-C benchmarks are used to evaluate the performance of our proposal. The results show that the proposed power model achieves higher precision than existing methods. In addition, the proposed performance-degrade factor can provide flexible trade-offs between performance and energy. Moreover, by setting up an appropriate performance-degrade factor, better energy efficiency can be achieved than the original PostgreSQL.
A Generative Model for Synthesizing Structured Datasets Based on GAN
Song Kehui, Zhang Ying, Zhang Jiangwei, Yuan Xiaojie
2019, 56(9):  1832-1842.  doi:10.7544/issn1000-1239.2019.20180353
Asbtract ( 1178 )   HTML ( 29)   PDF (2738KB) ( 642 )  
Related Articles | Metrics
Synthesizing high quality dataset has been a long-standing challenge in both machine learning and database community. One of the applications of high quality dataset synthesis is to improve the model training, especially deep learning models. A robust model training process requires a large annotated dataset. One way of acquiring a large annotated training set is via the domain experts manual annotation, which is expensive and prone to mistakes. Therefore, as an alternative, automatic synthesis of high quality and similar dataset is much more plausible. Some efforts have been devoted for synthesizing image dataset due to the rapid development of computer vision. However, those models can not be applied to the structured data (numeric & categorical table) directly. Moreover, little efforts have been payed to the numeric & categorical table. Therefore, we propose TableGAN, the first generative model from GAN family, which improves the performance of the generative model with adversarial learning mechanism. TableGAN modifies the internal structure of traditional GAN targeting numeric & categorical table, including the optimization function, to synthesize more high-quality training dataset samples for improving the effectiveness of the training models. Extensive experiments on real datasets show significant performance improvement for those models trained on the enlarged training datasets, and thus verify the effectiveness of our TableGAN.
Positive and Unlabeled Generative Adversarial Network on POI Positioning
Tian Jiwei, Wang Jinsong, Shi Kai
2019, 56(9):  1843-1850.  doi:10.7544/issn1000-1239.2019.20180847
Asbtract ( 606 )   HTML ( 7)   PDF (1753KB) ( 252 )  
Related Articles | Metrics
With the rapid popularization of smart mobile devices, people rely more and more on location-based social networking service (LBSNS). Due to the high cost of data acquisition, point of interest (POI) positioning based on small data collection has become a big challenge. Recent research focuses on received signal strength (RSS) and simultaneous localization methods. Although there has been some research on POI positioning, the existing approaches do not discuss the problem of insufficient positive training samples. Based on the truthful positive data and a large amount of unlabeled data, a novel approach, called positive and unlabeled generative adversarial network (puGAN), is proposed. Firstly, we use positive and unlabeled method along with the generative adversarial network to effectively mine the hidden features of data. Secondly, based on the hidden features, we calibrate the positive data and unlabeled data, then treat them as the input of the discriminator. Finally, with the minimax of generator and discriminator, a POI-discriminator model is obtained. We evaluate the new method by analyzing ROC curve and the relationship between training error and testing error. The results of experiments show that the method we proposed can effectively solve the problem of insufficient positive samples and outperforms the traditional models of POI positioning, including one-class classifier, SVM and neural network.
Domain Named Entity Recognition Combining GAN and BiLSTM-Attention-CRF
Zhang Han, Guo Yuanbo, Li Tao
2019, 56(9):  1851-1858.  doi:10.7544/issn1000-1239.2019.20180733
Asbtract ( 2135 )   HTML ( 30)   PDF (765KB) ( 954 )  
Related Articles | Metrics
Domain named entity recognition usually faces the lack of domain annotation data and the inconsistency of entity annotation in the same document due to the diversity of entity names in the domain. To issue the above problems, our work draws on the combination of the generative adversarial network (GAN) which can generate data and the BiLSTM-Attention-CRF model. Firstly, BiLSTM-Attention is used as the generator model of GAN, and CNN is used as the discriminant model. The two models use the crowd annotations and the expert annotations to train respectively, and integrate the positive annotation data consistent with the expert annotation data distribution from the crowd annotations to solve the problem of lack of annotation data in the domain; then we also introduce a new method to obtain the new feature representation of each word in the document through introducing a document-level global feature in the BiLSTM-Attention-CRF model in order to solve the problem of inconsistency of the entity in the same document caused by the diversification of the entity name. Finally, taking the crowd annotations in the information security field as a sample, a comprehensive horizontal evaluation experiment is carried out by learning the common features and applying them to the training BiLSTM-Attention-CRF model for the identification of named entities in the information security field. The results show that compared with the existing models and methods, the model we proposed has made great progress on various indicators, reflecting its superiority.
Feature Extraction and Recognition of Vibration Signals in Optical Fiber Security System
Zou Baixian, Xu Shaowu, Miao Jun, Lu Yanling
2019, 56(9):  1859-1871.  doi:10.7544/issn1000-1239.2019.20180382
Asbtract ( 448 )   HTML ( 1)   PDF (4560KB) ( 272 )  
Related Articles | Metrics
Optical fiber vibration sensor is widely used in the new generation of security monitoring system. The feature extraction and recognition methods of optical fiber vibration signal have become a research hotspot in the field of pattern recognition. The feature extraction and recognition methods of various optical fiber signals are summarized. These feature extraction methods decompose optical fiber vibration signals from the perspective of time domain, so different attribute characteristics of signals can be extracted. The empirical thresholds, neural networks and support vector machines are used to identify optical fiber vibration signals. Up to now, there is still a problem that the correct recognition rate of optical fiber intrusion events is not high. Vibration signal data of five kinds of optical fiber vibration signals, such as excavator mining, artificial digging, vehicle walking, personnel walking and noise, are visually analyzed. An effective method for feature selection of optical fiber vibration signal is proposed. According to the importance of optical fiber vibration intrusion events, identification tasks are completed in four stages, and the two-class task decision tree model and the constrained extreme learning machine algorithm are used to identify the type of intrusion events,which improves the correct recognition rate of all kinds of events.
A Novel Algorithm for Identifying Critical Nodes in Networks Based on Local Centrality
Zheng Wenping, Wu Zhikang, Yang Gui
2019, 56(9):  1872-1880.  doi:10.7544/issn1000-1239.2019.20180831
Asbtract ( 611 )   HTML ( 10)   PDF (925KB) ( 515 )  
Related Articles | Metrics
Identifying critical nodes in a network is important to understand the connectivity property and dynamic characteristic of the network. Critical node problem has wide applications in field of social network, terrorist network, immunization network, etc. The removal of critical nodes might degrade network connectivity significantly. An algorithm, named as GCNP (greedy algorithm for critical node problem), is proposed based on the centrality of nodes. Firstly, GCNP selects nodes according to a kind of centrality measure, such as degree, betweenness, LocalRank, and so on, to obtain a vertex cover of the interesting network. A residual graph is obtained after deleting the vertex cover set from the network. Since the size of selected vertex cover set is always larger than the preset size of critical node set. Then, GCNP would choose nodes greedily from the vertex cover to add back to the residual network iteratively, until the number of deleted nodes satisfies preset size of critical node set. The node whose replacement would lead to minimum increment of the pairwise connectivity in the residual graph would be put back first. A centrality measure LNC (local neighbor centrality) based on local topological characteristics is defined to select nodes of initial vertex cover. Experimental results on 16 artificial networks and 9 real networks show that the proposed algorithm framework GCNP could improve the performance of some classical centrality measures for identifying critical nodes in a complex network. And the proposed centrality measure LNC is more effective in evaluating the importance of nodes than degree centrality, LocalRank centrality, K-Shell centrality and local degree sum centrality.
Predicted Results Evaluation and Query Verification of Drug-Target Interaction
Yu Donghua, Guo Maozu, Liu Xiaoyan, Cheng Shuang
2019, 56(9):  1881-1888.  doi:10.7544/issn1000-1239.2019.20180830
Asbtract ( 614 )   HTML ( 1351)   PDF (781KB) ( 364 )  
Related Articles | Metrics
The drug-target interaction prediction is one of the important assistant approaches in drug discovery and design, however, experimental identification and validation of potential drug-target encoded by the human genome is both costly and time-consuming. Therefore, querying and validating the predicted drug-target interaction in databases is an important assessment of prediction methods. In this paper, the query and validation method of drug-target interaction named as DTcheck (drug-target check) is developed and designed with Web spider based on KEGG, DrugBank, ChEMBL databases, which realizes efficient query and validation function for drug-target pair providing both KEGG DRUG ID and KEGG GENES ID. ID mapping function is also designed in DTcheck, which can map Uniprot ID from DrugBank and ChEMBL into KEGG GENE ID. DTcheck expands 907, 766, 458, 40 pairs of new drug-target interaction for Enzyme, IC (ion channel), GPCR (G-protein-coupled receptor), NR (nuclear receptor) standard datasets, respectively. Moreover, combined with query and validation result, the analysis of the prediction results of the BLM (bipartite local models) method shows that evaluation of Top N is more reasonable than AUC (area under curve) value for the prediction method of drug-target interaction. It also shows that the BLMd method with low AUC value is superior to the BLMmax method with high AUC value in predicting the drug-target interaction.
Prediction of Disease Associated Long Non-Coding RNA Based on HeteSim
Ma Yi, Guo Xingli, Sun Yutong, Yuan Qianqian, Ren Yang, Duan Ran, Gao Lin
2019, 56(9):  1889-1896.  doi:10.7544/issn1000-1239.2019.20180834
Asbtract ( 593 )   HTML ( 7)   PDF (1467KB) ( 393 )  
Related Articles | Metrics
A growing number of evidences indicate that long non-coding RNAs (lncRNAs) play important roles in many biological processes, and mutations or dysfunction in these long non-coding RNAs can cause serious diseases in human bodies, such as various cancers. Biological methods have been exploited to predict potential associations between diseases and long non-coding RNAs, which are of great significance for the exploration of pathogenesis, diagnosis, treatment, prognosis and prevention of complex diseases. Heterogeneous information network is constructed based on the known disease-gene associations. The association strength between lncRNAs and diseases can be measured by an association score in the heterogeneous network. A simple method called HeteSim is applied to calculate the association scores between lncRNAs and diseases. The method used in this paper is based on all paths existing between a given disease and a given lncRNA. The experiments show that our method can achieve superior performance than state-of-art methods.Our predictions for ovarian cancer and gastric cancer have been verified by biological experiments, indicating the effectiveness of this method. The case studies indicate that our method can give informative clues for further investigation. In conclusion, the only paths based on known disease-gene associations are exploited, and it is can be expected that other disease associated information can also be integrated into our method, and better performance can be available.
Lab Indicator Standardization in a Regional Medical Health Platform
Zhang Jiaying, Wang Qi, Zhang Zhixing, Ruan Tong, Zhang Huanhuan, He Ping
2019, 56(9):  1897-1906.  doi:10.7544/issn1000-1239.2019.20180729
Asbtract ( 420 )   HTML ( 4)   PDF (1124KB) ( 219 )  
Related Articles | Metrics
Due to the lack of a complete synonym list for indicator mapping, different hospitals may use different names for the same lab indicator. Lab indicator name discrepancy has greatly affected the medical information sharing and exchange among hospitals. It is becoming increasingly important to standardize the lab indicators. Such a problem can be seen as an entity alignment task to map different indicators into standard ones. However, a lab indicator only involves its name and value, not including any extra properties or contexts which is needed by existing knowledge base (KB) alignment or entity linking methods. More importantly, there exist no available standard KBs to provide standard indicator terms. Therefore, we cannot implement these existing methods directly. To solve the problem, in this paper, we present the first effort to work on lab indicator standardization. We propose a novel standardization method, which firstly clusters the indicators based on their names and abbreviations, and then iteratively employs a binary classification algorithm based on similarity features and partition score features for indicator mapping. Experimental results on the real-world medical data show that the final classification achieves a F1-score of 85.27%, which indicates that our method improves the quality and outperforms state-of-the-art approaches.
Design and Implementation of Pairwise Sequence Alignment Algorithm Components Based on Dynamic Programming
Shi Haihe, Zhou Weixing
2019, 56(9):  1907-1917.  doi:10.7544/issn1000-1239.2019.20180835
Asbtract ( 550 )   HTML ( 1356)   PDF (1776KB) ( 381 )  
Related Articles | Metrics
Pairwise sequence alignment algorithm is a key algorithm in bioinformatics, and it is widely used in sequence similarity analysis and genomic sequence database searching. The existing study mainly focuses on the optimization and use of relative alignment algorithms for specific application problems. To some extent, those studies lack a high-level algorithm framework that not only has led to the redundancy of the sequence alignment algorithms and the possible errors caused by the artificial selection algorithm, but also made the structure of algorithm difficult to be understood effectively. Through in-depth analysis of the dynamic programming-based pairwise sequence alignment algorithms domain(DPPSAA), a domain feature model and the corresponding algorithm component interactive model have been established, a DPPSAA component library has been formally implemented by the PAR platform, and a concrete algorithm has been assembled, thus the reliability of the algorithm for formal assembly is guaranteed, moreover a valuable reference for the application of sequence similarity analysis algorithms is provided. Finally, the C++ program generation system of PAR platform is used to transform the assembly alignment algorithm into C++ program and the running results show that the dynamic programming-based pairwise sequence alignment algorithm component library has certain practicability.
BIRI: A BBO-Inspired MSN Routing Algorithm with Information-Centric Paradigm Support
Tu Panpeng, Wang Xingwei, Li Jie, Huang Min
2019, 56(9):  1918-1926.  doi:10.7544/issn1000-1239.2019.20180861
Asbtract ( 404 )   HTML ( 1)   PDF (913KB) ( 268 )  
Related Articles | Metrics
The popularity of intelligent mobile terminals has greatly promoted the development of mobile social networks (MSNs). As the carrier of the terminal equipment, the human being has the feature of constantly moving which leads to dynamic changes of the network topology and brings many serious problems to MSN routing, such as high latency, low delivery rate and high overhead. In order to promote routing efficiency, based on the content-centric idea in information-centric networking (ICN) and the biogeography-based optimization (BBO) algorithm, an efficient BBO-inspired MSN routing algorithm with information-centric paradigm support (BIRI) is designed. Firstly, social metrics, social relationship strength and centrality, are redefined to direct BBO algorithm for community detection. Secondly, the novel strategies of content aggregation, data caching and bridge node selection are designed to support the efficient content retrieval and access. Based on these strategies, the enhanced intra-community and inter-community routing processes are proposed to release the interference caused by the mobility of the terminal equipment on data transmission. The proposed BIRI routing algorithm is simulated on the opportunistic network environment (ONE), and compared with other three baseline MSN routing algorithms and analyzed from three aspects of delivery rate, average latency and network overhead ratio. Experimental results show that the proposed BIRI mechanism is feasible and effective.
Product Category Mining Associated with Weibo Hot Topics
Zuo Xiaochen, Dou Zhicheng, Huang Zhen, Lu Shuqi, Wen Jirong
2019, 56(9):  1927-1938.  doi:10.7544/issn1000-1239.2019.20180723
Asbtract ( 806 )   HTML ( 13)   PDF (1101KB) ( 357 )  
Related Articles | Metrics
Weibo is one of the widely used social media platforms for online sharing and communication. Some widely-received topics have been formed into Weibo hot topics by being forwarded, reviewed, and searched by a large number of users in Weibo. And the widespread dissemination of these hot topics may further stimulate and promote users offline behaviors. As a typical representative of it, some hot topics on Weibo may stimulate sales of products related to the topics under the e-commerce platform. Mining out the relevant product categories of Weibos hot topics in advance can help e-commerce platforms and sellers to do a good job of commodity operation and inventory deployment as well as promote the search conversion rate of users and bring about an increase in the sales of corresponding products. This paper proposes a method of mining potential shopping categories associated with hot topics of Weibo. First, the method builds a product knowledge map, and then uses a variety of in-depth network models to perform textual matching between the information of the associated knowledge of product categories and the content of the Weibo topics. The strength of association of each hot topic and product category is identified. Experiments show that the method can effectively identify the relationship between hot topics and shopping categories, and most of the hot topics of Weibo can be associated with at least one product category in the e-commerce platform.
Evaluation of Content Credibility in Social Media
Liu Bo, Li Yang, Meng Qing, Tang Xiaohu, Cao Jiuxin
2019, 56(9):  1939-1952.  doi:10.7544/issn1000-1239.2019.20180624
Asbtract ( 671 )   HTML ( 4)   PDF (1088KB) ( 441 )  
Related Articles | Metrics
With the rapid development of social media in recent years, the access to information has been broadened, but the spreading of incredible information has been facilitated at the same time, which brings a series of negative impacts to cyber security. Compared with the traditional online media, the information in social media is more open and complicated, giving rise to great challenges to judge online information credibility for individuals. How to filter the incredible information becomes an urgent problem. In the existing research on the assessment of information credibility in social media, lots of effort has been involved in extracting the useful factors for credibility assessment, but the processing of noisy data is neglected, and a large number of useless tweets can be included in the evaluation process, resulting in the deviation of the information credibility assessment. So it is particularly important to select the significant tweets for information credibility assessment. This paper takes the topic factor and conformity of users into consideration to relieve the impact of noisy data, such as conformity retweeting, on information credibility assessment, and uses Bayesian network to establish an evaluation model for information credibility in social media. Then we verify the effectiveness of our model using a real dataset.
A Link Prediction Approach in Temporal Networks Based on Game Theory
Liu Liu, Wang Yuyao, Ni Qixuan, Cao Jie, Bu Zhan
2019, 56(9):  1953-1964.  doi:10.7544/issn1000-1239.2019.20180842
Asbtract ( 602 )   HTML ( 13)   PDF (2234KB) ( 297 )  
Related Articles | Metrics
Link prediction is an important task in complex network analysis, which can be applied to many real-world practical scenarios such as recommender systems, information retrieval, and marketing analysis. Different from the traditional link prediction problem, this paper predicts the existence of the link at any time in the future based on the set of temporal links in a given time window, that is, the evolution mechanism of the temporal network. To explore this question, we propose a novel semi-supervised learning framework, which integrates both survival analysis and game theory. First, we carefully define the ε-adjacent network sequence, and make use of time stamp on each link to generate the ground-truth network evolution sequence. Next, to capture the law of network evolution, we employ the Cox proportional hazard model to study the relative hazard associated with each temporal link, so as to estimate the covariate’s coefficient associated with a set of neighborhood-based proximity features. To compress the searching space, we further propose a game theory based two-way selection mechanism to inference the future network topology. We finally propose a network evolution prediction algorithm based on autonomy-oriented computing, and demonstrate both the effectiveness and the efficiency of the proposed algorithm on real-world temporal networks.
Optimization of TCP Congestion Control Algorithm in Dynamic Adaptive Streaming over HTTP
Wu Hua, Wang Ling, Cheng Guang
2019, 56(9):  1965-1976.  doi:10.7544/issn1000-1239.2019.20180752
Asbtract ( 491 )   HTML ( 11)   PDF (4437KB) ( 498 )  
Related Articles | Metrics
With the rapid growth of Internet video traffic, streaming media transmission technologies are also developed rapidly. From the traditional real-time transport protocol (RTP) over user datagram protocol (UDP) to hyper text transfer protocol (HTTP) over transmission control protocol (TCP), major video service providers are constantly developing new media transmission technologies to attract more users. As one of the most popular adaptive streaming media transmission technologies, dynamic adaptive streaming over HTTP (DASH) has many advantages in improving users’ experience. However, due to its fragmented transmission, the resulting ON-OFF transmission mode causes bursts of TCP traffic. The intermittent traffic bursts have an impact on other applications. When multiple users compete for the bandwidth at the same time, the players may estimate the network bandwidth incorrectly. This will cause the players switch among the different video resolutions frequently, and the users experience bad QoE (quality of experience). As the transport layer protocol, TCP congestion control algorithm plays a decisive role in the video transmission efficiency, but the traditional congestion control algorithm is not customized for the DASH streaming media transmission. This paper proposes a TCP congestion control algorithm TCP-HAS which adapts TCP Vegas to the DASH streaming media transmission. More specifically, bandwidth and video’s bitrate are also used to set TCP congestion control parameters in TCP-HAS. Experimental results demonstrate that TCP-HAS can improve network QoS (quality of service) as well as QoE of the users when multiple users share the bottleneck link bandwidth.
Machine Learning Inference Framework on Multi-Core Processor
Zhang Xiao, Zhi Tian
2019, 56(9):  1977-1987.  doi:10.7544/issn1000-1239.2019.20180786
Asbtract ( 655 )   HTML ( 11)   PDF (1831KB) ( 689 )  
Related Articles | Metrics
In recent years, deep neural network has been widely used in many domains and got huge success. Since the size and computation workload for neural network model is increasing rapidly, GPU and many new-designed domain-specific accelerators have been used in order to complete computing neural networks as soon as possible. However, the traditional general-purpose processor should not be ignored. Considering it is common and easy to get, exploring efficient way for using general-purpose processor in deep learning is meaningful. In training phase, the multi-core architecture is suitable for data parallelism which helps to increase system throughput. However, in inference phase, end-to-end latency is much more important than throughput, and traditional data parallelism could not fulfill the requirement of small batch and low latency. In order to utilize hardware resource of multi-core architecture, it is necessary to split the computation task into smaller parts which can be executed on multi-core processor in parallel. Besides, a sophisticated strategy is necessary to make sure the split plan will not affect computing efficiency on each core. In this paper, we propose a parallel framework for the multi-core general-purpose processor. It divides each operation in the neural network into smaller ones and executes them on the multiple cores in parallel. By offering some necessary assistant operations, this framework can be easily transplanted to support potential multi-core processors. Also, the framework can automatically generate an effective splitting plan for the given neural networks. The plan is designed with enough consideration of both network architecture and low-level hardware. The experimental results show that this framework can give an efficient splitting plan which substantially reduces the end-to-end latency of inference task on multi-core processor.
Proactive Locally Repairable Codes for Cloud Storage Systems
Zhang Xiaoyang, Xu Jiahao, Hu Yuchong
2019, 56(9):  1988-2000.  doi:10.7544/issn1000-1239.2019.20190048
Asbtract ( 546 )   HTML ( 4)   PDF (1157KB) ( 371 )  
Related Articles | Metrics
Cloud storage systems, which provide customers the ability to access their data reliably, start to adopt a novel family of codes called locally reparable codes (LRC), e.g., Windows Azure Storage and Facebook’ HDFS RAID. Compared with Reed-Solomon codes, LRC is efficiently repairable since it divides the data blocks of each stripe into groups, each of which has an additional local parity block such that a failed block can be repaired locally in one group. LRC assumes that each group is equal-size which implies that each failed block is repaired from the same amount of data of a group. However, the blocks in the disks which are more likely to fail should be repaired more efficiently. In this paper, we present a proactive LRC (pLRC) via predicting disk failures and resizing the groups such that the recent failed disks can be repaired faster while maintaining the same storage overhead and code construction relative to LRC. We analyze pLRC through the reliability modeling of mean-time-to-data-loss (MTTDL) and also implement pLRC in Facebook’s HDFS. The results show that compared with LRC, pLRC’s reliability can be improved by up to 113%, and its degraded read and disk repair performance can be improved by up to 46.8% and 47.5%, respectively.
Formal Model of Correctness and Optimization on Binary Translation
Fu Liguo, Pang Jianmin, Wang Jun, Zhang Jiahao, Yue Feng
2019, 56(9):  2001-2011.  doi:10.7544/issn1000-1239.2019.20180513
Asbtract ( 400 )   HTML ( 1)   PDF (888KB) ( 372 )  
Related Articles | Metrics
Binary translation has attracted more and more attention in the fields of architecture optimization, program performance optimization, security analysis, software transplantation and so on. Although there are totally different requirements on the application of binary translation techniques in these various fields, the researchers are always focusing their attention on two aspects: the correctness of translation and the efficiency of translation. The correctness of translation refers to the equivalent logical functions between the program before and after the translation, so an appropriate formal model for the translation needs to be built at first. According to the current research needs on correctness and optimization of binary translation technique, twice mapping model based on subsequent relations could have been constructed firstly; then the properties and construction methods of a correct process on translation should have been described formally; based on the formal description, the optimization methods of translation would be classified and analyzed on their diverse characteristics and various properties finally. This paper provides a more powerful theoretical support for the further research on the strategy pattern combination of translation methods and optimization methods in the binary translation technology by constructing a formal model of binary translation based on correctness and optimization.
Compiler Optimization Sequence Selection Method Based on Learning Model
Liu Hui, Xu Jinlong, Zhao Rongcai, Yao Jinyang
2019, 56(9):  2012-2026.  doi:10.7544/issn1000-1239.2019.20180789
Asbtract ( 538 )   HTML ( 7)   PDF (3817KB) ( 291 )  
Related Articles | Metrics
For new applications and target platforms, it is often necessary to use the compiler for optimization sequence selection to improve the performance of target code. Iterative compilation enables the optimization sequence selection process automatically, with as many different versions of the program as possible within the allowable time and space range. However, iterative compilation method is a mechanical search that lacks the utilization of previous experience and requires large execution overheads. Therefore, an optimized compilation method is needed to automatically predict the performance of the transformed program without actually running. This paper presents Features ANN to select the optimization sequence of compiler. Features ANN is based on the supervised learning model. Firstly, the program feature is extracted by the program feature representation technique through a combination of dynamic and static feature. Then, the compiler optimization space is searched based on the program features, and it finds the best optimization of the current version of the program. Finally, training samples are formed by program features and optimal optimization, and an artificial neural network (ANN) is used to construct a learning model to predict the optimal optimization sequence of the new program. Experimental results show that, Features ANN can get the best performance compared with the existing iterative compilation and non-iterative compilation methods.