ISSN 1000-1239 CN 11-1777/TP

Table of Content

01 January 2018, Volume 55 Issue 1
Survey on Single Disk Failure Recovery Methods for Erasure Coded Storage Systems
Fu Yingxun, Wen Shilin, Ma Li, Shu Jiwu
2018, 55(1):  1-13.  doi:10.7544/issn1000-1239.2018.20160506
Asbtract ( 1535 )   HTML ( 50)   PDF (3359KB) ( 809 )  
Related Articles | Metrics
With the rapid development of cloud storage, erasure codes which can tolerate a series of disk failures with low storage overhead have attracted a lot of attentions. The implementations for erasure codes constructing over storage systems are erasure coded storage systems. Once disk failures happen, erasure coded storage systems need to access the information storing on the surviving disks, and then reconstruct the lost information by a certain recovery algorithm. With the development of storage scale, disk failures happen very frequently, where most of disk failures are single disk failure. Therefore, how to fast recover the lost data from single disk failures has becoming a key problem for erasure coded storage systems. In this paper, we first introduce the background and significance for single disk failure recoveries, and then give some fundamental terms and principles for erasure codes. Afterward, we illustrate the hybrid recovery principle, elaborate the key ideas for current construction-based recovery methods and search-based recovery methods in detail, and summarize their typical application scenarios. We also summarize some new erasure coding techniques for optimizing the single disk failure recovery efficiency. At the end of the paper, we discuss the research directions for disk failure recoveries under erasure coded storage systems in the future.
Data Security and Privacy Preserving Techniques for Wearable Devices: A Survey
Liu Qiang, Li Tong, Yu Yang, Cai Zhiping, Zhou Tongqing
2018, 55(1):  14-29.  doi:10.7544/issn1000-1239.2018.20160765
Asbtract ( 1939 )   HTML ( 21)   PDF (3987KB) ( 1153 )  
Related Articles | Metrics
Mobile computing based on wearable devices is considered as the important technology for supporting ubiquitous perceptual applications. It uses widespread sensors to continuously sense the environment information. Moreover, it also adopts short-range communication and data mining/machine learning to transmit and process the sensed data, respectively. Current work mainly focuses on designing and implementing new mobile applications, information gathering, product modality and friendly user interfaces. However, research on data security and privacy technology for wearable devices is still in its fancy. In the perspective of data analysts, researchers analyze the characteristics of diverse data in wearable devices and privacy threats targeting wearable devices. Moreover, they are particularly interested in human activity recognition techniques and data mining mechanisms based on multi-source sensing data. On the other hand, it is vital for privacy protectors of wearable devices to study on privacy preservation techniques in the following three aspects: cloud-assisted privacy preserving mechanisms, privacy-aware personal data publishing and policy-based access control. A case study regarding security and privacy for Fitbit, a kind of wearable devices for health tracking, is presented. At last, the technological approaches to preserve data security and privacy for wearable devices are summarized, and some open issues to be further studied are also raised.
Text Emotion Analysis: A Survey
Li Ran, Lin Zheng, Lin Hailun, Wang Weiping, Meng Dan
2018, 55(1):  30-52.  doi:10.7544/issn1000-1239.2018.20170055
Asbtract ( 4611 )   HTML ( 126)   PDF (4177KB) ( 3928 )  
Related Articles | Metrics
With the rapid development of social networks, electronic commerce, mobile Internet and other technologies, all kinds of Web data expand rapidly. There are a large number of emotional texts on the Internet, and they are very helpful to understand the netizen’s opinion and viewpoint if fully explored. The aim of emotion classification is to predict the emotion categories of emotive texts, which is the core of emotion analysis. In this paper, we first introduce the background knowledge of emotion analysis including different emotion classification systems and its application scenarios on public opinion management and control, business decisions, opinion search, information prediction, emotion management. Then we summarize the mainstream approaches of emotion classification, and make a detailed description and analysis on these approaches. Finally, we expound the problems of data sparsity, class imbalance learning, dependence for the strong domain knowledge and language imbalance existing in the emotion analysis work. The research progress of text emotion analysis is summarized and prospect combined with large data processing, the mixing of multiple media, deep learning development, mining on a specific topic and multilingual synergy.
Structures and State-of-Art Research of Cluster Scheduling in Big Data Background
Hao Chunliang, Shen Jie, Zhang Heng, Wu Yanjun, Wang Qing, Li Mingshu
2018, 55(1):  53-70.  doi:10.7544/issn1000-1239.2018.20170051
Asbtract ( 2282 )   HTML ( 23)   PDF (3254KB) ( 1584 )  
Related Articles | Metrics
Cluster scheduling is one of the most investigated topics in big data environment. The main problem it aims to solve is to efficiently fulfill the requirements of data analytic workload using finite amount of cluster resources. Along with the rapid development in big data applications within the past decade, the context and goals of cluster scheduling also rose significantly in complexity. As the drawbacks of traditional centralized scheduling methods have becoming increasingly apparent in modern clusters, many alternative scheduling structures, including two-level scheduling, distributed scheduling, and hybrid scheduling, have been proposed in recent years. Unfortunately, as each of these methods embodies a distinct set of advantages and limitations, there is yet to appear a simple one-fits-all answer that can overcome all scheduling challenges simultaneously in big data environment. Therefore, this work aims at providing a comprehensive survey on various families of mainstream scheduling methods, focusing on their motivation, strengths and weaknesses, and suitability to different application scenarios. Seminal works of each scheduling structure are analyzed in-depth in this paper to bring insights on the current state of development. Last but not least, we try to extrapolate the current trend in cluster scheduling and highlight the challenges to be tackled in future works.
A Task Migration Strategy in Big Data Stream Computing with Storm
Lu Liang, Yu Jiong, Bian Chen, Liu Yuechao, Liao Bin, Li Huijuan
2018, 55(1):  71-92.  doi:10.7544/issn1000-1239.2018.20160812
Asbtract ( 1896 )   HTML ( 14)   PDF (5671KB) ( 1066 )  
Related Articles | Metrics
As one of the most representative platforms in stream computing, Apache Storm has become the first choice for the scenarios of real-time big data processing due to its advantages in open source, simplicity and excellent performance. A round-robin scheduling strategy is used as the Storm default scheduler, without considering the differences of performance and workload among distinct work nodes, and the different overhead of inter-node, inter-process and inter-executor communication under heterogeneous environment, which cannot fully exploit the high performance of Storm cluster in itself. In order to minimize the communication overhead on the premise of all kinds of resource constraints, a task migration strategy for heterogeneous Storm cluster (TMSH-Storm) is proposed on the basis of resource-constrained model, optimal communication overhead model and task migration model, which comprises two algorithms: source node selection algorithm and task migration algorithm. Source node selection algorithm adds work nodes which exceed the threshold to a set of source nodes according to the workload and priority of CPU, memory and network bandwidth in each work node; Task migration algorithm takes into account various factors such as the migration overhead, communication overhead, resource constraint as well as load of each node and each task, migrating the tasks that from source nodes to proper destination nodes successively and asynchronously. Experimental results show that the proposed strategy can reduce latency and overhead of inter-node communication, moreover, the implementation cost is lower compared with the existing research.
A Deep Learning Model for Predicting RNA-Binding Proteins Only from Primary Sequences
Li Hongshun, Yu Hua, Gong Xiujun
2018, 55(1):  93-101.  doi:10.7544/issn1000-1239.2018.20160508
Asbtract ( 2297 )   HTML ( 20)   PDF (1900KB) ( 1513 )  
Related Articles | Metrics
RNA-binding proteins (RNA-BPs) play pivotal roles in alternative splicing, RNA editing, methylating and many other biological functions. Predicting functions of these proteins from primary amino acids sequences are becoming one of the major challenges in functional annotation of genomes. Traditional prediction methods often devote themselves to extracting physicochemical features from sequences but ignoring motif information and location information between motifs. Meanwhile, the small scale of data volumes and large noises in training data result in lower accuracy and reliability of predictions. In this paper, we propose a new deep learning based model to predict RNA-binding proteins from primary sequences. The model utilizes two stages of convolutional neutral network(CNN) to detect the function domain of protein sequences, and long short-term memory neural network(LSTM) to obtain the length-fixed feature representation of sequences and learn long short-term dependencies between function domains of protein sequences. It overcomes more human intervention in feature selection procedure than in traditional machine learning method, since all features are learned automatically. The experimental results show its priority in processing large scale of sequence data.
A New Documents Clustering Method Based on Frequent Itemsets
Zhang Xuesong, Jia Caiyan
2018, 55(1):  102-112.  doi:10.7544/issn1000-1239.2018.20160662
Asbtract ( 1594 )   HTML ( 5)   PDF (1544KB) ( 901 )  
Related Articles | Metrics
Traditional document clustering methods use vector space model (VSM) of words to represent documents. This VSM representation only measures the importance of a single words, while ignores the semantic relationship between words, and has high dimensionality. In this study, we propose a new document clustering method: FIC (frequent itemsets based document clustering method). In the method, we use frequent itemsets (where a frequent itemset is a set of frequently co-occurred words) mined by FP-Growth algorithm in documents to represent each document. We then construct the document-document relationship network based on the similarity between pairs of documents at this new representation. At last, we divide the network into communities using a given community detection method to complete document clustering. Thereby, FIC can not only overcome the high dimensionality of VSM, but also fully make use of topological relationship among documents. The experimental results on two English corpora (Reters-21578 and 20Newsgroup) and one Chinese corpus (Sougou-News) demonstrate that the proposed method FIC is superior to the other existing frequent itemsets based methods and other classical state-of-the-art document clustering methods, and the top K words for characterizing each topic of documents identified by FIC are more meaningful than the classical topic model LDA (latent Dirichlet allocation).
Integrating User Social Status and Matrix Factorization for Item Recommendation
Yu Yonghong, Gao Yang, Wang Hao, Sun Shuanzhu
2018, 55(1):  113-124.  doi:10.7544/issn1000-1239.2018.20160704
Asbtract ( 1495 )   HTML ( 6)   PDF (2424KB) ( 1133 )  
Related Articles | Metrics
With the increasing popularity of online social network services, social networks platforms provide rich information for recommender systems. Based on the assumption that friends share more common interests than non-friends and users tend to accept the item recommendations from friends, more and more recommender systems utilize trust relationships of users to improve the performance of recommendation algorithms. However, most of the existing social-network-based recommendation algorithms ignore the following problems: 1) in different domains, users tend to trust different friends; 2) the degree of influence that a user is affected by their trusted friends is different in different domains since the user has different social status in different domains. In this paper, we first infer domain-specific social trust relation networks based on original users’ rating information and social network information, and then compute each user’s social status by leveraging PageRank algorithm for each specific domain. Finally, we propose a novel recommendation algorithm by integrating users’ social status with matrix factorization model. Experimental results on real-world dataset show that our proposed approach outperforms traditional social-network-based recommenda-tion algorithms.
Personalized Knowledge Recommendation Model Based on Constructivist Learning Theory
Xie Zhenping, Jin Chen, Liu Yuan
2018, 55(1):  125-138.  doi:10.7544/issn1000-1239.2018.20160547
Asbtract ( 1533 )   HTML ( 15)   PDF (3667KB) ( 1052 )  
Related Articles | Metrics
Personalized recommendation is becoming a basic form of information network services in the era of Internet+ and big data. Its wide use in e-commerce and social media has produced huge commercial value, however, there are only limited research and applications in the field of personalized knowledge learning, which may have tremendous potential social value for public education and personalized information selection. This study proposes a novel personalized knowledge recommendation method—constructive recommendation model, based on constructivist learning theory. The new model uses knowledge networks to represent expected knowledge systems, uses the nearest neighbor priority strategy to select knowledge item candidates, and introduces top-K unstudied knowledge recommendation algorithm based on sorting knowledge candidate items by their learnable constructive degrees. The proposed constructive recommendation model can dig users potential knowledge demands by comparing domain knowledge network structure and users learnt knowledge network structure. Then it can orderly recommend most needful knowledge items to users for gaining the greatest constructive learning effect. We choose a very interesting healthy diet knowledge system as the experimental problem, in which 14600 knowledge documents are grabbed from public Internet Websites in China with knowledge subjects ‘health knowledge’, ‘dietary nutrition’and ‘dietary misconceptions’etc. Some meaningful experimental analysis are executed in this paper, and corresponding results demonstrate that recommended knowledge sequences given by our model can gain stronger knowledge continuity and higher knowledge learning efficiency than the existing related methods.
A Revised Translation-Based Method for Knowledge Graph Representation
Fang Yang, Zhao Xiang, Tan Zhen, Yang Shiyu, Xiao Weidong
2018, 55(1):  139-150.  doi:10.7544/issn1000-1239.2018.20160723
Asbtract ( 2242 )   HTML ( 24)   PDF (2657KB) ( 1354 )  
Related Articles | Metrics
Knowledge graph is of great research value to artificial intelligence, which has been extensively applied in the fields of semantic search and question answering, etc. Knowledge graph representation transforms a large-scale knowledge graph comprising entities and relations into a continuous vector space. To this end, there have been a number of models and methods proposed for knowledge embedding. Among them, TransE is a classic translation-based method that is of low model complexity, high computational efficiency, as well as good capability of expressing knowledge. However, TransE still has two flaws: one is that it utilizes inflexible Euclidean distance as metric, and treats each feature dimension identically, hence, the model accuracy may be interfered by irrelevant dimensions; the other is that it has limitations in dealing with complex relations including reflexive, one-to-many, many-to-one and many-to-many relations. Currently, there has not been a single method that resolves the flaws simultaneously, and thus, we propose a revised translation-based method for knowledge graph representation, namely, TransAH. For the first flaw, TransAH adopts an adaptive metric, replacing Euclidean distance with weighted Euclidean distance by adding a diagonal weight matrix, which assigns different weights to every feature dimension. As to the second, inspired by TransH, it introduces the relation-oriented hyperspace model, projecting head and tail entities to hyperspace of a given relation for distinction. At last, empirical studies on public real knowledge graph datasets analyze and verify the effectiveness of the proposed method. Comprehensive comparative experiments using two tasks-link prediction and triplet classification show that, in contrast to the existing models and methods, TransAH achieves remarkable improvement in various aspects and demonstrates its superiority.
Adaptive Multibiometric Feature Fusion Based on Classification Distance Score
Zhang Lu, Wang Huabin, Tao Liang, Zhou Jian
2018, 55(1):  151-162.  doi:10.7544/issn1000-1239.2018.20160675
Asbtract ( 1167 )   HTML ( 6)   PDF (3309KB) ( 713 )  
Related Articles | Metrics
Matching score is one of the traditional fusion score metrics, but it’s not a good metric to classify the data with intra-class and inter-class scores. The classification confidence score can be used to well separate the data with intra-class score from the data with inter-class score, but it does not work well for the data whose matching scores are next to the classification threshold. Therefore, this paper proposes a new score metric based on the classification distance score, which contains not only the information of the first level of classification but also the information of the distance between matching score and classification threshold, and which can also increase the distance of the fusion scores between intra-class and inter-class scores, and the classification distance score provides the characteristics of effective discriminative information fusion set for fusion algorithm, which can improve the utilization rate of score metric; furthermore, since the information entropy indicates the information value of features, it can be used to define the feature correlation coefficient and feature weight coefficient, and then the weighted fusion and traditional SUM rules are unified in an adaptive algorithm framework, which can improve the fusion recognition rate. The experimental results indicate the validity of the proposed method.
Fast Self-Adaptive Clustering Algorithm Based on Exemplar Score Strategy
Zhang Yuanpeng, Deng Zhaohong, Chung Fu-lai, Hang Wenlong, Wang Shitong
2018, 55(1):  163-178.  doi:10.7544/issn1000-1239.2018.20160937
Asbtract ( 1118 )   HTML ( 2)   PDF (7559KB) ( 656 )  
Related Articles | Metrics
Among the exemplar-based clustering algorithms, in order to improve their efficiencies and make them self-adaptive, a fast self-adaptive clustering algorithm based on exemplar score (ESFSAC) is proposed based on our previous work, a fast reduced set density estimator (FRSDE). The proposed ESFSAC is based on three significant assumptions that are stated as: 1) exemplars should come from high-density samples; 2) exemplars should be either the components of the reduced set or their neighbors with high similarities; 3) clusters can be diffused by surrounding both exemplars and its labeled reduced set. Based on the first two assumptions, a quantity called exemplar score is proposed to estimate the possibility of a sample as an exemplar and its rationale is theoretically analyzed. With exemplar score and the third assumption, a fast self-adaptive clustering algorithm is proposed. In this novel algorithm, firstly, all samples are ranked ordered by their exemplar scores descendingly, and stored in a set called exemplar candidate set. Secondly, exemplars in the candidate set are selected one by one and their labels are propagated to their neighbors in the reduced set. Thirdly, with the same strategy, the unlabeled samples gain their labels from the samples in the reduced set. To speed up this process, a sampling algorithm is introduced. The power of the proposed algorithm is demonstrated on several synthetic and real world datasets. The experimental results show that the proposed algorithm can deal with datasets with different shapes and large scale datasets without presetting the number of clusters.
Sentence Classification Model Based on Sparse and Self-Taught Convolutional Neural Networks
Gao Yunlong, Zuo Wanli, Wang Ying, Wang Xin
2018, 55(1):  179-187.  doi:10.7544/issn1000-1239.2018.20160784
Asbtract ( 1482 )   HTML ( 4)   PDF (2218KB) ( 906 )  
Related Articles | Metrics
The study and establishment of sentence classification model have an important impact on the study of nature language processing and understanding. In this paper, we propose a sentence classification model named SCNN based on sparse and self-taught convolutional neural networks in extracting characteristics of the features from data in the CNN model. Firstly, in this method, the convolutional layer itself studies the effective combinations from the feature matrices of the previous layers in order to dynamically learn the relationships of data features in the scope of the sentence, eliminating the user-defined feature-map input of the convolutional layers. Secondly, during the unsupervised training process, using L1-norm to increase sparse constraints, the complexity of the proposed model can be effectively decreased, on the contrary, the accuracy of SCNN model can be effectively increased. Finally, by employing K-Max Pooling in the feature extraction layer, the maximal feature sequence can be selected, and relative orders among features can be effectively preserved. SCNN can cope with sentence with variant length, and furthermore, the model can apply to any language due to its independence to any linguistic features like syntax and parse trees. Experiments on the standard corpus dataset show that the proposed model is effective for the task of the sentence classification.
Topic Augmented Convolutional Neural Network for User Interest Recognition
Du Yumeng, Zhang Weinan, Liu Ting
2018, 55(1):  188-197.  doi:10.7544/issn1000-1239.2018.20160892
Asbtract ( 1245 )   HTML ( 10)   PDF (3642KB) ( 821 )  
Related Articles | Metrics
With the development of mobile Internet technology and the popularity of mobile terminals, there have been many social websites and applications on the Internet. As a social application, microblog has attracted a large number of users, with its convenience of operation and rapid propagation. A user receiving hundreds of microblogs every day, which leads to the situation of information overload, increases the difficulty of the user’s information and knowledge acquisition. On the other hand, more and more merchants treat microblog as a marketing platform, which makes the advertisements directed delivery become a problem with highly commercial value. Microblog user interest recognition can contribute to solve the problems discussed above. This paper proposes a topic augmented convolutional neural network approach to recognize user interest. By integrating the continuous semantic information and the discrete topic information, the proposed approach first obtains the category distribution of users’ microblogs. It then recognizes users’ interest through the maximum likelihood estimation over the category distribution of users’ microblogs. Experimental results show that the proposed topic augmented convolutional neural network approach outperforms the labeled LDA based approach and the traditional convolutional neural network approach significantly on the microblog classification and user interest recognition.
Chemical-Induced Disease Relation Extraction Based on Biomedical Literature
Li Zhiheng, Gui Yingyi, Yang Zhihao, Lin Hongfei, Wang Jian
2018, 55(1):  198-206.  doi:10.7544/issn1000-1239.2018.20160893
Asbtract ( 1564 )   HTML ( 3)   PDF (2284KB) ( 706 )  
Related Articles | Metrics
drug reactions between chemicals and diseases make the topic of chemical-disease relations (CDRs) become a focus that receives much concern. And automatic extraction of chemical-induced disease (CID) relations from the biomedical literature can be used to support biocuration, new drug discovery and drug safety surveillance. In this paper, we present a CID relation extraction system, called CDRExtractor, to extract CID relations from biomedical literature at both sentence and document levels. To extract the CID relations located in the same sentence, we first manually annotate a sentence-level training set which is used to train the sentence-level classifier. And to improve the performances of the classifier, Co-training algorithm is used to exploit the unlabeled data with the feature kernel and graph kernel as two independent views. Then CDRExtractor uses a document-level classifier to extract the span sentence CID relations. The classifier utilizes the document level information (features) of the chemical and disease pair, and then returns the CID relations at the document level. Finally, the post-processing rules are applied to the union set of two classifiers and generate the final outputs. Experimental results show that CDRExtractor achieves an F-score of 67.72% on the test set of the BioCreative V CDR CID subtask.
Defending Against SDN Network Topology Poisoning Attacks
Zheng Zheng, Xu Mingwei, Li Qi, Zhang Yun
2018, 55(1):  207-215.  doi:10.7544/issn1000-1239.2018.20160740
Asbtract ( 1591 )   HTML ( 10)   PDF (2464KB) ( 986 )  
Related Articles | Metrics
Software-defined networking (SDN) is a new network paradigm. Unlike the conventional network, SDN separates the control plane from the data plane. The function of the data plane is enabled in switches while only the controller provides the functions of the control plane. The controller learns topologies of the whole networks and makes the traffic forwarding decisions. However, recent studies show that there exist some serious vulnerabilities in topology management services of the current SDN controller designs, which mainly exists in host tracking service and link discovery service. Attackers can exploit these vulnerabilities to poison the network topology information in the SDN controllers. What’s more, attackers can even make the whole network down. Fortunately, researchers have paid some attention to this serious problem and proposed their defense solution. However, the existing countermeasures can be easily evaded by the attackers. In this paper, we propose an effective approach called SecTopo, to defend against the network topology poisoning attacks. Our evaluation on SecTopo in the Floodlight controller shows that the defense solution can effectively secure network topology with a minor impact on normal operations of OpenFlow controllers.
Online/Offline Traceable Attribute-Based Encryption
Zhang Kai, Ma Jianfeng, Zhang Junwei, Ying Zuobin, Zhang Tao, Liu Ximeng
2018, 55(1):  216-224.  doi:10.7544/issn1000-1239.2018.20160799
Asbtract ( 1298 )   HTML ( 10)   PDF (1032KB) ( 652 )  
Related Articles | Metrics
Attribute-based encryption (ABE), as a public key encryption, can be utilized for fine-grained access control. However, there are two main drawbacks that limit the applications of attribute-based encryption. First, as different users may have the same decryption privileges in ciphertext-policy attribute-based encryption,it is difficult to catch the users who sell their secret keys for financial benefit. Second, the number of resource-consuming exponentiation operations required to encrypt a message in ciphertext-policy attribute-based encryption grows with the complexity of the access policy, which presents a significant challenge for the users who encrypt data on mobile devices. Towards this end, after proposing the security model for online/offline traceable attribute-based encryption, we present an online/offline traceable ciphertext-policy attribute-based encryption scheme in prime order bilinear groups, and further prove that it is selectively secure in the standard model. If a malicious user leaks his/her secret key to others for benefit, he/she will be caught by a tracing algorithm in our proposed scheme. Extensive efficiency analysis results indicate that the proposed scheme moves the majority cost of an encryption into the offline encryption phase and is suitable for user encryption on mobile devices. In addition, the proposed scheme supports large universe of attributes, which makes it more flexible for practical applications.