Loading...
ISSN 1000-1239 CN 11-1777/TP

Table of Content

01 March 2015, Volume 52 Issue 3
TML: A General High-Performance Text Mining Language
Li Jiajing, Li Xiaoming, Meng Tao
2015, 52(3):  553-560.  doi:10.7544/issn1000-1239.2015.20131546
Asbtract ( 1598 )   HTML ( 4)   PDF (1564KB) ( 1183 )  
Related Articles | Metrics
This paper proposes a general-purpose programming language named TML for text mining. TML is the abbreviation of “text mining language”, and it aims at turning complicated text mining tasks into easy jobs. The implementation of TML includes a compiler, a runtime virtual machine (interpreter), and an IDE. TML has supplied most usual text mining techniques, which are implemented as grammars and reserved words. Users can use TML to program, and the code will be compiled into bytecodes, which will be next interpreted in the virual runtime machine.TML has the following characteristics: 1) It supplies a formal way to model the searching area, object definition and mining methods of text mining jobs, so users can program with it to make a declarative text mining easily; 2) The TML runtime machine implements usual text mining techniques, and organizes them into an efficient text analysis pipeline; 3) The TML compiler fully explores the possibility of concurrently executing its byte codes, and the execution has good performance on very large collections of documents and user-written rules. TML has been used in several large-scale online data analysis applications, including commodity purchase intention analysis, fine-grained reputation analysis of brands and products, and legal document analysis.
Smooth CHKS Twin Support Vector Regression
Huang Huajuan,, Ding Shifei, Shi Zhongzhi
2015, 52(3):  561-568.  doi:10.7544/issn1000-1239.2015.20131444
Asbtract ( 932 )   HTML ( 0)   PDF (1723KB) ( 775 )  
Related Articles | Metrics
Twin support vector regression (TSVR) was proposed recently as a novel regressor that tries to find a pair of nonparallel planes, i.e., ε insensitive down- and up- bounds, by solving two related SVM-type problems. However, it may incur suboptimal solution since its objective function is positive semi-definite and it is lack of complexity control. In order to address this shortcoming, smooth twin support vector regression (STSVR) is introduced using sigmoid function as smoothing technique to convert the original problems into unconstrained minimization, which can improve the training speed. However, its accuracy needs to be improved. In this paper, aiming at the low approximation ability of sigmoid function of STSVR, using CHKS (chen-harker-kanzow-smale) function which has better approximation ability as the smooth function, a new version of smooth TSVR called smooth CHKS twin support vector regression (SCTSVR) model is proposed. In SCTSVR, CHKS function is used to approximate the non-differential term of twin support vector regression. Then Newton-Armijo algorithm is used to solve the corresponding model. We have proved that SCTSVR is not only strictly convex, but also can meet the arbitrary order smooth performance. Meanwhile, the experimental results on several artificial and benchmark datasets show that SCTSVR has better regression performance than STSVR.
Incremental FP_GROWTH Algorithm Based on Disk-resident 1-itemsets Counting
Shen Yan, Zhu Yuquan, Liu Chunhua
2015, 52(3):  569-578.  doi:10.7544/issn1000-1239.2015.20131436
Asbtract ( 707 )   HTML ( 3)   PDF (2334KB) ( 1107 )  
Related Articles | Metrics
As the size of data set increases constantly, improving the mining efficiency of frequent itemsets has become a key research in data mining fields. The incremental update algorithm of frequent itemsets is a main research target, because these algorithms can make use of the discovered information to improve mining efficiency for new data set. However, current incremental update algorithms are almost based on the framework of APRIORI. These algorithms only have limited improvements of efficiency. There are some new incremental update algorithms recently based on FP-TREE and the like data structure. But these algorithms have some other problems such as low efficiency in tree structure adjustment, low efficiency in storage of the tree, discovered frequent itemsets and so on. The performance of algorithms should be improved further. Therefore, a novel incremental FP_GROWTH algorithm based on disk-resident 1-itemsets counting (IU_FPGROWTH_1COUNTING) is presented in this paper after analyzing the key information of incremental update mining process. This algorithm doesnt need to store the temporal tree structure and temporal mining results. When the original data set and the support change at the same time, IU_FPGROWTH_1COUNTING can improve mining efficiency of frequent itemsets by decreasing the data set scan of FP_GROWTH, which is very time-consuming. The verification experiments and performance analysis on synthetic data set and real data set are presented. The experimental results show that IU_FPGROWTH_1COUNTING is an effective incremental update algorithm of frequent itemsets.
A Heuristic Two-layer Reinforcement Learning Algorithm Based on BP Neural Networks
Liu Zhibin, Zeng Xiaoqin, Liu Huiyi, Chu Rong
2015, 52(3):  579-587.  doi:10.7544/issn1000-1239.2015.20131270
Asbtract ( 1169 )   HTML ( 3)   PDF (3957KB) ( 1300 )  
Related Articles | Metrics
Reinforcement learning is a promising learning approach for agent to interact with environment from repeated training. However, it is bedeviled by the curse of dimensionality so that it can be hardly applied to large scale problems due to its low efficiency. Imbedding static prior knowledge can improve the learning performance of reinforcement learning, but inappropriate knowledge often misguides the learning process or reduces the learning speed. In this paper, an online heuristic two-layer reinforcement learning algorithm based on BP neural networks, named NNH-QL, is proposed for the purpose of avoiding the blindness and limitation of the previous learning methods. The top layer, served as reward shaping function, is constituted by BP neural networks. By shaping, the qualitative top layer provides dynamic online acquired knowledge to instruct the Q-learning based on table. In order to improve the learning efficiency of the qualitative layer, the eligibility traces are incorporated into the BP neural networks training processes. The NNH-QL method combines the flexibility of standard Q-learning and the generalization performance of BP neural networks. All the methods above offer feasible methods to solve reinforcement learning problems in larger state space. For testing, the NNH-QL algorithm is applied to an optimal path search problem. The results show that this algorithm can improve the learning performance and accelerate the learning process obviously.
A Method of Computing Minimal Hitting Sets Using CSP
Wang Yiyuan, Ouyang Dantong, Zhang Liming, Zhang Yonggang
2015, 52(3):  588-595.  doi:10.7544/issn1000-1239.2015.20131478
Asbtract ( 980 )   HTML ( 1)   PDF (1093KB) ( 632 )  
Related Articles | Metrics
Model-based diagnosis (MBD) is an important challenge and touchstone for artificial intelligence (AI) and plays an important role on studying the whole field of AI, for revealing a lot of AI technical problems. In MBD, candidate diagnostic results are generally described by all minimal hitting sets from all minimal conflict sets. Computing the minimal hitting sets is one of the core problems in this process. In addition, many practical problems can be converted to minimal hitting sets by some methods, such as the student course selection problem. In this paper, a new method is proposed to convert minimal hitting sets problems into constraint satisfaction problems and then call a state-of-the-art CSP-solver to compute, which extends the application areas of constraint satisfaction problems. Moreover, the concepts of hard-conflict sets and soft-conflict sets are proposed at the first time. Then this paper applies this new method to compute minimal hitting sets which have some features: less than a fixed length, not including specific elements, and including hard-conflict sets and soft-conflict sets. Compared with an effective algorithm, experimental results show that our new proposed method has some advantages: easy to implement, strong scalability and having good efficiency for some types of minimal hitting set.
Patch Matching with Global Spatial Constraints for Person Re-Identification
Chen Puqiang, Guo Lijun, Zhang Rong, Zhao Jieyu
2015, 52(3):  596-605.  doi:10.7544/issn1000-1239.2015.20131481
Asbtract ( 931 )   HTML ( 1)   PDF (3387KB) ( 675 )  
Related Articles | Metrics
The target person recognition is a problem of person re-identification in multiple non-overlapping camera views. Existing target person recognition mostly extracts the human appearance feature, and re-identify target pedestrians through the feature similarity. For the pedestrians who have the most similar area and a small different part, these methods still can not give accurate recognition results. In this article, we consider that pedestrians of recognition are almost in a standing posture, and the vertical structure of the same pedestrian is more similar to the vertical structure of different pedestrians. Therefore, on the basis of densely patch-matching, we propose a matching method with spatial constraints(SCM),which not only considers the process of local patch matching in two different images, but also concerns the constraint of each patch in the vertical direction. In order to reduce the negative impact of background for identification, we adopt the method of pose evaluation to extract roughly foreground of the human body. In our experiment, the proposed approach have been tested in the most challenging public VIPeR database and CUHK02 database, and the results prove that it reaches the best recognition results so far.
Graph Regularized Semi-Supervised Learning on Heterogeneous Information Networks
Liu Yufeng, Li Renfa
2015, 52(3):  606-613.  doi:10.7544/issn1000-1239.2015.20131147
Asbtract ( 1092 )   HTML ( 4)   PDF (961KB) ( 1213 )  
Related Articles | Metrics
Heterogeneous information networks, composed of multiple types of objects and links, are ubiquitous in real life. Therefore, effective analysis of large-scale heterogeneous information networks poses an interesting but critical challenge. Learning from labeled and unlabeled data via semi-supervised classification can lead to good knowledge extraction of the hidden network structure. However, although semi-supervised learning on homogeneous networks has been studied for decades, classification on heterogeneous networks has not been explored until recently. In this paper, we consider the semi-supervised classification problem on heterogeneous information networks with an arbitrary schema consisting of a number of object and link types. By applying graph regularization to preserve consistency over each relation graph corresponding to each type of links separately, we develop a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We propose an iterative framework on heterogeneous information network in which the information of labeled data can be spread to the adjacent nodes by iterative method until the steady state. We infer the class memberships of unlabeled data from those of labeled ones according to their proximities in the network. Experiments on the real DBLP data set clearly show that our approach outperforms the classic semi-supervised Learning methods.
Learning Behavior Analysis and Prediction Based on MOOC Data
Jiang Zhuoxuan, Zhang Yan, Li Xiaoming
2015, 52(3):  614-628.  doi:10.7544/issn1000-1239.2015.20140491
Asbtract ( 3832 )   HTML ( 61)   PDF (5895KB) ( 2517 )  
Related Articles | Metrics
With the booming of MOOC (massive open online course) in the past two years, educational data analysis has become a promising research field where the quality of teaching and learning can be and is being quantified to improve the educational effectiveness and even to promote the modern higher education. In the autumn of 2013, Peking University released its first six courses on the Coursera platform. Through mining and analyzing the massive data of learning behavior of over 80000 participants from the courses, this paper endeavors to manifest more than one side of learning activity in MOOC. Meanwhile, according to the characteristic of learning behavior in Chinese MOOC, learners are classified into several groups and then the relationship between their learning behavior and performance is thoroughly studied. Based on the above work, we find out that learners performance, regarding whether heshe could get certificated eventually, can be predicted by looking into several features of their learning behavior. Experiment results indicate that these features can be trained to effectively estimate whether a learner is probably to complete the course successfully. Besides, this method has the potential to partially evaluate the quality of both teaching and learning in practice.
Cross-Domain Text Sentiment Classification Based on Grouping-AdaBoost Ensemble
Zhao Chuanjun, Wang Suge, Li Deyu, Li Xin
2015, 52(3):  629-638.  doi:10.7544/issn1000-1239.2015.20140156
Asbtract ( 978 )   HTML ( 0)   PDF (2128KB) ( 767 )  
Related Articles | Metrics
In the cross-domain sentiment classification, the labeled data in the target domain is often scarce and precious. To solve this problem, this paper proposes a grouping-AdaBoost ensemble classifier method by comprehensively using the strategies and techniques of semi-supervised learning, Bootstrapping, data grouping, AdaBoost, ensemble learning. Firstly, we adopt a small amount of labeled data in the target domain to generate a number of virtual data by using synthetic minority over-sampling technique. On this basis, we can obtain more data with high credibility label in the target domain by using Bootstrapping method. In the aspect of classifier construction, we firstly make an equivalent quantity partition to the labeled data in the source domain, and combine each part with the labeled data in the target domain to form the corresponding combined data sets. Corresponding to each combined data set, a classifier is trained, and it is then promoted by AdaBoost method. At last, these classifiers corresponding to the combined data sets are linearly integrated into an ensemble classifier. The experimental results on four data sets from Amazon online shopping reviews corpora indicate that the proposed method can improve the accuracy of cross-domain sentiment transformation effectively.
Deceptive Reviews Detection Based on Positive and Unlabeled Learning
Ren Yafeng, Ji Donghong, Zhang Hongbin, Yin Lan
2015, 52(3):  639-648.  doi:10.7544/issn1000-1239.2015.20131473
Asbtract ( 1462 )   HTML ( 5)   PDF (1322KB) ( 1024 )  
Related Articles | Metrics
Identifying deceptive reviews has important theoretical meaning and practical value. While previous works focus on some heuristic rules or traditional supervised methods. Recent research has shown that humans cannot directly identify deceptive reviews by their prior knowledge. Human-annotated dataset must contain some mislabeled examples. Due to the difficulty of human labeling needed for supervised learning, the problem remains to be highly challenging. There are some ambiguous reviews (we call them spy examples), which are easily mislabeled. The key of identifying deceptive review is how to deal with these spy reviews. Based on some truthful reviews and a large amount of unlabeled reviews, a novel approach, called mixing population and individual nature PU learning, is proposed. Firstly, some reliable negative examples are identified from the unlabeled dataset. Secondly, some representative positive examples and negative examples are generated by integrating latent dirichlet allocation and K-means. Thirdly, all spy examples are clustered into many groups based on dirichlet process mixture model, and two schemes (population nature and individual nature) are mixed to determine the category label of spy examples. Finally, multiple kernel learning is presented to build the final classifier. Experimental results demonstrate that our proposed methods can effectively identify deceptive reviews, and outperform the current baselines.
A Latent-Citation-Network Based Patent Value Evaluation Method
Feng Ling, Peng Zhiyong, Liu Bin, Che Dunren
2015, 52(3):  649-660.  doi:10.7544/issn1000-1239.2015.20131424
Asbtract ( 933 )   HTML ( 2)   PDF (1896KB) ( 717 )  
Related Articles | Metrics
Patent value refers to the exchange value of a patent in the purchase and transaction.It can provide precious information for the decision-makings of patent owners and buyers.Existing patent value evaluation methods are mainly based on training or citation analysis.However, the methods based on training depend too much on the experimental parameters, which results in weak credibility.On the other hand, the methods based on citation analysis only consider direct citations during the evaluation leaving indirect citations and novelty of patent neglected.For these reasons, this paper presents a latent-citation-network based patent value evaluation method to evaluate the value of each patent in which direct citations, indirect citations and novelty are all considered.First, the latent citation association is discovered utilizing similarity between patents and a latent citation network is established.Then, a basic algorithm is implemented to effectively evaluate the value of each patent on the network.Further, an improved algorithm is proposed to solve the problem of inefficiency of the basic algorithm.Finally, to handle the value variations caused by the arrival of a new patent, a dynamic patent value evaluation algorithm is designed to efficiently update the value of the original patents.As shown in the experiments, the proposed methods in this paper are effective.
TH-award:A Performance and Security Test Platform for Wireless Multi-Hop Networks
Zhu Lin, Meng Kun, Lin Chuang, Xu Kun
2015, 52(3):  661-670.  doi:10.7544/issn1000-1239.2015.20131320
Asbtract ( 969 )   HTML ( 0)   PDF (3344KB) ( 552 )  
Related Articles | Metrics
Wireless multi-hop network technology paves a way of ad hoc communication among nodes and facilitates users to access Internet. However, due to the natures of wireless media and distributed operation mode, more performance and security challenges in wireless multi-hop networks are faced, and designing efficient protocols are one of the main research objectives in this field. To solve the problem of relying much on simulation to aid protocols design, the paper proposes an integrated platform, the testbed for high-level analysis of wireless ad-hoc routing design(TH-award), helps to test and analyze the performance and security of designed protocols through implementing them on it. TH-award consists of commercial routers, servers and embedded development boards. By analyzing the work mechanisms of protocols, TH-award is designed with a protocol rule library, a parameter library and a test and analysis library, which permits users not only to customize and deploy their designed protocols, but also to easily get the results in terms of performance and security of them. In the end, we implement comprehensive experiments by TH-award to verify its feasibility.
A Selection Method for User Authentication Protocols in Wireless Networks
Zhao Jing, Li Xin, Deng Lingjuan, Li Xinghua, Ma Jianfeng
2015, 52(3):  671-680.  doi:10.7544/issn1000-1239.2015.20131376
Asbtract ( 1038 )   HTML ( 3)   PDF (3901KB) ( 704 )  
Related Articles | Metrics
Generally, in wireless networks there are a certain number of candidate user authentication protocols which can be selected from, and how to select one that can fulfill the user’s personalized requirements is an unsolved problem. There already are some studies on the authentication protocols, but most of them are from the protocol designers’ perspective. To the best of our knowledge, this paper is the first to study how to select authentication protocols with considering the user’s personalized requirements in wireless networks. From the perspective of users, we propose a solution that guides users to select from different authentication protocols according to their personalized requirements, taking into users’ most concerning factors, e.g., security, energy consumption, authentication delay and their preference. The energy consumption is defined as the sum of the energy consumption of user transmitting, receiving messages and cryptographic operations involved in the process of interaction. The cryptographic operations include Hash algorithm, RSA key exchange, digital signature, symmetric encryption and decryption algorithm. Adopting our solution to the EAP protocols in WLAN, we evaluate the security and performance of the EAP-PEAP, EAP-TLS, EAP-TTLS/MD5 and EAP-TTLS/MSCHAPV2. The results show that, regardless of how users set the weight, EAP-TTLS/MSCHAPV2 and EAP-TTLS/MD5 are always better than EAP-PEAP and EAP-TLS.
Multi-Stride Regular Expression Matching Using Parallel Character Index
Ding Linxuan, Huang Kun, Zhang Dafang
2015, 52(3):  681-690.  doi:10.7544/issn1000-1239.2015.20131255
Asbtract ( 803 )   HTML ( 3)   PDF (3799KB) ( 602 )  
Related Articles | Metrics
Deep packet inspection (DPI) is a key function of network intrusion detection and prevention systems (NIDPS). TCAM-based regular expression matching algorithms have been proposed as a promising approach to improve processing speed, which is an important research direction of DPI. Ternary content addressable memory (TCAM) has the characters of high searching speed and small storage space, as well as the TCAM power consumption is proportionate to its storage space. Deterministic finite automaton (DFA) requires large storage space and the storage space of multi-stride DFA grows exponentially with the stride of DFA, which leads to high TCAM power consumption of DFA, especially for multi-stride DFA. This paper presents a parallel character-indexed multi-stride regular expression matching algorithm to address such limitation. This algorithm uses the idea of building parallel character indexes according to the stride of DFA, and reduces the number of activated TCAM blocks by using bitmap intersection, which in turn translates low TCAM power. Experimental results show that our algorithm reduces the TCAM power by more than 99.8% as well as the TCAM space usage by 48.5%~65.3%, and improves the matching throughput by 1.9~2.6 times compared with previous solutions based on multi-stride DFA.
Fully Secure Hierarchical Inner Product Encryption with Constant-Size Ciphertexts
Zhang Jie, Ge Aijun, Ma Chuangui
2015, 52(3):  691-701.  doi:10.7544/issn1000-1239.2015.20131413
Asbtract ( 905 )   HTML ( 0)   PDF (1043KB) ( 554 )  
Related Articles | Metrics
Inner product encryption (IPE) is a concrete construction of predicate encryption, which represents a wide class of predicates that includes an equality test(for identity-based encryption and hidden vector encryption), disjunctions or conjunctions of equality tests (for attribute-based encryption). Hierarchical inner product encryption (HIPE) can provide the capacity of delegate for inner product encryption, and it can effectively reduce the workload of root node of the system. Aiming at the efficiency that exists in the hierarchical inner product encryption, we present a short ciphertexts IPE scheme with full security in asymmetric bilinear pairing. By making use of the IPE scheme as building blocks, we then present a new HIPE scheme with the new technique for dual system encryption. The new realization of dual system encryption does not use tags, which makes the compression of ciphertexts possible. The proposed HIPE scheme achieves constant-size ciphertexts and full security in the standard model. Security is proven under three static assumptions whose size does not depend on the number of queries. Furthermore, our scheme achieves lower computational cost because decryption only needs seven pairing operations. Compared with other existing schemes, our scheme is more compact to implement and can provide better efficiency in terms of the communication and computation cost.
Light-Weight Integrity Monitoring Based on Hashing Time Validity
Xu Qingui, Qin Yong, Yang Taolan
2015, 52(3):  702-717.  doi:10.7544/issn1000-1239.2015.20131382
Asbtract ( 662 )   HTML ( 0)   PDF (4558KB) ( 646 )  
Related Articles | Metrics
Real-time monitoring of node integrity is effective means to protect resource-restrained nodes. By identifying main tampering attack modes against resource-restrained nodes, and analysiing the influence on hashing time, pure-software integrity monitoring means based on inspecting hashing time validity is suggested. On the basis of analysing testability condition of hashing time validity, checksum forging punishment coefficient is proposed to indicate tamper-resisting performance of monitor mechanism, and a light-weight hashing algorithm of merging program states is put forward. By simplifying hashing structure and integrating program states into checksum, checksum forging is made more difficult. Damaged nodes have to spend much more time on extra work like restoring legal code and program states than on hashing if they want to aquire the correct checksum. Hence, the proposed mechanism imposes much greater checksum forging punishment on damaged nodes than other approaches like SWATT and Shah. In order to prevent message forging or tampering during transmission over communication networks, a monitoring protocol supporting message authentication is designed. For tolerating influence from hashing time fluctuation and checksum guess, node integrity state is evaluated from results of both checksum comparison and hashing time validity statistics. The experiments show that the proposed approach achieves high reliabiliy in examining validity of checksum and hashing time with small cost. Toleration ability against fluctuation disturbance on hashing time from node multi-tasking environment and communication networks is improved, and hence tamper-resisting performance of resource-constrained nodes is enhanced.
Anomaly Intrusion Behavior Detection Based on Fuzzy Clustering and Features Selection
Tang Chenghua, Liu Pengcheng, Tang Shensheng, Xie Yi
2015, 52(3):  718-728.  doi:10.7544/issn1000-1239.2015.20130601
Asbtract ( 1129 )   HTML ( 2)   PDF (1777KB) ( 1245 )  
Related Articles | Metrics
The behaviors of network attack connection are always changeable and complex. Typical behavior mining methods, which always do using traditional clustering, do not fit in with constructing anomaly intrusion detection model. According to the characteristics of network attacks, the anomaly intrusion detection model based on fuzzy clustering and features selection are proposed. Firstly, the results that the fuzzy C-means clustering algorithm is sensitive to the initial cluster centers is improved using hierarchical clustering algorithm, the disadvantage that FCM is easy to fall into local optimum in the iteration is overcome using the global search ability of genetic algorithm, and they are combined into a AGFCM algorithm. Secondly, the feature attribute data sets of network attack connection are sorted through the information gain algorithm. The capacity of feature attributes is determined by using the Youden index to cut the data sets at the same time. Lastly, the anomaly intrusion detection model is built by using the attribute data sets dimensionality reduction and improved FCM clustering algorithm. Experimental results show that the anomaly intrusion detection model can effectively detect the vast majority of network attack types, which provides a feasible solution for solving the problems of false alarm rate and detection rate in anomaly intrusion detection model.
Recovering Traceability Links Using Syntactic Analysis
Wang Jinshui, Weng Wei, Peng Xin
2015, 52(3):  729-737.  doi:10.7544/issn1000-1239.2015.20131308
Asbtract ( 680 )   HTML ( 2)   PDF (1187KB) ( 542 )  
Related Articles | Metrics
Software requirement traceability has been globally recognized as a key factor of affecting the success of software projects. The generated traceability links are vital to many software engineering and software verification and validation (V&V) activities such as change impact analysis, software reuse and consistency checking. Addressing most existing requirement traceability approaches based on information retrieval are strongly affected by the quality of the documentation of different types of software artifacts, this paper presents a dynamic requirement traceability approach based on syntactic analysis. The proposed approach is able to extract terms which are most likely to characterize itself from text-based software artifacts such as source code and requirement document, and then reduce the adverse effects of noise in artifacts to the requirement tracing process. In order to verify the effectiveness of the proposed approach, we have compared the quality of the trace links produced by several dynamic requirement traceability approaches on three open source software systems and six types of software artifacts. The result suggests that the dynamic requirement traceability approach based on syntactic analysis can effectively improve the accuracy of the produced trace links.
Processing String Similarity Search in External Memory Efficiently
Wang Jinbao, Gao Hong, Li Jianzhong, Yang Donghua
2015, 52(3):  738-748.  doi:10.7544/issn1000-1239.2015.20130683
Asbtract ( 771 )   HTML ( 0)   PDF (3070KB) ( 527 )  
Related Articles | Metrics
String similarity search is fundamental to various applications, such as data cleaning, spell checking, bioinformatics and information integration, since users tend to provide fuzzy queries in these applications due to input errors of both queried strings and data strings. With the rapid growth of data size, big string datasets become ubiquitous, and almost every modern information system stores data presented in string format. String similarity search should be well supported in modern information systems. Memory based q-gram inverted indexes fail to support string similarity search over big string datasets due to the memory constraint, and it can no longer work if the data size grows larger than memory size. Existing external memory method, Behm-Index, only supports length-filter and prefix filter. This paper proposes LPA-Index to reduce I/O cost for better query response time. It is a disk resident index which suffers no limitation on data size compared to memory size. LPA-Index supports both length-filter and positional filter which reduce query candidates efficiently, and it selectively reads inverted lists during query processing for better I/O performance. Experiment results on both real datasets and a synthetic dataset demonstrate the efficiency of LPA-Index and its advantages over existing state of art disk index Behm-Index with regard to I/O cost and query response time.
Strong Neighborhood Pair Query in Dynamic Dataset
Li Song, Zhang Liping, Hao Zhongxiao
2015, 52(3):  749-759.  doi:10.7544/issn1000-1239.2015.20131390
Asbtract ( 674 )   HTML ( 0)   PDF (1656KB) ( 588 )  
Related Articles | Metrics
The strong neighborhood pair query(SNP query) has important significance in the spatial data mining, big data processing, spatial database, geographic information system, similarity analysis and reasoning of data, etc. To remedy the deficiency of the existing work, according to the characteristic and complexity of the strong neighborhood pair query in dynamic dataset, a strong neighborhood pair query algorithm (VR_SNP algorithm) in the double data sets is proposed based on the Voronoi diagram and R tree. According to the irregular regions and uneven density of the points, the Voronoi diagram is used to query the strong neighborhood pair, conversely, the R tree is used to query the pair. Based on the secondary calculation and filtration for the initial neighborhood pair set, the VR_SNP_DA algorithm and VR_SNP_DE algorithm are given respectively to deal with the strong neighborhood pair query as the data sets increase and decrease dynamically. Furthermore, the VR_SNP_DL algorithm to deal with the query about the moving objects is studied. The theoretical study and the experimental results show that the algorithms have great advantages when the scale of the datasets are large and the positions of the points usually change.
Fault-Tolerant Real-Time Scheduling Algorithm with Pre-Allocation in Primary/Alternate Model
Liu Xian, Guo Ruifeng, Deng Changyi
2015, 52(3):  760-768.  doi:10.7544/issn1000-1239.2015.20130677
Asbtract ( 682 )   HTML ( 0)   PDF (1794KB) ( 559 )  
Related Articles | Metrics
A real-time system is supposed to provide fault tolerance to keep the system meeting its stringent timeliness and reliability requirements in the presence of faults. The primary/alternate model is an effective method for enhancing the capability to tolerate faults. Most fault-tolerant real-time scheduling algorithms based on this model at present reserve time intervals for the alternates to provide software fault-tolerance. However, these reserved time intervals need to be modified when their corresponding primaries are successfully completed at run-time, thereby increasing the scheduling overhead. To address this problem, a new fault-tolerant real-time scheduling algorithm, termed as BCE\+*, is proposed in this paper. The major contribution of the algorithm is presenting a new pre-allocating scheme which imposes restrictions on the preemption of higher priority task while reserving time intervals for alternates. The theoretical performance analysis presented in the paper demonstrates that the proposed algorithm can avoid the modification of reserved time internals, and meanwhile keep the schedulability of the system. Finally, simulation experiments show that the proposed approach can effectively reduce the scheduling overhead of the system compared with the classic BCE algorithm, and the reduction is significant when both the processor utilization of the primaries and the software fault probability are low.
Balanced k-Way Partitioning for Weighted Graphs
Zheng Lili, Wu Jigang, Chen Yong, Zhu Meixia
2015, 52(3):  769-776.  doi:10.7544/issn1000-1239.2015.20131508
Asbtract ( 1156 )   HTML ( 3)   PDF (904KB) ( 671 )  
Related Articles | Metrics
Balanced k-way partitioning for a weighted graph is to divide the vertex set of the graph into k disjoint subsets, in order to minimize the difference of the sums of the vertex-weights between two subsets, together with minimizing the sum of the edge-weights, whose incident vertices belong to different subsets. This k-way partitioning problem is widely applied in the areas such as hardwaresoftware co-design, VLSI design, data partitioning, etc., and it has been proved to be NP-complete. An efficient heuristic algorithm is proposed to generate a good approximate k-way partition by maximizing the sum of the weights associated with the inner edges of the subsets, together with a relatively balanced partition. In detail, the proposed heuristic algorithm constructs a subset S by selecting a group of neighboring vertices with the highest gain from its candidate subset for inclusion S until the sum of vertex-weights of S meets the demand. Moreover, we customize an approach based on tabu search to refine the heuristic partition. Experimental results show that, the proposed algorithm works more efficiently for average solution than the state-of-the-art on 86%, 81% and 68% graphs among the public benchmark, for the cases k=2,4,8, respectively, with the improvement up to 60%.