Loading...
ISSN 1000-1239 CN 11-1777/TP

Table of Content

15 April 2014, Volume 51 Issue 4
Survey of Data Mining for Microblogs
Ding Zhaoyun, Jia Yan, and Zhou Bin
2014, 51(4):  691-706. 
Asbtract ( 1730 )   HTML ( 4)   PDF (2806KB) ( 1724 )  
Related Articles | Metrics
The past few years the rapid development and popularization of microblogs have already been witnessed. Due to their openness, terminal expansion, content simplicity, low threshold and so on, microblogs deeply affect our daily life by providing an important platform for people to publish comments, transform information and acquire knowledge, to name just a few. Though bearing such advantages, microblogs may cause serious impacts on the national security and social development if they are out of control. Therefore, the research on microblogs is quite valuable from both theoretical and practical perspective, especially in this age of the Internet. Analyzing and mining microblogs also brings great challenges. As can be seen, microblogs can be treated as a generalization and extension of human life in the virtual network world. However, different from traditional information networks, microblogs have their unique characteristics, including noisy data diversity, social media, multi-relations, the rapid spread and evolutionary, nonlinearity, large scalability and etc. Such differences bring forth great challenges in analyzing and mining the microblogs. In this paper, we survey the data mining for microblogs and analyze the dataset of Twitter. Moreover, we summarize the challenges of data mining for microblogs.
Fast Adaptive Clustering by Synchronization on Large Scale Datasets
Ying Wenhao, Xu Min, Wang Shitong, and Deng Zhaohong
2014, 51(4):  707-720. 
Asbtract ( 587 )   HTML ( 0)   PDF (4024KB) ( 634 )  
Related Articles | Metrics
The existing synchronization clustering algorithm Sync regards each attribute of a sample as a phase oscillator in the synchronization process. As a result, the algorithm has higher time complexity and can not be well used on large scale datasets. To solve this problem, we propose a novel fast adaptive clustering algorithm FAKCS in this paper. Firstly, FAKCS introduces a method based on RSDE and CCMEB technology to extract the samples from the original dataset. Then it begins clustering adaptively by using the DaviesBouldin cluster criterion and the new order parameter which can observe the degree of local synchronization. Moreover, the relationship between the new order parameter and KDE is found in this paper, which reveals the probability density nature of local synchronization. FAKCS can detect clusters of arbitrary shape, number and density on large scale datasets without setting cluster number previously. The effectiveness of the proposed method has been demonstrated in image segmentation examples and experiments on large UCI datasets.
An Adaptive and Optimized Negotiation Model Based on Bayesian Learning
Hou Wei, Dong Hongbin, and Yin Guisheng
2014, 51(4):  721-730. 
Asbtract ( 528 )   HTML ( 0)   PDF (2177KB) ( 413 )  
Related Articles | Metrics
In complex automated negotiations, a challenging issue is how to design effective learning mechanisms of agents that can deal with incomplete information, in which the agents do not know the opponent’s private information (i.e., the deadline, reservation offer, issue weight) and such information may be not unchanged. We present a time dependent, bilateral multi-issue optimized negotiation model by combining Bayesian learning with evolutionary algorithm based on mixed strategy (BLMSEAN). The proposed model defines reservation units, reservation points, and the each probability of reservation point which can be on behalf of the likelihood of the reservation point located in the unit. A regression analysis compares the correlation between estimated offers and historical offers, and Bayesian rule updates the probabilities and the weights of issues utilizing the historical offers only. The evolution algorithm with mixed mutation strategy enables the estimation to approximate more accurately opponent’s negotiation parameters and to adjust adaptively concession strategy to benefit two partners to improve the joint utility and success rate of negotiation agreement. By being evaluated empirically, this model shows its effectiveness for the agent to learn the possible range of its opponent’s private information and alter its concession strategy adaptively.
Concept Drift Detection for Data Streams Based on Mixture Model
Guo Gongde, Li Nan, and Chen Lifei
2014, 51(4):  731-742. 
Asbtract ( 650 )   HTML ( 0)   PDF (3316KB) ( 489 )  
Related Articles | Metrics
As its application in credit card fraud detection and many other fields, more and more scholars are paying attention to the classification for concept drifting data streams. Most existing algorithms assume that the true labels of the testing instances can be accessed right after they are classified, and utilize them to detect concept drift and adjust current model. It is an impractical assumption in real-world because manual labeling of instances which arrive continuously at a high speed requires a lot of time and effort. For the problem mentioned above, this paper proposes a concept drift detection method based on KNNModel algorithm and incremental Bayes algorithm which is called KnnM-IB. The proposed method has the virtue of the KNNModel algorithm when classifying instances covered by the model clusters. In addition, the incremental Bayes algorithm is used to handle the confused instances and update the model. Using the change of the window size and the few labeled most informative instances which are chosen by active learning, the KnnM-IB algorithm can detect the concept drift on data streams. Semi-supervised learning technology is also used to increase the number of the labeled instances to update the model when the underlying concept of the data streams is stable. Experimental results show that compared with the traditional classification algorithms, the proposed method not only adapts to the situation of concept drift, but also acquires the comparable or better classification accuracy.
A Minimum Attribute Self-Adaptive Cooperative Co-Evolutionary Reduction Algorithm Based on Quantum Elitist Frogs
Ding Weiping, Wang Jiandong, and Guan Zhijin
2014, 51(4):  743-753. 
Asbtract ( 408 )   HTML ( 0)   PDF (2581KB) ( 429 )  
Related Articles | Metrics
Attribute reduction is a key point in studying rough sets theory. It has been proven that computing minimum attribute reduction of the decision table is an NP-hard problem. However, the conventional evolutionary algorithms are not efficient in accomplishing minimum attribute reduction. A novel minimum attribute self-adaptive cooperative co-evolutionary reduction algorithm (QEFASCR) based on quantum elitist frogs is proposed. Firstly, evolutionary frogs are represented by multi-state quantum chromosomes, and quantum elitist frogs can fast guide the evolutionary frogs into the optimal area, which can strengthen the convergence velocity and global search efficiency. Secondly, a self-adaptive cooperative co-evolutionary model for minimum attribute reduction is designed to decompose evolutionary attribute sets into reasonable subsets according to both the best historical performance experience records and assignment credits, and some optimal elitists in different subpopulations are selected out to evolve their respective attribute subsets, which can increase the cooperation and efficiency of attribute reduction. Therefore the global minimum attribute reduction set can be obtained quickly. Experiments results indicate that the proposed algorithm can achieve the higher performance on the efficiency and accuracy of minimum attribute reduction, compared with the existing algorithms.
Global Convergence and Premature Convergence of Two-Membered Evolution Strategy
Zhang Yushan, Hao Zhifeng, and Huang Han
2014, 51(4):  754-761. 
Asbtract ( 361 )   HTML ( 0)   PDF (1009KB) ( 405 )  
Related Articles | Metrics
The theory of discrete state Markov chain has been applied widely to the analysis of convergence and time complexity of the evolutionary algorithms, and the application of the theory of continuous state Markov process is relatively few and not systematic due to the adoption of some profound mathematical tools. In this article, we introduce the theory of continuous state Markov process, use the measure theory and conditional mathematical expectation theory as tools to deduce a key calculation formula of the transition probability. We analyze the convergence property of the continuous evolutionary algorithms represented by (1+1)ES; prove theoretically that a wide class of common mutation distributions including normal distribution and Cauchy distribution, if satisfying some specified conditions which are not difficult to meet in practice, can cause (1+1)ES to converge in probability to the ε-vicinity of global optimum if the constant mutation operator is adopted; construct a function with fitness plateau and prove theoretically that some self-adaptive mutation operators can plunge (1+1)ES into premature convergence even if normal distribution and Cauchy distribution are adopted as mutation distribution. The theoretical analysis is validated by simulation experiment. The results show that self-adaptive mechanism is not always effective, and the scale factor should be selected properly.
Road Segmentation Based on Vanishing Point and Principal Orientation Estimation
Tian Zheng, Xu Cheng, Mi Chao, Li Renfa, and Wang Xiaodong
2014, 51(4):  762-772. 
Asbtract ( 500 )   HTML ( 0)   PDF (3931KB) ( 620 )  
Related Articles | Metrics
Most existing road segmentation algorithms based on vanishing point estimation demand the vanishing point locats inside the image, and they are always time-consuming and cannot effectively overcome the interference of noise which has strong texture features. This paper focuses on these problems, and proposes a road segmentation method based on principal orientation and vanishing point estimation. Firstly, the valid voters are selected by the restrains of road principal orientation. Then a multi-dimension voting scheme is presented, which records the voting information in different orientations of candidate vanishing point, and these information is later used to judge whether the vanishing point is located inside image. Finally, a boundary fitting strategy based on principal orientation is proposed, which extracts the road region according to the data generated on the multi-dimension voting stage. Quantitative and qualitative experiments show that the proposed road segmentation method is more accurate and faster than the traditional algorithms, and it can still work well when the vanishing point is located outside the image.
Analysis Method of Thinking Data Based on fMRI
Deng Hongxia, Xiang Jie, You Ya, and Li Haifang
2014, 51(4):  773-780. 
Asbtract ( 683 )   HTML ( 0)   PDF (3678KB) ( 456 )  
Related Articles | Metrics
Recently, a growing number of studies have shown that machine learning technologies can be used to decode mental state from functional magnetic resonance imaging (fMRI) data. It has achieved the functional positioning of the brain activity using fMRI technology to interpret the thinking data.But how to run the specific operation of the brains thinking process is still unknown. The analytical methods to better reveal the thinking process need to be further studied. Through adopting two stimuli tasks of solving the 4×4 Sudoku problems and image emotional reaction, the thinking process data which interpret the different state of mind is gotten, and different ways of thinking data analysis methods are explored. The experiments proved that the feature selection methods of t-test, the feature extraction methods of the peak time and the classification algorithm of SVM are more suitable for analyzing the fMRI data, especially to two different states of mind data above, which can reveal the correct state of mind. This essay should provide the theory basis and data for promoting the fMRI technology to interpret the thinking data.
Outliers and Change-Points Detection Algorithm for Time Series
Su Weixing, Zhu Yunlong, Liu Fang, and Hu Kunyuan
2014, 51(4):  781-788. 
Asbtract ( 2850 )   HTML ( 4)   PDF (2315KB) ( 2716 )  
Related Articles | Metrics
Because the conventional change-points detection method exists the shortages on time delay and inapplicability for the time series mingled with outliers in the practical applications, an outlier and change-point detection algorithm for time series, which is based on the wavelet transform of the efficient score vector, is proposed in this paper. The algorithm introduces the efficient score vector to solve the problem of the conventional detection method that statistics often increass infinitely with the number of data added during the process of detection, and proposes a strategy of analyzing the statistics by using wavelet in order to avoid the serious time delay. In order to distinguish the outlier and change-point during the detection process, we propose a detecting framework based on the relationship between Lipschitz exponent and the wavelet coefficients, by which both outlier and change-point can be detected out meanwhile. The advantage of this method is that the detection effect is not subject to the influence of the outlier. It means that the method can deal with the time series containing both outliers and change-points under actual operating conditions and it is more suitable for the real application. Eventually, the effectiveness and practicality of the proposed detection method have been proved through simulation results.
Ensemble Approach for Detecting User Profile Attacks Based on Bionic Pattern Recognition
Zhou Quanqiang and Zhang Fuzhi
2014, 51(4):  789-801. 
Asbtract ( 439 )   HTML ( 0)   PDF (3719KB) ( 429 )  
Related Articles | Metrics
The supervised approaches suffer from low precision when detecting user profile attacks. Aiming at this problem, an ensemble detection approach is proposed by introducing bionic pattern recognition theory and ensemble learning technology. Firstly, through calculating the distance between a covered line segment and its nearest genuine profile, an adaptive calculational algorithm is proposed to adaptively assign a proper radius of hypersphere for each neuron. Secondly, the assigned radius is used to redesign an existing constructive neural network to make it more reasonable to cover attack profiles so as to improve its classification capability. Finally, an ensemble framework is proposed to detect user profile attacks. To create diverse base training sets, a base training set generation algorithm is proposed by combining various attack types. These base training sets are used to train the redesigned neural network in order to generate base classifiers. Based on ITS algorithm, a selective ensemble detection algorithm is proposed to select parts of base classifiers and the majority voting strategy is used to integrate the outputs of these base classifiers. The experimental results on two different scale of real datasets, MovieLens and Netflix, show that the proposed approach can effectively improve the precision under the condition of holding a high recall.
A Robust and High Accurate 3D Facial Motion Tracking Algorithm
Yu Jun, and Wang Zengfu,
2014, 51(4):  802-812. 
Asbtract ( 568 )   HTML ( 1)   PDF (3232KB) ( 410 )  
Related Articles | Metrics
A 3D facial motion tracking approach is proposed based on the incorporation of online appearance model (OAM) and cylinder head model (CHM) in framework of particle filter. It includes: 1) For the construction of OAM, multi-measurements are infused to reduce the influence of lighting and person dependence. 2) OAM can provide the detailed descriptive parameters used for facial motion tracking, however, it is not suitable for robust facial motion tracking across large pose variation. To alleviate these problems, OAM and CHM which is suitable for robust global head motion tracking across large pose variation, are combined, where the global head motion parameters obtained from the CHM are used as the cues of the OAM parameters for a good fitting. And the good fitting result is set as the initial of CHM in next frame. 3) Motion filtering is applied by particle filter combined with local optimization and improved resampling. Experiments of tracking real video sequences demonstrate that accurate tracking is obtained even in the presence of perturbing factors including significant head pose and facial expression variations, occlusions, and illumination changes. And it is also shown that facial motion tracking combining OAM and CHM is more pose robust than that of OAM in terms of 24% higher tracking rate and 11% wider pose coverage. The between subjective experiment indicates the suitability of subjective face identification on synthesized video with tracked facial motion parameters by it.
Multi-View Cooperative Tracking of Multiple Mobile Object Based on Dynamic Occlusion Threshold
Zhou Liangyi, Wang Zhi, and Wang Yingguan
2014, 51(4):  813-823. 
Asbtract ( 584 )   HTML ( 1)   PDF (3159KB) ( 437 )  
Related Articles | Metrics
The occlusion among mobile objects during movement causes a complicated space relationship, which makes multi-view information fusion, cooperative processing and tracking difficult problem. A multi-view object tracking algorithm is proposed by combining modified fusion feature with dynamic occlusion threshold and the improved particle filter in this paper. To solve the objects characteristic uncertainty in multi-view information fusion, occlusion variable is introduced to describe space relationship among mobile objects. Through analyzing the contribution on information fusion of objects in different scene planes with the help of the positions and scales of objects, and the homography transform and sensor model, the expression of dynamic occlusion thresholds are given. After dynamically adjusting and comparing occlusion threshold, an accurate occlusion state among multiple mobile objects is obtained, which is utilized in objects characteristic fusion within common scene. Then an improved particle filter tracking algorithm based on Bayesian theory is proposed, which utilizes the modified characteristic fusion and makes the tracking system more robust to object occlusion. Experiments show that the proposed occlusion variable and dynamic occlusion threshold can effectively solve the problems of objects characteristic uncertainty and scale variation, and good tracking precision is maintained even when objects are occluded.
Algorithm Design and Empirical Analysis for Particle Swarm Optimization-Based Test Data Generation
Mao Chengying, Yu Xinxin, and Xue Yunzhi
2014, 51(4):  824-837. 
Asbtract ( 677 )   HTML ( 1)   PDF (6472KB) ( 623 )  
Related Articles | Metrics
How to generate a test dataset with high coverage and strong fault-revealing ability is a difficult problem, especially for software structural testing. Recently, meta-heuristic search has been confirmed to be an effective way to generate structural test data. In the paper, a swarm intelligence-based method is proposed to handle this problem. At first, the basic framework for search-based test data generation is discussed. Then, with regard to branch coverage criterion, the algorithm for generating test data based on particle swarm optimization (PSO) is proposed. Meanwhile, a new way to construct fitness function is defined according to the structure analysis for branch predicates in program under test. Subsequently, ten open published programs are used to perform experimental evaluation. The experimental results show that PSO outperforms genetic algorithm (GA) and simulated annealing (SA) in all four metrics, i.e., average coverage, success rate, average convergence generation and average time. In addition, other four PSO variant algorithms are also introduced and implemented to conduct comparison analysis with the basic PSO. The results indicate that the basic PSO is the most suitable algorithm for test data generation problem. On the contrary, comprehensive learning PSO (CLPSO) exhibits the worst performance in all variant algorithms.
Predicate Constraint Oriented BPEL Modeling and Feasible Path Analysis
Wang Jin, Huang Zhiqiu, Tang Jiajun, Chen Zhe, and Xiao Fangxiong
2014, 51(4):  838-847. 
Asbtract ( 384 )   HTML ( 1)   PDF (2451KB) ( 377 )  
Related Articles | Metrics
Due to the fact that single Web service is too simple in function to accomplish complex business requirement, a coordinated aggregate method (service composition) is introduced to build enterprise solutions by combining multiple existing enterprise services. A composition of services is comparable to a traditional application in which its functional scope is usually associated with the automation of a parent business process.Based on XML, BPEL uses XPath to bind variables and define expressions. Different from WSDL which can only define simple constraint by pre-and post-condition, XPath expressions combined with structured activities provide the more expressive ability to define constraint.Both white-box testing and model checking are based on BPEL modeling and feasible path analyzing. For BPEL, only considering the structural activities and ignoring the data manipulation and constraints will bring some negative impacts.To address this issue, we propose an XPath expression oriented predicate constraint analyzing and modeling approach and introduce a feasible path analysis algorithm based on this model. This approach takes into account the data manipulation effects on the feasible path. We firstly analyze the syntax of BPEL expressions and make a normalizing to the BPEL expressions. Then, the activity effect extended variable structure tree is used to model the atomic expression and the composite predicate expression. Moreover, the feasible path analysis algorithm is discussed using the established model. Finally, by case studying, the feasibility and experiment process of our approach are illustrated.
Method for Modeling and Analyzing Software Energy Consumption of Embedded Real-Time System
Zhu Yi, Xiao Fangxiong, Zhou Hang, and Zhang Guangquan
2014, 51(4):  848-855. 
Asbtract ( 493 )   HTML ( 1)   PDF (1306KB) ( 588 )  
Related Articles | Metrics
With the progress of low-power research on embedded real-time systems, software energy consumption has an efficient effect on the system and develops towards quantitative analysis. Aiming to the problem that the modeling and analysis of embedded real-time system is difficult to effectively take into account software energy consumption, this paper proposes a method for modeling and analyzing software energy consumption of embedded real-time system based on process algebra. Priced timed CSP is proposed by extending price information on timed CSP, and the power consumption of instructions in embedded real-time systems is mapped into the price of priced timed CSP. The energy consumption of embedded real-time software can be modeled and optimized by using priced timed CSP. The optimal path algorithms are proposed to check the power consumption satisfyability of single instruction and calculate the minimum energy consumption reachability path of embedded real-time systems. This formal method improves the accuracy and efficiency of energy calculation, and the calculation results can be used to quantitatively analyze and optimize the energy consumption of embedded real-time systems.
Research on Generation System from Radl Algorithm to Apla Program
Xie Wuping and Xue Jinyun
2014, 51(4):  856-864. 
Asbtract ( 433 )   HTML ( 1)   PDF (1601KB) ( 430 )  
Related Articles | Metrics
Designing algorithm is a creative activity. The traditional design and description methods are difficult to ensure the correctness of algorithm. By defining algorithm description language Radl with mathematical referential transparency in the PAR method, the algorithm described by recursive relation can be obtained from deriving problem specification formally. The core of the algorithm is the recursive relation group, and therefore, it is easy to deduce and prove Radl algorithm formally. By deeply analyzing Radl language and Radl algorithm characteristics, the essential relationship between the Radl algorithm and abstract sequential program Apla is revealed, and Apla program generation rules based on Radl grammar productions are defined. Then Apla program automatic generation system is achieved and the systems reliability is studied. Finally the core algorithm of the system is verified formally. The algorithm developed by PAR method is correct, and the generation system which is proved formally can be assured reliably. So the high reliability of algorithm is ensured, and the automation development tools improve the development efficiency of program.
Test Case Selection for Improving the Effectiveness of Software Fault Localization
Wang Kechao, Wang Tiantian, Su Xiaohong, Ma Peijun, and Tong Zhixiang
2014, 51(4):  865-873. 
Asbtract ( 414 )   HTML ( 1)   PDF (2039KB) ( 474 )  
Related Articles | Metrics
Existing approaches to test case selection usually focus on how to improve the efficiency of testing, rather than the effectiveness of fault localization. To solve this problem, a test case selection approach is proposed. Firstly, “test case prioritization criterion by similarity of failure coverage vector” is proposed to prioritize the passed test case whose execution path is more similar to that of the failed test case. Secondly, “test case selection criterion by equivalent failure coverage division” is defined to select passed test cases, which can maximally differentiate the statements in the failure execution path. Finally, the test case selection model ES is created based on these two criteria. Different from existing approaches, ES takes advantage of the execution path of the failed test case to improve the effectiveness of fault localization. It has been applied to analyze the Siemens benchmark, and the selected test cases have been used as input of four popular coverage-based fault localization techniques. Experimental results show that in terms of the Reduction and Expense_increase metrics, ES is better than the existing statement-based and vector-based test case reduction approaches. ES can get a Reduction of over 97%, which indicates that it can greatly improve the efficiency of fault localization. In particular, ES can also achieve a low Expense_increase, which means it can significantly improve the effectiveness of fault localization.
MujavaX: A Distribution-Aware Mutation Generation System for Java
Sun Chang’ai, and Wang Guan
2014, 51(4):  874-881. 
Asbtract ( 446 )   HTML ( 0)   PDF (1578KB) ( 492 )  
Related Articles | Metrics
Mutation analysis is widely employed to evaluate the effectiveness of various software testing techniques. Existing mutation analysis techniques commonly insert faults into original programs uniformly, while actual faults tend to be clustered, which has been observed in empirical studies. This mismatch may result in the inappropriate simulation of faults, and thus may not deliver the reliable evaluation results. To overcome this limitation, we proposed a distribution-aware mutation analysis technique in our previous work, and it has been validated that the mutation distribution has impact on the effectiveness result of software testing techniques under evaluation. In this paper, we implement a mutation system called MujavaX to support distribution-aware mutation analysis. Such a system is an extension and improvement on Mujava which has been widely employed to mutation testing for Java programs. A case study is conducted to validate the correctness and feasibility of MujavaX, and experimental results show that MujavaX is able to generate a set of mutants for Java programs with respect to the given distribution model specified by testers.
Study of Choosing the Optimal Storage Format of Sparse Matrix Vector Multiplication
Li Jiajia, Zhang Xiuxia, Tan Guangming, and Chen Mingyu
2014, 51(4):  882-894. 
Asbtract ( 686 )   HTML ( 1)   PDF (4134KB) ( 759 )  
Related Articles | Metrics
Sparse matrix vector multiplication (SpMV) is one of the most important kernels in scientific and engineering applications. It is also one of the most essential subprograms of sparse BLAS library. A lot of work has been dedicated in optimizing SpMV, and some has achieved significant performance improvement. Since most of the optimization methods are less of generalization and only suitable for a specific type of sparse matrices, the optimized SpMV kernels have not been widely used in real applications and numerical solvers. Besides, there are many storage formats of a sparse matrix and most of them achieve diverse performance on different SpMV kernels. In this paper, considering different sparse matrix features, we present an SpMV auto-tuner (SMAT) to choose the optimal storage format for a given sparse matrix on different computer architectures. The optimal storage format releasing the highest SpMV performance is helpful to enhance the performance of applications. Moreover, SMAT is also extensible to new formats, which will make full use of the achievements of SpMV optimization in literatures. We test SMAT using 2366 sparse matrices from the University of Florida. SMAT achieves 9.11 GFLOPS (single) and 2.44 GFLOPS (double) on Intel platform, and 3.36 GFLOPS (single) and 1.52 GFLOPS(double) on AMD platform. Compared with Intel MKL library, the speedup of SMAT is 1.4 to 1.5 times.
Entity Association Mining Algorithm CFRQ4A in Heterogeneous Information Spaces
Yang Dan, Shen Derong, Nie Tiezheng, Yu Ge, and Kou Yue
2014, 51(4):  895-904. 
Asbtract ( 518 )   HTML ( 0)   PDF (2341KB) ( 433 )  
Related Articles | Metrics
The rich entity associations are prerequisites and play important roles in many applications such as data analyzing, data mining, knowledge discovery and semantic query in heterogeneous information spaces. However unlike homogeneous information network, due to the complexity, diversity and heterogeneous of entity associations in heterogeneous information spaces, the entity association mining is not a simple task and with more challenges. It is taken as an example to discover the likely entity associations among heterogeneous entities in an author bibliographic network. In particular, aiming at the characteristics of heterogeneous information spaces, a new general 4-step entity association mining algorithm CFRQ4A (clustering, filtering, reasoning and quantifying for associations) is proposed. CFRQ4A leverages not only attribute values of heterogeneous entities but also structural (path) information of heterogeneous information network. And association constraints are introduced to verify semantic and logic correctness of entity associations in the mining process. The purpose of the filtering step is to further reduce the searching space of the mining algorithm. Moreover, aiming at the inherent features of entity association, a reasonable association strength quantifying model is given. Experimental results on the DBLP dataset demonstrate the feasibility and effectiveness of the proposed algorithm.
Non-Cooperative Structured Deep Web Selection Based on Hybrid Type Keyword Retrieval
Wan Changxuan, Deng Song, Liu Dexi, Jiang Tengjiao, and Liu Xiping
2014, 51(4):  905-917. 
Asbtract ( 291 )   HTML ( 0)   PDF (2787KB) ( 415 )  
Related Articles | Metrics
In order to efficiently utilize the resources in deep Web, data integration of deep Web emerges as the times require. Data source selection becomes one of the key technologies in data integration of deep Web because it is helpful to improve the efficiency of deep Web integration and the quality of returned results. Most of deep Web data sources are structured and non-cooperative. Recent research findings of non-cooperative structured deep Web selection are divided into two categories, one is based on the discrete keyword retrieval, and the other is based on the character keyword retrieval. As far as I am concerned, there is no data source selection method considering above two type keywords. In this paper, user query keywords are divided into retrieval-type keywords and constraint-type keywords. We use the association feature between subject headings, the association feature between subject heading and feature word, and the association feature between histograms, to construct the hierarchical data source summary. The summary can deal with the hybrid type keyword retrieval, which is made of retrieval-type keywords and constraint-type keywords. The summary can reflect the search intent of retrieval-type keywords and the binding character of constraint-type keywords. Finally, we also give a corresponding data source selection strategy based on above summary. The experiment results show that our method has good performance of record recall ratio and precision.
top-k Aggregation Keyword Search over Relational Databases
Zhang Dongzhan, Su Zhifeng, Lin Ziyu, and Xue Yongsheng
2014, 51(4):  918-929. 
Asbtract ( 579 )   HTML ( 0)   PDF (2922KB) ( 452 )  
Related Articles | Metrics
Structured query language (SQL) is a classical approach to performing query over relational databases. However, it is difficult to search information for ordinary users who are unfamiliar with the underlying schema of the database and SQL. While keyword search technology used in information retrieval (IR) systems allows users to just simply input a set of keywords to get the required results. Therefore, it is desirable to integrate DB and IR, which allows users to search relational databases without any knowledge of database schema and query languages. Given a keyword query, the existing approaches find individual tuples which match a set of query keywords based on primary-foreign-key relationships in databases. However, it is more useful for users to get the aggregation result of tuples in many real applications, and those existing methods cannot be used to deal with such issue. Therefore, this paper focuses on the problem of top-k aggregation keyword search over relational databases. Here recursion-based full search algorithm, i.e., RFS, is proposed to get all aggregation cells. To achieve high performance, new ranking techniques, keyword-tuple-based two dimensional index and quick search algorithm, i.e., OQS, are developed for effectively identifying top-k aggregation cells. A large number of experiments have been implemented upon two large real datasets, and the experimental results show the benefits of our approach.