ISSN 1000-1239 CN 11-1777/TP


    Default Latest Most Read
    Please wait a minute...
    For Selected: Toggle Thumbnails
    Journal of Computer Research and Development    2015, 52 (8): 1705-1706.  
    Abstract2245)   HTML17)    PDF (397KB)(1287)       Save
    Related Articles | Metrics
    Online Learning Algorithms for Big Data Analytics: A Survey
    Li Zhijie,Li Yuanxiang,Wang Feng,He Guoliang,Kuang Li
    Journal of Computer Research and Development    2015, 52 (8): 1707-1721.   DOI: 10.7544/issn1000-1239.2015.20150185
    Abstract4781)   HTML105)    PDF (1700KB)(3866)       Save
    The advent of big data has been presenting a large array of applications that require real-time processing of massive data with high velocity. How to mine big data stream in a wide range of real-world applications becomes more and more important. Conventional batch machine learning techniques suffer from many limitations when being applied to big data analytics tasks. Online learning technique with stream computing mode is a promising tool for data stream learning. In this survey, we firstly introduce the motivation and background of big data analytics, and then focus on presenting the family of classical and latest online learning methods and algorithms, which are promising to tackle the emerging challenges of mining big data in a wide range of real-world applications. The main technical content of this survey consists of three parts: 1) online learning for linear model;2) kernel-based online learning for nonlinear model;3) non-traditional online learning methods. This is followed by a discussion about some key problems of large-scale machine learning for big data analytics applications. Finally, we present a few typical scenarios of online learning for big data stream and discuss possible directions for ongoing and future research in this area.
    Related Articles | Metrics
    Generalized Kernel Polarization Criterion for Optimizing Gaussian Kernel
    Tian Meng, Wang Wenjian
    Journal of Computer Research and Development    2015, 52 (8): 1722-1734.   DOI: 10.7544/issn1000-1239.2015.20150110
    Abstract1289)   HTML0)    PDF (3966KB)(702)       Save
    The choice of kernel function is a basic and challenging problem in researches on kernel methods. Gaussian kernel is a popular and widely used one in various kernel methods, and many universal kernel selection methods have been derived for Gaussian kernel. However, these methods may have some disadvantages, such as heavy computational complexity, the difficulty of algorithm implement, and the requirement of the classes generated from underlying multivariate normal distributions. To remedy these problems, generalized kernel polarization criterion has been proposed to tune the parameter of Gaussian kernel for classification tasks. By taking the within-class local structure into account and centering the kernel matrix, the criterion does better in maximizing the class separability in the feature space. And the final optimized kernel parameter leads to a substantial improvement in the performance. Furthermore, the criterion function can be proved to have a determined approximate global minimum point. This good characteristic, coupled with its independence of the actual learning machine, makes the optimal parameter easier to find by many algorithms. Besides this, local kernel polarization criterion function, a special case of generalized kernel polarization criterion function, can also be proved to have a determined approximate global minimum point. The extensions of generalized kernel polarization criterion and local kernel polarization criterion to the multiclass domain have been proposed. Experimental results show the effectiveness and efficiency of our proposed criteria.
    Related Articles | Metrics
    Temporal Link Prediction Based on Dynamic Heterogeneous Information Network
    Zhao Zeya,Jia Yantao,Wang Yuanzhuo,Jin Xiaolong,Cheng Xueqi
    Journal of Computer Research and Development    2015, 52 (8): 1735-1741.   DOI: 10.7544/issn1000-1239.2015.20150183
    Abstract1422)   HTML8)    PDF (1251KB)(1057)       Save
    Temporal link prediction on dynamic heterogeneous information networks, aiming to predict both the building times of links and their types, has been widely studied in recent years. The dynamic heterogeneous information network is a network that has different types of vertices and time-labeled edges, and in this paper we study the temporal link prediction problem in the dynamic heterogeneous information network. Most existing studies employs the structure-based predictive methods, where the structures fails to embed the time information. Therefore, they cannot characterize the correlation between structures and time during the prediction. In this work, we firstly construct the structure called the time-difference-labeled path(TDLP) to combine the time information and structural features into a unified setting and propose TDLP, a time-difference-labeled path based temporal link prediction method, which combines the time information with the structural path features. Experiments on a real data set of a scholar bibliographic website demonstrate that the proposed TDLP method performs better than the state-of-the-art methods on predicting both whether and when a link will be built.
    Related Articles | Metrics
    Feature Selection Algorithm Based on the Multi-Colony Fairness Model
    Yang Tan,Feng Xiang,Yu Huiqun
    Journal of Computer Research and Development    2015, 52 (8): 1742-1756.   DOI: 10.7544/issn1000-1239.2015.20150245
    Abstract1271)   HTML11)    PDF (3509KB)(840)       Save
    As the world gradually transforms from the information world to the data-driven world, the areas of pattern recognition and date mining are facing more and more challenges. Feature subset selection process becomes a necessary part of big-data pattern recognition due to the data with explosive growth. Inspired by the behavior of grabbing resources in animals, the paper adds personal grabbing-resource behavior into the model of resource distribution transformed from the model of feature selection and proposes multi-colony fairness algorithm(MCFA) to deal with this behavior in order to obtain a better distribution scheme (i.e. to obtain a better feature subset). The algorithm effectively fuses the strategies of the random search and the heuristic search. In addition, it combines the methods of filter and wrapper so as to reduce the amount of calculation while improving the classification accuracy. The convergence and the effectiveness of the proposed algorithm are verified both from mathematical and experimental aspects. MCFA is compared with the other four classic feature selection algorithms SFS(sequential forward selection), SBS(sequential backward selection), SFFS(sequential floating forward selection), SBFS(sequential floating backward selection) and three mainstream feature selection algorithms RRFS(relevance-redundancy feature selection), mRMR(minimal-redundancy-maximal-relevance), ReliefF. The comparison results show that the proposed algorithm can obtain better feature subsets both in the aspects of feature subset length and the classification accuracy which indicates the efficiency and the effectiveness of the proposed algorithm.
    Related Articles | Metrics
    Manifold Clustering and Visualization with Commute Time Distance
    Shao Chao, Zhang Xiaojian
    Journal of Computer Research and Development    2015, 52 (8): 1757-1767.   DOI: 10.7544/issn1000-1239.2015.20150247
    Abstract1209)   HTML1)    PDF (5430KB)(899)       Save
    The existing manifold learning algorithms can effectively learn and visualize the low-dimensional nonlinear manifold structure of high-dimensional data. However, most efforts to date select the neighborhood size in sensitivity and difficulty, and require sampling the data from a single manifold. To reduce the sensitivity of manifold learning algorithms to the neighborhood size, and address the effective visualization and clustering of multi-manifold data, this paper employs the commute time distance to propose a novel manifold learning algorithm, called CTD-ISOMAP (commute time distance isometric mapping). Compared with Euclidean distance, commute time distance probabilistically synthesizes all the paths connecting any two points in the neighborhood graph. Consequently, it takes into account the intrinsic nonlinear geometric structure for the given data, while still providing the robust results, and then is suitable to identify the shortcut edges and the inter-manifold edges possibly existed in the neighborhood graph. CTD-ISOMAP with the commute time distance, therefore, effectively eliminates the shortcut edges in the neighborhood graph, so that each output achieves the low-dimensional nonlinear manifold structure in the much wider range of the neighborhood size, and eliminates the inter-manifold edges in the neighborhood graph to boost the clustering on multi-manifold data obtained by spectral clustering. Finally, our experimental study verifies the effectiveness of CTD-ISOMAP.
    Related Articles | Metrics
    FSMBUS: A Frequent Subgraph Mining Algorithm in Single Large-Scale Graph Using Spark
    Yan Yuliang,Dong Yihong,He Xianmang,Wang Wei
    Journal of Computer Research and Development    2015, 52 (8): 1768-1783.   DOI: 10.7544/issn1000-1239.2015.20150256
    Abstract2434)   HTML10)    PDF (6675KB)(1357)       Save
    Mining frequent subgraphs in a single large-scale graph is of huge demand with the rapid growth of the social networking. However, it is inefficient for the serial algorithms to mine frequent subgraphs in low support when mining for a single large-scale graph. Meanwhile, few existing distributed algorithms can’t support the growth pattern mining, and the Hadoop framework they worked is not suitable for iterative running. In this paper, a distributed algorithm named FSMBUS for mining frequent subgraph in a single large-scale graph under Spark framework is proposed. It constructs the parallel computing candidate subgraphs by suboptimal CAM Tree, which returns all the frequent subgraphs for given user-defined minimum support. Additionally, infrequent patterns’ test and searching order chosen is introduced to optimize the algorithm. Sorted-Greedy method is designed for data partition to balance the workload. Our experiments show that FSMBUS runs faster and more effective than the existing algorithms with real datasets,and even can run with the lower support threshold and the larger graph datasets as well. At the same time, FSMBUS runs 2~4 times faster on Spark framework than that on Hadoop framework.
    Related Articles | Metrics
    A Graph Clustering Method for Detecting Protein Complexes
    Wang Jie,Liang Jiye,Zheng Wenping
    Journal of Computer Research and Development    2015, 52 (8): 1784-1793.   DOI: 10.7544/issn1000-1239.2015.20150180
    Abstract1133)   HTML4)    PDF (1457KB)(763)       Save
    Protein-protein interaction (PPI) networks are widely present in complex biological networks. The topological features of PPI networks play an important role in analyzing the functional modules in networks. Some graph clustering methods have been successfully used to complex networks to detect protein complexes in PPI networks. Traditional graph clustering algorithms in PPI analyzing methods primarily focus on hard clustering for a network, while, nowadays soft clustering algorithms to find overlapped clusters have become one of the hotspots of current research. Existing soft clustering algorithms pay less attention on small-scale non-dense clusters, while some small-scale non-dense clusters often have important biological meaning in PPI networks. A measuring method of the association strength of edges is developed based on node neighborhoods in networks, and then a soft clustering algorithm named flow-simulation graph clustering (F-GCL) on the basis of flow simulation is presented to detect complexes in a PPI network. Experiments show that the proposed soft clustering algorithm F-GCL can simultaneously find out overlapping clusters and small-scale non-dense clusters without improving the running time. Compared with MCODE(molecular complex detection), MCL(Markov clustering), RNSC(restricted neighborhood search clustering) and CPM(clique percolation method) algorithms on six Saccharomyces cerevisiae PPI networks, the algorithm F-GCL shows considerable or better performance on three evaluating indicators: F-measure, Accuracy and Separation.
    Related Articles | Metrics
    Recognizing the Same Commodity Entities in Big Data
    Hu Yahui,Li Shijun,Yu Wei,Yang Sha,Gan Lin,Wang Kai,Fang Qiqing
    Journal of Computer Research and Development    2015, 52 (8): 1794-1805.   DOI: 10.7544/issn1000-1239.2015.20150252
    Abstract1451)   HTML2)    PDF (1811KB)(1137)       Save
    The recent blossom of big data and e-commerce has revolutionized our life by providing everyone with the ease and fun never before. How to identify the same commodity entities from these multi-source heterogeneous, fragmented, various and inconsistent e-commerce data for better business intelligence raises a very valuable and challenging topic. In this light, we analyze the characteristics of Web big data and collect the crawled original commodity information data from the different e-commerce platforms, which are the multi-source heterogeneous and mass scales of data. Then, we build an index model based on commodity’s attributes and values, and construct a global model map to record the commodity’s attribute and value, and form the unified model and high efficient commodity information for the next step. And we measure the similarity of the commodity’s identity on the multilayer hierarchical probabilistic model, including identifying the possible candidate commodity set, similarity filtering the candidate commodity set and similarity filtering based on the special items of candidate commodities set. Finally, we output the same commodity set in the inverted index list. We also evaluate our method on the datasets collected from Chinese three main-stream B2C e-commerce platforms with Hadoop framework. Experimental results show the accuracy and effectiveness of our method.
    Related Articles | Metrics
    Sentiment Uncertainty Measure and Classification of Negative Sentences
    Zhang Zhifei, Miao Duoqian, Nie Jianyun,Yue Xiaodong
    Journal of Computer Research and Development    2015, 52 (8): 1806-1816.   DOI: 10.7544/issn1000-1239.2015.20150253
    Abstract1366)   HTML1)    PDF (3554KB)(712)       Save
    Sentiment classification is a powerful technology for social media big data analysis. It is of great importance to predict the sentiment polarity of a sentence, especially a negative sentence that is often used. The negation words and sentiment words play equally important roles in the sentiment classification of negative sentences. A negation word is important when it modifies a sentiment word; but it can also have sentimental implication on its own. The existing methods only consider the negation words when they modify sentiment words. In this paper, a unified classification model based on decision-theoretic rough sets is proposed to deal with the sentiment classification of negative sentences. First, the sentiment value of each clause in a sentence is calculated by several lexicons and the inter-sentence relations. A novel measure of sentiment uncertainty for a sentence is given based on Kullback-Leibler divergence. Then, the negative sentences are represented in terms of four features (initial polarity, sentiment uncertainty, successive punctuations, and sentence type) and especially two negation-related features: single negation and salient adverb. Finally, a novel attribute reduction algorithm based on the decision correlation degree is used to generate the decision rules for sentiment classification of negative sentences. The experimental results show that this model is effective and the sentiment uncertainty measure is helpful to sentiment classification.
    Related Articles | Metrics