ISSN 1000-1239 CN 11-1777/TP


    Default Latest Most Read
    Please wait a minute...
    For Selected: Toggle Thumbnails
    Journal of Computer Research and Development    2016, 53 (8): 1649-1650.  
    Abstract1786)   HTML11)    PDF (385KB)(1030)       Save
    Related Articles | Metrics
    A Ranking Based Poisson Matrix Factorization Model for Point-of-Interest Recommendation
    Yu Yonghong, Gao Yang,Wang Hao
    Journal of Computer Research and Development    2016, 53 (8): 1651-1663.   DOI: 10.7544/issn1000-1239.2016.20160202
    Abstract1812)   HTML5)    PDF (3051KB)(1027)       Save
    With the rapid growth of location-based social network (LBSN), point-of-interest (POI) recommendation has become an important means to meet users’ personalized demands and alleviate the information overload problem. However, traditional POI recommendation algorithms have the following problems: 1)most of existing POI recommendation algorithms simplify users’ check-in frequencies at a location, i.e., regardless how many times a user checks-in a location, they only use binary values to indicate whether a user has visited a POI; 2)matrix factorization based POI recommendation algorithms totally treat users’ check-in frequencies as ratings in traditional recommender systems and model users’ check-in behaviors using the Gaussian distribution; 3)traditional POI recommendation algorithms ignore that users’ check-in feedback is implicit and only positive examples are observed in POI recommendation. In this paper, we propose a ranking based Poisson matrix factorization model for POI recommendation. Specifically, we first utilize the Poisson distribution instead of the Gaussian distribution to model users’ check-in behaviors. Then, we use the Bayesian personalized ranking metric to optimize the loss objective function of Poisson matrix factorization and fit the partial order of POIs. Finally, we leverage a regularized term encoding geographical influence to constrain the process of Poisson matrix factorization. Experimental results on real-world datasets show that our proposed approach outperforms traditional POI recommendation algorithms.
    Related Articles | Metrics
    Collaborative Filtering Recommendation Algorithm Combining Community Structure and Interest Clusters
    Guo Hongyi, Liu Gongshen, Su Bo,Meng Kui
    Journal of Computer Research and Development    2016, 53 (8): 1664-1672.   DOI: 10.7544/issn1000-1239.2016.20160175
    Abstract1564)   HTML3)    PDF (2068KB)(1146)       Save
    Traditional collaborative filtering recommendation algorithms suffer from data sparsity, which results in poor recommendation accuracy. Social connections among users can reflect their interactions, which can be mixed into recommendation algorithms to improve the accuracy. Only straightforward social connections have been used by most current social recommendation algorithms, while users’ latent interest and cluster information haven’t been considered. In response to these circumstances, this paper proposes a collaborative filtering recommendation algorithm combining community structure and interest clusters. Firstly, overlapping community detection algorithm is used to detect the community structure existed in user social network, thus users in the same community have certain common characteristics. Meanwhile, we design a customized fuzzy clustering algorithm to discover users’ interest clusters, which uses item-category relationship and users’ activity history as input. Users in the same cluster are similar in generalized interest. We quantify users’ preference for each social community and interest cluster they belong to respectively. Then, we combine this two types of user group information into matrix factorization model by adding a clustering-based regularization term to improve the objective function. Experiments conducted on the Yelp dataset show that, in comparison to other methods including both traditional and social recommendation algorithms, our approach gets better recommendation results in accuracy.
    Related Articles | Metrics
    Online Consumptions Prediction via Modeling User Behaviors and Choices
    Zeng Xianyu, Liu Qi, Zhao Hongke, Xu Tong, Wang Yijun,Chen Enhong
    Journal of Computer Research and Development    2016, 53 (8): 1673-1683.   DOI: 10.7544/issn1000-1239.2016.20160103
    Abstract1780)   HTML4)    PDF (2515KB)(1234)       Save
    The rise of electronic e-commerce sites and the formation of the user’s online shopping habits, have brought a huge amount of online consumer behavioral data. Mining users’ preferences from these behavioral logs (e.g. clicking data) and then predicting their final consumption choices are of great importance for improving the conversion rate of e-commerce. Along this line, this paper proposes a way of combining users’ behavioral data and choice model to predict which item each user will finally consume. Specifically, we first estimate the optimum substitute in each consumption session by a utility function of users’ behavioral sequences, and then we build a latent factor based choice model (LF-CM) for the consumed items and the substitutes. In this way, the preference of users can be computed and the future consumptions can be predicted. One step further, to make full use of users’ information of choosing and improve the precision of consumption prediction, we also propose a learning-to-rank model (latent factor and sequence based choice model, LFS-CM), which considers all the items in one session. By integrating latent factors and utility function of users’ behavioral sequences, LFS-CM can improve the prediction precision. Finally, we use the real-world dataset of Tmall and evaluate the performance of our methods on a distributed environment. The experimental results show that both LF-CM and LFS-CM perform well in predicting online consumption behaviors.
    Related Articles | Metrics
    Feature Selection Based on the Measurement of Correlation Information Entropy
    Dong Hongbin, Teng Xuyang,Yang Xue
    Journal of Computer Research and Development    2016, 53 (8): 1684-1695.   DOI: 10.7544/issn1000-1239.2016.20160172
    Abstract1800)   HTML13)    PDF (2019KB)(1072)       Save
    Feature selection aims to select a smaller feature subset from the original feature set. The subset can provide the approximate or better performance in data mining and machine learning. Without transforming physical characteristics of features, fewer features give a more powerful interpretation. Traditional information-theoretic methods tend to measure features relevance and redundancy separately and ignore the combination effect of the whole feature subset. In this paper, the correlation information entropy is applied to feature selection, which is a technology in data fusion. Based on this method, we measure the degree of the independence and redundancy among features. Then the correlation matrix is constructed by utilizing the mutual information between features and their class labels and the combination of feature pairs. Besides, with the consideration of the multivariable correlation of different features in subset, the eigenvalue of the correlation matrix is calculated. Therefore, the sorting algorithm of features and an adaptive feature subset selection algorithm combining with the parameter are proposed. Experiment results show the effectiveness and efficiency on classification tasks of the proposed algorithms.
    Related Articles | Metrics
    Consistent Collective Entity Linking Algorithm
    Liu Qiao, Zhong Yun, Liu Yao, Wu Zufeng,Qin Zhiguang
    Journal of Computer Research and Development    2016, 53 (8): 1696-1708.   DOI: 10.7544/issn1000-1239.2016.20160192
    Abstract1545)   HTML1)    PDF (2387KB)(531)       Save
    The goal of entity linking is to link entity mentions in the document to their corresponding entity in a knowledge base. The prevalent approaches can be divided into two categories: the similarity-based approaches and the graph-based collective approaches. Each of them has some pros and cons. The similarity-based approaches are good at distinguish entities from the semantic perspective, but usually suffer from the disadvantage of ignoring relationship between entities; while the graph-based approaches can make better use of the relation between entities, but usually suffer from bad discrimination on similar entities. In this work, we present a consistent collective entity linking algorithm that can take full advantage of the structured relationship between entities contained in the knowledge base, to improve the discrimination capability of the proposed algorithm on similar entities. We extensively evaluate the performance of our method on two public datasets, and the experimental results show that our method can be effective at promoting the precision and recall of the entity linking results. The overall performance of the proposed algorithm significantly outperform other state-of-the-art algorithms.
    Related Articles | Metrics
    Influence Maximization Across Multi-Channels in Social Network
    Li Xiaokang, Zhang Xi, Sun Hao,Sun Guangzhong
    Journal of Computer Research and Development    2016, 53 (8): 1709-1718.   DOI: 10.7544/issn1000-1239.2016.20160211
    Abstract1457)   HTML6)    PDF (2635KB)(661)       Save
    Social networks have widely attracted the interests of researchers in recent years because of their popularity. Influence maximization in social network is one of the most popular problems of social network fields. Influence maximization in social network is a problem to pick up k seed users from a social network, target them as seed users and propagate influence via the network, with the goal of maximizing the number of users influenced by seed nodes. The majority of previous work is based on a single channel. However, in real world, information is propagated via multiple channels. This paper takes information spread in multiple networks into consideration, proposes and formulates influence maximization problem across multi-channels in social network. The problem becomes to pick up k seed users from multiple networks and simultaneously propagate influence across multiple networks, maximizing the number of influenced users by seed set. We prove the problem is NP-hard under independent cascade model through reducing it into influence maximization in social network. According to the property of the problem, we put forward three efficient and effective approximation methods for it. Experiments show our proposed methods are effective on four real social networks.
    Related Articles | Metrics
    Self-Adaptive Clustering Based on Local Density by Descending Search
    Xu Zhengguo, Zheng Hui, He Liang,Yao Jiaqi
    Journal of Computer Research and Development    2016, 53 (8): 1719-1728.   DOI: 10.7544/issn1000-1239.2016.20160136
    Abstract1365)   HTML4)    PDF (4102KB)(635)       Save
    Cluster analysis is an important research domain of data mining. On the unsupervised condition, it is aimed at figuring out the class attributes of samples in a mixed data set automatically. For decades a certain amount of clustering algorithms have been proposed associated with different kinds of priori knowledge. However, there are still some knotty problems unsolved in clustering complex data sets, such as the unknown number and miscellaneous patterns of clusters, the unbalanced numbers of samples between clusters, and varied densities within clusters. These problems have become the difficult and emphatic points in the research nowadays. Facing these challenges, a novel clustering method is introduced. Based on the definition of local density and the intuition of ordered density in clusters, the new clustering method can find out natural partitions by self-adapted searching the boundaries of clusters. Furthermore, in the clustering process, it can overcome the straitened circumstances mentioned above, with avoiding noise disturbance and false classification. The clustering method is testified on 6 typical and markedly different data sets, and the results show that it has good feasibility and performance in the experiments. Compared with other classic clustering methods and an algorithm presented recently, in addition, the new clustering method outperforms them on 2 different evaluation indexes.
    Related Articles | Metrics
    Tensor Representation Based Dynamic Outlier Detection Method in Heterogeneous Network
    Liu Lu, Zuo Wanli,Peng Tao
    Journal of Computer Research and Development    2016, 53 (8): 1729-1739.   DOI: 10.7544/issn1000-1239.2016.20160178
    Abstract1468)   HTML12)    PDF (3961KB)(1199)       Save
    Mining rich semantic information hidden in heterogeneous information network is an important task in data mining. The value, data distribution and generation mechanism of outliers are all different from that of normal data. It is of great significance of analyzing its generation mechanism or even eliminating outliers. Outlier detection in homogeneous information network has been studied and explored for a long time. However, few of them are aiming at dynamic outlier detection in heterogeneous networks. Many issues need to be settled. Due to the dynamics of the heterogeneous information network, normal data may become outliers over time. This paper proposes a dynamic tensor representation based outlier detection method, called TRBOutlier. It constructs tensor index tree according to the high order data represented by tensor. The features are added to direct item set and indirect item set respectively when searching the tensor index tree. Meanwhile, we describe a clustering method based on the correlation of short texts to judge whether the objects in datasets change their original clusters and then detect outliers dynamically. This model can keep the semantic relationship in heterogeneous networks as much as possible in the case of fully reducing the time and space complexity. The experimental results show that our proposed method can detect outliers dynamically in heterogeneous information network effectively and efficiently.
    Related Articles | Metrics
    Mining Patent Knowledge for Automatic Keyword Extraction
    Chen Yiqun, Zhou Ruqi, Zhu Weiheng, Li Mengting,Yin Jian
    Journal of Computer Research and Development    2016, 53 (8): 1740-1752.   DOI: 10.7544/issn1000-1239.2016.20160195
    Abstract1979)   HTML10)    PDF (2962KB)(938)       Save
    Keywords are important clues that can help a user quickly decide whether to skip, to scan, or to read the article. Keyword extraction plays an increasingly crucial role in information retrieval, natural language processing and other several text related researches. This paper addresses the problem of automatic keyword extraction and designs a novel automatic keyword extraction approach making use of patent knowledge. This approach can help computer to learn and understand the document as human being according to its background knowledge, finally pick out keywords automatically. The patent data set is chosen as external knowledge repository because of its huge amount of data, rich content, accurate expression and professional authority. This paper uses patent data set as the external knowledge repository serves for keyword extraction. An algorithm is designed to construct the background knowledge repository based on patent data set, also a method for automatic keyword extraction with novel word features is provided. This paper discusses the characters of patent data, mines the relation between different patent files to construct background knowledge repository for target document, and finally achieves keyword extraction. The related patent files of target document are used to construct background knowledge repository. The information of patent inventors, assignees, citations and classification are used to mining the hidden knowledge and relationship between different patent files. And the related knowledge is imported to extend the background knowledge repository. Novel word features are derived according to the different background knowledge supplied by patent data. The word features reflecting the document’s background knowledge offer valuable indications on individual words’ importance in the target document. The keyword extraction problem can then be regarded as a classification problem and the support vector machine (SVM) is used to extract the keywords. Experiments have been done using patent data set and open data set. Experimental results have proved that using these novel word features, the novel approach can achieve superior performance in keyword extraction to other state-of-the-art approaches.
    Related Articles | Metrics
    Protein Function Prediction Using Positive and Negative Examples
    Fu Guangyuan, Yu Guoxian, Wang Jun,Guo Maozu
    Journal of Computer Research and Development    2016, 53 (8): 1753-1765.   DOI: 10.7544/issn1000-1239.2016.20160196
    Abstract1540)   HTML7)    PDF (1417KB)(644)       Save
    Predicting protein function is one of the key challenges in the post genome era. Functional annotation databases of proteins mainly provide the knowledge of positive examples that proteins carrying out a given function, and rarely record the knowledge of negative examples that proteins not carrying out a given function. Current computational models almost only focus on utilizing the positive examples for function prediction and seldom pay attention to these scarce but informative negative examples. It is well recognized that both positive and negative examples should be used to achieve a discriminative predictor. Motivated by this recognition, in this paper, we propose a protein function prediction approach using positive and negative examples (ProPN) to bridge this gap. ProPN first utilizes a direct signed hybrid graph to describe the positive examples, negative examples, interactions between proteins and correlations between functions; and then it employs label propagation on the graph to predict protein function. The experimental results on several public available proteomic datasets demonstrate that ProPN not only makes better performance in predicting negative examples of proteins whose functional annotations are partially known than state-of-the-art algorithms, but also performs better than other related approaches in predicting functions of proteins whose functional annotations are completely unknown.
    Related Articles | Metrics
    Petri Nets Based Recognition of Model Deviation Domains and Model Repair
    Du Yuyue, Sun Ya’nan,Liu Wei
    Journal of Computer Research and Development    2016, 53 (8): 1766-1780.   DOI: 10.7544/issn1000-1239.2016.20160099
    Abstract1182)   HTML4)    PDF (3172KB)(484)       Save
    Process mining techniques can be used to discover process models from event logs. Event logs and process model can be contrasted by conformance checking techniques. And conformance checking techniques can be used to detect the deviations between observed behaviors and process model. However, existing techniques of process mining concern with discovering these deviations, but not support to repair the process model easily and make the process model more related to the real process. So in this paper we propose a token-based identification method of model deviation domains and a token-based technique of model repair (static model repair and dynamic model repair) through techniques of conformance checking and dynamic behaviors of workflow net. Model deviation domains can be identified effectively though the flow direction of token. We can repair process model according to model deviation domains. And we also can repair the real complex process accurately which has the structures of complicated circulation and choice. In this paper, the effectiveness and the correctness of techniques are illustrated through contrast experiment and analysis with other techniques.
    Related Articles | Metrics
    Online Transfer Learning for Mining Recurring Concept in Data Stream Classification
    Wen Yimin, Tang Shiqi, Feng Chao,Gao Kai
    Journal of Computer Research and Development    2016, 53 (8): 1781-1791.   DOI: 10.7544/issn1000-1239.2016.20160223
    Abstract2293)   HTML15)    PDF (3187KB)(1025)       Save
    At the age of big data, data stream classification is being applied to many fields, like spam filtering, market predicting, and weather forecasting, et al, in which recurring concept is an important character. Aiming to reduce the influence of negative transfer and improve the lag of detection of concept drift, RC-OTL is proposed for mining recurring concepts in data stream based on online transfer learning strategy. When a concept drift is detected, RC-OTL selects one current base classifier to store, and then computes the domain similarities between the current training samples and the stored classifiers, in order to select the most appropriate source classifier to combine with a new classifier for learning the upcoming samples, which results in knowledge transfer from the source domain to the target domain. In addition, RC-OTL can select appropriate classifier to classify when the current classification accuracy is detected below a given threshold before concept drift detection. The preliminary theory analysis explains why RC-OTL can reduce negative transfer effectively, and the experiment results further illustrates that RC-OTL can efficiently promote the cumulate accuracy of data stream classification, and faster adapt to the samples of new concept after concept drift takes place.
    Related Articles | Metrics
    HSSM: A Hierarchical Method for Streaming Submodular Maximization
    Zhang Fenxiang, Chen Huahui, Qian Jiangbo,Dong Yihong
    Journal of Computer Research and Development    2016, 53 (8): 1792-1805.   DOI: 10.7544/issn1000-1239.2016.20160140
    Abstract1159)   HTML2)    PDF (3551KB)(492)       Save
    How to extract k elements from a large data stream according to some utility functions can be reduced to maximizing a submodular set function. The traditional algorithms had already given some good solutions of summarizing a static data set by submodular method, well-known as standard greedy algorithm. Lastly researches also presented some new algorithms with corresponding distributed solutions to meet the streaming data access and the real-time response limits, but those algorithms are not good enough in maximizing the utility gain. In this paper, we propose a new algorithm called HSSM which runs as a pipelining distributed framework and requires only a single-pass access the data set. Finally, the utility gain of our solution is close to the option standard greedy solution. We also propose using utility vector to compress the set of summarization and filtering out the low gain objects to improve the original HSSM algorithm. Fortunately, this approach can get applications in many related fields such as representative articles selection, k-medoid select problem and so on. Experimental results show that the improved Spark-HSSM+ method can increase the summarization speed in direct proportion to k\+2 in contrast to the traditional method. Compared with other distributed algorithms, the result of the Spark-HSSM+ method is the most close to the standard greedy solution.
    Related Articles | Metrics
    IncPR: An Incremental Parallel PageRank Algorithm
    Jiang Shuangshuang, Liao Qun, Yang Yulu,Li Tao
    Journal of Computer Research and Development    2016, 53 (8): 1806-1818.   DOI: 10.7544/issn1000-1239.2016.20160210
    Abstract1771)   HTML13)    PDF (3278KB)(630)       Save
    PageRank algorithm plays an important role in various areas such as information retrieval and social network analysis. The continuously increasing size of networks and the timeliness requirements make it bring enormous computational overhead to repeat calculating PageRank for a massive and evolving network. Therefore, IncPR, a PageRank algorithm based on the idea of incremental computation model is proposed. IncPR reuses existing results from previous computation and modifies PageRank values according to the changed part of data sets accumulatively via an extended Monte Carlo method, which can effectively avoid reduplicated computation and improve the performance of PageRank computing with no additional storage overhead. Theoretical analysis shows that the computational complexity of IncPR processing an evolving network with m nodes and n edges changed is O((cm+n)R/c\+2), where c is the reset probability and R is the number of random walks starting from each node. The computational complexity of IncPR is lower than other existing related algorithms. Our evaluations with typical real-world graphs upon a Hama cluster also demonstrate that IncPR obtains PageRank values with equal (or even higher) accuracy as the baseline Monte Carlo PageRank algorithm and reduces the amount of computation significantly. In experiments with 0.01% data changed, IncPR reduces over 99.9% amount of computation.
    Related Articles | Metrics
    SparkCRF: A Parallel Implementation of CRFs Algorithm with Spark
    Zhu Jizhao, Jia Yantao, Xu Jun, Qiao Jianzhong, Wang Yuanzhuo,Cheng Xueqi
    Journal of Computer Research and Development    2016, 53 (8): 1819-1828.   DOI: 10.7544/issn1000-1239.2016.20160197
    Abstract2032)   HTML5)    PDF (3386KB)(770)       Save
    Condition random fields has been successfully applied to various applications in text analysis, such as sequence labeling, Chinese words segmentation, named entity recognition, and relation extraction in nature language processing. The traditional CRFs tools in single-node computer meet many challenges when dealing with large-scale texts. For one thing, the personal computer experiences the performance bottleneck; For another, the server fails to tackle the analysis efficiently. And upgrading hardware of the server to promote the capability of computing is not always feasible due to the cost constrains. To tackle these problems, in light of the idea of “divide and conquer”, we design and implement SparkCRF, which is a kind of distributed CRFs running on cluster environment based on Apache Spark. We perform three experiments using NLPCC2015 and the 2nd International Chinese Word Segmentation Bakeoff datasets, to evaluate SparkCRF from the aspects of performance, scalability and accuracy. Results show that: 1)compared with CRF++, SparkCRF runs almost 4 times faster on our cluster in sequence labeling task; 2)it has good scalability by adjusting the number of working cores; 3)furthermore, SparkCRF has comparable accuracy to the state-of-the-art CRF tools, such as CRF++ in the task of text analysis.
    Related Articles | Metrics