ISSN 1000-1239 CN 11-1777/TP


    Default Latest Most Read
    Please wait a minute...
    For Selected: Toggle Thumbnails
    Meng Xiaofeng
    Journal of Computer Research and Development    2015, 52 (2): 261-264.  
    Abstract1430)   HTML1)    PDF (678KB)(1077)       Save
    Related Articles | Metrics
    Big Data Privacy Management
    Meng Xiaofeng, Zhang Xiaojian
    Journal of Computer Research and Development    2015, 52 (2): 265-281.   DOI: 10.7544/issn1000-1239.2015.20140073
    Abstract3057)   HTML43)    PDF (3345KB)(2000)       Save
    With the high-speed development of information and network, big data has become a hot topic in both the academic and industrial research, which is regarded as a new revolution in the field of information technology. However, it brings about not only significant economic and social benefits, but also great risks and challenges on individuals’ privacy protection and data security. Currently, privacy related with big data has been considered as one of the greatest problems in many applications. This paper analyzes and summarizes the categories generated by big data, the privacy properties and types in terms of difference reasons, the challenges in technologies and laws and regulations on managing privacy, and describes the differences of the current technologies which handle those challenges. Finally, this paper provides an active framework for managing big data privacy on the actual private problems. Under this framework, we illustrate some privacy-preserving technology challenges on big data.
    Related Articles | Metrics
    Functional Dependencies Discovering in Distributed Big Data
    Li Weibang, Li Zhanhuai, Chen Qun, Jiang Tao, Liu Hailong, Pan Wei
    Journal of Computer Research and Development    2015, 52 (2): 282-294.   DOI: 10.7544/issn1000-1239.2015.20140229
    Abstract1702)   HTML4)    PDF (2922KB)(988)       Save
    Discovering functional dependencies (FDs) from relational databases is an important database analysis technique, which has a wide range of applications in knowledge discovery, database semantics analysis, data quality assessment and database design. Existing functional dependencies discovery algorithms are mainly applied in centralized data, and are suitable to the case of small data size only. However, it is far more challenging to discover functional dependencies in distributed databases, especially with big data. In this paper, we propose a novel functional dependencies discovering approach in distributed big data. Firstly we execute functional dependencies discovering algorithm in parallel in each node, then prune the candidate set of functional dependencies based on the results of discovery. Secondly we group the candidate set of functional dependencies according to the features of candidate functional dependencies’ left hand side, and execute functional dependencies discovery algorithm based on each candidate set in parallel, and get all the functional dependency eventually. We analyze the number of candidate functions with regard to different groups, and data shipment and load balance are taken into account when discovering functional dependencies. Experiments on real-world big datasets demonstrate that compared with previous discovering methods, our approach is more effective in efficiency.
    Related Articles | Metrics
    Automatically Discovering of Inconsistency Among Cross-Source Data Based on Web Big Data
    Yu Wei,Li Shijun,Yang Sha, Hu Yahui,Liu Jing, Ding Yonggang, Wang Qian
    Journal of Computer Research and Development    2015, 52 (2): 295-308.   DOI: 10.7544/issn1000-1239.2015.20140224
    Abstract1757)   HTML2)    PDF (2541KB)(1188)       Save
    Data inconsistency is a pervasive phenomenon existing in Web, which has gravely affected the quality of Web information. The current research of data inconsistency mainly focused on traditional database application. It is lack of consistency research on diverse, complicated, rapidly-changing and abundant Web big data. On account of multi-source heterogeneous Web data and 5V features of big data, we present unified algorithm of data extraction and Web object data model based on three aspects: website structure, characteristic data and knowledge rules. We study and sort the features of data inconsistency, and establish inconsistency classifier model, inconsistency constraint mechanism and inconsistency inference algebra computing system. Then based on cross-source Web data consistency theory system, we've researched Web inconsistency data automatically discovery method via constraint rules detection and statistical deviation analysis. Combining the characters of the two methods, we propose an automatically discovery algorithm of Web inconsistency data in view of hierarchy probabilistic judgment based on Hadoop MapReduce architecture. The framework is applied to multiple B2C electronic commerce big data on Hadoop platform and compared with traditional architecture and other methods. The results of our experiment proves the accuracy and efficiency of the method.
    Related Articles | Metrics
    Theme-Aware Task Assignment in Crowd Computing on Big Data
    Zhang Xiaohang,Li Guoliang, Feng Jianhua
    Journal of Computer Research and Development    2015, 52 (2): 309-317.   DOI: 10.7544/issn1000-1239.2015.20140267
    Abstract1967)   HTML1)    PDF (2171KB)(1139)       Save
    Big data has brought tremendous challenges for the traditional computing model, because of its inherent characteristics such as large volume, high velocity, high variety, low-density value. On the one hand, the large volume and high velocity require the techniques of massive data computation and analysis; on the other hand, the high variety and low-density value make big data computing tasks highly depend on the complex cognitive reasoning technology. To overcome the coexistence challenges of massive data analysis and complex cognitive reasoning, human-machine collaboration based crowd computing is an effective way to solve the big data problem. In crowd computing, task assignment is one of the basic problems. However the current crowdsourcing platforms cannot support the active task assignment, which iteratively assigns tasks to appropriate workers based on the knowledge background or users. To address this problem, we propose an iterative theme-aware task assignment framework, and deploy it into existing crowdsourcing platforms. The framework includes two components. The first component is task modeling, which models the tasks as a graph where vertices are tasks and edges are task relationships. The second component is the iterative task assignment algorithm, which identifies the themes of the workers by their historical records, computes the workers’ accuracy on different themes, and assigns the tasks to the appropriate workers. Various experiments validate the effectiveness of our method.
    Related Articles | Metrics
    Distributed Stream Processing: A Survey
    Cui Xingcan, Yu Xiaohui, Liu Yang, Lü Zhaoyang
    Journal of Computer Research and Development    2015, 52 (2): 318-332.   DOI: 10.7544/issn1000-1239.2015.20140268
    Abstract3368)   HTML41)    PDF (2523KB)(2771)       Save
    The rapid growth of computing and networking technologies, along with the increasingly richer ways of data acquisition, has brought forth a large array of applications that require real-time processing of massive data with high velocity. As the processing of such data often exceeds the capacity of existing technologies, there has appeared a class of approaches following the distributed stream processing paradigm. In this survey, we first review the application background of distributed stream processing and discuss how the technology has evolved to its current form. We then contrast it with other big data processing technologies to help the readers better understand the characteristics of distributed stream processing. We provide an in-depth discussion of the main issues involved in distributed stream processing, such as data models, system models, storage management, semantic guarantees, load control, and fault tolerance, pointing out the pros and cons of existing solutions. This is followed by a systematic comparison of several popular distributed stream processing platforms including S4, Storm, Spark Streaming, etc. Finally, we present a few typical applications of distributed stream processing and discuss possible directions for future research in this area.
    Related Articles | Metrics
    Big Data Analysis and Data Velocity
    Chen Shimin
    Journal of Computer Research and Development    2015, 52 (2): 333-342.   DOI: 10.7544/issn1000-1239.2015.20140302
    Abstract2138)   HTML150)    PDF (3828KB)(1540)       Save
    Big data poses three main challenges to the underlying data management systems: volume (a huge amount of data), velocity (high speed of data generation, data acquisition, and data updates), and variety (a large number of data types and data formats). In this paper, we focus on understanding the significance of velocity and discussing how to face the challenge of velocity in the context of big data analysis systems. We compare the requirements of velocity in transaction processing, data stream, and data analysis systems. Then we describe two of our recent research studies with an emphasis on the role of data velocity in big data analysis systems: 1) MaSM, supporting online data updates in data warehouse systems; 2) LogKV, supporting high-throughput data ingestion and efficient time-window based joins in an event log processing system. Comparing the two studies, we find that storing incoming data updates is only the minimum requirement. We should consider velocity as an integral part of the data acquisition and analysis life cycle. It is important to analyze the characteristics of the desired big data analysis operations, and then to optimize data organization and data distribution schemes for incoming data updates so as to maintain or even improve the efficiency of big data analysis.
    Related Articles | Metrics
    A Survey on PCMBased Big Data Storage and Management
    Wu Zhangling, Jin Peiquan,Yue Lihua, Meng Xiaofeng
    Journal of Computer Research and Development    2015, 52 (2): 343-361.   DOI: 10.7544/issn1000-1239.2015.20140116
    Abstract2024)   HTML5)    PDF (3583KB)(1865)       Save
    Big data has become a hot topic in both academia and industry. However, due to the limitations of current computer system architectures, big data management is facing a lot of new challenges w.r.t. performance, energy, etc. Recently, a new kind of storage media called phase change memory (PCM) introduces new opportunities for advancing computer architectures and big data management, due to its nonvolatility, byteaddressability, high read speed, low energy, etc. As a kind of nonvolatile storage media, PCM has some unique features of DRAM, such as byteaddressability and high readwrite performance, thus can be regarded as a crosslayer storage media for redesigning current storage architecture so as to realize highperformance storage. In this paper, we summarize the features of PCM, and present a survey on PCMbased data management. We discuss the related advances in terms of two aspects, namely that PCM is used as secondary storage and that PCM is used as main memory. We also introduce the current studies on the applications of PCM in various areas. Finally, we propose some future research directions on PCMbased data management so as to provide some valuable references for big data storage and management on new storage architectures.
    Related Articles | Metrics
    A GPU-Accelerated Highly Compact and Encoding Based Database System
    Luo Xinyuan, Chen Gang, Wu Sai
    Journal of Computer Research and Development    2015, 52 (2): 362-376.   DOI: 10.7544/issn1000-1239.2015.20140254
    Abstract1556)   HTML1)    PDF (4925KB)(851)       Save
    In the big data era, business applications generate huge volumes of data, making it extremely challenging to store and manage those data. One possible solution adopted in previous database systems is to employ some types of encoding techniques, which can effectively reduce the size of data and consequential improve the query performance. However, existing encoding approaches still cannot make a good tradeoff between the compression ratio, importing time and query performance. In this paper, to address the problem, we propose a new encoding-based database system, HEGA-STORE, which adopts the hybrid row-oriented and column-oriented storage model. In HEGA-STORE, we design a GPU-assistant encoding scheme by combining the rule-based encoding and conventional compression algorithms. By exploiting the computation power of GPU, we efficiently improve the performance of encoding and decoding algorithms. To evaluate the performance of HEGA-STORE, it is deployed in Netease to support log analysis. We compare HEGA-STORE with other database systems and the results show that HEGA-STORE can provide better performance for data import and query processing. It is a much compact encoding database for big data applications.
    Related Articles | Metrics
    An Energy Efficient Algorithm for Big Data Processing in Heterogeneous Cluster
    Ding Youwei, Qin Xiaolin, Liu Liang, Wang Taochun
    Journal of Computer Research and Development    2015, 52 (2): 377-390.   DOI: 10.7544/issn1000-1239.2015.20140126
    Abstract1612)   HTML0)    PDF (5721KB)(1108)       Save
    It is reported that the electricity cost to operate a cluster may well exceed its acquisition cost, and the processing of big data requires large scale cluster and long period. Therefore, energy efficient processing of big data is essential for the data owners and users, and it is also a great challenge for the energy use and environment protection. Existing methods powered down some nodes to reduce energy consumption or developed new strategies of data storage in the cluster. However, we can find that much energy is still wasted even minimal nodes are used to process the task, and new storage strategies do not suit for the deployed clusters for the extra cost of data transformation. In this paper, we propose a novel algorithm MinBalance to processing I/O intensive big data tasks energy efficiently in heterogeneous cluster. The algorithm can be divided into two steps, node selection and workload balance. In the former step, four greedy policies are used to select the proper nodes considering heterogeneity of the cluster. While in the latter step, the workloads of the selected nodes will be well balanced to avoid the energy wastes caused by waiting. MinBalance is a universal algorithm and cannot be affected by the data storage strategies. Experimental results indicate that MinBalance can achieve over 60% energy reduction for large data sets over the traditional methods of powering down partial nodes.
    Related Articles | Metrics
    Survey on Large-Scale Graph Pattern Matching
    Yu Jing, Liu Yanbing,Zhang Yu, Liu Mengya,Tan Jianlong,Guo Li
    Journal of Computer Research and Development    2015, 52 (2): 391-409.   DOI: 10.7544/issn1000-1239.2015.20140188
    Abstract3293)   HTML11)    PDF (4874KB)(2810)       Save
    In the big data age, there exists close affinities among the great amount of multi-modal data. As a popular data model for representing the relations of different data, graph has been widely used in various fields such as analysis of social network, social security, and biological information. Fast and accurate search over the large-scale graph serves as a fundamental problem in graph analysis. In this paper, we survey the up-to-date development in graph pattern matching techniques for graph search from the application perspective. Graph pattern matching techniques are roughly classified into several categories according to the properties of graphs and the requirement of applications. Meanwhile, we focus on introducing and analyzing the exact pattern matching, including non-index matching, index-based matching and their key techniques, representative algorithms, and performance evaluation. We summarize the state-of-the-art applications, challenging issues, and research trends for graph pattern matching.
    Related Articles | Metrics
    Survey of Sign Prediction Algorithms in Signed Social Networks
    Lan Mengwei,Li Cuiping, Wang Shaoqing,Zhao Kankan, Lin Zhixia,Zou Benyou, Chen Hong
    Journal of Computer Research and Development    2015, 52 (2): 410-422.   DOI: 10.7544/issn1000-1239.2015.20140210
    Abstract2158)   HTML2)    PDF (2032KB)(1352)       Save
    According to the potential meaning, the edges in some networks can be divided into positive and negative relationships. When we mark these positive and negative edges with plus and minus signs respectively, a signed network is formed. Signed networks are widespread in sociology, information science, biology and other fields. Nowadays signed networks have become one of research hotspots. Researching on sign prediction problem in signed social networks is valuable to personalized recommendation, abnormal node identification and user clustering in social networks. This paper focus on predicting positive and negative links in signed social networks, and describes domestic and overseas current research status and latest developments. First we introduce the social structural balance theory and status theory. Then we classify several sign prediction algorithms into two categories according to their main ideals: algorithms based on matrix and algorithms based on classification. We introduce the basic idea of these sign prediction algorithms in detail. And then we compare and analyze these algorithms from multiple perspectives such as speed, accuracy, scalability and so on. Finally, we summarize some regularity characteristics and challenges in sign prediction and discuss some possible development directions in signed social networks research.
    Related Articles | Metrics
    Multiple Sources Fusion for Link Prediction via Low-Rank and Sparse Matrix Decomposition
    Liu Ye,Zhu Weiheng,Pan Yan, Yin Jian
    Journal of Computer Research and Development    2015, 52 (2): 423-436.   DOI: 10.7544/issn1000-1239.2015.20140221
    Abstract1569)   HTML2)    PDF (1785KB)(1094)       Save
    In recent years, link prediction is a popular research field of link mining in social network and other complex networks. In the problem of link prediction, there usually exist multiple additional sources of information used to improve the performance of predicting the probability of the links in network. Among all the sources, the major source of all the information sources usually plays the most significant role on predicting. It is important to design a robust algorithm to make full use of all the sources and balance the major source and additional sources to get better link prediction result. Meanwhile, the traditional unsupervised algorithms based on topological calculation are mostly useful methods to calculate the scores for solving link prediction problem. In the approach of link prediction methods, the most important step is to construct a precise input seed matrix. Since many real-world network data may be noisy, which decreases the accuracy of most link prediction methods. In this paper, we propose a novel method with the multiple additional sources which take advantage of the leading information seed source matrix and others. And then, the seed source matrix is combined with other sources to construct a better matrix with lower noise and more precise structure than the seed matrix. The new matrix is used as the input matrix to traditional unsupervised topological algorithm. Experiment results show that the new proposed method can get better performance of the link prediction problem in different kinds of multiple sources real-world datasets.
    Related Articles | Metrics
    Event Propagation Analysis on Microblog
    Zhu Xiang, Jia Yan, Nie Yuanping, Qu Ming
    Journal of Computer Research and Development    2015, 52 (2): 437-444.   DOI: 10.7544/issn1000-1239.2015.20140187
    Abstract1690)   HTML8)    PDF (2184KB)(1264)       Save
    Event propagation analysis is one of the main research issues in the field of social network analysis. Hotspot outbreaks and spreads through the social network, and it makes a great impact in a short period of time. Meanwhile, it is easier to create a hotspot and spread it in social network than in traditional media, so information diffusion will do harm to social security and property if used by criminals. Traditional influence propagation analysis method can only analyze single microblog (or tweet), so it limits event propagation analysis in social network. In this paper, we review some existing propagation models such as independent cascade model, linear threshold model, etc. After that, we introduce some basic definitions of influence propagation analysis in social network. Then we propose a method combining user deduplication, spammer detection and probabilistic reading based on existing independent cascade model. The main idea of our method is making user deduplication in the event composed of several key microblogs (or tweets) and building event propagation graph. Then we remove spammers in that graph and make influence propagation analysis by using probabilistic reading model. It provides a novel method to make event propagation analysis. Finally, some experiments are conducted and the results demonstrate the correctness and effectiveness of the method.
    Related Articles | Metrics