ISSN 1000-1239 CN 11-1777/TP

Table of Content

01 February 2016, Volume 53 Issue 2
Research on the Big Data Fusion: Issues and Challenges
Meng Xiaofeng and Du Zhijuan
2016, 53(2):  231-246.  doi:10.7544/issn1000-1239.2016.20150874
Asbtract ( 4551 )   HTML ( 81)   PDF (3467KB) ( 3515 )  
Related Articles | Metrics
Data characteristics and realistic demands have changed because of the large-scale data’s links and crossover. The data, which has main features of large scale, multi-source heterogeneous, cross domain, cross media, cross language, dynamic evolution and generalization, is playing an important role. And the corresponding data storage, analysis and understanding are also facing a major challenge. The immediate problem to be solved is how to use the data association, cross and integration to achieve the maximization of the value of big data. Our paper believes that the key to solve this problem lies in the integration of data, so we put forward the concept of large data fusion. We use Web data, scientific data and business data fusion as a case to analyze the demand and necessity of data fusion, and propose a new task of large data fusion, but also summarize and analyze the existing fusion technologies. Finally, we analyze the challenges that may be faced in the process of large data fusion and the problems caused by large data fusion.
Knowledge Representation Learning: A Review
Liu Zhiyuan, Sun Maosong, Lin Yankai, Xie Ruobing
2016, 53(2):  247-261.  doi:10.7544/issn1000-1239.2016.20160020
Asbtract ( 12243 )   HTML ( 296)   PDF (3333KB) ( 20731 )  
Related Articles | Metrics
Knowledge bases are usually represented as networks with entities as nodes and relations as edges. With network representation of knowledge bases, specific algorithms have to be designed to store and utilize knowledge bases, which are usually time consuming and suffer from data sparsity issue. Recently, representation learning, delegated by deep learning, has attracted many attentions in natural language processing, computer vision and speech analysis. Representation learning aims to project the interested objects into a dense, real-valued and low-dimensional semantic space, whereas knowledge representation learning focuses on representation learning of entities and relations in knowledge bases. Representation learning can efficiently measure semantic correlations of entities and relations, alleviate sparsity issues, and significantly improve the performance of knowledge acquisition, fusion and inference. In this paper, we will introduce the recent advances of representation learning, summarize the key challenges and possible solutions, and further give a future outlook on the research and application directions.
Short Text Understanding: A Survey
Wang Zhongyuan, Cheng Jianpeng, Wang Haixun, Wen Jirong
2016, 53(2):  262-269.  doi:10.7544/issn1000-1239.2016.20150742
Asbtract ( 3419 )   HTML ( 11)   PDF (1608KB) ( 3274 )  
Related Articles | Metrics
Short text understanding is an important but challenging task relevant for machine intelligence. The task can potentially benefit various online applications, such as search engines, automatic question-answering, online advertising and recommendation systems. In all these applications, the necessary first step is to transform an input text into a machine-interpretable representation, namely to “understand” the short text. To achieve this goal, various approaches have been proposed to leverage external knowledge sources as a complement to the inadequate contextual information accompanying short texts. This survey reviews current progress in short text understanding with a focus on the vector based approaches, which aim to derive the vectorial encoding for a short text. We also explore a few potential research topics in the field of short text understanding.
Graph-Based Collective Chinese Entity Linking Algorithm
Liu Qiao, Zhong Yun, Li Yang, Liu Yao, Qin Zhiguang
2016, 53(2):  270-283.  doi:10.7544/issn1000-1239.2016.20150832
Asbtract ( 2476 )   HTML ( 12)   PDF (1917KB) ( 2812 )  
Related Articles | Metrics
Entity Linking technology is a central concern of the knowledge base population research area. Traditional entity linking methods are usually limited by the immaturity of the local knowledge base, and deliberately ignore the semantic correlation between the mentions that co-occurr within a text corpus. In this work, we propose a novel graph-based collective entity linking algorithm for Chinese information processing, which not only can take full advantage of the structured relationship of the entities offered by the local knowledge base, but also can make use of the additional background information offered by external knowledge sources. Through an incremental evidence minning process, the algorithm achieves the goal of linking the mentions that are extraced from the text corpus, with their corresponding entities located in the local knowledge base in a batch manner. Experimental results on some open domain corpus demonstrate the validity of the proposed referent graph construction method, the incremental evidence minning process, and the coherence criterion between the mention-entity pairs. Experimental evidences show that the proposed entity linking algorithm consistently outperforms other state-of-the-art algorithms.
Chinese Named Entity Relation Extraction Based on Syntactic and Semantic Features
Gan Lixin, Wan Changxuan, Liu Dexi, Zhong Qing, Jiang Tengjiao
2016, 53(2):  284-302.  doi:10.7544/issn1000-1239.2016.20150842
Asbtract ( 2364 )   HTML ( 21)   PDF (2640KB) ( 2047 )  
Related Articles | Metrics
Named entity relations are a foundation of semantic networks and ontology, and are widely used in information retrieval and machine translation, as well as automatic question and answering systems. In named entity relationships, relationship feature selection and extraction are two key issues. Characteristics of Chinese long sentences with complicated sentence patterns and many entities, as well as the data sparse problem, bring challenges for Chinese entity relationship detection and extraction tasks. To deal with above problems, a novel method based on syntactic and semantic features is proposed. The feature of dependency relation composition is obtained through the combination of their respective dependency relations between two entities. And the verb feature with the nearest syntactic dependency is captured from dependency relation and POS (part of speech). The above features are incorporated into feature-based relationship detection and extraction using SVM. Evaluation on a real text corpus in tourist domain shows above two features from syntactic and semantic aspects can effectively improve the performance of entity relationship detection and extraction, and outperform previously best-reported systems in terms of precision, recall and F1 value. In addition, the verb feature with nearest syntactic dependency achieves high effectiveness for relationship detection and extraction, especially obtaining the most prominent contribution to the performance improvement of data sparse entity relationships, and significantly outperforms the state-of-the-art based on the verb feature.
A Graph-Based Approach for Query Answering Under Inconsistency-Tolerant Semantics
Fu Xuefeng, Qi Guilin, Zhang Yong
2016, 53(2):  303-315.  doi:10.7544/issn1000-1239.2016.20150839
Asbtract ( 1171 )   HTML ( 3)   PDF (2468KB) ( 719 )  
Related Articles | Metrics
Inconsistency often occurs during ontology evolution, and leads to the invalidity of standard reasoning. To tackle this problem, inconsistency-tolerant semantics can be provided for the target language. However, ill-defined inconsistency-tolerant semantics may cost massive calculation and result in losing valuable information. In this paper, a variant of classical inconsistency-tolerant semantics is proposed, named IPAR-semantics. The newly defined inconsistency-tolerant semantics can avoid computing the closure of an ABox w.r.t. the corresponding TBox, thus can reduce the computation time and reserve as much information as possible. Based on the newly defined inconsistency-tolerant semantics, we further propose an approach for consistent query answering based on graph. In our approach, the given ontology and the target query are both transformed into graphs by different rules and stored into graph database. The IPAR-semantics ensure that the inconsistent instances cannot be included in the answering of query and the new approach can avoid redundant rewritings of a user query. Finally, We conduct comparative experiments on the ontologies generated by UOBM generator. In the experiments, we implement the query answering system under IPAR-semantics using our graph-based approach and compare it with the query answering approach under ICAR-semantics. The experimental results show that our approach outperforms in both efficiency and scalability.
Semiring Provenance for Data Fusion
Xue Jianxin, Shen Derong, Kou Yue, Nie Tiezheng, Yu Ge
2016, 53(2):  316-325.  doi:10.7544/issn1000-1239.2016.20150872
Asbtract ( 1474 )   HTML ( 2)   PDF (2286KB) ( 964 )  
Related Articles | Metrics
As an important part of the Web data integration, Web data fusion is the quality assurance of integrated data and the precondition of accurate analysis and mining. However, being a uniform data fusion is treated as black box, which makes the fusion lack of interpretability and debuggable ability. Therefore, to describe fusion process and origin for solving the conflict, we should construct a provenance mechanism with data provenance. Data provenance describes about how data is generated and evolves with time going on, which can not only show which input tuples contribute to the data but also how they contribute. We study the semiring provenance for data fusion. Firstly, we propose an approximate iterative approach to optimal the computational process of semiring provenance. After, to speed up the convergence, we show a Newton-like approach. Recursion may make the situation complicated, we analysize the characteristic of semiring provenance and show that Kleene sequence and Newton-like sequence can convergent only after n step. And experiments show that the technologies in this paper are highly effective and feasible.
Energy Storage System in Green Data Centers: Architecture and Management
Ye Ran, Li Chao, Liang Xiaoyao
2016, 53(2):  326-340.  doi:10.7544/issn1000-1239.2016.20150296
Asbtract ( 1700 )   HTML ( 5)   PDF (2889KB) ( 1226 )  
Related Articles | Metrics
Modern data centers are heavily constrained by their power budget and carbon footprint. Faced with a growing concern about the skyrocketing IT energy expenditure and the looming environmental crisis, academia and industry alike are now focusing more attention than ever on non-conventional data center power provisioning and management strategies. Recently, energy storage system (ESS) has emerged as a key enabler that allows modern data centers to greatly improve energy efficiency and system sustainability. It can help reduce data center operating expenditure (OpEx) through limiting the temporary power demand spikes caused by irregular workloads. On the other hand, it also facilitates the integration of clean and renewable energy sources (i.e. solar energy and wind energy) into the energy portfolio of green data centers. Given the growth in scale and importance of energy storage system in future data center design, this study aims to give a fairly comprehensive view of its architecture design and power management strategies. This paper surveys existing research works in two primary ESS application scenarios: power demand shaving and power supply smoothing. We compare a variety of state-of-the-art proposals and discuss major design considerations (e.g. cost, efficiency, and reliability) of ESS. We conclude today and tomorrow by discussing the opportunity and challenges created by energy storage systems in data centers.
A Lightweight Fine-Grained Fault-Tolerant Scheme for 3D Networks-on-Chip
Zhou Jun, Li Huawei, Wang Tiancheng, Li Xiaowei
2016, 53(2):  341-353.  doi:10.7544/issn1000-1239.2016.20148436
Asbtract ( 1043 )   HTML ( 0)   PDF (3239KB) ( 500 )  
Related Articles | Metrics
3D network-on-chip (NoC) is one of the main trends of the communication technology for 3D integrated circuits (ICs). Each processing element (PE) in the networks communicates with its attached router node through the network interface (NI), and different nodes are connected to their neighbors with the horizontal or vertical links. With the increase of complexity and integration level, the interference to the communication of NoCs becomes more and more serious, and the probability of fault occurrence also rises up. In order to guarantee the normal operation of the circuits, effective fault-tolerant schemes need to be designed for the networks. The routers are one of the main components of NoCs. For most of the existing fault-tolerant schemes, faulty routers are usually completely replaced by the redundant ones or even deprecated. In this paper, we propose a lightweight fine-grained fault-tolerant scheme to take full advantage of the remaining valid resources of the routers in 3D NoCs. Our scheme includes a new reliable micro-architecture designed for the routers and a matching lightweight fault-tolerant routing scheme. Experimental results show that the proposed scheme possesses higher performance, improved reliability and lower overhead compared with the state-of-the-art fault-tolerant schemes for 3D NoCs.
Fault Tolerant Global Scheduling with Backup Priority Promotion
Peng Hao, Han Jianghong, Wei Zhenchun, Wei Xing
2016, 53(2):  354-361.  doi:10.7544/issn1000-1239.2016.20148380
Asbtract ( 1040 )   HTML ( 0)   PDF (1358KB) ( 585 )  
Related Articles | Metrics
Fault tolerance is of great importance in hard real-time systems due to the impossibility of eliminating faults. In such a system the fault tolerant scheduling algorithm plays a critical role for achieving fault tolerance capability. In primary-backup scheme based fault tolerant global scheduling algorithms, the execution window of backup is relatively small. When priority inheritance strategy is adopted, the response time of the backup is likely too long to guarantee deadline requirement. For improving the real time property of the backup, we propose a fault tolerant global scheduling algorithm based on backup priority promotion strategy—FTGS-BPP. In FTGS-BPP, the backup has a higher priority than its corresponding primary so that during the execution the backup suffers less interference. Consequently the response time of the backup is reduced which means better real time performance. FTGS-BPP can achieve fault tolerance with less processors than the algorithms which follow priority inheritance strategy. A backup priority searching algorithm is also proposed. The simulation result shows that, compared with the fault tolerant global scheduling algorithm based on priority inheritance strategy, FTGS-BPP is able to reduce processor requirement significantly when scheduling the same task set.
Efficient Merge Sort Algorithms on Array-Based Manycore Architectures
Shi Song, Li Hongliang, Zhu Wei
2016, 53(2):  362-373.  doi:10.7544/issn1000-1239.2016.20148246
Asbtract ( 1107 )   HTML ( 3)   PDF (2797KB) ( 663 )  
Related Articles | Metrics
Sorting is one of the most fundamental problems in the field of computer science. With the rapid development of manycore processors, it shows great importance to design efficient sort algorithms on manycore architecture. An important trend of manycore processors is array-based manycore architecture. This paper presents two efficient merge sort algorithms on array-based manycore architectures. Our algorithms achieve high performance by using DMA(direct memory access) multi-buffering strategy to improve the memory accessing efficiency, deeply balanced merging strategy to keep load-balancing between cores, SIMD(single instruction multiple data) merging method to exploit fine-grained parallelism and on-chip swap merging strategy to reuse on-chip data. Experimental results on DFMC(deeply-fused many-core) prototype system achieve a sorting rate of 647 MKeys/s(million keys per second), and the sorting efficiency(sorting rate/peak performance) is 3.3x higher than state-of-the-art GPU merge sort on GTX580, and 2.7x higher than the fastest merge sort on Intel Xeon Phi. Additionally, this paper establishes an analytical model to characterize the performance of our algorithms. Based on the analytical model, we analyze the influence of the main structural parameters to the performance of the algorithms, which will contribute to the study of many-core architecture.
Architecture and Key Technologies of In-Cloud Smart Data Appliance
Zhang Dong, Qi Kaiyuan, Wu Nan, Xin Guomao, Liu Zhengwei, Yan Bingheng, Guo Feng
2016, 53(2):  374-389.  doi:10.7544/issn1000-1239.2016.20148277
Asbtract ( 1300 )   HTML ( 2)   PDF (4177KB) ( 797 )  
Related Articles | Metrics
To make up for the gap between big data technologies and industry applications, this paper proposes the models of scalability, customizability and multi-type processing of big data appliance, based on which the in-cloud smart data appliance, i.e. iSDA, is designed and implemented. First, the iSDA is assembled by optimally developing the cooperative computing units, heterogeneous storage and high-speed switching network to take fully advantages of both scale-out and scale-up architectures. Second, iSDA is devised to satisfy diversity requirements of industry big data applications by virtue of hardware customization from light-weight to heavy-load styles, and as well as hybrid software stack including real-time, interaction, streaming and batch processing all accelerated by the in-memory computing engine. Furthermore, in the consideration of the HDFS metadata service bottleneck, MapReduce load skew and HBase cross-domain issue, this paper as well introduces the technologies of multiple metadata servers, load balance algorithm and cross-datacenter big table used in iSDA. The practical use cases in the telecommunication, finance and environmental protection industries show that the proposed architecture and key technologies are feasible and effective, and the comprehensive comparisons with traditional MPP databases and other mainstream Hadoop distributions are also given to detail the advantages of iSDA from both hardware and software aspects.
An Address Cache of Interconnect Network in Parallel Computers
Zhang Jianmin, Li Tiejun, Li Sikun
2016, 53(2):  390-398.  doi:10.7544/issn1000-1239.2016.20148039
Asbtract ( 1024 )   HTML ( 4)   PDF (2211KB) ( 482 )  
Related Articles | Metrics
Most of users are accustomed to utilize the virtual address in their parallel programs running at the scalable parallel computer systems. Therefore the virtual and physical address translation directly affects the performance of the parallel programs. Cache can strongly improve the efficiency of address translation and reduce the latency of translation. In this paper, a new address translation cache (ATC) is proposed for the interconnect network of scalable parallel computer systems. To improve the hit ratio, ATC adopts embedded dynamic random access memory (eDRAM) to store more address translation table items. A new eDRAM refresh mechanism is proposed to hide the refresh operation and avoid the performance loss introduced by refresh. In ATC, there are many reliability techniques, including error correcting code and a novel bypass module. The well-known NPB benchmarks have been run at the 32 compute nodes including ATC. The results show that the ATC has high hit ratio which the average value of 32 nodes is 95.3%. It is indicated that ATC is well designed and has high performance. It also has been compared with three types of typical cache implemented by different capacities SRAM (static random access memory), and the conclusion is the capacity of cache is key factor to improve the hit ratio.
Survey on Transactional Storage Systems Based on Non-Volatile Memory
Shi Wei, Wang Dongsheng
2016, 53(2):  399-415.  doi:10.7544/issn1000-1239.2015.20148358
Asbtract ( 1630 )   HTML ( 5)   PDF (4399KB) ( 1140 )  
Related Articles | Metrics
With the emergence and further widespread use of non-volatile memory, the storage architecture is undergoing fundamental change. Transaction processing technologies in traditional database systems and file systems are mostly designed for rotating disks while they cannot take full advantage of new features of non-volatile memory. To take full advantage of the non-volatile memory characteristics and narrow the gap between system I/O performance and CPU processing performance, transactional storage systems and technologies designed based on non-volatile memory have gained focus and great popularity. In this paper, current status of the software layer transaction processing technologies, which are used in traditional database systems and file systems, are addressed in brief firstly. Then based on the division of non-volatile memory which includes flash and phase-change memory, the existing transactional storage systems based on non-volatile memory are discussed. Finally, the research works are summarized and the possible research directions are pointed out. Among the discussion, for the research of transactional flash-based storage systems, analysis of the optimization of transaction processing technologies using traditional host interface flash storage devices is given first, followed by analysis and comparison of the characteristics of the transaction flash storage systems using dedicated transactional interfaces. For the research of transactional PCM-based transactional storage systems, using PCM in both main memory and external storage environment for transaction processing is analyzed and compared, and the key technologies including the combination of log and cache and fine-grained logging are discussed.
Problems and Optimizations of Refresh for Large-Capacity DRAM
Cui Zehan, Chen Mingyu
2016, 53(2):  416-430.  doi:10.7544/issn1000-1239.2016.20148360
Asbtract ( 2126 )   HTML ( 13)   PDF (2782KB) ( 1095 )  
Related Articles | Metrics
DRAM (dynamic random access memory) is widely used as main memory of computer system, which is of fast speed, high density and low cost. DRAM uses capacitors as basic storage cells, and uses the amount of charges to represent digital “0” and “1”. However, the capacitor charges leak over time, causing data lost. To maintain data integrity, DRAM periodically refreshes all cells-read data out before lost and rewrite into cells. Refresh operations block normal memory requests, causing performance overhead; refresh operations also consume extra power, causing energy overhead. The refresh overheads are related to DRAM density. In the past, DRAM density was relative small, and the amount of cells needing to be refreshed was not that large, so the overheads gain little attention. But as the evolving of Moore’s Law, DRAM density grows to Gigabits today, and more cells need to be refreshed during the same period, exacerbating the performance and energy overheads. The problem of refresh has now been an important concern for both industry and academia. In this paper, we first introduce how refreshes are performed, its overheads, and some improvements from industry; then we classify the many improvements from industry and academia into two categories-reducing the blocking of memory requests, and reducing the unnecessary refreshes-and give our analysis and summaries, respectively; finally, we conclude the research work and point out the possible research directions.
A Dynamic Replica Management Mechanism Based on File Support Degree
Xiao Zhongzheng, Chen Ningjiang, Jia Jionghao, Zhang Wenbo
2016, 53(2):  431-442.  doi:10.7544/issn1000-1239.2016.20148327
Asbtract ( 1067 )   HTML ( 3)   PDF (2650KB) ( 422 )  
Related Articles | Metrics
Replication-based management schema is an important fault tolerance mechanism in large scale distributed storage systems. In response to the demand of dynamic replication management in distributed storage systems, a file popularity index named file support degree and its computation model are proposed. Within this model, file’s parameters are periodically collected. By combination of self-correlation of file support degree, file hits in previous collection cycle, accessed data volume and file’s grade, a model that exactly reflects files’ replication requirement is built. To adapt to the variable system load, the model dynamically adjusts its parameters, making the replication decision-making to reflect real system status. Based on these work, some algorithms like load balancing, replication adjustment and replication clearing are designed. To avoid a single data storage node being overloaded, a data storage nodes’ load-balance strategy is proposed. In this strategy, data storage nodes are divided into 3 groups: a holding group, an acceptable group and a begging group. There are 2 periodic procedures in the system, including replication adjusting procedure and replication clearing procedure. In replication adjusting procedure, top P files are replicated to data storage nodes selected based on the load-balance strategy. Replication clearing procedure is a long-periodic procedure, because it needs many adjusting procedures to make the begging group be empty. This dynamic replication management mechanism is proven effective through the given experimentations.
Study and Implementation of Frequent Sequences Mining Based Prefetching Algorithm
Wang Fang, Wang Peiqun, Zhu Chunjie
2016, 53(2):  443-448.  doi:10.7544/issn1000-1239.2016.20148040
Asbtract ( 1349 )   HTML ( 4)   PDF (1496KB) ( 735 )  
Related Articles | Metrics
Prefetching technology is widely used as an efficient means to improve the performance of storage systems. However, traditional prefetching algorithms are mostly based on detecting sequential access features, which makes them hard to work in the environment with less or no sequential access features. Whats worse, the storage system may even suffer from negative effects with poor prefetching accuracy. Whereas the proposed prefetching algorithm based on frequent sequences mining can make some contributions to the storage system in such environment by analyzing the behavior of the data accessing to find the potential rules. Meanwhile, in some application scenarios where the cache capacity may be limited, such as the embedded system, the proposed prefetching algorithm improves the prefetching accuracy to avoid some adverse impacts which may be caused by prefetching. The new proposed prefetching algorithm is based on the frequent sequences mining technology, and the prefetching rules derived from the mined frequent sequences are organized in a Trie tree. To improve the accuracy of the prefetching, the multistep matching technology and the subtree partitioning technology are introduced, which can subtly control the using of prefetching rules, so that the prefetching algorithm with relatively high prefetching accuracy can efficiently improve the performance of the storage system.
WR Approach: Determining Accurate Attribute Values in Big Data Integration
Zhou Ningnan, Sheng Wanxing, Liu Ke-yan, Zhang Xiao, Wang Shan
2016, 53(2):  449-458.  doi:10.7544/issn1000-1239.2016.20148275
Asbtract ( 1421 )   HTML ( 0)   PDF (1931KB) ( 964 )  
Related Articles | Metrics
Big data integration lays the foundation for high quality data-driven decision. One critical section thereof is to determine the accurate attribute values from records in data pertaining to a given entity. The state-of-the-art approach R-topK argues to design rules to decide relative accuracy among the attribute values and thus obtain accurate values. Unfortunately, in cases where multiple true values or conflicted rules exist, it requires rounds of human intervention. In this paper, we propose a weighted rule (WR) approach for determining accurate attribute values in big data integration. Each rule is augmented with weight and thus avoid human intervention when conflicts occur. This paper designs a chase procedure-based inference algorithm, and proves that it can figure out weighted constraints over relative accuracy among attribute values in O(n\+2), which introduces constraints for finding accurate data values. Taking conflicts among constraints into consideration, this paper proposes an O(n) algorithm to discover accurate attribute values among the combination of data values. We conduct extensive experiments under real world and synthetic datasets, and the results demonstrate the effectiveness and efficiency of WR approach. WR approach boosts performance by factor of 3-15x and improves effectiveness by 7%-80%.
Large-Scale Heterogeneous Data Co-Clustering Based on Nonnegative Matrix Factorization
Shen Guowei, Yang Wu, Wang Wei , Yu Miao, Dong Guozhong
2016, 53(2):  459-466.  doi:10.7544/issn1000-1239.2016.20148284
Asbtract ( 1391 )   HTML ( 3)   PDF (2337KB) ( 722 )  
Related Articles | Metrics
Heterogeneous information network contains multi-typed entities and interactive relations. Some co-clustering algorithms have been proposed to mine underlying structure of different entities. However, with the increase of data scale, the scale of different class entities are growing unbalanced, and heterogeneous relational data are becoming extremely sparse. In order to solve this problem, we propose a two steps co-clustering algorithm FNMTF-CM based on correlation matrix decomposition. In the first step, the correlation matrix is built with the correlation relationship of smaller-typed entities and decomposed into indicating matrix of smaller-typed entity based on symmetric nonnegative matrix factorization. Correlation matrix has higher dense degree and smaller size compared with the original heterogeneous relationship matrix, so our algorithm can process large-scale heterogeneous data and maintain a high precision. After that, the indicating matrix of smaller-typed can be used as the input directly, so the heterogeneous relational matrix tri-factorization is very fast. Experiments on artificial and real-world heterogeneous data sets show that the accuracy and performance of FNMTF-CM algorithm are superior to the traditional co-clustering algorithms based on nonnegative matrix factorization.
Multi-Way Join Optimization Approach Based on MapReduce
Li Tiantian, Yu Ge, Guo Chaopeng, Song Jie
2016, 53(2):  467-478.  doi:10.7544/issn1000-1239.2016.20148281
Asbtract ( 1239 )   HTML ( 3)   PDF (2708KB) ( 605 )  
Related Articles | Metrics
Multi-way join is one of the most common data analysis operations, and MapReduce programming model that has been widely used to process large scale data sets has brought new challenges to multi-way join optimization. Traditional optimization approaches cannot be simply adapted to fit MapReduce feature, so there is still optimization room for MapReduce join algorithm. As to the former, we think I/O is the main cost of join. This paper first proposes an I/O cost based heuristic algorithm to initially determine a join sequence, and conducts further optimization. After the optimization, we also design a parallel execution algorithm to improve the whole performance of multi-way join. As to the latter, we think load balancing can effectively decrease the “buckets effect” of MapReduce. This paper proposes a fair task load allocation algorithm to improve the intra-join parallelism, and also analyzes the method to decide the appropriate number of Reduce tasks. Experiments verify the effectiveness of the proposed optimization approaches. This study contributes to multi-way join applications in big data environment, such as the star-join in OLAP and the chain-join in social network.
Reachability Query in Heterogeneous Information Networks
Yin Dan, Gao Hong, Zou Zhaonian, Li Jianzhong
2016, 53(2):  479-491.  doi:10.7544/issn1000-1239.2016.20148274
Asbtract ( 1388 )   HTML ( 2)   PDF (2935KB) ( 674 )  
Related Articles | Metrics
With the size of graph data increasing explosively, the form of graph data is much more complicated. Heterogeneous information networks can be modeled as graphs, which contain multiple types of nodes and multiple types of edges, e.g., bibliographic database, online shopping website and knowledge graphs. Reachability query in heterogeneous information networks is investigated in this paper. By using different types of relationships between nodes, the reachability of nodes is queried while satisfying specific path schema. This problem is polynomial. However, the time costing can’t be tolerant by scanning the big network for answering one query. The existing reachability work can be classified into two categories: K-hop reachability query and label-constraint reachability query. But these techniques can’t be used for processing path-based reachability query in heterogeneous information networks. Therefore, in order to respond online queries efficiently, a novel index structure is proposed, which decomposes path schemas and precomputes the reachability of nodes in sub-path schemas. Online query is computed efficiently by decomposing the query path schema and using the reachability of the indexes. A path schema decomposition strategy is developed by searching the partial order graph of path schemas in order to minimize the query time. Experiments on real world and synthetic data demonstrate the effectiveness of algorithms for reachability query in heterogeneous information networks.
Link Inference in Large Scale Evolutionable Knowledge Network
Zhao Zeya, Jia Yantao, Wang Yuanzhuo, Jin Xiaolong, Cheng Xueqi
2016, 53(2):  492-502.  doi:10.7544/issn1000-1239.2016.20148283
Asbtract ( 1378 )   HTML ( 9)   PDF (2219KB) ( 1246 )  
Related Articles | Metrics
In the era of network big data, the spatiotemporal information of knowledge is richly stored in knowledge networks, such as the building time of links, the lifetime of vertices, etc. Traditional knowledge network representation models are mostly blind to either the spatial or the temporal information of vertices and links in the network. It has been verified in the literature that considering the spatial or the temporal information of vertices and links can promote the performance of link inference in knowledge networks. In this paper, we propose an evolutionable knowledge network model, i.e., a heterogeneous knowledge network, in which vertices and edges are anchored in both time and space dimensions. Then based on the model, we further study the link inference problem on evolutionable knowledge networks. Specifically, we firstly define the link extendable patterns to characterize the link formation process, and then propose a knapsack constrained link inference method to turn the link inference problem into a combinatorial optimization problem with the knapsack-like constrains. The dynamic programming technique is used to solve the optimization problem in pseudo-polynomial time complexity. Experiments on real data sets suggest the better effectiveness and scalability of our proposed method over large-scale networks than the state-of-the-art methods.