ISSN 1000-1239 CN 11-1777/TP

Table of Content

15 October 2011, Volume 48 Issue 10
Topic Mining for Microblog Based on MB-LDA Model
Zhang Chenyi , Sun Jianling and Ding Yiqun
2011, 48(10):  1795-1802. 
Asbtract ( 1331 )   HTML ( 20)   PDF (2085KB) ( 1820 )  
Related Articles | Metrics
As microblog grows more popular, services like Twitter have become information providers on a web scale. Early work on microblog focused more on its user relationship and community structure, without considering the value of content. So the research on microblog requires a change from solely user’s relationship analysis to its content mining. Although traditional text mining methods have been studied well, no algorithm is designed specially for microblog data, which contain structured information on social network besides plain text. In this paper, we propose a novel probabilistic generative model based on LDA, called MB-LDA, which is suitable to model the microblog data and takes both contact relation and document relation into consideration to help topic mining in microblog. We present a Gibbs sampling implementation for inference of our model, and find not only the topics of microblog, but also the topics focused by contactors according to the final results. Besides, our model can be extended to many texts associated with social networking such as E-mails and forum posts. Experimental results on actual dataset show that MB-LDA model can offer an effective solution to topic mining for microblog.
Algorithm Based on Counting for Mining Frequent Items over Data Stream
Zhu Ranwei, Wang Peng, and Liu Majin
2011, 48(10):  1803-1811. 
Asbtract ( 585 )   HTML ( 2)   PDF (1963KB) ( 627 )  
Related Articles | Metrics
Mining frequent items over data stream has drawn great attention, and large amount of efficient algorithms have been proposed by many researchers over the past decades. Although the classical algorithms are well suited to find frequent items, usually they do not perform well when estimating items’ approximate frequency. To solve this problem, we introduce a series of counter-based algorithms called SRoEC (segment rotative efficient count), SReEC (segment reserve efficient count) and RFreq (reserve frequent). They divide the counter used in classical algorithms and define operations for counters to improve the accuracy of item frequency and avoid the effect of low frequency items. As the experience shows, these algorithms can find Top-K items above the threshold correctly and return their approximate frequency as accurate as possible. Both analysis and experiments demonstrate that under same cost of space, these algorithms return higher count accuracy rate, lower frequency error rate and higher frequency reserve rate on both simulated data set and real data set when compared with the two best classical algorithms (frequent algorithm and space saving algorithm) nowadays. Amongst them, RFreq algorithm shows obvious advantages. What’s more, the algorithms perform much better than classical ones when the data distribution is smooth.
An Algorithm on Probabilistic Frequent Nearest Neighbor Query over Snapshots of Uncertain Database with Locally Correlation
Miao Dongjing, Shi Shengfei, and Li Jianzhong
2011, 48(10):  1812-1822. 
Asbtract ( 353 )   HTML ( 0)   PDF (2885KB) ( 552 )  
Related Articles | Metrics
Research on locally correlated spatial uncertain data recently gets more and more attentions in many practical applications. This paper proposes a novel probabilistic frequent nearest neighbor query over snapshots of uncertain database with locally correlation, which retrievals those objects which can be seen as the nearest neighbor of query point with a certain probability frequently. The existing methods of processing nearest neighbor query over traditional data or uncertain data cannot be used to process this kind of query because of the expensive time consuming. In order to solve this problem, we propose a generic framework, including several filters based on chernoff bound and the dynamic programme used to compute the probability mass function. We provide two filtering methods for two phases. The extended form of chernoff upper bound is used to filter lots of candidates objects at the first phase; while at the second phase, we use the standard form of chernoff bound. Then we also discuss the filters and dynamic programme used to process the extended LC-PFNN query. At last, we conduct experiments both on real and synthetic data set, and the results demonstrate that our method is effective enough, which lays a solid foundation for further research.
TrSVM: A Transfer Learning Algorithm Using Domain Similarity
Hong Jiaming, Yin Jian, Huang Yun, Liu Yubao, and Wang Jiahai
2011, 48(10):  1823-1830. 
Asbtract ( 804 )   HTML ( 5)   PDF (1417KB) ( 1162 )  
Related Articles | Metrics
Transfer learning algorithms focus on reusing related domain data to help solving learning tasks in the target domain. In this paper, we study the problem of inductive transfer learning. Most of the existing algorithms in inductive transfer learning might suffer from the problem of sample selection bias when the number of target domain data is too small. To address this problem, we propose to utilize domain similarity in a new approach. Through detailed discussion about the similarity of related domains, we define the concept of weak domain similarity. Using this concept to give additional constraints on the target classifiers, we develop a simple but effective approach to leverage the useful knowledge from the related domain, so that related domain data can be directly used in the training process. In this way, we are able to make the target classifier less sensitive to the small amount of target training data. Furthermore, we show that a modified SMO method can be applied to optimize the objective function in the algorithm effectively. The new algorithm is referred to as TrSVM, and can be seen as extension of support vector machines for transfer learning. Experiment results on extensive datasets show that TrSVM outperforms support vector machines and the state-of-the-art TrAdaBoost algorithm, and demonstrate the effectiveness of our algorithm.
A Novel Efficient Graph Aggregation Algorithm
Yin Dan, Gao Hong, and Zou Zhaonian
2011, 48(10):  1831-1841. 
Asbtract ( 734 )   HTML ( 0)   PDF (2148KB) ( 519 )  
Related Articles | Metrics
Many real world datasets can be modeled as graphs, where nodes represent objects and edges indicate relationships between nodes. Today, large graphs are common in many domains, such as social networks and road networks. Graph aggregation is a new technique for representing a large-scale graph by a concise graph that can capture the structural and attributive information of the original large graph. Graph aggregation plays an important role in the management, analysis and visualization of graph data. However, there are very few research results in graph aggregation. Moreover, the existing results are far from systematic. The main problems are: 1)depending on specific applications; 2)only considering partial information of graph, such as structures or attributes of original graphs; 3)having strict constraints on users’ interactions and feedbacks. To this end, this paper proposes a new graph aggregation algorithm on directed graphs, which adopts a new quality measuring function for aggregation graphs, characterizing the diversity, coverage, conciseness and utility of aggregated graphs. The algorithm guarantees the quality of aggregation graphs by means of Locality Sensitive Hashing (LSH) according to the Jaccard similarity of node attributes and an entropy-based partitioning method. Experiments on real datasets demonstrate the effectiveness and efficiency of the algorithm.
Probabilistic Reverse Skyline Query Processing on Uncertain Data Streams
Bai Mei, Xin Junchang, Dong Han, and Wang Guoren
2011, 48(10):  1842-1849. 
Asbtract ( 380 )   HTML ( 1)   PDF (1541KB) ( 428 )  
Related Articles | Metrics
Reverse skyline query has played an important role in making effective market decisions. Because the flow property and uncertainty of data are more and more apparent, probabilistic reverse skyline query on uncertain data streams has become a new study task. In order to solve the problem of probabilistic reverse skyline query on uncertain data streams efficiently, firstly, through analyzing practical applications’ requirements, the definition of probabilistic reverse skyline on uncertain data streams is proposed; and then according to the relevant concepts, the index model of probabilistic reverse skyline on uncertain data streams is proposed. Next, through the detailed and in-depth analysis of probabilistic reverse skyline’s properties on uncertain data streams, a novel algorithm, probabilistic reverse skyline on uncertain data streams based on R-tree index (RT2RS), is proposed. RT2RS algorithm makes use of an efficient pruning strategy to avoid a large number of invalid operations. Finally, the performance of RT2RS algorithm is verified by a large number of simulation experiments. The experimental results show that RT2RS algorithm is an effective way to solve the problem of probabilistic reverse skyline on uncertain data streams; it could significantly reduce the execution time of probabilistic reverse skyline query on uncertain data streams and meet the requirements of practical applications.
k-Nearest Neighbors in Uncertain Graph
Zhang Yinglong, Li Cuiping, Chen Hong, and Du Lingxia
2011, 48(10):  1850-1858. 
Asbtract ( 809 )   HTML ( 0)   PDF (1428KB) ( 920 )  
Related Articles | Metrics
In many areas, a lot of data have been modeled by graphs which are subject to uncertainties, such as molecular compounds and protein interaction networks. While many real applications, for example, collaborative filtering, fraud detection, and link prediction in social networks etc, rely on efficiently answering k-nearest neighbor queries (kNN), which is the problem of computing the most “similar” k nodes to a given query node. To solve the problem, in this paper a novel method based on measurement of SimRank is proposed. However, because graphs evolve over time and are uncertainly, the computing cost can be very high in practice to solve the problem using the existing algorithms of SimRank. So the paper presents an optimization algorithm. Introducing path threshold, which is suitable in both determined graph and uncertain graph,the algorithm merely considers the local neighborhood of a given query node instead of whole graph to prune the search space. To further improving efficiency, the algorithm adopts sample technology in uncertain graph. At the same time, theory and experiments interpret and verify that the optimization algorithm is efficient and effective.
k'/k-Dominant Skyline Query over Multiple Time Series
Xu Yajun, Wang Chaokun, Shi Wei,5, Pan Peng,5, and Wei Dongmei,
2011, 48(10):  1859-1870. 
Asbtract ( 539 )   HTML ( 1)   PDF (2209KB) ( 357 )  
Related Articles | Metrics
Time series have been widely used in many fields of nature and society. And they can be divided into single time series and multiple time series. Multiple time series, consisting of interrelated single time series, can describe an object by many aspects. Therefore, multiple time series are more complex than single time series. At present the research of time series mainly focuses on single time series, and the research of multiple time series is relatively little, such as the query over multiple time series. However, multiple time series are very useful in our life. In addition, most of researches of single time series cannot be used in or extended to multiple time series directly, which makes the study of multiple time series necessary. In this paper we give the definition of the dominant relation between multiple time series, and then propose the k'/k-dominant skyline query over multiple time series. We also present the proof of correctness of algorithms in this paper. Finally a set of experiments are conducted on both synthetic and real data to verify the proposed algorithms. The experiment results prove that both of these two algorithms are effective, and GMI algorithm is much more efficient than GMS.
Processing k-Nearest Neighbors Query over Uncertain Graphs
Zhang Xu, He Xiangnan, Jin Cheqing, and Zhou Aoying
2011, 48(10):  1871-1878. 
Asbtract ( 586 )   HTML ( 2)   PDF (1308KB) ( 944 )  
Related Articles | Metrics
Complex networks, such as biological networks, social networks, and communication networks, have been widely studied,and the data extracted from those applications is inherently uncertain due to noise, incompleteness and inaccuracy,so these applications can be modeled as uncertain graphs. The k-nearest neighbors (kNN) is a fundamental query for uncertain graphs, which is to compute the k nearest nodes to some specific node in a graph. In this paper, we design a framework for processing kNN query in uncertain graphs. We firstly propose a new kNN query over uncertain graphs, following which a novel algorithm is proposed to solve the kNN query. Then we optimize this algorithm which greatly improves the efficiency of the kNN query. Theoretical analysis and experimental results show that the proposed algorithm can efficiently retrieve the answer of a kNN query for an uncertain graph.
A Graph-Based Music Data Model and Query Language
Ou Xiaoping, Wang Chaokun, Peng Zhuo, Qiu Ping, and Bai Yiyuan
2011, 48(10):  1879-1889. 
Asbtract ( 607 )   HTML ( 0)   PDF (1832KB) ( 737 )  
Related Articles | Metrics
Graph data models are widely used in various area to present, store and process the data with complicated relationships. Considering the deficiency of existing music data models and query languages, this paper firstly presents a graph-based music data model named Gra-MM to model music data with complicated relationships. The definitions of model’s logical data structure and algebraic operations are given. Then based on Gra-MM, we present a music data query language called Gra-MQL, with the BNF syntax of the language. Gra-MQL can handle the complicated relationships among music data well, and also has the ability to query music by meta data as well as music content data. Gra-MQL meets a variety of user requirements and overcomes the shortcomings of traditional graph query languages which do not have adequate expressive power when dealing with data that have complicated relationships and don’t have the ability to query music based on music content data. Finally, a brief introduction on the prototype of music database management system is given in this paper. Three queries experiments are carried out on this prototype and the performances on different datasets are given, and these experiments show that the model and query language are both feasible in practice and theory.
Keyword Queries over Relational Databases Based on Tuple Combination
Tao Yue, He Zhenying, and Zhang Jiaqi
2011, 48(10):  1890-1898. 
Asbtract ( 433 )   HTML ( 1)   PDF (1572KB) ( 537 )  
Related Articles | Metrics
Databases have been used to organize and retrieve information for many years. In traditional ways, users have to use retrieve languages like SQL to get certain information, this is unfriendly to those who don’t know such languages. So recently, researchers develop a new way to retrieve information from database by the method of keyword query, some works extend the research to aggregate query. However, most of these works by now are aimed to obtain individual tuples to answer the query. In some scenarios, people want to get multiple tuples to make the decision. In order to meet the need of returning multiple tuples, this paper firstly proposes a new concept called combination query to answer keyword query by returning tuple combinations. Some heuristic prune methods are proposed through analysis of the problem, and are integrated to an optimization algorithm. An empirical evaluation on both real data sets and synthetic data sets verifies that the optimization algorithm outperforms the initial algorithm obviously in most cases.
Cache-Based Mobile Data Query Processing Algorithm in MANET
Zhang Yanqing, Li Jinbao, and Guo Longjiang
2011, 48(10):  1899-1907. 
Asbtract ( 325 )   HTML ( 0)   PDF (1615KB) ( 443 )  
Related Articles | Metrics
Aiming at the nature of mobile Ad-Hoc networks, such as self-organization, limited communication bandwidth, multi-hop communication, limited energy, limited storage and the frequent disconnection of the links, etc, a problem of mobile data query based on caching is formulated and proved to be a NP-complete problem. Then a polynomial approximate algorithm which is called MD is given to solve the problem. The MD algorithm uses greedy strategy to query cache nodes which store the largest amount of new data packets, so as to reduce the number of query and minimize the network transmission delay. According to the approximate algorithm for mobile data query, another algorithm is proposed based on the max ratio of new data packets and unit communicate delay of cache nodes, called MDD. It not only takes the number of new data packets into account, but also considers the quality-of-service of the links. This algorithm can improve the efficiency of query, minimize network data transmission delay, reduce energy consumption and improve network throughput. Theoretical analysis and experimental results indicate that the proposed algorithms can make full use of the data of cache nodes, complete data query better, reduce data transmission delay, as well as improve the efficiency of query efficiently.
A Location Index for Range Query in Real-Time Locating System
Guo Chao, Li Kun, Wang Yongyan, Liu Shenghang, and Wang Hongan
2011, 48(10):  1908-1917. 
Asbtract ( 522 )   HTML ( 1)   PDF (1617KB) ( 353 )  
Related Articles | Metrics
The range query of moving objects’ location is very important in many mobile applications, especially in analyzing, decision making, predicting, etc. Real-time locating system (RTLS) is a mobile system using RFID technology with the feature of skew object density. There are always storage wastes or performance decline while using existing indices in real-time locating system because of the skew object density. In this paper, a novel index mechanism called RPI (region partition index) is proposed to answer the range queries in RTLS. It firstly divides the region of the RTLS into sub regions according to the object density, and then indexes the division regions with R-tree. The object locations in these division regions are indexed by grid. Furthermore, this index is optimized to be cache conscious. In the optimized index structure, the object locations in a grid cell are stored in a list of arrays. The size of each array is determined by the size of the CPU cache line. Experimental results show that the new index has better search performance than R-tree and grid, and still keeps quite prominent update performance while object density is skew. The optimized index also brings strong performance improvement because it sharply reduces the cache miss rate in range queries.
OAFTL: An Efficient Flash Translation Layer for Enterprise Application
Qi Xiaoying, Tang Xian, Liang Zhichao, and Meng Xiaofeng
2011, 48(10):  1918-1926. 
Asbtract ( 567 )   HTML ( 0)   PDF (1645KB) ( 344 )  
Related Articles | Metrics
NAND flash based devices usually introduce a software firmware called flash translation layer (FTL) to simulate the flash memory like a block device. FTL is critical to the performance of flash-based devices. Most existing FTL algorithms work normally in embedded systems. However, they behave poorly when there are frequent random accesses in enterprise applications. In this paper, we propose an operation aware flash translation layer (OAFTL) for enterprise-scale storage devices based on page-level mapping scheme. OAFTL manages the entries cached in RAM according to readwrite operations separately. Besides, OAFTL supplies a log page for translation page to relieve frequent updates of translation information to improve performance. The experiment result shows that our OAFTL algorithm works effectively for enterprise workload. In our experiments, OAFTL improves the total performance by more than 20 percent compared with the existing methods.
An Adaptive Page-Replacement Strategy for Spatial Database Systems
Chen Kunjie, Sun Weiwei, Zhu Liang, and Liu Weimo
2011, 48(10):  1927-1934. 
Asbtract ( 503 )   HTML ( 0)   PDF (1122KB) ( 440 )  
Related Articles | Metrics
With the thorough investigations and applications of the spatial database systems in recent years, a page replacement strategy, especially designed for spatial database according to the characters of data organization and queries, has become a new topic. Voronoi diagram, which is a very important spatial data organization technique, performs remarkably in kNN query processing due to its outstanding partition technique and adjacent property. Focused on the spatial database organized by Voronoi diagram, this paper firstly presents a Euclidean distance-based replacement strategy in which a page with the maximum space distance to the last access page will be evicted first. Then, considering the various range of search area in kNN query, we formulate an adaptive page replacement strategy based on the LIRS strategy, whose HIRs proportion of the buffer is self-tuning to adapt to different kinds of queries. Combing these two strategies, an adaptive Euclidean distance-based LIRS page replacement strategy named AELIRS is proposed, which uses LIRS to manage pages history information and the Euclidean distance-based replacement strategy to choose evicted page. AELIRS can balance the temporal locality and spatial locality component dynamically, adaptively and continually. Extensive experiments show that AELIRS outperforms other strategies in a wide range of the buffer size and search area.
Research and Implementation of CFTree* Index on MultiMedia Data in myBUD
Zhang Xiao, Sun Xinyun, , Liu Keyan, Ju Xingxing, and Wang Shan,
2011, 48(10):  1935-1941. 
Asbtract ( 504 )   HTML ( 0)   PDF (1633KB) ( 346 )  
Related Articles | Metrics
The ever-growing unstructure data, such as image, video, audio, web page, etc., bring the challenge of effective unstructured data management(USDM). In this work, CFTree* is proposed to index and manage the multimedia data in our USDM platform—myBUD. The CFTree* index is a hierarchical indexing structure built on the top of cluster feature tree. CFTree* can be leveraged in the approximate kNN query processing. The experimental result shows that the query performance of approximate kNN query based on CFTree* gains about 60% improvement over thaton sequence scan. The approximate kNN query result has lower average precision than exact kNN query result, but it has more diversity.
Scheduling Algorithm for the Reuse Buffers in Column-Store Data Warehouse Query Execution
Zhang Qi, Wang Mei, Le Jiajin, and Liu Guohua,
2011, 48(10):  1942-1950. 
Asbtract ( 493 )   HTML ( 0)   PDF (1438KB) ( 442 )  
Related Articles | Metrics
Reusing intermediates is an important way to improve the performance of query execution. The current column-store systems mainly focus on the reusage of the intermediates in multiple query plans, while large quantities of reusable intermediates in a single query are neglected. The intermediates of a single query are suitable for reusing during the process of execution due to the characteristics of their high certainty and the evaluable amount. To deal with this problem, a novel scheduling algorithm for the reuse buffers is proposed. Firstly, we propose a reusability estimation model based on the relative position of the given operator node in the physical execution tree as well as the estimated volume of the intermediates it produces during execution. Then, we provide the reuse buffer scheduling algorithm based on the results of the reusability estimation model. In the process of query execution for each query plan, the scheduling algorithm is executed on the basis of the results of its reusability estimation model, making the more important intermediates stay longer in the memory than the others, leading the improvement of query performance. The experimental results on the benchmark data set SSB verify the effectiveness of the proposed method.
Study and Implementation of OLAP Performance Benchmark
Zhao Bo and Ye Xiaojun
2011, 48(10):  1951-1959. 
Asbtract ( 814 )   HTML ( 5)   PDF (2614KB) ( 561 )  
Related Articles | Metrics
With the expansion of business intelligence (BI) market, usability evaluation of on-line analytical processing (OLAP) systems has become a hot issue in database industries. How to measure the performance of OLAP systems is an important efficiency aspect that needs to be resolved with an OLAP performance benchmark. This paper presents a unified testing approach for efficiency evaluation of both ROLAP and MOLAP implementation systems based on the APB-1 benchmark introduced by OLAP committee. Firstly, an adequate cube model for MOLAP systems and the corresponding MDX query templates for APB-1 benchmark have been proposed, in which the shortcomings of APB-1 benchmark and their ROLAP star model are discussed during the cube model designing process. Then, experimental results come from new MOLAP model and ROLAP model implementation are compared under the same testing dataset and parameters in order to validate the accuracy of MOLAP model and the design correctness of MDX query templates. At last, according to the purpose of performance comparisons of OLAP systems, the OLAP performance testing procedure and its testing implementation framework for ROLAPMOLAP systems are proposed, in which various modules are explained, and a set of parallel queries has been tested on commercial DBMSs to verify the effectiveness of this framework. The achievement of this study will provide the technical and implemental supports for the performance testing of ROLAPMOLAP products.
A Geographical Databases Watermarking Method Based on Quantization Modulation with Variable Steps
Wang Chuanjian, Ge Hefei, Ding Mao, Peng Zhiyong, Peng Yuwei, Song Wei, and Wang Junzhou
2011, 48(10):  1960-1971. 
Asbtract ( 448 )   HTML ( 0)   PDF (2476KB) ( 358 )  
Related Articles | Metrics
Geographical databases, as one kind of the most important infrastructure of geographical information system, are great treasure of data owners. How to protect the copyright of geographical data effectively using digital watermarking is a critical issue. In this paper, we propose a robust, shape-preserving and blind watermarking method. We compute mean feature distance for each polygon and choose h most significant bits of mean feature distance as the robust identifier of one polygon. All polygons are partitioned into several groups based on their identifiers. Quantization modulation technique with variable steps is exploited to hide watermark into all polygons whose areas are slightly modified to derive watermarked geographical database. To ensure the security of the proposed algorithm, the polygon-group assignment, the watermark bit to be embedded in each polygon and every variable step are all algorithmically determined under the control of a private key known only to the owner of the data. Experimental results show that the proposed watermarking method has good performance and is resilient to translation, rotation, simplification, noise addition, vertices interpolation, cropping, tuple alteration and insertion attacks. Moreover, within the usage range of the geographical data, the robustness of the proposed method is improved with the increase of the watermark strength.