ISSN 1000-1239 CN 11-1777/TP

Table of Content

01 December 2016, Volume 53 Issue 12
Structured Processing for Pathological Reports Based on Dependency Parsing
Tian Chiyuan, Chen Dehua, Wang Mei,Le Jiajin
2016, 53(12):  2669-2680.  doi:10.7544/issn1000-1239.2016.20160611
Asbtract ( 1539 )   HTML ( 17)   PDF (2437KB) ( 928 )  
Related Articles | Metrics
Most of pathological reports are unstructured texts which can not be directly analyzed by computers. The current researches on structured texts mainly focus on the information extraction. However, the syntactic features of pathological reports are particular, which makes it more difficult to extract information relations. To solve this problem, a novel method of structuralizing pathological reports based on syntactic and semantic features is proposed in this paper. First of all, we construct a synonym lexicon by using neural network language models to eliminate the phenomenon of synonymy. Then the dependency trees are generated based on the preprocessed pathological reports to extract medical examination indices. Meanwhile, we use short-sentence segmentation and annotation as optimized strategies to simplify the structure of dependency trees, which makes the grammatical relations of medical texts clearer and improves the quality of the structured results. Finally the key-value pairs of medical examination indices can be extracted from pathological reports in Chinese, and the structured texts can be generated automatically. Experimental results based on real pathological report data sets show that the performance of the proposed method on medical indices and values extraction achieves 82.91% and 79.11% of accuracy, which provides a solid foundation for related studies in the future.
A Method for Road Network Updating Based on Vehicle Trajectory Big Data
Yang Wei, Ai Tinghua
2016, 53(12):  2681-2693.  doi:10.7544/issn1000-1239.2016.20160610
Asbtract ( 1857 )   HTML ( 26)   PDF (9348KB) ( 690 )  
Related Articles | Metrics
Vehicle trajectory data becomes an important approach to access and update of road information. However, conventional methods cannot identify road change type and extract change entities quickly using crowdsourcing trajectory data. To solve the problem, this paper propose a new method to use vehicle trajectory big data to detect and update changes rapidly in the road network. Firstly, road change type is identified by detecting and classifying the road change information using trajectory movement geometry information (direction, turn angle) and traffic semantic information(traffic volume, speed). Through analysis of trajectory data, the real physical change and traffic semantic change of road can be distinguished from each other. And then incremental information is extracted by Delaunay triangulation and traffic flow time series analysis. This method combines the road change type identifying and incremental data extraction through taking road segment buffer as basic unit. Finally, incremental information fusion is conducted according to road change type. An experiment using taxi GPS traces data in Shenzhen is verified the validity of the novel method. The experimental results prove that the method can identity road change type, and the accuracy of incremental data is improved about 18% compared with map matching method. Furthermore, the comparison analysis of the road network update results is also carried out to confirm that the method is suitable for layer-level updates.
Evaluation of GPS-Environment Friendliness of Roads Based on Bus Trajectory Data
Ma Liantao, Wang Yasha, Peng Guangju, Zhao Yuxin, He Yuanduo, Gao Jingyue
2016, 53(12):  2694-2707.  doi:10.7544/issn1000-1239.2016.20160626
Asbtract ( 1439 )   HTML ( 2)   PDF (3985KB) ( 505 )  
Related Articles | Metrics
GPS is the most widely-used outdoor positioning system. With the advance of relevant technologies, positioning accuracy of GPS has been increasing continuously. However, as the GPS satellite signal can be blocked by buildings, multi-path error becomes the major cause of positioning error in a city. Evaluating the negative effects of GPS error on urban environments, which is referred as environment friendliness in this paper, will help the prediction of GPS error range in different road segments. Furthermore, it enhances user experiences of location-based services, reveals the relationship between environmental characteristics and multi-path error, and helps to determine where to deploy supplementary positioning devices. In this paper, we have proposed an urban road friendliness evaluation (URFE) approach, based on the processing and analyzing of massive historical bus GPS trajectory data. Specifically, URFE first takes full advantage of the unique features of fixed bus routes to significantly improve the efficiency of data processing. Then, it adopts a fault-tolerant method to deal with the possible errors of street maps; Finally, URFE completes the missing data by utilizing the inherent relationship between GPS errors of the same cars and roads, and utilizes an evaluation strategy by taking the influence of different GPS terminal devices’ qualities into account. Using the bus trajectory data within the second ring road of Chengdu during one month, we evaluate the effectiveness of our approach. Environments friendliness of 5648 different road segments has been evaluated, whose rationality has been verified by checking real satellite maps and street views.
Investment Recommendation Based on Risk and Surplus in P2P Lending
Zhu Mengying, Zheng Xiaolin,Wang Chaohui
2016, 53(12):  2708-2720.  doi:10.7544/issn1000-1239.2016.20160608
Asbtract ( 1280 )   HTML ( 3)   PDF (2756KB) ( 669 )  
Related Articles | Metrics
Online peer-to-peer (P2P) lending, which is a newly personal wealth distribution and management system, has become a new type of financing mode for Internet users. P2P lending platform allows borrowers to create borrow listing and investors to bid and invest borrowers’ listing directly. In the P2P lending, there is a significant issue that is how to reasonably match borrowers and investors and then allocate the amount of investors, so as to recommend low risk and high rate investment decisions to the investors. This paper proposes a recommendation framework risk based total surplus risk total surplus maximize (RTSM), which can solve the problem of allocating the investment amount into borrowers’ listings. Specifically, we first propose to adapt various methods of regression to evaluate default risk. Then, we give the hypothesis the surplus of investors and borrowers under default risk which is based on the theory of surplus in economics. And based on this hypothesis, we combine the risk assessment and investment recommendation to maximize the total surplus under default risk. We apply the recommendation framework RTSM into two real-world datasets (Prosper and PPDai). Finally, experiments and analysis indicate that RTSM can reduce risk and improve the overall benefits of both investors and borrowers.
A Sensor and User Behavior Data Analysis Based Method of Mobile Learning Situation Perception
Ye Shuyan, Zhang Weizhan, Qi Tianliang, Li Jing,Zheng Qinghua
2016, 53(12):  2721-2728.  doi:10.7544/issn1000-1239.2016.20160633
Asbtract ( 1573 )   HTML ( 7)   PDF (1947KB) ( 802 )  
Related Articles | Metrics
As the popularity of the smart phones and mobile technologies, more and more people begin to use smartphones to learn and get new knowledge. Mobile learning has played a critical role in the field of education for a few years. The effectiveness of mobile learning reflects in the ability of perceiving different learning contexts and then provides matched learning resource. Context awareness has become a research hotspot, but the most important is learning situation perception. We can provide proper learning resources according to the specific learning situation. Because of the mobility and complexity of mobile learning, it’s difficult to perceive learning situation. The thesis proposes a method to perceive learning situations by combining sensor data and learning operation data and conducts some experiments. It chooses and calculates some sensor data eigenvalues and learning operation index eigenvalues as the inputs of the classification algorithms, the learning situations that students provide as training set data. The result shows that combining sensor data and learning operation data to perceive learning situations can improve the accuracy of the learning situation perception, which proves the feasibility and effectiveness of learning situation perception based on sensor data and learning operations.
The Construction and Analysis of Pass Network Graph Based on GraphX
Zhang Tao, Yu Jiong, Liao Bin, Guo Binglei, Bian Chen, Wang Yuefei, Liu Yan
2016, 53(12):  2729-2752.  doi:10.7544/issn1000-1239.2016.20160568
Asbtract ( 1460 )   HTML ( 30)   PDF (7119KB) ( 735 )  
Related Articles | Metrics
In the field of social networking, finance, public security, health care, etc, the application of big data technology is matured constantly, but its application in competitive sports is still in exploratory stage. Lacking of recording the pass data in basketball technical statistics leads that we can not research the statistical analysis, data mining and application on the pass data. Firstly, as the aggregation from of passing data is graph, based on data acquisition, clean and format conversion, Vertex and Edge table construction, we create the pass network graph with GraphX, which lays the foundation for other applications. Secondly, the PlayerRank algorithm is proposed to distinguish the importance of players, player position personalized the graph vertex’s color, etc, which improves the visual quality of pass network graph. Finally, we can use the pass network graph created by GraphX to analyze the effect of passing quantity and quality on the outcome of the game, and the pass network graph is also used to analyze the team’s passing data, tactical player selection, on-the-spot tactics supporting, subgraph extraction and gaming experience improvement, etc.
Onboard: A Data-Driven Agile Software Development Collaboration Tool
Chen Long, Ye Wei,Zhang Shikun
2016, 53(12):  2753-2767.  doi:10.7544/issn1000-1239.2016.20160625
Asbtract ( 1708 )   HTML ( 9)   PDF (4165KB) ( 519 )  
Related Articles | Metrics
Scrum is an agile software development process with a balance between schedule and flexibility, which empowers software development teams with the ability to work efficiently and respond to changes quickly at the same time. Each step in the software development process can generate tons of data, which can further facilitate team and project management and improve development efficiency if these data are captured, analyzed, displayed and fed back. However, these data are commonly scattered and under-utilized because project management and source code management are separated in existing software development management toolbox. To promote data-driven agile software development process with Scrum at its core, we create Onboard, an agile software development collaboration tool based on cloud service, which, by associating Git commits with tasks, creatively incorporates agile process management, source code management and project management into one integrated service for software development teams. Onboard supports end-to-end management of the whole software life cycle, thus it can collect all the data generated throughout the development process and extract valuable information. This paper first introduces the design principle and structure of Onboard, and then gives a comprehensive survey of data visualization and analysis applied in Onboard. In the survey, we propose solutions to a series of related problems on two topics: how to fully utilize the data generated to improve agile development process and how to evaluate the contribution of a team member. In the final analysis, the paper provides topics that need further research.
Mining Software Repositories: Contributors and Hot Topics
Jiang He, Chen Xin, Zhang Jingxuan, Han Xuejiao,Xu Xiujuan
2016, 53(12):  2768-2782.  doi:10.7544/issn1000-1239.2016.20160653
Asbtract ( 2611 )   HTML ( 23)   PDF (3334KB) ( 925 )  
Related Articles | Metrics
Software updates and evolves continuously over time, software repositories accumulate massive data. How to effectively collect, organize, and make use of these data has become a key problem in software engineering. Mining Software Repositories (MSR) aim to mine useful knowledge contained in complex and diversified data to improve the quality and productivity of software. Although some studies have elaborately summarized the background, history, and prospects about MSR, existing studies do not present systematically the most influential author, institution, and country as well as the major research topics and their transitions over time. Therefore, this study combines the existing classical publication analysis frameworks and algorithms to analyze the relationships among publications related to MSR, and presents some important domain knowledge for researchers in detail. To effectively tackle this task, we construct a framework named MSR Publication Analysis Framework (MSR-PAF). MSR-PAF consists of three components which can be used to create a dataset for the study, conduct a bibliography analysis, and implement a collaboration pattern analysis, respectively. The results of the bibliography analysis show that the most productive author, institution, and country are Ahmed E. Hassan, University of Victoria, and USA, respectively. The most frequent keyword is software maintenance and the most influential author is Abram Hindle. In addition, the results of the collaboration pattern analysis show that Abram Hindle is the most active author, and open source project and software maintenance are the most popular research topics.
Representation and Operations Research of k\+2-MDD in Large-Scale Graph Data
Dong Rongsheng, Zhang Xinkai, Liu Huadong,Gu Tianlong
2016, 53(12):  2783-2792.  doi:10.7544/issn1000-1239.2016.20160589
Asbtract ( 1199 )   HTML ( 1)   PDF (2906KB) ( 434 )  
Related Articles | Metrics
Efficient and compact representation and operation of graph data which contains hundreds of millions of vertices and edges are the basis of analyzing and processing the large scale of graph data. Aiming at the problem, this paper proposes a representation of large-scale graph data based on the decision diagram, that is k\+2-MDD, providing the initialization of k\+2-MDD and the basic operation such as the edge query, inner(outer) neighbor query, finding out(in)-degree, adding(deleting) edge, etc. The representation method is optimized and improved on the basis of k\+2 tree, and after dividing the adjacency matrix of graph into k\+2, it is stored with the multi valued decision diagram, so as to achieve a more compact storage structure. According to the experimental results of a series of real Web graph and the social network graph data (cnr-2000, dewiki-2013, etc.) derived from the LAW laboratory at the University of Milan, it can be seen that the number of k\+2-MDD’ nodes is only 259%-451% of the k\+2 tree, which achieving the desired effect. According to the experimental results of random graphs, it can be seen that the k\+2-MDD structure is not only suitable for sparse graphs, but also for dense graphs. The graph data of k\+2-MDD shows that both containing the compact and query efficiency representation of k\+2 tree and realizing the efficient operation of the graph model can thus achieve the unity of description and computing power.
A Method of Bayesian Probabilistic Matrix Factorization Based on Generalized Gaussian Distribution
Yan Cairong, Zhang Qinglong, Zhao Xue,Huang Yongfeng
2016, 53(12):  2793-2800.  doi:10.7544/issn1000-1239.2016.20160582
Asbtract ( 1535 )   HTML ( 3)   PDF (1608KB) ( 674 )  
Related Articles | Metrics
The method of Bayesian probability matrix factorization (Bayesian PMF) is widely used in the personalized recommendation systems due to its high prediction accuracy and excellent scalability. However, the accuracy is affected greatly by the sparsity of the initial scoring matrix. A new Bayesian PMF method based on generalized Gaussian distribution called GBPMF is proposed in this paper. In the method, the generalized Gaussian distribution (GGD) is adopted as the prior distribution model in which some related parameters are adjusted automatically through machine learning to achieve desired effect. Meanwhile, we apply the Gibbs sampling algorithm to optimize the loss function. Considering the influence of the time difference of scoring in the prediction process, a temporal factor is integrated into the sampling algorithm to optimize the method and improve its prediction accuracy. The experimental results show that our methods GBPMF and GBPMF-T can obtain higher accuracy when dealing with both sparse matrix and non-sparse matrix, and the latter can even get better effect. When the matrix is very sparse, the accuracy of Bayesian PMF decreases sharply while our methods show stable performance.
A Reconstruction Method of Spatial Data Using MPS and ISOMAP
Du Yi, Zhang Ting, Huang Tao
2016, 53(12):  2801-2815.  doi:10.7544/issn1000-1239.2016.20150384
Asbtract ( 1107 )   HTML ( 2)   PDF (5799KB) ( 379 )  
Related Articles | Metrics
Conditional data influence the reconstructed results greatly in the reconstruction of spatial data. Reconstructed results often show a number of uncertainties when only sparse conditional data are available, so it is suitable to use indefinite interpolation to reconstruct spatial data. As one of the main indefinite interpolation methods, multiple-point statistics (MPS) can extract the intrinsic features of patterns from training images and copy them to the simulated regions. Because the traditional MPS methods using linear dimensionality reduction are not suitable to deal with nonlinear data, isometric mapping (ISOMAP) is combined with MPS to address the above issues. A method using MPS and ISOMAP for the reconstruction of spatial data is proposed for the accurate reconstruction of unknown spatial data by constructing pattern dataset, dimensionality reduction of patterns, classification of patterns and extraction of patterns, which has provided a new idea for dealing with nonlinear spatial data by MPS. The experimental results show that the structural characteristics of reconstructed results using this method are similar to those of training images.
Separable Compressive Imaging Method Based on Singular Value Decomposition
Zhang Cheng, Wang Dong, Shen Chuan, Cheng Hong, Chen Lan, Wei Sui
2016, 53(12):  2816-2823.  doi:10.7544/issn1000-1239.2016.20150414
Asbtract ( 1086 )   HTML ( 2)   PDF (2906KB) ( 545 )  
Related Articles | Metrics
When facing the compressive imaging problem that the measurement matrix has too large dimension, separable compressive sensing (SCS) can effectively achieve this problem at a cost of a certain percentage of additional measurements. However, the both separable measurement matrices in existing separable compressive sensing method should be row-normalized orthogonal random matrix, which limits its application significantly. In this paper, the method of singular value decomposition (SVD) is introduced into separable compressive sensing measurement process, which can effectively achieve the separation of measurement matrix and reconstruction matrix: the design of the measurement matrix in sensing stage is more to consider the physical properties for easy implementations, such as the deterministic structure of Toeplitz or Circulant matrices and etc; in the reconstruction stage, it is more to consider the optimization of reconstruction matrix. Through the introduction of singular value decomposition method to optimize the measurement matrix in reconstruction stage, the reconstruction performance can be effectively facilitated, especially for Toeplitz and Circulant matrix in large-scale image compressive reconstruction. Numerical results demonstrate the validity of our proposed method.
Image Retrieval Based on Texton Correlation Descriptor
Wu Jun, Liu Shenglan, Feng Lin, Yu Laihang
2016, 53(12):  2824-2835.  doi:10.7544/issn1000-1239.2016.20150711
Asbtract ( 1119 )   HTML ( 4)   PDF (3019KB) ( 489 )  
Related Articles | Metrics
The performance of content-based image retrieval (CBIR) depends to a great extent on the image feature descriptor. Among these descriptors, color difference histogram (CDH) has showed the great discriminative performance in CBIR. However, there are still some limitations in it: 1)only taking color difference of pixels in global region into account; 2)not considering the spatial structure among pixels. In this paper, to solve these problems, we propose a novel image representation, called texton correlation descriptor (TCD), which is applied to CBIR. First, we define uniform regions which contain discriminative information of images and then detect them by analyzing the relationship among low-level features (color value and local binary patterns) of pixels. Second, in order to character contrast and spatial structure information in uniform regions respectively, we propose the color difference feature which fuses color difference correlation and global color difference histogram, and texton frequency feature which fuses texton frequency correlation and texton frequency histogram. Finally, by combining these feature vectors, TCD not only characters two orthogonal properties: spatial structure and contrast, but also takes these properties in local and global uniform regions into account simultaneously so that TCD has better performance in CBIR. The experimental results show that the retrieval results of TCD is higher than that of other descriptors in image datasets, and thus demonstrate that TCD is more robust and discriminative in CBIR.
Concurrent In-Memory OLAP Query Optimization Techniques
Zhang Yansong, Jiao Min, Zhang Yu, Wang Shan
2016, 53(12):  2836-2846.  doi:10.7544/issn1000-1239.2016.20150613
Asbtract ( 1219 )   HTML ( 3)   PDF (5237KB) ( 736 )  
Related Articles | Metrics
Recent researches not only focused on query-at-a-time query optimizations but also focused on group-at-a-time query optimizations due to the multicore hardware architecture support and highly concurrent workload requirements. By grouping concurrent queries into shared workload, some high latency operations, e.g., disk I/O, cache line access, can be shared for multiple queries. The existing approaches commonly lie in sharing query operators such as scan, join or predicate processing, and try to generate an optimized global executing plan for all the queries. For complex analytical workloads, how to generate an optimized shared execution plan is a challenging issue. In this paper, we present a template OLAP execution plan for widely adopted star schema to simplify execution plan for maximizing operator utilization. Firstly, we present a surrogate key oriented join index to transform traditional key probing based join operation to array index referencing (AIR) lookup to make join CPU efficient and support a lazy aggregation. Secondly, the predicate processing of concurrent queries is simplified as cache line conscious predicate vector to maximize concurrent predicate processing within single cache line access. Finally, we evaluate the concurrent template OLAP (on-line analytical processing) processing with multicore parallel implementation under the star schema benchmark(SSB), and the results prove that the shared scan and predicate processing can double the concurrent OLAP query performance.
Novel MapReduce-Based Similarity Self-Join Method: Filter and In-Circle Algorithm
Bao Guanghui, Zhang Zhaogong, Li Jianzhong, Xuan Ping
2016, 53(12):  2847-2857.  doi:10.7544/issn1000-1239.2016.20150794
Asbtract ( 1013 )   HTML ( 1)   PDF (2870KB) ( 567 )  
Related Articles | Metrics
Similarity self-join is a very important study in many applications. For the massive data sets, MapReduce can provide an effective distributed computing framework, in particular, similarity self-join can be applied on the framework. There are still problems, such as fine partition method, are applied to cluster data area for load balancing, but it is not easy to implement. Existing algorithms cant effectively accomplish similarity self-join operations for the massive data sets. In this paper, we propose two novel algorithms of similarity self-join on the MapReduce framework, and use coordinate-filtering techniques to get the valid candidate sets and use the in-circle method on the hexagon-based partition area. Those coordinate-filtering techniques are based on equal-width grid partition, and adopt the restriction that two points have more distances than two projective points in the same axis, and can drop obviously some candidate set. We also proof that the hexagon-based partition is the best form in all normal partition. Our experimental results demonstrate that the novel method has an advantage over the other join algorithms for cluster data area which improves efficiency over 80%. The algorithm can effectively solve the problem of the similarity self-join for the massive data in cluster data area.
MTruths:An Approach of Multiple Truths Finding from Web Information
Ma Ruxia, Meng Xiaofeng, Wang Lu, Shi Yingjie
2016, 53(12):  2858-2866.  doi:10.7544/issn1000-1239.2016.20150614
Asbtract ( 1096 )   HTML ( 5)   PDF (2347KB) ( 535 )  
Related Articles | Metrics
Web has been a massive information repository on which information is scattered in different data sources. It is common that different data sources provide conflicting information for the same entity. It is called the truth finding problem that how to find the truths from conflicting information. According to the number of attribute values, object attributes can be divided into two categories: single-valued attributes and multiple-valued attributes. Most of existing truth finding work is designed for truth finding on single-valued attributes. In this paper, a method called MTruths is proposed to resolve truth finding problem for multiple-valued attributes. We model the problem using an optimization problem. The objective is to maximize the total weight similarity between the truths and observations provided by data sources. In truth finding process, two methods are proposed to find the optimal solution: an enumeration algorithm and a greedy algorithm. Experiments on two real data sets show that the correctness of our approache and the efficiency of the greedy algorithm outperform the existing state-of-the-art techniques.
Uncertainty-Aware Adaptive Service Composition in Cloud Computing
Ren Lifang, Wang Wenjian, Xu Hang
2016, 53(12):  2867-2881.  doi:10.7544/issn1000-1239.2016.20150078
Asbtract ( 1151 )   HTML ( 3)   PDF (3330KB) ( 577 )  
Related Articles | Metrics
Cloud computing service composition is to select appropriate component services from numerous of services distributed in different clouds to build scalable loose coupling value-added applications. Traditional service composition methods are usually divided into selection stage and composition stage. Hardly guaranteeing the services with the best performance in the selection stage are still optimal in the execution stage because of the dynamic nature of the cloud computing environment and the stochastic nature of services evolution. Focusing on these two natures of service composition in cloud computing environment, a service composition model is built based on POMDP (partially observable Markov decision process) named as SC_POMDP (service composition based on POMDP), and a Q-learning algorithm is designed to solve the model. SC_POMDP can dynamically select the component services with outstanding QoS (quality of service) during the execution of service composition, which aims to ensure the adaptability of the service composition. Different from most existing methods, the proposed SC_POMDP regards the environment of service composition as being uncertain, and the compatibility between component services is considered, hence SC_POMDP is more in line with the real situation. Simulation experiments demonstrate that the proposed method can successfully solve the problems of service composition in different sizes. Specially, when service failure occurs, SC_POMDP can still select the optimal alternative component services to ensure the successful execution of the composite service. Compared with two existing methods,the selected composite service by SC_POMDP is best in response time and throughput, which reflects the superior adaptation of SC_POMDP.
Invulnerability of Clustering Wireless Sensor Network Towards Cascading Failures
Fu Xiuwen, Li Wenfeng, Duan Ying
2016, 53(12):  2882-2892.  doi:10.7544/issn1000-1239.2016.20150455
Asbtract ( 1111 )   HTML ( 2)   PDF (3202KB) ( 389 )  
Related Articles | Metrics
Current researches of cascading failures of wireless sensor network (WSN) mainly focus on peer-to-peer (P2P) structure. However, in real scenarios most of sensor networks always collect and deliver environmental data via clustering structure. Therefore, through observing the heterogeneity of connections in clustered networks, we construct a cascading failure model of wireless sensor network by introducing the concept of “sensing load” and “relay load”. Besides that, we discuss the relevant features between key parameters of cascading model and invulnerability of two typical clustering topologies (i.e., scale-free topology and random topology). In order to constrain the scale of cascading failures, we also discuss how to select cluster heads to enlarge their capacity to achieve this purpose. The simulation and theoretical results show that the network invulnerability is negatively correlated to the proportion of cluster heads p and positively correlated to the allocation coefficient A. When adjustment coefficient α=1, the invulnerability of the network is optimized. When adjustment coefficient α<1, choosing cluster heads with fewer cluster-cluster connections is a more efficient way to enhance the network invulnerability. When adjustment coefficient α>1, choosing cluster heads with more cluster-cluster connections is more cost-effective. When adjustment coefficient α=1, the scale of cascading failures is not related to the selecting schemes of cluster heads.