ISSN 1000-1239 CN 11-1777/TP

Table of Content

15 October 2006, Volume 43 Issue 10
A Text Classification Method Based on Term Frequency Classifier Ensemble
Jiang Yuan and Zhou Zhihua
2006, 43(10):  1681-1687. 
Asbtract ( 563 )   HTML ( 3)   PDF (531KB) ( 1012 )  
Related Articles | Metrics
In this paper, a method of text classification based on term frequency classifier ensemble is proposed. Term frequency classifier is a kind of simple classifier obtained after calculating terms' frequency of texts in the corpus. Though the generalization ability of term frequency classifier is not strong enough, it is a qualified base learner for ensemble because of its low computational cost, flexibility in updating with new samples and classes, and the feasibility of improving generalization with the help of ensemble paradigms. An improved AdaBoost algorithm is used to build the ensemble, which employs a scheme of compulsive weights updating to avoid early stop. Therefore it is more suitable for text classification. Experimental results on the corpus of Reuters-21578 show that the proposed method can achieve good performance in text classification tasks.
Research on the Algorithm of Feature Selection Based on Gini Index for Text Categorization
Shang Wenqian, Huang Houkuan, Liu Yuling, Lin Yongmin, Qu Youli, and Dong Hongbin
2006, 43(10):  1688-1694. 
Asbtract ( 634 )   HTML ( 3)   PDF (485KB) ( 856 )  
Related Articles | Metrics
With the rapid development of World Wide Web, large numbers of documents are available on the Internet. Automatic text categorization becomes more and more important for dealing with massive data. Text categorization has become a key technology in organizing and processing large amount of text data. For most classifiers using vector space model (VSM), text preprocessing has become the bottleneck of categorization. High dimensionality of the feature space is impossible for many classifiers. So adopting appropriate text feature selection algorithms to reduce the dimensionality of the feature space is becoming the key role. At present, there are many text feature selection algorithms. In this paper, all these text feature selection methods are not discussed in detail. but another new text feature selection method—Gini index is presented. Improved Gini-index is used for text feature selection, constructing the measure function based on Gini-index. The experiment results show that the text feature selection based on Gini index can improve the categorization performance further, and that its complexity of computing is small.
A Fuzzy Clustering Technology Based on Hierarchical Neural Networks for Web Document
Lei Jingsheng, Ma Jun, and Jin Ting
2006, 43(10):  1695-1699. 
Asbtract ( 344 )   HTML ( 1)   PDF (372KB) ( 591 )  
Related Articles | Metrics
A multilayer vector space model is proposed in this paper. The model partitions a document into many text paragraphs, and the text weight is defined according to the text paragraphs' position. A simple and effective fuzzy clustering approach is presented. A three-layer hierarchical clustering neural network is developed to cluster the Web documents into some predefined categories or topics. The fuzzy clustering approach differs from existing clustering-based methods. First, a fuzzy competitive neural network is exploited as a data pre-processor to extract a number of subclusters which can be viewed as an initial fuzzy clustering from Web documents. Secondly, based on the initial fuzzy clustering, a fuzzy C-means (FCM) clustering algorithm is used to decide the optimal number of fuzzy clustering. The experimental results show that the Web documents focusing on a subject are rather completely and exactly clustering together.
Proximal Support Vector Machine Based on Prototypal Multiclassfication Hyperplanes
Yang Xubing and Chen Songcan
2006, 43(10):  1700-1705. 
Asbtract ( 325 )   HTML ( 0)   PDF (569KB) ( 604 )  
Related Articles | Metrics
Proximal support vector machine via generalized eigenvalues (GEPSVM) casts away the parallelism condition on the canonical planes of the traditional support vector machines (SVM) and analytically seeks two hyperplanes such that each plane is close to the samples of its class and meanwhile far away from the samples of the other classes. Compared with the SVM, GEPSVM does not need quadratic programming and can gain comparable classification performance to SVM. Despite these advantages, GEPSVM is a binary classifier and can not separate multi-class datasets directly. Moreover, it is hard to theoretically set the regularization parameter in it and the generalized eigen-equation problem may be ill-conditioned. In this paper, a novel method, proximal SVM based on prototypal multi-classification hyperplanes (MHPSVM) is proposed, which can directly obtain multi-prototypal hyperplanes for multiple-class classification. Finally, experimental results on both artificial and benchmark datasets show that the classification performance of MHPSVM can be significantly higher than that of GEPSVM, especially in multi-class classification.
An Example Selection Method for Active Learning Based on Embedded Bootstrap Algorithm
Tian Chunna, Gao Xinbo, and Li Jie
2006, 43(10):  1706-1712. 
Asbtract ( 475 )   HTML ( 2)   PDF (748KB) ( 682 )  
Related Articles | Metrics
In order to reduce the redundancy of the original Bootstrap example selection algorithm, an embedded Bootstrap (E-Bootstrap) strategy is proposed, which sieves elaborately through extremely large training data sets for more typical examples relevant to the learning problem. Through formulating, the E-Bootstrap and Bootstrap algorithms are compared from two aspects, which indicate that the E-Bootstrap algorithm with almost the same training time selects more utility examples to represent a potentially overwhelming quantity of training data. Thus computational resource constraints to the size of training example set can be handled to some degree, and more effective predictor can be trained. Furthermore, both the above algorithms are applied to the negative example selection for AdaBoost based face detection system. Two experiments are implemented according to each aspect of theoretical analysis. And the results show that the E-Bootstrap outperforms the Bootstrap in getting rid of the redundant examples in Bootsrap sampling to obtain more representative training example set. Moreover, the E-Bootstrap algorithm is applicable to many other example-based active learning methods.
SEFNN—A Feed-Forward Neural Network Design Algorithm Based on Structure Evolution
Li Ning, Xie Zhenhua, Xie Junyuan, and Chen Shifu
2006, 43(10):  1713-1718. 
Asbtract ( 413 )   HTML ( 2)   PDF (512KB) ( 636 )  
Related Articles | Metrics
Genetic algorithm is a random search algorithm that simulates natural selection and evolution. It searches through the total solution space and can find the optimal solution globally over a domain. Recently, the popular encoding scheme is to encode the structure and weights, etc. into a string, which is not easy for the reservation of sub-structure during the process of genetic evolution. Generally, BP training scheme used in feed-forward neural network is to train all the offspring equally, which obviously wastes resources. A new method named SEFNN is proposed, which uses compact matrix encoding scheme, a new crossover operator, a properly modified mutate operator and rules of training elites. The efficiency of evolutionary feed-forward neural network is improved by properly considering the relationship between genotype and phenotype, thus improving the mutation speed and adopting a scheme of selective training. Experiments show that the proposed method can get good performance in accuracy. It has also found good application in a lung cancer diagnosis system.
Research on a Covering Rough Fuzzy Set Model
Wei Lai, Miao Duoqian, Xu Feifei, and Xia Fuchun
2006, 43(10):  1719-1723. 
Asbtract ( 438 )   HTML ( 0)   PDF (339KB) ( 679 )  
Related Articles | Metrics
Based on the research of a covering rough set model, the definitions about the upper approximation of the covering rough set are found to be inconsistent. Through the analysis of the difference among the several models, a more reasonable model is adopted, and, combined with the theory about covering reduction, the covering rough set model is redefined. Its several properties are given. Meanwhile, the model is extended, and the covering rough fuzzy set model is defined. Then some of its good properties are also proved.
A Study of Constrained Layout Optimization Using Adaptive Particle Swarm Optimizer
Lei Kaiyou and Qiu Yuhui
2006, 43(10):  1724-1731. 
Asbtract ( 469 )   HTML ( 0)   PDF (741KB) ( 755 )  
Related Articles | Metrics
The optimal layout problem of circle group in a circular container with performance constraints of equilibrium belong to an NP-hard problem. Due to its complexity, the general particle swarm optimization algorithm converges slowly and easily converges to local optima. Taking the layout problem of satellite cabins as background, a novel adaptive particle swarm optimizer is presented based on multi-modified strategies, which can not only escape from the attraction of local optima of the later phase to heighten particle diversity, and avoid the premature problem, but also maintain the characteristic of fast speed search in the early convergence phase to get global optimum. Thus, the algorithm has a better search performance to deal with the constrained layout optimization problem. Experimental results on three examples show that this algorithm is feasible and efficient.
The Tight Estimation Distance Using Wavelet
Liu Bing, Wang Wei, and Shi Baile
2006, 43(10):  1732-1737. 
Asbtract ( 329 )   HTML ( 1)   PDF (499KB) ( 695 )  
Related Articles | Metrics
Time series similarity search is of growing importance in many applications. Wavelet transforms are used as a dimensionality reduction technique to permit efficient similarity search over high-dimensional time series data. Proposed in this paper are the tight upper and lower bounds on the estimation distance using wavelet transform, and it is shown that the traditional distance estimation is only a part of the lower bound. According to the lower bound, more dissimilar time series can be excluded than the traditional method. And according to the upper bound, whether two time series are similar can be directily judged, and the number of time series to process in the original time domain can be further reduced. The experiments show that using the upper and lower tight bounds can significantly improve the filter efficiency and reduce the running time than the traditional method.
Mining Frequent Closed Patterns from a Sliding Window over Data Streams
Liu Xuejun, Xu Hongbing, Dong Yisheng, Qian Jiangbo, and Wang Yongli
2006, 43(10):  1738-1743. 
Asbtract ( 507 )   HTML ( 4)   PDF (515KB) ( 658 )  
Related Articles | Metrics
The set of frequent closed patterns determines exactly the complete set of all frequent patterns and is usually much smaller than the latter. But how to mine frequent closed patterns from a sliding window is a very big challenge. According to the features of data streams, a new algorithm, call DS_CFI, is proposed to solve the problem of mining the frequent closed itemsets. A sliding window is divided into several basic windows and the basic window is served as an updating unit. Latency frequent closed itemsets of every basic window are mined by the existing frequent closed pattern algorithms. Those itemsets and their subset are stored in a new data structure called DSCFI_tree. The DSCFI_tree can be incrementally updated and the frequent closed itemsets in a sliding window can be rapidly found based on DSCFI_tree. The experimental results show the feasibility and effectiveness of the algorithm.
Online Correlation Analysis for Multiple Dimensions Data Streams
Yang Xuemei, Dong Yisheng, Xu Hongbing, Liu Xuejun, Qian Jiangbo, and Wang Yongli
2006, 43(10):  1744-1750. 
Asbtract ( 562 )   HTML ( 2)   PDF (728KB) ( 845 )  
Related Articles | Metrics
Studied in this paper is the problem of identifying correlations between two multiple-dimensions data streams under constrained resources. A novel online canonical correlation analysis (CCA) algorithm based on approximate technique, called QuickCCA, is proposed. To solve bottleneck of CCA's performance, QuickCCA uses a column-sampling with non-equal probability to compress the numbers of tuples and construct synopsis matrix first. And based on the synopsis matrix, the most k principal correlation coefficients between evolving multiple-dimensions data streams are computed rapidly. Theoretic analysis and experiments indicate that QuickCCA can accurately identify correlations between two multiple-dimensions data streams in synchronic sliding windows model. Compared with the existing correlation analysis algorithm for data streams, the QuickCCA algorithm reduces complexity of computation efficiently and trades accuracy with performance. It can be presented as a generic tool for a multitude of applications on data stream mining problems.
An Event Composite Matching Approach Based on the OBDD Graphs
Xu Gang, Ma Jiangang, and Huang Tao
2006, 43(10):  1751-1759. 
Asbtract ( 365 )   HTML ( 1)   PDF (669KB) ( 731 )  
Related Articles | Metrics
One crucial issue of content-based Pub/Sub systems is the content-based event matching. The existing approaches that utilize simple constraints to match the contents of events are not enough. For example, in the enterprise application integration scenarios, the event matching is often “many-to-one” matching or “one-to-many” matching. Therefore, traditional matching approaches should be extended to solve the complex matching problems. After analyzing information matching patterns in enterprise application integration, three matching models are proposed, the simple matching is extended to the multi-semantic matching, and the temporal constraint variable is introduced. The operations differ in accordance with the different semantics; the discrete events are supported by the temporal constraint variable. Then, OBDD graphs are extended into hierarchy colored OBDD graphs and the equality of the transformation is further proved. Based on the extended OBDD graphs, the composite matching algorithms are presented and analyzed. Experiments show that the proposed algorithms are effectual.
Detecting Out-of-Bounds Accesses with Conditional Range Constraint
Xia Yimin, Luo Jun, and Zhang Minxuan
2006, 43(10):  1760-1766. 
Asbtract ( 314 )   HTML ( 1)   PDF (568KB) ( 626 )  
Related Articles | Metrics
Out-of-bounds accesses can lead to nondeterministic behaviors. Proposed in this paper is a novel detection method based on conditional range constraint. It divides the detection process into two phrases: the constraint generation phase and the constraint resolution phase. In the phase of constraint generation, a flow-sensitive, inter-procedure algorithm is proposed to propagate range constraints and value constraints respectively. In the constraint resolution phase, a linear programming solver is used to determine the bounds of abstract memory locations and the offset. The experiment results show that the method proposed is effective, and its precision is higher than 80%.
Strategies of Regression Test Case Selection for Component-Based Software
Mao Chengying, and Lu Yansheng
2006, 43(10):  1767-1774. 
Asbtract ( 449 )   HTML ( 0)   PDF (745KB) ( 792 )  
Related Articles | Metrics
Component-based software technology has been increasingly adopted in the development of large-scale and complex software systems. However, the testing problem induced by it hasn't been settled perfectly and is still one of the open issues in component-based software engineering (CBSE). Due to the lack of information about the constructs and changes in externally-provided components, system testers (i.e., component users) generally can't perform effective regression testing on their component-based software systems (CBSs). The ultimate reason is that they aren't able to select the proper test cases to retest the modification caused by the changes in component. Through analyzing the drawbacks of the existing regression testing techniques for CBSs, two improved strategies are proposed. One is based on the enhanced representation of change information of component version, and the other is implemented via the component built-in test design. Preliminary experiments have been employed on some medium scale systems, and experiment results show that the strategies of regression test case selection are feasible and effective in practice.
A Hierarchical Group Communication Protocol for Scalable Total Ordering
Wang Wentao, Wu Junmin, Xu Yinlong, Li Huanghai, and Bao Chunjian
2006, 43(10):  1775-1781. 
Asbtract ( 489 )   HTML ( 1)   PDF (528KB) ( 433 )  
Related Articles | Metrics
In parallel and distributed systems, a great deal of members in a group are cooperating to achieve some functions. But in a traditional group communication system, there are lots of communication overheads, especially when membership of group changes frequently. These overheads will greatly degrade the efficiency of group communication system. In this paper, a novel hierarchical group communication protocol called RHGP is proposed. RHGP is the acronym of “ring-based hierarchical of group protocol”. This protocol supports total order message delivery and hierarchical group management by using token passing. It also supports dynamical changing of the membership of a group. In order to reduce communication overheads and improve reliability, the protocol decreases the number of messages exchanged during the membership changing of a group. It is proved that the proposed protocol is reliable in the sense that with high probability of 99.8646% a ring-based hierarchy with nearly 200 members can work well when member faulty probability is bounded by 0.1%; if at most 3 members faulty are allowed, reliability probability of hierarchy is 99.9999%. It is also proved that the proposed protocol is scalable in the sense that with the number of group members increasing, reliability probability of hierarchy decreases slowly.
Agent-Oriented Software Engineering: Status and Challenges
Mao Xinjun, Chang Zhiming, Wang Ji, and Wang Huaimin
2006, 43(10):  1782-1789. 
Asbtract ( 612 )   HTML ( 3)   PDF (439KB) ( 943 )  
Related Articles | Metrics
Agent-oriented software engineering (AOSE) is an emerging and leading field in the literature of software engineering, which intends to integrate agent theory and technology with the software engineering philosophy and principles together in order to provide the engineering solutions to develop agent-based systems. In recent years, great progresses have been made in the study of AOSE, with the development of Web application and the socialization of software development. In this paper, the backgrounds of AOSE are introduced from the viewpoints of application requirements and technology developments, and a comprehensive survey of research and status of AOSE is presented. The open problems and challenges are finally discussed to guide future research.
An Algorithm for Internet AS Graph Betweenness Centrality Based on Backtrack
Zhang Guoqiang, and Zhang Guoqing
2006, 43(10):  1790-1796. 
Asbtract ( 540 )   HTML ( 4)   PDF (618KB) ( 487 )  
Related Articles | Metrics
Betweenness plays a key role in the characterization of complex networks, which can be used to quantify a node or edge's importance. For Internet, betweenness directly reflects the possible load needed to carry by the node or edge, thus it can be used to predict the network's dynamic behavior. However, O(n\+3) run time is required to compute the betweenness by conventional algorithms, which becomes a bottleneck for large-scale network analysis. These algorithms are devised for weighted networks, but a lot of real network models don't take weight into account. On the other hand, neither of these algorithms take the semantics of edges into account. In this paper, an algorithm based on backtrack which calculates the betweenness for simple un-weighted network with O(nm) run time complexity is proposed, after which the unique feature of Internet AS graph, i.e, edges between two ASes are associated with some kind of commercial relationships, is considered. Grounded on the simple backtrack algorithm, an algorithm for calculating betweenness of Internet AS graph is proposed. Since routing policies and routing behaviors are also considered in this algorithm, it has great significance to apply this betweenness computation to the analysis of Internet behavior.
A Sub-Tuple-Partition-Based Fast Two-Dimensional Packet Classification Algorithm
Liu Tong, Li Huawei, Li Xiaowei, and Gong Shuguang,
2006, 43(10):  1797-1803. 
Asbtract ( 407 )   HTML ( 2)   PDF (543KB) ( 592 )  
Related Articles | Metrics
Packet classification is important for network applications such as firewalls, intrusion detection, and differentiated services. There are a lot of research papers about the packet classification. The algorithms based on tuple space firstly presented by Srinivasan have a common problem: the matched rules can not be located in which tuples they are by preprocessing. So it makes their average lookup performance unstable. For two-dimensional packet classification, a new regulation to partition the tuple into some sub-tuples and a new algorithm for two-dimensional packet classification are presented. The sub-tuple based algorithm has an advantage of locating the matched rule's tuple through the result of three independent one-dimensional lookups. So the sub-tuple-based algorithm can increase lookup speed and get stable lookup performance.
PKI-Based Mutual Connections Constrained with Discrepancy of Trust Domains
Zhu Pengfei, Dai Yingxia, and Bao Xuhua
2006, 43(10):  1804-1809. 
Asbtract ( 344 )   HTML ( 0)   PDF (428KB) ( 482 )  
Related Articles | Metrics
Mutual connections of entities within different PKI-based trust domains may fail because of the discrepancy of those trust domains. In this paper, how the discrepancy works is analyzed with the new concept, “compatibility of trust domains”. Besides, an approach is introduced that extended-in-condition trust domains can be used to avoid inter linkage from discrepancy, and an implementation is designed.
A Survey of Identity-Based Cryptography Using Pairing
Tian Ye, Zhang Yujun, and Li Zhongcheng
2006, 43(10):  1810-1819. 
Asbtract ( 504 )   HTML ( 0)   PDF (1034KB) ( 763 )  
Related Articles | Metrics
Identity-based cryptography is proposed to simplify key management which is one of the most complex problems in certificate-based cryptography. Firstly, the identity-based cryptography technologies are surveyed from three fundamental primitives (encryption, digital signature, and key agreement), and some problems in these technologies are analyzed, including security models, efficiency, and so on. Because of the lack of study for applications, how to apply identity-based cryptography to implement access authentication and data confidentiality in wireless mobile IPv6 network is discussed. Finally, the possible new directions of identity-based cryptography techniques and the applications are pointed out.
Fault Injection and Soft Error Sensitivity Characterization for Fault-Tolerant Godson-1 Processor
Huang Hailin, Tang Zhimin, and Xu Tong,
2006, 43(10):  1820-1827. 
Asbtract ( 544 )   HTML ( 0)   PDF (571KB) ( 634 )  
Related Articles | Metrics
Dependability plays an important role in the design of nanometer-scale processors and application-specific processors exposed to harsh environments such as cosmic rays. Fault injection is employed to characterize the soft error sensitivity and validate the integrated fault tolerance mechanisms in a fault-tolerant processor. With Godson-1 processor as the research prototype, a novel fault injection technique is presented, which can perform fast and continuous simulation-based fault injections to the subjected processor by running two synthesizable processor RTL models simultaneously. Based on this technique, about 30,0000 soft errors are injected into Godson-1 and the soft error sensitivity of Godson-1 is further investigated to direct the design of a fault-tolerant and dependable Godson-1 processor with good statistical significance.
Study on Modeling MIPS Processors for Static WCET Analysis
Bian Xiaofeng, and Zhou Xuehai
2006, 43(10):  1828-1834. 
Asbtract ( 423 )   HTML ( 0)   PDF (596KB) ( 581 )  
Related Articles | Metrics
To obtain safe and tight WCET estimation, it is necessary to take account of the features of a target processor architecture. New architecture mechanisms, such as cache, pipeline etc., have been widely applied in modern processors to increas their performance. Ignoring their impact will cause WCET overestimated during the analysis of execution time. For the purpose of WCET analysis, a method of modeling MIPS processor architecture is proposed using Petri nets. How to abstract common RISC processor architecture features and how to model them with Petri nets are discussed and its effectivity is verified with experiment.
Formal Verification of Godson-2 Microprocessor Floating-Point Division Unit
Chen Yunji, Ma Lin, Shen Haihua, and Hu Weiwu
2006, 43(10):  1835-1841. 
Asbtract ( 411 )   HTML ( 0)   PDF (515KB) ( 616 )  
Related Articles | Metrics
Word level model checking based on decision diagrams can verify arithmetic circuits completely, but its bug finding is time-consuming. SAT-based bounded model checking is powerful in bug finding, but it does not support specification with mathematic formula. In this paper, the SAT-based word level model checking method is presented. This method extends CNF to E-CNF which can represent both Boolean formula and mathematic formula, and extends bounded model checking and SAT solver to word level to cooperate with E-CNF. Experiments show that SAT-based model checking is powerful in bug finding for arithmetic circuit. The verification of floating-point division unit of Godson-2 microprocessor adopts both *PHDD-based and SAT-based model checking methods. These two methods are excellently complementary to each other, giving the ability to completely verify design as well as shorten design cycle.
A Persistent Out-of-Band Virtualization System
Zhang Guangyan, Shu Jiwu, Xue Wei, and Zheng Weimin
2006, 43(10):  1842-1848. 
Asbtract ( 348 )   HTML ( 0)   PDF (626KB) ( 569 )  
Related Articles | Metrics
Storage virtualization can enhance the overall quality of service in storage are a networks. Out-of-band virtualization, in contrast with in-band virtualization, has the advantages of high performance and good scalability. In this paper, a policy for guaranteeing the persistent consistency of virtualization metadata is proposed, which enables out-of-band virtualization to survive panics and power failures by means of ordered operations in a virtualization transaction, REDO logging and log integrity checking. An organizing mode of the on-disk virtualization metadata is described, which is human-readable and easy to modify due to the introduction of a relation model. An effective approach which obtains the volume layout by analyzing the partition tables on legacy disks is presented for integrating typical legacy storage systems, and little time is needed to upgrade large legacy systems by the approach. The implemented prototype based on these key technologies demonstrates its ability to provide high performance in the representative experiments.
Fault Tolerance with Virtual Disk Replicas in the Mass Storage Network
Wang Di, Xue Wei, Shu Jiwu, and Shen Meiming
2006, 43(10):  1849-1854. 
Asbtract ( 355 )   HTML ( 1)   PDF (888KB) ( 607 )  
Related Articles | Metrics
Data availability and accessing performance in large scale storage network have been more and more important. Based on the in-band virtualization system for massive storage, virtual disk replicas are proposed to improve the data availability. The replicas use selective read scheduling, asynchronous update mechanism, and a special space-adjusting algorithm to optimize the data accessing performance. The experiments with virtual disk replicas show that, when the device number is adequate, the read-performance can be increased by 26%; even if one or two disks fail, the I/O operations can be processed correctly, and the read-performance is still over 10% better than the normal accessing without replicas.