Loading...
ISSN 1000-1239 CN 11-1777/TP

Table of Content

01 November 2020, Volume 57 Issue 11
Enhancing Spatial Steganographic Algorithm Based on Multi-Scale Filters
Wu Junqi, Zhai Liming, Wang Lina, Fang Canming, Wu Tian
2020, 57(11):  2251-2259.  doi:10.7544/issn1000-1239.2020.20200441
Asbtract ( 366 )   HTML ( 9)   PDF (3678KB) ( 358 )  
Related Articles | Metrics
Steganography is a kind of convert communication technique which uses multimedia carriers such as images, videos, and audios. How to embed secret messages as much as possible under the condition of minimizing the impact on the carrier is always the research focus of steganographic algorithms. After the introduction of STC (syndrome trellis codes), the embedding efficiency of steganographic algorithms can approach the theoretical upper bound. Therefore, the design of steganographic algorithms focuses on the distortion functions which are designed to measure the embedding security of image pixels. Distortion functions are crucial to content-adaptive steganography. For spatial image steganography, the distortion functions are always designed with a texture-complexity criterion of images, where textured regions are assigned low embedding costs and flat regions are assigned high embedding costs. However, for the variety of image contents, this criterion may not be sufficiently satisfied for all pixels in a given image. In this paper, we propose an enhancing spatial steganographic algorithm to refine the embedding costs by using multi-scale filters, which can better enhance texture regions in different scales while reducing the enhancement of smooth regions. The refined embedding costs adequately conform to the above criterion, and thus overcome the problem of improper cost assignment. Experimental results demonstrate that the proposed algorithm can be applied to existing spatial image steganographic algorithms, and can also improve their steganographic security against image steganalysis.
Evolutionary Multi-Objective Optimization Image Steganography Based on Edge Computing
Ding Xuyang, Xie Ying, Zhang Xiaosong
2020, 57(11):  2260-2270.  doi:10.7544/issn1000-1239.2020.20200437
Asbtract ( 360 )   HTML ( 9)   PDF (6484KB) ( 317 )  
Related Articles | Metrics
Edge computing solves the defect that terminals cannot run complex applications due to limited computing resources. In practical, the edge computing can support terminals with limited computing resources to implement covert communication based on image steganography. This paper proposes an evolutionary multi-objective optimization image steganography based on genetic algorithm, which is suitable for edge computing scenarios. First, a formal definition of image steganography is given by taking steganography imperceptibility optimization and steganography security optimization as objective functions. Secondly, the image is preprocessed through multiple directional and non-directional high-pass filters, and aggregated filter residuals are obtained as candidate locations for embedding of secret information. Then, the genetic manipulations of genetic algorithm are used to iteratively search individuals with high fitness in candidate locations. Through genetic manipulation of the genetic algorithm, the embedded locations with higher fitness is searched iteratively, and the optimal solution of the evolutionary multi-objective optimization problem is obtained. Finally, the secret information is embedded in the pixel locations corresponding to the optimal solution. The simulation experiments are conducted, and results show that the proposed algorithm can maintain image quality and resist steganalysis better than the other existing algorithms.
Reversible Data Hiding in JPEG Images Based on Distortion-Extension Cost
Wang Yangyang, He Hongjie, Chen Fan, Zhang Shanjun
2020, 57(11):  2271-2282.  doi:10.7544/issn1000-1239.2020.20200434
Asbtract ( 210 )   HTML ( 8)   PDF (4687KB) ( 170 )  
Related Articles | Metrics
Considering the file size increase and visual distortion of the JPEG image with secret data, a reversible data hiding algorithm of JPEG images based on distortion-extension cost is proposed. Histogram shifting is used to realize reversible embedding of secret data, focusing on how to adaptively select embedding frequency and image block according to embedding capacity, so as to minimize the visual distortion and file size increase of JPEG image with secret data. This paper discusses and analyzes the rationality of determining the frequency embedding sequence by simulating the unit increase of file size of different frequencies, and determining the image block embedding sequence by the number of zero alternating current coefficients and smoothness of the image block. When embedding the data, the frequency of the smaller unit increase of file size and the smoother image block are preferred; and the unit increase of file size and the unit distortion-increase ratio are defined as quantitative evaluation indicators of algorithm file expansion, the relationship of visual quality and file expansion, respectively. Experimental results demonstrate that compared with the latest similar algorithms, the proposed algorithm can achieve a better balance between the increase of file size of the JPEG image with secret data and the visual quality, and reduce the increase of file size of the JPEG image with secret data, and the average of unit increase of file size under the same embed capacity has been reduced by 0.15~0.25.
File Covert Transfer Strategy Based on End Hopping and Spreading
Hou Bowen, Guo Hongbin, Shi Leyi
2020, 57(11):  2283-2293.  doi:10.7544/issn1000-1239.2020.20200420
Asbtract ( 157 )   HTML ( 7)   PDF (2333KB) ( 101 )  
Related Articles | Metrics
The end hopping and spreading technology is an active defense technology that pseudorandom changes the end information in the end to end data transmission and uses the end spreading sequence to realize high-speed synchronous authentication. In this paper, the end hopping and spreading technology is introduced into file covert transfer, the file covert transmission strategy under the end hopping and spreading network is studied, the multicast time correction scheme is proposed, and the synchronization problem in communication process is solved. Two kinds of file transfer schemes based on time transfer and transmission size transfer are proposed for the end hopping and spreading network, and the data migration is added into the file transfer process to realize the covert transmission and integrity transmission of files. A prototype system is designed and implemented for the file covert transfer of end hopping and spreading, and the usability and security are tested. The experimental results show that the file covert transfer strategy can effectively meet the requirements for the integrity and concealment of file transfer.
Android Browser Fingerprinting Identification Method Based on Bidirectional Recurrent Neural Network
Liu Qixu, Liu Xinyu, Luo Cheng, Wang Junnan, Chen Langping, Liu Jiaxi
2020, 57(11):  2294-2311.  doi:10.7544/issn1000-1239.2020.20200459
Asbtract ( 319 )   HTML ( 17)   PDF (3154KB) ( 292 )  
Related Articles | Metrics
Browser fingerprinting is a user identification method which has gradually matured since its concept was proposed in 2010 and is widely used in a lot of popular business websites to serve ads accurately. However, traditional fingerprinting has lots of problems in tracing users because it changes subtly no matter if the fingerprint feature value is changed due to system upgrade, browser update or tampering caused by fingerprint blocker. On the basis of research on the attributes of browser fingerprint, a great number of fingerprints from the volunteers who used Android devices are collected and supervised learning framework RNNBF for user identification is proposed. The robustness of RNNBF is reflected in the data and the model respectively. In the data aspect, the fingerprint-based data enhancement technology is used to generate the enhanced data set. In the model aspect, the attention mechanism is used to make our model focus more on the invariant fingerprint features. In terms of model evaluation, the RNNBF model is compared with the single-layer LSTM model and the random forest model. When F1-Score is used as the evaluation standard, the recognition effect of the RNNBF model is better than the latter two, which proves the excellent performance of RNNBF in dynamically linking fingerprints.
Formal Security Evaluation and Improvement of Industrial Ethernet EtherCAT Protocol
Feng Tao, Wang Shuaishuai, Gong Xiang, Fang Junli
2020, 57(11):  2312-2327.  doi:10.7544/issn1000-1239.2020.20200399
Asbtract ( 181 )   HTML ( 9)   PDF (7721KB) ( 110 )  
Related Articles | Metrics
The EtherCAT protocol is widely used due to its high real-time performance and strong performance. However, with the rapid development and openness of the Industrial Ethernet protocol, industrial control systems are subject to huge network attack risks. There are many studies on the security and improvement of industrial Ethernet protocols, but these studies lack formal modeling and security evaluation of the protocol, and only focus on the realization of the security function of the protocol itself, which has certain limitations. In order to solve the current situation of industrial Ethernet being attacked, we take EtherCAT protocol which is widely used at present as the research object, and propose a model checking method based on colored Petri net theory and Dolev-Yao attack method, and evaluate and improve the security of the protocol. First, we verify the security mechanism of the protocol FSoE based on Petri net theory and CPN Tools model tools; then introduce the Dolev-Yao attack model to evaluate the security of the original model of the protocol. It is found that there are 3 types of man-in-the-middle attack vulnerabilities in the protocol, including tampering, replay, and deception. Finally, a new solution is proposed for the vulnerabilities in the protocol. A key distribution center and a Hash function are added to the original protocol. The security verification of the new scheme is carried out again using the CPN model detection tool. Through verification, it can be found that the new scheme can effectively prevent 3 types of man-in-the-middle attacks and improve the security of the protocol.
MSRD: Multi-Modal Web Rumor Detection Method
Liu Jinshuo, Feng Kuo, Jeff Z. Pan, Deng Juan, Wang Lina
2020, 57(11):  2328-2336.  doi:10.7544/issn1000-1239.2020.20200413
Asbtract ( 381 )   HTML ( 22)   PDF (2007KB) ( 436 )  
Related Articles | Metrics
The multi-modal web rumors that combine images and texts are more confusing and inflammatory, so they are more harmful to national security and social stability. At present, the web rumor detection work fully considers the text content of the essay in the rumor, and ignores the image content and the embedded text in the image. Therefore, this paper proposes a multi-modal web rumors detection method MSRD for the image, embedded text in the image and the text of the essay based on deep neural networks. This method uses the VGG-19 network to extract image content features, DenseNet to extract embedded text content, and LSTM network to extract text content features. After concatenating with the image features, the mean and variance vectors of the image and text shared representations are obtained through the fully connected layer, and the random variables sampled from the Gaussian distribution are used to form a re-parameterized multi-modal feature and used as the input of the rumor detector. Experiments show that the method achieves 68.5% and 79.4% accuracy on the two data sets of Twitter and Weibo.
Privacy Preservation Method of Data Aggregation in Mobile Crowd Sensing
Wang Taochun, Jin Xin, Lü Chengmei, Chen Fulong, Zhao Chuanxin
2020, 57(11):  2337-2347.  doi:10.7544/issn1000-1239.2020.20190579
Asbtract ( 306 )   HTML ( 11)   PDF (1515KB) ( 184 )  
Related Articles | Metrics
Serious privacy leakage problems are on the rise with the wide application of mobile crowd sensing owing to the popularity of mobile smart devices. In general, the existing privacy protection schemes assume that the third-party service platform is credible, which therefore sets a high requirement on the application context. Based on this, the paper proposes a new privacy preservation data aggregation algorithm based on elliptic curve cryptography (ECPPDA) in mobile crowd sensing. The server randomly divides the participants into g clusters and forms respective cluster public key for each cluster. The nodes in the cluster encrypt the data through their own cluster public keys and merge the data aggregation results. The server obtains the aggregation result by cooperating with the members in the cluster. Since what the server receives is the ciphertext of aggregation and the ciphertext decryption requires all the nodes in the cluster to cooperate together, the server cannot obtain the data of a single participant. In addition, the updating of the cluster public key by the server can facilitate the participants to dynamically join or leave. The experimental result shows that ECPPDA has the characteristics of high security, low consumption, low communication and high precision.
Review of Automatic Image Annotation Technology
Ma Yanchun, Liu Yongjian, Xie Qing, Xiong Shengwu, Tang Lingli
2020, 57(11):  2348-2374.  doi:10.7544/issn1000-1239.2020.20190793
Asbtract ( 855 )   HTML ( 25)   PDF (1358KB) ( 610 )  
Related Articles | Metrics
As one of the most effective ways to reduce the “semantic gap” between image data and its content, automatic image annotation (AIA) technology has shown its great significance to help people understand image contents and retrieve the target information from the massive image data. This paper summarizes the general framework of AIA models by investigating the literatures about image annotation in recent 20 years, and analyzes the general problems to solve in AIA problems by combining the framework with various specific works. In this paper, the main methods used in various AIA models are classified into 9 types: correlation model, hidden Markov model, topic model, matrix factorization model, neighbor-based model, SVM-based model, graph-based model, CCA (KCCA) model and deep learning model. For each type of image annotation model, this paper provides a detailed study and analysis in terms of “basic principle introduction-specific model differences-model summary”. In addition, this paper summarizes some commonly used datasets and evaluation indexes, and compares the performance of some important image annotation models with related analysis on the advantages and disadvantages of various types of AIA models. Finally, some open problems and research directions in the field of image annotation are proposed and suggested.
Deep Highly Interrelated Hashing for Fast Image Retrieval
He Zhouyu, Feng Xupeng, Liu Lijun, Huang Qingsong
2020, 57(11):  2375-2388.  doi:10.7544/issn1000-1239.2020.20190498
Asbtract ( 243 )   HTML ( 9)   PDF (4666KB) ( 149 )  
Related Articles | Metrics
In recent years, with the explosive growth of the amount of image data, the combination of hashing and deep learning shows excellent performance in the field of large-scale image retrieval. Most of the mainstream deep-supervised hashing methods use a “paired” strategy to generate a similarity matrix constrained Hash encoding. The instance-pairwise similarity matrix is a n×n matrix, where n is the number of training samples. The computational cost of such methods is large, and such methods are not suitable for large-scale image retrieval. Therefore, this paper proposes a deep highly interrelated hashing method, which is a deep-supervised hashing method that enables fast and accurate large-scale image retrieval. It can be widely used in a variety of deep convolutional neural networks. Particularly, in order to make the Hash code more discriminating, this paper proposes a highly interrelated loss function constrained Hash encoding. The highly interrelated loss function adjusts the distance between features by changing the sensitivity of the model to the weight matrix. It maximizes the distance between classes and reduces the distance within the class. Many experiments in CIFAR-10, NUS-WIDE and SVHN datasets are done. The experimental results show that the image retrieval performance of deep highly interrelated hashing is better than the current mainstream methods.
Survey on Geometric Unfolding, Folding Algorithms and Applications
Sun Xiaopeng, Liu Shihan, Wang Zhenyan, Li Jiaojiao
2020, 57(11):  2389-2403.  doi:10.7544/issn1000-1239.2020.20200126
Asbtract ( 330 )   HTML ( 10)   PDF (1589KB) ( 287 )  
Related Articles | Metrics
Unfolding and folding problem is a popular research topic in computer graphics, and has a wide range of applications, such as industrial manufacturing, architectural design, medical treatment, and aviation technology. In this survey, we review the basic concepts of unfolding and folding problem, introduce the research and application in four fields: robot design, computer animation, deep learning and others. We discuss the research work of unfolding and folding problem in detail. First, according to the different degrees of unfolding, we summarize research progress and typical algorithm ideas from two aspects: full unfolding and approximate unfolding. Full unfolding is to unfold 3D objects into 2D space without overlapping and deformation. However, most objects cannot be directly unfolded, and only an approximately unfolded structure can be solved. Approximate unfolding is a non-overlapping and deformed process, which is unfolded into the plane domain by mapping. How to find the smallest deformation is the key to approximate unfolding. Second, according to the different folding forms, the folding problem is divided into two types: Origami and Kirigami. We divide Origami into rigid folding and curved folding according to the different forms of crease, such as straight crease and curved crease. Kirigami is a special folding method that combines cutting and folding technology, which drives folding by the elastic force or other external forces generated by cutting. Here, we mainly consider the technology or algorithm of using Kirigami technology to construct auxetic structures. In addition, in order to compare the advantages and disadvantages of the algorithm, we summarize the commonly used algorithm indicators of unfolding and folding algorithm. Then, we evaluate the typical algorithm in recent years, and analyze advantages and disadvantages. Finally, we summarize and propose the development trend of unfolding and folding, including algorithm accuracy and robustness, fold volumetric objects, self-driven process and intelligent application of Kirigami technology.
SBS: An Efficient R-Tree Query Algorithm Exploiting the Internal Parallelism of SSDs
Chen Yubiao, Li Jianzhong, Li Yingshu
2020, 57(11):  2404-2418.  doi:10.7544/issn1000-1239.2020.20190564
Asbtract ( 169 )   HTML ( 3)   PDF (1465KB) ( 92 )  
Related Articles | Metrics
The flash-based SSD has become the mainstream storage device for its excellent features. At the same time, with the magnificent improvement of internal design of SSD architecture, more and more storage chips and hardware resources are integrated into SSDs which makes them full of internal parallelism, while traditional external memory algorithm and data structure optimization rarely take the internal parallelism of SSDs into consideration. Range query is one of the most important basic operations of R-tree. R-tree is the engine index data structure of many geographic information systems. Therefore, the efficiency of range query plays an important role in the performance of the entire geographic information system. Almost all the tree index structures are difficult to effectively utilize the feature of internal parallelism due to the data loading dependency problem. Therefore, a new range query algorithm SBS(stack batch search) based on stack is proposed, which can effectively utilize the internal parallelism of the SSD with memory usage of O(B log N). Finally, we verify the performance of the SBS algorithm through real data experiments. Experimental results show that SBS has the best performance in range query under acceptable memory consumption. On two different solid-state drives, the speed up ratio of SBS can reach 3.4 and 4.5 separately.
Survey on Data Updating in Erasure-Coded Storage Systems
Zhang Yao, Chu Jiajia, Weng Chuliang
2020, 57(11):  2419-2431.  doi:10.7544/issn1000-1239.2020.20190675
Asbtract ( 298 )   HTML ( 14)   PDF (1644KB) ( 223 )  
Related Articles | Metrics
In a distributed storage system, node failure has become a normal state. In order to ensure high availability of data, the system usually adopts data redundancy. At present, there are mainly two kinds of redundancy mechanisms. One is multiple replications, and the other is erasure coding. With the increasing amount of data, the benefits of the multi-copy mechanism are getting lower and lower, and people are turning their attention to erasure codes with higher storage efficiency. However, the complicated rules of the erasure coding itself cause the overhead of the read, write, and update operations of the distributed storage systems using the erasure coding to be larger than that of the multiple copies. Therefore, erasure coding is usually used for cold data or warm data storage. Hot data, which requires frequent access and update, is still stored in multiple copies. This paper focuses on the data update in erasure-coded storage systems, summarizes the current optimization work related to erasure coding update from the aspects of hard disk I/O, network transmission and system optimization, makes a comparative analysis on the update performance of representative coding schemes at present, and finally looks forward to the future research trends. Through analysis, it is concluded that the current erasure coding update schemes still cannot obtain the update performance similar to that of multiple copies. How to optimize the erasure-coded storage system in the context of erasure coding update rules and system architecture, so that it can replace the multi-copy mechanism under the hot data scenario, and reducing the hot data storage overhead is still a problem worthy of further study in the future.
Optimization of LSM-Tree for Key-Value Stores
Wu Shangyu, Xie Jingwen, Wang Yi
2020, 57(11):  2432-2441.  doi:10.7544/issn1000-1239.2020.20190551
Asbtract ( 277 )   HTML ( 5)   PDF (1808KB) ( 223 )  
Related Articles | Metrics
LSM-Tree (log-structured merge tree) is widely used in contemporary mainstream key-value storage systems to process the vast amount of data. LSM-Tree converts random write requests into sequential write requests by maintaining a batch of requests in memory to achieve high write efficiency. However, there are still two shortages in LSM-Tree: First, the flow direction of data is unidirectional and fixed. Data stored at the bottom of LSM-Tree will remain there until they are deleted by compaction operation. This will make read amplification problem more serious. Second, the data distribution in LSM-Tree does not reflect access frequency. The data with different access frequencies could be stored in the same physical location. The data with high access frequency in lower layer will cost higher access latency. This paper presents FloatKV (floating key-value), an access frequency aware key-value storage strategy. FloatKV first proposes a structure called LRFO (LRU and FIFO) to manage data stored in memory. Then FloatKV presents a new strategy for floating process in external storage to move data closer to memory. FloatKV records the access frequency of data stored in the external storage and adjusts the storage location based on its access frequency. To verify the feasibility and performance of FloatKV, we conduct a series of experiments using standard benchmarks from YCSB (yahoo! cloud serving benchmark) and compare FloatKV with representative key-value store techniques. The experimental results show that FloatKV can significantly improve the reading efficiency and effectively reduce the read amplification problem.
Search of Students with Similar Lifestyle Based on Campus Behavior Information Network
Wang Xin’ao, Duan Lei, Cui Dingshan, Lu Li, Dun Yijie, Qin Ruiqi
2020, 57(11):  2442-2455.  doi:10.7544/issn1000-1239.2020.20190649
Asbtract ( 226 )   HTML ( 5)   PDF (2269KB) ( 172 )  
Related Articles | Metrics
It is important to keep track of both the psychological and academic status of students in campus. Generally, student data covers a wide range of kinds such as students’ interests, hobbies, and lifestyles, and these data can be collected via smart devices such as student e-cards by many campuses. With the rapid development of new generation of information technology, in recent years, researchers have explored novel ways to improve the quality of talent cultivation by utilizing the student data, such as applying big data analysis on the data to discover subtle but meaningful information as the guidance for better student management. Among such research, search of students with similar lifestyles can exert positive effect on the improvement of student management, as potential and insightful information can be found and may further provide some warnings for students at an early stage if anything unusual is found. Existing algorithms for searching students with similar lifestyle have two deficiencies. Firstly, they cannot explain the similarities between students because related semantic information is lost in the searching process. Secondly, they fail to integrate multiple data sources, while the student behavioral data is growing dynamically and only using one dataset may lead to biased results. To break these limitations, we first propose the concept of campus behavior information network to represent student behaviors in campus. Next, based on the constructed campus behavior information network, an algorithm named SCALE is proposed for similar campus lifestyle mining. SCALE calculates the student similarity by specific meta-paths with constraints. SCALE is strong and unique, not only in keeping the similarity semantics of the original data, but also in extensively integrating multiple data sources in a scalable way while retaining the original results of calculation. Due to the large scale of datasets, parallel strategy is further designed and applied to SCALE for the sake of efficiency. Through extensive experiments on real campus behavior datasets, the effectiveness and execution efficiency of the SCALE are verified.
Convolutional Interactive Attention Mechanism for Aspect Extraction
Wei Zhenkai, Cheng Meng, Zhou Xiabing, Li Zhifeng, Zou Bowei, Hong Yu, Yao Jianmin
2020, 57(11):  2456-2466.  doi:10.7544/issn1000-1239.2020.20190748
Asbtract ( 317 )   HTML ( 8)   PDF (2361KB) ( 277 )  
Related Articles | Metrics
Attention mechanism is a common model in aspect extraction research. There are two limitations in attention mechanism towards aspect extraction: First, existing attention mechanism is mostly static attention or self attention.Self attention mechanism is a global attention mechanism, and it brings the irrelevant noises (words that are far away from the target word and unrelated to it) into attention vector; Second, existing attention mechanisms are mostly single-layer which lack interactivity. To address above two limitations, a convolutional interactive attention (CIA) mechanism is proposed in this paper. A bidirectional long short term memory network (Bi-LSTM) is exploited to obtain hidden representations of words in a target sentence, and then the convolutional interactive attention mechanism is used for representation learning. Convolutional interactive attention mechanism includes two layers: in the first layer, the number of context words for each target word is limited by a window, then the context words are used to calculate the attention vector of target word. In the second layer, the interactive attention vector is calculated by attention distribution of the first layer and all the words in target sentence. After that, we concatenate attention vectors of the first layer and second layer. Finally, conditional random field (CRF) is utilized to label aspects. This paper demonstrates the effectiveness of the proposed method over the official evaluation datasets of 2014—2016 Semantic Evaluation (SemEval).Compared with the baseline, the model proposed in this paper increases the F1 score of aspect extraction with 2.21%, 1.35%, 2.22% and 2.21% respectively on four datasets.
Construction of Large-Scale Disease Terminology Graph with Common Terms
Zhang Chentong, Zhang Jiaying, Zhang Zhixing, Ruan Tong, He Ping, Ge Xiaoling
2020, 57(11):  2467-2477.  doi:10.7544/issn1000-1239.2020.20190747
Asbtract ( 271 )   HTML ( 5)   PDF (3751KB) ( 205 )  
Related Articles | Metrics
The National Health Planning Commission requires medical institutions to use the ICD (international classification of diseases) codes. However, due to the large amount of common terms in clinical disease descriptions, the direct matching rate between clinical diagnostic names in electronic medical records and ICD codes is low. Based on the real diagnostic data on the regional healthcare platform, this paper constructs a disease terminology graph fusing common terms. Specifically, this paper proposes a relationship recognition algorithm based on data enhancement which combines the rule algorithm based on the disease components and the pre-training BERT(bidirectional encoder representation from transformers) model. The proposed algorithm identifies synonymy and hypernymy between over 50 000 common terms and diseases in ICD10(international classification of diseases 10th revision,Chinese version), then further fuses the hierarchical structure of ICD11(international classification of diseases 11th revision,Chinese version). Moreover, this paper also proposes a task allocation algorithm based on the disease-department association graph to perform manual verification. Finally, a large-scale disease terminology graph including 1 460 synonyms and 46 508 hypernymy can be formed by 94 478 disease entities. The evaluation experiments show that the coverage of clinical diagnostic data based on disease terminology graph is 75.31% higher than direct mapping based on ICD10. In addition, compared with manual coding by doctors, the automatic coding using disease terminology graph can shorten 59.75% of the encoding time, and the accuracy rate is 85%.