Abstract:
Data identification is a prerequisite for achieving precise data governance, effectively ensuring the security of data elements during cross-domain transfer. Currently, there are methods for generating identifiers for individual data, but as the scale of data continues to expand, identifiers at the data level cannot be directly applied to the dataset level. This also introduces issues of identifiers being “easily damaged” and “difficult to embed”. To effectively address these issues, we adopt the design concept of network honeypoint from the “guardian” model proposed by academician Fang Binxing. Utilizing the idea of deception defense, we propose an anti-damage data identification technology based on dataset honeypoint for cross-domain data transfer scenarios, and design a complete method for generating and embedding dataset honeypoints. First, for cross-domain data transfer scenarios, dataset honeypoints are designed. By enhancing the concealment of dataset honeypoints and increasing their redundancy, the issue of identifiers being “easily damaged” is addressed. Second, by ensuring that the form of dataset honeypoint is indistinguishable from real data, the issue of identifiers being “difficult to embed” is resolved. Finally, experiments conducted on both image and encrypted text data modalities demonstrate that dataset honeypoints possess high anti-damage capability, high robustness, and low performance overhead.