Abstract:
The rich entity associations are prerequisites and play important roles in many applications such as data analyzing, data mining, knowledge discovery and semantic query in heterogeneous information spaces. However unlike homogeneous information network, due to the complexity, diversity and heterogeneous of entity associations in heterogeneous information spaces, the entity association mining is not a simple task and with more challenges. It is taken as an example to discover the likely entity associations among heterogeneous entities in an author bibliographic network. In particular, aiming at the characteristics of heterogeneous information spaces, a new general 4-step entity association mining algorithm CFRQ4A (clustering, filtering, reasoning and quantifying for associations) is proposed. CFRQ4A leverages not only attribute values of heterogeneous entities but also structural (path) information of heterogeneous information network. And association constraints are introduced to verify semantic and logic correctness of entity associations in the mining process. The purpose of the filtering step is to further reduce the searching space of the mining algorithm. Moreover, aiming at the inherent features of entity association, a reasonable association strength quantifying model is given. Experimental results on the DBLP dataset demonstrate the feasibility and effectiveness of the proposed algorithm.