Abstract:
eCFD (extended conditional functional dependency) is proposed as the extension of CFD (conditional functional dependency) for data cleaning. Compared with CFD, eCFD can take more patterns of values and catch more semantic information. However, there are only few works about eCFD. This paper focuses on the problem of eCFD discovering, whose counterpart of CFD has been studied very much. As we know, this paper is the first work about eCFD discovering. To avoid inconsistencies and remove redundancies, based on the definitions of strongly validated and weakly non-redundant eCFDs, formal definition of eCFD discovering problem is given and MeCFD method is proposed to solve this problem. MeCFD first generates all basic eCFDs which are weakly non-redundant and semantically equivalent to all strongly validated eCFDs, then constructs compound eCFDs through merging basic eCFDs. Searching candidate space in depth-first order makes MeCFD use only constant memory space to maintain data partitions. Efficient pruning strategies are proposed to improve the performance of MeCFD. Theoretical analysis shows the correctness of MeCFD. Experiments over real data sets show the good scalability of MeCFD and the effectiveness of pruning strategies and optimizing methods.