Abstract:
The leading partitional clustering technique, K-Modes, is one of the most computationally efficient clustering methods for categorical data. In the traditional K-Modes algorithm, the simple matching dissimilarity measure is used to compute the distance between two values of the same categorical attributes. This compares two categorical values directly and results in either a difference of zero when the two values are identical or one if otherwise. However, the similarity between categorical values is not considered. In this paper, a new distance measure based on rough set theory is proposed, which overcomes the shortage of the simple matching dissimilarity measure and is used along with the traditional K-Modes clustering algorithm. While computing the distance between two values of the same categorical attributes, the new distance measure takes into account not only their difference but also discernibility of other relational categorical attributes to them. The time complexity of the modified K-Modes clustering algorithm is linear with respect to the number of data objects which can be applied for large data sets. The performance of the K-Modes algorithm with the new distance measure is tested on real world data sets. Comparisons with the K-Modes algorithm based on many different distance measures illustrate the effectiveness of the new distance measure.