基于分类型矩阵对象数据的MD fuzzy k-modes聚类算法

李顺勇1 张苗苗1 曹付元2

1(山西大学数学科学学院 太原 030006)2(山西大学计算机与信息技术学院 太原 030006)

摘 要 传统的聚类算法一般是对单值属性数据进行聚类.但在许多实际应用中,每个对象通常被多个特征向量所描述.例如,顾客在购物时可能同时购买多个产品.由多个特征向量描述的对象称为矩阵对象,由矩阵对象构成的数据集称为矩阵对象数据集.目前,针对矩阵对象数据聚类算法的研究相对较少,还有很多问题有待解决.利用fuzzy k-modes算法的聚类过程,提出一种基于矩阵对象数据的matrix-object data fuzzy k-modes(MD fuzzy k-modes)聚类算法.该算法结合模糊集的概念引入模糊因子β,重新定义了矩阵对象间的相异性度量,并给出类中心的启发式更新算法.最后,在5个真实数据集上验证了MD fuzzy k-modes算法的有效性,并分析了模糊因子β与隶属度w之间的关系.大数据时代,利用MD fuzzy k-modes算法对多条记录进行聚类,能更易发现顾客的消费偏好,从而做出更有针对性的推荐.

关键词 矩阵对象数据;MD fuzzy k-modes算法;相异性度量;类中心;聚类

聚类算法中最具代表性的是k-means,k-modes,k-prototype算法,其中,k-means[1]主要用于对数值型数据进行聚类.现实中,分类型属性数据也常见.1998年Huang[2]提出了k-modes算法,该算法用简单匹配计算2个对象间的距离,用modes代替means,基于频率来更新类中心.2001年Chaturvedi等人[3]改进了k-modes算法,提出了k-modes-CGC算法,有效地运用非参数方法对分类型数据进行聚类.随后,Huang等人[4]证明了二者的等价性.此外,在初始类中心的选取上,Ying等人[5]考虑将迭代求精法与k-modes算法结合;在相异性度量的选取上,Ng等人[6]和San等人[7]基于属性频率计算相似度,Li等人[8]基于生物特征计算距离.Liang等人[9-14]也基于不同度量提出了多种k-modes的改进算法.

以上种种算法在考虑类别归属时,其隶属度只考虑了0,1这2个值,即只能划分到确定的某一类中,属于硬划分.而数据的不同属性重要度会给部分数据的真实类别归属带来模糊性.粗糙集[15]和模糊集[16]理论的提出为数据在数据集中的位置提供了有利的基础,软划分应运而生.Bezdek提出的fuzzy c-means(FCM)算法[17]是软划分聚类的典例.1999年Huang等人[18]在FCM算法的基础上引进模糊因子、隶属度矩阵等,进一步提出fuzzy k-modes算法.2004年Kim等人[19]用模糊集对类中心的模糊化刻画分类数据中类的不确定性,提出了具有模糊类中心的Fuzzy k-modes算法.2005年Li等人[20]提出了基于特征加权的模糊聚类新算法(novel feature weighted clustering algorithm, NFWFCA).2007年Cai等人[21]结合局部空间和灰度信息,提出了快速通用的聚类算法(fast generalized fuzzy c-means, FGFCM).2016年Zhou等人[22]结合多目标优化算法与模糊中心点聚类,提出一种新颖的多目标模糊聚类算法.总之,k-modes算法对后续众多的拓展算法起到了积极的铺垫作用.

已有的聚类算法普遍使用X={X1,X2,…,Xn}的数据表示模式,X表示由n个对象组成的对象集, Xi=(Xi1,Xi2,…,Xim)表示每个对象由m个属性特征描述,每个属性特征有且仅有唯一的取值.然而实际应用中,对象的每个属性特征可能有不同的取值.例如顾客购物时,可能同时购买多个产品,这就容易产生矩阵对象数据[23].若利用已有的聚类算法处理该类数据,需用先验知识来选取其中一条记录,这会严重损失信息并破坏数据的原始性,且违背了以数据总体来做数据分析的初衷.因此,为了利用多条消费记录发现客户的消费喜好,从而做出更具针对性的推荐[23],有必要研究基于矩阵对象数据的聚类算法.Cao等人[24]首先提出基于集值对象的Set-Value k-modes (SV-k-modes)算法和fuzzy Set-Value k-modes(fuzzy SV-k-modes)算法[25].之后,Cao等人又提出基于矩阵对象的k-multi-weighted-modes(k-mw-modes)聚类算法[23].该算法在考虑类别归属的同时,其隶属度也仅仅考虑了0,1这2个值.由于数据集中属性重要度的不同,常常会给部分数据的真实类别归属带来模糊性.本文兼顾模糊集引入模糊因子,提出一种基于矩阵对象数据的模糊聚类算法(matrix-object data fuzzy k-modes, MD fuzzy k-modes).本文的主要贡献有4个方面:

1) 结合模糊集的概念提出了一种更新类中心启发式算法;

2) 提出了基于分类型矩阵对象数据的MD fuzzy k-modes聚类算法;

3) 实验验证了MD fuzzy k-modes算法的有效性;

4) 分析了模糊因子β与隶属度w的关系.

1 回顾fuzzy k-modes算法

X={X1,X2,…,Xn}是由n个对象、m个属性描述的分类型数据集,则XiXj间的相异性度量定义为

(1)

其中,

QX的类中心,如果Q能最小化

(2)

fuzzy k-modes算法用迭代方式将数据分为k类, 此算法的目的是最小化目标函数:

(3)

其中,W为隶属度矩阵.

2 MD fuzzy k-modes聚类算法

经典的k-type算法[1-2]主要由3部分组成:相异性度量的定义、类中心的表示和类中心的更新过程.本文提出的MD fuzzy k-modes算法也从这3方面考虑.

2.1 矩阵对象间的相异性度量

用简单0-1匹配、属性频率等相异性度量来计算数据间的距离适用于1对1对象数据,而矩阵对象数据每个属性有多于一个的属性值,这些相异性度量对矩阵对象数据有一定的局限性,由于k-mw-modes算法[23]中定义了2个矩阵对象间的相异性度量,本文直接引用此相异性度量.

定义1. 相异性度量.给定矩阵对象Xi,Xj,每个对象由m个分类型属性{A1,A2,…,Am}来描述,则XiXj的相异性度量定义为

(4)

其中:

δ(Xis,Xjs)=

(5)

(6)

式(5)中表示矩阵对象Xi在属性As上的所有属性值,即值域.12是标准化因子,因为0≤δ(Xis,Xjs)≤2,当=时,δ(Xis,Xjs)=0;当=∅时,δ(Xis,Xjs)=2.

可以验证该相异性度量满足非负性、对称性和三角不等式性,的确是一个距离.

例1. 表1是某一矩阵对象数据集的描述,其中X={X1,X2},A={A1,A2},计算X1,X2间的距离.

在属性A1上,在属性A2上,所以,X1,X2间的距离

Table 1 A Matrix-Object Data Set
表1 某一矩阵对象数据集

XA1A2XA1A2X1adaebdcfX2aececf

2.2 类中心的定义及启发式更新过程

聚类目的使是“类内距离最小,类间距离最大”,即同类中每个对象到类中心的距离最小,不同类中的对象到其他类中心的距离最大.因此,聚类过程中类中心的表示是必需的.针对一个集合的mode[2],集合X′={x1,x2,…,xn}的mode是一个向量Q′=(q1,q2,…,qm),使得目标函数最小化.也就是说,qs(1≤sm)是集合X关于第s个属性出现最频繁的值,使得向量Q′是一个mode.本文在k-mw-modes算法的基础上引入了模糊因子β和隶属度w.这样,目标函数在考虑距离的同时也应兼顾模糊隶属度.

定义2. 类中心.如果Ql能使目标函数达到最小:

(7)

QlX的类中心.

要求类中心Ql,即要最小化D(X,Ql),根据定义1,仅需要最小化设属性As的属性个数为|Vs|,那么,其类中心Qls属性值的个数只能从1到|Vs|中选,若要从|Vs|中选us个作为Qls的属性值,则有种情况,而us的取值范围也是1到|Vs|,根据多项式理论,类中心在属性As上的选择有2|Vs|-1种,需要遍历每种情况求再按降序排列,才能更新出在属性As上的类中心.同理,其他属性的类中心也可更新出来.

这种全局性更新类中心算法的时间复杂度为O(nmtk×2|V′|),n表示对象数,m表示属性个数,k表示分类个数,t表示迭代次数,|V′|=max{|Vs|,1≤sm}.由此可知,全局性更新类中心的算法时间复杂度随着对象个数、属性个数、分类数及迭代次数的增多呈线性增长,属性值的个数呈指数增长.

当矩阵对象数据中属性值个数过多时,全局更新类中心的算法计算量过大,耗时增强,故本文提出了启发式更新类中心算法.首先分析

(8)

这里记称作Xi的属性值的权重.

由式(8)可知,要想使最小,则需使最大.

假设第l类中属性Aj的值域是计算并按降序排列.若Qljrj个值,则有3种情形.

1) 当rj=1时,若

Qlj={};若有多于一个的最大值,则随机取一个.

2) 当时,则取全部值作为类中心.

3) 当时,有3种情形:

① 若选前rj个值作为类中心.

② 若则先选前rj-1个值Q″={,,…,}作

Qlj的一部分.若Qlj={}∪Q″;若

Qlj={}∪Q″,若则随机取一个.

③ 若

则先选前rj-p′-1个属性值作为类中心的一部分,记为Q″={,,…,},再从接下来的p′+p+1个属性值中选p′+1个作为剩余部分.假设Qj是剩余属性值的所有组合的集合,RtQj中每个组合的频率和,若

矩阵对象数据中每个属性有多于一个的属性值,类中心的属性个数rj取多少恰当也是一个值得研究的课题.为了增加算法的有效性,本文令根据以上启发式更新算法,属性Aj的类中心可以更新出来.同理,其他属性的类中心也能找到.

2.3 MD fuzzy k-modes聚类算法

k-means算法、k-modes算法、k-mw-modes算法的聚类目的是找到隶属度矩阵W和类中心Q,并使目标函数最小化;拓展的fuzzy k-means,fuzzy k-modes算法的聚类目的也是找到WQ,使目标函数最小.

本文在k-mw-modes算法的基础上,引入模糊因子并改进了类中心的表示及更新过程,提出了MD fuzzy k-modes算法.

定义3. 最小化目标函数.将一矩阵对象数据集X={X1,X2,…,Xn}划分为k类,则需最小化目标函数:

(9)

且满足:

wli∈[0,1], 1≤lk, 1≤in,

(10)

(11)

(12)

其中,Q=(Q1,Q2,…,Qk)中的元素Ql表示第l类的中心,Ql=(Ql1,Ql2,…,Qlm);W=(wli)是一个k×n维的隶属度矩阵,wli=1表示Xi被分到l类.

为使F′(W,Q)达到最小,要通过多次迭代过程使其收敛:1) 初始化类中心Qt;2) 固定Qt,找出使F′(W,Q)最小的Wt;3) 固定Wt,用启发式更新算法找出Qt+1使F′(W,Q)达到最小;4) 重复步骤1)2)3),直到类中心不变或目标函数小于阈值为止.

其中,隶属度矩阵W由定理1计算而来,类中心Q的更新由启发式更新算法而来.

定理1. 固定Q,在式(10)~(12)的限制下使F′(W,Q)最小,则W的更新为

(13)

MD fuzzy k-modes算法的基本步骤:

1) 随机选取k个对象作为初始类中心;

2) 根据2.1节,计算每个对象到k个中心的距离,将对象分配到与其距离最小的类中;

3) 根据2.2节,计算每个对象到k个中心的隶属度,并更新k个类的类中心;

4) 重复步骤2)3),直到类中心或目标函数不变为止.

算法1. MD fuzzy k-modes算法.

输入:X为由m个属性描述的n维矩阵对象数据,k为需要聚类个数,ε为阈值,idCentersk个初始类中心的标签,β为模糊因子;

输出:cid是聚类后所有对象的标签,num是迭代次数.

Q是初始类中心,value=0,num=0;

② while num<100 do

newvalue=0;

④ for i=1 to n do

⑤ for l=1 to k do

⑥ 计算第i个对象到第l个中心的距离d(Xi,Ql)(用式(4));

⑦ end for

⑧ end for

⑨ for i=1 to n do

⑩ for l=1 to k do

计算第i个对象到第l个中心的隶属度wli(用式(13));

end for

end for

if |newvalue-value|≤ε

break;

else

value=newvaluenum=num+1;

end if

for l=1 to k do

更新类中心Q(用启发式更新算法);

end for

end while

3 实验分析

为了评价MD fuzzy k-modes算法的有效性,本文考虑了5个真实数据集:Market Basket,Micro-soft Web,Musk,MovieLens,Alibaba.Market Basket记录了1 001个顾客的交易记录,每条记录由用户ID、交易时间、产品名称和产品ID这4个属性描述;Microsoft Web来自UCI数据集,记录了1998年1月份某周内32 711个匿名用户的网页浏览情况,每个用户由用户ID和网页ID这2个属性描述;Musk也来自UCI数据集,包括92个对象,每个对象由167个属性描述;MovieLens从MovieLens网站上下载,本文只使用其中的ratings数据,它记录了6 040个观众对3 900部电影的1 000 209条评分情况,每条记录由用户ID、电影ID、用户评分和提交评价的时间这4个属性描述;Alibaba是884个用户浏览某些品牌的182 880条记录,也由4个属性描述.这5个数据集均为矩阵对象数据集.为了增强聚类效果,本文对各数据集的属性做了相应的预处理,预处理后的数据形式如表2所示:

Table 2 Data Set after Preprocessing
表2 预处理后的数据集

Data SetObjectsAttributesRecordskMarket Basket900263003Microsoft Web 962298572Musk 921674762MovieLens 3759257824Alibaba79321656553

3.1 评价标准

本文采用精度(AC)、纯度(PR)、召回率(RE)、兰德指数(ARI)、归一化互信息(NMI)这5个评价指标对所提算法进行了有效性评价.AC表示分类正确的比例;PR表示预测为正的样本中有多少是对的;RE表示样本中的正例有多少被预测正确;ARINMI用来衡量2个数据分布的吻合程度.ACPRREARINMI的值越大,聚类结果越接近于数据集的真实划分,聚类效果越好.

X是一矩阵对象数据集,C={C1,C2,…,Ck}是X的聚类结果,P={P1,P2,…,Pk}是真实标签,聚类个数为k,真实类别数为k′.假定k=k′,5种评价指标定义为

(14)

(15)

(16)

ARI=

),

(17)

(18)

其中,nCinPj分别是CiPj中对象的个数,表示正确分到i类中的对象个数.如果聚类结果越接近于数据集的真实划分,那么ACPRREARINMI的值就越大,最大值为1.换句话说,实验结果在这5个评价标准上的值越大,聚类质量越好,算法越有效.

3.2 启发式与全局性更新类中心算法的比较

为了评价启发式更新类中心算法的有效性,本节在用MD fuzzy k-modes算法聚类的过程中,分别采用启发式(HAMF)和全局性算法(GAMF)更新类中心,对比了实验结果与运行时间.以Market Basket为例,运行10次,结果如表3和表4所示.其中,表3的“±”前后分别表示均值和标准差.

Table 3 Comparison Results of the MD fuzzy k-modes Algorithms with GAMF and HAMF
表3 在MD fuzzy k-modes算法中用GAMF和HAMF更新类中心的结果比较

AlgorithmsMean±StdACPRREARINMIMD fuzzy k-modes+GAMF0.9001±0.10190.8997±0.13040.8875±0.12950.7987±0.21030.7727±0.1408MD fuzzy k-modes+HAMF0.8808±0.12250.8870±0.14130.8665±0.14080.7287±0.19450.7126±0.1839

Table 4 Running Time of the MD fuzzy k-modes Algorithms

with GAMF and HAMF

表4 MD fuzzy k-modes算法中用GAMF和HAMF更新

类中心的运行时间

AlgorithmsRunning Time∕sMD fuzzy k-modes+GAMF3.46725×105 MD fuzzy k-modes+HAMF160.313812

Notes: The bold value represents that the running time of the MD fuzzy k-modes algorithm with HAMF is much shorter than GAMF.

从表3和表4可以看出,用全局性算法更新类中心的聚类效果要好于启发式更新算法,但耗时长达96 h.而启发式更新算法在聚类效果相似的情况下只需耗时160 s.因此,在用MD fuzzy k-modes算法进行聚类时,选用本文提出的启发式更新算法更有效.

3.3 MD fuzzy k-modes算法与其他算法的比较

本文选SV-k-modes,k-mw-modes,fuzzy k-modes,fuzzy SV-k-modes这4种算法与MD fuzzy k-modes算法进行比较,其中,fuzzy k-modes算法必须把矩阵数据转换为单值属性值形式,SV-k-modes,fuzzy SV-k-modes算法需把矩阵数据转换为集值数据形式.在与SV-k-modes,k-mw-modes算法比较时,由于这2种算法不含模糊因子β,本文假定MD fuzzy k-modes算法中的β=1.1.在与fuzzy k-modes,fuzzy SV-k-modes算法进行比较时,由于在fuzzy k-type聚类算法[17-21]中,初始类中心的选取和模糊因子β对聚类结果有重要的影响,不同的初始化类中心和不同的β取值会导致聚类结果不同.本文从这2方面验证MD fuzzy k-modes算法的有效性.在β的取值上,目前很多学者研究这一问题.Pal和Bezdek[26]在fuzzy k-means算法中设置β∈[1.5,2.5],Zhou等人[27]认为β的最优区间是[2.5,3],Huang等人[18]设置最小值β=1.1.由于β的取值没有公认的准则,目前研究的最小值为1.1,最大值为3.本文设置β∈[1.1,2.9],步长为0.2.在初始类中心的选择上,本文随机初始化类中心30次,即2种算法在不同的β取值下分别运行30次,通过计算平均聚类质量来验证MD fuzzy k-modes算法的有效性.数据集Market Basket,Microsoft Web,Musk,MovieLens,Alibaba在这5种评价标准上的实验结果如表5~9所示.其中,“±”前后分别表示30次实验结果的均值和标准差.

从表5可以看出,在不考虑模糊因子β的情况下,新提出的MD fuzzy k-modes算法比SV-k-modes算法、k-mw-modes算法在5种评价标准上的值高,说明聚类效果更好.

表6~9显示,考虑模糊因子β时, MD fuzzy k-modes算法相较fuzzy k-modes算法在5种评价标准上的值有明显提高.尤其是Market Basket和Microsoft Web数据集上,ACPRREARINMI值有30%~60%的提高,这说明MD fuzzy k-modes算法要比fuzzy k-modes算法的聚类效果好得多.在MovieLens数据集上RE值虽有所下降,但在其他评价标准上有20%左右的提高;Musk数据集的实验结果虽然没有前3个数据集的效果明显,但仍比fuzzy k-modes算法的值高.再者,相较fuzzy SV- k-modes算法,5种评价标准上的值也有所提高.在Market Basket和Microsoft Web数据集上,ACPRREARINMI值有10%~20%的提高,在Musk,MovieLens数据集上的值相近,但比fuzzy SV-k-modes算法的值高,也说明聚类效果好.

上述实验结果充分验证了MD fuzzy k-modes算法对矩阵对象数据进行聚类具有较好的可行性与有效性.

Table 5 Comparison Results of the Three Algorithms on Five Data Sets
表5 在5个数据集上3种算法的对比

Data SetAlgorithmsMean±StdACPRREARINMISV-k-modes0.6533±0.11240.7012±0.11010.6443±0.11940.2970±0.16770.3219±0.1516Market Basketk-mw-modes0.8147±0.15520.8262±0.15460.8002±0.16610.6093±0.27040.6084±0.2462MD fuzzy k-modes0.8936±0.11470.9044±0.11490.8841±0.12550.7487±0.20730.7314±0.1862SV-k-modes0.7745±0.11330.8307±0.08590.7363±0.14010.3313±0.27210.3332±0.2135Microsoft Webk-mw-modes0.9236±0.03130.9421±0.03660.9015±0.03170.7191±0.08900.6639±0.0936MD fuzzy k-modes0.9290±0.00580.9486±0.00380.9066±0.00770.7337±0.02040.6796±0.0182SV-k-modes0.5681±0.05550.5894±0.08710.5651±0.05530.0216±0.04530.0396±0.0725Muskk-mw-modes0.5667±0.05120.5840±0.07670.5614±0.05280.0187±0.03800.0318±0.0526MD fuzzy k-modes0.5722±0.04530.5962±0.04950.5754±0.04870.0194±0.02430.0349±0.0194SV-k-modes0.7057±0.11970.7978±0.07930.5462±0.10780.4088±0.20530.4802±0.1816MovieLensk-mw-modes0.8141±0.02710.7943±0.03270.6649±0.06220.5813±0.06060.5485±0.0559MD fuzzy k-modes0.8087±0.07190.8319±0.07270.6890±0.09700.5774±0.13080.6108±0.1074SV-k-modes0.5768±0.05900.6084±0.06860.4867±0.08520.1478±0.10760.1468±0.0778Alibabak-mw-modes0.6009±0.07090.5885±0.06500.5256±0.08120.2023±0.11580.1638±0.0886MD fuzzy k-modes0.6403±0.06330.6129±0.07360.5680±0.07960.2685±0.10780.2107±0.0875

Notes: The bold values represent that the highest value obtained by the MD fuzzy k-modes algorithm.

Table 6 Comparison Results of the Three Algorithms on Market Basket
表6 在Market Basket数据集上3种算法的对比

βAlgorithmsMean±StdACPRREARINMIfuzzy k-modes0.5505±0.05500.6034±0.06680.5885±0.09660.1443±0.04520.1778±0.03911.1fuzzy SV-k-modes0.8005±0.16160.8121±0.15960.7974±0.16070.5725±0.26340.5524±0.2369MD fuzzy k-modes0.8986±0.10480.9175±0.09510.8889±0.11520.7508±0.17350.7412±0.1476fuzzy k-modes0.5636±0.04960.6317±0.06770.5877±0.06500.1466±0.03780.1937±0.03221.3fuzzy SV-k-modes0.7444±0.16270.7554±0.16630.7346±0.16900.4822±0.24800.4810±0.2236MD fuzzy k-modes0.8842±0.14340.8920±0.14700.8737±0.15570.7441±0.24030.7247±0.2201fuzzy k-modes0.5632±0.05120.6220±0.06930.6016±0.07570.1493±0.03870.1907±0.03211.5fuzzy SV-k-modes0.7550±0.14840.7696±0.14470.7531±0.14810.4815±0.23670.4772±0.2082MD fuzzy k-modes0.8788±0.15030.8865±0.15070.8679±0.16220.7421±0.25540.7213±0.2296fuzzy k-modes0.5609±0.06040.6311±0.06750.5925±0.08280.1380±0.05450.1847±0.05361.7fuzzy SV-k-modes0.7310±0.13200.7439±0.12930.7209±0.14240.4404±0.21920.4424±0.1949MD fuzzy k-modes0.8640±0.14630.8668±0.15390.8497±0.16280.7092±0.26010.6888±0.2399fuzzy k-modes0.5693±0.05190.6326±0.08100.5938±0.06140.1488±0.03750.1973±0.04891.9fuzzy SV-k-modes0.7769±0.14360.7841±0.14670.7706±0.14740.5288±0.23880.5140±0.2125MD fuzzy k-modes0.8908±0.12250.9025±0.12040.8794±0.13700.7499±0.21010.7322±0.1879fuzzy k-modes0.5594±0.05720.6179±0.07350.5949±0.08220.1461±0.04540.1877±0.04262.1fuzzy SV-k-modes0.7581±0.11360.7723±0.11510.7397±0.13060.4787±0.19670.4794±0.1717MD fuzzy k-modes0.8912±0.12240.9004±0.12400.8802±0.13540.7500±0.21610.7321±0.1927fuzzy k-modes0.5581±0.04810.6340±0.05830.6193±0.08330.1408±0.03900.1912±0.03002.3fuzzy SV-k-modes0.7985±0.11750.8127±0.11050.7901±0.12620.5369±0.21790.5182±0.1911MD fuzzy k-modes0.9307±0.07810.9400±0.08110.9229±0.09040.8153±0.15060.7906±0.1356fuzzy k-modes0.5483±0.05640.6125±0.07150.5994±0.08180.1378±0.04100.1847±0.03592.5fuzzy SV-k-modes0.7234±0.12050.7331±0.12300.7122±0.12470.4314±0.19870.4269±0.1769MD fuzzy k-modes0.9176±0.09040.9292±0.09140.9091±0.10150.7953±0.14800.7749±0.1313fuzzy k-modes0.5525±0.05640.6228±0.07330.6139±0.07870.1356±0.04590.1855±0.04692.7fuzzy SV-k-modes0.7724±0.77240.7845±0.12950.7627±0.13460.5049±0.21130.4917±0.1863MD fuzzy k-modes0.9213±0.09200.9288±0.09980.9125±0.10530.8053±0.15550.7806±0.1414fuzzy k-modes0.5685±0.05120.6359±0.06610.6180±0.06520.1488±0.03770.1979±0.03342.9fuzzy SV-k-modes0.7514±0.13040.7618±0.13410.7483±0.12990.4837±0.20080.4768±0.1760MD fuzzy k-modes0.9340±0.07450.9408±0.08350.9278±0.08190.8310±0.11890.8016±0.1146

Notes: The bold values represent that the highest value obtained by the MD fuzzy k-modes algorithm.

Table 7 Comparison Results of the Three Algorithms on Microsoft Web
表7 在Microsoft Web数据集上3种算法的对比

βAlgorithmsMean±StdACPRREARINMIfuzzy k-modes0.6895±0.05950.7070±0.06400.7752±0.17090.1269±0.11570.1139±0.11071.1fuzzy SV-k-modes0.8316±0.11170.8626±0.11160.7891±0.13620.4756±0.25260.4475±0.2278MD fuzzy k-modes0.9286±0.00450.9483±0.00300.9061±0.00600.7324±0.01580.6783±0.0142fuzzy k-modes0.6769±0.06400.6895±0.06990.7905±0.18700.1059±0.12410.0974±0.11801.3fuzzy SV-k-modes0.8878±0.04160.9176±0.04520.8550±0.04870.6030±0.10960.5626±0.1107MD fuzzy k-modes0.9274±0.00420.9476±0.00270.9046±0.00550.7281±0.01450.6745±0.0130

Continued (Table 7)

βAlgorithmsMean±StdACPRREARINMIfuzzy k-modes0.6762±0.05830.6973±0.06230.7345±0.18770.1005±0.11340.0920±0.10461.5fuzzy SV-k-modes0.8584±0.09070.8884±0.09440.8230±0.10390.5369±0.20670.5044±0.1892MD fuzzy k-modes0.9281±0.00410.9479±0.00300.9056±0.00520.7307±0.01410.6762±0.0147fuzzy k-modes0.6900±0.05900.7047±0.06760.7436±0.17010.1285±0.11410.1146±0.11001.7fuzzy SV-k-modes0.8980±0.01240.9294±0.00730.8660±0.01630.6298±0.03960.5914±0.0322MD fuzzy k-modes0.9196±0.05740.9382±0.06070.8943±0.07540.7129±0.13710.6604±0.1268fuzzy k-modes0.6932±0.05090.7093±0.05570.7139±0.15450.1329±0.09990.1144±0.09971.9fuzzy SV-k-modes0.8678±0.07390.9005±0.06700.8288±0.09330.5534±0.18340.5167±0.1695MD fuzzy k-modes0.9091±0.07710.9253±0.08890.8864±0.07960.6898±0.17240.6360±0.1656fuzzy k-modes0.6908±0.06750.7030±0.07400.7687±0.17540.1335±0.13150.1241±0.12612.1fuzzy SV-k-modes0.8790±0.07400.9125±0.06170.8415±0.09670.5868±0.17610.5527±0.1526MD fuzzy k-modes0.9192±0.05690.9390±0.05470.8939±0.07470.7129±0.12680.6607±0.1133fuzzy k-modes0.6776±0.05730.6915±0.06540.7704±0.18420.1050±0.10990.0925±0.10452.3fuzzy SV-k-modes0.8534±0.10540.8868±0.10210.8102±0.13300.5298±0.23990.5040±0.2092MD fuzzy k-modes0.9145±0.06070.9312±0.07020.8932±0.06180.6988±0.14550.6436±0.1401fuzzy k-modes0.6756±0.05730.6880±0.06510.7679±0.18660.1018±0.11000.0895±0.10502.5fuzzy SV-k-modes0.8701±0.08540.9059±0.07450.8293±0.11230.5654±0.19710.5378±0.1632MD fuzzy k-modes0.9207±0.05370.9386±0.06050.8990±0.05310.7168±0.12340.6630±0.1186fuzzy k-modes0.6831±0.05830.6972±0.06780.7155±0.17370.1157±0.11250.1030±0.10742.7fuzzy SV-k-modes0.8714±0.06520.8980±0.07920.8368±0.07610.5622±0.15600.5218±0.1609MD fuzzy k-modes0.9125±0.05840.9317±0.06200.8924±0.04920.6915±0.14660.6403±0.1375fuzzy k-modes0.6861±0.05740.7021±0.06450.7520±0.17430.1204±0.11090.1065±0.10632.9fuzzy SV-k-modes0.8624±0.09500.8907±0.10190.8231±0.11700.5507±0.21280.5159±0.2002MD fuzzy k-modes0.9195±0.05710.9385±0.05650.8944±0.07510.7121±0.13730.6586±0.1235

Notes: The bold values represent that the highest value obtained by the MD fuzzy k-modes algorithm.

Table 8 Comparison Results of the Three Algorithms on Musk
表8 在Musk数据集上3种算法的对比

βAlgorithmsMean±StdACPRREARINMIfuzzy k-modes0.5428±0.02450.5431±0.02440.5409±0.0267-0.0013±0.01030.0070±0.00731.1fuzzy SV-k-modes0.5536±0.03860.5649±0.05500.5505±0.04030.0079±0.01980.0165±0.0226MD fuzzy k-modes0.5811±0.03400.5960±0.03570.5765±0.03630.0094±0.01840.0217±0.0142fuzzy k-modes0.5493±0.03730.5530±0.03960.5480±0.04050.0047±0.02070.0122±0.01621.3fuzzy SV-k-modes0.5554±0.03800.5637±0.04440.5512±0.03950.0089±0.02130.0151±0.0191MD fuzzy k-modes0.5814±0.03520.5829±0.04920.5770±0.03660.0104±0.01650.0196±0.0197fuzzy k-modes0.5275±0.02330.5305±0.02520.5227±0.0283-0.0054±0.00940.0043±0.00751.5fuzzy SV-k-modes0.5500±0.03880.5604±0.05600.5454±0.04190.0062±0.01900.0153±0.0224MD fuzzy k-modes0.5865±0.04050.5957±0.04950.5725±0.04210.0098±0.02520.0165±0.0208fuzzy k-modes0.5464±0.03240.5498±0.03530.5445±0.03660.0022±0.01440.0103±0.01191.7fuzzy SV-k-modes0.5489±0.03890.5568±0.04650.5447±0.04110.0058±0.0216 0.0134±0.0186MD fuzzy k-modes0.5791±0.04200.5857±0.05070.5753±0.04340.0113±0.02660.0180±0.0268

Continued (Table 8)

βAlgorithmsMean±StdACPRREARINMIfuzzy k-modes0.5471±0.03460.5503±0.03650.5459±0.03690.0031±0.01780.0107±0.01391.9fuzzy SV-k-modes0.5583±0.03760.5754±0.05670.5549±0.03890.0100±0.02020.0202±0.0249MD fuzzy k-modes0.5883±0.03770.5933±0.05360.5743±0.03920.0190±0.01960.0293±0.0242fuzzy k-modes0.5428±0.02920.5465±0.03270.5408±0.03350.0001±0.01250.0088±0.01082.1fuzzy SV-k-modes0.5649±0.04570.5722±0.05210.5611±0.04710.0156±0.02760.0208±0.0239MD fuzzy k-modes0.5820±0.04510.5936±0.05850.5879±0.04540.0199±0.03050.0220±0.0318fuzzy k-modes0.5286±0.02290.5314±0.02500.5246±0.0276-0.0052±0.00930.0044±0.00752.3fuzzy SV-k-modes0.5525±0.04150.5633±0.05740.5478±0.04470.0081±0.02220.0167±0.0238MD fuzzy k-modes0.5898±0.04430.5704±0.05420.5558±0.04580.0126±0.03070.0193±0.0267fuzzy k-modes0.5402±0.02970.5432±0.03190.5372±0.0336-0.0006±0.01340.0079±0.01112.5fuzzy SV-k-modes0.5500±0.03750.5580±0.04430.5453±0.04010.0062±0.02100.0132±0.0172MD fuzzy k-modes0.5800±0.03580.5692±0.04180.5561±0.03740.0097±0.02320.0173±0.0219fuzzy k-modes0.5493±0.03180.5534±0.03440.5490±0.03420.0032±0.01530.0112±0.01242.7fuzzy SV-k-modes0.5692±0.03810.5746±0.04100.5661±0.04050.0151±0.02210.0197±0.0176MD fuzzy k-modes0.5906±0.04230.5946±0.04750.5730±0.04260.0197±0.02470.0204±0.0203fuzzy k-modes0.5402±0.02120.5433±0.02320.5391±0.0242-0.0023±0.00880.0067±0.00742.9fuzzy SV-k-modes0.5587±0.03400.5697±0.04720.5552±0.03540.0088±0.01780.0170±0.0183MD fuzzy k-modes0.5659±0.04130.5860±0.06940.5617±0.04210.0150±0.03100.0282±0.0453

Notes: The bold values represent that the highest value obtained by the MD fuzzy k-modes algorithm.

Table 9 Comparison Results of the Three Algorithms on MovieLens
表9 在MovieLens数据集上3种算法的对比

βAlgorithmsMean±StdACPRREARINMIfuzzy k-modes0.7072±0.09080.7440±0.06750.7173±0.12130.4349±0.13950.4997±0.13601.1fuzzy SV-k-modes0.7815±0.10920.8211±0.07730.6290±0.12350.5538±0.18370.5831±0.1654MD fuzzy k-modes0.8076±0.05760.8287±0.05030.6783±0.09820.5810±0.11050.6105±0.0906fuzzy k-modes0.6892±0.08150.7323±0.06790.6814±0.09460.3966±0.12150.4636±0.10571.3fuzzy SV-k-modes0.7727±0.08280.8021±0.06220.6032±0.07590.5305±0.15490.5564±0.1385MD fuzzy k-modes0.7986±0.06190.8166±0.05980.6550±0.09170.5548±0.11420.5787±0.1005fuzzy k-modes0.6843±0.08970.7355±0.07290.7284±0.11840.4026±0.12790.4884±0.11581.5fuzzy SV-k-modes0.7786±0.08390.8011±0.07450.6333±0.11330.5599±0.14970.5903±0.1336MD fuzzy k-modes0.7867±0.07250.8088±0.06610.7110±0.11500.5478±0.13190.5853±0.1136fuzzy k-modes0.7122±0.07260.7523±0.06000.6949±0.12080.4413±0.11210.5052±0.10961.7fuzzy SV-k-modes0.7451±0.11080.7917±0.08940.5817±0.09730.4829±0.18510.5235±0.1669MD fuzzy k-modes0.8065±0.04840.8252±0.04850.6636±0.09090.5763±0.09570.6176±0.0782fuzzy k-modes0.6740±0.08690.7227±0.08260.6993±0.11370.3773±0.12430.4561±0.11571.9fuzzy SV-k-modes0.7775±0.07480.8139±0.06350.6379±0.10930.5486±0.13490.5805±0.1284MD fuzzy k-modes0.8022±0.07010.8297±0.06760.6488±0.10940.5597±0.13790.6075±0.1216fuzzy k-modes0.6784±0.09740.7236±0.09390.7150±0.13290.3929±0.13230.4727±0.12332.1fuzzy SV-k-modes0.7376±0.08200.8044±0.06350.5725±0.07360.4608±0.15040.5131±0.1236MD fuzzy k-modes0.7855±0.06190.8082±0.06300.6814±0.09820.5440±0.11360.5805±0.0940fuzzy k-modes0.7107±0.07630.7417±0.06730.6867±0.10030.4287±0.11680.4958±0.09672.3fuzzy SV-k-modes0.7626±0.08000.8126±0.05040.6119±0.09540.5086±0.15430.5460±0.1386MD fuzzy k-modes0.7969±0.07110.8122±0.07490.6790±0.10480.5593±0.13100.5932±0.1186

Continued (Table 9)

βAlgorithmsMean±StdACPRREARINMIfuzzy k-modes0.6701±0.08180.7139±0.06290.7304±0.11470.3858±0.12060.4687±0.10882.5fuzzy SV-k-modes0.7226±0.09770.7931±0.06530.5908±0.12150.4460±0.17140.5223±0.1453MD fuzzy k-modes0.7867±0.07200.8074±0.07010.6627±0.10020.5345±0.13950.5679±0.1240fuzzy k-modes0.6686±0.09480.7183±0.08210.6710±0.11380.3667±0.13870.4364±0.13692.7fuzzy SV-k-modes0.7631±0.06690.7949±0.06860.6205±0.08120.5026±0.12320.5340±0.1043MD fuzzy k-modes0.7788±0.07500.8023±0.07050.6495±0.09640.5294±0.13610.5623±0.1226fuzzy k-modes0.6681±0.09930.7155±0.09130.7115±0.12480.3736±0.14300.4430±0.14792.9fuzzy SV-k-modes0.7673±0.07720.8058±0.07270.6160±0.09030.5253±0.13990.5695±0.1142MD fuzzy k-modes0.8001±0.05040.8139±0.05320.6626±0.10350.5635±0.10540.5983±0.0930

Notes: The bold values represent that the highest value obtained by the MD fuzzy k-modes algorithm.

3.4 βw的关系

由于β的取值直接影响矩阵对象归属到每个类别的隶属度,因此有必要分析模糊因子β与隶属度w的关系.由于数据集的对象数过多,本文只取前10个对象作为研究对象.经过30次实验后求平均,Market Basket,Microsoft Web,Musk,MovieLens这4个数据集的实验结果分别如图1~4所示.其中,“○”表示矩阵对象分到第1类,“★”表示矩阵对象分到第2类,“□”表示矩阵对象分到第3类,“+”表示矩阵对象分到第4类.

Fig. 1 Relationship between β and w on Market Basket
图1 在Market Basket数据集上βw的关系图

Fig. 2 Relationship between β and w on Microsoft Web
图2 在Microsoft Web数据集上βw的关系图

Fig. 3 Relationship between β and w on Musk
图3 在Musk数据集上βw的关系图

Fig. 4 Relationship between β and w on MovieLens
图4 在MovieLens数据集上βw的关系图

由图1~4可知:隶属度w明显受模糊因子β的影响.随着β的增大,w的值呈递减(或递增)形式变化.β的值越大,曲线越平缓,即隶属同一类别的可能性越趋于一致.

4 结 论

实际应用中,大多数数据都是矩阵对象数据,为了对这类数据进行聚类,本文提出了一种新的聚类算法——MD fuzzy k-modes算法.首先,引用了矩阵对象间的相异性度量;其次,给出类中心的表示及启发式更新算法;再次,提出了MD fuzzy k-modes算法;最后通过在Market Basket,Microsoft Web,Musk,MovieLens,Alibaba这5个数据集上的实验分析,验证了本文所提出的MD fuzzy k-modes算法在聚类效果上的有效性并分析了模糊因子β与隶属度w之间的关系.大数据时代,通过MD fuzzy k-modes算法对多条记录进行聚类,能更易发现客户的消费喜好,从而做出具有针对性的推荐.

参考文献

[1]Macqueen J. Some methods for classification and analysis of multivariate observations[C] //Proc of the 5th Berkeley Symp on Mathematical Statistics and Probability. Oakland, CA: University of California Press, 1967: 281-297

[2]Huang Zhexue. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge Discovery, 1998, 2(3): 283-304

[3]Chaturvedi A, Green P E, Caroll J D. K-modes clustering[J]. Journal of Classification, 2001, 18(1): 35-55

[4]Huang Zhexue, Ng M K. A note on k-modes clustering[J]. Journal of Classification, 2003, 20(2): 257-261

[5]Ying Sun, Zhu Qiuming, Chen Zhengxin. An iterative initial-points refinement algorithm for categorical data clustering[J]. Pattern Recognition Letters, 2002, 23(7): 875-884

[6]Ng M K, Li Mark Junjie, Huang Joshua Zhexue, et al. On the impact of dissimilarity measure in k-modes clustering algorithm[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2007, 29(3): 503-507

[7]San O M, Huynh V N, Nakamori Y. An alternative extension of the k-means algorithm for clustering categorical data[J]. International Journal of Applied Mathematics & Computer Science, 2004, 14(2): 241-247

[8]Li Cen, Biswas G. Unsupervised learning with mixed numeric and nominal data[J]. IEEE Transactions on Knowledge & Data Engineering, 2002, 14(4): 673-690

[9]Liang Jiye, Bai Liang, Cao Fuyuan. K-modes clustering algorithm based on a new distance measure[J]. Journal of Computer Research and Development, 2010, 47(10): 1749-1755 (in Chinese)(梁吉业, 白亮, 曹付元. 基于新的距离度量的K-modes聚类算法[J]. 计算机研究与发展, 2010, 47(10): 1749-1755)

[10]Bai Liang, Liang Jiye. The k-modes type clustering plus between-cluster information for categorical data[J]. Neurocomputing, 2014, 133(133): 111-121

[11]Cao Fuyuan, Liang Jiye, Bai Liang, et al. A framework for clustering categorical time-evolving data[J]. IEEE Transactions on Fuzzy Systems, 2010, 18(5): 872-882

[12]Cao Fuyuan, Liang Jiye. A data labeling method for clustering categorical data[J]. Expert Systems with Applications, 2011, 38(3): 2381-2385

[13]Cao Fuyuan, Liang Jiye, Li Deyu, et al. A dissimilarity measure for the k-modes clustering algorithm[J]. Knowledge-Based Systems, 2012, 26(9): 120-127

[14]Cao Fuyuan, Liang Jiye, Li Deyu, et al. A weighting k-modes algorithm for subspace clustering of categorical data[J]. Neurocomputing, 2013, 108(5): 23-30

[15]Pawlak Z. Rough Sets: Theoretical Aspects of Reasoning about Data[M]. Dordrecht, Netherland: Kluwer Academic Publishers, 1992

[16]Lehmann I, Weber R, Zimmermann H J. Fuzzy set theory[J]. Operations-Research-Spektrum, 1992, 14(1): 1-9

[17]Bezdek J C, Ehrlich R, Full W. FCM: The fuzzy c-means clustering algorithm[J]. Computers & Geosciences, 1984, 10(2/3): 191-203

[18]Huang Zhexue, Ng M K. A fuzzy k-modes algorithm for clustering categorical data[J]. IEEE Transactions on Fuzzy Systems, 1999, 7(4): 446-452

[19]Kim D W, Lee K H, Lee D. Fuzzy clustering of categorical data using fuzzy centroids[J]. Pattern Recognition Letters, 2004, 25(11): 1263- 1271

[20]Li Jie, Gao Xinbo, Jiao Licheng. A new feature weighted fuzzy clustering algorithm[C] //Proc of the Int Workshop on Rough Sets. Berlin: Springer, 2005: 412-420

[21]Cai Weiling, Chen Songcan, Zhang Daoqiang. Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation[J]. Pattern Recognition, 2007, 40(3): 825-838

[22]Zhou Zhiping, Zhu Shuwei, Zhang Daowen. Multiobjective clustering algorithm with fuzzy centroids for categorical data[J]. Journal of Computer Research and Development, 2016, 53(11): 2594-2606 (in Chinese)(周治平, 朱书伟, 张道文. 分类数据的多目标模糊中心点聚类算法[J]. 计算机研究与发展, 2016, 53(11): 2594-2606)

[23]Cao Fuyuan, Yu Liqin, Huang Joshua Zhexue, et al. k-mw-modes: An algorithm for clustering categorical matrix-object data[J]. Applied Soft Computing, 2017, 57: 605-614

[24]Cao Fuyuan, Huang Joshua Zhexue, Liang Jiye, et al. An algorithm for clustering categorical data with set-valued features[J]. IEEE Transactions on Neural Networks & Learning Systems, 2018, 29(10): 4593-4606

[25]Cao Fuyuan, Huang Joshua Zhexue, Liang Jiye, et al. A fuzzy SV-k-modes algorithm for clustering categorical data with set-valued attributes[J]. Applied Mathematics & Computation, 2017, 295: 1-15

[26]Pal N R, Bezdek J C. On cluster validity for the fuzzy c-means model[J]. IEEE Transactions on Fuzzy Systems, 1995, 3(3): 370-379

[27]Zhou Kaile, Fu Chao, Yang Shanlin. Fuzziness parameter selection in fuzzy c-means: The perspective of cluster validation[J]. Science China Information Sciences, 2014, 57(11): 1-8

A MD fuzzy k-modes Algorithm for Clustering Categorical Matrix-Object Data

Li Shunyong1, Zhang Miaomiao1, and Cao Fuyuan2

1(School of Mathematical Sciences, Shanxi University, Taiyuan 030006)2(School of Computer and Information Technology, Shanxi University, Taiyuan 030006)

Abstract Traditional algorithms generally cluster single-valued attributed data. However, in practice, each attribute of the data object is described by more than one feature vector. For example, customers may purchase multiple products at the same time as they shop. An object described by multiple feature vectors is called a matrix object and such data are called matrix-object data. At present, the research work on clustering algorithms for categorical matrix- object data is relatively rare, and there are still many issues to be settled. In this paper, we propose a new matrix-object data fuzzy k-modes (MD fuzzy k-modes) algorithm that uses the fuzzy k-modes clustering process to cluster categorical matrix-object data. In the proposed algorithm, we introduce the fuzzy factor β with the concept of fuzzy set. The dissimilarity measure between two categorical matrix-objects is redefined, and the heuristic updating algorithm of the cluster centers is provided. Finally, the effectiveness of the MD fuzzy k-modes algorithm is verified on the five real-world data sets, and the relationship between fuzzy factor β and membership w is analyzed. Therefore, in the era of big data, clustering multiple records by using the MD fuzzy k-modes algorithm can make it easier to find customers’ spending habits and preferences, so as to make more targeted recommendation.

Key words matrix-object data; MD fuzzy k-modes algorithm; dissimilarity measure; cluster centers; clustering

(lisy75@sxu.edu.cn)

DOI:10.7544/issn1000-1239.2019.20180737

收稿日期2018-11-01;修回日期:2019-04-09

基金项目国家自然科学基金项目(61573229);山西省基础研究计划项目(201701D121004);山西省回国留学人员科研资助项目(2017-020);山西省高等学校教学改革创新项目(J2017002)

This work was supported by the National Natural Science Foundation of China (61573229), the Shanxi Provincial Basic Research Foundation of China (201701D121004), the Shanxi Scholarship Council of China (2017-020), and the Shanxi Provincial Teaching Reform and Innovation Program in Higher Education (J2017002).

中图法分类号 TP391

Li Shunyong, born in 1975. PhD, associate professor and master supervisor. His main research interest is statistical machine learning.

Zhang Miaomiao, born in 1993. Master candidate. Her main research interest is statistical machine learning.

Cao Fuyuan, born in 1974. Professor and PhD supervisor. Member of CCF. His main research interests include data mining and machine learning.