Exploration Approaches in Deep Reinforcement Learning Based on Intrinsic Motivation: A Review
-
摘要:
近年来,深度强化学习(deep reinforcement learning, DRL)在游戏人工智能、机器人等领域取得了诸多重要成就. 然而,在具有稀疏奖励、随机噪声等特性的现实应用场景中,该类方法面临着状态动作空间探索困难的问题. 基于内在动机的深度强化学习探索方法是解决上述问题的一种重要思想. 首先解释了深度强化学习探索困难的问题内涵,介绍了3种经典探索方法,并讨论了这3种方法在高维或连续场景下的局限性;接着描述了内在动机引入深度强化学习的背景和算法模型的常用测试环境,在此基础上详细梳理各类探索方法的基本原理、优势和缺陷,包括基于计数、基于知识和基于能力3类方法;然后介绍了基于内在动机的深度强化学习技术在不同领域的应用情况;最后总结亟需解决的难以构建有效状态表示等关键问题以及结合表示学习、知识积累等领域方向的研究展望.
Abstract:In recent years, deep reinforcement learning has made many important achievements in game artificial intelligence, robotics and other fields. However, in the realistic application scenarios with sparse rewards and random noises, such methods are suffering much from exploring the large state-action space. Introducing the notion of intrinsic motivation from psychology into deep reinforcement learning is an important idea to solve the above problem. Firstly, the connotation of the difficulty of exploration in deep reinforcement learning is explained, and three classical exploration methods are introduced, and their limitations in high-dimensional or continuous scenarios are discussed. Secondly, the background of the introduction of intrinsic motivation into deep reinforcement learning and the common testing environments of algorithms and models are described. On this basis, the basic principles, advantages and disadvantages of various exploration methods are analyzed in detail, including count-based, knowledge-based and competency-based approaches. Then, the applications of deep reinforcement learning based on intrinsic motivation in different fields are introduced. Finally, this paper throws light on the key problems that need to be solved for more advanced algorithms, such as the difficulty in constructing effective state representation, and also pinpoints some prospective research directions such as representation learning and knowledge accumulation. Hopefully, this review can provide readers with guidance of designing suitable intrinsic rewards for problems in hand and devising more effective exploration algorithms.
-
天际线(Skyline)查询[1]作为多目标决策、兴趣点发现、推荐系统等领域关键问题的一种解决途径,在2001年被提出,自此受到研究学者的广泛关注与研究. 近些年,Skyline查询研究拓展到不确定数据Skyline查询[2]﹑数据流Skyline查询[3]﹑动态Skyline查询[4]﹑反Skyline查询[5] 、偏好Skyline查询等方面,其中偏好Skyline查询可以返回满足用户偏好需求的结果集. 针对因用户偏好不同导致属性的重要性不同问题,研究者们提出了新的支配关系与算法. 但已有研究主要集中在非道路网的用户偏好Skyline查询或者道路网单用户偏好Skyline查询方面,没有考虑道路网多用户偏好和权重的Top-k Skyline查询.
传统偏好Skyline查询算法主要存在3点局限性:1)偏好Skyline查询需要确定属性的重要程度,由于不同用户权重与偏好不同,因此不同属性的重要程度也不一致,而已有研究中较少有提出将用户偏好和权重综合考虑,得到对用户群统一的属性重要程度次序处理方法;2)传统偏好Skyline查询算法大多未考虑道路网环境下的距离维度,只考虑静态维度;3)传统偏好Skyline查询算法返回的结果集过大、无序,不能给用户提供有效的决策支持.
因此,针对道路网多用户偏好Top-k Skyline查询问题,本文提出满足多用户不同权重和偏好需求的查询方法.
本文的主要贡献有3点:
1)针对道路网存在大量数据点以及多查询用户场景,需要计算数据点到各个查询用户的道路网距离,从而产生的很大距离计算开销,为了提升距离计算效率,本文根据所提的Vor-R*-DHash索引结构以及数据点与查询用户群的空间位置关系,提前剪枝在距离维度被支配的大量数据点.
2)针对在道路网Top-k Skyline查询处理时未综合考虑多用户不同权重和偏好以及返回的结果集数量不可控的问题,本文首先提出整体属性权重值的概念,综合考虑用户权重和偏好;并进一步提出用户群权重偏好次序,并基于此次序提出一种新的支配,即K-准放松支配;接着根据偏好次序进行逐次放松支配,使返回结果集大小可控;同时当k值改变时,动态调整放松轮次即可获取候选结果集CS,而无需重新计算距离、偏好次序等,减少了查询计算开销.
3)针对Skyline查询返回结果集无序的问题,本文基于z-整体属性权重值,提出了选取Top-k个结果集的打分函数,对候选结果集CS打分排序,返回有序结果集.
1. 相关工作
Skyline查询主要分为集中式查询和分布式查询. 其中集中式查询主要分为使用索引结构和不使用索引结构. 使用索引结构的算法常用R-tree等索引结构,例如文献[6]利用最近邻(nearest neighbor,NN)算法和R-tree索引查找Skyline点,基于R-tree可以快速判断数据点是否为Skyline点,接着利用数据点进行子集合的划分,递归查找Skyline点. 不使用索引结构的Skyline查询算法主要有基于排序的SFS(sort-filter Skyline)算法[7]. 而Skyline查询在不断发展过程中又产生了许多变种问题,例如K-支配空间Skyline查询[8]﹑连续Skyline查询[9]﹑针对推荐系统的范围障碍空间连续Skyline查询[10]﹑概率Skyline查询[11]以及Top-k Skyline查询等[12-13].
在集中式计算环境下,文献[14]根据用户不同偏好提出了维度不确定的定义,根据维度特征划分数据,进行Skyline概率支配测试,同时利用阈值处理大规模数据集Skyline查询问题. 文献[15]提出一种高效偏序域Skyline查询处理方法,利用倒排索引进行Skyline查询. 在并行计算环境下,文献[16]提出了不完全数据集的偏好Skyline查询算法SPQ(Skyline preference query). 文献[17]根据用户的偏好,基于Voronoi图将数据对象划分到不同网格中,并行计算所有对象组合,获取动态Skyline结果. 文献[18]提出了MapReduce下Top-k Skyline偏好查询.
道路网Skyline查询近些年来也受到越来越多的关注. 道路网Skyline查询既考虑数据点的路网空间属性,又考虑非空间属性. 文献[19]提出了基于范围的移动对象连续Skyline查询处理方法,利用Voronoi图组织道路网中的数据点,通过所提的3种算法减少道路网产生的相交节点数和距离计算开销. 文献[20]提出了道路网环境下综合考虑空间距离和社交距离的Skyline组用户查询方法.
Top-k Skyline查询在多目标决策中往往更具优势,因为它可以控制返回的结果集数量. 文献[21]提出基于安全区域技术解决连续Top-k Skyline查询结果更新问题,提出了结合Top-k查询和Skyline查询的安全区域构建算法. 文献[22]提出了MapReduce环境下Top-k Skyline处理方法. 文献[23]将K-Skyband查询与Top-k Skyline查询结合处理大数据集的Top-k Skyline查询.
目前道路网环境下Top-k Skyline查询研究大多集中在单用户场景,较少考虑多用户偏好和权重不同的场景. 针对已有方法的不足,本文利用查询点与数据点的位置关系剪枝数据集,利用所提的K-准放松支配控制结果集数量;利用所提的打分函数返回有序结果集,在理论论证和分析基础上提出了道路网多用户偏好Top-k Skyline查询方法.
2. 主要定义
设道路网环境下数据集P={p1, p2,…, pn},查询用户群G={q1, q2,…, qm}.
定义1. 道路网距离支配. 给定查询用户群G、数据点p1、数据点p2,数据点之间的距离为Dist,当且仅当Dist(p1, qi)≤Dist(p2, qi),1≤i≤m;且存在Dist(p1,qi)<Dist(p2, qi),1≤i≤m,称p1道路网距离支配p2,记作p1►p2. 本文距离如不特殊说明,则为道路网距离.
定义2. 整体属性权重. 给定查询用户群G,用户权重w={w1,w2,…,wm},用户qi的查询关键字keys={C1,C2},C1为优先考虑的属性集合,C2为一般偏好的属性集合,任意维度dj的整体属性权重Wj如式(1):
Wj=m∑i=1si⋅wi, (1) 其中si代表属性dj对于用户qi的重要性得分.
在属性的重要性程度计分时,将属性偏好分为3类:优先考虑﹑一般偏好和未考虑. 不同类别分数不同,例如C1中的属性被赋予2分,C2中的属性被赋予1分,未考虑的属性被赋予0分.
定义3. 用户群权重偏好次序. 指针对查询用户群属性的有序集合 GP={d1, d2, …, di},其中di代表任意属性,GP中属性对用户群的重要性程度呈非递增排列. 用户群权重偏好次序综合考虑用户的偏好和权重.
定义4. K-准放松支配(KPRD). 设P为数据集,数据维度空间为D,dj为任意维度,总维度数为r,θ=(θ1,θ2,…,θK)是D上K个维度的无差异阈值. 数据点pi,pt∈P,pi K-准放松支配pt,记作piϾpt,当且仅当:
{|{j|pi[dj]−pt[dj]>θj}|=0,|{j|pt[dj]−pi[dj]>0}|<|{j|pi[dj]−pt[dj]>0}|, (2) 其中1≤j≤K.
定义5. 道路网多用户偏好Top-k Skyline查询. 给定道路网路段集R、查询用户群G、数据集P、用户的查询关键字集合keys和用户权重集合w,道路网多用户偏好Top-k Skyline查询返回P的一个子集. 该子集中数据点在道路网的距离维度和静态维度都不能被P中任意其他数据点支配,并且是根据用户群偏好和权重排序的Top-k个数据点. 本文将道路网多用户偏好Top-k Skyline查询方法记作MUP-TKS.
3. 道路网多用户偏好Top-k Skyline查询
本文提出的道路网多用户偏好Top-k Skyline查询方法主要分为3个部分:距离较优集选取﹑K-准放松支配和Top-k个数据点选取.
3.1 道路网距离较优集选取方法
定义6. Mindist距离[24]. r维欧氏空间中,点p到同一空间内某矩形N的最小距离为Mindist(N, p).
定义7.Edist距离. 设查询用户群的最小外接矩形(minimum bounding rectangle,MBR)为Q,数据点p的MBR为N,则min{Mindist(p, Q)}为(Q, N)最小欧氏距离,记作Edistmin;max{Mindist(p, Q)}为(Q, N)最大欧氏距离,记作Edistmax.
定义8.Ndist距离. 设查询用户群的MBR为Q,数据点p的MBR为N,有min{Ndist(p, Q)}为(Q, N)最小网络距离,记作Ndistmin;max{Ndist(p, Q)}为(Q, N)最大网络距离,记作Ndistmax,其中p为N中的任意数据点,Ndist(p,Q)为p到Q的网络距离.
定理1. 设查询用户群的MBR为Q,道路网中数据点构成的2个中间节点分别为N1,N2,若DE1=Edistmin(Q, N2),DE2=Edistmax(Q, N1),DN1=Ndistmax(Q, N1),并且DE1>DE2,DE1>DN1,则N1►N2,且N2中任意数据点都被N1中数据点距离支配.
证明. 假设DN2=Ndistmin(Q, N2),因为欧氏距离值一定小于等于道路网距离值,所以当DE1>DE2且DE1>DN1时一定有DN2≥DE1,可得DN2>DN1,即N2中数据点到Q的最小网络距离大于N1中数据点到Q的最大网络距离,进而可得N2中任意数据点到Q的网络距离都大于N1中任意数据点到Q的网络距离. 因此N1►N2,且N2中任意数据点被N1中任意数据点道路网距离支配.证毕.
剪枝规则1. 设数据点构成的MBR分别为N1,N2,查询用户群的MBR为Q,如果满足:Edistmax(Q, N1)≤Edistmin(Q, N2),并且Ndistmax(Q, N1)<Edistmin(Q, N2),则节点N2可被剪枝.
定义9. 道路网最大距离的最小值. 给定数据点p1,p2,查询用户群G,数据点p到查询点q的道路网距离为Ndist(p, q). 若有DN1=max{Ndist(p1, qi)},DN2=max{Ndist(p2, qi)}(1≤i≤m),并且DN1<DN2,则当前道路网最大距离的最小值为DN1,记作DN_MaxMin.对应的数据点为p1.
定理2. 若节点N的Edistmin(Q, N)>DN_MaxMin,则节点N可被剪枝.
证明. 因为Edistmin(Q, N)>max{Ndist(p, qi)}(1≤i≤m),所以Ndistmin(Q, N)>max{Ndist(p, qi)},即p►N,且N中数据点也被p距离支配.证毕.
剪枝规则2. 若Edistmin(Q, N)≥DN_MaxMin,则节点N被支配,即N和N中数据点{p1, p2,···, pi}被剪枝.
如图1所示,数据点p1,p2到查询用户群{q1, q2, q3}的最大网络距离分别为DN1,DN2,有DN1>DN2,则DN_MaxMin=DN2. 数据点{p3,p4,p5,p6,p7,p8}构成的MBR为N1;若Edistmin(Q, N1)>DN_MaxMin,可得N1中数据点到各查询用户的网络距离大于DN_MaxMin,因为Edistmin(Q, N1)>DN_MaxMin,且有min{Ndist(p2, qi)}≥Edistmin(Q,N1)(1≤i≤3),所以p2►N1,N1可被剪枝.
定理3. 设DE为数据点pi到查询用户qj的欧氏距离,若min{DE(pi, qj)}>DN_MaxMin(1≤j≤m),则pi被剪枝.
证明. 假设p1为DN_MaxMin对应的数据点,若min{DE(pi,qj)}>DN_MaxMin,则有Ndist(p,qj)>DN_MaxMin(1≤j≤m),即数据点p1►p,p可被剪枝.证毕.
剪枝规则3. 假设数据点p1为DN_MaxMin对应的数据点,若存在DN_MaxMin<min{DE(pi, qj)}(1≤j≤m),则p1►pi,可将pi从候选集中删除,其中pi为任意不为p1的数据点.
为了减少计算,在剪枝前基于路网数据点的网络Voronoi图构建Vor-R*-DHash索引结构,如图2所示.
Vor-R*-DHash索引结构构造过程有3步:
1)构建路网所有数据点的网络Voronoi图.
2)创建R*-tree.从R*-tree的根部开始,从上至下、从左至右给每个节点编号,从0开始编号.
3)构建2级HashMap结构,第1级HashMap为first_hash、key为R*-tree中每个节点编号;第2级HashMap为sec_hash、key为后续剪枝处理需要的值,包括isNode(非数据点的节点)、MinE(节点到Q的最小欧氏距离值)、MaxE(节点到Q的最大欧氏距离值 )、MinN(节点到Q的最小网络距离值)、MaxN(节点到Q的最大网络距离值)、{DN1, DN2,…, DNi}(数据点到各查询用户的网络距离)、{DE1, DE2,…, DEi}(数据点到各查询用户的欧氏距离).
2级key对应的value值初始都为空,若数据点根据剪枝规则提前被剪枝,则这些值无需计算.DEi,DNi的值也是后续需要使用才被计算,并存入sec_hash.
基于剪枝规则1~3和Vor-R*-DHash索引结构,进一步给出距离较优集选取方法,如算法1所示.
算法1. 距离较优集选取方法 G_DBC.
输入:查询用户群G,道路网路段集R,数据集P;
输出:距离维度不被支配的距离较优集DBC.
① 以P中数据点、道路网路段集R构建 Vor- R*-DHash索引;
② 构建查询用户群的最小外包矩形Q;
③ 初始化DBC←∅;
④ 根据索引找到距离查询用户最近的点point;
⑤ 将point加入DBC中;
⑥ 计算数据点到各查询用户的网络距离 Ndist(p, qi)、欧氏距离DE(p, qi);
⑦ 将数据点网络距离、欧氏距离存入sec_hash;
⑧ 找到数据点到查询用户的网络距离最大值 的最小值;
⑨ DN_MaxMin←min{Ndist(p,qi)}; /*将最小值 赋给DN_MaxMin*/
⑩ 将数据点父节点Ni加入队列queue中;
⑪ 计算Ni到Q的最小、最大欧氏距离和最 小、 最大网络距离,存至sec_hash;
⑫ N1←min{MaxN}; /*将当前支配能力最强的 节点赋值给N1*/
⑬ for node in queue do
⑭ if node的孩子节点都被访问过 then
⑮ 将node的父节点加入queue中;/*向上 一层访问*/
⑯ end if
⑰ if node的孩子节点N为非叶子节点 then
⑱ 计算N到Q的欧氏距离DE1;
⑲ if DE1 > DN_MaxMin then
⑳ Cut N;/*将N剪枝,剪枝规则2*/
㉑ else if MaxE(N1) < MinE(N) 且 MaxN(N1)<MinE(N) then
㉒ Cut N;/*剪枝规则1*/
㉓ else
㉔ 将N加入队列queue;
㉕ 计算N到Q的最小、最大网络距离, 并 存至sec_hash;
㉖ 更新N1←min{MaxN};/*当前支配能力强 的节点赋给N1*/
㉗ end if
㉘ end if
㉙ if node的孩子节点N为叶子节点 then
㉚ 计算数据点到各查询用户欧氏距离 DE(p,qi);
㉛ if min{DE(p,qi)} > DN_MaxMin then
㉜ Cut N;/*剪枝规则3*/
㉝ else
㉞ 计算N到各查询用户网络距离DN;
㉟ if min{DN} > DN_MaxMin then
㊱ Cut N;
㊲ else
㊳ 将N与DBC中数据点支配比较;
㊴ if N被支配 then
㊵ Delete N;
㊶ else
㊷ 将N加入DBC中;
㊸ 更新DN_MaxMin←min{DN};
㊹ end if
㊺ end if
㊻ end if
㊼ end if
㊽ end for
㊾ return DBC.
算法1首先构建Vor-R*-DHash索引和查询用户群最小外接矩形Q,可快速得到距离查询点最近的数据点point,计算并保存sec_hash所需数据. 将point加入距离较优集DBC,并初始化DN_MaxMin. 接着将point父节点加入队列queue中,计算并保存sec_hash所需数据,并初始化N1. 每次取出队头节点处理,依据剪枝规则1~3进行节点的剪枝或者将节点加入DBC,并判断是否需要更新N1,DN_MaxMin等值,直至队列为空,循环结束. 最后返回距离较优集DBC.
3.2 数据集的放松支配过程
3.2.1 获取用户群权重偏好次序
首先初始化整体属性权重集合.W={W1,W2,…,Wi}={0,0,…,0};接着计算每个属性的整体属性权重值得到W;最后对整体属性权重值不为0的属性降序排列,得到属性的重要性次序,即用户群权重偏好次序.
在获取用户群权重偏好次序时,为了减小计算开销,利用HMap1,HMap2分别保存优先考虑的属性和一般偏好的属性. 当用户发起查询时,将C1中属性作为键,对应的用户权重作为值保存到HMap1;将C2中属性作为键,对应的用户权重作为值保存到HMap2.
进一步给出获取用户群权重偏好次序算法CDW,如算法2所示.
算法2. 获取用户群权重偏好次序算法 CDW.
输入:用户群G,用户查询关键字keys,用户权重w,维度空间D;
输出:用户群权重偏好次序GP.
① 初始化W为0; /*大小为数据集维度数*/
② 根据keys,w创建HMap1,HMap2;
③ for dj ∈D do
④ 基于HMap1、HMap2和式(1)得到Wj ;
⑤ end for
⑥ 根据W降序得到用户群权重偏好次序GP;
⑦ return GP. /*返回用户群权重偏好次序*/
3.2.2 基于用户群权重偏好次序的K-准放松支配
获取用户群偏好次序后,基于该次序进行放松支配处理. 本文中K为整体属性权重值不为0的维度数. 放松支配过程的处理对象为DBC与静态Skyline集取并集后的集合S. 经K-准放松支配后得到数量可控的候选结果集CS.
定理4. 任意2个数据点pi,pj∈P,若第i(i>0)轮在K个维度上piϾpj,则数据点pi必定在前K–i维支配数据点pj.
证明. 若在第i轮piϾpj,可知该轮的无差异阈值为(0,0,…,0,θK−i+1,…,θK),进而可得前K–i维使用的无差异阈值为(0,0,…,0),所以前K–i维为严格支配比较,即数据点pi必定在前K–i维支配数据点pj.证毕.
定理5. 数据集P经过第i(i>1)轮放松支配后所得结果集Si一定是第i–1轮放松支配后所得结果集Si−1的子集.
证明. 设第i轮放松的维度为第(K–i+1)~K维,第i–1轮放松的维度为第(K–i+2)~K维,其余维度使用严格支配. 可知第i轮的无差异阈值为(0,0,…,0,θK−i+1,θK−i+2,…,θK),第i–1轮的无差异阈值为(0,0,…,0,θK−i+2,…,θK),进而可知第i–1轮在前K–i+1个维度为严格支配比较,即在前K–i+1个维度的无差异阈值为(0,0,…,0). 第i轮不同于第i–1轮之处在于对第K–i+1维进行了放松支配,即在前K–i+1个维度无差异阈值为(0,0,…,0,θK−i+1),所以有Si⊆Si−1.证毕.
由定理4、定理5可直接得出定理6.
定理6. 给定数据集S,结果集数量随着每一轮放松而呈单调非递增趋势,即
|KPRD(i−1,D,S)|⩽ (3) 为使返回的结果集更符合用户群偏好,并保证数量可控,基于定理4~6进行逐次放松支配. 逐次放松支配过程中,θ是D上K个维度的无差异阈值,θ =(θ1, θ2, …, θK). 假定当前放松轮次为第i轮(1≤i≤K),无差异阈值θ =(0,0,…,0,θK−i+1,…,θK). 位于di前面的维度重要性都要高于di,因此该轮放松支配维度d1~di−1都使用严格支配比较. 放松支配从对用户群而言最不重要的属性开始,并预先将数据点按照用户群权重偏好次序非递增排序,距离维度值用数据点到查询用户群网络距离的最大值表示.
基于以上讨论,进一步给出基于用户群权重偏好次序的K-准放松支配算法KPRD,如算法3所示.
算法3. 基于用户群权重偏好次序的K-准放松支配算法KPRD.
输入:用户群G,无差异阈值θ,并集S,数据维度空间D,k值,用户查询关键字keys,用户权重w;
输出:候选结果集CS.
① GP←call CDW(G, keys, w, D);/*调用算法2 获取用户群权重偏好次序GP*/
② K←|GP|; /*整体属性权重值大于0的 维度数*/
③ 根据GP调整S中数据点;
④ 初始化CurS←S; /*CurS为每轮放松支配后 的结果集*/
⑤ 初始化oldCount←|S|; /*保存上一轮结果集 个数*/
⑥ 初始化curCount ←|CurS|;/*保存本轮结果集 个数*/
⑦ for j = K to 1 do /*进行最多K轮放松支配*/
⑧ for every pi,pj ∈ CurS do
⑨ if piϾpj then
⑩ 将pj从CurS删除;
⑪ curCount = curCount −1;
⑫ end if
⑬ end for
⑭ if oldCount ≥ k 且 curCount < k then
⑮ CS←S;
⑯ return CS;/*返回上一轮的结果集*/
⑰ else
⑱ 将CurS结果集保存至文件;
⑲ S←CurS;/*更新S*/
⑳ oldCount←|S|;/*更新oldCount*/
㉑ end if
㉒ end for
㉓ CS←CurS;
㉔ return CS.
3.3 Top-k个数据点选取方法
通过放松支配处理后可有效控制返回用户群的结果集大小,本节进一步给出Top-k个数据点选取策略,使返回结果集有序. 利用z-整体属性权重值的打分函数选取Top-k个数据点,处理对象为候选结果集CS.
定义10. 单调打分函数F[25]. 数据集中数据点作为输入域将数据点映射到实数范围.F由r个单调函数构成,F={f1, f2, …, fr}. 对于数据集中任意数据点,有F = \displaystyle\sum\limits_{j = 1}^r {{f_j}(p[{d_j}])},其中fj(p[dj])为数据点在dj维度的单调函数.
定理7. 假设数据集P的单调打分函数为F,若数据集中任意一个元组有最高的分数,那么它一定是Skyline点.
证明. 以反证法进行证明. 假设有p1,p2∈P,p1的得分F(p1)为数据集的最高得分,F(p1)>F(p2),p1不是Skyline点,p2支配p1,p1[dj]≤p2[dj](1≤j≤r),则可得\displaystyle\sum\limits_{j = 1}^r {{f_j}({p_1}[{d_j}]) \leqslant \displaystyle\sum\limits_{j = 1}^r {{f_j}({p_2}[{d_j}])} },即F(p1)≤F(p2),与假设矛盾.证毕.
定理8. 数据集P根据任意单调打分函数所得数据点顺序是Skyline支配的拓扑顺序.
证明. 以反证法进行证明. 假设存在2个数据点p1,p2∈P,单调打分函数为F,p1支配p2,F(p1)<F(p2),根据定理7可知,p1支配p2,则有F(p1)≥F(p2),与假设矛盾. 所以如果F(p2)>F(p1),可能有p2支配p1,但可以确定p1不可能支配p2. 如果F(p1)=F(p2),则p1支配p2或p2支配p1(这两者是等价的,会根据属性的映射关系排序),或者p1和p2之间不具备支配关系. 因此依据打分函数F所得数据点顺序是按照Skyline支配关系的一个拓扑顺序.证毕.
定义11. 线性打分函数[25]. 给定线性打分函数L,一般化形式为L(p) = \displaystyle\sum\limits_{j = 1}^r {{\omega _j} \cdot p[{d_j}]},其中ωj为实常数,p[dj]为数据点在dj维度的取值.
定义12. z-整体属性权重值. 给定数据集P,数据点 {p_i}\in P,pi在dj维度的z-整体属性权重值为
{\varphi }_{i,j}=\frac{({V}_{i,j}-\mu )}{\sigma }\cdot {W}_{j}\cdot {\zeta }_{j}\text{,} (4) 其中,\dfrac{{({V_{i,j}} - \mu )}}{\sigma }为数据点pi在维度dj的z值,Wj为dj的整体属性权重值,ζj为dj的维度优劣值,ζj=1或ζj=−1. 由定义10~12可知,fj(p[dj])=φi,j=ωjz,ωj=Wjζj.
定理9. 数据点任意维度的fj(p[dj])是单调的.
证明. 因为ωj=Wjζj,在打分阶段Wj为实常数,所以可得ωj为实常数,且随着数据点维度值变大,它的z值也变大,因此数据点的任意维度fj(p[dj])是单调的.证毕.
定义13. 基于z-整体属性权重值的打分函数. 数据点pi各维度z-整体属性值之和为它的得分,记作F(pi):
F({p_i}) = \sum\limits_{j = 1}^r {{\varphi _{i,j}}} . (5) 定理10. F(pi)是单调打分函数.
证明. 因为有F({p_i}) = \displaystyle\sum\limits_{j = 1}^r {{f_j}(p[{d_j}])},根据定理9可知数据点的任意维度fj(p[dj]) 随着维度值变大单调递增,它们具备相同的单调性,因此F(pi)也是单调的.证毕.
进一步给出Top-k个数据点选取方法,如算法4所示.
算法4.Top-k个数据点选取方法TK_DC.
输入:候选结果集CS,整体属性权重集合W,维度优劣集合ζ;
输出:Top-k Skyline结果集.
① for pi∈CS do
② 计算数据点的z-整体属性权重值;/*根据 式(4)*/
③ 计算数据点得分;/*根据式(5) */
④ end for
⑤ 根据F(pi)降序排序;
⑥ return Top-k个数据点.
算法4主要对经过算法3处理后的候选结果集CS打分,并对行②③计算CS中各个数据点的得分,基于行⑤⑥数据点的得分排序,输出Top-k Skyline结果集给用户群.
综合距离较优集选取﹑K-准放松支配和Top-k个数据点选取的处理过程,进一步给出算法5 MUP-TKS的算法.
算法5. 道路网多用户偏好Top-k Skyline查询算法MUP-TKS.
输入:数据集P,道路网路段集R,用户群G,用户查询关键字keys,用户权重w,无差异阈值θ,k,维度优劣集合ζ;
输出:Top-k Skyline结果集.
① 预先计算保存数据集的静态Skyline 集;
② 距离较优集选取方法G_DBC;/*调用算法1*/
③ 对距离较优集与静态Skyline集求并集S;
④ K-准放松支配算法KPRD; /*调用算法 3*/
⑤ Top-k 个数据点选取方法TK_DC. /*调用算 法4*/
4. 实验比较与分析
本节主要对MUP-TKS进行实验以及性能评估. 实验对比算法为道路网单用户偏好Skyline算法UP-BPA[26]、K支配空间偏好Skyline算法KSJQ[23]以及基于时间道路网多用户偏好Skyline算法DSAS[27].UP-BPA算法适用于道路网单用户,为了更好地与本文所提MUP-TKS进行对比,将其扩展,对查询用户群的每个用户分别运行该算法;再对子结果集取并集,得到候选结果集CS;最后对候选结果集基于z-值的打分函数打分,得到Top-k个数据点,扩展后的算法称为EUP-BPA.将KSJQ算法扩展,对每个用户单独执行该算法,用户偏好对应它的K个子空间;对每个用户的结果集取并集后得到候选结果集;对候选结果集CS基于z-值的打分函数打分,得到Top-k个Skyline结果集,扩展后的算法称为EKSJQ.将DSAS算法扩展,对满足不同用户需求的数据点基于z-值打分函数打分,按照数据点得分从高至低返回Top-k个Skyline结果集,扩展后的算法称为EDSAS.
4.1 数据集及实验环境设置
实验使用真实道路网数据集. 道路网数据集
1 是北美2.5×107 km2范围内的路段信息,它包含175813个节点和179179条边. 兴趣点数据集2 来自北美酒店及登记信息. 查询用户采用随机生成的方式. 本文使用Vor-R*-DHash索引结构组织数据集. 实验参数取值范围如表1所示,每个用户最大关注维度为4.每个实验采取单一变量原则,其余变量为默认值,实验结果取30次实验运行的平均值.表 1 实验参数设置Table 1. Experimental Parameter Setting参数 取值范围 用户数量 5,10,15,20,25,30,35 数据集规模 1×104,2×104,3×104,4×104,5×104 数据维度 5,7,9,11,13,15,17 无差异阈值(标准差) 0.1倍,0.5倍,1倍,2倍,10倍 获取数据点数量k 2,4,6,8,10 注:加粗数值表示参数默认取值. 实验环境为:Windows 10(64b),CoreTM i6-5200U CPU @2.20 GHz 2.19 GHz处理器,12 GB运行内存. 在IntelliJ IDEA开发平台上使用Java实现本文所提的算法MUP-TKS和对比算法EUP-BPA,EKSJQ,EDSAS.
4.2 算法对比实验
1)用户数量对算法性能的影响
为了分析用户数量对算法性能的影响,本实验对不同用户数量下的MUP-TKS,EKSJQ,EDSAS,EUP-BPA算法进行测试,观察算法在不同用户数量下的CPU运行时间、候选结果集CS数量的变化情况.
图3展示了4种算法在不同用户数量下CPU运行时间变化情况.由图3可知,随着用户数量的增加,4种算法的CPU运行时间都在增加. 因为用户数量增加导致不同用户的偏好情况增加,从而需要更多时间处理用户偏好. MUP-TKS的CPU运行时间增长趋势没有其他3种算法的增长趋势大,主要原因是MUP-TKS将多用户的偏好转换成用户群权重偏好次序,对数据集按照该次序预排序,再进行K-准放松支配,使用户数量增加对CPU运行时间的影响减小.
图4展示了4种算法随着用户数量的变化,候选结果集CS数量的变化情况. 由图4可知随着用户数量的增加,CS的数量变大. 但MUP-TKS,EKSJQ,EDSAS算法的变化趋势远没有EUP-BPA算法的变化趋势大,主要因为EUP-BPA算法需要对每个用户进行偏好Skyline查询,再合并各用户的偏好Skyline结果集.
2)数据规模对算法性能的影响
为了分析数据规模对MUP-TKS性能的影响,本实验对不同数据规模下的MUP-TKS,EKSJQ,EDSAS,EUP-BPA算法进行测试,观察4种算法在不同数据规模下CPU运行时间、CS数量的对比情况.
由图5可知,随着数据集规模变大,CPU运行时间不断增加,因为当数据集规模变大时,需要比较的元组数量增加. 而MUP-TKS的增长趋势比其他3种算法小,主要因为MUP-TKS利用剪枝策略和Vor-R*-DHash索引提前剪枝大量不可能成为Skyline的数据点,减少了元组比较次数.
3)k值对算法性能的影响
图6展示了4种算法随着k值变化CPU运行时间变化的情况. 随着k值变化,MUP-TKS的CPU运行时间没有太大变化,因为MUP-TKS在每一轮放松支配后会保存结果集,当k值变化时,可直接找到对应符合大小要求轮次的CS打分,即可得到Top-k Skyline结果集,该过程时间消耗很小. 而EKSJQ,EUP-BPA算法都需要重新计算,时间消耗较大.
图7展示了4种算法随着k值变化元组比较次数的变化情况. 可以发现MUP-TKS随着k值增大,元组比较次数减少,因为当k值增大时,放松支配的轮次减少. 而随着k值增大,EKSJQ,EUP-BPA算法的元组比较次数增多,因为需要进行更多的支配比较找到Top-k个数据点. 随着k值增大,EDSAS算法的元组比较次数基本没有变化.
4)无差异阈值对算法性能的影响
本实验分析无差异阈值对MUP-TKS性能的影响. 图8展示了MUP-TKS在不同无差异阈值下CPU运行时间的变化情况. 由图8可知,若只考虑第1轮放松时间,无差异阈值变化对第1轮放松的CPU响应时间影响不大,因为不同无差异阈值的初始数据集大小都是相同的,处理相同数据集规模的时间差异不大. 而算法总运行时间随着阈值增大而减小,因为无差异阈值增大后,放松支配时会删减更多被支配的元组.
5. 总 结
本文针对现实生活中道路网多用户场景的偏好Top-k Skyline 查询问题,进行深入分析与研究. 作为道路网上单用户偏好Skyline查询问题的补充,提出了一种基于道路网环境下多用户偏好Top-k Skyline查询方法. 该方法利用剪枝规则和索引减少了距离计算开销,并利用用户群权重偏好次序进行放松支配,使结果集可控. 实验结果表明,本文方法能有效解决道路网多用户偏好查询问题,返回的结果集可以满足多用户偏好与权重需求,可以提供有效参考价值. 下一步研究重点主要集中在对多查询用户移动情况下偏好 Top-k Skyline 查询问题的处理.
作者贡献声明:李松提出了方法思路和技术方案;宾婷亮和郝晓红负责算法优化、完成部分实验并撰写论文;张丽平完成部分实验;郝忠孝提出指导意见并修改论文.
-
表 1 本文与已发表相关论文的异同
Table 1 Similarities and Differences of Our Paper Compared with Published Related Papers
相关综述 出发点 研究角度 与本文的主要区别 文献[5] 解决RL面临的抽象动作(及其时序)和抽象状态表示, 以及在其基础上的高层序贯决策问题. 借鉴发育学习理论, 依托分层强化学习、课程学习、状态表征等方法, 详细阐述了如何结合内在动机与深度强化学习方法帮助智能体获取知识和学习技能. 该文重点阐述发育学习理论中2种主要的内在动机模型如何与RL相结合, 以解决稀疏奖励、表征学习、option发现、课程学习等问题, 然而对于内在动机如何解决各类探索问题并未深入研究. 文献[6] 为适应学习系统的行为, 研究如何优化值函数集合的学习问题. 将并行价值函数学习建模为强化学习任务, 在提出的并行学习测试环境中, 基于带非静态目标的在线多预测任务设定, 研究和比较不同的内在奖励机制的表现. 该文重点研究如何利用内在奖励从共享的经验流中学习价值函数集合, 以适应内在驱动学习系统的行为. 文献[7] 解决深度强化学习和多智能体强化学习在现实场景中的广泛应用和部署面临的瓶颈挑战——探索问题. 从单智能体和多智能体角度出发, 系统性阐述了各类探索方法在深度强化学习领域的研究情况, 并在常见的基准环境中对典型的探索方法进行了综合对比. 该文聚焦于阐述覆盖深度强化学习和多智能体强化学习的解决探索问题的多类方法, 基于内在动机的方法并非该论文的研究重点, 因此导致基于内在动机的探索方法覆盖面较小, 讨论深度不够. 文献[8] 解决未知且随机环境中序贯决策面临的探索问题. 从智能体探索使用的信息类型出发, 全面阐述了无奖励探索、随机动作选择、基于额外奖励或基于优化的探索等方法在基于MDP的强化学习领域的研究情况. 该文聚焦于为强化学习解决序贯决策问题中所涉及到的探索方法提供广泛的高层综述, 仅初步介绍了一些基于内在动机的探索方法. 表 2 基于计数的主要方法小结
Table 2 Summary of Main Methods Based on Count
分类 算法 内在奖励形式 状态表示 主要测试环境和效果 基于密度模型的伪计数 PC[39]
(NIPS-16)CTS密度模型+
伪计数的均方根Atari-MR: 50M帧训练后得到2461均分, 100M帧训练后得到3439均分. PixelCNN[44]
(ICML-17)PixelCNN密度模型+伪计数的均方根 Atari-MR: 100M帧训练后得到6600均分. 间接伪计数 {\mathrm{E}\mathrm{X}}^{2} [47]
(NIPS-17)判别器评估状态新颖性, 作为间接密度 CNN Doom-MWH: 平均成功率大于 74\mathrm{\%} , 显著高于VIME[58], #Exploration[53], TRPO[59]. DORA[48]
(ICLR-18)探索价值 \mathrm{E} -value
作为间接计数Atari-FW: DORA[48]在 2\times {10}^{6} 训练步数内收敛, 而PC需 1\times {10}^{7} 训练步数收敛[39]. SR[49]
(AAAI-20)SR的范数作为
伪计数Atari-HEG: 与PC[39], PixelCNN[44], RND[60]性能相当或略高. 状态抽象 #Exploration[53]
(NIPS-17)基于状态Hash的计数 Pixel, BASS, AE Atari-HEG: 在除Atari-MR的问题上
比PC[39]得分高, 在Atari-MR上显著
低于PC.CoEX[40]
(ICLR-19)基于contingency-awareness状态表示的伪计数 逆动力学预测训练卷积,注意力mask提取位置信息 Atari-HEG: 在大部分问题上都比A3C+[39], TRPO-AE-SimHash[53], Sarsa- \varphi -EB[46], DQN-PixelCNN[44], Curiosity[61]效果好. 注:CNN (convolutional neural networks), TRPO (trust region policy optimization), RND (random network distillation). 表 3 基于预测模型的主要算法小结
Table 3 Summary of Main Algorithms Based on Predictive Models
算法类型 算法 内在奖励形式 状态表示 抗噪 主要测试环境和效果 基于预测误差 Static/Dynamic AE[71]
(arXiv 15)前向动力学模型(仅2层网络)的状态预测误差的2范数平方 Autoencoder
的隐层否 14个Atari游戏: 与DQN[72], Thompson sampling, Boltzman方法相比, 优势有限. ICM[61]
(ICML-17)前向动力学模型的状态预测误差的2范数平方 逆动力学辅助
训练CNN+ELU部分 Doom-MWH: 探索和导航效率
显著高于TRPO-VIME[58].文献[74]
(ICLR-19)前向动力学模型的状态预测误差的2范数平方 Pixels, RF, VAE[75], 逆动力学特征IDF 部分 在48个Atari游戏、SuperMarioBros、2个Roboschool场景、Two-player Pong、2个Unity迷宫等环境中, Pixel表现较差, VAE[75]不稳定, RF和IDF表现较好, IDF迁移泛化能力强, RF和IDF学习效率受到随机因素影响. RND[60]
(ICLR-19)状态嵌入预测误差的2范数平方 PPO[56]策略网络中的卷积层 是 Atari: 1970M帧训练, 在多个Atari-HEG(包括Atari-MR上获得 \le 8000均分)效果显著好于动力学预测方法. EMI[73]
(ICML-19)前向动力学模型的状态预测误差的2范数平方 前向和逆向动力学互信息最大化 rllab任务: 显著优于ICM[61], RND[60], {\mathrm{E}\mathrm{X} }^{2}[47], AE-SimHash[53], VIME[58];Atari-HEG:大部分游戏中稍优于上述方法. LWM[77]
(NeurIPS-20)前向动力学模型的状态预测误差的2范数平方 最小化时序邻近状态的特征向量W-MSE损失
函数是 Atari-HEG: 50M帧, 大部分游戏上明显优于EMI[73], {\mathrm{E}\mathrm{X} }^{2}[47], ICM[61], RND[60], AE-SimHash[53]. 预测结果不一致性 Disagreement[79]
(ICML-19)一组前向动力学状态预测误差的方差 随机特征/Image-Net预训练的ResNet-18特征 是 Unity迷宫导航: 在noisy TV设置下探索效率明显高于RF下的前馈模型[74]. 文献[81]
(ICML-20)对动力学模型后验分布的采样方差 是 rllab任务: 优于Disagreement[79], MAX[80], ICM[61]. 预测精度提升 文献[82]
(ICML-17)基于预测损失的提升或网络复杂度的提升的多种奖励 语言建模任务(n-gram模型,repeat copy任务和bAbI任务): 显著提升了学习效率, 甚至达到了1倍. 表 4 基于信息论的主要方法小结
Table 4 Summary of Main Methods Based on Information Theory
算法类型 算法 内在奖励形式 状态表示 抗噪 主要测试环境和效果 信息增益 VIME[58]
(NIPS-16)预测模型参数的累计熵减
(推导为前后参数的KL散度)是 rllab的多个任务(包括层次性较强的SwimmerGather): 得分显著高于TRPO和基于L2预测误差的TRPO. Surprisal[90]
(arXiv 17)惊奇:真实转移模型与学习模型参数之间的KL散度近似 是 多个较困难的rllab任务和部分Atari游戏:仅在部分环境下探索效率高于VIME[58], 但在其他环境与VIME有一定差距. AWML[69]
(ICML-20)基于加权混合的新旧动力
学模型损失函数之差假定智能体具有面向物体的特征表示能力 是 多类动态物体的复杂3维环境:精度明显高于Surprisal[90], RND[60], Disagreement[79], ICM[61]等方法. 最大熵 MaxEnt[92]
(ICML-19)最大化状态分布的熵
为优化目标, 以状态密度
分布的梯度为奖励Pendulum, Ant, Humanoid控制任务作为概念验证环境: 相比随机策略, 诱导出明显更大的状态熵. 文献[94]
(ICML-19)隐状态分布的负对数 基于先期任务的奖励预测任务得到最小维度隐状态表示 简单的object-pusher环境: 获得外在奖励的效率显著高于无隐状态表示的MaxEnt[92]. 互信息 VMI[100]
(NIPS-15)当前状态下开环option
与终止状态的互信息CNN处理像素观测信息 简单的静态、动态和追逃的网格世界: 展示了对关键状态的有效识别. VIC[99]
(arXiv 16)当前状态下闭环option
与终止状态的互信息简单的网格世界: 证明了对Empowerment的估计比VMI算法[100]更准确. DIAYN[103]
(ICLR-19)当前状态下闭环option与每一状态的互信息、option下动作的信息熵最大化 2D导航和连续控制任务: 相对VIC[99]能演化出更多样的技能. DADS[107]
(ICLR-20)式(19)的正向形式, 兼顾
多样性和可预测性OpenAI Gym的多个控制任务: 与DIYAN[103]相比, 原子技能丰富且稳定, 更有利于组装层次化行为; 大幅提升下游基于模型规划任务的学习效率. -
[1] Sutton R S, Barto A G. Reinforcement Learning: An Introduction [M]. Cambridge, MA: MIT Press, 2018
[2] 刘全,翟建伟,章宗长,等. 深度强化学习综述[J]. 计算机学报,2018,41(1):1−27 doi: 10.11897/SP.J.1016.2019.00001 Liu Quan, Zhai Jianwei, Zhang Zongchang, et al. A survey on deep reinforcement learning[J]. Chinese Journal of Computers, 2018, 41(1): 1−27 (in Chinese) doi: 10.11897/SP.J.1016.2019.00001
[3] Liu Xiaoyang, Yang Hongyang, Gao Jiechao, et al. FinRL: Deep reinforcement learning framework to automate trading in quantitative finance [C] //Proc of the 2nd ACM Int Conf on AI in Finance. New York: ACM, 2022: 1−9
[4] 万里鹏,兰旭光,张翰博,等. 深度强化学习理论及其应用综述[J]. 模式识别与人工智能,2019,32(1):67−81 doi: 10.16451/j.cnki.issn1003-6059.201901009 Wan Lipeng, Lan Xuguang, Zhang Hanbo, et al. A review of deep reinforcement learning theory and application[J]. Pattern Recognition and Artificial Intelligence, 2019, 32(1): 67−81 (in Chinese) doi: 10.16451/j.cnki.issn1003-6059.201901009
[5] Aubret A, Matignon L, Hassas S. A survey on intrinsic motivation in reinforcement learning [J]. arXiv preprint, arXiv: 1908.06976, 2019
[6] Linke C, Ady N M, White M, et al. Adapting behavior via intrinsic reward: A survey and empirical study[J]. Journal of Artificial Intelligence Research, 2020, 69: 1287−1332 doi: 10.1613/jair.1.12087
[7] Yang Tianpei, Tang Hongyao, Bai Chenjia, et al. Exploration in deep reinforcement learning: A comprehensive survey [J]. arXiv preprint, arXiv: 2109.06668, 2021
[8] Amin S, Gomrokchi M, Satija H, et al. A survey of exploration methods in reinforcement learning [J]. arXiv preprint, arXiv: 2109.00157, 2021
[9] Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning [C/OL] //Proc of the 4th Int Conf on Learning Representations. 2016 [2022-09-06].https://arxiv.org/abs/1509.02971v6
[10] Plappert M, Houthooft R, Dhariwal P, et al. Parameter space noise for exploration [C/OL] //Proc of the 6th Int Conf on Learning Representations. 2018 [2022-09-06].https://arxiv.org/abs/1706.01905
[11] Fortunato M, Azar M G, Piot B, et al. Noisy networks for exploration [C/OL] // Proc of the 6th Int Conf on Learning Representations. 2018 [2022-09-06].https://arxiv.org/abs/1706.10295
[12] 章晓芳, 周倩, 梁斌, 等, 一种自适应的多臂赌博机算法 [J]. 计算机研究与发展, 2019, 56(3): 643−654 Zhang Xiaofang, Zhou Qian, Liang Bin, et al. An adaptive algorithm in multi-armed bandit problem [J]. Journal of Computer Research and Development, 2019, 56(3): 643−654 (in Chinese)
[13] Lai T L, Robbins H. Asymptotically efficient adaptive allocation rules[J]. Advances in Applied Mathematics, 1985, 6(1): 4−22 doi: 10.1016/0196-8858(85)90002-8
[14] Strehl A L, Littman M L. An analysis of model-based interval estimation for Markov decision processes[J]. Journal of Computer and System Sciences, 2008, 74(8): 1309−1331 doi: 10.1016/j.jcss.2007.08.009
[15] Jaksch T, Ortner R, Auer P. Near-optimal regret bounds for reinforcement learning [C] //Proc of the 21st Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2008: 89–96
[16] Azar M G, Osband I, Munos R. Minimax regret bounds for reinforcement learning [C] //Proc of the 34th Int Conf on Machine Learning. New York: ACM, 2017: 263–272
[17] Jin C, Allen-Zhu Z, Bubeck S, et al. Is q-learning provably efficient [C] //Proc of the 32nd Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2018: 4868–4878
[18] Kolter J Z, Ng A Y. Near-Bayesian exploration in polynomial time [C] //Proc of the 26th Int Conf on Machine Learning. New York: ACM, 2009: 513–520
[19] Russo D, Van Roy B, Kazerouni A, et al. A tutorial on thompson sampling[J]. Foundations and Trends in Machine Learning, 2018, 11(1): 1−96 doi: 10.1561/2200000070
[20] Osband I, Van Roy B. Why is posterior sampling better than optimism for reinforcement learning [C] //Proc of the 34th Int Conf on Machine Learning. New York: ACM, 2017: 2701–2710
[21] Osband I, Blundell C, Pritzel A, et al. Deep exploration via bootstrapped DQN [C] //Proc of the 30th Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2016: 4033–4041
[22] Thrun S B. Efficient exploration in reinforcement learning [R]. Pittsburgh, CP: School of Computer Science, Carnegie-Mellon University, 1992
[23] Barto A G, Singh S, Chentanez N. Intrinsically motivated learning of hierarchical collections of skills [C] //Proc of the 3rd Int Conf on Development and Learning. Piscataway, NJ: IEEE, 2004: 112–119
[24] Oudeyer P Y, Kaplan F. What is intrinsic motivation? A typology of computational approaches [J/OL]. Frontiers in Neurorobotics, 2007 [2022-09-06].https://www.frontiersin.org/articles/10.3389/neuro.12.006.2007/full
[25] Harlow H F. Learning and satiation of response in intrinsically motivated complex puzzle performance by monkeys[J]. Journal of Comparative and Physiological Psychology, 1950, 43(4): 289−294 doi: 10.1037/h0058114
[26] Hull C L. Principles of behavior [J/OL]. The Journal of Nervous and Mental Disease, 1945, 101(4): 396. [2022-09-06].https://journals.lww.com/jonmd/Citation/1945/04000/Principles_of_Behavior.26.aspx
[27] Deci E L, Ryan R M, Intrinsic Motivation and Self-Determination in Human Behavior [M]. Berlin: Springer, 2013
[28] Ryan R M, Deci E L. Intrinsic and extrinsic motivations: Classic definitions and new directions[J]. Contemporary Educational Psychology, 2000, 25(1): 54−67 doi: 10.1006/ceps.1999.1020
[29] Barto A, Mirolli M, Baldassarre G. Novelty or surprise [J/OL]. Frontiers in Psychology, 2013, 4: 907. [2023-09-06]. http://www.frontiersin.org/articles/10.3389/fpsyg.2013.00907/full
[30] Czikszentmihalyi M. Flow: The Psychology of Optimal Experience[M]. New York: Harper & Row, 1990
[31] Asada M, Hosoda K, Kuniyoshi Y, et al. Cognitive developmental robotics: A survey[J]. IEEE Transactions on Autonomous Mental Development, 2009, 1(1): 12−34 doi: 10.1109/TAMD.2009.2021702
[32] White R W. Motivation reconsidered: The concept of competence[J]. Psychological Review, 1959, 66(5): 297−333 doi: 10.1037/h0040934
[33] Baldassarre G. What are intrinsic motivations? A biological perspective [C/OL] //Proc of IEEE Int Conf on Development and Learning. 2011 [2022-09-06].https://ieeexplore.ieee.org/document/6037367
[34] Schmidhuber J. Formal theory of creativity, fun, and intrinsic motivation (1990–2010)[J]. IEEE Transactions on Autonomous Mental Development, 2010, 2(3): 230−247 doi: 10.1109/TAMD.2010.2056368
[35] Bellemare M G, Naddaf Y, Veness J, et al. The ARCADE learning environment: An evaluation platform for general agents[J]. Journal of Artificial Intelligence Research, 2013, 47: 253−279 doi: 10.1613/jair.3912
[36] Duan Y, Chen X, Houthooft R, et al. Benchmarking deep reinforcement learning for continuous control [C] //Proc of the 33rd Int Conf on Machine Learning. New York: ACM, 2016: 1329–1338
[37] Kempka M, Wydmuch M, Runc G, et al. VizDoom: A doom-based ai research platform for visual reinforcement learning [C/OL] //Proc of IEEE Conf on Computational Intelligence and Games. 2016 [2022-09-06].https://ieeexplore.ieee.org/document/7860433
[38] Brockman G, Cheung V, Pettersson L, et al. OpenAI gym [J]. arXiv preprint, arXiv: 1606.01540
[39] Bellemare M G, Srinivasan S, Ostrovski G, et al. Unifying count-based exploration and intrinsic motivation [C] //Proc of the 30th Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2016: 1479–1487
[40] Choi J, Guo Y, Moczulski M, et al. Contingency-aware exploration in reinforcement learning [C/OL] //Proc of the 7th Int Conf on Learning Representations. 2019 [2022-09-06].https://arxiv.org/abs/1811.01483
[41] Veness J, Ng K S, Hutter M, et al. Context tree switching [C] //Proc of Data Compression Conf. Piscataway, NJ: IEEE, 2012: 327–336
[42] Bellemare M, Veness J, Talvitie E. Skip context tree switching [C] //Proc of the 31st Int Conf on Machine Learning. New York: ACM, 2014: 1458–1466
[43] Hasselt H V, Guez A, Silver D. Deep reinforcement learning with double q-learning [C] //Proc of the 30th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2016: 2094–2100
[44] Ostrovski G, Bellemare M G, Oord A, et al. Count-based exploration with neural density models [C] //Proc of the 34th Int Conf on Machine Learning. New York: ACM, 2017: 2721–2730
[45] Oord A, Kalchbrenner N, Vinyals O, et al. Conditional image generation with PixelCNN decoders [C] //Proc of the 30th Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2016: 4797–4805
[46] Martin J, Narayanan S S, Everitt T, et al. Count-based exploration in feature space for reinforcement learning [C] //Proc of the 26th Int Joint Conf on Artificial Intelligence. Menlo Park: AAAI, 2017: 2471–2478
[47] Fu J, Co-Reyes J D, Levine S. EX2: exploration with exemplar models for deep reinforcement learning [C] //Proc of the 31st Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 2574–2584
[48] Choshen L, Fox L, Loewenstein Y. DORA the explorer: Directed outreaching reinforcement action-selection [C/OL] //Proc of the 6th Int Conf on Learning Representations. 2018 [2022-09-06].https://arxiv.org/abs/1804.04012
[49] Machado M C, Bellemare M G, Bowling M. Count-based exploration with the successor representation [C] //Proc of the 34th AAAI Conf on Artificial Intelligence. Menlo Park: AAAI, 2020: 5125–5133
[50] Machado M, Rosenbaum C, Guo Xiaoxiao, et al. Eigenoption discovery through the deep successor representation[C/OL] //Proc of the 6th Int Conf on Learning Representations. 2018 [2022-09-06].https://arxiv.org/abs/1710.11089
[51] Schmidhuber J. Curious model-building control systems [C] //Proc of Int Joint Conf on Neural Networks. Piscataway, NJ: IEEE, 1991: 1458–1463
[52] Tao Ruoyu, Franois-Lavet V, Pineau J. Novelty search in representational space for sample efficient exploration[J]. Advances in Neural Information Processing Systems, 2020, 33: 8114−8126
[53] Tang Haoran, Houthooft R, Foote D, et al. #Exploration: A study of count-based exploration for deep reinforcement learning [C/OL]//Proc of the 31st Conf on Neural Information Processing Systems. Cambridge, MA: MIT. 2017 [2022-09-06].https://proceedings.neurips.cc/paper/2017/hash/3a20f62a0af1aa152670bab3c602feed-Abstract.html
[54] Charikar M S. Similarity estimation techniques from rounding algorithms [C] //Proc of the 34th ACM Symp on Theory of Computing. New York: ACM, 2002: 380–388
[55] Bellemare M, Veness J, Bowling M. Investigating contingency awareness using ATARI 2600 games [C/OL] //Proc of the 26th AAAI Conf on Artificial Intelligence. Menlo Park: AAAI. 2012 [2022-09-06].https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/view/5162/0
[56] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms [J]. arXiv preprint, arXiv: 1707.06347, 2017
[57] Song Yuhang, Wang Jianyi, Lukasiewicz T, et al. Mega-Reward: Achieving human-level play without extrinsic rewards [C] //Proc of the 34th AAAI Conf on Artificial Intelligence. Menlo Park: AAAI, 2020: 5826–5833
[58] Houthooft R, Chen Xi, Duan Yan, et al. VIME: Variational information maximizing exploration [C] //Proc of the 30th Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2016: 1117– 1125
[59] Schulman J, Levine S, Abbeel P, et al. Trust region policy optimization [C] //Proc of the 32nd Int Conf on Machine Learning. New York: ACM, 2015: 1889–1897
[60] Burda Y, Edwards H, Storkey A. Exploration by random network distillation [C/OL] //Proc of the 7th Int Conf on Learning Representations. 2019 [2022-09-06].https://arxiv.org/abs/1810.12894
[61] Pathak D, Agrawal P, Efros A A, et al. Curiosity-driven exploration by self-supervised prediction [C] //Proc of the 34th Int Conf on Machine Learning. New York: ACM, 2017: 2778–2787
[62] Lopes M, Lang T, Toussain M, et al. Exploration in model-based reinforcement learning by empirically estimating learning progress [J/OL]. Advances in Neural Information Processing Systems, 2012 [2022-09-06].https://proceedings.neurips.cc/paper/2012/hash/a0a080f42e6f13b3a2df133f073095dd-Abstract.html
[63] O’Donoghue B, Osband I, Munos R, et al. The uncertainty Bellman equation and exploration [C] //Proc of the 35th Int Conf on Machine Learning. New York: ACM, 2018: 3836–3845
[64] Yu A, Dayan P. Expected and unexpected uncertainty: ACh and NE in the neocortex [C] //Proc of the 15th Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2002: 173–180
[65] Grossberg S. Adaptive resonance theory: How a brain learns to consciously attend, learn, and recognize a changing world[J]. Neural Networks, 2013, 37: 1−47 doi: 10.1016/j.neunet.2012.09.017
[66] Schmidhuber J. A possibility for implementing curiosity and boredom in model-building neural controllers [C] //Proc of Int Conf on Simulation of Adaptive Behavior: From Animals to Animats. Cambridge, MA: MIT, 1991: 222–227
[67] Thrun S. Exploration in active learning [J/OL]. Handbook of Brain Science and Neural Networks, 1995 [2022-09-06].https://dl.acm.org/doi/10.5555/303568.303749
[68] Huang Xiao, Weng J. Novelty and reinforcement learning in the value system of developmental robots [C/OL] //Proc of the 2nd Int Workshop on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems. Lund, SWE: Lund University Cognitive Studies, 2002: 47–55
[69] Kim K, Sano M, De Freitas J, et al. Active world model learning with progress curiosity [C] //Proc of the 37th Int Conf on Machine Learning. New York: ACM, 2020: 5306–5315
[70] Oudeyer P Y, Kaplan F, Hafner V V. Intrinsic motivation systems for autonomous mental development[J]. IEEE Transactions on Evolutionary Computation, 2007, 11(2): 265−286 doi: 10.1109/TEVC.2006.890271
[71] Stadie B C, Levine S, Abbeel P. Incentivizing exploration in reinforcement learning with deep predictive models [J]. arXiv preprint, arXiv: 1507.00814, 2015
[72] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529−533 doi: 10.1038/nature14236
[73] Kim H, Kim J, Jeong Y, et al. EMI: Exploration with mutual information [C] //Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2019: 3360–3369
[74] Burda Y, Edwards H, Pathak D, et al. Large-scale study of curiosity-driven learning [C/OL] //Proc of the 7th Int Conf on Learning Representations. 2019 [2022-09-06].https://arxiv.org/abs/1808.04355
[75] Rezende D J, Mohamed S, Wierstra D. Stochastic backpropagation and approximate inference in deep generative models [C] //Proc of the 31st Int Conf on Machine Learning. New York: ACM, 2014: 1278–1286
[76] Savinov N, Raichuk A, Vincent D, et al. Episodic curiosity through reachability [C/OL] //Proc of the 7th Int Conf on Learning Representations. 2019 [2022-09-06].https://arxiv.org/abs/1810.02274
[77] Ermolov A, Sebe N. Latent world models for intrinsically motivated exploration. [C] //Proc of the 34th Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2020, 33: 5565−5575
[78] Badia A P, Sprechmann P, Vitvitskyi A, et al. Never Give Up: Learning directed exploration strategies [C/OL] //Proc of the 8th Int Conf on Learning Representations. 2020 [2022-09-06].https://arxiv.org/abs/2002.06038
[79] Pathak D, Gandhi D, Gupta A. Self-supervised exploration via disagreement [C] //Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2019: 5062–5071
[80] Shyam P, Jaśkowski W, Gomez F. Model-based active exploration [C] //Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2019: 5779–5788
[81] Ratzlaff N, Bai Q, Fuxin L, et al. Implicit generative modeling for efficient exploration [C] //Proc of the 37th Int Conf on Machine Learning. New York: ACM, 2020: 7985–7995
[82] Graves A, Bellemare M G, Menick J, et al. Automated curriculum learning for neural networks [C] //Proc of the 34th Int Conf on Machine Learning. New York: ACM, 2017: 1311–1320
[83] Holm L, Wadenholt G, Schrater P. Episodic curiosity for avoiding asteroids: Per-trial information gain for choice outcomes drive information seeking[J]. Scientific Reports, 2019, 9(1): 1−16 doi: 10.1038/s41598-018-37186-2
[84] Shannon C E. A mathematical theory of communication[J]. The Bell System Technical Journal, 1948, 27(3): 379−423 doi: 10.1002/j.1538-7305.1948.tb01338.x
[85] Frank M, Leitner J, Stollenga M, et al. Curiosity driven reinforcement learning for motion planning on humanoids [J/OL]. Frontiers in Neurorobotics, 2014 [2022-09-06].https://frontiersin.yncjkj.com/articles/10.3389/fnbot.2013.00025/full
[86] Alemi A A, Fischer I, Dillon J V, et al. Deep variational information bottleneck [C/OL] //Proc of the 5th Int Conf on Learning Representations. 2017 [2022-09-06].https://arxiv.org/abs/1612.00410v5
[87] Kim Y, Nam W, Kim H, et al. Curiosity-bottleneck: Exploration by distilling task-specific novelty [C] //Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2019: 3379–3388
[88] Sun Yi, Gomez F, Schmidhuber J. Planning to be surprised: Optimal bayesian exploration in dynamic environments [C] //Proc of the 4th Conf on Artificial General Intelligence. Berlin: Springer, 2011: 41–51
[89] Chien J T, Hsu P C. Stochastic curiosity maximizing exploration [C/OL] //Proc of Int Joint Conf on Neural Networks. Piscataway, NJ: IEEE. 2020 [2022-09-06].https://ieeexplore.ieee.org/abstract/document/9207295
[90] Achiam J, Sastry S. Surprise-based intrinsic motivation for deep reinforcement learning [J]. arXiv preprint, arXiv: 1703.01732, 2017
[91] Laversanne-Finot A, Pere A, Oudeyer P Y. Curiosity driven exploration of learned disentangled goal spaces [C] //Proc of the 2nd Conf on Robot Learning. New York: ACM, 2018: 487–504
[92] Hazan E, Kakade S, Singh K, et al. Provably efficient maximum entropy exploration [C] //Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2019: 2681–2691
[93] Lee L, Eysenbach B, Parisotto E, et al. Efficient exploration via state marginal matching [C/OL] //Proc of the 8th Int Conf on Learning Representations. 2020 [2022-09-06].https://arxiv.org/abs/1906.05274v1
[94] Vezzani G, Gupta A, Natale L, et al. Learning latent state representation for speeding up exploration [C/OL] //Proc of the 2nd Exploration in Reinforcement Learning Workshop at the 36th Int Conf on Machine Learning. 2019 [2022-09-06].https://arxiv.org/abs/1905.12621
[95] Liu H, Abbeel P. Behavior from the void: Unsupervised active pre-training[J]. Advances in Neural Information Processing Systems, 2021, 34: 18459−18473
[96] Seo Y, Chen L, Shin J, et al. State entropy maximization with random encoders for efficient exploration [C] //Proc of the 38th Int Conf on Machine Learning. New York: ACM, 2021: 9443−9454
[97] Still S, Precup D. An information-theoretic approach to curiosity-driven reinforcement learning[J]. Theory in Biosciences, 2012, 131(3): 139−148 doi: 10.1007/s12064-011-0142-z
[98] Salge C, Glackin C, Polani D. Empowerment—An Introduction [M]. Berlin: Springer, 2014
[99] Gregor K, Rezende D J, Wierstra D. Variational intrinsic control [J]. arXiv preprint, arXiv: 1611.07507, 2016
[100] Mohamed S, Rezende D J. Variational information maximisation for intrinsically motivated reinforcement learning [C] //Proc of the 29th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2015: 2125– 2133
[101] Sutton R S, Precup D, Singh S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning[J]. Artificial Intelligence, 1999, 112(1/2): 181−211
[102] Campos V, Trott A, Xiong Caiming, et al. Explore, discover and learn: Unsupervised discovery of state-covering skills [C] //Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2020: 1317–1327
[103] Eysenbach B, Gupta A, Ibarz J, et al. Diversity is all you need: Learning skills without a reward function [C/OL] //Proc of the 7th Int Conf on Learning Representations. 2019 [2022-09-06].https://arxiv.org/abs/1802.06070
[104] Kwon T. Variational intrinsic control revisited [J]. arXiv preprint, arXiv: 2010.03281, 2020
[105] Achiam J, Edwards H, Amodei D, et al. Variational option discovery algorithms [J]. arXiv preprint, arXiv: 1807.10299, 2018
[106] Hansen S, Dabney W, Barreto A, et al. Fast task inference with variational intrinsic successor features [C/OL] //Proc of the 8th Int Conf on Learning Representations. 2020 [2022-09-06].https://arxiv.org/abs/1906.05030
[107] Sharma A, Gu S, Levine S, et al. Dynamics-aware unsupervised discovery of skills [C/OL] //Proc of the 8th Int Conf on Learning Representations. 2020 [2022-09-06].https://arxiv.org/abs/1907.01657
[108] Mirolli M, Baldassarre G. Functions and mechanisms of intrinsic motivations [J/OL]. Intrinsically Motivated Learning in Natural and Artificial Systems, 2013 [2022-09-06].https://linkspringer.53yu.com/chapter/10.1007/978−3-642−32375-1_3
[109] Schembri M, Mirolli M, Baldassarre G. Evolving internal reinforcers for an intrinsically motivated reinforcement-learning robot [C] //Proc of the 6th Int Conf on Development and Learning. Piscataway, NJ: IEEE, 2007: 282–287
[110] Santucci V G, Baldassarre G, Mirolli M. Grail: A goal-discovering robotic architecture for intrinsically-motivated learning[J]. IEEE Transactions on Cognitive and Developmental Systems, 2016, 8(3): 214−231 doi: 10.1109/TCDS.2016.2538961
[111] Auer P. Using confidence bounds for exploitation-exploration trade-offs[J]. Journal of Machine Learning Research, 2002, 3(12): 397−422
[112] Sun Qiyu, Fang Jinbao, Zheng Weixing, et al. Aggressive quadrotor flight using curiosity-driven reinforcement learning[J]. IEEE Transactions on Industrial Electronics, 2022, 69(12): 13838−13848 doi: 10.1109/TIE.2022.3144586
[113] Perovic G, Li N. Curiosity driven deep reinforcement learning for motion planning in multi-agent environment [C] //Proc of IEEE Int Conf on Robotics and Biomimetics. Piscataway, NJ: IEEE, 2019: 375–380
[114] 陈佳盼,郑敏华. 基于深度强化学习的机器人操作行为研究综述[J]. 机器人,2022,44(2):236−256 Chen Jiapan, Zheng Minhua. A survey of robot manipulation behavior research based on deep reinforcement learning[J]. Robot, 2022, 44(2): 236−256 (in Chinese)
[115] Shi Haobin, Shi Lin, Xu Meng, et al. End-to-end navigation strategy with deep reinforcement learning for mobile robots[J]. IEEE Transactions on Industrial Informatics, 2019, 16(4): 2393−2402
[116] Hirchoua B, Ouhbi B, Frikh B. Deep reinforcement learning based trading agents: Risk curiosity driven learning for financial rules-based policy [J/OL]. Expert Systems with Applications, 2021 [2022-09-06].https://www.sciencedirect.com/science/article/abs/pii/S0957417420311970
[117] Wesselmann P, Wu Y C, Gašić M. Curiosity-driven reinforcement learning for dialogue management [C] //Proc of IEEE In Conf on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2019: 7210–7214
[118] Silver D, Singh S, Precup D, et al. Reward is enough [J/OL]. Artificial Intelligence, 2021 [2022-09-06].https://www.sciencedirect.com/science/article/pii/S0004370221000862
[119] 文载道,王佳蕊,王小旭,等. 解耦表征学习综述[J]. 自动化学报,2022,48(2):351−374 Wen Zaidao, Wang Jiarui, Wang Xiaoxu, et al. A review of disentangled representation learning[J]. Acta Automatica Sinica, 2022, 48(2): 351−374 (in Chinese)
[120] Kipf T, Van Der Pol E, Welling M. Contrastive learning of structured world models [C/OL] //Proc of the 7th Int Conf on Learning Representations. 2019 [2022-09-06].https://arxiv.org/abs/1911.12247
[121] Watters N, Matthey L, Bosnjak M, et al. COBRA: Data-efficient model-based RL through unsupervised object discovery and curiosity-driven exploration [C/OL] //Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2017 [2022-09-06].https://arxiv.org/abs/1905.09275v2
[122] Kulkarni T D, Narasimhan K R, Saeedi A, et al. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation [C] //Proc of the 30th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2016: 3682–3690
[123] Vezhnevets A S, Osindero S, Schaul T, et al. FeUdal networks for hierarchical reinforcement learning [C] //Proc of the 33rd Int Conf on Machine Learning. New York: ACM, 2017: 3540–3549
[124] Frans K, Ho J, Chen Xi, et al. Meta learning shared hierarchies [C/OL] //Proc of the 5th Int Conf on Learning Representations. 2017 [2022-09-06].https://arxiv.org/abs/1710.09767
[125] Ecoffet A, Huizinga J, Lehman J, et al. First return, then explore[J]. Nature, 2021, 590(7847): 580−586 doi: 10.1038/s41586-020-03157-9
[126] Chen T, Gupta S, Gupta A. Learning exploration policies for navigation [C/OL] //Proc of the 7th Int Conf on Learning Representations. 2019 [2022-09-06].https://arxiv.org/abs/1903.01959
[127] Chaplot D S, Gandhi D, Gupta S, et al. Learning to explore using active neural SLAM [C/OL] //Proc of the 8th Int Conf on Learning Representations. 2020 [2022-09-06].https://arxiv.org/abs/2004.05155
[128] Chaplot D S, Salakhutdinov R, Gupta A, et al. Neural topological SLAM for visual navigation [C] //Proc of IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 12875–12884
[129] Berseth G, Geng D, Devin C, et al. SMiRL: Surprise minimizing reinforcement learning in unstable environments [C/OL] //Proc of the 9th Int Conf on Learning Representations. 2021 [2022−09-06].https://arxiv.org/abs/1912.05510
[130] Singh S, Lewis R L, Barto A G, et al. Intrinsically motivated reinforcement learning: An evolutionary perspective[J]. IEEE Transactions on Autonomous Mental Development, 2010, 2(2): 70−82 doi: 10.1109/TAMD.2010.2051031
[131] Sorg J, Singh S, Lewis R L. Reward design via online gradient ascent [C] //Proc of the 23rd Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2010: 2190–2198
[132] Guo X, Singh S, Lewis R, et al. Deep learning for reward design to improve Monte Carlo tree search in ATARI games [C] //Proc of the 25th Int Joint Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2016: 1519–1525
[133] Zheng Zeyu, Oh J, Hessel M, et al. What can learned intrinsic rewards capture [C] //Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2020: 11436–11446
[134] Forestier S, Portelas R, Mollard Y, et al. Intrinsically motivated goal exploration processes with automatic curriculum learning[J]. Journal of Machine Learning Research, 2022, 23: 1−41
[135] Colas C, Fournier P, Chetouani M, et al. CURIOUS: Intrinsically motivated modular multigoal reinforcement learning [C] //Proc of the 35th Int Conf on Machine Learning. New York: ACM, 2019: 1331–1340
[136] Péré A, Forestier S, Sigaud O, et al. Unsupervised learning of goal spaces for intrinsically motivated goal exploration [C/OL] //Proc of the 6th Int Conf on Learning Representations. 2018 [2022-09-06].https://arxiv.org/abs/1803.00781
[137] Warde-Farley D, Van de Wiele T, Kulkarni T, et al. Unsupervised control through nonparametric discriminative rewards [C/OL] //Proc of the 7th Int Conf on Learning Representations. 2019 [2022-09-06].https://arxiv.org/abs/1811.11359
[138] Pong V H, Dalal M, Lin S, et al. Skew-fit: State-covering self-supervised reinforcement learning [C] //Proc of the 37th Int Conf on Machine Learning. New York: ACM, 2020: 7783−7792
[139] Bengio Y, Louradour J, Collobert R, et al. Curriculum learning [C] //Proc of the 26th Int Conf on Machine Learning. New York: ACM, 2009: 41–48
[140] Jaderberg M, Mnih V, Czarnecki W M, et al. Reinforcement learning with unsupervised auxiliary tasks [C/OL] //Proc of the 5th Int Conf on Learning Representations. 2017 [2022-09-06].https://arxiv.org/abs/1611.05397
[141] Sukhbaatar S, Lin Z, Kostrikov I, et al. Intrinsic motivation and automatic curricula via asymmetric self-play [C/OL] //Proc of the 6th Int Conf on Learning Representations. 2018 [2022-09-06].https://arxiv.org/abs/1703.05407
[142] Gronauer S, Diepold K. Multi-agent deep reinforcement learning: A survey[J]. Artificial Intelligence Review, 2022, 55(2): 895−943 doi: 10.1007/s10462-021-09996-w
[143] Iqbal S, Sha F. Coordinated exploration via intrinsic rewards for multi-agent reinforcement learning [J]. arXiv preprint, arXiv: 1905.12127, 2019
[144] Jaques N, Lazaridou A, Hughes E, et al. Social influence as intrinsic motivation for multi-agent deep reinforcement learning[C] //Proc of the 35th Int Conf on Machine Learning. New York: ACM, 2019: 3040–3049
[145] Guckelsberger C, Salge C, Togelius J. New and surprising ways to be mean: Adversarial NPCS with coupled empowerment minimization [C/OL] //Proc of IEEE Conf on Computational Intelligence and Games. 2018 [2022-09-06].https://ieeexplore.ieee.org/abstract/document/8490453
-
期刊类型引用(12)
1. 徐享希,李炯彬,郭志远. 车联网网络安全挑战与评估技术分析. 质量与认证. 2025(02): 83-86+91 . 百度学术
2. 郭健忠,王灿,谢斌,闵锐. 面向车联网DoS攻击的混合入侵检测系统. 计算机系统应用. 2025(03): 85-93 . 百度学术
3. 赵建斌,杜彦辉. 浅析车联网漏洞挖掘技术. 警察技术. 2025(02): 25-29 . 百度学术
4. 李俊吉,张佳琦,高改梅,杨莉. 基于信誉机制的车联网共识算法. 计算机工程. 2025(04): 217-226 . 百度学术
5. 黄金洲,杭波,王峰,徐德刚. 车联网技术在襄阳智慧交通建设中的应用. 智能城市. 2024(01): 20-23 . 百度学术
6. 李昱. 容许风险与自动驾驶场景中的注意义务. 现代法学. 2024(04): 161-174 . 百度学术
7. 谢勇,胡秋燕,李仁发,谢国琪,肖甫. 基于Uptane的汽车软件在线升级优化框架. 计算机研究与发展. 2024(09): 2145-2155 . 本站查看
8. 洪榛,冯王磊,温震宇,吴迪,李涛涛,伍一鸣,王聪,纪守领. 基于梯度回溯的联邦学习搭便车攻击检测. 计算机研究与发展. 2024(09): 2185-2198 . 本站查看
9. 李可,马赛,戴朋林,任婧,范平志. 基于多目标深度强化学习的车车通信无线资源分配算法. 计算机研究与发展. 2024(09): 2229-2245 . 本站查看
10. 顾芳铭,况博裕,许亚倩,付安民. 面向自动驾驶感知系统的对抗样本攻击研究综述. 信息安全研究. 2024(09): 786-794 . 百度学术
11. 刘雪娇,赵祥,夏莹杰,曹天聪. 空地协同场景下具有隐私保护的高效异构认证方案. 浙江大学学报(工学版). 2024(10): 1981-1991 . 百度学术
12. 权仕鑫,孙溶辰,刘留,孙志国. 车联网融合通信技术的场景仿真与分析. 应用科技. 2024(05): 190-196 . 百度学术
其他类型引用(11)