融合语义知识的深度表达学习及在视觉理解中的应用

张瑞茂; 彭杰锋; 吴恙; 林倞

doi:10.7544/issn1000-1239.2017.20171064

融合语义知识的深度表达学习及在视觉理解中的应用

The Semantic Knowledge Embedded Deep Representation Learning and Its Applications on Visual Understanding

摘要

摘要: 近几年来，随着深度学习技术的日趋完善，传统的计算机视觉任务得到了前所未有的发展.如何将传统视觉研究中的领域知识融入到深度模型中提升深度模型的视觉表达能力，从而应对更为复杂的视觉任务，成为了学术界广泛关注的问题.鉴于此，以融合了语义知识的深度表达学习为主线展开了一系列研究.取得的主要创新成果包括3个方面：1)研究了将单类型的语义信息(类别相似性)融入到深度特征的学习中，提出了嵌入正则化语义关联的深度Hash学习方法，并将其应用于图像的相似性比对与检索问题中，取得了较大的性能提升；2)研究了将多类型信息(多重上下文信息)融入到深度特征的学习中，提出了基于长短期记忆神经网络的场景上下文学习方法，并将其应用于复杂场景的几何属性分析问题中；3)研究了将视觉数据的结构化语义配置融入到深度表达的学习中，提出了融合语法知识的表达学习方法，并将其应用到复杂场景下的通用内容解析问题中.相关的实验结果表明:该方法能有效地对场景的结构化配置进行预测.

Abstract: With the rapid development of deep learning technique and large scale visual datasets, the traditional computer vision tasks have achieved unprecedented improvement. In order to handle more and more complex vision tasks, how to integrate the domain knowledge into the deep neural network and enhance the ability of deep model to represent the visual pattern, has become a widely discussed topic in both academia and industry. This thesis engages in exploring effective deep models to combine the semantic knowledge and feature learning. The main contributions can be summarized as follows: 1)We integrate the semantic similarity of visual data into the deep feature learning process, and propose a deep similarity comparison model named bit-scalable deep hashing to address the issue of visual similarity comparison. The model in this thesis has achieved great performance on image searching and people’s identification. 2)We also propose a high-order graph LSTM (HG-LSTM) networks to solve the problem of geometric attribute analysis, which realizes the process of integrating the multi semantic context into the feature learning process. Our extensive experiments show that our model is capable of predicting rich scene geometric attributes and outperforming several state-of-the-art methods by large margins. 3)We integrate the structured semantic information of visual data into the feature learning process, and propose a novel deep architecture to investigate a fundamental problem of scene understanding: how to parse a scene image into a structured configuration. Extensive experiments show that our model is capable of producing meaningful and structured scene configurations, and achieving more favorable scene labeling result on two challenging datasets compared with other state-of-the-art weakly-supervised deep learning methods.

HTML全文

参考文献(0)

施引文献

资源附件(0)