ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2019, Vol. 56 ›› Issue (5): 1082-1091.doi: 10.7544/issn1000-1239.2019.20180471

• 人工智能 • 上一篇    下一篇

基于降噪自动编码器的语种特征补偿方法

苗晓晓1,2,徐及1,2,王剑1   

  1. 1(中国科学院声学研究所语言声学与内容理解重点实验室 北京 100190); 2(中国科学院大学 北京 100190) (miaoxiaoxiao@hccl.ioa.ac.cn)
  • 出版日期: 2019-05-01
  • 基金资助: 
    国家重点研发计划项目(2016YFB0801203,2016YFB0801200)

Denoising Autoencoder-Based Language Feature Compensation

Miao Xiaoxiao1,2, Xu Ji1,2, Wang Jian1   

  1. 1(Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190); 2(University of Chinese Academy of Sciences, Beijing 100190)
  • Online: 2019-05-01

摘要: 在语种识别中,当训练语音与测试语音长度失配时,系统的识别性能会出现严重下降.基于降噪自动编码器(denoising auto-encoder, DAE)的方法对不同长度测试语音的语种特征进行补偿,把不同长度的语音特征都映射为固定长度的语音特征,一定程度上解决了长度失配和音素分配不平衡的问题.具体分为4个环节:1)语音信号经过分帧、变换得到底层声学特征;2)提取语音信号的原始i-vector,同时计算其音素向量;3)对原始i-vector和音素向量进行拼接,送入基于DAE的语种特征补偿处理单元得到补偿后的i-vector;4)将补偿后的i-vector和原始i-vector分别送入后端分类器得到2个分数向量,并将其在得分域融合后进行判决.在NIST-LRE07上的实验结果表明:所提出的语种特征补偿算法在各种测试语音时长上的识别性能均有提升.相比传统的语种识别系统,测试语音时长为30 s时性能相对提升3.16%,测试语音时长为10 s时性能相对提升2.90%.相比端到端语种识别系统,测试语音时长为3 s时性能相对提升3.21%.

关键词: 语种识别, i-vector, 音素向量, 特征补偿, 降噪自动编码器

Abstract: Language identification (LID) accuracy is often significantly reduced when the duration of the test data and the training data are mismatched. This paper proposes a method to compensate language features using a denoising autoencoder (DAE). Use of denoising autoencoder-based language feature compensation can map language features from variable length utterances into a fixed length representation. Therefore the problem of length mismatch and unbalanced phoneme distribution can be mitigated. The algorithm first converts the speech signal to low level acoustic features by framing and transforming, and then estimates its i-vector and phonetic vector. These two vectors are then concatenated and fed into the DAE-based language feature compensation processing unit. The compensated i-vector from the output of the DAE, and the original i-vector, are presented to the back-end classifier to obtain two score vectors. These two score vectors are finally fused at a score level to obtain a final result. Tests on NIST-LRE07 demonstrate that this feature compensation method improves identification performance over various test speech durations. Compared with traditional LID systems, the performance for 30 s test utterances improves by 3.16%, while the performance for 10 s test utterances improves by 2.90%. Compared with the end-to-end LID system, the performance on 3 s test utterances is increased by 3.21%.

Key words: language identification (LID), i-vector, phoneme vector, feature compensation, denoising autoencoder (DAE)

中图分类号: