高级检索

    基于门控交叉注意力融合的神经导向说话人提取方法研究

    GATENet: Gated Attention and Local Feature Enhancement Network for Neuro-Oriented Target Speaker Extraction

    • 摘要: 神经导向说话人提取是一种模拟人类听觉注意机制的智能语音处理技术,通过解码听者脑电信号(EEG)中的注意力指向,从混合语音中提取目标说话人语音.该技术为解决鸡尾酒会问题提供了新的解决思路,对开发智能助听设备具有重要意义.然而,现有方法面临多模态特征融合效率低、局部语音特征提取不足、模型计算复杂度高等挑战,制约了实际应用效果.本文提出新型端到端时域神经导向说话人提取模型GALENet,其创新性体现在三个核心模块:(1) 门控交叉注意力融合机制,通过双向交叉注意力建模EEG与语音的跨模态关联,引入动态门控权重自适应调节模态贡献,解决传统线性融合的信息交互不足问题;(2) E-DPRNN网络,在双路径架构中嵌入Conv-U模块强化局部短时特征捕捉能力,配合残差连接提升梯度传递效率;(3) 单路径全局调制块(SPGM),采用无参数池化与特征调制替代传统块间建模,在减少参数量的同时保持全局语义建模能力.这三个模块协同实现了高效的多模态特征融合与精准的说话人提取.实验验证表明,该方法在Cocktail Party、AVED和MM-AAD三个数据集上的平均SI-SDR分别达到13.59dB、10.44dB和10.17dB,相较基线模型MSFNet性能提升5.4%-8.2%,参数量减少49.7%-55.6%,验证了该模型在神经导向说话人提取中的应用潜力与有效性.

       

      Abstract: Neuro-Oriented speaker extraction is an intelligent speech processing technique that emulates human auditory attention mechanisms, enabling target speaker extraction from mixed speech by decoding the attentional focus embedded in listeners' electroencephalogram (EEG) signals. This technology offers novel insights into resolving the "cocktail party problem" and holds significant implications for developing intelligent hearing-assistive devices. However, existing approaches confront three critical challenges: inefficient multimodal feature fusion, inadequate local speech feature extraction, and high computational complexity, which collectively constrain practical application efficacy. This paper proposes GALENet, a novel end-to-end time-domain speaker extraction model, featuring three innovative modules: (1) A Gated Cross-Attention Fusion mechanism that establishes cross-modal correlations between EEG and speech through bidirectional cross-attention, while adaptively modulating modal contributions via dynamic gating weights to overcome the information interaction limitations of conventional linear fusion; (2) An E-DPRNN network incorporating Conv-U modules within the dual-path architecture to reinforce local short-term feature capture, synergized with residual connections to enhance gradient propagation efficiency; (3) A Single-Path Global Modulation (SPGM) block that replaces traditional inter-block modeling with parameter-free pooling and feature modulation, preserving global semantic modeling capability while reducing parameters. These modules collectively achieve efficient multimodal fusion and precise speaker extraction. Experimental validation demonstrates that the proposed method attains average SI-SDR scores of 13.59dB, 10.44dB, and 10.17dB on the Cocktail Party, AVED, and MM-AAD datasets respectively, outperforming the baseline model MSFNet by 5.4%-8.2% while reducing parameters by 49.7%-55.6%. These results substantiate the model's application potential and effectiveness in Neural-oriented speaker extraction.

       

    /

    返回文章
    返回