GATENet: Gated Attention and Local Feature Enhancement Network for Neuro-Oriented Target Speaker Extraction
-
Graphical Abstract
-
Abstract
Neuro-Oriented speaker extraction is an intelligent speech processing technique that emulates human auditory attention mechanisms, enabling target speaker extraction from mixed speech by decoding the attentional focus embedded in listeners' electroencephalogram (EEG) signals. This technology offers novel insights into resolving the "cocktail party problem" and holds significant implications for developing intelligent hearing-assistive devices. However, existing approaches confront three critical challenges: inefficient multimodal feature fusion, inadequate local speech feature extraction, and high computational complexity, which collectively constrain practical application efficacy. This paper proposes GALENet, a novel end-to-end time-domain speaker extraction model, featuring three innovative modules: (1) A Gated Cross-Attention Fusion mechanism that establishes cross-modal correlations between EEG and speech through bidirectional cross-attention, while adaptively modulating modal contributions via dynamic gating weights to overcome the information interaction limitations of conventional linear fusion; (2) An E-DPRNN network incorporating Conv-U modules within the dual-path architecture to reinforce local short-term feature capture, synergized with residual connections to enhance gradient propagation efficiency; (3) A Single-Path Global Modulation (SPGM) block that replaces traditional inter-block modeling with parameter-free pooling and feature modulation, preserving global semantic modeling capability while reducing parameters. These modules collectively achieve efficient multimodal fusion and precise speaker extraction. Experimental validation demonstrates that the proposed method attains average SI-SDR scores of 13.59dB, 10.44dB, and 10.17dB on the Cocktail Party, AVED, and MM-AAD datasets respectively, outperforming the baseline model MSFNet by 5.4%-8.2% while reducing parameters by 49.7%-55.6%. These results substantiate the model's application potential and effectiveness in Neural-oriented speaker extraction.
-
-