Abstract

Translate

Decoding human speech from neural signals is essential for brain-computer interface (BCI) technologies, restoring communication in individuals with neurological deficits. However, this remains a highly challenging task due to the scarcity of paired neural-speech data, signal complexity, high dimensionality, and the limited availability of public tools. We first present a deep learning-based framework comprising an ECoG Decoder that translates electrocorticographic (ECoG) signals from the cortex into interpretable speech parameters, and a novel source-filter-based speech synthesizer that reconstructs spectrograms from those parameters. A companion audio-to-audio auto-encoder provides reference features to support decoder training. This framework generates naturalistic and reproducible speech and generalizes across a cohort of 48 participants. Among the tested architectures, the 3D ResNet achieved the best decoding performance in terms of Pearson Correlation Coefficient (PCC=0.804), followed closely by a SWIN model (PCC=0.796). Our models decode speech with high correlation even under causal constraints, supporting real-time applications. We successfully decoded speech from participants with either left and right hemisphere coverage, which may benefit patients with unilateral cortical damage. We further perform occlusion analysis to identify cortical regions relevant to decoding.

We next investigate decoding from different forms of intracranial recordings, including surface (ECoG) and depth (stereotactic EEG or sEEG) electrodes, to generalize neural speech decoding across participants and diverse electrode modalities. Most prior works are constrained to 2D grid-based ECoG data from a single patient. We aim to design a deep-learning model architecture that can accommodate variable electrode configurations, support training across multiple subjects without subject-specific layers, and generalize to unseen participants. To this end, we propose SwinTW, a transformer-based model that leverages the 3D spatial locations of electrodes rather than relying on a fixed 2D layout. Subject-specific models trained on low-density 8×8 ECoG arrays outperform prior CNN and transformer baselines (PCC=0.817, N=43). Incorporating additional electrodes—including strip, grid, and depth contacts—further improves performance (PCC=0.838, N=39), while models trained solely on sEEG data still achieve high correlation (PCC=0.798, N=9). A single multi-subject model trained on data from 15 participants performs comparably to individual models (PCC=0.837 vs. 0.831) and generalizes to held-out participants (PCC=0.765 in leave-one-out validation). These results demonstrate SwinTW’s scalability and flexibility, particularly for clinical settings where only depth electrodes—commonly used in chronic neurosurgical monitoring—are available. The model’s ability to learn from and generalize across diverse neural data sources suggests that future speech prostheses may be trained on shared acoustic-neural corpora and applied to patients lacking direct training data.

We further investigate two complementary latent spaces for guiding neural speech decoding to enhance interpretability and structure in decoding further. HuBERT offers a discrete, phoneme-aligned latent space learned via self-supervised objectives. Decoding sEEG signals into the HuBERT token space improves intelligibility by leveraging pretrained linguistic priors. In contrast, the articulatory space provides a continuous, interpretable embedding grounded in vocal tract dynamics. The articulatory space enables speaker-specific speech synthesis through differentiable articulatory vocoders and is especially suited for both sEEG and sEMG decoding, where signals reflect muscle movements linked to articulation. While HuBERT emphasizes linguistic structure, the articulatory space provides physiological interpretability and individual control, making them complementary in design and application. We demonstrate that both spaces can serve as intermediate targets for speech decoding across invasive and non-invasive modalities. As a future direction, we extend our articulatory-guided framework toward sentence-level sEMG decoding and investigate phoneme classifiers within articulatory space to assist decoder training. These developments and the design of more advanced single- and cross-subject models support our long-term goal of building accurate, interpretable, and clinically deployable speech neuroprostheses.

Alternate abstract:

解码神经信号中的人类语音是脑机接口技术恢复神经功能障碍人群沟通能力的关键。然而，可配对的神经‑语音数据稀缺、信号复杂且维度高，加之公开工具有限，使得这一任务仍然极具挑战。我们首先提出一套深度学习框架，其中包含将皮层电图（ECoG）信号转换为可解释语音参数的 ECoG Decoder、基于源‑滤波模型重建声谱图的创新语音合成器，以及提供参考特征以支持解码器训练的音频‑音频自编码器。该框架能够生成自然且可复现的语音，并在 48 名受试者上展现良好泛化能力。在所有测试架构中，3D ResNet 的解码性能最佳（皮尔逊相关系数 PCC=0.804），Swin 模型紧随其后（PCC=0.796）。即便在因果推断限制下，模型仍能实现高相关度解码，满足实时应用需求。我们还成功解码了左右半球覆盖受试者的语音，为单侧皮层损伤患者提供潜在益处，并通过遮挡分析识别了与解码相关的关键脑区。

接着，我们研究了不同形式的颅内记录（表面 ECoG 与深部立体脑电 sEEG）的语音解码，旨在跨受试者和多样电极类型实现泛化。多数既往研究局限于单个患者的二维网格 ECoG 数据。为此，我们设计了 SwinTW——一种利用电极三维空间位置信息、无需固定二维布局的 Transformer 架构，可容纳多样电极配置，并在无个体化层的情况下跨受试者训练与推断。针对 8×8 低密度 ECoG 阵列训练的受试者特定模型优于现有 CNN 与 Transformer 基线（PCC=0.817，N=43）。在模型中加入条状、网格及深部电极进一步提升性能（PCC=0.838，N=39）；仅使用 sEEG 数据训练时仍保持较高相关度（PCC=0.798，N=9）。基于 15 名受试者联合训练的单一多受试者模型，其表现与个体模型相当（PCC=0.837 对 0.831），并在留一法验证中对未见受试者保持良好泛化（PCC=0.765）。这些结果表明 SwinTW 具备可扩展性与灵活性，尤其适用于临床常见的仅植入深部电极的场景。该模型跨多源神经数据学习与泛化的能力提示，未来语音神经假体可在共享声学‑神经语料上训练，再应用于缺乏直接训练数据的患者。

我们进一步探索两种互补潜在空间，以提升解码的结构性和可解释性。HuBERT 提供离散且与音素对齐的潜在空间，基于自监督目标学习；将 sEEG 信号解码到 HuBERT 令牌空间，可借助预训练语言先验提升语音可懂度。相比之下，发音特征空间是一种连续且与声道运动对应的可解释嵌入，可通过可微分的发音声码器实现说话人定制化语音合成，尤其适用于映射发音相关肌肉活动的 sEEG 和 sEMG 解码。HuBERT 着重语言结构，发音特征空间强调生理解释与个体控制，两者在设计和应用上互为补充。我们证明，这两种空间均可作为侵入式与非侵入式语音解码的中间目标。作为未来方向，我们将发音特征框架扩展到句子级 sEMG 解码，并在发音空间内引入辅助音素分类器以强化解码器训练。这些进展以及更先进的单受试者与跨受试者模型设计，共同支撑我们构建准确、可解释且临床可部署的语音神经假体的长期目标。

Details

Title

Neural Speech Decoding and Understanding Leveraging Deep Learning and Speech Synthesis

Author

Chen, Xupeng

Publication year

2025

Publisher

ProQuest Dissertations & Theses

ISBN

9798315752905

Source type

Dissertation or Thesis

Language of publication

English

ProQuest document ID

3212535696

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Neural Speech Decoding and Understanding Leveraging Deep Learning and Speech Synthesis

Jump to:

Abstract

Details

Suggested sources