It appears you don't have support to open PDFs in this web browser. To view this file, Open with your PDF reader
Abstract
Decoding human speech from neural signals is essential for brain-computer interface (BCI) technologies, restoring communication in individuals with neurological deficits. However, this remains a highly challenging task due to the scarcity of paired neural-speech data, signal complexity, high dimensionality, and the limited availability of public tools. We first present a deep learning-based framework comprising an ECoG Decoder that translates electrocorticographic (ECoG) signals from the cortex into interpretable speech parameters, and a novel source-filter-based speech synthesizer that reconstructs spectrograms from those parameters. A companion audio-to-audio auto-encoder provides reference features to support decoder training. This framework generates naturalistic and reproducible speech and generalizes across a cohort of 48 participants. Among the tested architectures, the 3D ResNet achieved the best decoding performance in terms of Pearson Correlation Coefficient (PCC=0.804), followed closely by a SWIN model (PCC=0.796). Our models decode speech with high correlation even under causal constraints, supporting real-time applications. We successfully decoded speech from participants with either left and right hemisphere coverage, which may benefit patients with unilateral cortical damage. We further perform occlusion analysis to identify cortical regions relevant to decoding.
We next investigate decoding from different forms of intracranial recordings, including surface (ECoG) and depth (stereotactic EEG or sEEG) electrodes, to generalize neural speech decoding across participants and diverse electrode modalities. Most prior works are constrained to 2D grid-based ECoG data from a single patient. We aim to design a deep-learning model architecture that can accommodate variable electrode configurations, support training across multiple subjects without subject-specific layers, and generalize to unseen participants. To this end, we propose SwinTW, a transformer-based model that leverages the 3D spatial locations of electrodes rather than relying on a fixed 2D layout. Subject-specific models trained on low-density 8×8 ECoG arrays outperform prior CNN and transformer baselines (PCC=0.817, N=43). Incorporating additional electrodes—including strip, grid, and depth contacts—further improves performance (PCC=0.838, N=39), while models trained solely on sEEG data still achieve high correlation (PCC=0.798, N=9). A single multi-subject model trained on data from 15 participants performs comparably to individual models (PCC=0.837 vs. 0.831) and generalizes to held-out participants (PCC=0.765 in leave-one-out validation). These results demonstrate SwinTW’s scalability and flexibility, particularly for clinical settings where only depth electrodes—commonly used in chronic neurosurgical monitoring—are available. The model’s ability to learn from and generalize across diverse neural data sources suggests that future speech prostheses may be trained on shared acoustic-neural corpora and applied to patients lacking direct training data.
We further investigate two complementary latent spaces for guiding neural speech decoding to enhance interpretability and structure in decoding further. HuBERT offers a discrete, phoneme-aligned latent space learned via self-supervised objectives. Decoding sEEG signals into the HuBERT token space improves intelligibility by leveraging pretrained linguistic priors. In contrast, the articulatory space provides a continuous, interpretable embedding grounded in vocal tract dynamics. The articulatory space enables speaker-specific speech synthesis through differentiable articulatory vocoders and is especially suited for both sEEG and sEMG decoding, where signals reflect muscle movements linked to articulation. While HuBERT emphasizes linguistic structure, the articulatory space provides physiological interpretability and individual control, making them complementary in design and application. We demonstrate that both spaces can serve as intermediate targets for speech decoding across invasive and non-invasive modalities. As a future direction, we extend our articulatory-guided framework toward sentence-level sEMG decoding and investigate phoneme classifiers within articulatory space to assist decoder training. These developments and the design of more advanced single- and cross-subject models support our long-term goal of building accurate, interpretable, and clinically deployable speech neuroprostheses.