Content area
Visual cues from a speaker’s face can significantly improve speech comprehension in noisy environments through multisensory integration (MSI)—the process by which the brain combines auditory and visual inputs. Individuals with Autism Spectrum Disorder (ASD), however, often show atypical MSI, particularly during speech processing, which may contribute to the social communication difficulties central to the diagnosis. Understanding the neural basis of impaired MSI in ASD, especially during naturalistic speech, is critical for developing targeted interventions. Most neurophysiological studies have relied on simplified speech stimuli (e.g., isolated syllables or words), limiting their ecological validity. In this study, we used high-density EEG and linear encoding and decoding models to assess the neural processing of continuous audiovisual speech in adolescents and young adults with ASD ( N = 23) and age-matched typically developing controls ( N = 19). Participants watched and listened to naturalistic speech under auditory-only, visual-only, and audiovisual conditions, with varying levels of background noise, and were tasked with detecting a target word. Linear models were used to quantify cortical tracking of the speech envelope and phonetic features. In the audiovisual condition, the ASD group showed reduced behavioral performance and weaker neural tracking of both acoustic and phonetic features, relative to controls. In contrast, in the auditory-only condition, increasing background noise reduced behavioral and model performance similarly across groups. These results provide, for the first time, converging behavioral and neurophysiological evidence of impaired multisensory enhancement for continuous, natural speech in ASD.
Significance Statement
In adverse hearing conditions, seeing a speaker's face and their facial movements enhances speech comprehension through a process called multisensory integration, where the brain combines visual and auditory inputs to facilitate perception and communication. However, individuals with Autism Spectrum Disorder (ASD) often struggle with this process, particularly during speech comprehension. Previous findings using simple, discrete stimuli do not fully explain how the processing of continuous natural multisensory speech is affected in ASD. In our study, we used natural, continuous speech stimuli to compare the neural processing of various speech features in individuals with ASD and typically developing (TD) controls, across auditory and audiovisual conditions with varying levels of background noise. Our findings showed no group differences in the encoding of auditory-alone speech, with both groups similarly affected by increasing levels of noise. However, for audiovisual speech, individuals with ASD displayed reduced neural encoding of both the acoustic envelope and the phonetic features, marking neural processing impairment of continuous audiovisual multisensory speech in autism.
1 Introduction
The social environment is inherently complex, requiring individuals to continuously process and integrate information from multiple sensory modalities. Successful navigation of social interactions relies on the ability to combine cues from speech sounds, facial expressions, gestures, and body language to create a coherent and accurate understanding of the environment ( Ross et al., 2007). Social cues are often ambiguous or degraded in real-world situations, making the integration of auditory, visual, and tactile information essential for interpreting the intentions, emotions, and communication of others ( Ito et al., 2021; Romanski, 2012; Ross et al., 2011). This process is driven by multisensory integration (MSI) — the brain’s ability to combine information from different senses to enhance perception and action. MSI optimizes detection and identification of environmental cues by leveraging redundant and complementary sensory inputs to improve perception and related actions, particularly under degraded conditions ( Bolognini et al., 2007; Gillmeister and Eimer, 2007; Shahin and Miller, 2009; Stein and Meredith, 1993). Individuals with Autism Spectrum Disorder (ASD), however, often exhibit difficulties in social interaction and communication, alongside atypical MSI ( Beker et al., 2017; Crosse et al., 2022; Gelder et al., 1991; Russo et al., 2010). These MSI deficits are observed for both speech and non-speech stimuli ( Beker et al., 2017; Brandwein et al., 2015, 2013; Crosse et al., 2022; Russo et al., 2010) and are thought to contribute to the challenges autistic individuals face, including in social situations, as they may struggle to combine and interpret sensory cues effectively. Audiovisual speech perception, which involves integrating lip and facial movements with speech sounds, is an important example of MSI that facilitates speech comprehension in noisy environments ( Crosse, Di Liberto, and Lalor, 2016; Ma et al., 2009; Ross et al., 2007). In autistic individuals, reduced audiovisual gain during audiovisual speech-in-noise tasks—a diminished ability to benefit from combining visual and auditory speech inputs—has been consistently reported ( Foxe et al., 2015; Stevenson et al., 2017, 2018). These challenges in multisensory integration are particularly evident during childhood and adolescence ( Beker et al., 2017; Foxe et al., 2015), when MSI is immature ( Crosse et al., 2022; Ross et al., 2011). The fact that several studies report reduced audiovisual speech benefits in autistic individuals during childhood ( Foxe et al., 2015; Stevenson et al., 2014a; Woynaroski et al., 2013) suggests that atypical MSI development may interfere with the learning processes necessary for effective social communication. Moreover, it has been proposed that reduced exposure to audiovisual speech due to atypical social biases in autism during childhood could contribute to MSI deficits ( Cuppini et al., 2017; Foxe et al., 2015). This limited exposure may delay the maturation of multisensory speech processing, with evidence suggesting a partial recovery in adolescence/young adulthood ( Beker et al., 2017; Foxe et al., 2015).
Despite this evidence, prior research on the neural mechanisms underlying impaired multisensory speech integration in autism remains limited. Most studies have employed highly simplified speech stimulation conditions in which isolated syllables or words are presented repeatedly, to facilitate traditional event related analyses, which require averaging across many trials to examine brain responses (but see ( Ross et al., 2024)). EEG studies using this approach, somewhat surprisingly, have largely failed to reveal evidence of differences in the neural processing of audiovisual speech stimuli in children with autism ( Dunham-Carr et al., 2023; Irwin et al., 2023; Magnee et al., 2008) but see ( Megnin et al., 2012). While valuable, this approach fails to capture the complexity of processing naturalistic audiovisual speech. In real-world communication, speech unfolds continuously over time, with temporal and semantic cues shared between auditory and visual streams facilitating multisensory integration. It remains unclear how autistic individuals process continuous, naturalistic speech in noisy environments and the stages of processing that lead to the observed deficits in audiovisual speech perception. Furthermore, an open question is whether speech encoding deficits in autism are specific to audiovisual conditions, or if they also extend to auditory-only speech, particularly in the context of naturalistic, continuous speech. Some behavioral speech-in-noise studies using sentence-level ( Alcantara et al., 2004; Smith and Bennetto, 2007), word-level ( Foxe et al., 2015) and syllable-level stimuli ( Irwin et al., 2011) in young autistic individuals suggest that auditory speech processing is relatively intact (for review see Ruiz Callejo and Boets (2023)). Other studies, however, have reported deficits in auditory-only speech perception in autism ( Schelinski and von Kriegstein, 2020; Stevenson et al., 2017) and a recent review ( Key and D'Ambrose Slaboch, 2021) indicates that these difficulties become more pronounced as both stimulus and task complexity increase.
Here we sought to better approximate how auditory and audiovisual speech is processed in real-world conditions by leveraging recent advances in neural modeling. Linear encoding and decoding models applied to EEG data provide a powerful framework for analyzing neural responses to continuous speech without relying on time-locked averaging ( Goncalves et al., 2014; Lalor and Foxe, 2010; Lalor et al., 2009). These models allow for a more detailed exploration of how the brain tracks and processes speech in both auditory-only and audiovisual conditions, offering new insights into the mechanisms underlying social communication deficits in ASD. Linear models can be employed in two key ways: (1) stimulus reconstruction (or backward modeling), where EEG data is used to reconstruct specific stimulus features, and (2) temporal response functions (TRF) (or forward modeling), which characterize the brain’s response to various speech features over time ( Crosse et al., 2021). A significant advantage of these models is their flexibility in handling a range of features, from basic acoustic properties to higher-order phonetic features ( Crosse et al., 2021; Di Liberto et al., 2015).
To investigate the neural mechanisms underlying multisensory speech processing in autism, we employed linear encoding and decoding models to assess autistic and neurotypical individuals use multisensory inputs to enhance the processing of both low-level acoustic features (e.g., the speech envelope) and higher-order phonetic features ( Crosse et al., 2021; Di Liberto et al., 2015). Participants were exposed to natural speech stimuli—continuous children’s stories narrated by an actress—under three sensory conditions (auditory-only, visual-only, and audiovisual speech) and across six levels of signal-to-noise ratio (SNR), while high-density EEG was recorded. We hypothesized that neural encoding of features of the continuous speech stimulus would decline as noise levels increased for both groups, for both auditory and audiovisual conditions. We furthermore expected that, consistent with the preponderance of findings in the literature ( Foxe et al., 2015; Stevenson et al., 2017, 2014b), autistic participants would show reduced neural encoding of the continuous audiovisual speech compared to controls. Finally, to address inconsistent findings in the literature ( Alcantara et al., 2004; Irwin et al., 2011; Key and D'Ambrose Slaboch, 2021; Schelinski and von Kriegstein, 2020; Smith and Bennetto, 2007; Stevenson et al., 2017) and to clarify whether multisensory speech encoding differences in autism for continuous narrative speech can be attributed primarily to multisensory integration deficits, we addressed whether neural-tracking of speech in the auditory-alone condition differed between autistic and control groups.
2 Materials and methods
2.1 Participants and experimental procedure
The present study was carried out with 20 typical developing (TD) and 25 ASD participants. One TD and Two ASD participants were excluded due to poor EEG quality (see Preprocessing section below), resulting in a final sample of 19 TD (age range: 8–17.7; mean: 12.93±3.0 years old) and 23 ASD (age range: 9–22.3; mean: 13.74±3.44 years old) participants. Demographics of all participants are presented in supplementary Table 1. All participants were native English speakers, and they all had normal hearing based on audiometric threshold evaluation and normal or corrected to normal vision. Diagnoses of ASD were confirmed by a trained clinical psychologist using the Autism Diagnostic Interview-R ( Lord et al., 1994) and the Autism Diagnostic Observation Schedule (ADOS-G, ( Lord et al., 2000)). The Institutional Review Board of the Albert Einstein College of Medicine approved all procedures. Participants were given $12.00 an hour for their time in the laboratory. All procedures conformed to the ethical standards of the Declaration of Helsinki.
Participants were seated in a chair in an electrically shielded room (Industrial Acoustics Company Inc, Bronx, NY), 70 cm away from the visual display (Dell UltraSharp 1704FPT). Stimulus presentation was controlled using Presentation software (Neurobehavioral Systems). EEG data were recorded at a sampling rate of 512 Hz using a 64-channel BioSemi Active II system (Biosemi ActiveTwo, Amsterdam, The Netherlands). The system uses active electrodes and a direct current (DC) coupling with a hardware bandpass filter from DC to 150 Hz to reduce low-frequency drift and high-frequency noise. The recording reference was the Common Mode Sense (CMS) active electrode, which, together with the Driven Right Leg (DRL) passive electrode, forms a feedback loop to stabilize the reference signal and reduce common-mode noise. This referencing convention in BioSemi systems differs from traditional reference electrodes by dynamically compensating for potential differences between the scalp and the amplifier ground, effectively creating a "zero potential" point. Audio was presented at 75 dB SPL using in-ear earphones (ER-1, Etymotic Research with 10 mm eartips) while the participants were watching ∼30-second videos of continuous speech (see ‘Stimuli’) in one of three possible conditions: Audio-only (A), Visual-only (V) or Audiovisual (AV). Each time, the audio was presented at one of 6 possible SNRs: no-noise (NN), −3, −6, −9, −12 and-15 dB. For each combination of condition x SNR, 20 videos (10 min) were presented, leading to a total of 360 (20×3 × 6) videos presented to each participant. To maintain participant attention, the experiment was divided into two separate recording sessions conducted on different days, each consisting of 180 videos. For each session, only three SNR levels were presented (randomly selected among pairs of levels: −15|–12; –9|–6; –3|NN) to balance noise levels across sessions. In total, 60 unique videos were used (see ‘Stimuli’ section for details), each presented six times across the experiment. The chronological order of each story was preserved across repetitions to maintain narrative coherence and participant engagement. However, the sensory modality (A, V, AV) and noise level were randomly assigned for each repetition, allowing each video to appear under different experimental conditions. One TD participant and one ASD participant completed only one of the two recording sessions, resulting in a final sample of 18 TD participants for the −15, −9, and NN SNRs (and 23 for ASD), and 22 ASD participants for the −12, −6, and −3 SNRs (and 19 TD). Notably, we used Linear Mixed Models (LMMs) for statistical analysis, as they effectively account for repeated measures within subjects while handling unbalanced data. This approach ensures that differences in the number of participants across SNRs do not introduce bias in the statistical results, though it may slightly reduce power for conditions with fewer observations. During the experiment, the sensory condition (A, V or AV) was randomly selected, but the chronological order of the videos was maintained for all participants such that they could follow the story without repeated segments. In the auditory condition, a still image of the face of the actress was presented on the screen. In the visual condition, no speech sound was presented, but the various levels of background noise were presented. At the start of each video, a written word appeared on the screen and participants were instructed to vocalize the word. Following this, the video began, and participants were required to press a button as soon as they recognized the word in the 30 s video. The word could occur one or more times in each 30-second video. To evaluate participants' performance in the task, we calculated the F-score (or F
1 scores), a metric that balances both precision and recall (
Van Rijsbergen, 1979). Precision refers to the proportion of correct target identifications (true positives) out of all instances where the participant identified a target (including both true positives and false positives). In other words, it measures how accurate participants were when they indicated a target was present. Recall, on the other hand, refers to the proportion of actual targets correctly identified out of all true target instances, reflecting how sensitive participants were in detecting targets. The F-score is calculated as the harmonic mean of precision and recall, emphasizing their balance rather than treating them as independent metrics. The formula for the F-score is:
This calculation ensures that both precision and recall are equally weighted. An F-score of 1 indicates perfect performance (where both precision and recall are 100 %), while an F-score closer to 0 indicates poor performance. Unlike accuracy, which could be biased by class imbalances (e.g., detecting more frequent stimuli while ignoring rare ones), the F-score provides a more balanced assessment of performance by capturing both the ability to correctly identify targets and to avoid false alarms. This makes it particularly useful for evaluating participants' behavioral responses across different task conditions, especially in situations where both missed detections, and false alarms need to be considered.
2.2 Stimuli
Videos of a professional actress reciting children's stories were recorded in a quiet, well-lit room. The actress stood in front of a plain grey background, positioned at the center of the screen, with only her head and torso visible. Three stories were recorded: The Lorax (00:14:38), Goldilocks (00:08:04) and Rumpelstiltskin (00:16:34). Subsequently, each story was segmented into videos of 30–40 s segments, leading to a total of 61 videos (one video was used as a training trial at the beginning of each block). The segmentation was adapted for each video to allow for a natural transition point in the story. Each video had a resolution of 1280×720 pixels with a frame rate of 30 frames per second. Videos were exported in audio-only (A), visual-only (V), and audiovisual (AV) formats using Adobe Premiere Pro CC 2017 (Ver 11.0). Each of the 60 unique videos was presented six times to reach the full set of 360 trials. The soundtracks were sampled at 48 kHz, underwent dynamic range compression, and intensities were normalized based on root mean square levels (refer to ( Crosse et al., 2015)). For the addition of noise, these soundtracks were masked with spectrally matched stationary noise to maintain consistent masking across the duration of each stimulus ( Ding et al., 2014; Ding and Simon, 2013). This noise was generated in MATLAB using a 50th-order forward linear predictive model derived from the original speech recordings.
2.3 Data preprocessing
Analyses were conducted in Python (3.11) using MNE-Python (1.8.0; (
Gramfort et al., 2013) and custom scripts. Bad channel detection was performed using the function NoisyChannels from the pyprep toolbox:
2.4 Auditory feature extraction
The present study focuses on the measurement of the coupling between the EEG signal and two features in the speech stimuli: the acoustic envelope and the phonetic features. These two properties of the speech stimuli were extracted based on methodologies developed in previous studies (
Di Liberto et al., 2015;
Lalor and Foxe, 2010;
Lalor et al., 2009;
Mesgarani et al., 2014). The raw audio was read directly from each video (AVI) and processed in MATLAB (R2023b) to extract the desired stimulus features. First this audio signal was filtered into 16 frequency bands between 70 Hz and 16 kHz that were logarithmically spaced using a dynamic compressive Gammachirp filterbank (
Irino and Patterson, 2006). Then, the temporal envelope of each frequency band was calculated through the mTRFenvelope function from the mTRF-Toolbox (v2.4) in MATLAB:
2.5 Stimulus reconstruction and temporal response function
In this study, we employed both linear decoding models (also known as backward models) to reconstruct stimulus features from new EEG data, and linear encoding models (also known as forward models) to predict EEG responses based on stimulus features. Forward and backward modelling were conducted using the mTRF-Toolbox (
Crosse, Di Liberto, Bednar, et al., 2016). The objective of both model types is to assess the relationship between specific features in the presented stimuli and resulting brain activity. The methodologies of these approaches are thoroughly detailed in (
Crosse et al., 2021). Briefly, due to the delayed nature of brain responses to stimuli, these models utilize time-lagged [t
min, t
max] matrices to measure how EEG responds linearly to changes in stimulus features within this time frame. Backward models decode speech features using the delayed brain responses by integrating across all EEG recording channels. Forward model weights, also known as temporal response functions (TRFs), indicate how EEG responses on a specific recording channel fluctuate in response to unit changes in a specified stimulus feature. While both decoding and encoding models were applied to the acoustic envelope, only encoding models were used for the phonetic-level features. This is because decoding models are optimized for reconstructing continuous univariate time series, whereas phonetic regressors consist of multiple sparse binary time series, making encoding approaches more appropriate for estimating the brain's response to each phonetic feature individually. Based on previous studies, we chose a time-lag window of 0–500 ms, as cross-sensory integration of speech relies on long temporal windows (
Crosse, Di Liberto, and Lalor, 2016). For each participant, data for each condition (sensory modality and SNR) were randomly split into training (80 %) and a test (20 %) set. The reliability of the models was evaluated using a leave-one-out cross-validation procedure and optimized through ridge-regression with lambda values selected from the following set of values:
The models were then used to predict the held-out test set, and the Pearson correlation coefficient was calculated between the actual and predicted signals (the stimulus for backward models and the EEG for forward models). Given that we used participant-dependent models with a limited number of trials, we repeated the training-testing cycle five times to mitigate any potential biases from selecting either an overly favorable or particularly challenging test set. The prediction accuracies of the models were determined by averaging the Pearson correlation values across these five iterations. For visualization purposes, the TRF waveforms ( Fig. 2G) were evaluated with time-lags of −100 to 600 ms. To select EEG channels of interest in a data-driven manner, we applied a spherical spline surface Laplacian (current source density, CSD) transformation to the TRF estimates. This spatial filtering technique enhances topographical resolution and attenuates volume conduction effects by estimating the second spatial derivative of the potential field ( Perrin et al., 1989; Kayser and Tenke, 2015). The resulting CSD maps provide reference-free estimates of cortical activity that emphasize local source contributions. These maps were used to identify regions of interest (ROIs), defined as the channels showing the highest TRF amplitudes in the 50–100 ms time window following stimulus onset, which approximately corresponds to early auditory cortical responses.
2.6 Eye-tracking recordings & analysis
Gaze behavior was recorded using the EyeLink 1000 system (SR Research Ltd.) at a sampling rate of 500 Hz. Following data collection, recordings were converted to .asc format using EDF Converter 4.3.1 (SR Research) to ensure compatibility with MNE-Python for further analysis. Similar to the EEG analysis, epochs were created around the triggers marking the start of each video, and these epochs were categorized based on modality and noise level using trigger information. For the analysis, three regions of interest (ROIs) were defined using pixel coordinates corresponding to the face, eyes, and mouth. For each participant and each video, we calculated the time spent in each ROI and averaged these values across modalities and noise levels for each group. Due to hardware issues, eye-tracking data were unavailable for 4 TD participants and 4 ASD participants. To examine the influence of looking behavior on behavioral and EEG measures, therefore, an additional statistical analysis incorporating gaze behavior as a covariate to assess its influence on the results was performed on the participants with eye-tracking data
2.7 Audiovisual gain
Multisensory gain can be assessed in several ways. One common method is to compare a multisensory condition directly to a corresponding unisensory condition (typically by calculating the differential between their respective magnitudes). Another approach is to compute the summation of both unisensory conditions – known as the additive model - and compare this to the multisensory condition. In the context of linear models and neural data, MSI has been quantified by evaluating differences in reconstruction or prediction accuracy between the AV condition and a model that combines the A and V conditions ( Crosse, Di Liberto, and Lalor, 2016). This method involves training an A model on ‘auditory-only’ trials and a V model on ‘visual-only’ trials, summing their covariances (with the weights multiplied by 2 to maintain equivalent power), and then validating the A + V model using AV data. MSI is then calculated by subtracting the prediction accuracy of the A + V model from that of the AV model. Based on this approach, we initially decided to include a visual-only condition in our analysis. However, our eye-tracking analysis indicated significant differences in attention allocation during the visual-only condition between groups: participants with ASD spent significantly less time looking at the speaker's mouth. This difference suggests that MSI evaluation could be biased between the ASD and TD groups. Since attentional patterns during the audiovisual speech condition were similar across groups, including the visual-only condition in our MSI calculation could lead to an overestimation of MSI for the ASD group, given their lower prediction accuracy in the visual-only condition. To avoid this bias, we excluded the visual-only condition from our MSI calculation. Instead, we defined ‘audiovisual gain’ as the difference in performance between the AV condition and the auditory-only condition (AV− A). Given this approach, we do not interpret our results as a direct measure of multisensory integration. Rather, we refer to the observed effects as ‘audiovisual gain’ or ‘multisensory enhancement,’ which more appropriately reflect the nature of the comparison. Importantly, this difference in visual attention was not observed in the audiovisual condition, where gaze behavior was similar between groups. However, to evaluate the effect of gaze behavior on the audiovisual gain, we included the time spent looking at the speaker’s mouth as a covariate in an additional statistical analysis.
2.8 Statistical analysis
Statistical analyses were performed in Jamovi (The jamovi project, version 2.3.28,
The normality of residuals was verified using the Kolmogorov-Smirnov test and follow-up post-hoc comparisons were corrected using Bonferroni correction. Comparisons between TD and ASD groups across combined SNR conditions were performed using a classical independent sample t-test, after confirming the normality of the distribution ( α = 0.05). If the normality (Shapiro-wilk test ( Shapiro and Wilk, 1965)) or the equality of variance (Levene’s test) was significant, then a Welch’s test and a Mann-Whitney test were performed.
3 Results
3.1 Linking behavioral performance and stimulus reconstruction: evidence of reduced audiovisual gain in ASD
The first part of this study examined the behavioral performance of both TD and ASD participants in detecting a target word amidst varying SNRs. To account for false alarms (i.e., incorrect word identification), we computed the F-score for each participant under each condition. Results showed that the average F-score significantly dropped in both groups for both auditory and audiovisual speech as SNR decreased (i.e. noise level increased), confirming the expected difficulty in word detection under increasingly noisy conditions (
Fig. 1
The F-score for visual-only speech remained consistent across different SNR levels within each group, indicating stable performance in the absence of auditory input (
Fig. 2
The patterns observed in the neural reconstruction of the acoustic envelope mirrored those seen in behavioral performance. As SNR decreased, reconstruction accuracy declined for both auditory ( Fig. 1G, SNR factor: F = 100.35, df=191.4, η 2=0.72, p < 0.001) and audiovisual conditions ( Fig. 1H, SNR factor: F = 47.68, df=191.2, η 2=0.55, p < 0.001) in both groups. However, while no group differences were found in the auditory condition (group factor: F = 2.0, df=38.3, η 2=0.04, p = 0.166, Group*SNR: F = 2.08, df=191.4, η 2=0.05, p = 0.07; Age: F = 6.74, Estimate=−0.004, df=41.4, η 2=0.14, p = 0.013), there were significant group differences in the audiovisual condition (group factor: F = 6.56, df=38.7, η 2=0.18, p = 0.014; Group*SNR: F = 0.493, df=191.2, η 2=0.01, p = 0.781; Age: F = 7.912, Estimate=−0.007, df=45.7, η 2=0.14, p = 0.007), with ASD participants showing significantly reduced AV gain ( Fig. 1H, I, group factor: F = 7.13, df=39.2, η 2=0.15, p = 0.011; SNR factor: F = 4.97, df=192.8, η 2=0.11, p < 0.001; group*SNR: F = 2.07, df=192.8, η 2=0.05, p = 0.07; Age: F = 2.16, Estimate=−0.002, df=41.3, η 2=0.04, p = 0.149). A follow-up exploratory analysis revealed that the reduced AV gain in ASD was more pronounced at lower SNRs ( Fig. 1J, t=−3.55, p = 0.001). The consistency between neural reconstruction accuracy and behavioral performance highlights the robustness of these results across both behavioral and electrophysiological measures. When incorporating time spent looking at the speaker’s mouth during the audiovisual condition as a covariate in the linear model, the results remained consistent, indicating a significant group effect independent of gaze behavior (group factor: F = 5.06, df=33.0, η²=0.13, p = 0.031; SNR factor: F = 4.75, df=149.3, η²=0.14, p < 0.001; Mouth-gaze: F = 0.58, df=123.2, η²<0.01, p = 0.447). To assess whether the group differences in TRF reconstruction accuracy reflect behaviorally meaningful variation, we performed within-group repeated-measures correlations between TRF accuracy (AV–A) and behavioral performance (AV–A F-score; Supplementary figure 2). In the TD group, we observed a significant correlation (r = 0.40, p = 0.0006), and in the ASD group, a weaker but still significant correlation (r = 0.21, p = 0.0497).
3.2 TRF analyses reveal decreased neural encoding of the acoustic envelope in audiovisual speech in ASD
The findings from the previous section indicate a diminished AV gain in individuals with ASD, evident in both behavioral performance and our ability to reconstruct the acoustic envelope based on the EEG data. To further explore this, temporal response function (TRF) estimation was employed as a complementary forward model to assess the neural encoding of the acoustic envelope at specific EEG channels. TRF topographies were first analyzed to identify a representative cluster of electrodes—referred to as the region of interest (ROI)—to examine neural encoding and detect potential group differences that may explain the reduced decoding accuracy for audiovisual speech in ASD. To enhance spatial resolution, a spherical spline surface Laplacian was applied to the TRF data (
Kayser and Tenke, 2015;
Perrin et al., 1989), resulting in current source density (CSD) estimates for both auditory (
Fig. 3
Prediction accuracy was calculated by predicting EEG signals based on the acoustic envelope using TRFs and computing the Pearson correlation coefficient between the predicted and actual EEG signals (
Crosse et al., 2021). Across both auditory (
Fig. 4
3.3 Deficit in neural encoding of phonetic features in audiovisual speech-in-noise in ASD
Following the analysis of neural encoding of the acoustic envelope, the same process was applied to phonetic features, a higher-order speech component. As in the acoustic envelope analysis, the temporal response functions (TRFs) of participant-dependent forward models were first evaluated. Given that phonetic features were represented as a 2D binary matrix, the resulting TRFs were also represented as 2D matrices, with each row corresponding to a specific phonetic feature and the color indicating TRF amplitude. For both groups, TRF amplitude was highest for two types of phonetic features: "Voicing" and "Vowels" (
Figs. 5
Similar to the acoustic envelope analysis, the resulting TRFs were used to predict EEG signals, this time providing the models with phonetic features. The topographical distribution of prediction accuracy resembled that obtained with the acoustic envelope, showing higher values over centro-temporal channels for both auditory and audiovisual speech conditions (
Figs. 6
4 Discussion
Here we studied audiovisual speech processing in autism using natural continuous-speech stimuli, to examine the integrity of multisensory enhancement of the neural processing of speech under naturalistic environmental conditions which are often noisy and challenging for the listener. Our key findings reveal that, under naturalistic conditions where continuous audiovisual speech is presented, audiovisual gain of speech processing is significantly reduced in ASD. Eye-tracking analysis suggests that this is not because of differences in visual attention. Furthermore, this is seen at multiple stages of speech processing as assessed using high-density EEG. In contrast, the cortical processing of auditory speech-in-noise seems to be largely intact in individuals with autism.
4.1 Impaired multisensory enhancement of speech in autism
In noisy settings, seeing a speaker's face and facial movements enhances speech comprehension through MSI, where the brain combines visual and auditory inputs to facilitate perception and communication ( McGurk and MacDonald, 1976; Ross et al., 2007; Saint-Amour et al., 2007; Sumby and Pollack, 1954). In accordance with the principle of inverse effectiveness, that MSI is more pronounced when the unisensory signal is less informative ( Meredith and Stein, 1986), both groups exhibited increased AV gain as SNR decreased, both for behavioral and neural measures. In contrast to previous behavioral studies by our lab where AV gain peaked at intermediate SNRs ( Foxe et al., 2015; Ross et al., 2011), we saw a more linear inverse effect here. This is likely because participants already knew the word they were trying to detect, resulting in an increased ability to perform better at very low SNRs. We further found that AV gain was reduced for ASD compared to TD individuals, confirming a reduction in behavioral indices of multisensory enhancement that is now well described in the literature ( Alcantara et al., 2004; Foxe et al., 2015; Smith and Bennetto, 2007). Although previous studies have investigated audiovisual speech integration in autism, the use of continuous, naturalistic speech combined with varying levels of background noise represents a novel approach. This paradigm offers a more ecologically valid framework for probing multisensory speech processing in autism. Notably, our findings reveal previously unreported impairments in the neural processing of audiovisual speech in autism, highlighting the value of using naturalistic stimuli to uncover subtle deficits in multisensory integration. It is worth noting that potential sex differences were also investigated, as previous research from our group reported that females outperform males in audiovisual word recognition tasks ( Ross et al., 2015). However, in the present study, no significant sex differences were observed in either the TD or ASD groups. This lack of differences may be attributed to the small number of female participants in both groups (8 in TD and 5 in ASD). As such, the question of whether sex differences exist in speech perception when using continuous, natural speech stimuli remains unresolved and warrants further investigation with larger, more balanced samples. More broadly, the relatively small sample size of the present study represents a limitation that may impact the generalizability of the findings. While the use of linear mixed-effects models (LMMs) allowed us to account for repeated measures and handle unbalanced data, future studies with larger and more balanced samples will be essential to validate and extend these results. Interestingly, a recent fMRI study from our group also investigated potential differences in brain network activation during continuous speech-in-noise processing in autism( Ross et al., 2024). Despite using a naturalistic audiovisual paradigm across a broad age range (8–40 years), the results revealed similar large-scale activation patterns between ASD and TD participants. At first glance, this may seem inconsistent with the current findings. However, in our EEG study, although no significant group differences emerged in the modeled brain responses (i.e., TRFs) for either the acoustic envelope or phonetic features—and the corresponding current source density (CSD) maps were highly similar—prediction accuracy from participant-specific models was significantly lower in the autism group. This apparent discrepancy underscores an important point: while average neural responses may appear comparable across groups, increased variability within the ASD population may mask meaningful differences when relying on traditional group-level analyses. Traditional methods that rely on averaging, such as TRF or fMRI activation analyses, may fail to detect subtle but functionally meaningful differences in neural processing. In contrast, participant-dependent models optimize the fit for each individual, reducing variability and enabling the identification of more nuanced differences in speech processing mechanisms while accounting for the neural heterogeneity of ASD.
4.2 Hierarchical theory and findings on AV gain for speech
Speech perception is a complex, hierarchical process through which the human brain extracts semantic meaning from dynamic sound pressure signals. This process involves a series of transformations that generate increasingly abstract representations of the signal, allowing for consistent understanding of speech despite variations in acoustics due to differences in speakers and environments. Acoustic information is integrated through a sophisticated network that extends from the cochlea, through various structures along the subcortical auditory pathways, to the auditory and motor cortices. Research on the functional neuroanatomy of multisensory speech processing has shown that visual information enhances auditory speech perception at every stage of speech processing ( Aina Puce et al., 1998; Beauchamp, 2005; Callan et al., 2003; Calvert et al., 1999; Calvert, 1997, 2001; Iacoboni, 2008; Meister et al., 2007; Ojanen et al., 2005; Okada et al., 2013; Puce and Perrett, 2003; Ross et al., 2022). According to Peelle and Sommers (2015), audiovisual (AV) speech integration occurs in two stages: an early stage where visual speech provides temporal cues about the acoustic signal (‘prediction’), and a later stage where visual cues about place and manner of articulation integrate with acoustic information to aid lexical selection (‘constraint’). Early-stage integration enhances auditory sensitivity through direct projections from the visual to the auditory cortex ( Calvert, 1997; Grant and Seitz, 2000; Okada et al., 2013; Tye-Murray et al., 2011), while later-stage integration occurs in regions like the superior temporal sulcus (STS), where visual and acoustic information are combined ( Beauchamp et al., 2004; Karas et al., 2019; Kayser and Logothetis, 2009; Zhu and Beauchamp, 2017). Previous studies using continuous speech stimuli have demonstrated enhanced audiovisual speech processing compared to audio alone conditions in neurotypical adults, with improvements for both the acoustic envelope ( Crosse et al., 2015; Crosse, Di Liberto, and Lalor, 2016; O'Sullivan et al., 2021) and phonetic features ( Di Liberto et al., 2015; O'Sullivan et al., 2021). Here, we showed reduced neural encoding of both the acoustic envelope and phonetic features in audiovisual speech-in-noise in individuals with autism, characterized by a statistically significant group difference in AV gain for the acoustic envelope and reduced phonetic prediction accuracy specifically under high-noise audiovisual conditions. Our study included a relatively wide age range of participants. The benefit of visual information during audiovisual speech perception changes throughout development and impacts our experience of language( Pepper and Nuttall, 2023; Ross et al., 2011). Regarding the effect of age in the present study, we consistently observed that older participants performed better behaviorally, which aligns with developmental expectations. However, we did not find a significant effect of age on audiovisual gain in either behavioral or neural measures. This may be due to the relatively small sample size, as multisensory integration is known to improve with age. However, and surprisingly, at the neural level, we observed a decrease in TRF amplitude with age for both the acoustic envelope and phonetic features. This may reflect developmental changes in the topography of the auditory evoked potential (AEP), which gets smaller in amplitude over temporal scalp regions and larger over fronto-central scalp regions over the course of childhood development ( Gomes et al., 2001). In parallel, we found a decline in both reconstruction and prediction accuracy with age, which may stem from this reduction in TRF amplitude. Further studies using larger sample groups could help to further clarify at which stage of the speech processing hierarchy these multisensory deficits first appear in autism and how they evolve with development.
4.3 Atypical neuro-oscillations and audiovisual speech processing in autism
Although our results show clear group differences, the precise neural mechanisms underlying these deficits remain unclear. In part these may reflect attentional and subsequent learning differences, as considered below. However, there is also evidence for impaired neuro-oscillatory function in ASD that could contribute to altered multisensory speech processing. We recently proposed that dysfunctional cross-sensory oscillatory neural communication may be one key pathway to impaired multisensory processing in ASD ( Beker et al., 2017). In the context of speech, prior studies have implicated disruptions in oscillatory dynamics, particularly reduced theta-band (4–7 Hz) activity in the auditory cortex of individuals with autism ( Jochaut et al., 2015; Wang et al., 2023). Theta oscillations, which track syllable onsets, are thought to play a critical role in organizing neural activity into syllable-based integration windows that facilitate higher-level speech processing ( Ghitza, 2011). Jochaut et al. (2015) observed atypical theta–gamma coupling in the auditory cortex of individuals with autism while they watched and listened to a TV program. Unlike typically developing individuals—who show a downregulation of gamma activity by theta oscillations—both frequency bands increased simultaneously in the autism group. Importantly, this abnormal coupling pattern was linked to verbal impairments. Complementary findings from the same group ( Wang et al., 2023) showed that in very young children with autism, the expected theta–gamma coupling was diminished and replaced by abnormal beta–gamma coupling during audiovisual cartoon viewing.
Together, these findings suggest that atypical temporal neural dynamics and impaired long-range oscillatory coordination ( Beker et al., 2017) may underlie the neural deficits observed in AV speech processing in autism. Further research is needed to clarify whether these represent multisensory effects (with only an audiovisual condition, the respective contributions of visual and auditory inputs could not be assessed), how these disrupted mechanisms operate in naturalistic, noisy environments, and whether they represent viable targets for intervention.
4.4 Visual attention and neural prediction accuracy
It is important to address whether the observed reduction in audiovisual (AV) speech processing in ASD reflects a genuine deficit in multisensory integration or if it simply is a consequence of impaired neural processing of visual-only speech. To explore this, a visual-only condition was included in the experimental design, with the original plan to compare neural measures of AV speech against an additive model ( A + V), as proposed by ( Crosse et al., 2015) to quantify the "true" MSI effect—an approach commonly used in classical ERP studies ( Besle et al., 2004; Molholm et al., 2002). Eye-tracking data revealed that during audiovisual speech, individuals with autism and typically developing (TD) participants exhibited similar fixation patterns, spending comparable amounts of time looking at the speaker's mouth. However, in the visual-only condition, individuals with autism spent significantly less time fixating on the speaker's mouth. Since fixating on articulatory movements is critical for leveraging visual speech information during AV speech perception ( Tan et al., 2023), this lack of attention in the visual-only condition likely contributed to reduced neural prediction accuracy. Given these insights, the visual-only condition was excluded from the analysis of multisensory processing, and instead, the comparison focused on neural measures of audiovisual and auditory speech. Importantly, when we included the time spent looking at the speaker’s mouth in the audiovisual condition as a covariate, differences in audiovisual gain between TD and ASD participants remained. This finding reinforces the interpretation that the observed deficits in audiovisual processing are not simply a result of reduced visual attention. Although we controlled gaze behavior to ensure comparable visual input during audiovisual trials, it remains possible that differences in visual speech processing—not just gaze allocation—contribute to the reduced audiovisual gain observed in ASD. Future studies are needed to disentangle the relative contributions of visual processing deficits and integration impairments to audiovisual speech processing.
4.5 The role of attention in audiovisual speech processing
Although the results indicate that differences in audiovisual (AV) gain are not directly linked to visual attention, the observed deficits in individuals with autism may still be associated with broader attentional differences between groups. Research on the "cocktail party effect"—the ability to focus on a single speaker in a noisy environment—has shown that neural tracking of speech is modulated by attention ( O'Sullivan et al., 2015; Power et al., 2012). Individuals with autism often exhibit reduced attention to social stimuli, such as faces and speech cues ( Constantino et al., 2017; Dawson et al., 2004; Klin, 1991; Santapuram et al., 2022), which may result in fewer multisensory speech experiences in naturalistic environments. This reduced exposure would lead to fewer opportunities to learn visual-articulatory to speech sound correspondences, impacting the brain's ability to effectively track and integrate AV speech cues, potentially contributing to the reduced neural tracking observed in this study. In this context, the observed deficits may reflect a long-term consequence of attentional biases in daily life, rather than an immediate difference in attention allocation during the experimental task. This hypothesis is consistent with computational models suggesting that multisensory integration (MSI) maturation depends on repeated exposure to multisensory stimuli, and that delays in exposure can result in prolonged integration deficits in individuals with autism ( Cuppini et al., 2017). Accordingly, group differences in multisensory benefits from AV speech processing may gradually diminish with increased exposure and learning over time, highlighting the importance of experience-driven mechanisms in the development of MSI. While we did not find direct evidence for this relationship—such as an interaction between age and group for audiovisual gain—our results indicate that audiovisual speech processing deficits persist across a broad age range (from late childhood to young adulthood, ages 8–22). Future research using the current approach and focusing on narrower age groups will be important to understanding how audiovisual speech integration evolves over time in individuals with autism, to impact speech and language comprehension and communication abilities.
4.6 Intact auditory cortical processing of speech-in-noise in autism
We found that for auditory-alone speech, both behavioral and neural measures of speech processing decrease similarly for both ASD and TD individuals as SNR decreases. This aligns with most previous behavioral speech-in-noise studies using sentence-level stimuli in young individuals with autism ( Alcantara et al., 2004; Ross et al., 2024; Ruiz Callejo and Boets, 2023; Smith and Bennetto, 2007). It is particularly noteworthy that neural processing of auditory speech is also similar between ASD and TD individuals. As in the audiovisual condition, we observed a reduction in the amplitude of the modeled brain response to both speech features as SNR decreased for both groups. This finding is consistent with other studies using linear models and the acoustic envelope in TD adults ( Crosse, Di Liberto, and Lalor, 2016; Muncke et al., 2022), as well as with classical ERP studies ( Russo et al., 2009) showing reduced amplitude and increased latency of speech-evoked responses in noise. The reduction in neural processing of speech with decreasing SNR is similar across ASD and TD individuals, both for the acoustic envelope (using backward and forward models) and for phonetic features (using forward models). These results are significant since current neurophysiological evidence related to auditory speech processing in autism has reported inconsistent findings ( Key and D'Ambrose Slaboch, 2021; Schwartz et al., 2018). One explanation for the variability of results could be the different types of speech stimuli used between studies, primarily vowels, syllables, or word-level stimuli. However, these previous results suggest that atypical auditory speech processing becomes more apparent in ASD with increasing stimulus complexity (from vowels to multisyllabic stimuli) and task complexity (such as semantic comprehension; for review, see ( Key and D'Ambrose Slaboch, 2021). Here, we showed that with complex, natural speech stimuli, the neural processing of auditory speech-in-noise appears to be intact in individuals with autism.
5 Conclusions
Our study extends the understanding of multisensory speech processing deficits in ASD by investigating continuous audiovisual speech perception in noisy conditions, closely mimicking real-life communication. Using high-density EEG recordings and linear encoding/decoding models, we assessed the neural processing of speech features at both basic (acoustic envelope) and more abstract (phonetic features) levels during auditory-only and audiovisual speech across different SNRs. Our key behavioral and neurophysiological findings reveal a significant reduction in the multisensory facilitation of speech processing during audiovisual speech for individuals with ASD compared to their TD counterparts. This reduction in AV gain is evident despite intact auditory cortical processing of speech-in-noise in ASD. Overall, our findings underscore the importance of considering multisensory deficits in understanding communication challenges faced by individuals with ASD and highlight the need for interventions targeting these deficits to improve communication outcomes in challenging hearing environments.
Data & code availability
The raw EEG data that supports the findings of this study are available from the corresponding author upon reasonable request. Analyses were conducted using custom Python script available on GitHub (
CRediT authorship contribution statement
Theo Vanneau: Writing – original draft, Visualization, Investigation, Formal analysis, Data curation. Michael J. Crosse: Writing – review & editing, Software, Investigation, Data curation, Conceptualization. John J. Foxe: Writing – review & editing. Sophie Molholm: Writing – review & editing, Supervision, Investigation, Conceptualization.
Declaration of competing interest
The authors declare no competing interests.
Acknowledgments
The authors thank Aida Davila for her assistance with the recording and analysis of pilot data and Giovanni Di Liberto for his valuable methodological guidance. This project was supported by funding from
Supplementary materials
Supplementary material associated with this article can be found, in the online version, at
Appendix Supplementary materials
Copyright Elsevier Limited 2025