Content area
Recent advances in the cognitive neuroscience of language have embraced naturalistic stimuli such as movies and audiobooks. However, most open-access neuroimaging datasets still focus on single-speaker scenarios, falling short of capturing the complexity of real-life, multi-speaker communication. To address this gap, we present the BABA functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) dataset, in which two independent cohorts (N = 30 each) viewed the same 25-minute excerpt from a Chinese reality TV show—one group in the fMRI scanner and the other in the MEG scanner. Set in a rural village, the show features 11 speakers, including five father–child pairs, engaging in spontaneous, emotionally rich dialogue with overlapping speech, rapid turn-taking, and natural interruptions. The combined use of fMRI and MEG allows researchers to explore both spatial and temporal aspects of language processing. This resource opens new avenues for studying neural mechanisms of multi-speaker comprehension, attentional shifts, and authentic social communication.
Background
Recent research in the cognitive neuroscience of language has shifted towards using a naturalistic paradigm, utilizing stimuli such as movies, podcasts, and audiobooks1, 2, 3, 4, 5, 6, 7–8. In contrast to traditional controlled experiments, which often involves highly constrained and artificial linguistic inputs, naturalistic paradigms provide a richer, more ecologically valid window into how people comprehend language in everyday life. These approaches preserve the temporal dynamics, contextual richness, and social complexity inherent in natural communication, making them especially valuable for understanding language processing beyond the laboratory. However, most existing neuroimaging datasets have focused on single-speaker or two-speaker scenarios9, 10–11. In contrast, real-life communication frequently occurs in multi-speaker environments, such as group conversations and family interactions. Processing language in these contexts likely engages additional cognitive and neural mechanisms, including the allocation of attention across speakers, the monitoring of speaker identity, and the integration of multiple discourse streams12. Rapid speaker switching, a common feature of group dialogue, poses unique challenges for the brain. It requires listeners to update predictions dynamically, reorient attention, and maintain multiple mental representations of different speakers’ perspectives and narrative threads13,14. Understanding how the brain manages these demands is crucial for developing more comprehensive models of natural language comprehension and has implications for fields ranging from cognitive neuroscience of language, social cognition, developmental science, and computational modeling. Despite its importance, this area remains underexplored, highlighting the need for datasets and paradigms that reflect the true complexity of real-world communication.
Here, we present the BABA functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) dataset15, collected from native Chinese speakers as they watched a 25-minute excerpt from the Chinese reality TV show “Where Are We Going, Dad? (Season 1)” (“baba” means “dad” in Chinese, which inspired the name of the dataset). while undergoing neuroimaging. Originally aired in 2013, the show features unscripted interactions between five fathers and their young children as they travel to a rural village and engage in everyday activities, offering a rich and dynamic naturalistic stimulus that closely mirrors real-world communication. Participants completed five multiple-choice questions following the video to ensure their engagement and comprehension (see Table S1). In addition, we collected 15 minutes of resting-state data immediately after the video, during which participants lay still in the scanner and were instructed to recollect and reflect on the content of the video. This post-stimulus resting period provides a unique opportunity to examine memory consolidation, post-comprehension processes, and the lingering effects of complex language exposure on brain activity. The dataset includes recordings from both fMRI and MEG modalities, allowing researchers to investigate the spatial localization and fine-grained temporal dynamics of multi-speaker, emotionally nuanced, and socially embedded communication.
Validation analyses confirm the dataset’s high quality and consistency across participants. Notably, we observed significant activation in the right temporoparietal junction (TPJ) during speaker-switching events, a region implicated in attentional reorienting to task-relevant, yet currently unattended stimuli16. These findings highlight the TPJ’s central role in processing conversational dynamics and support the utility of this dataset for advancing our understanding of naturalistic, multi-talker language comprehension in the brain. The dataset also includes comprehensive annotations for the video, covering transcribed speech for each speaker and detailed word-level and phrase-level annotations, such as log frequency, part-of-speech (POS) tags, and the number of parser actions for each word derived from Stanford parser17 based on bottom-up, top-down, and left-corner parsing strategies18. These rich annotations enable fine-grained analysis of how specific acoustic and linguistic properties influence neural processing in complex auditory environments. All data and annotations are provided in standardized formats to ensure accessibility and reproducibility (see Fig. 1 for an overview of the data collection procedures, preprocessing steps, technical validation of the neuroimaging data, and annotation processes).
Fig. 1 [Images not available. See PDF.]
Overview of the fMRI and MEG data collection and analysis procedures. Different participants performing identical tasks in the fMRI and MEG scanners (green). Both fMRI and MEG data underwent preprocessing (purple) and were evaluated using quality control methods (yellow). Audio and transcribed speech from the video stimuli were annotated with acoustic and linguistic features (blue).
In sum, the BABA dataset15 offers several advantages for advancing research in language, social cognition, and auditory attention. It enables the investigation of neural mechanisms of multi-talker speech comprehension, including how the brain reallocates attentional resources across multiple interlocutors. The parent-child interactions also allow exploration of developmentally and emotionally grounded communication, which may engage unique neural circuits compared to adult-adult conversation.
Methods
Participants
Thirty participants (17 females, mean age = 23.17 ± 2.31 years) were recruited for the fMRI experiment at Shanghai International Studies University, Shanghai, China. An additional thirty participants (16 females, mean age = 22.67 ± 1.99 years) were recruited from the West China Hospital of Sichuan University, Chengdu, China for the MEG experiment. All participants were right-handed, had normal or corrected-to-normal vision, and reported no history of neurological disorders (see Table 1 for the participants’ demographic information). Before the experiment, all participants provided written informed consent and were compensated for their participation.
Table 1. Participants’ demographic information.
Participant ID from fMRI experiment | Sex | Age | Participant ID from MEG experiment | Sex | Age |
|---|---|---|---|---|---|
sub-01 | F | 24 | sub-31 | F | 24 |
sub-02 | F | 20 | sub-32 | M | 24 |
sub-03 | F | 22 | sub-33 | F | 27 |
sub-04 | M | 25 | sub-34 | F | 22 |
sub-05 | F | 25 | sub-35 | F | 21 |
sub-06 | F | 24 | sub-36 | F | 22 |
sub-07 | F | 24 | sub-37 | F | 21 |
sub-08 | F | 25 | sub-38 | F | 21 |
sub-09 | F | 21 | sub-39 | F | 24 |
sub-10 | F | 25 | sub-40 | F | 26 |
sub-11 | M | 25 | sub-41 | F | 24 |
sub-12 | F | 23 | sub-42 | M | 21 |
sub-13 | F | 19 | sub-43 | M | 20 |
sub-14 | F | 22 | sub-44 | M | 19 |
sub-15 | F | 22 | sub-45 | M | 21 |
sub-16 | M | 19 | sub-46 | F | 24 |
sub-17 | M | 21 | sub-47 | M | 23 |
sub-18 | M | 24 | sub-48 | F | 25 |
sub-19 | F | 25 | sub-49 | M | 20 |
sub-20 | M | 20 | sub-50 | M | 21 |
sub-21 | F | 23 | sub-51 | F | 22 |
sub-22 | F | 21 | sub-52 | F | 22 |
sub-23 | M | 24 | sub-53 | F | 22 |
sub-24 | M | 24 | sub-54 | M | 24 |
sub-25 | F | 21 | sub-55 | F | 21 |
sub-26 | M | 25 | sub-56 | M | 22 |
sub-27 | M | 24 | sub-57 | M | 22 |
sub-28 | M | 24 | sub-58 | M | 24 |
sub-29 | M | 24 | sub-59 | M | 26 |
sub-30 | M | 30 | sub-60 | M | 25 |
Stimuli
The video stimulus was extracted from the first episode of the Chinese reality TV show “Where Are We Going, Dad? (Season 1)” (openly available at https://www.youtube.com/watch?v=ZgRdRHmYuN8), which originally aired in 2013. The show features unscripted interactions between fathers and their child as they travel to a rural village and engage in daily activities. The selected excerpt has a total duration of 25 minutes and 19 seconds. The original video had a resolution of 640 × 368 pixels with a frame rate of 15 frames per second. It was presented in full-color (RGB) format, without embedded subtitles or captions.
Experimental design
The experimental procedures for both fMRI and MEG experiments were identical. Participants watched the video while inside the scanner. The video was presented via a mirror in the fMRI and MEG. Audio was delivered through MRI-compatible headphones (Sinorad, Shenzhen, China) during the fMRI experiment and MEG-compatible insert earphones (ComfortBuds 24, Sinorad, Shenzhen, China) during the MEG experiment. Following the video, participants were visually presented with 5 multiple-choice questions on the screen to assess their comprehension and ensure engagement with the stimuli (see Table S1). Participants responded using a button press, with a maximum response time of 10 seconds per question. If no response was recorded within this time, the experiment proceeded to the next question automatically. After the quiz, participants were instructed to close their eyes for 15 minutes without an explicit task. This period allowed for the recording of neural activity, capturing spontaneous mental replay of the video stimulus. The entire experimental procedure lasted approximately 45 minutes per participant. The fMRI experiment was approved by the Ethics Committee of Shanghai Key Laboratory of Brain-Machine Intelligence for Information Behavior (No. 2024BC028), and the MEG experiment was approved by the West China Hospital of Sichuan University Biomedical Research Ethics Committee (No. 2024[657]).
Data acquisition
The fMRI data was collected in a 3.0 T Siemens Prisma MRI scanner at Shanghai International Studies University, Shanghai. Anatomical scans were obtained using a Magnetization Prepared RApid Gradient-Echo (MP-RAGE) ANDI iPAT2 pulse sequence with T1-weighted contrast (192 single-shot interleaved sagittal slices with A/P phase encoding direction; voxel size = 1 × 1 × 1 mm; FOV = 256 mm; TR = 2300 ms; TE = 2.98 ms; TI = 900 ms; flip angle = 9°; acquisition time = 6 min; GRAPPA in-plane acceleration factor = 2). Functional scans were acquired using T2-weighted echo planar imaging (63 interleaved axial slices with A/P phase encoding direction, voxel size = 2.5 × 2.5 × 2.5 mm; FOV = 220 mm; TR = 2000ms; TE = 30 ms; acceleration factor = 3; flip angle = 60°).
MEG data were recorded at West China Hospital of Sichuan University in Chengdu, China, using a 64-channel optically pumped magnetometer (OPM) MEG system (Quanmag, Beijing, China). The system consists of 64 single-axis OPM sensors (radial direction, fixed helmet) with a 1000 Hz sampling rate, <20 fT/√Hz sensitivity, and >100 Hz bandwidth. Each sensor (16 × 19 × 66 mm³) contains a 4 × 4 × 4 mm³ rubidium vapor cell and an integrated laser. The sensitive volume is located ~6 mm from the sensor’s outer surface. Sensors were mounted on a rigid, adult-sized helmet providing full-brain coverage. The system was housed in a six-layer magnetically shielded cylinder (1.5 mm permalloy, 10 mm aluminum), with residual magnetic field ≤1 nT and typical system noise of 20–30 fT/√Hz. Participants lay on a scanning bed inserted into the cylinder, wearing air-conduction headphones during the auditory task. Sensor positions were fixed by the helmet geometry, without additional digitization19. OPM-MEG is a new type of MEG instrumentation that offers several advantages over conventional MEG systems. These include higher signal sensitivity, improved spatial resolution, and more uniform scalp coverage. Additionally, OPM-MEG allows for greater participant comfort and compliance, supports free movement during scanning, and features lower system complexity, making it a promising tool for more flexible and accessible neuroimaging20. The MEG Data were sampled at 1,000 Hz and bandpass-filtered online between 0 and 500 Hz. To facilitate source localization, T1-weighted MRI scans were acquired from the participants using a 3.0 T Siemens TrioTim MRI scanner at West China Hospital of Sichuan University (176 single-shot interleaved sagittal slices with A/P phase encoding direction; voxel size = 1 × 1 × 1 mm; FOV = 256 mm; TR = 1900 ms; TE = 2.3 ms; TI = 900 ms; flip angle = 9°; acquisition time = 7 min). All participants provided written informed consent outlining the experimental procedures and the data sharing plan prior to participation. They were compensated for their time and contribution.
Data preprocessing
All Digital Imaging and Communications in Medicine (DICOM) files of the raw fMRI data were first converted into the Brain Imaging Data Structure (BIDS) format using dcm2bids (v3.1.121) and subsequently transformed into Neuroimaging Informatics Technology Initiative (NIfTI) format via dcm2niix (v1.0.2022050522). Facial features were removed from anatomical images using PyDeface (v2.0.2). Preprocessing was carried out with fMRIPrep (v20.2.023), following standard neuroimaging pipelines. For anatomical images, T1-weighted scans underwent bias field correction, skull stripping, and tissue segmentation into gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF). These images were then spatially normalized to the Montreal Neurological Institute (MNI) space using the MNI152NLin2009cAsym:res-2 template, ensuring consistent alignment across participants. Functional MRI preprocessing included skull stripping, motion correction, slice-timing correction, and co-registration to the T1-weighted anatomical reference. For each BOLD run, head-motion parameters with respect to the BOLD reference (transformation matrices, and six corresponding rotation and translation parameters) are estimated before any spatiotemporal filtering using ‘mcflirt’ (FSL 5.0.9) and slice timing correction was applied using 3dTshift (AFNI 20160207). Co-registration to the anatomical image was done with flirt using boundary-based registration (6 degrees of freedom). No susceptibility distortion correction was applied. Confound regressors included motion parameters (and their derivatives/quadratics), framewise displacement (FD), DVARS, global signals, and t/aCompCor components computed from white matter and CSF after high-pass filtering (128 s cutoff). Volumes exceeding FD > 0.5 mm or standardized DVARS > 1.5 were flagged as motion outliers. All transforms were applied in a single interpolation step using antsApplyTransforms with Lanczos interpolation. We further performed spatial smoothing on the preprocessed fMRI data (post-fMRIPrep) using an isotropic Gaussian kernel with an 8 mm FWHM. However, the versions uploaded to OpenNeuro remain unsmoothed so that researchers can choose whether to apply smoothing.
MEG data preprocessing was conducted using MNE-Python (v1.8.024). We first applied a bandpass filter (1–38 Hz) to remove low-frequency drifts and high-frequency noise. We then identified bad channels through visual inspection and cross-validated using PyPREP (v0.4.325), these bad channels were interpolated to maintain data integrity. To mitigate physiological artifacts, we performed independent component analysis (ICA) and removed components corresponding to heartbeat and eye movements. The data were then segmented into three task-related epochs corresponding to the video watching, question answering, and post-task replay conditions. We computed the noise covariance from the mean over each full epoch, as in our naturalistic movie paradigm, audiovisual stimulation is continuous, leaving no “clean” pre-stimulus interval to estimate a noise covariance matrix. Prior MEG work using naturalistic designs has not converged on a single baseline strategy: some apply no correction26, some use empty-room recordings27, some use the full epoch28, and others do not report a baseline procedure29, 30–31. Because we lack empty-room data and prefer correction over no correction, we estimated the noise covariance from the full epoch, which effectively subtracts the mean distribution in the specified time interval (the “baseline”). T1-weighted MRI data were converted to NIfTI format and processed with FreeSurfer (v7.3.232) to reconstruct cortical surfaces and generate boundary element model (BEM) surfaces using a single-layer conductivity of 0.3 S/m. MEG-MRI coregistration was performed with fiducial points and refined via MNE-Python’s graphical interface. A source space (resolution = 5 mm) was generated using a fourth-order icosahedral mesh, and a BEM solution was computed to model head conductivity. A forward model was then created based on anatomical MRI and digitized head shape. Noise covariance matrices were estimated from raw MEG recordings, and inverse operators were constructed using minimum norm estimation (SNR = 3). Source reconstruction employed dynamic statistical parametric mapping (dSPM) for noise-normalized estimates. Task-related epochs (video watching, question answering, post-task replay) were used to compute source estimates, which were morphed onto the FreeSurfer average brain template for group-level comparisons.
Annotations
We first used Whisper Large-V333 to transcribe human speech into text with timestamps. We then manually reviewed and segmented the transcripts, resulting in 738 sentences and 2,930 words. Each sentence was annotated with one of 13 speaker labels, including 11 main speakers, a general “villagers” label for all villagers, and a general “staff” label for all staff members of the TV show (See example in Fig. 2a). Additionally, we identified and isolated background audio events without speech, categorizing them as 57 instances of natural ambient sounds (“sound”) and 153 instances of artificial sound effects (“sound effect”). The demographic information for each speaker and their total speaking duration, total number of spoken words and speaking rate were shown in Fig. 2b.
Fig. 2 [Images not available. See PDF.]
Experimental procedure and the annotations of the audio. (a) Participants watched the video while undergoing neuroimaging. The video frames, audio waveforms and corresponding transcripts are shown, with different colors indicating different speakers. They then answered 5 multiple-choice quiz questions and rested for 15 minutes in the scanner. (b) Information of speakers and the duration, total number of words and speaking rate of their speech output in the video stimuli. (c) Acoustic features including pitch (f0) and root-mean square (RMS) intensity at every 10 ms of the audio were extracted. (d) Each word was annotated with speaker identity, timing (onset, offset, duration), log unigram frequency, part-of-speech (POS) tags, and syntactic complexity measures from different parsing strategies.
Acoustic features, including pitch (f0) and root-mean square (RMS) intensity, were extracted at 10 ms intervals using Praat-Parselmouth (v0.4.434), as shown in Fig. 2c. Additionally, a binary indicator was assigned to mark speaker switching between different speech segments. Linguistic annotations were applied to the manually segmented speech transcript, integrating multiple levels of information (Fig. 2d). Lexical frequency was estimated by applying a log transformation to unigram frequency values obtained from the Google Books Ngram Viewer dataset (https://books.google.com/ngrams). Part-of-speech (POS) tags were assigned using the spaCy Chinese pipeline (v3.7.535). To assess syntactic complexity, the number of parsing actions required to process each word within a sentence was derived from constituency trees generated by the Stanford Parser (v3.9.217), following three parsing strategies: top-down, bottom-up, and left-corner.
Data Records
The dataset is available on the OpenNeuro repository (https://openneuro.org/datasets/ds00534615) and follows the BIDS format (v1.8.0). All data were fully anonymized by removing identifiable information and anatomical details. We present both raw and preprocessed fMRI and MEG data, accompanied by relevant metadata, including participant information, behavioral results, and text and audio annotations. Figure 3 presents overview of the dataset’s directory structure.
Fig. 3 [Images not available. See PDF.]
Structure of the dataset directories. (a) Folder layout for both raw and processed fMRI and MEG data. (b) Folder layout for audio and text annotations, video stimuli, and behavioral results for the quiz questions.
Annotation files
Location: sourcedata/annotation/transcripts.csv
sourcedata/annotation/wav_acoustic.csv
sourcedata/annotation/word_information.csv
File format: comma-separated value.
Annotation of acoustic and linguistic features for the audio and text of the stimuli.
Quiz files
Location: sourcedata/quiz/quiz_questions.csv
sourcedata/quiz/quiz_behavioral_results.csv
File format: comma-separated value.
Five multiple-choice questions and participants’ accuracy scores from the fMRI and MEG experiments.
Stimuli files
Location: sourcedata/stimuli/baba_video.mp4
File format: mp4 (MPEG-4 Part 14).
The video file corresponds to the first episode of the Chinese reality TV show “Where Are We Going, Dad? Season 1” (2013).
Anatomical MRI files
Location: sub- <ID>/anat/sub- <ID>_rec-defaced_T1w.nii.gz
File format: NIfTI, gzip-compressed.
Sequence protocol: sub- <ID>/anat/sub- <ID>_rec-defaced_T1w.json
Preprocessed data:
derivatives/sub- <ID>/anat/sub-<ID>_desc-preproc_T1w.nii.gz
fMRI data files
Location: sub- <ID>/func/sub-<ID>_task-[baba, question, replay, rest]_bold.nii.gz
File format: NIfTI, gzip-compressed.
Sequence protocol: sub- <ID>/func/sub-<ID>_task-[baba, question, replay, rest]_bold.json
Preprocessed data:
derivatives/sub- <ID>/func/sub-<ID>_task-[baba, question, replay,rest]_desc-preproc_bold.nii.gz
MEG data files
Location: sub- <ID>/meg/sub-<ID>_task-[baba, question, replay]_meg.fif
File format:FIF
Sequence protocol: sub- <ID>/meg/sub-<ID>_task-[baba, question, replay]_meg.json
Preprocessed data:
derivatives/sub- <ID>/meg/sub-<ID>_task-[baba, question, replay]_desc-preproc_meg.fif
Anatomical MRI files from the participants involved in the MEG experiment
Location: sub- <ID>/anat/sub-<ID>_rec-defaced_T1w.nii.gz
File format: NIfTI, gzip-compressed.
Sequence protocol: sub- <ID>/anat/sub-<ID>_rec-defaced_T1w.json
Technical Validation
Behavioral results
The mean accuracy on the quiz questions was 79.33% ± 19.29% for the fMRI experiment and 85.33% ± 18.14% for the MEG experiment. The mean log-transformed reaction times (RTs) were 1.29 ± 0.29 seconds and 1.41 ± 0.27 seconds, respectively (see Figure S1). These results suggest that participants were attentive to the video content during scanning and responded to the quiz questions both accurately and promptly.
fMRI data quality
Data quality assessment was conducted using the MRI Quality Control (MRIQC36) on the preprocessed data, which quantifies factors such as signal fidelity, noise levels, and anatomical precision. We also computed inter-subject correlation (ISC2) on the fMRI data to examine the consistency of neural activity across participants. Additionally, whole-brain general linear modeling (GLM) was performed to explore neural responses to pitch and intensity of the audio and switching of speakers in the video.
MRIQC
Image-quality metrics (IQMs) results from MRIQC for the raw and preprocessed anatomical data and functional data were shown in Fig. 4a,b), indicating high overall data integrity. Low coefficient of joint variation (CJV) for gray and white matter, along with entropy-focus criterion (EFC), suggests minimal head motion and few artifacts. High contrast-to-noise ratio (CNR) and foreground-background energy ratio (FBER) indicate clear tissue differentiation and well-balanced image energy distribution. The white matter to maximum intensity ratio (WM2MAX) confirms that white matter intensity is within the expected range, while strong signal-to-noise ratio (SNR) for both gray and white matter further supports good data quality. The functional MRI data exhibited a high temporal signal-to-noise ratio (tSNR) and low temporal derivative of time courses of root mean square over voxels (DVARS), indicating strong temporal stability. Minimal head motion and artifacts were reflected in low average framewise displacement (FD) and AFNI’s outlier ratio (AOR). Additionally, low mean full width at half maximum (FWHM) and AFNI’s quality index (AQI) suggest a well-distributed and consistent image intensity across the brain. Low global correction (GOR) values and a small number of dummy scans further confirm the effectiveness of correction procedures and the stability of the initial scan state.
Fig. 4 [Images not available. See PDF.]
Quality assessment of fMRI data. (a) Group-level Image-quality metrics (IQMs) of anatomical MRI data through MRIQC. (b) Group-level IQMs of functional MRI data through MRIQC.
ISC of fMRI timecourses
We computed ISC for each voxel’s time series across participants during video watching. Specifically, each subject’s voxel-wise time series was correlated with the mean time series of the same voxel across all other participants (leave-one-out approach). This analysis produced a group-level ISC map, highlighting regions with the strongest correlations, particularly in the left temporal and bilateral occipital lobes (see Fig. 5).
Fig. 5 [Images not available. See PDF.]
Brain regions exhibited significant inter-subject correlation (ISC) during video viewing in the fMRI scanner.
GLM analysis
We performed a GLM analysis to investigate the BOLD responses to pitch, intensity, and the switching of speakers in the video. The pitch and intensity features were extracted every 10 ms of the audio and were standardized via z-scoring. Speaker-switching is a binary regressor where the offset of switching of speaker was marked as 1 in the transcript. These regressors were convolved with the canonical hemodynamic response function (HRF) and regressed against each participant’s voxel-wise BOLD activity. The computed beta maps were then subjected to a second-level GLM, where statistical significance at the group level was evaluated through cluster-based inference, applying a threshold of p < 0.05 (see Fig. 6a for the GLM analysis pipeline). The GLM results showed significant bilateral activation in the superior temporal lobes for pitch and intensity, highly consistent with the activation map for the terms “pitch” (102 studies), “intensity” (428 studies) extracted from fMRI term-based meta-analysis from Neurosynth37 see Fig. 6b).
Fig. 6 [Images not available. See PDF.]
GLM analyses for localizing brain responses to word rate, pitch, intensity, and speaker switching. (a) Overview of the GLM methods. (b) Brain maps showing significant activation clusters associated with each feature, alongside corresponding clusters from term-based fMRI meta-analyses using Neurosynth37.
As our video stimulus features multiple speakers, it provides a unique opportunity to examine the neural mechanisms underlying multi-talker conversation processing. To this end, we specifically analyzed brain activity at moments when the speaker switched during the video. The results revealed significant activation in the right temporoparietal junction (TPJ; see Fig. 6b). The TPJ is widely recognized as a key component of the ventral attention network, which is responsible for reorienting attention to task-relevant but currently unattended stimuli (see16 for a review). To further validate our findings, we extracted the activation map for the term “switching” (based on 193 studies) from Neurosynth. The meta-analytic results showed prominent clusters in the bilaterial TPJ, which partially overlap with the speaker-switching activations observed in our data. This further suggests that speaker changes during naturalistic conversation engage domain-general cognitive control mechanisms, particularly those associated with task switching and attentional reorientation.
MEG data quality
To assess the quality and reliability of the MEG data, we calculated the proportion of bad channels and removed artifact-related ICA components for all participants. We then computed ISC on the time series of source-level MEG data to evaluate the consistency of neural responses across individuals during video viewing. Additionally, we conducted a regression analysis to identify significant spatiotemporal clusters associated with pitch, intensity, and speaker switching, providing insights into how low-level acoustic features and conversational dynamics are represented in the brain during naturalistic viewing. Since 6 of the participants exhibited different PSD patterns that diverged from the other 24 (10 females, mean age = 22.75 ± 1.94 years; see Figure S2), we excluded their data from the ISC and regression analysis for MEG data. However, all datasets remain available in the OpenNeuro repository for other researchers’ use.
Portion of bad channels and excluded ICA components
Figure 7a illustrates the MEG sensor layout. Figure 7b (left) summarizes the distribution of bad channels, indicating overall good signal quality. Eleven participants had no bad channels, while for sixteen participants, Channel 19 and Channel 48 were consistently identified as bad due to equipment-related issues. Figure 7b (right) presents the number of excluded components following ICA, the mean number of excluded components is 2.20 ± 0.54 channels across participants.
Fig. 7 [Images not available. See PDF.]
Quality assessment of the MEG data. (a) MEG sensor layout. (b) Number of interpolated channels and excluded ICA components for all participants. (c) Vertices showing significant ISC during video viewing. (d) Source-level significant spatiotemporal clusters for pitch, intensity and switching from regression analysis. Shaded regions indicate significant time windows. * denotes p < 0.05, ** denotes p < 0.01 and *** denotes p < 0.001.
ISC of MEG timecourses
We performed ISC analysis for source-level MEG data. For each source, we first computed a group reference by averaging the time series across all participants during video viewing. We then correlated each participant’s source time series with that group reference. At the group level, we tested the significance of these correlation coefficients using cluster-based two-sample t-tests with 10,000 permutations38, which revealed brain regions showing significant inter-subject synchronization. We observed elevated ISC across a broad bilateral network—including temporal, parietal, and occipital cortices—largely mirroring our fMRI ISC findings.
MEG regression results
To examine MEG responses to pitch, intensity, and speaker switching, we performed a linear regression analysis using these three features as regressors to predict source-level MEG activity during word-level epochs. Specifically, we extracted −100 to 500 ms epochs time-locked to the onset of each word in the audio stream. For each word, we used the maximum pitch and intensity values within the word interval as representative features. We performed the regression at 20 ms intervals across the 600 ms epoch at each source for every participant, yielding three beta time courses corresponding to the three regressors: pitch, intensity, and speaker switching. At the group level, we conducted statistical analysis using a spatiotemporal cluster-based permutation t-test38 with 10,000 permutations to identify sources and time points where the beta coefficients for pitch, intensity, and speaker switching is significantly different from 0. We observed a significant cluster in the right superior temporal lobe in response to pitch (t = 1.57, p = 0.042, Cohen’s d = 2.81, time window: 140–300 ms). Intensity was associated with bilateral STG clusters (left: t = 1.1, p = 0.049, Cohen’s d = 1.91, time window: −80–320 ms; right: t = 1.07, p = 0.037, Cohen’s d = 2.23, time window: −100–200 ms). Additionally, speaker switching was associated with a significant cluster in the right TPJ and STG (t = 1.2, p = 0.041, Cohen’s d = 1.16, time window: 300–500 ms), as well as the right MTG (t = 1.25, p = 0.008, Cohen’s d = 1.7, time window: −60–500 ms), again consistent with the results observed from the fMRI regression analysis (see Fig. 7d). These results provide additional insight into the temporal dynamics of speaker switching during multi-talker speech conversation, complementing the spatial information revealed by the fMRI results (see Fig. 6b).
Usage Notes
This dataset provides a valuable resource for investigating the neural mechanisms underlying naturalistic audiovisual processing, multi-talker speech comprehension, and developmentally and emotionally grounded communication in both fMRI and MEG modalities. However, there are several limitations and potential bottlenecks in its usage.
Annotation bottleneck
Frame-by-frame speaker annotations in the video were not labeled in the current dataset, limiting the precision with which speaker-specific neural responses can be analyzed. Future work could incorporate automated annotation pipelines using multimodal large language models (LLMs) capable of processing audiovisual inputs. These models can jointly analyze visual frames and audio tracks to detect and label speaker identities, speaking turns, and facial expressions in real time.
Analysis bottleneck
More advanced analytical techniques—such as multivariate approaches or computational models capable of representing group conversations—may be employed to further uncover the neural mechanisms underlying. These methods can offer a more nuanced understanding of how the brain navigates and interprets overlapping speech signals in complex, real-world listening environments.
Acknowledgements
This work was supported by the CityU Start-up Grant 7020086, CityU Strategic Research Grant 7200747 (JL), the 1·3·5 project for disciplines of excellence–Clinical Research Incubation Project, West China Hospital, Sichuan University (No. 2020HXFH025) and the Open Project from Key Laboratory of Language Science and Multilingual Artificial Intelligence, Shanghai International Studies University.
Author contributions
J.L. designed the study. Q.W. and Z.M. collected the MEG data. R.X., S.F. and X.J. collected the fMRI data. Y.W., C.W., Z.M. and J.L. analysed the data. J.L. wrote the paper.
Code availability
Scripts used for data preprocessing and analysis are available on GitHub. (https://github.com/compneurolinglab/baba).
Data availability
The dataset is available on the OpenNeuro repository (https://openneuro.org/datasets/ds00534615).
Competing interests
The authors declare no competing interests.
Supplementary information
The online version contains supplementary material available at https://doi.org/10.1038/s41597-025-06110-5.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1. Brennan, J et al. Syntactic structure building in the anterior temporal lobe during natural story listening. Brain and Language; 2012; 120, pp. 163-173. [DOI: https://dx.doi.org/10.1016/j.bandl.2010.04.002] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/20472279]
2. Hasson, U; Nir, Y; Levy, I; Fuhrmann, G; Malach, R. Intersubject synchronization of cortical activity during natural vision. Science; 2004; 303, pp. 1634-1640.2004Sci..303.1634H [DOI: https://dx.doi.org/10.1126/science.1089506] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/15016991]
3. Huth, AG; de Heer, WA; Griffiths, TL; Theunissen, FE; Gallant, JL. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature; 2016; 532, pp. 453-458.2016Natur.532.453H [DOI: https://dx.doi.org/10.1038/nature17637] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27121839][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4852309]
4. Li, J et al. Le Petit Prince multilingual naturalistic fMRI corpus. Sci Data; 2022; 9, [DOI: https://dx.doi.org/10.1038/s41597-022-01625-7] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36038567][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9424229]530.
5. Malik-Moraleda, S et al. An investigation across 45 languages and 12 language families reveals a universal language network. Nat Neurosci; 2022; 25, pp. 1014-1019. [DOI: https://dx.doi.org/10.1038/s41593-022-01114-5] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35856094][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10414179]
6. Momenian, M et al. Le Petit Prince Hong Kong (LPPHK): Naturalistic fMRI and EEG data from older Cantonese speakers. Sci Data; 2024; 11, [DOI: https://dx.doi.org/10.1038/s41597-024-03745-8] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39261552][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11390913]992.
7. Nastase, SA et al. The “Narratives” fMRI dataset for evaluating models of naturalistic language comprehension. Sci Data; 2021; 8, [DOI: https://dx.doi.org/10.1038/s41597-021-01033-3] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34584100][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8479122]250.
8. Wehbe, L et al. Simultaneously uncovering the patterns of brain regions involved in different story reading subprocesses. PloS one; 2014; 9, e112575.2014PLoSO..9k2575W [DOI: https://dx.doi.org/10.1371/journal.pone.0112575] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/25426840][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4245107]
9. Li, J. et al. Multi-talker speech comprehension at different temporal scales in listeners with normal and impaired hearing. eLife13 (2024).
10. Zada, Z. et al. A shared model-based linguistic space for transmitting our thoughts from brain to brain in natural conversations. Neuron112 (2024).
11. Zion Golumbic, EM et al. Mechanisms underlying selective neuronal tracking of attended speech at a “cocktail” party. Neuron; 2013; 77, pp. 980-991. [DOI: https://dx.doi.org/10.1016/j.neuron.2012.12.037] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/23473326][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3891478]
12. McDermott, JH. The cocktail party problem. Current Biology; 2009; 19, pp. R1024-R1027. [DOI: https://dx.doi.org/10.1016/j.cub.2009.09.005] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/19948136]
13. Mesgarani, N; Chang, EF. Selective cortical representation of attended speaker in multi-talker speech perception. Nature; 2012; 485, pp. 233-236.2012Natur.485.233M [DOI: https://dx.doi.org/10.1038/nature11020] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22522927]
14. Power, AJ; Foxe, JJ; Forde, E-J; Reilly, RB; Lalor, EC. At what time is the cocktail party? A late locus of selective attention to natural speech. European Journal of Neuroscience; 2012; 35, pp. 1497-1503. [DOI: https://dx.doi.org/10.1111/j.1460-9568.2012.08060.x] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22462504]
15. Li, J., Wang, Y., Wang, C. & Ma, Z. Naturalistic fMRI and MEG recordings during viewing of a reality TV show. OpenNeuro https://doi.org/10.18112/openneuro.ds005346.v1.0.4.
16. Corbetta, M; Shulman, GL. Control of goal-directed and stimulus-driven attention in the brain. Nat Rev Neurosci; 2002; 3, pp. 201-215. [DOI: https://dx.doi.org/10.1038/nrn755] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/11994752]
17. Levy, R. & Manning, C. D. Is it harder to parse Chinese, or the Chinese treebank? in Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics 439–446, https://doi.org/10.3115/1075096.1075152 (Association for Computational Linguistics, Sapporo, Japan, 2003).
18. Hale, J. Automaton Theories of Human Sentence Comprehension. (CSLI Publications, 2014).
19. Wang, X et al. Performance of optically pumped magnetometer magnetoencephalography: validation in large samples and multiple tasks. J. Neural Eng.; 2024; 21, 066033.2024JNEng.21f6033W [DOI: https://dx.doi.org/10.1088/1741-2552/ad9680]
20. Brookes, MJ et al. Magnetoencephalography with optically pumped magnetometers (OPM-MEG): the next generation of functional neuroimaging. Trends in Neurosciences; 2022; 45, pp. 621-634. [DOI: https://dx.doi.org/10.1016/j.tins.2022.05.008] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35779970][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10465236]
21. Boré, A; Guay, S; Bedetti, C; Meisler, S; GuenTher, N.
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.