|
From Minutes to Days: Scaling Intracranial Speech Decoding with Supervised Pretraining |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents a framework to scale intracranial (iEEG) speech decoding by leveraging week-long, continuous brain and audio recordings as supervised pretraining data. Using a contrastive learning model that aligns brain activity with pretrained audio representations (wav2vec 2.0), the approach demonstrates significant gains over models trained solely on short, controlled datasets. The study reveals that pretraining performance improves log-linearly with the amount of data and that downstream performance on controlled tasks benefits robustly from large-scale pretraining followed by supervised finetuning. Analysis of the learned embedding spaces highlights issues of cross-day neural drift and distributional shift between ambient and experimental audio.
1. The methodology is clearly formalized, with careful and transparent documentation of preprocessing, architecture, and experimental protocols.
2. The empirical evaluation is rigorous: the impact of pretraining is shown clearly in Figure 2, and the log-linear relationship between data amount and downstream performance is rare in the brain decoding literature.
3. The analysis of representation drift (Figure 6 and 9) is a valuable, often neglected aspect, revealing new neuroscientific challenges that arise with longer time windows.
1. Both the model architecture and training paradigm are directly adopted from [1] (Line 133). The dataset was not collected by the authors, yet directly selected 3 subjects from 46 subjects in [2], which is not publicly available. The code link doesn’t belong to this project, but was copied from [1].
2. The experimental evaluation included only 3 subjects from [1] and lacked comparison with advanced sEEG decoding baselines [3-6], which makes it difficult to position the contribution of this article.
3. Although the idea of using sEEG-audio signal pairs during the non-task phase to improve decoding performance during the task phase is interesting, the experimental design itself is to ensure that the subjects focus on carefully designed cognitive tasks and that the recorded sEEG signals contain information about language perception, which makes the neuroscientific basis and reproducibility of this work questionable.
**References**:
[1] Défossez A, Caucheteux C, Rapin J, et al. Decoding speech perception from non-invasive brain recordings[J]. Nature Machine Intelligence, 2023, 5(10): 1097-1107.
[2] Linnea Evanson, Christine Bulteau, Mathilde Chipaux, Georg Dorfm¨uller, Sarah FerrandSorbets, Emmanuel Raffo, Sarah Rosenberg, Pierre Bourdillon, and Jean-R´emi King. Emergence of Language in the Developing Brain. Manuscript Online, May 2025. URL
https://ai.meta.com/research/publications/emergence-of-language-in-the-developing-brain/. (Accessed 10/09/2025).
[3] Wang C, Subramaniam V, Yaari A U, et al. BrainBERT: Self-supervised representation learning for intracranial recordings[J]. arXiv preprint arXiv:2302.14367, 2023.
[4] Chau G, Wang C, Talukder S, et al. Population transformer: Learning population-level representations of neural activity[J]. ArXiv, 2025: arXiv: 2406.03044 v4.
[5] Zhang D, Yuan Z, Yang Y, et al. Brant: Foundation model for intracranial neural signal[J]. Advances in Neural Information Processing Systems, 2023, 36: 26304-26321.
[6] Yuan Z, Shen F, Li M, et al. Brainwave: A brain signal foundation model for clinical applications[J]. arXiv preprint arXiv:2402.10251, 2024.
See the above weaknesses. |
Fully human-written |
|
From Minutes to Days: Scaling Intracranial Speech Decoding with Supervised Pretraining |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper describes week-long intracranial and audio recordings used to train a contrastive learning model. Learned representations seem to suggest that brain activity represents speech features, but that its global structure shifts, which identifies the practical problem that shift ought to be explicitly accounted-for.
- It is a strength that large amounts of data (over the course of a week) can be effectively used, apparently scalably. It is hard to assess the "over two orders of magnitude" claim (L17), though. This also reveals one of the main insights, regarding the cross-day neural drift and the need to correct for it.
- It is only the most minor of complaints, but the format of the Introduction is not quite typical of a scientific publication. It is suggested to omit the boldfaced headings, or to add a more narrative opening. Some claims are mentioned 'loosely' (e.g., "patients...typically spend about a week", "about 100X more neural data") or without citation. The writing generally can be tightened up and improved.
- Although references and related work are distributed throughout the paper, these tend to be isolated to specific decisions (e.g., like the wav2vec2 model used). It may have been easier to identify the apparent novelty of the work were it couched in a fulsome, contextualized background work section.
- The core of the work is a standard CLIP(-like?) contrastive alignment with typical objectives -- there's no novel architectural nor objective nor analytical
- The experiments are within-subject for a relatively small collection of patients. An ongoing problem in this community is how to either build thinker-independent models from scratch, or how to use foundation models that are generalizable, so such small-N data (in terms of patients) can be leveraged. At least for generalizability, the empirical results are narrow. Additional ablations or modifications of adjustable parameters would also be expected.
- L42: Are you suggesting that there is a tradeoff between EEG and MEG in time-v-spatial resolution?
- L46: the moment participants perform an overt speech task can be disastrous for EEG. Is overt speech in EEG not included in 'typically'?
- L122: In your loss, is it the case that the objective is to pick the right V for a given U? This makes sense, but is CLIP typically symmetric?
- IBID: Negatives appear in batch only? When batch is ~128, is the number or variety of negatives modest? |
Fully human-written |
|
From Minutes to Days: Scaling Intracranial Speech Decoding with Supervised Pretraining |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces a supervised pretraining framework for intracranial EEG (iEEG)-based speech decoding, leveraging week-long ambient and task-based brain-audio recordings from epilepsy patients. Using a contrastive learning approach, the authors align neural signals with representations from a pretrained speech model (wav2vec 2.0), scaling dataset sizes by orders of magnitude compared to traditional short, controlled experiments. The work demonstrates that pretraining on large-scale, ambient recordings significantly improves downstream decoding performance with robust log-linear gains as data expands, while detailed representational analyses reveal substantial cross-day drift in neural embeddings.
1. Real-world relevance: The authors effectively leverage week-long clinical iEEG recordings paired with ambient audio—data typically discarded—to scale training data by over two orders of magnitude. This represents a meaningful step toward real-world, scalable brain-speech decoding and is clearly motivated and illustrated (Figure 1).
2. Rigorous and comprehensive experimental validation: The pretraining framework consistently improves downstream speech decoding across all three subjects, with statistically significant gains (Figure 2A). The log-linear scaling with pretraining data quantity (Figure 2B) and sensitivity analyses (e.g., finetuning data ablation in Figure 4A) further strengthen the claims.
3. Representational and distribution shift analysis: The paper provides a clear analysis of the distribution shift between ambient and true audiobook sounds (Figure 3) and demonstrates the necessity of finetuning. The comparison between wav2vec 2.0 and melspectrogram features (Figure 5) offers valuable insights into which acoustic representations align better with neural activity.
4. Neurophysiologically informative embedding analysis: The UMAP visualizations and linear decoding analyses (Figures 6, 10) reveal meaningful structure in the learned embeddings, particularly the day-to-day drift in neural representations—a finding with important implications for future model design and clinical translation.
1. Limited comparison to recent state-of-the-art baselines: The paper does not adequately situate itself within the rapidly evolving literature on neural decoding. Key recent works—such as self-supervised pretraining on iEEG [1,2] and cross-subject or cross-session transfer learning [3]—are not discussed or compared. This omission weakens the claim of methodological novelty.
2. Incomplete coverage of pretraining innovations in brain decoding: While this paper emphasizes supervised pre-training on environmental data, it lacks a detailed overview of the results from related foundational models [4,5] that also utilize large-scale neural network data. Therefore, a deeper exploration is needed regarding the connections between this work and these methods, and in what ways they represent breakthroughs.
3. Lack of neural-level interpretability and spatial ablation: The embedding analyses are informative but do not directly link to neural anatomy or functional localization. Ablations over electrode groups (e.g., auditory vs. non-auditory cortex) or analysis of how different brain regions contribute to the learned representations would strengthen the interpretability and biological plausibility of the model.
4. Superficial handling of temporal non-stationarity: Although the paper identifies day-to-day drift as a key challenge, the proposed model does not explicitly account for it. Incorporating temporal adaptation mechanisms—such as domain-adversarial training, sliding-window normalization, or time-aware embeddings—could improve robustness and generalization, and should be explored or at least discussed as a future direction.
**References:**
[1] Wu, D., Li, S., Feng, C., Cao, L., Zhang, Y., Yang, J., & Sawan, M. (2024). Towards Homogeneous Lexical Tone Decoding from Heterogeneous Intracranial Recordings. *arXiv preprint arXiv*:2410.12866.
[2] Zheng, H., Wang, H., Jiang, W., Chen, Z., He, L., Lin, P., ... & Liu, Y. (2024). Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals. *Advances in Neural Information Processing Systems, 37*, 79996-80033.
[3] Singh, A., Thomas, T., Li, J., Hickok, G., Pitkow, X., & Tandon, N. (2025). Transfer learning via distributed brain recordings enables reliable speech decoding. *Nature Communications, 16*(1), 8749.
[4] Zhang, D., Yuan, Z., Yang, Y., Chen, J., Wang, J., & Li, Y. (2023). Brant: Foundation model for intracranial neural signal. *Advances in Neural Information Processing Systems, 36*, 26304-26321.
[5] Chau, G., Wang, C., Talukder, S., Subramaniam, V., Soedarmadji, S., Yue, Y., ... & Barbu, A. (2025). Population transformer: Learning population-level representations of neural activity. *ArXiv, arXiv*-2406.
See Weaknesses. |
Fully AI-generated |
|
From Minutes to Days: Scaling Intracranial Speech Decoding with Supervised Pretraining |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose leveraging ambient audio data from long intracranial studies in a contrastive supervised pre-training stage. In turn, this enables learning from intracranial signals over the length of a study, vastly increasing the amount of training data available. The authors show that pre-training a contrastive model with this data, allows it to generalise, with some fine-tuning, to downstream speech comprehension / audio listening tasks. The results also indicate that the pre-training scales log-linearly, suggesting further data could continue to improve generalisation performance.
- Interesting idea to leverage ambient audio for a supervised pre-training stage
- Fine-tuning the pre-trained model seems to convincingly beat the baseline
- Error bars and statistical tests included show that improvements are significant
- Performs appears to scale log-linearly with pre-training data between 0-100 hours
- Missing baselines: Please include (1) an end-to-end baseline where you train your full architecture directly on the supervised data and (2) a baseline where you train a linear layer directly on the raw iEEG of the downstream data. Without these, it’s hard to determine whether the pre-training was necessary at all.
- Minor: Line 126-128: Özdogan et al. 2025 quotes some of the work from [A] so this should also be cited here. Similarly, line 441/442 discusses unsupervised models, for which you may also wish to cite [B] and [C] for intracranial unsupervised foundation models.
I am open to moving towards recommending acceptance if the authors can address the above concerns satisfactorily.
[A] Jayalath, D., Landau, G. and Jones, O.P., 2025. Unlocking non-invasive brain-to-text. arXiv preprint arXiv:2505.13446.
[B] Wang, C., Subramaniam, V., Yaari, A.U., Kreiman, G., Katz, B., Cases, I. and Barbu, A., 2023. BrainBERT: Self-supervised representation learning for intracranial recordings. arXiv preprint arXiv:2302.14367.
[C] Zhang, D., Yuan, Z., Yang, Y., Chen, J., Wang, J. and Li, Y., 2023. Brant: Foundation model for intracranial neural signal. Advances in Neural Information Processing Systems, 36, pp.26304-26321.
- Why resample the brain data to 40Hz for the architecture? Intracranial recordings often pick up gamma and high-gamma band frequencies that may be relevant for speech perception [D] and could improve results. The Defossez et al. (2023) architecture was designed for non-invasive (MEG) where these frequencies are often low-signal or noise, but in intracranial recordings they are likely to be useful.
- Why use the ambient data as a pre-training stage at all? What happens when you jointly train with the ambient data as well as the true audiobook data?
[D] Mugler, E.M., Patton, J.L., Flint, R.D., Wright, Z.A., Schuele, S.U., Rosenow, J., Shih, J.J., Krusienski, D.J. and Slutzky, M.W., 2014. Direct classification of all American English phonemes using signals from functional speech motor cortex. Journal of neural engineering, 11(3), p.035015. |
Fully human-written |