|
Brain-Informed Language Model Training Enables Scalable and Generalizable Alignment with Human Brain Activity |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper explores whether functional MRI (fMRI) recordings of human brain activity can serve as a supervisory signal to train large language models (LLMs) toward more human-like, multimodal representations. Building on over 50 hours of fMRI data from participants watching Friends and 10 hours of additional movie data, the authors fine-tune GPT-2 (124M) and LLaMA-2 (7B) using Low-Rank Adaptation (LoRA) within a dual-objective framework that balances standard language modeling loss with a brain alignment loss. Through systematic experiments, the authors show that brain-informed fine-tuning improves voxel-level encoding accuracy across auditory, temporal, and frontal cortical regions, scaling with both model size and training duration, and generalizes across participants and unseen movie stimuli.
1. The paper shows that brain activity can effectively guide LM training, not merely serve as a downstream encoding target.
2. The paper proposes a dual-loss optimization scheme dynamically balancing language and brain supervision to improve generalization and stability.
3. The experiments show that neural alignment captures shared cortical structures rather than individual noise.
1. Using fMRI to align LMs with brain activity is an incremental extension of prior work. For example, [1] has found that brain prediction performance scales logarithmically with model size from 125M to 30B parameter models. Additionally, understanding why or how brain alignment leads to better representations is missing. The argument that “brain signals provide multimodal inductive bias” remains intuitive but unquantified. The results show correlation gains but do not explain what representational dimensions change in the LM after brain fine-tuning. There is no layer-wise or feature-space analysis (e.g., probing semantic vs. perceptual dimensions, alignment metrics like CKA or RSA) to reveal how brain-aligned features differ from text-only ones. The reasoning of why the dual-objective training scheme prevents representational collapse is missing.
2. The “brain-from-scratch” and “text-only” models are included, but stronger baselines are missing.
3. The writing and the figure are hard to follow. For example, what do the blue and red mean in Figure 2?
[1] Antonello, Richard, Aditya Vaidya, and Alexander Huth. "Scaling laws for language encoding models in fMRI."
1. Why can the proposed framework inject multisensory inductive biases? Can the authors quantify or demonstrate this claim?
2. Why does the dual-objective loss prevent representational collapse? Are there any insights?
3. The writing can be improved. |
Lightly AI-edited |
|
Brain-Informed Language Model Training Enables Scalable and Generalizable Alignment with Human Brain Activity |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper investigates how fMRI recordings can be used to fine-tune large language models (LLMs) toward human brain activity. The authors propose a dual-objective framework combining standard language modeling with brain alignment, leveraging over 50 hours of naturalistic movie-watching fMRI data. Experiments on GPT-2 and LLaMA-2 show consistent improvements in voxel-wise encoding, cross-subject generalization, and downstream visually grounded commonsense tasks. The study suggests that neural supervision can inject multimodal, human-like structure into text-only LMs.
* Interesting and novel idea, directly training LMs with human brain activity rather than merely testing alignment.
* Strong empirical design using large-scale naturalistic fMRI datasets with both within- and cross-subject validation.
* Careful exploration of scaling effects (data size and model size) and clear demonstration that alignment benefits grow with scale.
* Connection to downstream commonsense reasoning provides an interesting link between neuroscience and AI representation learning.
* The paper may inadvertently compromise the double-blind review process: Appendix A.3 explicitly names computing clusters and institutions (e.g., CMU, University of Montreal), which should be anonymized.
* Several figures (e.g., Fig. 2/3/5/6) are unclear or potentially misleading, the metrics are not labeled, making it difficult to interpret what is being compared or measured.
* The methodological description lacks precision. The "separate Ridge model" in Sec. 4.3 is not clearly explained, especially whether it was trained using all subjects or subject-specific data, and the strong cross-subject consistency in Appendix Fig. 12 is puzzling without further justification.
* The evaluation procedure on the CoDa dataset (Sec. 4.5) is underspecified; the paper does not clearly define how the metrics were computed.
* The paper focuses entirely on brain alignment and visually grounded reasoning but does not explore how brain fine-tuning affects performance on traditional text-only tasks. Understanding whether language quality is preserved or degraded is important for assessing the broader impact of this approach.
See above. |
Heavily AI-edited |
|
Brain-Informed Language Model Training Enables Scalable and Generalizable Alignment with Human Brain Activity |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper explores how fMRI signals can be used not only as evaluation data / predictive objective to measure alignment with LLMs but also as supervisory signals to fine-tune them. The authors explore the potential for guiding LLM training with brain data. Authors test several strategies: 1. LoRA-based fine-tuning of pre-trained LLMs, 2. training LLMs from scratch using brain data, and (3) joint optimisation combining language and brain-alignment losses.
They report improvements in voxel-level encoding performance for brain-informed fine-tuning over brain-only models. They also highlight potential knowledge enhancement on visually grounded language benchmarks, suggesting that fine-tuning on fMRI data injects perceptual and associative priors that text-only training lacks, against text-only models.
The core idea of leveraging biologically grounded signals to inject multimodal capabilities into LLMs is original, well-motivated, and goes into the direction of more understandable AI models.
The use of the VL-Commonsense benchmark is a good idea to test whether brain supervision enhances perceptual knowledge (e.g., color, shape, co-occurrence).
The large-scale fMRI dataset (50h of Friends + 10h of movies) is valuable and significantly larger than prior studies.
Although the idea introduced is interesting and the motivation is clear, the paper in its current form is not yet ready for acceptance for several reasons.
1. First some parts of the methods are insufficiently explained:
- The concepts of brain-fine-tuned, brain-from-scratch, brain-only are not well explained early on in the paper (in the abstract and introduction) and complexify the understanding of the paper.
- The reasoning behind using an LLM architecture randomly initialised and train it only on brain data is not clear.
- The input data used (which texts? transcripts? captions?) is not described in the main text.
- Ridge modelling approach (4.3); is a new ridge model trained for each test subject? how does it impact the generalisation claim?
2. Some of the key hypotheses remain vague.
- Choice of some of hyperparameters, see next section for more details.
- The rationale behind training “brain-from-scratch” models is weak; given the data scale, the risk of overfitting is great.
- Missing or vague methodological justifications (e.g., why voxel-wise linear encoding? why not spatially constrained or surface-based models?.
- missing some real baseline to compare the performance of the model (simple correlation model between language embedding and brain activation to compare with fine-tuning).
3. The results section is not strong enough to convince about the value of the method and more importantly to support the main claims that are made throughout the paper ( "substantially improve", generalisation to new subjects, to new stimuli etc...).
- Some claims sound conclusive but are only suggestive, such as the generalisability to new stimuli and to new subjects. There are no quantitative metrics supporting the generalisation capabilities of the model to new subject and new stimuli. Only some figure are presented (in the appendix), without colour scale, and without any way of interpreting how good the performance are.
- Similarly the VL-Commonsense (CoDa) results would benefit from more detailed analysis (which objects are better classified etc.. ), what are the prompts, how differ semantically the answers of the models - as it is now, it is unclear what is actually improved/learned by the model by fine-tuning on brain data (is it the additional training on new input texts that is helping the model or is it the actual prediction task?)
4. The paper would benefit from better evaluation of the performance:
- Generally I found it very difficult to appreciate the results due to the absence of real baseline. How would perform a simple correlation model between language embeddings and fMRI activations (similar to J. Millet et al 2022 - Toward a realistic model of speech processing in the brain with self-supervised learning)
- Missing proper quantitative evaluation regarding generalisation performance, maybe some correlation scores per region? auditory, language area etc. Is the correlation score the best metric to assess the generalisation performance? what about stimuli (audio, visual or text) retrieval ? (see Dahan et al 2025 - SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watching Experiments)
- More systematic evaluations than correlation maps (difficult to read).
5. The paper overall needs substantial polishing in writing, structure, and presentation. The writing and figure design need substantial polishing for clarity and readability.
- Some of the concepts are not really introduced or cited (such as LoRA ).
- The phrasing is often unclear; several paragraphs (e.g., first in Related Work) would benefit from rewriting.
- Figure 1, should present the different training schemes more clearly.
- Legends in figures are not self-explanatory, it is not said what types of maps is shown what is show (Figure 6); Figure 4 lacks clarity.
- Could you provide more rational of the use of 32 tokens as a context window?
- How are the LoRA layers selected within the model?
- How would you evaluate the risk of overfitting of training a LLM architecture on a relatively small dataset?
- What are the main limitations of the dataset (small N, inter-session variability)? How dependent are results of generalisation and sensory representation on this specific dataset? Have you considered richer benchmarks such as audiovisual captioning or narrative reasoning tasks to validate multimodal grounding?
- In terms of generalisation, what would be the comparative encoding scores when trained/tested directly on Movie10?
- it might be interesting as a lower bound of the experiment, to have the correlation score using a text-only model while finetuning only final layer to predict the brain activity (from text embeddings).
- It is not clear from the text to which training scheme refers the mention to "brain-only" ? I guess it is the brain-from-scratch but it is confusing. |
Fully human-written |
|
Brain-Informed Language Model Training Enables Scalable and Generalizable Alignment with Human Brain Activity |
Soundness: 3: good
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper applies LLMs to predict fMRI voxel responses from text stimuli. The authors compare three training approaches: frozen LM, LoRA finetuning, and full finetuning across different LLMs (GPT-2, Llama, Mistral). They test cross-subject and cross-movie generalization on a movie-watching dataset.
While the results demonstrate that full finetuning with more data yields the best performance, the paper's contribution remains unclear. The architectural variants (LoRA/HRF) lack novelty, and the reported correlations are substantially lower than other LLM-based voxel prediction methods. The work would benefit from clearer positioning and substantial reframing to establish its distinct contribution to the field.
The paper provides detailed hyperparameter specifications that support reproducibility, and the clear section organization makes the methodology easy to follow. The experimental design includes valuable cross-subject and cross-movie generalization tests that are important for assessing model robustness in real-world applications.
### Architecture
The authors note that fMRI responses are "high-dimensional, have slow temporal dynamics, and are spatially structured," but the proposed architectures (frozen LM/LoRA/full finetuning) don't explicitly model these inductive biases. Missing architectural considerations include spatial voxel relationships, region-specific HRFs, or temporal dependencies from previous activations.
HRF details are missing: distribution type (Single Gamma/Canonical/Poisson), parameterization, and lag estimation methodology.
### Results quality
Mean correlations are very low and potentially insignificant without proper statistical testing that accounts for spatial structure. Other papers report order-of-magnitude higher results [1][2][3]. While datasets and preprocessing differ, the gap warrants detailed explanation.
### Presentation issues
- Figure 1: Links between panels A, B, C unclear; training regimes not referenced; text too small in B/C; technical terms (TR, HRF, BOLD) need explanation or simplification
- Figure 3: Color schemes unclear
- Inconsistent presentation throughout
### Typos
- line 250: "in 10" -> "in figure 10"
- line 329: "coincidentally" -> "fittingly"
- line 377: "Whether our model" -> "If our model"
[1] [On whether the relationship between large language models and brain activity is language-specific](https://2025.ccneuro.org/abstract_pdf/Gurel_2025_On_whether_relationship_large_language_models.pdf)
[2] [Hierarchical Brain–LLM Alignment Reveals Layer-Specific Neural Representations of Second Language Proficiency](https://www.biorxiv.org/content/10.1101/2025.06.17.660057v2.full.pdf)
[3] [fMRI predictors based on language models of increasing complexity recover brain left lateralization](https://arxiv.org/html/2405.17992v2)
What is causing the anomaly in Figure 4 at 10h of training data with Llama?
How do you justify the low correlation results compared to other fMRI-LLM papers? What specific dataset or preprocessing differences account for the order-of-magnitude performance gap?
Can you provide detailed HRF specifications: distribution type, parameters, and lag estimation method?
Why don't the proposed architectures explicitly model the spatial and temporal structure of fMRI data that you highlight as important? |
Moderately AI-edited |