|
Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in Brain |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper systematically evaluates the brain alignment capability of instruction-tuned multimodal large language models (Instruction-Tuned MLLMs) under naturalistic movie stimuli, providing the first cross-modal analysis of the relationship between model representations and neural activations across both visual and auditory modalities.
This work is represents a good step to compare brain alignment between visual/audio modalities and large multimodal language models under naturalistic movie stimuli, conducting extensive experiments across multiple datasets and baselines to ensure robustness and generality of the findings.
1. Limited originality. The main contribution of this paper lies in expanding the data dimension, without introducing any innovation in algorithmic design, model architecture, or theoretical analysis. Overall, the work appears to be an incremental extension of prior unimodal studies.
2. the authors claim that instruction tuning improves brain alignment, the paper does not explain why task instructions better capture neural task differentiation. The cognitive mechanism underlying this effect, whether due to changes in attention allocation, semantic structuring, or task-specific feature learning remains unclear.
3. The current evidence for the effect of instruction tuning on brain alignment is based solely on cross-model comparisons, which suffer from confounding factors such as architectural differences, pretraining data, and parameter scale. To establish causal validity, the authors are strongly encouraged to design a self-controlled experiment comparing the same model before and after instruction tuning. This setup isolates the single variable of “whether instruction tuning has been applied,” allowing for a direct measurement of its pure effect. Such an experiment would not only answer whether instruction tuning improves brain alignment, but also how it reshapes internal representations, e.g., identifying which layers show the largest improvement in alignment, whether changes occur in semantic versus perceptual layers, and whether the representational distribution becomes more consistent with cortical activation patterns. Furthermore, by comparing model activations for the same instruction (e.g., “Describe this video”) before and after instruction tuning, the authors could directly demonstrate how tuning transforms the model from producing task-agnostic representations to generating task-specific ones that align with relevant brain regions, thereby providing strong empirical evidence for the paper’s core claim that task instructions guide functional specialization in the brain. Therefore, causal verification is required through self-controlled experiments.
4. Experimental design lacks rigor. To validate “instruction tuning” within a neuroscience or cognitive science framework, experiments should systematically test semantically similar or equivalent instructions, comparing the resulting activation patterns and brain alignment. The current finding, that different task instructions (e.g., “video description” vs. “spatial understanding”) activate distinct brain regions, does not yet prove that the model captures semantic-level task distinctions rather than merely reacting to superficial textual differences.
To strengthen the argument, the authors should conduct a semantic similarity robustness test. Specifically, they could construct a dataset of instruction pairs with known semantic similarity (e.g., “Describe this scene” vs. “Narrate what happens in the video”) and compute both (a) the semantic similarity between instructions using text embedding models, and (b) the similarity of internal model representations (e.g., cosine similarity between layer vectors).
A strong positive correlation between these measures would demonstrate that instruction tuning enhances the model’s sensitivity to fine-grained semantic distinctions.
Additionally, clustering analysis of activation patterns induced by various instructions could reveal whether the model organizes tasks into conceptual categories resembling human cognition, e.g., grouping “object recognition” and “scene recognition” as perceptual tasks, and “inferring intentions” and “summarizing narratives” as social reasoning tasks. Such analyses would verify the model’s semantic precision and conceptual structuring ability, significantly enhancing the work’s interpretability and contribution to cognitive neuroscience.
5. The paper does not provide or cite empirical findings showing that instructions themselves lead to differential neural activations in humans. Since the authors claim that instruction tuning improves brain alignment, they should explain the corresponding neural or representational mechanism. For example, whether task instructions alter attention distribution, facilitate semantic decomposition, or modulate higher-order reasoning representations in ways analogous to human cognition.
6. The compared multimodal large language models (MLLMs) differ in architecture size, pretraining corpus, and task scope. The paper should clarify how these factors were controlled to ensure fair and scientifically valid comparisons.
7. The paper emphasizes a “hierarchical correspondence” between model layers and brain regions, yet it lacks supporting evidence such as a correlation matrix or significance testing. The authors should verify whether this pattern reflects genuine layer-wise alignment or is driven by model size or random effects. Furthermore, it would be informative to test whether the same hierarchical pattern persists when replacing task instructions with random natural language prompts.
8. Weak performance of audio models remains unexplained. The paper should analyze whether this weakness stems from modality misalignment, instruction design, or feature fusion mechanisms, and discuss how these factors influence cross-modal representation learning.
9. Unjustified task categorization. The division of 13 tasks into 5 categories lacks statistical validation. The authors should clarify whether this categorization is subjective or data-driven, ideally supporting it with quantitative clustering or similarity analysis.
10. The voxel-wise mapping via ridge regression may oversimplify the relationship between model representations and neural activity by assuming linearity. The authors should consider nonlinear methods (e.g., kernel regression or neural mapping) to explore potential higher-order relationships and verify robustness.
See weeknesses |
Fully AI-generated |
|
Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in Brain |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This study uses instruction tuned MLLMs to model multimodal human brain activity during audiovisual movie viewing. They find that instructive tuning improves brain predictivity of both video and audio MLLMs. Instruction tuning improves video model predictivity across the brain, but audio improvements are limited to auditory cortex. The results also show how different types of instruction tuning differentially improve prediction across cortex, revealing known functional preferences across the brain. Finally, the models also reproduce the layer-wise hierarchy, with early layers most predictive of sensory regions, and later layers predicting higher-level brain areas.
This an interesting and timely paper showing novel and potentially more interpretable ways to improve modeling of multimodal brain activity.
The modeling results reproduce known feature tuning and cotical hierarchies
This study seems to open the door to many future applications for both neuroscience and neuroengineering.
The paper is comprehensive and clearly written.
While the results are interesting and novel it is not entirely clear what is at stake - is the main goal of this work to better understand the brain or for future applications (if so what)? It is easy to imagine future benefits of this work, particularly with the differential effects for different types of instruction tuning, but it’s not entirely clear how to move beyond mapping already known brain function.
Overall the figures are quite clear but it is hard to keep track of the different models/instructions. I wonder if Fig 2 could be labeled not only with model name but also with model type (video, audio, instruction-tuned, not insturction tuned, etc). The color coding already seems to obey this.
Relatedly, it is hard to differentiate the colors in Fig 3 (this may be unavoidable given the large number of categories). A table or graphic with the different instructions, their task grouping, and (coloring in this way) may help.
As a smaller point - the intro lays out two types of multimodal brain model (for unimodal versus multimodal stimuli). While these are clearly different applications, the summary suggests they are largely similar (multimodal better than unimodal models). This section could be streamlined and I don’t think is entirely accurate, as the benefits of multimodal modeling are usually minimal for unimodal stimuli.
How should we interpret the finding that video models are the most predictive of language and auditory regions?
How can we disentangle the role of more training / data versus the specific contribution of the instruction tuning?
How does cross subject predictivity compare to model predictivity shown in Fig 2?
Are the whole brain results in all cortical voxels? Given some movies are repeated it may make sense to restrict this to only reliable voxels |
Fully human-written |
|
Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in Brain |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper investigates the alignment between instruction-tuned multimodal large language models (IT-MLLMs) for video and audio and human brain activity. Using fMRI data from participants watching naturalistic movies (video with audio) , the authors employ a voxel-wise encoding methodology to compare representations from six video and two audio IT-MLLMs against non-instruction-tuned, in-context learning, and unimodal models.
- The paper addresses a crucial and timely question at the intersection of foundation models and neuroscience: Do instruction-tuned models, which are becoming the dominant paradigm, process information in a way that is more aligned with the human brain? This work provides strong affirmative evidence for the video modality.
- The use of naturalistic, multimodal stimuli (movies with audio) is a significant strength. This is a major step beyond previous work that often relied on unimodal stimuli (static images or text), allowing for a more ecologically valid assessment of multimodal processing.
- The paper finds that audio IT-MLLMs (Qwen-2.5-Audio, Kimi-Audio) provide only limited gains, significantly underperforming their video counterparts. The discussion acknowledges this and attributes it to potential differences in training data or objectives. However, the analysis could more deeply explore why this is the case. Is the audio stream simply less informative for this stimulus, or are current audio MLLMs genuinely less "brain-like" in their representations?
- The paper groups ICL models (Qwen-2.5-Omni, InternVL) as a baseline. While IT-MLLMs do outperform them, the conceptual line between a model following an in-context prompt and a model following an instruction it was tuned on is slightly blurred. A more direct discussion on what specific properties instruction-tuning adds beyond the zero-shot task-following capabilities of ICL models would strengthen the argument.
- Miss some good video LLM works, e.g., Kimi-VL and Seed1.5 VL.
Please see weaknesses. |
Lightly AI-edited |