|
LaVCa: LLM-assisted Visual Cortex Captioning |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces LaVCa, a novel framework that leverages large language models (LLMs) to generate natural-language captions describing the selectivity of individual fMRI voxels in the human brain cortex. Although the proposed framework appears complex, the only trainable component is a standard voxel-wise ridge regression model, while all other modules remain frozen. Consequently, the methodological novelty primarily lies in the system design rather than in model learning or representation innovation.
* The paper proposes a unique and well-motivated pipeline that applies LLMs to the problem of voxel-wise captioning. Overall, the paper is very well written, clearly organized, and easy to follow.
* The captions generated by LaVCa quantitatively capture more detailed properties than the existing method. Both inter-voxel and intra-voxel analyses are thorough and supported by quantitative evidence.
* The approach relies heavily on the output quality of LLMs. While the pipeline is appealing, the lack of systematic ablation across different LLMs, prompts, or hyper-parameter settings limits the generalizability of the results.
* The paper does not investigate how the choice of the external image dataset affects performance. All experiments rely on the OpenImages dataset.
* In BrainSCUBA, caption generation relies on the nearest neighbor search operation. LaVCa also uses a similar operation, but with an external dataset.
* LaVCa uses LLMs to explain voxel responses in natural language. Several recent works have explored generating natural-language descriptions directly from brain activity—although focusing on decoding rather than encoding, they share similar techniques, namely the alignment between model and brain representations. Despite this fundamental distinction, the current manuscript does not explicitly discuss how LaVCa complements or differs from existing brain-to-text work such as [1–2]. Without such discussion, readers may conflate LaVCa with decoding frameworks.
[1] Bridging the Gap between Brain and Machine in Interpreting Visual Semantics: Towards Self-adaptive Brain-to-Text Decoding, ICCV 2025.
[2] Mindgpt: Interpreting what you see with non-invasive brain recordings, IEEE TIP 2025.
* Could the authors clarify whether the optimal image selection step could induce semantic biases that might affect voxel interpretations?
* The manuscript reports comparison results with only one method; it is unclear whether other more recent or advanced methods could be included for comparison.
* In Figure A3, how does a single voxel generate multiple captions? Is it based on different activation levels? The horizontal axis title in the upper right should be "caption." |
Lightly AI-edited |
|
LaVCa: LLM-assisted Visual Cortex Captioning |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
LaVCa introduces a data-driven approach for describing the selectivity of individual human brain voxels in the visual cortex using captions generated by large language models (LLMs). The proposed method trains encoding models for fMRI activity in response to images, isolates the most selective images for each voxel, uses a multimodal LLM for captioning, and finally composes concise keyword-driven voxel captions with an LLM and a keywords-to-sentence model.
Compared to prior methods, the proposed method provides richer, more interpretable, and accurate natural-language descriptions of what each voxel encodes, revealing greater diversity and fine-grained functional specialization in the visual cortex.
The strengths of this paper lie in its creative use of LLMs to generate natural-language captions that accurately describe voxel-level visual selectivity in the human cortex, surpassing prior methods in both interpretability and diversity of representations.
The approach is robust across benchmarks, clearly demonstrates broader and finer-grained conceptual tuning in visual areas, and is built with modularity and reproducibility in mind, thereby enhancing its impact on both neuroscience and neuroAI.
- While the captions are more diverse, the method often omits very local, fine-grained details in face-selective or highly specialized voxels, likely a result of summary steps and current limitations of captioning models?
- Some hyperparameters (e.g., keywords and image set size) influence accuracy, and there is limited exploration of more structured or hierarchical compositional strategies for capturing multi-concept or multimodal selectivity.
- Benchmark comparisons focus primarily on accuracy and diversity, but lack direct behavioral validation. How these captions relate to actual perceptual or cognitive phenomena in human subjects could be better clarified.
The following are some questions/suggestions for authors:
- Can the pipeline be expanded to provide hierarchical or compositional captions, reflecting not just multiple keywords but structured relationships (object-action or scene-context)?
- How does LaVCa generalize to multimodal responses, including voxels sensitive to auditory or language stimuli? Are the methods readily adaptable, or are critical modifications needed?
- How reproducible are the identified semantic clusters across large populations? Do the same diversity patterns emerge in different datasets or subject groups?
- Could direct behavioral validation (e.g., relating captioned selectivity to subject perception, imagery, or recall tasks) link voxel captions to cognition and perceptual experience more strongly? |
Fully AI-generated |
|
LaVCa: LLM-assisted Visual Cortex Captioning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a novel LLM-based pipeline for fMRI interpretation at the voxel level. The authors leverage the natural scenes dataset, and use the learned encoding weights from a VLM on NSD to predict fMRI responses to a novel set of images. They select the image set that produces the highest predicted response for each voxel and use LLM keyword extractors / sentence composers to provide rich and interpretable descriptions for each voxel.
The method leverages state-of-the-art AI methods to improve not only prediction but also interpretability of fMRI data
The model generalizes fMRI responses to held out images to generate rich descriptions of each voxel
The main figures replicate prior results of known neural tuning and may be a source of hypothesis generation for future studies.
The paper is well written with clear visuals
The main weakness is that it is unclear what the advantage of LaVCa is relative to using the direct fMRI responses from NSD. Given the large set of images shown in that fMRI dataset, which were drawn from MS-Coco, it seems possible to find the NSD image (or set of images) that drives the highest response in each voxel and do the same LLM extraction / sentence composition on the captions for those images. (This could be done in a cross-validated / encoding framework if desired.) In most of the paper, these captions are treated as the “ground truth” and it’s unclear what the advantage is for the first part of the voxel-preference pipeline. This major limitation significantly tempers claims about prediction accuracy (see below as well) and interpretability.
While the paper is well written overall, the notion of prediction accuracy as cosyne similarity between the generated sentence or image and the original caption/image is not typical and not always clearly explained. For example, at a quick look Figs 2 and 4 could be interpreted as the more traditional encoding accuracy score (correlation between predicted and actual fMRI). More generally, this notion of prediction accuracy is extremely complex as it involves many steps of generative AI, removing it from the original data.
As a small point, the order of Fig 2 and FIg 3 should swapped.
How do the voxel interpretations generated by keyword extractor / sentence composer directly on the captions (or an even simpler summary statistic of the captions) compare to the interpretations generated by LaVCa?
What are the scores of the encoding model? How does the choice of VLM encoding model affect the captions / how can we assess that this backbone of LaVCa is accurate?
How can this method be extended to other datasets? What scale / diversity of sitmuli is needed? |
Fully human-written |
|
LaVCa: LLM-assisted Visual Cortex Captioning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper investigates the selectivity of voxels in the human visual cortex by generating a descriptive caption for each voxel.
To briefly summarize what the authors do:
1. First they build voxel-wise encoding models using embeddings from CLIP
2. They identify the "optimal images" that most strongly activate a specific voxel.
3. Descriptive captions for these images are generated with MiniCPM-V.
4. Using GPT-4o to extract keywords from the captions and composes them into a final, concise summary
The authors quantify the accuracy of their method in two ways:
1. They use Sentence-BERT to create text embeddings for each generated voxel caption and for the captions of all images in the test set (NSD). For a given voxel, they calculate the cosine similarity between its caption's embedding and every test image caption's embedding. This similarity score is treated as the predicted brain activity. They compare this predicted activity vector to the voxel's actual measured activity using Spearman's rank correlation. A higher correlation indicates a more accurate caption.
2. A similar eval where the voxel captions are turned into images using FLUX.
The authors find that their method is more accurate than BrainSCUBA.
The problem is a very interesting one, and understanding the selectivity distribution across the visual cortex can be of medical and scientific interest.
The approach is simple and consists of modular components that could be swapped out with future improvement in model quality.
1. It doesn't really make sense to me why the authors utilize an encoder to rank images in the first step. Each voxel has ground truth most activating images (from the fMRI beta weights). So it seems like this step is totally unnecessary and would degrade the model by replace real data rank with a predicted rank.
2. The keyword extraction stage (Figure 3d) seems unnecessary. It is not clear that voxels would only reason to entities in an image, compared to actions or adjectives.
3. The in-text citation format across the paper is very problematic. For example Line 135, the citations should not be in-text, and instead they should be in a parenthetical.
1. Line 195, do the authors convert the image embeddings to unit norm prior to constructing the encoder model? |
Fully human-written |