ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (33%) 6.00 5.00 1747
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (67%) 7.00 3.50 2814
Total 3 (100%) 6.67 4.00 2458
Title Ratings Review Text EditLens Prediction
Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces Perceptual-Initialization (PI), which initializes the CLIP vision encoder using human perceptual similarity data (NIGHTS triplets) before standard large-scale image-text contrastive training on YFCC15M. Compared with random initialization and with post-hoc perceptual fine-tuning, the proposed method yields consistent zero-shot gains across 29 classification and 2 retrieval benchmarks. The authors argue that embedding human perceptual priors at the start of training leads to faster convergence and more human-aligned representations. 1. Novel use of human perceptual priors as initialization rather than alignment fine-tuning. 2. Comprehensive evaluation over diverse datasets shows consistent positive gains. 3. Very low additional compute cost. 4. Clear comparison showing that late perceptual fine-tuning disrupts alignment and opens new direction for human or brain aligned pretraining. 1. No experiments using random or pseudo perceptual triplets to isolate the contribution of human perceptual structure. 2. The approach is validated only on NIGHTS; applicability to richer datasets remains untested. 3. No probing or visualization is provided to show how perceptual initialization changes internal feature space or similarity structure compared to the baseline. 1. Could the authors analyze which visual attributes benefit most from perceptual initialization (e.g., texture vs. shape bias)? 2. Does PI primarily affect the early layers or propagate to higher-level semantics during contrastive training? 3. How much perceptual data is necessary—does performance saturate after a certain fraction of NIGHTS triplets? 4. Could PI be combined with supervised or robust-CLIP initializations, or would they interfere? Fully AI-generated
Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a two stage training paradigm for VLMs like CLIP arguing the benefits of perceptual initialization (PI) over random initialization. Further it argues that incorporating PI in initialization phase more advantageous than post-hoc finetuning. The main contribution is demonstrating that this early-stage alignment provides a stronger foundation for general-purpose VLM intelligence. PI models show significant zero-shot performance improvements, without any task-specific fine-tuning, across a comprehensive suite of 29 classification and two retrieval benchmarks. PI approach consistently outperforms a randomly-initialized baseline, and a direct comparison shows that post-hoc perceptual fine-tuning is catastrophic to V-L alignment. **originality**: Lveraging supervised human behavioural data as a foundational inductive bias in the model intialization is a novel idea that opens a new research direction. The works provided a provides a structured solution that converts often ignored variance of random initialiation into a principled prior. **Significance**: PI paradigm is the core strength of the paper. It uses the supervised human perceptual data to initialize a VLM parameter prior to large scale pretaining, provide a potent to human aligned inductive bias right from time t=0. **Quality**: The provided results empirically validate the PI hypothesis, having consistent performance gains, outperforming 23/29 classification tasks. Further, it shows how post-hoc finetuninig leads to catastrophic forgetting. **writing**: The argument for PI is presented logically, starting from the known "path-dependency" of deep networks and the variance of random seeds, making the motivation for a principled initialization intuitive. **Limited scope of the prior**: Only the vision encoder is initialized with PI and the text encoder is still randomly initialized and trained from scratch. What is the reason for this choice for the experiments? CLIP like model operates on the shared latent space of vision and text modalities. The paper could be strenthened by exploring complementary intialization of text encoder, to see if such complete model with PI initialization provides synergistic benefits. **Perceptual Loss**: The core of the PI benefits lays in the perceptual loss function which is derived from a previous works. There's no/lack of evidence/interpretation (apart from the final results) provided on how does this loss function work/fail in the assumed context: pretraining vs post-hoc finetuning. **Mechanistic Analysis**: While efficacy id proven, the paper does not delve into why the inductive bias remains so effective after 32 epochs, where the post-hoc finetuning fails. This theoretical insight is critical to see the compatibility of leveraging this idea to different models or scenarios. Many of the questions in **Questions** section could not answered from the given content of the paper. **Limited evaluation**: current training uses 15M image-text pairs, while this is substantial, SOTA VLMs often trained on hunderds of millions or biilions of pairs. Will the proportional gains from PI would persist, diminsh or grow continuously (Though limited scling law provided in the paper). In failure cases, how PI should be addressed? - Do the PI weights remain closer to the perceptiual optimum throughout training? - How does the learned Image-Text alignment module interact differently with the PI-derived features versus the baseline features? How does the shared representation space differ? - Have any preliminary experiments been conducted to determine the minimal amount of human perceptual data required in Stage 1 to achieve a statistically significant positive gain? - Why/How does this loss function work? - Can the authors analyze the evolution of the logit scaling parameter ($\tau$) in Stage 2? - For failure cases, should the perceptual prior be "re-anchored" at intermediate stages, or perhaps weakened by introducing a temperature parameter to the perceptual loss? Fully human-written
Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment Soundness: 3: good Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a new visual representation learning scheme called Perceptual-Initialization, which trains the visual encoder to match human preference before the contrastive learning. Specifically, the human preference alignment is achieved using a triplet contrastive loss on the NIGHT dataset and the resultant model weights are used as the initialization of the formal contrastive learning. PI achieves zero-shot performance improvements on a variety of image classification and retrieval benchmarks compared to the baseline CLIP. - The proposed method is novel, simple yet effective. The promising results of the paper can encourage following researches exploring other initialization strategies. - Results in zero-shot image classification and retrieval tasks demonstrates that PI scales as the data volume increases, indicating the method's potential in large-scale training. - The paper is well organized and nicely presented. The ending section points out remaining challenges faithfully and offers valuable insights, strengthening its contribution to the field. - The proposed method limits its scope for the initialization of CLIP type model, despite that the human preference alignment is independent to the text encoder. The author could add experiments on other visual backbones such as vanilla ViTs to fully explore the potential of the method. - As mentioned in the weakness part, I'm wondering if PI could also benefit other types of visual pretraining? - Additionally, does the model trained using PI demonstrates stronger transferability compared to normal training? Fully human-written
PreviousPage 1 of 1 (3 total rows)Next