ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 6.00 4.00 5936
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (75%) 6.00 2.67 1901
Total 4 (100%) 6.00 3.00 2910
Title Ratings Review Text EditLens Prediction
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents VL-JEPA, a vision-language model formulated on the Joint Embedding Predictive Architecture (JEPA). Instead of traditional autoregressive token-level generation, VL-JEPA learns to predict target text embeddings directly in continuous space, thereby abstracting away surface linguistic variability and focusing on semantic representation. The model demonstrates improvements in efficiency and sample complexity compared to classical token-generative VLMs, particularly in zero-shot video understanding, retrieval, and real-time streaming scenarios. Extensive experiments benchmark VL-JEPA against leading models, with ablation studies and scalability analyses provided. 1.The paper's core contribution—applying a predictive JEPA-style objective to the cross-modal VL problem—is highly novel. Moving VL learning from the discrete token space to a continuous semantic space is a well-motivated and promising direction to address the known efficiency and latency bottlenecks of standard generative VLMs. 2.The "selective decoding" mechanism (Sec 4.3) is a standout contribution. The idea of monitoring the latent embedding stream for semantic variance and only triggering the expensive text decoder when a significant shift is detected is an elegant and practical solution for real-world, low-latency video streaming applications. 3.The model achieves state-of-the-art (SOTA) results across a wide and diverse range of video-language benchmarks, demonstrating the effectiveness and generalizability of the learned representations for both zero-shot and finetuned tasks. 1. While the paper emphasizes that the shift from token-space to embedding-space simplifies the target distribution, it provides no rigorous analysis of the measurability and discriminability of the resulting semantic embedding space. To substantiate this claim, the authors should provide supplementary analysis, such as: (i) Visualization or quantitative studies on the embedding space's structure (e.g., its clustering properties, separability) to demonstrate this claimed simplification. (ii) A theoretical elucidation of the target distribution's compressibility, perhaps through the lens of the Information Bottleneck principle. (iii) A crucial ablation study comparing the performance impact of using different embedding spaces (e.g., from CLIP, SONAR, BERT-base) as the prediction target. 2. L2 loss implicitly assumes a unimodal, deterministic target distribution.Real-world VL tasks are full of "one-to-many" ambiguities (e.g., "the light went out" vs. "the room became dark"). Both are valid but semantically distinct answers. The L2 loss will penalize the model for predicting either correct answer, forcing the predictor to regress towards a non-existent "average" embedding located somewhere between the two valid target points. This regression to the mean will likely result in semantically "blurry," generic, or even nonsensical decoded outputs. The paper completely fails to address this fundamental limitation. 3. Unfair and Misleading Efficiency Comparison: The core comparison in §4.2, which pits a 0.5B VL-JEPA predictor against a 1B VLM baseline, is fundamentally biased. The authors claim superior parameter efficiency, but their 0.5B predictor is not a neutral model; it is "cherry-picked" from the top-most, most semantically potent layers (L8-16) of the 1B Llama model. The paper's own ablation study (Table 5) confirms this bias, showing that these top layers (45.20% accuracy) are vastly superior to the bottom layers (35.86%). This is not an "apples-to-apples" comparison and does not prove the framework's efficiency, but rather the known fact that top-level LLM layers handle more complex semantics. 4. Lack of Statistical Rigor: The paper suffers from a critical lack of statistical validation. All reported results—including all SOTA claims in the tables and the key efficiency curves in Figures 3 and 4—appear to be point estimates from a single training run. No error bars, standard deviations, or significance tests are provided. This makes it impossible to determine if the reported gains are statistically significant or merely the artifact of a single, fortunate random seed, which undermines the scientific validity of all conclusions. 5. Missing Ablation on the Critical Y-Encoder Component: The entire methodology is critically dependent on the properties of the frozen y-encoder (SONAR). The paper fails to provide the most crucial ablation study: testing the VL-JEPA framework with different frozen text encoders (e.g., CLIP's text encoder, or a standard Sentence-BERT). Without this, the paper's claims are not generalizable. It is impossible to know if the authors have discovered a robust, general-purpose framework or simply a special-case solution that is uniquely and luckily compatible with the SONAR embedding space. 1.How does the L2 loss framework handle inherently multi-modal or ambiguous targets where multiple, semantically distinct ground truths exist? Does this not lead to regression towards a semantically blurry "average" embedding? 2.Can you provide quantitative evidence (e.g., t-SNE, cluster variance) that the embedding space actually simplifies the target distribution (e.g., maps "light went out" and "room is dark" to nearby points) compared to a standard token space? 3.How robust is VL-JEPA to the choice of the frozen y-encoder? What is the performance impact of replacing the SONAR encoder with a standard CLIP or Sentence-BERT encoder? 4.Can you clarify the 2.85x saving calculation (is it 1Hz / 0.35Hz)? More critically, can you provide any statistical validation (e.g., mean and std. dev. over 3+ runs) for your key SOTA claims, or at least for the comparison in Fig 4? 5.Loss Function: Why was L2 loss chosen over Cosine Similarity loss? Cosine loss would ignore magnitude and only focus on direction, which might be a more robust objective. Was this tested? Fully AI-generated
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces VL-JEPA, a kind of non-autoregressive vision-language model that predicts target text beddings from visual tokens and a query. The inference is conducted in the embedding spaces. This non-autoregressive nature allows for selective and low latency decoding at inference, while still exhibits strong zero-shot capability. 1. The paper is well-written and easy to follow. 2. The exploration of a novel architecture deviated from the current mainstream (autoregressive VLM) is meaningful and encouraged. 3. The low-latency and selective decoding properties of VL-JEPA makes it adaptable for many practical applications such as robots and wearable devices. 1. The core idea of VL-JEPA, to predict the target response in the embedding space, is very similar to LCM [1]. The training objective is also the same. However, the authors did not discuss the similarity and relationship with LCM. 2. Encoder-decoder architecture may impair the capability of the model to understand long query and generate long responses. The model can neither perform multi-round interaction. 3. The evaluations use accuracy and CIDEr as the main metric. It does not consider the readability of the generated responses. User study or LLM-based evaluation is needed. 4. In Table 2, the bold metrics are not the best ones. [1] Barrault, Loïc, et al. "Large concept models: Language modeling in a sentence representation space." arXiv preprint arXiv:2412.08821 (2024). 1. Most datasets used in this paper do not contain question-style queries. What are the queries used for training and evaluation? 2. Are you using any queries when you perform CLIP-like zero-shot evaluation? 3. Since your training data only entails limited number of tasks, have you tested the instructions following capability of the model? 4. It seems that the current VL-JEPA can only perform well on some standard video understanding tasks such as action recognition and step recognition. For such tasks, what is the advantage of Encoder-decoder model like VL-JEPA, compared with CLIP-style encoder-only models? Fully human-written
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents VL-JEPA a vision language model built on top of a JEPA architecture. The model is evaluated on video understanding and world modeling. 1 - The paper is clearly written and easy to follow and understand. 2 - A new vision-language model leveraging JEPA architecture instead of regular transformer decoders. 3 - Comparable performance to existing transformer-based VLMs, with less parameters. 4 - Extensive details are given about the training setup and resources. 1 - The model seem to be focused on video understanding as most of the training data are related to this task. This raises questions about the comparision to other VLMs that are trained and designed to be more generalist. 2 - Experiments focus on only a subset of use cases of a vision-language model (video understanding). More experiements on other types of tasks wuold have been appreciated (e.g., MMMU, OCRBench, DocVQA, etc.). If the JEPA architecture is intended to replace transformer-based VLMs then more generalization experiemnts are required. 3 - The choice of evaluation benchmarks is not well justified. For example, WORLDPREDICTION-WM is not known by the vision-language modeling community. If I'm not mistaken the paper introducing this benchmark [1] was cited once. [1] Chen, D., Chung, W., Bang, Y., Ji, Z., & Fung, P. (2025). WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning. arXiv preprint arXiv:2506.04363. 1 - Why the benchmarking of this model does not follow standard VLM benchmarking suites? 2 - Is there a justificiation for focusing on video understanding? 3 - Do VL-JEPA need a pre-training phase? How does the model size scale with the data used for training? In the paper, it is said 64M samples are seen, how much is this in number of tokens? Fully human-written
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduced VL-JEPA, a visual language JEPA model that is trained by predicting the embedding of target texts. The authors demonstrate that this results in faster and more efficient training, and for selected tasks yield state-of-the-art results. The model also obtains impressive zero-shot retrieval scores despite its training paradigm. In addition, a promising approach for selective decoding was also presented. Overall, the paper is well-written, and it is easy to find the relevant pieces of information. The paper introduces (or successfully reapplies) the JEPA architecture to the VL setting. - shows notable gains in both training speed and performance on zero-shot video captioning and classification (fig 3) while using a well-argued training procedure (JEPA). - shows non-trivial adaptation to retrieval and open-label classification. E.g. seen on youcook2, MSR-VTT (table 6). - The paper explores underexplored areas in the field, by exploring alternatives to generative token decoding, resulting in promising decoding strategies (selective decoding) and a reduced number of parameters compared to alternative models (fig 3, table 2, 4) - The relevant benchmarks used for evaluation are only briefly introduced. I would have loved to see a more substantial justification for choosing these specifically. - While it is stated that the model and code will be open source, it could be shared through existing anonymous platforms - VL-JEPA is in Table 6 compared to contrastively-trained models. It is implicitly argued that this is the reason for the subpar performance on some of the tasks. I would have loved to see a contrastive adaptation to see if this assumption is correct. I would love to get a more substantial justification for the choice of benchmarks/evaluation datasets. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next