ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 4.00 3.00 2992
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 4.00 3.00 2836
Fully human-written 1 (25%) 6.00 3.00 1635
Total 4 (100%) 4.50 3.00 2614
Title Ratings Review Text EditLens Prediction
PairUni: Pairwise Training for Unified Multimodal Language Models Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes PairUni that links multimodal understanding and generation data and optimization. The method firstly uses OpenAI o3 model to generate captions for understanding samples and QA pair for generation samples. Then they form retrieved pairs of semantically related data points. Furthermore, they propose Pair-GRPO that performs optimization on aligned examples to reduce task interference. Experiments were conducted on MMMU, MMStar, WISE, GenEval, etc, demonstrating the effectiveness of the proposed method. 1. The motivation of this paper is strong. Balancing understanding and generation in unified vision-language model is challenging and worth studying. 2. The proposed method is novel. The paper presents a novel approach that is composed of a data pairing pipeline and pair GRPO approach for unified optimization that minimizes interference between heterogeneous tasks. 3. The proposed method is effective. Through extensive experiments on WISE, GenEval, MMMU, MMStar, etc, the paper has shown that the proposed method is effective and outperform many prior works (some improvements are small but many are decent). 1. The overall presentation needs improvements and polishments - especially the figures are not well drawn and do not help readers understand the method well enough. 2. Unified VLMs seem far worse than understanding only or generation only models. The proposed approach presents decent progress, but the gap is still significant. 1. The data is generated with o3 model which is big and powerful. I wonder if the proposed approach would scale well to powerful baselines (Qwen3-VL for example). Fully human-written
PairUni: Pairwise Training for Unified Multimodal Language Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses the challenge of task interference during reinforcement learning (RL) fine-tuning of Unified Vision-Language Models (UVLMs), where the heterogeneous nature of understanding and generation tasks leads to conflicting optimization gradients. The authors propose PairUni, a two-part framework to mitigate this issue. At the data level, they introduce a pipeline to create a new dataset, PairUG, which consists of understanding-generation (UG) pairs. These pairs are either "aligned" (understanding and generation data derived from the same source image) or "retrieved" (linking a generation sample to a semantically similar understanding sample). At the optimization level, they propose Pair-GPRO, a variant of Group Relative Policy Optimization that weights the advantage term based on the similarity score of the data pair. Experiments on the Janus-Pro model show that PairUni achieves balanced improvements across both multimodal understanding and image generation benchmarks, outperforming strong baselines. 1. The paper effectively identifies and articulates a critical and timely problem in the development of UVLMs—the optimization conflict between understanding and generation tasks during unified RL. The empirical analysis presented in Figure 1, showing the correlation between gradient cosine similarity and benchmark performance, provides a strong, data-driven motivation for the proposed approach. 2. The proposed PairUni framework is elegant and logically sound. Tackling the problem at both the data level (through structured pairing) and the optimization level (through similarity-weighted advantages) is a comprehensive approach. The distinction between "aligned" and "retrieved" pairs is a practical way to balance supervision quality with data scale. 3. The paper demonstrates consistent and balanced performance gains on a variety of established benchmarks for both understanding (MMMU, MMStar, MME) and generation (WISE, GenEval). The improvements over the powerful Janus-Pro baseline are non-trivial and suggest the effectiveness of the proposed method. 4. The curated PairUG dataset is a valuable contribution to the community. By providing a structured, high-quality dataset specifically designed for unified RL fine-tuning, the authors enable further research in this area. 1. Limited Scope of Evaluation Benchmarks: The evaluation primarily focuses on standard VQA/reasoning and text-to-image generation tasks. However, a key capability of modern UVLMs is instruction-following image editing, which requires a tight integration of understanding (the instruction) and generation (the edit). 2. Insufficient Comparison with State-of-the-Art Baselines: The set of compared models, while including the relevant Janus-Pro, could be expanded to include more recent and powerful UVLMs (e.g., Qwen-Edit、Kontext, and Step1X) known for their unified capabilities, particularly in instruction-following and editing. 3. Lack of Clarity in Visualizations: Figure 2, which is central to understanding the data pairing pipeline, is too abstract and lacks the necessary detail to be fully informative. The low resolution and simplistic diagrams make it difficult to grasp the nuances of the alignment, retrieval, and clustering processes. 4. Need for Stronger Evidence of the Method's Novelty: The paper's main contribution is the PairUni framework (PairUG dataset + Pair-GPRO algorithm). While the results on Janus-Pro are strong, the experiments do not sufficiently disentangle the contribution of the method from the choice of the base model. To robustly claim that PairUni is a generally effective technique, its impact must be demonstrated across different model architectures. see weakness Fully AI-generated
PairUni: Pairwise Training for Unified Multimodal Language Models Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes PairUni, a unified RL framework for multimodal understanding–generation (UVLMs). It pairs data into understanding–generation (UG) tuples via augmentation, forming a 16K PairUG set; and introduces Pair-GRPO, which weights advantages by pair similarity. On Janus-Pro, the method reports balanced gains on understanding, with small positive transfer to a discrete-diffusion backbone. 1. Problem framing: Clear diagnosis that unified RL suffers from cross-task interference; the data–optimization alignment idea is intuitive and practically relevant. 2. Simple, implementable mechanism: Pair-GRPO reduces to advantage reweighting by a scalar similarity; easy to replicate and integrate with GRPO pipelines. 3. Paired dataset construction: A pragmatic pipeline (augmentation + clustering medoids + retrieval) that others can adopt; ablations show the pair structure matters beyond naïve mixing. 4. Broad evaluation surface: Reports on understanding and generation, at two model scales, plus an extra architecture. 1. Incremental algorithmic novelty: Pair-GRPO is essentially per-trajectory reweighting by a heuristic similarity; no principled analysis of when/why this dominates simple pair-aware sampling/curricula or per-pair loss scaling. Theoretical claims stop at intuition; no gradient-level diagnostics beyond a single correlation figure. 2. Similarity definition is underspecified/weak: Pair scores come from ResNet50 image embeddings only, ignoring text (prompts/Q/A). This invites spurious cross-instance pairs and makes the weighting arbitrary. No comparison to stronger vision–language similarities. 3. What exactly does similarity buy beyond sampling? If you hold batches fixed and only change weights to wp=√sp, how much of the gain remains vs. (i) uniform, (ii) sp-proportional sampling, (iii) reward-proportional weighting? 4. Ablations do not isolate causes: The main ablation contrasts pairing varieties, but does not disentangle (i) augmentation quality, (ii) retrieval thresholding, (iii) wp functional form (linear vs √· vs softmax), or (iv) text-aware vs image-only similarity. It also omits a reweight-by-reward baseline to check whether similarity adds beyond standard advantage magnitudes Please see weaknesses. Fully AI-generated
PairUni: Pairwise Training for Unified Multimodal Language Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. - This paper introduces PairUni, a unified reinforcement learning framework for Unified VLMs designed to balance multimodal understanding and generation tasks by reorganizing heterogeneous data into understanding-generation pairs. - A high-quality dataset of UG pairs, named PairUG, is curated for RL fine-tuning by curating aligned and retrieved pairs. These pairs are utilized in Pair-GRPO by assigning a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. - PairUni is evaluated on Janus-Pro across a range of understanding and generation benchmarks, demonstrating balanced improvements and competitive performance among baselines. - This work addresses an important problem of task interference in unified models, stemming from data and objective heterogeneity, and proposes a plausible paired-data strategy to mitigate it. - The framework demonstrates performance gains across a comprehensive suite of understanding and generation benchmarks, supported by several analyses. - The approach validates its generalizability to some extent by demonstrating effectiveness beyond autoregressive transformers, showing positive results on a discrete diffusion model and a flow-based model. - My main concern lies in the problem's setup and motivation. The methodology appears applicable primarily to a somewhat niche setting where understanding and generation tasks are handled by a shared architecture. The problem of task heterogeneity could be problematic in understanding-only or generation-only VLMs. - The motivating link between gradient cosine similarity and performance (Figure 1) seems ambiguous rather than stark, raising doubts about whether task interference is the true bottleneck, or if performance on both tasks is simply governed by the model's overall capacity. - The framework relies on several ad-hoc design choices that lack clear rationale or ablation studies. This includes the $\sqrt{s_p}$ weighting for retrieved pairs in Equation 2, the selection of the similarity threshold during data construction, and the K-means medoid selection, which may oversample infrequent data types without a clear analysis of its benefit or harm to the learning process. - The contribution of the proposed method versus the data is not clearly disentangled. The "Unpair" baseline in Table 4, which simply uses the curated data without the pairing strategy, already appears to achieve highly competitive performance relative to other baselines (e.g., the original Janus-Pro-1B). This suggests that the performance gains might be primarily attributed to the quality of the new 16K dataset rather than the proposed pairing and Pair-GPRO algorithm. Questions are enumerated in the weaknesses, primarily concerning rigorous analysis and design choices. Lightly AI-edited
PreviousPage 1 of 1 (4 total rows)Next