ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 4.00 4.00 2557
Lightly AI-edited 1 (25%) 4.00 4.00 2391
Fully human-written 2 (50%) 4.00 3.50 1696
Total 4 (100%) 4.00 3.75 2085
Title Ratings Review Text EditLens Prediction
HieraQuery: Bridging Multimodal Understanding and High-Quality Generation through Multi-Scale Query Learning Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper propose HieraQuery, a multi-scale query learning method for high-quality visual generation, which leverages a hierarchy of learnable visual queries for generation in a coarse-to-fine manner and includes a multi-scale representation alignment strategy for cross-scale consistency and convergence acceleration. Extensive experiments is conducted and the proposed method demonstrate the performance improvement on the visual generation capability. 1. Query learning is worth studying and have certain commonality for multimodal tasks, which is brave and novel. 2. The experiment result is solid under text-to-image generation, image style transfer and fine-grained editing. 3. Convincing visualization examples are provided to prove the effectiveness of the proposed method. 1. Lack of concept comparison framework between the baseline methods and proposed HieraQuery method. Providing this may help better presentation. 2. Comparison on understanding benchmarks should includes used MLLM backbone. 3. Several writing typos: “benchmarsk” at line 265 and MJHQ FID) at Table 4. 1. See weakness. 2. The training cost and inference speed between proposed method and baselines may require to provide. Overall, I think this work is novel, but with several weaknesses on writing and experiment. So I tend to adjust my score and confidence based on the author's response and the opinions of other reviewers. Fully human-written
HieraQuery: Bridging Multimodal Understanding and High-Quality Generation through Multi-Scale Query Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper HieraQuery introduces a hierarchical query modeling framework to improve unified multimodal generation. Instead of relying on a single query token set, it designs multi-scale queries—from coarse to fine—to progressively build images, where coarse queries capture global semantics and fine queries refine local details. It further aligns features across scales through multi-scale representation alignment, ensuring semantic consistency between the diffusion backbone and the visual encoder. Built on a frozen MLLM (Ming-Lite-Omni) with diffusion backbones like SD3-Medium and SANA-1.6B, the system achieves better image fidelity and structure, improving FID, GenEval, and DPG performance while maintaining strong multimodal understanding. 1.The hierarchical query framework provides a clear and modular architecture that systematically connects multimodal understanding with coarse-to-fine generation. 2.The multi-scale query learning and cross-scale alignment improve the balance between global semantic coherence and fine-grained visual fidelity. 3.The model demonstrates strong results on both text-to-image and editing tasks, showing good generality across benchmarks. 4.The paper presents extensive experiments as well as thorough ablation studies. 1. There is a typo in Figure 3: it should be qualitative and not quantitative. 2. The qualitative examples are too simple. For compositional task, please include more complex prompts. 3. The paper does not provide a quantitative analysis of additional computational cost, such as FLOPs, memory, or end-to-end inference time, so the efficiency trade-offs of the hierarchical design remain unclear. 4. The paper mainly reports benchmark scores (FID, GenEval, DPG) without perceptual or human evaluation on image coherence, consistency, or failure modes. 5. The understanding–generation balance is claimed but not verified with detailed breakdowns; it remains unclear whether improvements in generation come at the cost of understanding accuracy. In the ablation study, several configurations yield very similar quantitative results. How did the authors determine which setting is optimal in the first place? Was the choice based on efficiency (e.g., computational cost, FLOPs, or latency), stability, or qualitative evaluation? Providing a clearer selection criterion would help assess the robustness of the claimed improvements. Lightly AI-edited
HieraQuery: Bridging Multimodal Understanding and High-Quality Generation through Multi-Scale Query Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposed HieraQuery, which extends the method of MetaQuery to a multiscale paradigm by generating images of multiple resolutions on different sets of conditioning queries. To enforce cross-scale consistency, REPA is applied to different levels, so the DiT features from all these scales are aligned to the same semantic representation. Beyond image generation, the framework is also adapted to image editing, with a progressive reconstruction-and-then-editing paradigm. In the experiment, performance gain by using the multiscale queries is observed on GenEval. 1. The motivation is clear, and the paper is easy to follow. 2. The proposed framework, HieraQuery, comprehensively supports image understaning, generation, and editing. 3. HieraQuery achieves competitive performances on all three tasks studied. 1. Technical contribution is limited, since the ideas of multiscale generation and representation alignment are already proposed in existing works, i.e, VAR and REPA. 2. Although this paper studies unified models that cover understanding, generation, and editing. The advantage of multiscale queries is only experimentally supported by results on the image generation benchmark---GenEval. 3. Some results are expected but missed in the ablation study: **a.** In Table 3, the result of applying REPA to the single-scale setting is expected. **b.** DiT is shared across different scales. Will some scale-specific parameters benefit the final performance? 4. In the 2nd and 3rd training stages, only image-to-image losses are applied, which seems to damage the text-to-image generation capability of the model. Some details are missing: 1. SD3-Medium and SANA-1.6B are both studied. But it is unclear which model is used in Table 1&2. 2. Can the authors provide more details on editing? For example, are the same model parameters used for both generation and editing? Are VAE latents fed to the DiT to ensure consistency between source and edited images? Fully human-written
HieraQuery: Bridging Multimodal Understanding and High-Quality Generation through Multi-Scale Query Learning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a unified autoregressive multimodal LLM framework. It introduces hierarchical learnable visual queries for coarse-to-fine image generation: preceding queries handle low-resolution global structures, while subsequent ones refine high-resolution details. A multi-scale representation alignment strategy ensures cross-scale consistency and accelerates training. 1. The work shows strong technical quality through comprehensive experiments on benchmarks like GenEval, MJHQ-FID, and GEdit-Bench, outperforming SOTA unified models. Ablations validate the multi-scale query and alignment strategies. 2. HieraQuery advances unified multimodal LLMs, and its scalable multi-scale approach addresses bottlenecks in error accumulation and resolution handling, benefiting the AI community in developing more versatile vision-language models. 1. Table 1 shows noticeable drops in understanding benchmarks compared to MetaQuery (which uses a similar base): MMB decreases from 83.5 to 80.7, MMMU from 58.6 to 54.3. MM-Vet improves slightly (66.6 to 74.0), but the overall trend suggests the multi-scale query integration may interfere with the LLM's text-visual alignment for understanding tasks. 2. The core contribution—multi-scale queries for coarse-to-fine generation—builds directly on MetaQuery by hierarchizing queries. While this addresses MetaQuery's plateauing with token count, it resembles existing multi-scale generation techniques, such as *Chain-of-Sight* mentioned in the paper. This makes the novelty appear incremental rather than transformative for unified MLLMs. 3. Further detailed ablation studies are missing. For example, no quantitative analysis is provided on how performance and computational cost trade off as $K$ in $S= \\{s_1, s_2, . . . , s_K\\}$ in increases. Without this analysis, it is unclear whether increasing $K$ yields diminishing returns, introduces redundancy, or disproportionately raises inference latency. 1. Could the authors elaborate on how HieraQuery's hierarchical queries differ mechanistically from multi-resolution approaches in diffusion models? A detailed comparison, perhaps with pseudo-code or diagrams, could clarify if this is a truly novel integration or an incremental extension, potentially strengthening my view on the contribution's originality. 2. In Table 1, HieraQuery shows slightly lower scores on understanding benchmarks. Since the MLLM (Ming-Lite-Omni) is frozen during training, could this be causing a degradation? 3. Minor issue: There is a redundant writing issue in L267-L269. Moderately AI-edited
PreviousPage 1 of 1 (4 total rows)Next