ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 3.00 5023
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 2 (50%) 3.00 3.50 2896
Fully human-written 1 (25%) 4.00 3.00 4106
Total 4 (100%) 3.50 3.25 3730
Title Ratings Review Text EditLens Prediction
MIMIC-VQA: COMPILING AGENTIC REASONERS INTO EFFICIENT DOCUMENT VQA MODELS Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a two-phase paradigm for document visual question answering. First, a teacher pipeline generates CoT reasoning traces. Second, a student model completes the rest of the reasoning. Results demonstrate faster inference speed. 1. The reviewer loves the idea of using a student model to do further explanations, trained based on teacher expert data. This could be insightful to other VQA applications as well. 2. The reviewer appreciated the visual examples of Figure 2, but the reviewer still has some questions (see below). Although this paper has weaknesses, its conceptual motivation is clear and potentially impactful. 1. Writing and formatting issues. The manuscript has notable presentation problems that obscure the technical content. Examples include: (1) Ln 83-86 is a repeat of Ln 87. (2) Ln 88, "compiled" should be changed to ``compiled'', which applies to all quotation marks。 (3) Resulting numbers should be highlighted in the introduction, while the last paragraph of the introduction should be removed. (4) In related works, citation formats are wrong. (5) Ln 380-384, the format is wrong, and it looks like LLMs again. (6) Ln 440, imparts -> impacts. 2. Limited algorithmic novelty and fairness of comparison. The reported 5.3× speedup seems largely due to using a smaller model rather than a new algorithm. The student’s performance is also notably below the teacher’s in mAP. Speed comparisons should include similar 7 B-scale models such as LayTextLLM for fairness. The speed testing should be conducted against similar 7B model, such as LayTextLLM. 3. Unclear conceptual framing. The teacher–student design is fairly standard, and it is unclear why the framework is termed “agentic.” The paper would benefit from stronger justification or ablation demonstrating agent-like reasoning behavior. 1. In the second example of Figure 2, why are other numbers not marked? 2. Is mAP a good metric? If it's a grounding task, the reviewer would expect other metrics such as mIoU. Also, an analysis of the interpretability of the results is missing. 3. Why was this framework called agentic? 4. So, the outcome of this paper is only a student model? The speedup isn't a strong claim for an ICLR paper, and it's not due to the algorithmic design. 5. Why does SROIE not have an mAP result? Is it because of unavailable ground truth? Lightly AI-edited
MIMIC-VQA: COMPILING AGENTIC REASONERS INTO EFFICIENT DOCUMENT VQA MODELS Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper investigates how to resolve the fundamental trade-off between inference latency and reasoning transparency in document VQA, that modular agentic systems are often good at reasoning transparency but suffer from high latency, while end-to-end models are good at efficiency but suffer from lack of interpretability. The authors propose MIMIC-VQA, a knowledge distillation method for document VQA that distills not only the final answer but also the complete step-by-step reasoning process from a larger multi-agent teacher system into a single autoregressive generation by a smaller student system. The authors perform experiments on 5 benchmarks and demonstrate state-of-the-art performance on 4 of them, with the student model achieving 98.3% of teacher accuracy while operating 5.3× faster. Overall, this is an interesting paper that addresses a practical problem. However, the main contribution, the distillation approach, largely applies well-established knowledge distillation techniques that have been widely studied in modern LLMs. Nevertheless, in the document VQA area, it still demonstrates promising results in the experiments. - Great motivation for optimizing both inference latency and performance while maintaining reasoning transparency, addressing a genuine practical need in document AI deployment with significant infrastructure cost reductions (from 88.9 to 16.7 hours per 100K queries). - The method is technically sound and the experimental results are strong, achieving state-of-the-art performance on 4 out of 5 benchmarks with substantial improvements in spatial grounding (20-30 mAP point gains over existing methods). The ablation studies effectively demonstrate the critical importance of Chain-of-Thought distillation, showing catastrophic degradation in spatial grounding (14-point mAP drop) when CoT is removed. - The 102,447 Chain-of-Thought reasoning traces dataset represents a substantial contribution to the research community, providing high-quality procedural reasoning annotations for document VQA that could foster further research in this important area. ### Major - The major concern is that this work essentially applies well-established knowledge distillation techniques, which are widely used in LLMs, to the document VQA domain using vision-language models. While the application is competent and the "procedural distillation" framing is useful, the core methodological contribution is incremental rather than genuinely innovative. - The experiments are limited to only one teacher-student combination (Llama 4 Scout + Gemma 3-27B/9B). It would significantly strengthen the evidence for this method's effectiveness to test whether other model combinations would achieve similar gains, particularly given the authors' claim about the general applicability of the approach. ### Minor - Line 397: Claims "achieves state-of-the-art performance across all five benchmarks" but actually achieves SOTA on only 4 out of 5 benchmarks. - Lines 269 vs. 290: There are two instances of "Step 4: Answer Grounding" which creates confusion in the methodology description. - The nearly full-page pseudocode on page 5 (Algorithm 1) doesn't seem to add much value beyond the textual description and hurts the presentation efficiency of the paper, in my opinion. - How can you better justify the overall effectiveness of this approach since it's only tested with one teacher-student model configuration? What about other model selections such as larger/smaller teacher models or different student architectures? This is crucial for establishing the method's generalizability beyond the specific Llama 4 Scout + Gemma 3-27B/9B combination. - What about the data efficiency of the teacher data generation process? Could you provide more details on how the 102,447 number was determined? If the ratio of CoT reasoning traces to the original number of data points in the source datasets is too high, it could cause significant distribution shift, potentially leading to overfitting on synthetic reasoning patterns rather than genuine document understanding capabilities. Fully human-written
MIMIC-VQA: COMPILING AGENTIC REASONERS INTO EFFICIENT DOCUMENT VQA MODELS Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces MIMIC-VQA, which tackles the latency-interpretability trade‑off in document VQA by distilling the multi‑step reasoning of a modular, tool‑using teacher into a single student VLM. The teacher pipeline is used to produce ~102k CoT traces with bounding‑box supervision. The student model (pruned 9B from 27B) is trained to generate the reasoning, answer, and bbox in one forward pass. An optional constrained‑decoding step restricts coordinate tokens using a lightweight OCR vocabulary at inference. Empirically, the student approaches teacher ANLS while running ~5× faster and shows large mAP gains for grounding on DocVQA, VisualMRC, FUNSD, and CORD. Ablations argue that CoT is critical for spatial grounding. - The paper is clearly written and easy to understand. - Convincing ablations showing CoT is crucial for spatial grounding - The paper presents an efficient version of the model which could be deployed in real-time scenarios. - Teacher description is contradictory. In the main paper, the Teacher is an OCR‑based Llama Scout/Gemma pipeline; however, in Appendix B.2.1, the Generation Model (Teacher) is described as Gemini with GPT‑5 validation. This undermines the central claim because the nature of procedural knowledge being distilled is unclear. - The paper methodology has a lot of similarities with the "AURELIA" [1] which also distills reasoning information into VLMs via an agentic pipeline at test-time. Therefore, the methodology of the proposed cannot be tagged as novel. Also the comparison of proposed method with AURELIA like pipelines must be reported in the paper as it presents a crucial baselines for testing reasoning distillation with finetuning vs without finetuning. - Authors mentioned using mAP@IoU; however, the metric is missing in almost every reported baseline in all tables. This raises concerns regarding the effectiveness of the proposed method making it difficult to assess whether improvements hold against strong OCR‑based contemporaries under the same metric settings - The GPU compute used to run the experiments is unclear. Main paper (section 4.2) mentions using A100 GPUs, while Appendix A reports H100 GPUs being used. This will lead to reproducibility issues. - Appendix Table 4 reports the usage of LoRA for training student architecture, however, there is no mention of LoRA in Main paper. The details on self‑consistency are also missing. - While the pruning recipe is described, it lacks an ablation showing accuracy vs sparsity and the contribution of pruning independent from distillation. - In the main paper, the generated bbox has the format <x,y,h,w>; however in Appendix (page 16), the generated bbox has the format <x1,y1,x2,y2>, this raises questions about how the model is trained to emit boxes and how parsing errors are penalized. References [1] Chowdhury, S., Gani, H., Anand, N., Nag, S., Gao, R., Elhoseiny, M., ... & Manocha, D. (2025). Aurelia: Test-time reasoning distillation in audio-visual llms. ICCV 2025. - Which teacher actually produced the 102,447 traces? Please reconcile the OCR‑based Llama/Gemma pipeline (Sec. 3–4 main paper) with the Gemini+GPT‑5 pipeline (Appendix B). - Box parameterization: Is the student trained to output <x, y, w, h> or <x1, y1, x2, y2>? - Why are mAP scores omitted for the majority of cases? - How is “coordinate hallucination” defined and measured for the claimed 73% reduction? Lightly AI-edited
MIMIC-VQA: COMPILING AGENTIC REASONERS INTO EFFICIENT DOCUMENT VQA MODELS Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes MIMIC-VQA, a knowledge distillation framework for Document Visual Question Answering that aims to "compile" the reasoning process of a modular agentic system into an efficient neural network. The approach operates in two phases: (1) a teacher pipeline orchestrated by Llama 4 Scout generates 102,447 Chain-of-Thought reasoning traces with spatial grounding, and (2) these traces are used to train a student model derived from Gemma 3-27B, pruned to 9B parameters, to replicate the complete reasoning process including bounding box coordinates in a single autoregressive generation. The authors report state-of-the-art results on DocVQA (89.7 ANLS), VisualMRC, FUNSD, and CORD benchmarks with 5.3× inference speedup. Novel conceptual contribution: The idea of "compiling" procedural knowledge from a multi-step agentic system into a single efficient model through distillation is interesting and addresses a real trade-off in the field. Comprehensive evaluation: The paper evaluates on multiple established benchmarks (DocVQA, VisualMRC, FUNSD, CORD, SROIE) with both answer accuracy and spatial grounding metrics. Detailed methodology: The appendix provides extensive implementation details including hyperparameters, pruning schedules, and dataset generation procedures. Important ablation study: Table 3 demonstrates that Chain-of-Thought reasoning is critical for spatial grounding performance, which is a valuable finding. Focus on spatial grounding: The emphasis on both answer accuracy and localization is appropriate for document understanding applications. Methodological Contradictions a) OCR Dependency vs. End-to-End Claims: Section 3.3 describes "optional constrained decoding" that requires "lightweight OCR preprocessing" (adding 45ms latency) Appendix B emphasizes training for "visual-spatial reasoning WITHOUT text detection" But the teacher explicitly uses OCR (Algorithm 1, line 5) and deterministic ANLS matching (Algorithm 2) How can the student learn OCR-free spatial grounding when the teacher fundamentally relies on OCR? This contradiction is never resolved. b) Inference Process Unclear: Is constrained decoding always used or truly optional? Table 2 shows separate results for "+ Constrained Decoding" suggesting it's optional But Section 3.3 states it "ensures robust coordinate outputs" - when is it not used? What happens to spatial grounding quality without it? Missing Critical Technical Details Bounding Box Tokenization: How exactly are coordinates like "(450, 80, 120, 25)" tokenized? Are these separate number tokens? Single composite tokens? How large is the vocabulary? Algorithm 1 line 13 shows "Location: BA" but the actual format isn't specified The format "x y w h" appears in examples but is this (x,y,w,h) or (x1,y1,x2,y2)? Page 4 line 211 shows both formats. Insufficient Baselines and Comparisons a) Teacher Comparison: Table 2 doesn't show the Teacher Agent results, but Table 3 does (90.2 ANLS, 78.4 mAP) Why is the teacher not included as a baseline in the main results table? The student achieves 88.7 ANLS vs teacher's 90.2 - this is 98.3% retention, but for spatial grounding it's 69.1 vs 78.4 mAP (88.1%) - why the larger degradation? b) Alternative Distillation Methods: No comparison with standard knowledge distillation (logit matching) No comparison with other model compression techniques No comparison with simply using the base Gemma 3-27B model c) Fairness Concerns: Compared models (DocLayLLM, LayoutLLM, etc.) don't use CoT or extensive spatial reasoning traces Is it fair to compare a model trained on 102k expert-generated reasoning traces against models trained on standard datasets? Experimental Rigor Issues a) Statistical Significance: Appendix A.5 mentions 5 runs with different seeds, but Table 2 shows no error bars or confidence intervals No significance testing reported Given that improvements like 88.7→89.7 ANLS are claimed as contributions, statistical significance is essential b) Data Leakage Potential: How was the 102,447 generated dataset split? Were test set images used to generate teacher traces? This could lead to train/test contamination 6. Questionable Claims a) "Compiling" Metaphor: The paper claims to "compile" multi-step reasoning, but the student still generates the full CoT at inference True compilation would eliminate the intermediate steps, but here they're still generated (and add tokens/latency) The speedup comes from model size reduction, not from compilation b) Spatial Grounding Performance: 20-30 point mAP improvements claimed, but DLaVA baseline may be weak No comparison with specialized grounding models or object detection baselines The teacher's deterministic grounding (Algorithm 2) isn't a learned capability being transferred How is spatial grounding achieved without OCR? Resolve the contradiction between OCR-free claims and OCR-dependent implementation. What is the exact coordinate tokenization scheme? Provide concrete examples. Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next