|
Towards Visual Text Grounding of Multimodal Large Language Model |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The author introduces TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, the author proposes an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90k synthetic data based on four diverse datasets. A comprehensive evaluation of various MLLMs on their proposed benchmark exposes substantial limitations in their grounding capability on text-rich images. In addition, the author proposes two simple and effective TRIG methods based on general instruction tuning and plug-and-play efficient embedding, respectively. By finetuning MLLMs on their synthetic dataset, they promise to improve spatial reasoning and grounding capabilities.
1) TRIG-Bench is introduced to evaluate the bbox generation of MLLMs.
Text-rich document image grounding is not a novel task, which has been proposed in many works:
[1]SPTSV1, SPTSv2.
[2]DocOwl-series, including many visual text grounding data.
[3]Kosmos2.5 directly outputs spatially-aware texts (text+bboxes).
[4]Marten and TokenVL use a pixel-level segmentation map to guide the visual text grounding.
The authors should discuss the differences with these works.
The related work section should be more detailed.
Please see the weaknesses. |
Fully human-written |
|
Towards Visual Text Grounding of Multimodal Large Language Model |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces the TRIG (Text-Rich Image Grounding) task, which aims to evaluate and enhance the visual text localisation capabilities of MLLMs in text-rich document images. The authors have constructed the TRIG-Bench benchmark, consisting of 800 manually annotated samples, along with a synthetic training set comprising 90,000 samples, which was created using an OCR–LLM–human interaction process for data construction. Additionally, the paper proposes two baseline methods: a general instruction fine-tuning approach and an efficient embedded plug-and-play method. A systematic evaluation of multiple existing MLLMs is conducted, revealing significant limitations of current models in terms of complex document understanding and localisation.
1. TRIG addresses text-intensive document scenarios, such as tables and infographics, which have been overlooked by existing visual localisation research. This work fills the gap in the verifiability and credibility assessment of MLLM in document question answering.
2. By integrating OCR with GPT-4o for automated generation and manual verification, TRIG balances scale and quality. It particularly excels in annotating bounding boxes supporting answers with clearly defined semantic alignment objectives.
3. Through metrics such as the instruction-following rate, TRIG uncovers fundamental deficiencies in existing models for complex tasks, highlighting weaker instruction comprehension in comparison to spatial reasoning. This provides critical insights for the research community.
1. In Table 1, several open-source models (such as LLaVA and Qwen-VL) exhibit an IoU of 0.00 under the OCR-free setting, which is counterintuitive. It is advisable to verify the experimental setup and evaluation code (including prompt templates, model versions, random seeds, input preprocessing, and output parsing scripts) and to include several typical failure examples to identify the source of the issue.
2. The description of the embedding method lacks clarity. Please explicitly state which similarity measure is employed, how the embeddings are derived, how the similarity matrix is constructed and normalised, and how the similarity matrix is summarised into a similarity vector. It is recommended to provide relevant formulas or pseudo-code, and to report key hyperparameters as well as computational complexity.
3. Although the proposed embedding method demonstrates advantages in terms of efficiency, it has not been adequately compared with recent related work under identical experimental conditions, which diminishes the credibility of the method's innovation. It is recommended to perform the comparison using a unified dataset, the same OCR input, and consistent evaluation metrics, and to present a comparison table detailing accuracy, inference delay, and memory usage.
4. The notation is incomplete: the symbol l appears in equations (1) and (2) but is not defined.
1. In Table 1, several open-source models, including LLaVA and Qwen-VL, exhibit an IoU of 0.00 in the OCR-free scenario, which is markedly inconsistent with their performance in the standard VQA task. This discrepancy could potentially stem from errors in the evaluation code, deficiencies in the prompt design, or issues with the parsing of model outputs. If it can be demonstrated that the evaluation is indeed accurate and that the model is incapable of generating valid coordinates, the current conclusion may be upheld. Conversely, if the issue pertains to the evaluation or the parsing format, a revision of the conclusion will be necessary.
2. The current description lacks sufficient detail. For instance, the method used for similarity measurement has not been addressed. It is advisable to include relevant formulas or pseudocode to elucidate the computational process leading from image and text embeddings to the resultant similarity vector. Additionally, the method of similarity measurement and the aggregation strategy should be thoroughly clarified.
3. The paper should be subjected to a systematic comparison with recent works in the public domain under identical conditions. The absence of a consistent comparative framework undermines the persuasiveness of the contributions made by this method. |
Lightly AI-edited |
|
Towards Visual Text Grounding of Multimodal Large Language Model |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces TRIG, a task and resource suite for Text-Rich Image Grounding in document VQA. The authors build (i) TRIG-Bench, a human-validated benchmark of 800 QA pairs with grounded bounding boxes drawn from DocVQA, ChartQA, InfographicsVQA and TRINS, and (ii) a ~90k synthetic instruction dataset created via an OCR–LLM–human pipeline. They further provide two baselines: an instruction-tuning approach (fine-tuning an MLLM to emit answers plus boxes) and a plug-and-play embedding approach that retrieves image patches by text–image similarity with adjacent-patch smoothing. Two evaluation settings are defined: OCR-free grounding (predict boxes from the image) with pixel-level IoU, and OCR-based grounding (select boxes among OCR detections) with instance-level IoU/Precision/Recall/F1. A broad evaluation across open-source and proprietary MLLMs highlights limited performance on OCR-free grounding and the gains from the proposed training/data.
1. Well-scoped, underexplored problem: Document-centric visual text grounding is distinct from typical natural-image grounding and is practically valuable for trust and verification in document QA. The paper motivates this gap clearly.
2. Data pipeline design: The indexed OCR overlay plus reflection/rectification loop is a thoughtful way to reduce prompt–vision misalignment and to auto-filter noisy candidates before human vetting. The final benchmark requires dual-annotator agreement per item.
3. Methodological diversity: The instruction-tuned baseline and the efficient embedding baseline probe complementary failure modes (instruction-following vs spatial retrieval) and give the community practical baselines to start from.
1. Limited analysis of the “indexed OCR overlay” design: The paper asserts improved alignment by drawing indices on images and mirroring them in text, but ablations isolating this factor (e.g., with/without indices, different index densities, noise) are not reported.
2. Embedding baseline details: The adjacent-patch averaging and the 2-level top-k expansion heuristic are plausible but under-analyzed; the sensitivity to neighborhood size, stride, and k is not fully characterized.
1. Pixel-IoU vs Box-IoU: Why use pixel-level IoU to score rectangular predictions in OCR-free grounding? Would standard box IoU (with thresholds) change the ranking or the conclusions? Please provide a small study or justification.
2. Index-overlay ablation: Can you add an ablation comparing (a) “wrap OCR text only,” (b) “indices in text only,” (c) “indices drawn on image only,” and (d) “both,” to verify that the bidirectional index design is indeed the key factor?
3. Data provenance & leakage: Since GPT-4o participates in the pipeline, how do you mitigate downstream evaluation advantages (or disadvantages) for GPT-4o and related models? Any steps taken to avoid prompt/format leakage between data generation and evaluation? |
Fully AI-generated |
|
Towards Visual Text Grounding of Multimodal Large Language Model |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes TRIG, aiming to enhance the text visual grounding capability of multimodal large language models in text-rich document images. It builds a dataset of human-verified and synthetic question–answer pairs using an OCR–LLM–human pipeline. The authors introduce two baseline methods—an instruction-tuning approach and an embedding-based approach—to improve grounding accuracy.
* The Paper is clearly written and easy to follow.
* The research question of visual text grounding is important.
* The paper’s contribution is limited. It mainly focuses on reorganizing existing datasets using OCR tools rather than proposing a new dataset or introducing novel technical approaches, which does not meet the innovation standards expected at ICLR.
* Table 1 lacks comparisons with several quite related baseline methods, such as [1], [2] and [3], which weakens the empirical evaluation and makes it difficult to assess the claimed effectiveness of the proposed approach.
[1] Seed1.5-VL Technical Report. Arxiv 2025.
[2] CogAgent: A Visual Language Model for GUI Agents. CVPR 2024.
[3] Harnessing Webpage UIs for Text-Rich Visual Understanding. ICLR 2025.
See weakness. |
Lightly AI-edited |