|
Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
Authors propose a two-stage training strategy to obtain a reasoning model in the RS domain. First, they constructed a large-scale (380K) dataset of structured Geo-CoT rationales to SFT the base model, GLM-4.1V. Geo-CoT380K includes data from tasks such as VQA, Image Captioning, Scene Classification, Visual Grounding, Object Counting, and Object Detection. Then the SFT model is further trained via GRPO. The RSThinker model shows impressive performance on Grounding, Counting, Classification, VQA, and Captioning tasks, with reasonable, interpretable thinking process. The paper is well written and straightforward.
1. The RSThinker model shows impressive performance on Grounding, Counting, Classification, VQA, and Captioning tasks, with reasonable, interpretable thinking process.
2. The prompts for generating Geo-CoT380K are well designed. Geo-CoT380K should be a high-quality SFT dataset for teaching the model to think.
3. The tasks included in this paper are quite complete.
4. In the RS domain, this is the first work I have seen that makes CoT-SFT + GRPO pipeline work.
1. The ablation study is quite brief. More analysis could be elaborated. (1) Is 380K CoT data necessary for teaching the model to generate CoT? (Are 50K, 100K, or 200K enough? Is this CoT-SFT stage mainly for activating the base model’s parameters to learn the reasoning pattern, or for injecting RS knowledge into a general base model?) (2) GLM-4.1V is already quite impressive on RS tasks, even in a zero-shot manner. Will this CoT-SFT + GRPO pipeline work for other base models such as InternVL-3.5 or Qwen3-VL?
2. A crucial training detail is not disclosed. After SFT, do the authors train GRPO for each task independently or for all tasks jointly? For example, for the VG task, is the model tuned with GLM-4.1V + Geo-CoT380K (Stage I, SFT) + Geo-CoT380K-VG + Additional Dataset in Table 2 (Stage II, GRPO)? Or is it GLM-4.1V + Geo-CoT380K (Stage I, SFT) + Geo-CoT380K + Additional Dataset (Stage II, GRPO)?
3. From Table 8, it seems GRPO does not improve the model capability very much after training with Geo-CoT. Is this improvement actually from GRPO, or from continued training?
1. The result in Table 4 is not reproducible. I tried the GLM-4.1V-Thinking (zero-shot in VG) in a new conda GLM env, and only got 50.67 (compared with 63.8 in Table 4). The prompt is taken from GLM4.1V paper: "Tell me the position of the referred object in the picture. {Question} Answer in [x1,y1,x2,y2] format." and the corrdinate is normalized to 0-1000. It would be appreciate if authors can provide inference script (including prompt).
2. I SFT the GLM-4.1V with the training set of VRSBench using llamafactory, the result is 62.06 (miou), which is far away from 87.7 in Table 8. It would be appreciate if authors can provide some of Geo-CoT data (VRSBench's training part) to verify.
3. Is 380K CoT data necessary for teaching the model to generate CoT? (Are 50K, 100K, or 200K enough? Is this CoT-SFT stage mainly for activating the base model’s parameters to learn the reasoning pattern, or for injecting RS knowledge into a general base model?)
4. GLM-4.1V is already quite impressive on RS tasks, even in a zero-shot manner. I wonder will this CoT-SFT + GRPO pipeline work for other base models such as InternVL-3.5 or Qwen3-VL in RS tasks? Will it works for weaker model like Qwen2.5-VL? Grounding task solely should be enough to verify.
5. After SFT, do the authors train GRPO for each task independently or for all tasks jointly? For example, for the VG task, is the model tuned with GLM-4.1V + Geo-CoT380K (Stage I, SFT) + Geo-CoT380K-VG + Additional Dataset in Table 2 (Stage II, GRPO)? Or is it GLM-4.1V + Geo-CoT380K (Stage I, SFT) + Geo-CoT380K + Additional Dataset (Stage II, GRPO)?
6. Will training all tasks jointly benefit the model capability?
7. From Table 8, it seems GRPO does not improve the model capability very much after training with Geo-CoT. Is this improvement actually from GRPO, or from continued training?
I will certainly raise the score if authors address (some) my concerns. |
Fully human-written |
|
Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces the Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT), a framework intended to add verifiability to remote sensing VLMs by forcing them to link reasoning steps to visual evidence. The authors create a new dataset, Geo-CoT380k, by using GPT-4V to generate structured rationales for existing RS data. They propose a two-stage training strategy (SFT and GRPO) to train their model, RSThinker, on this new dataset. The resulting model, which outputs a reasoning trace, is shown to achieve strong performance on several remote sensing benchmarks.
1. The paper addresses the important and well-motivated problem of model verifiability, which is a significant barrier to using VLMs in high-stakes remote sensing applications.
2. The proposed "Planning-Grounding-Synthesis" cognitive architecture is a promising and intuitive structure for modeling geospatial analysis.
3. The public release of the Geo-CoT380k dataset is a potentially useful artifact that could spur further research into reasoning-based remote sensing models.
4. The ablation studies clearly demonstrate that training with the CoT-structured data (SFT w/ CoT) yields significant performance gains over a baseline fine-tuned on answers alone (SFT w/o CoT).
1. The related work section is incomplete, failing to cite or discuss relevant prior work on CoT frameworks in other geographic contexts, such as GeoChain (Yerramilli et al., 2025) and Gaea (Campos et al., 2025).
2. The paper's central claim of "faithfulness" is largely unsubstantiated. The Geo-CoT380k dataset, which underpins the entire method, is generated by GPT-4V. The authors' claim that this "ensures faithfulness by design" is a strong overstatement, as the LM may be generating post-hoc justifications that are plausible but not faithful, a bias the authors briefly acknowledge.
3. The evaluation metrics do not actually measure faithfulness or verifiability. The paper presents strong accuracy metrics (mAP, MAE, Acc) but mistakes this for faithfulness. High accuracy does not prove the reasoning trace is correct or causal; the lack of any direct evaluation of rationale quality is a major omission.
4. The methodological novelty of the training process is questionable. The description of Group Relative Policy Optimization (GRPO) appears to be a standard application of PPO rather than a novel algorithm.
5. The SOTA comparisons are potentially flawed. RSThinker is fine-tuned on 380k CoT rationales, while the baselines were not. The performance gains may be a result of this massive, task-specific fine-tuning dataset, not a superior reasoning architecture.
1. Given that "faithfulness" is the core claim, why was no human study conducted to validate the quality of the GPT-4V-generated rationales in Geo-CoT380k?
2. How can you be sure the model's reasoning trace is causal to its answer, and not just a plausible justification generated in parallel to a "black-box" prediction?
3. Can you clarify how GRPO is algorithmically novel compared to a standard PPO implementation with reward normalization?
4. Wouldn't a fairer baseline comparison involve applying few-shot CoT prompting to the baseline models using exemplars from your dataset? |
Fully AI-generated |
|
Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses a critical limitation of current Vision-Language Models (VLMs) in the remote sensing domain: their tendency to function as "black boxes" that produce plausible but often incorrect and unverifiable responses. To foster more trustworthy AI, the authors introduce the Geo-CoT (Geospatial Chain-of-Thought), a novel framework that mandates a structured, step-by-step reasoning process where each analytical step is explicitly grounded in visual evidence from the image.
To train models in this new paradigm, the authors have constructed Geo-CoT380k, a large-scale dataset of nearly 380k structured reasoning examples, by retrofitting verifiable rationales onto existing public datasets. They then propose a two-stage training strategy: first, Supervised Fine-Tuning (SFT) is used to instill the foundational cognitive architecture of Geo-CoT, followed by Group Reward Policy Optimization (GRPO), an outcome-based reinforcement learning stage, to incentivize the factual correctness of the reasoning. Their resulting model, RSThinker, demonstrates state-of-the-art performance across a comprehensive suite of multiple remote sensing benchmarks, validating the effectiveness of their approach.
1. **Addresses a Critical and High-Impact Problem:** The paper tackles a fundamental and highly relevant issue. The lack of interpretability and faithfulness in current VLMs is a major barrier to their adoption in high-stakes domains like disaster response and security. By emphasizing verifiable reasoning, this work makes a significant and timely contribution toward building more transparent and trustworthy AI systems for Earth Observation.
2. **Novel and Well-Designed Framework:** The proposed Geo-CoT framework is a key innovation. It moves beyond generic Chain-of-Thought by requiring each reasoning step to be explicitly linked to visual evidence (perceptual grounding), a powerful concept for ensuring faithfulness. This forces the model to justify its claims with specific pixel regions, a common failure point for many VLMs.
3. **Massive and High-Quality Dataset:** The creation of the Geo-CoT380k dataset is a monumental contribution in its own right. Its scale, diversity, and meticulous annotation process make it an invaluable resource for the entire remote sensing community. The semi-automated pipeline used to create it is a clever and practical approach to large-scale data generation.
4. **Strong and Comprehensive Empirical Validation:** The paper provides a thorough and convincing experimental evaluation. RSThinker is benchmarked against a wide array of state-of-the-art models across a diverse set of tasks, ranging from object counting to complex VQA. The consistent and significant performance gains provide strong evidence for the superiority of the proposed approach.
1. **Methodological Novelty:** While the application of the two-stage training pipeline (SFT followed by RL) is highly effective, the paradigm itself is not new to the machine learning community. The primary contribution of this paper lies in the novel Geo-CoT framework and the creation of the dataset, rather than in the invention of a new training methodology. Acknowledging this and framing the work more explicitly as a novel application and extension of an existing paradigm would be beneficial.
2. **Potential for "Faithful Hallucination":** The model is trained to generate plausible-looking reasoning chains. A potential failure mode is that the model could learn to generate a syntactically correct and seemingly logical reasoning chain that is, in fact, disconnected from the actual visual evidence. While the two-stage training process is designed to mitigate this, a more in-depth analysis of this potential failure case, perhaps through a targeted error analysis, would strengthen the paper's claims of faithfulness.
Could you provide a more detailed error analysis to investigate the failure mode of "faithful hallucination," where the model generates a structurally correct but factually ungrounded reasoning chain? What mechanisms in your framework are most effective at preventing this? |
Fully AI-generated |
|
Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper introduces Geo-CoT, which explicitly injects verifiable intermediate evidence into multimodal reasoning. During SFT, the model learns spatially anchored reasoning traces (with boxes/coordinates); afterward, GRPO uses task metrics as rewards for alignment. The authors claim improvements in faithfulness and interpretability across several vision–language tasks.
The idea of tying visual localization to the reasoning chain so that intermediate evidence is checkable is intrinsically valuable.
The framework is evaluated across diverse remote-sensing tasks (detection, counting, VQA, captioning)
1. The approach can largely be summarized as “structured CoT + RL with task-level rewards.” Compared with recent CoT+RLHF/RLAIF and visual-CoT lines, this feels more like an engineering integration than a conceptual/algorithmic advance.
2. The collapse of the output format without KL suggests that the model may be learning templates rather than reasoning logic.
3. The relevant work section should further elaborate on the differences in concepts and implementation between these recent visual CoT / RLAIF / GRPO-style methods
4. Lack of specific failure cases and cause analyses
5. Lack of few-shot or zero-shot (on broader datasets such as OPT-RSVG[1]) performance analysis
6. I would like to know that the implicit understanding ability (like SegEarth-R1[2]) of the proposed method
[1] Language-guided progressive attention for visual grounding in remote sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 1-13.
[2] Segearth-r1: Geospatial pixel reasoning via large language model[J]. arXiv preprint arXiv:2504.09644, 2025.
See Weaknesses |
Lightly AI-edited |
|
Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Geo-CoT, a perceptually grounded chain-of-thought framework aimed at enhancing the faithfulness and interpretability of vision-language models in remote sensing. The authors propose a large-scale dataset (Geo-CoT380k) containing multimodal reasoning chains for tasks like Visual Question Answering (VQA), Visual Grounding (VG), Counting, and Captioning. They also introduce a two-stage training approach combining supervised fine-tuning (SFT) and GRPO-based reinforcement learning, leading to the development of RSThinker, a model that demonstrates strong performance across a range of remote sensing tasks.
The Geo-CoT380k dataset is constructed with clear annotation rules and a consistent generation process. The two-stage training strategy (SFT + GRPO) is straightforward and effective, with experiments showing that both stages contribute to factual consistency and spatial grounding.
The study shows steady performance gains across VQA, captioning, and visual grounding, and highlights the value of verifiable reasoning for applied geospatial AI.
Overall, I do not agree with the authors’ claim that Geo-CoT can serve as a solution to the “critical gap” between existing works such as MM-VoT and the field of remote sensing observation. My main concerns are twofold:
1. The authors emphasize that there is a huge gap between MM-CoT and geoscience (lines 89–90), and they propose a concept called an “intention-driven active perception process,” claiming that this represents a “critical gap” in Earth observation. They also highlight *Perceptually-Grounded Geo-CoT* as a new paradigm in their contributions. However, I believe this emphasis is overstated. The so-called new paradigm essentially “mandates a verifiable link between each analytical step and its corresponding visual evidence,” which has already been extensively explored in follow-up works after MM-CoT (2023). I can list several such studies [1–3], and the authors should conduct a more thorough review of these. Over the nearly two years since MM-CoT was proposed, numerous works have investigated the correspondence between reasoning steps and their associated visual evidence. The authors do not clarify how their approach differs from these, but instead focus only on MM-CoT (2023) and emphasize thinking with corresponding visual evidence is the “critical gap” in remote sensing field which is already widely recognized in common field. The Geo-CoT paradigm does not clearly demonstrate any distinction between the remote sensing observation domain and the general domain. The authors even state in the related work section (lines 143–147) that existing studies focus on reasoning over whole objects, while remote sensing involves “a verifiable log of fine-grained perceptual operations.” However, if the authors examined recent works on visual reasoning from the past one to two years, they would see that this kind of “fine-grained perceptual operation” has already been addressed in many general-domain studies. The fact that the authors list this as the first of their three main contributions suggests they may have deliberately overlooked related works in order to overstate the novelty of their approach.
2. There are serious issues with the evaluation metrics used in the downstream tasks, which I hope the authors will clarify. Specifically, the authors use mAP and mIoU as evaluation metrics for the Visual Grounding (VG) task. I will detail these concerns in my comments, but this choice raises significant doubts about the reliability of their downstream task evaluations.
[1]Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
[2]V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
[3]CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation
1.Visual Grounding is essentially a single-instance localization problem conditioned on language, not an object detection task requiring multiple candidate boxes with confidence scores. The paper reports both mAP and mIoU as evaluation metrics for the Visual Grounding (VG) task. While mIoU is understandable as it directly measures localization accuracy, the use of mAP is conceptually unclear.
2.The Geo-CoT380k dataset is a key contribution, consisting of 380k reasoning traces generated by GPT-4V. However, the paper lacks any quantitative or qualitative validation of these traces. Prior CoT work often applies explicit quality control—e.g., multi-sample consistency (self-consistency), deductive stepwise verification, or human expert spot-checking—when using LLM-generated chains (e.g., [1], [2], [3]). But the paper does not mention such validation, making it unclear whether the reported improvements stem from the quality of reasoning chains or simply from the large data volume. This casts doubt on the true impact of Geo-CoT on reasoning faithfulness.
[1]Wang X, Wei J, Schuurmans D, et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR, 2023.
[2]Ling Z, Fang Y, Li X, et al. Deductive verification of chain-of-thought reasoning. In Proceedings of NeurIPS 2023.
[3]Wang Y, Zeng Y, Zheng J, et al. VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool. CoRR, 2024.
(1)Have the authors explicitly distinguished the task boundary between “visual grounding” and “object detection” in the paper? If so, where; if not, please clarify and justify why mAP is an appropriate metric for a language-conditioned single-instance Visual Grounding task.
(2)In the Visual Grounding task, each sample corresponds to a single referring expression without predefined object categories or multiple predictions. Under this setting, how are confidence scores for mAP computation obtained? If the model outputs only one bounding box per query, the definition of precision–recall and mAP becomes ambiguous. Could the authors clarify this evaluation setup?
(3)Could the authors clarify whether any quality control was performed on the GPT-4V-generated reasoning traces? If not, how can we be confident that the improvements are due to reasoning supervision and not just data scale effects? |
Lightly AI-edited |
|
Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces RSThinker, a vision–language model designed for faithful reasoning in remote sensing. The key idea is the Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT) framework, which reformulates remote-sensing analysis as a multi-stage, verifiable reasoning process: planning → evidence collection → synthesis.
To train such reasoning ability, the authors propose a two-stage alignment pipeline:
1. Supervised Fine-Tuning (SFT) on a newly constructed Geo-CoT380k dataset, which contains structured reasoning traces aligned with spatial evidence;
2. Group Relative Policy Optimization (GRPO) reinforcement learning, using task-specific metrics (mIoU, mAP, MAE, CIDEr) as direct rewards to encourage factual consistency.
Experiments across six remote-sensing benchmarks (VG, detection, counting, classification, captioning, VQA) show consistent and sometimes striking improvements over both general and remote-sensing-specific VLMs. The paper claims that these gains stem from explicitly grounding reasoning steps to verifiable spatial evidence, rather than from larger training data or model size.
1) Problem significance: addresses the lack of verifiable reasoning in remote-sensing VLMs—a recognized bottleneck for high-stakes applications (disaster response, environmental monitoring).
2) Conceptual integration: unifies planning–evidence–synthesis reasoning with spatially grounded supervision and metric-aligned RL—a combination rarely systematized before.
3) Dataset & framework completeness: Geo-CoT380k provides the first large-scale, structured reasoning corpus in Earth observation; the pipeline is end-to-end reproducible.
4) Empirical performance: substantial and consistent improvements across tasks; ablations confirm that both SFT and GRPO contribute.
5) Broader impact: opens a pathway for “auditable” reasoning in multimodal models, extending beyond remote sensing.
1) Fairness and data leakage risk:
Some evaluation benchmarks (e.g., VRSBench-VG, DIOR-RSVG) appear in the SFT or RL training corpus (Table 1 & 2), yet the baselines may be zero-shot.
Without clear disclosure and matched fine-tuning, large performance gaps could reflect training data advantage.
→ Action: provide a detailed train/test mapping table and equal-training baselines.
2) Missing comparison to strongest related paradigms:
Section 2.2 mentions MM-CoT but omits direct references to Grounded-CoT / Argus.
→ Action: add a method-level comparison table clarifying differences (e.g., explicit spatial grounding, task-specific RL, remote-sensing scalability).
3) Evaluation scope:
The claim of “faithful reasoning” would be stronger with human or automated evidence–conclusion consistency scores.
4) Reward design details:
GRPO uses end-task metrics as rewards, but credit assignment to intermediate reasoning steps is under-specified; ablation on reward sensitivity is missing.
5) Lack of failure-case analysis:
The paper reports uniformly strong results but shows no failure examples or discussion of limitations.
→ Action: include representative failure cases and categorize typical errors (e.g., grounding misalignment, reasoning inconsistency, hallucinated evidence).
1) Please provide a complete table mapping each evaluation benchmark to its presence in the training corpus (SFT/RL), specifying which splits were used and ensuring no test leakage.
2) Were baseline models fine-tuned on the same data or evaluated zero-shot? If fine-tuned, what settings were used?
3) Could you report Seen vs Unseen benchmark results and a leave-one-benchmark-out generalization experiment?
4) How is the GRPO reward computed per sample? Is it normalized by sample difficulty? Any instability observed when optimizing with heterogeneous metrics?
5) Do you have any evidence-consistency or hallucination evaluation to support the claim of “faithful reasoning”?
6) Please clarify whether RL rewards ever touched validation/test splits (they should not).
7) For fairness, consider adding at least one grounded-CoT / Argus-style baseline to demonstrate that the gain is not purely due to data scale. |
Fully AI-generated |