ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
ChemReason: A Chemical Code-Driven Reasoning LLM via Verifiable Reinforcement Learning Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper focuses on training an LLM that can do better at reasoning, leveraging its chemical code generation ability, tool usage, and training strategies. It leverages a few chemical tools that can automatically return results with chemical code as input, construct a reasoning output format, and think of ways to deal with the cold start problem. Experiment results show that the trained LLM with the proposed training scheme, can outperform close-source LLMs such as Claude-3.5 and open-source LLMs such as DeepSeek-R1. 1. The proposed methods are useful ways to augment an LLM on its ability such as reasoning, tool usage, and reflection. 2. The experiment results are very impressive. 1. The proposed methodologies are generally known methods for training an LLM (cold start, training an LLM to learn better reasoning and tool usage). It seems like the contribution lies in more on adapting these known strategies and implement them with probably good engineering to a specific chemistry task. In this sense, the contribution is not so significant interms of brining new knowledge to the community. I would suggest this paper to rewrite its contribution section, maybe to think of what are novel findings / insights during implementation, and what are the difference between the implemented methodology with the exisiting ones, and what are the insights for training a 8B model to be much better than the powerful public LLMs in some metrics. These unique insights may count as good contributions. 2. The writing and presentation can be improved. For example: (a) section 3 is named with "methodolog", lacking a "y" (b) line 259 should use \citep{} instead of \citet{}. Which specific LLM is Claude 3.5? Sonnet or Haiku? Fully human-written
ChemReason: A Chemical Code-Driven Reasoning LLM via Verifiable Reinforcement Learning Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a Chemical Reasoning LLM, ChemReason, for 9 different tasks spanning from molecular generation and molecular optimization to molecular editing. The core idea of this paper is to incorporate codes as verification tools to allow Chemical LLMs to verify and reflect during thinking. They construct such trajectories and train Qwen3-8B with cold-start SFT and RLVR, and evaluate their model using TOMG-Bench. They are the first to implement the tool RL and use code as a tool in the field of Chemistry. The results on TOMG-Bench are promising. 1. The paper is poorly written and organized, to the point of significantly hindering readability. Even several fundamental requirements for an academic paper have not been properly achieved. For example: * A great number of citations are missing, such as TOMG-bench, Synapse, and all the baseline models used in their experiments. (Also, the format of all the citations is incorrect.) * The writing logic of the Related Work section is confusing, and some parts read more like content that should belong in the Introduction section. * Several concepts in the paper are insufficiently or inadequately explained, such as the method to assign difficulty levels to the problems in TOMG-Bench, the role or function of R' in Figure 2, and the differences between C-GRPO and GRPO. 2. Their description of the advantages and core methodology of GRPO is almost entirely incorrect. The key improvement of GRPO lies in estimating the expected reward of the current state, V(S), through group-wise averaging, thereby eliminating the need for a Critic Model as used in PPO. However, in this paper’s discussion: * “Instead of assigning absolute rewards, GRPO normalizes rewards within each group.” Other algorithms, such as REINFORCE-Baseline and PPO, also compute the advantage rather than directly using the absolute reward, so this is not a distinguishing feature of GRPO. * “While other RLHF approaches require a dedicated reward model, GRPO can flexibly rely on any scoring function or even a stronger LLM to assess solution quality.” The authors seem to conflate the RL optimization algorithm with the reward system. For all algorithms that support optimization using outcome rewards, such as REINFORCE++, PPO, and GRPO, the method of obtaining the reward is irrelevant to the optimization process itself. As long as an outcome reward can be provided, these algorithms can be used for training. 3. Dedicated design of code templates seems to hinder the generalization capability of the model. 1. The format of all the citations is incorrect. You should use \citep, not \citet, in most cases. 2. Please provide more explanation of the questions in the third bullet point of Weaknesses 1. Fully human-written
ChemReason: A Chemical Code-Driven Reasoning LLM via Verifiable Reinforcement Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors introduce ChemReason, a tool-augmented reasoning LLM fine-tuned for molecular editing, optimization, and generation. Grounded in generative code reasoning, it verifies and reflects on its own steps by auto-generating and executing verification code during the chain of thought. - S1: The paper tackles a real, high-impact challenge in chemistry/drug discovery: language-based molecular optimization. - S2: Solid pipelines to generate traces for SFT and model training with ablation study on the main components: access to tool calls and RL. - S3: Results are strong: the method outperforms much larger general-purpose models on the TOMG benchmark. ### Major: - **W1: Potential task overfitting vs. genuine chemical reasoning gains.** The model is explicitly trained on the same tasks it is evaluated on, unlike the baselines. It’s unclear whether improvements reflect overfitting or superior chemical reasoning. Please add: - *Task-transfer tests*: Without retraining, does the model generalize to related objectives (e.g., solubility/TPSA optimization, scaffold hopping)? - *Capability retention*: Does it retain general language-modeling abilities, or have these degraded? - **W2: Lack of uncertainty and significance.** The significance of results cannot be assessed without variability estimates. If these are Pass@1, please run ≥3 independent runs with different seeds and report mean ± s.d., plus ideally paired significance tests. - **W3: Unclear differentiation from tool-enabled general LLMs.** Conceptually, your model is an LLM with Python + *RDKit* tool access, capabilities also available in recent GPT models. What differentiates your model from them? I know that they are expensive models but having this comparison would be very valuable. ### Minor: - W4: Missing Limitation section. - W5: Related work coverage (contemporaneous). Although these are near-contemporaneous, please mention early preprints on chemical reasoning LLMs (e.g., ether0, ChemDFM-R) in the Introduction/Related Work for completeness. - W6: Terminology: “cold-start.” The term is used inconsistently in the LLM/RL literature, sometimes for an SFT phase (as in your case) and sometime for RL without an SFT phase. Currently, you introduce “cold-start” at L68 and only clarify SFT at L240. Please clarify the first time you use the term that you mean an SFT phase. - W7: Missing citation or Synapse (L127). - W8: Type in "Methodology" (L146). - W9: Figures and Tables are not self-contained. E.g., Fig 2: Add a legend or panel description clarifying symbols/labels (T, C, R, CT, M, etc.) so the figure is self-contained, and Table 1: Define Aut., Bon., Fun. in the caption; state the metric(s); explain what bold and underlining denote. - What the difference between Qwen3-8B-SFT (original data)/ SFT(ori), C-SFT and C-SFT models? Aka what is the original data, once again I tought that you only had traces with code-calls. Did I misunderstand something? Also, do I understand correctly that the difference between C-SFT and C-SFT+TIR, is that both the trained on the traces with code but only +TIR had access to tool calls during evaluation? As you can see I'm bit confused on differences between your models and SFT training data. Please clarify these. OK, it seems that you give of these explanations in 4.3 after the main results. Please explain to the different models in the main table before showing the main results or at least give a hyperlink to this sections or simply remove the ablation study results from the main table as you anyway show them in Fig 4. - Q1: On L277, you mention that GRPO doesn't require a reward model, but in Fig. 3 you have a reward model in the C-GRPO pipeline... - Q2: Not sure, I understand the Eq. 1 and what process above does it refer to? The trajectory generation for the SFT? For me it reads as the likelihood of answer given a set of tasks $\mathcal{T}$, but probably it should for a one specific task? And do I understand correctly that $k$ is number of reasoning-code-result iterations, or what exactly is a reasoning step? - Q3: Not sure I understand section 3.3.2 and Eq. 2 correctly. From previous sections, I understood that all your COTs contain a tool call. However, from Eq. 2 I would understand that you have cases where you have tools calls and others that do not. Please clarify. - Q4: Please clarify the variants before the main results: What exactly are Qwen3-8B-SFT (ori), C-SFT, and C-SFT+TIR? What is "original" data? Also confirm whether C-SFT and C-SFT+TIR share the same code-trace training data, with +TIR differing only by tool access at evaluation, as for my TIR doesn't make sense during SFT. Add a short explanation of each of variants before Table 1 (or at least cross-link to §4.3), or move the ablations out of the main table, as the ablation results are already shown in Fig. 4, to avoid confusion. Fully human-written
ChemReason: A Chemical Code-Driven Reasoning LLM via Verifiable Reinforcement Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose a model with native tool usage. Specifically, during trace generation, the model produces Python-based verification tools to check its own outputs. The generated code is executed in a sandbox environment, and the resulting outputs are incorporated back into the reasoning trace. Based on these tool results, the LLM learns to either initiate another reasoning round or stop and produce the final answer. The authors evaluate their model on the TOMG benchmark, where it outperforms baseline methods. An ablation study further shows that the proposed multi-stage training procedure contributes to the observed performance gains. * **(S1) – Relevance of the tackled research question.** The proposed approach rests on a critical assumption: compared to external tool usage, an LLM’s reasoning capabilities might improve when it is required to generate tools on the fly. This assumption appears plausible, as generating executable code forces the model to make explicit design choices and reason about program correctness. In doing so, the LLM may develop a deeper understanding of the problem, which could also enhance general reasoning, e.g., reasoning about orchestrating tool usage and arriving at final solutions. Consequently, investigating whether this assumption holds is highly relevant and could yield genuinely novel research insights. * **(S2) – Experimental results indicate potential significance.** Experiments on the TOMG dataset and corresponding ablation studies indicate that the staged training procedure (SFT followed by RL) is beneficial. Models trained with both stages outperform their respective baselines. Compared to other base models, the proposed approach also seems to perform well. The choice of the TOMG dataset is appropriate, as it enables evaluation across three relevant tasks: molecular editing, molecular optimization, and conditioned generation. ### Comments * A notable advantage of the proposed approach is its ability to verify the validity of generated SMILES strings (i.e. RDKit processibility). However, this benefit is conceptually equivalent to that of LLMs employing external tool usage, where RDKit acts as the external verifier. * **(W1) - The assumptions made are not well supported, diminishing both the manuscript quality and the overall relevance of the work**. - Statements about tool-augmented models: a) "they remain far from realizing the vision of autonomous scientific assistants" (l132) and b) "They still struggle to unify reasoning, tool invocation, and execution into a coherent, learnable process" (l133f). The reference provided for (a) does not substantiate this statement and does not focus at all on tool-augmented agents. For (b), no reference is provided. Consequently, the validity of both statements remains highly unclear. - Verifiability: In a strict sense, the correctness of the generated tools cannot be guaranteed. While in most practical cases it might be sufficient to check only for the correct answer, the generated tools might rely on Clever-Hans effects rather than on a correct implementation. * **(W2) - The central research question is never answered due to a missing experiment**. The interesting research question is: Does generating tools on the fly within the reasoning trace help to boost reasoning capabilities compared to models using external tools? However, this question is never answered because no model with external tool usage has been trained and evaluated. To answer the raised question, reasoning traces for both approaches would need to be compared. A standard RL-trained model without tool usage would also be necessary to evaluate whether tool usage is required at all to solve the tasks. Currently, the success rates of standard chemical reasoning models remain unclear. * **(W3) - The manuscript includes several mistakes, typos, and unclear passages**. - Brackets for references are missing. - Typos: * l092: introductionaa * l134: missing white spaces * l146: Methodolog - The paragraph in Section 3.1.1, which introduces the TOMG tasks, is understandable but needs rewriting. - Formula (1) is cluttered. - Clarity issues: * l199: "we prompt a strong code model" is not specific enough. The authors should describe the model used. * Figure 2: Used abbreviations, e.g., T'_2, should be introduced. * l264: "data portion" is unclear. The authors should specify the data used. * "C-GRPO": In the RL step, the model is essentially trained with standard GRPO. The rebranding to C-GRPO adds confusion. * "Code success nums" (l306f) is unclear and needs further explanation. * The abbreviation "Aut" (l381) in Table 1 is unclear. * Table 1: The included models SFT(ori), C-SFT, and C-SFT-TIR are insufficiently explained, despite the ablation study section allowing an educated guess as to their meaning. * **(W4) - Missing error bars and statistical tests reduce the significance of the results**. The tables and figures are reported without error bars or statistical tests, making it unclear how much of the observed variation could have arisen by chance. The authors should include error bars and perform appropriate statistical tests. * **(W5) - Related work and experiments lack consideration of other chemical reasoning models**. Reasoning models for chemical tasks is the core of the manuscript, yet the authors completely ignore large parts of relevant prior work in this area, e.g., [1,2]. ### References * [1] Narayanan. Training a Scientific Reasoning Model for Chemistry * [2] Zhao. MolReasoner: Toward Effective and Interpretable Reasoning for Molecular LLMs * How different is the generated code from the associated templates? Could the code also be fully pre-written, with the LLM only required to execute it? If so, how would the training behavior differ? Fully human-written
United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory Soundness: 3: good Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a new multi-agent prompting framework for enhancing LLM reasoning particularly in complex reasoning problems; The paper draws analogy between In-context learning failures in LLMs and cognitive load theory, claiming that the ICL's limitaiton can be viewed as a cognitive overload where the working memory of the LLM is not enough for the task's cognitive load. The paper then proposed CoThinker, which divides the cognitive loads to multiple parallel agents and maintains a shared working memory. The CoThinker framework shows moderate improvements on LiveBench and CommonGen-Hard compared with direct IO, CoT and previous multi-agent debating methods. - The analogy from cognitive load theory is interesting, and has the potential to open up new perspectives in understanding and improving large language models - A comprehensive pilot study and citations from the cognitive science perspective is provided - The empirical results show that the framework generalize well on different llms - The paper is clearly written and easy to follow - Although the paper spend a large portion trying to justify the CLT-LLM analogy, my main concern are as follows: - It is still not clear what does the "cognitive load" mean in an LLM: what's the difference between a "cognitive load" and simply "task difficulty"/"reasoning complexity"; both the attention entropy and perplexity only reflects the "uncertainty" of the model, which is also directly correlates with the difficulty/complexity of the task. - It is not clear why the analogy helps: I find it hard to justify any unique insights (compared with existing work on prompting methods) that the CLT-LLM analogy can bring. The final proposed method still boils down to: a central meta-agent + multiple sub-agents. I feel the analogy does not truly contribute to proposing a novel and effective method. - And on LiveBench, the gains against the second best baseline is very marginal; again posing a challenge on why this analogy helps - Some very related work are missing in both related work and experiments: - Wang, Zhenhailong, et al. "Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration." arXiv preprint arXiv:2307.05300 (2023). - Suzgun, Mirac, and Adam Tauman Kalai. "Meta-prompting: Enhancing language models with task-agnostic scaffolding." arXiv preprint arXiv:2401.12954 (2024). N/A Fully human-written
United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper investigates the performance limitations of Large Language Models (LLMs) on complex, multi-faceted tasks through the lens of Cognitive Load Theory (CLT) from cognitive science. The proposed multi-agent framework CoThinker operationalizes CLT principles by distributing intrinsic cognitive load through agent specialization and managing transactional load via structured communication and a collective working memory. Experiments on LiveBench and CommonGen-Hard demonstrate improved performance over the baselines, especially on high cognitive load tasks. 1. The application of Cognitive Load Theory to explain LLM limitations is novel and insightful, bridging human intelligence and machine intelligence. 2. Clear system design of the proposed method. Each component—agent specialization, the transactive memory system, and the communication moderator—directly maps to established principles for managing cognitive load in human collaborative systems. 1. Comparisons in experiments omit some strong structured reasoning and multi-agent baselines (e.g., Tree-of-Agents, Agents with a leader, etc.). Statistical significance and variance across seeds are not consistently reported. 2. Though each part of the system can be mapped to CLT, however, these three parts are typical settings for multi-agent systems. The inspiration from CLT to design the detail algorithms of each part is lacked. 3. There is scalability analysis in the manuscript, but the experiment sets are too few to completely show the correlations between the number of agents and the performance. 1. Can you provide quantitative confirmation of small-world properties in the communication graph? 2. How sensitive are results to the choice of N and β beyond the reported ranges? Could you include task-wise adaptive selection strategies? 3. Can you provide more insights from CLT to explain the details of the system design? Fully human-written
United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper attempts to connect cognitive load theory with limitations of large language models, and builds a multi-agent framework to mitigate cognitive overload through agent specialization and structured communication. 1. The identified issue about the mismatch between task complexity and model processing capability is important and deserves further exploration. 2. The experiments are comprehensive, with solid ablation studies and well-organized analyses. 1. Cognitive science is mainly used as a rhetorical framing to justify a fairly standard multi-agent architecture (e.g., role assignment, communication bus, small-world topology). While it works, the CoThinker system lacks real novelty in design. I would expect cognitive theories to inspire genuinely new forms of multi-agent organization, rather than merely serving as interdisciplinary justification for existing designs. 2. If CLT is to be meaningfully applied to LLMs, the key questions should be: how to measure a model’s working-memory capacity, how to quantify a task’s cognitive load, and most crucially, how to determine (in a quantifiable way) tasks be decomposed or allocated once both cognitive load and working-memory capacity are measurable. The paper does not directly address these points. 3. The validation experiments in Section 3 do not provide real evidence; they merely restate an obvious fact: harder tasks make the model less confident, and clearer instructions help with difficult problems. 1. How do you envision quantitatively measuring “working memory capacity” in LLMs, beyond indirect proxies like attention entropy or perplexity? 2. Could the proposed framework adaptively estimate cognitive load and decide when to invoke multi-agent collaboration? 3. How sensitive is the performance of CoThinker to the chosen communication topology? Heavily AI-edited
United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. - This work explains the performance ceiling of in-context learning by comparing LLM attention mechanism with human working memory, where LLM has the similar cognitive load theory as in cognitive science and can be measured by attention entropy and perplexity. - Based on the CLT of LLM and solutions to cognitive overload, the work introduce a multi-agent framework, CoThinker, consisting of agent parallel thinking, transacutive memory system and communication moderator. The experiments show the effectiveness of this framework in improving LLM's performance in complex tasks. - Cognitive Load Theory provides an explanation for LLM performance limits and a clear design rationale for multi-agent collaboration. - CoThinker takes cognitive science principles—like working memory, collective cognition, and small-world communication—and turns them into practical, easy-to-understand tools. This connects human cognitive theory with machine collaboration. - The authors tested their theory using quantitative measures (entropy, perplexity) and multiple benchmarks across different model families, showing it’s robust and works broadly. Detailed ablation studies (on communication moderators, TMS, and thinking styles) help figure out which mechanisms do the most to cut down cognitive load. - CoThinker underperforms on low-load tasks (e.g., instruction following) due to communication overhead—suggesting inefficiency in simple contexts. - The chosen proxies for cognitive load (attention entropy, perplexity) are suggestive but indirect. Can you give more explanation on the relationship between attention entropy and cognitive load? Why were attention entropy and perplexity chosen as cognitive load proxies, and how might alternative measures, e.g., gradient variance? Fully human-written
Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 10: strong accept, should be highlighted at the conference Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper tackles personalized federated learning (PFL) under realistic heterogeneity: data heterogeneity where each client has distinct multi-modal tasks with temporal shifts, and model heterogeneity where clients use different model families and sizes. It introduces FedMosaic, which combines relevance-guided aggregation and PQ-LoRA to enable selective knowledge sharing and cross-architecture parameter sharing. The authors also release DRAKE, a multi-modal PFL benchmark with 40 tasks spanning VQA, visual reasoning, visual relations, including multi-image inputs and unseen task evaluation under distribution shifts. Empirically, FedMosaic outperforms strong baselines across heterogeneous/static/dynamic and cross-family settings on Self (personalization) and Others (generalization), and improves fast adaptation on unseen tasks; ablations show both RELA and PQ-LoRA contribute meaningfully. - Well-posed problem + realistic setup. The paper motivates that most PFL work oversimplifies heterogeneity; here, clients differ in both data and model (families and depths/sizes), which is closer to practice (agentic AI, device constraints). - Clear algorithmic design. The proposed method comes with clear motivation and design solution accordinly. For instance, RELA computes client-wise gradients on a small frozen model. It applies EMA decay to track shifting client knowledge and also adds Gaussian noise + random subsampling for privacy and bandwidth. - Benchmark contribution. DRAKE covers multi-modal, multi-image tasks, temporal shifts, and unseen evaluation; the table contrasts prior FL benchmarks along these axes. - Strong and granular evidence. The experiments are comprehensive with different settings like heterogeneous (same-family) PFL and cross-family heterogeneity, also analyize the performance from per-client view and fast adaptation possibility. Detailed ablations about adding PQ-LoRA improves Others, adding RELA further lifts Self/Others, are also provided. While RELA applies EMA, noise, and subsampling to the last-layer gradients from privacy perspective, an explicit comparison to baselines with stronger privacy guarantees would strengthen the claim. - I'm curious about the task relavance between the tasks in your DRAKE benchmark, like showing the relavance matrix. Fully human-written
Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper aims to address the challenges of personalized federated learning in scenarios where both data and models are heterogeneous. The authors proposed a framework named FedMosaic, which consists of two core components: the RELevance-guided Aggregation (RELA) and PQ-LoRA. RELA is a task-relevance-based model aggregation strategy that constructs customized global models for clients. PQ-LoRA is a module shareable across heterogeneous models, addressing differences in model depth and dimensions through "block-wise aggregation" and "weight alignment". Additionally, the authors propose DRAKE, a comprehensive multimodal federated learning benchmark that covers 40 different tasks and simulates real-world task diversity and temporal distribution shifts. Experiments on both multi-modal and text-only benchmarks demonstrate that FedMosaic outperforms PFL methods in both personalization and generalization. Overall, this paper represents a meaningful problem in personalized federated learning. The authors find that existing personalized federated learning methods are still confined to simplified scenarios with highly homogeneous data and models across clients, while real-world scenarios are more complex. They proposed FedMosaic, which addresses the simultaneous heterogeneity of data and models through a task-correlation-aware model aggregation strategy and dimension-invariant modules. Additionally, they introduced DRAKE, a comprehensive multimodal federated learning benchmark. **Major Weaknesses:** Overall, this paper has some merits, but there are a few weaknesses that stop me from giving a higher rating. My major concerns are as follows. (1) The paper mentions that the FedMosaic method does not require high computational costs, and the authors' experiments indeed include sections related to computational costs. However, the process of weight alignment in PQ-LoRA seems to be relatively complex, and the paper does not provide information about the computational costs of this part. (2) Section 4.2.1 of the paper mentions using CKA to find relative depth alignment and demonstrates with Llama-1B and Llama-3B, but lacks sufficient explanation regarding the applicability of this method. (3) In the weight alignment in the PQ-LoRA section, it mentions freezing the smaller model as a pivot and updating the larger model. The paper lacks an explanation for why this strategy was adopted. (4) DRAKE is one of the contributions of the paper, but extensive details are provided in the appendix, with relatively limited space allocated in the main text. **Minor Weaknesses:** (1) Figure 2 provides an overview of FedMosaic, but the image is relatively dense and slightly lacking in readability, so it could be adjusted a bit. Please clarify my concerns in the weakness part. Lightly AI-edited
Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work addresses both data and model heterogeneity in Personalized Federated Learning (PFL). The authors propose FedMosaic, a framework that jointly mitigates these challenges through two core components: RELA and PQ-LoRA. RELA (Relevance-guided Aggregation) constructs client-specific global models by weighting updates based on task relatedness, enabling effective knowledge sharing among similar clients while reducing interference across unrelated tasks. PQ-LoRA introduces dimension-invariant low-rank adapters whose parameters depend only on rank $r$, allowing efficient and architecture-agnostic knowledge sharing across heterogeneous models. To more accurately capture real-world task heterogeneity and distribution shifts, they further introduce DRAKE, a comprehensive multi-modal federated learning benchmark. 1. Experimental results demonstrate consistent improvements over strong baselines, indicating the effectiveness of the approach. 2. The appendix further provides a thorough and extensive suite of experiments, supporting the validity and robustness of the reported findings. 3. The paper is generally well-written and clearly organized, making the technical ideas easy to follow. 1. The comparison against prior works using non-IID splits of a single dataset may not be entirely equitable. The contextual settings differ significantly (maybe the earlier studies targeted models specialized for single-domain or unimodal tasks, rather than fine-tuning billion-parameter foundation models). Moreover, the motivation for exploring multi-modal tasks in PFL requires further clarification. What are the practical or deployment-oriented use cases where clients naturally possess distinct modalities? At present, the setup appears somewhat hypothetical, with each client operating on different data and architectures. In such a scenario, the incentive for federated participation is not clearly articulated. 2. The novelty of RELA is not fully evident. The client-wise gradient update formulation $\hat{g}_i^{(t)} = (1 - \alpha) \hat{g}_i^{(t-1)} + \alpha g^{(t)}_i$ closely resembles a first-order exponential moving average (EMA), similar to adaptive optimization methods such as Adam. Furthermore, the addition of a sanitization or noise component introduces privacy-related implications that warrant more rigorous analysis. If differential privacy–like noise is applied, the paper should evaluate its robustness against gradient-based privacy attacks and report accuracy trade-offs with and without the noise injection. 3. The paper attempts to address multiple orthogonal challenges simultaneously (data heterogeneity, model heterogeneity, privacy), which can dilute the focus of the contribution. A clearer ablation or modular analysis could help isolate the effects of each component. Currently it's unclear how impactful the "computing gradients at every $m$ batch is" or how impactful (accuracy-wise) the sanitized gradients are. 1. In the related work section, most PFL citations are listed without discussion. It would be helpful to briefly summarize the current state of the field: What approaches do recent state-of-the-art methods adopt, and what limitations does FedMosaic specifically address beyond them? 2. The preliminaries conclude with a PFL objective, but the formulation of the global model objective is unclear. How does the given objective differ from the standard local objective, and why does it include terms dependent on other clients’ models? 3. In Equation (1) and Figure 2, are the gradients computed with respect to the frozen weights $W_s$? 4. The rationale for computing only the last-layer gradient (based on the proportionality of preceding gradients via the chain rule) requires further justification or empirical support. Are there results related to it in the appendix? 5. The paper mentions that gradients $g_i$ are computed every $m$ batch iterations rather than every batch. What is the observed accuracy trade-off with and without this optimization? 6. For PQ-LoRA, does the method assume that all clients use the same low-rank dimension $r$? If so, the approach still enforces a degree of architectural homogeneity. Given that the core challenge is model heterogeneity, how can we justify $P$ and $Q$ modules remaining dimensionally the same? 7. How are the LoRA parameters $A$ and $B$ trained? Lightly AI-edited
Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper considers personalized federated learning (pFL) of multimodal large language models (MLLMs) under realistic scenarios that involve not only data heterogeneity, but also model architecture and model family heterogeneity, and task diversity. The paper designs a new method called FedMosaic that enables FL style collaboration across clients even in the simultaneous presence of all of the heterogeneities. FedMosaic has two important components - RELA (Relevance-guided Aggregation) and PQ-LoRA (Dimension-invariant Low-Rank Adaptation) that respectively address (data, task) and model heterogeneities. In an effort to make evaluation more realistic, the paper also introduces a new benchmark called DRAKE that incorporates all these heterogeneities and further includes aspects like dynamic distribution shifts and unseen tasks. Extensive experimental evaluation is provided in the paper for FedMosaic as well as several other state-of-the-art pFL baselines, which establish superior characteristics of FedMosaic. (S1) The writing and presentation of the paper are very clear in terms of both algorithm design and experimental results. Adequate intuitions are provided throughout the paper and appendices. The problem is well motivated, related work is well cited, and the contributions are contextualized appropriately. (S2) FedMosaic is an original and interesting solution to a very complex practical problem of multiple heterogeneities in pFL of MLLMs. This is a significant contribution to the field in terms of both ideas and solutions. RELA and PQ-LoRA would likely find use in other problems too. (S3) The supporting experimental evidence provided in the paper and appendices is quite exhaustive and impressive. The paper undertakes a wide diversity of studies to establish the characteristics of FedMosaic from several angles and shows competitive or improved performance w.r.t. all compared baselines. (W1) Introducing a new benchmark in an algorithms paper is counterproductive. The benchmark would be difficult to discover for any reader. To a reviewer, the benchmark's design is impossible to evaluate when only 10 lines can be allocated to it in the main body. While DRAKE looks extremely useful, there are several nuances which can only be understood by carefully reading multiple sections in the appendices. My opinion is that DRAKE should be submitted as a separate datasets & benchmarks style paper for it to be properly peer-reviewed as such. (W2) There is no benchmark called HFLB in (Chen et al., 2024). The name/citation should be corrected. (Q1) Section 4.2.1 Figure 4, and Appendices A.12, A.17: Even though supporting empirical evidence is provided, I don't understand why layer correlations should exist across model families (Llama, Qwen, etc.). Is this exclusively caused by the common training data source from which $D_P$ is sampled? PQ-LoRA would only work if such correlation exists, right? How should one think about the system when common subset from pre-training/post-training data may be unknown/may not exist/may be inaccessible? (Q2) Line 73, 210: Does the system require a separate model instance on the server for each client? If yes, is that scalable to large number of clients? If no, what do experiments suggest about observed number of model instances on the server per client, across datasets of interest? Fully human-written
Fairness Aware Reward Optimization Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a new in-processing fairness approach for reward optimization, called FARO. The main idea is to embed group fairness constraints into the reward modeling objective, to balance the accuracy-based objectives with fairness concerns. The paper also provides some theoretical analyses, which are useful to connect fairness domain to the RL reward modeling. 1. The paper focuses on improving fairness in reward optimization, which is a very essential domain to explore given the increasing reliance on LLMs in high-stake applications. 2. The proposed algorithm can be applicable to various important group fairness metrics, including demographic parity (DP) and equality of opportunity (EO). 3. The overall design is based on some theoretical backgrounds. My main concerns lie in the empirical verification of the proposed method, as the current experimental setup raises several questions regarding the robustness and generalizability of the findings. 1. The baseline data points in the LLM experiment are very limited. For example, there is no explicit baseline data provided for delta_dp, delta_eo, or delta_cf. It is very critical to observe the performance changes in these fairness metrics, especially given that the algorithm is specifically designed to optimize them. Moreover, no other state-of-the-art fairness algorithms are compared in this experiment, making it difficult to understand the relative effectiveness of the proposed algorithm in terms of disparity mitigation. 2. The LLM experiments are performed only with a single dataset, BBQ. To ensure the generalizability of the findings, it would be important to evaluate the proposed method across a more diverse range of scenarios. 3. The paper does not provide any stability information in the LLM experiments (e.g., giving only single data points without standard deviation or confidence intervals). It makes the reported results less trustworthy and hinders an important understanding of the algorithm's performance consistency. My major questions are included in the above weaknesses section. Minor: In Table 2, why the order of DP, EO, and CF are different across the models? Are there any typos? Fully human-written
Fairness Aware Reward Optimization Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Fairness-Aware Reward Optimization (FARO), a framework for training LLM reward models that incorporate algorithmic fairness constraints such as demographic parity, equalized odds, and counterfactual fairness. FARO formulates reward modeling as a constrained optimization problem solved via a proxy Lagrangian descent–ascent (ProxyGDA) game. The authors provide theoretical guarantees that the resulting reward satisfies fairness constraints up to a vanishing slack. The authors further analyze the induced accuracy-fairness trade-off in KL-regularized RL fine-tuning and prove that using a fair reward model leads to fairer downstream policies, with the existence of a Pareto frontier between accuracy and fairness. Empirically, FARO reduces demographic bias and harmful generations across multiple LLMs on the BBQ dataset, while preserving or improving factuality and overall performance. The paper addresses an important and timely problem in LLM alignment, ensuring fairness during the reward modeling phase. Incorporating algorithmic fairness constraints into this stage is an important direction given the growing societal impact of biased model behavior. The work attempts to provide theoretical guarantees for fairness compliance and analyzes the accuracy–fairness trade-off induced by RL fine-tuning. It also highlights an underexplored yet socially significant issue, namely that biases in reward models can propagate into downstream system performance. This paper reads poorly in terms of presentation. For instance, there are many issues with definitions and notations, which make the paper difficult to follow. The symbol $\mathcal{J}$ first appears in Equation (1) on page 3 (line 128), but it is only formally defined on page 4 (line 145). On page 4 (line 167), the definition of $q$ is too informal. The events $\mathcal{E}$ and $\mathcal{E}'$ seem to play an important role in the definition of the $q$ function, but they are rarely mentioned or used later in the paper. In Proposition 4.3 and Theorem 4.4, the symbol $\Delta$ is not clearly defined. I only found the definition of $\Delta_{dp}$ on page 5 (line 188). Propositions and theorems need to be stated precisely; otherwise, this causes significant confusion. The experimental section also requires improvement. The first issue is that the experiments are quite limited in scope. Another issue lies in the writing and presentation. For example, in Table 2, the term Disambig Bias Score is never defined, and several other elements in the table lack clear explanation. In Figure 2, the orange marker is missing a label or legend, leaving readers to guess what it represents. For now, the paper appears to be a direct application of existing fairness concepts to reward modeling, and most of its findings are rather expected. In your opinion, what is the most valuable insight this paper actually provides? Fully human-written
Fairness Aware Reward Optimization Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces FARO (Fairness-Aware Reward Optimization), an in-processing framework that aims to ensure fairness during reward model training, a critical stage in aligning large language models or reinforcement learning systems to human preferences. The authors argue that post-processing or constraint enforcement after training fails to guarantee equitable treatment, because reward models must be ordinally correct (ranking behaviors properly), cardinally calibrated (reflecting magnitude), and fair with respect to protected attributes in the preference data. FARO formalizes fairness constraints (demographic parity, equalized odds, or conditional independence) directly into the reward model’s optimization objective. The paper provides theoretical motivation for embedding fairness penalties in the reward-learning process and evaluates FARO against standard baselines on synthetic and small-scale preference datasets, claiming improved fairness metrics with minimal alignment degradation. Timely motivation. The paper targets an important emerging issue (bias propagation in alignment and RLHF reward models). The observation that fairness must be considered during reward training rather than after deployment is accurate and relevant. Clear problem framing. The taxonomy of ordinal, cardinal, and fair reward desiderata is pedagogically helpful and could inspire future formal definitions of “fair reward alignment.” Conceptual simplicity. The proposed framework (adding fairness regularizers during reward-model fitting) is simple to implement and compatible with standard training pipelines. Readable and structured. The narrative is coherent, the introduction well motivated, and the figures (illustrating fairness trade-offs) are easy to follow. Lack of novelty. The notion of embedding fairness constraints or regularizers during reward learning closely parallels prior work such as Liu et al. (2023) "Fair RLHF", Narayanan et al. (2022) on “Fair Reward Shaping,” and even earlier “Fair Policy Gradient” papers. FARO does not present a distinct optimization approach, theorem, or architecture. The main fairness-regularized objective is standard L2 or KL regularization with group-conditioned loss terms. Suggestion: Explicitly position FARO against these existing methods and clarify what theoretical or practical advancement it brings. Theoretical underdevelopment. While the text refers to ordinal and cardinal calibration, there is no formal fairness definition connecting these notions to reward functions. No theorem or guarantee shows that FARO enforces or bounds demographic disparities. The fairness penalty is heuristic. Suggestion: Include at least one proposition or convergence result showing that fairness regularization modifies reward gradients in a provable way. Weak experimental validation. Experiments use toy synthetic preference datasets and small-scale binary comparisons. No large or realistic RLHF setup (e.g., human preference alignment on text or image data) is tested. Improvements in fairness metrics (DP gap, EO gap) are minor and within variance. Suggestion: Add a larger empirical evaluation or ablation showing stability and generalization. Ambiguous fairness metrics. The fairness objectives (DP, EO, CI) are mentioned but not precisely defined for pairwise preference data. It is unclear whether “equal opportunity” applies to preference comparisons or to label distributions. Without clarity, reported fairness improvements are difficult to interpret. Suggestion: Formalize fairness definitions specific to pairwise or ranking tasks. No real discussion of trade-offs. The paper lacks quantitative analysis of fairness–alignment trade-offs. Claims that FARO “preserves alignment quality” are unsubstantiated; there are no significance tests or error bars. Suggestion: Include Pareto front or fairness–utility curves to support this claim. Overly conceptual framing. Much of the paper reads as a position statement (“we should ensure fairness in reward models”) rather than a technical contribution. While conceptually important, it does not reach the level of methodological depth expected at a top-tier ML conference. Insufficient relation to prior work. The related work section omits direct references to prior fair reward modeling, constrained RL, and fair preference learning papers. Without positioning, FARO appears to rediscover well-established ideas. Can you formally define demographic parity or equalized odds in the context of pairwise preference data? How are fairness constraints enforced during gradient updates? Are they penalties, projections, or Lagrange multipliers? How sensitive is performance to the fairness-penalty weight $\lambda$? Is there a trade-off curve you can show? Have you tested FARO on real RLHF data (e.g., text alignment or summarization preferences)? Does FARO generalize to multi-attribute or intersectional fairness constraints? Could you discuss how fairness regularization interacts with the reward normalization typically used in preference modeling (e.g., Bradley-Terry scaling)? How does FARO compare to fairness-aware policy optimization (Fair PG, Fair Q-Learning) in terms of outcomes, not just rewards? Please clarify the computation cost—does fairness enforcement slow down training significantly? Fully AI-generated
Fairness Aware Reward Optimization Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes FARO, which adds fairness constraints directly into the reward model fine-tuning process during preference learning. They replace hard preference decisions with smooth Bradley–Terry probabilities so fairness can be optimized, and use a proxy-Lagrangian approach to enforce group fairness. They then show that when this fair reward model is used in RLHF fine-tuning, the resulting policy is also fairer. Experiments demonstrate reduced demographic bias with no major loss in alignment performance. Considering fairness in the setting of RLHF is well motivated and timely. The authors give a clear problem formulation and develop practical reformulations and optimization methods to solve the problem. The paper offers both theoretical and empirical insights, which together make a complete set of results. Overall, the work is also well structured. - The exposition is sometimes too sketchy on notation and key definitions, which makes the paper difficult to follow for non-experts. For example, the fairness notions in Section 2.2 are introduced largely in abstract terms, without concrete explanation of the variables and notations involved. This level of abstraction may be fine for domain experts but does not help with the accessibility for a broader audience. - While the motivation is strong and the problem is formulated rigorously, the technical contributions feel relatively modest. The reformulation of problem (4) looks more like a detour that eventually goes back to the probabilities of pair-wise comparisons. The reduction in the number of constraints via the anchoring trick is fairly straightforward, and it is unclear how much actual performance improvement this gives. Algorithm 1 is just a direct application of gradient descent, and the other theoretical results do not seem to introduce any fundamentally new insights. Some of them seem to be direct results of the application of gradient descent. Overall, the work presents a well-motivated direction with plausible empirical benefits, but the contribution feels moderate, and the presentation could be improved to better clarify the key ideas and make the method more accessible to a wider audience. - Typos (minor issue): Page 1, ln39: "an" educational-chat bot setting... Page 5, ln 224: ...hold for all $\ge 2$ hold... I don't have any questions. Fully human-written
TESSAR: Geometry-Aware Active Regression via Dynamic Voronoi Tessellation Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper discusses TESSAR - an active learning method for regression that picks new points to label by using geometrical structure. TESSAR builds a Voronoi diagram around the currently labeled points and seeks unlabeled samples that lie near the boundaries of the diagram, where the model is least certain. It then balances this with two complementary signals: a score that encodes the density of points or how representative they are, and a score the encodes diversity by measuring distances from labeled data. The result is a single scoring rule that aims to be informative, diverse, and representative. Across many tabular datasets, this approach matches or beats strong baselines, with a practical update trick to keep computation reasonable. The paper presents a clear geometrical idea for the solution of a well motivated problem in active learning, which is integrated into a unified comprehensive strategy. Attention is given to computational complexity. The technical sections are hard to follow, as several derivations are terse. The experiments are limited to modest-size tabular datasets; given the inherent computational complexity of the method, more challenging datasets seem appropriate. The authors appear unfamiliar with several active learning works that address a similar problem via coverage. Although those papers target classification at the low-budget regime, their focus - selecting spatially diverse and representative points from the underlying distribution - is closely related. Instead of employing a Voronoi diagram, the optimization is formulated in terms of set coverage; crucially, because the objective is submodular, the greedy solution enjoys efficient approximation guarantees. Published extensions in that line of work also incorporate uncertainty terms. It is therefore essential to compare against this coverage-based literature. 1) Yehuda, Ofer, et al. "Active learning through a covering lens." Advances in Neural Information Processing Systems 35 (2022): 22354-22367. 2) Bae, Wonho, Junhyug Noh, and Danica J. Sutherland. "Generalized coverage for more robust low-budget active learning." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024. Please address the relationship to the coverage-based literature discussed above, and clarify the method’s scalability to larger datasets. Lightly AI-edited
TESSAR: Geometry-Aware Active Regression via Dynamic Voronoi Tessellation Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces TESSAR, an active learning framework for regression tasks that leverages Voronoi tessellation to improve sample selection. The core innovation is the Voronoi-based Least Disagree Metric (VLDM), which identifies informative samples near Voronoi faces in the input space, addressing the limitations of traditional distance-based methods that often overlook dense interior regions. VLDM is combined with a distance score for peripheral exploration and a density-based representativity term, resulting in a unified acquisition function. The authors provide theoretical motivation linking Voronoi faces to high predictive variance, along with an efficient approximation for VLDM computation. Empirical evaluations on 14 tabular regression datasets show that TESSAR matches or surpasses state-of-the-art baselines such as LCMD and BADGE in terms of RMSE, although its runtime increases with larger datasets, reducing efficiency, a limitation the authors explicitly address. The use of Voronoi tessellation as a geometric approximation for disagreement-based sampling in regression is a well-motivated idea. Unlike classification tasks with clear decision boundaries, regression lacks such structures, and this paper elegantly adapts the concept via VLDM. The theoretical analysis in Section 2.2, showing that points near Voronoi faces exhibit high variance under Lipschitz assumptions, provides solid grounding. TESSAR integrates informativeness, diversity, and representativity into a single score, with efficient dynamic updates (Algorithm 2) to avoid recomputing VLDM naively. The empirical consistency of VLDM (Figure 3) and ablation studies (Figure 4) clearly show the complementary benefits of the components in TESSAR. Evaluations on diverse datasets (e.g., Protein, Road, Stock) using performance profiles and penalty matrices (Figure 5, Table 1) highlight TESSAR's consistent superiority. It achieves the highest RA(0) of 41% in performance profiles, outperforming LCMD (29%). Runtime comparisons (Table 3) indicate it's competitive with baselines, with increases justified by better performance. The paper is well-written, with clear pseudocode, evaluations and detailed appendices on datasets and metrics, and thoughtful discussion of limitations (e.g., computational cost in large pools, homoskedasticity assumption). TESSAR's runtime scales with pool size and perturbations, making it slower on very large datasets (e.g., 547s on Road vs. ~150s for Coreset). The authors acknowledge this and suggest optimizations, but more scaling experiments (e.g., on million-scale data) could strengthen the case. Active Learning is designed for large data sets. The related works section is comprehensive but could better highlight differences from clustering-based methods like LCMD. Can we pre-evaluate how much TESSAR will outperform random sampling on a given dataset? Following LCMD’s analysis (Holzmüller et al., 2023), which showed that the ratio of initial RMSE to MAE on a small training set strongly predicts the benefit of LCMD-TP over random selection, a similar diagnostic could be developed for TESSAR. For example, a pre-evaluation metric could forecast TESSAR’s sample efficiency gains, helping practitioners decide when to deploy it. How sensitive is TESSAR to the choice of feature extractor (e.g., vs. raw inputs or other architectures)? The method relies on a feature mapping (e.g., neural network outputs), and varying this (e.g., with PCA) might affect Voronoi partitions and VLDM scores, warranting empirical sensitivity analysis. In the theoretical analysis, the Lipschitz assumption is reasonable, but are there empirical cases where it fails, and how does TESSAR perform there? Fully AI-generated
TESSAR: Geometry-Aware Active Regression via Dynamic Voronoi Tessellation Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes TESSAR, a geometry-aware active learning framework for regression. The key idea is to model uncertainty through VLDM, which measures the geometric instability of samples under small perturbations of labeled points. By combining VLDM with distance and density terms, the method dynamically selects informative samples in a model-agnostic manner. 1. The paper proposes a novel geometric perspective, which is interesting. The use of Voronoi tessellation and the proposed VLDM provide an innovative and well-motivated formulation of uncertainty for regression tasks. 2. The method achieves notable performance gains across various datasets and baselines. 1. The paper should include a static Voronoi or less frequent update baseline to show the effect of dynamic tessellation. Also what is the selection stragegy of Gaussian perturbation parameter? 2. It would be helpful to include an ablation replacing VLDM with simpler geometric proxies such as nearest-neighbor distance or local label variance. This would clarify whether the performance gains truly stem from the proposed VLDM formulation or can be achieved by simpler uncertainty measures. 2. Voronoi-based geometry can degrade in high-dimensional spaces due to distance concentration. Is the proposed approach still effective in high-dimensional regression tasks? See weakness. Fully human-written
Comp-LTL: Temporal Logic Planning via Zero-Shot Policy Composition Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes an approach that composes pretrained task primitives to satisfy Linear temporal logic (LTL) specifications, with the aim of avoiding retraining or fine-tuning whenever the specifications change. The core method constructs a transition system (TS), which is then pruned and made deterministic. The approach is compared with prior work. The approach aims to address an important problem in learning for LTL: the need to retrain or fine-tune whenever specifications change. In addition, the figures in the paper are clear and illustrative. - **W1.** The related-work section is not well organized or sufficiently extensive. For example, as far as I understand, your approach is quite similar to model-based planning without learning. The works by Qiu et al. (2023) and Jackermeier & Abate (2025) also appear closely related but were only briefly mentioned and were not explicitly compared with your approach. - **W2.** The contributions are not very clear. As I understand it, zero-shot approaches for LTL already exist; the novelty of your approach seems to stem from implicit safety integration rather than from being zero-shot per se, and this is not emphasized—for example, in the abstract. - **W3.** In my opinion, the assumptions are very strong: the environment is modeled as a deterministic Markov decision process (MDP) whose topology can be constructed, which makes the method strongly related to model-based planning approaches such as Kurtz & Lin (2023) rather than learning approaches. - **W4.** Only the construction of a TS is explained in the technical approach section. The pretrained task primitives, arguably the core of the approach, are neither explained nor discussed, which makes the motivation for the TS unclear as well its function. - **W5.** The experiments are not comprehensive; only two environments and two baselines are considered. ### References 1. Kurtz, Vince, and Hai Lin. “Temporal logic motion planning with convex optimization via graphs of convex sets.” *IEEE Transactions on Robotics* 39.5 (2023): 3791–3804. - **Q1.** See W1. Could you provide a more systematic related-work discussion and comparison? Possible categories: model-based planning for LTL; model-free and model-based learning for LTL; transfer learning and fine-tuning for LTL; zero-shot transfer for LTL. Could you explicitly state the advantages and disadvantages of your work relative to model-based planning approaches? What are the advantages and disadvantages of implicit vs. explicit safety, and why is this important? - **Q2.** See W2. Also, what does your approach contribute beyond Kloetzer & Belta (2008) and Nangue Tasse et al. (2020)? - **Q3.** What are your thoughts on W4? - **Q4.** Could you provide results for additional environments and compare your approach with methods beyond reward machines (RMs) and skill machines (SMs), particularly with other zero-shot and model-based planning approaches? Lightly AI-edited
Comp-LTL: Temporal Logic Planning via Zero-Shot Policy Composition Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces Comp-LTL, a framework for zero-shot satisfaction of Linear Temporal Logic (LTL) specifications using pretrained RL task primitives. Instead of retraining policies for new specifications, Comp-LTL composes existing primitives via Boolean task algebra and a pruned, deterministic transition system (TS) that ensures feasible and sound planning. The method integrates safety through minimum-violation (MV) semantics and constructs a product automaton with Büchi automata for execution. Experiments in grid-based environments (Office World, Video Game) show that Comp-LTL achieves safer, faster, and more generalizable performance than baselines such as Reward Machines, Skill Machines, and Boolean Composition. 1. The integration of deterministic TS pruning and Boolean policy composition for zero-shot LTL satisfaction is original and well-motivated. 2. The paper provides clear theorems (determinism, soundness, feasibility) with proof sketches, demonstrating a solid theoretical foundation. 3. Comp-LTL requires no fine-tuning or retraining, showing strong adaptability to unseen specifications in grid-world environments. 1. The Q-learning algorithms used in the paper are value-based, discrete-action algorithms — meaning they assume a finite, enumerable action space. Authors claim that "Our approach agnostic to the method in which the policies are trained, as we show Comp-LTL is successful with both tabular Q-learning and DQN policies." but the the Q-learning-based primitives in Comp-LTL cannot be directly applied to continuous-action robotic tasks. The experiments are confined to grid-based environments. Claims of generality would benefit from evaluation in continuous control or robotic settings. 2. While runtime and training times are reported, theoretical or empirical analysis of computational complexity (e.g., TS construction scaling with number of regions or propositions) is missing. 3. The claimed contribution of abstracting a geometric environment into a transition system with Boolean-composed task labels is not novel — similar abstractions were used in "Compositional RL from Logical Specifications" (NeurIPS 2020) and "Instructing Goal-Conditioned RL Agents with Temporal Logic Objectives" (NeurIPS 2023). The more distinct contribution lies in the pruning strategy ensuring deterministic, feasible TS representations and its integration with zero-shot safe composition. 1. How would Comp-LTL perform on LTL formula having "Until" operator? And how about the performance on LTL with $\omega$-regular expression, which is a very important extension on LTL formula? 2. Would Comp-LTL still maintain zero-shot in continuous-action environments? 3. See above. Moderately AI-edited
Comp-LTL: Temporal Logic Planning via Zero-Shot Policy Composition Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a logic-based control framework that is _agnostic to the specific logical specification_, in the sense that changing the task objective, expressed as an LTL formula, does not require retraining any policy. This is achieved by combining a pre-trained set of primitive policies with a transition system (TS) that abstracts the environment into labeled regions. The TS is then composed with the automaton corresponding to the target LTL formula, inducing a product graph over which planning is performed. Consequently, adaptation to a new specification is achieved zero-shot, purely through symbolic planning on this graph, without additional policy learning. - The paper is well written, and the contribution is clear. - The proposed idea is novel. In particular, I find interesting the connection between temporal logic specification and zero-shot composition of policies for the similarity with the multi-task problem. - The authors formally proove the soundness of the propose pruned transition system (TS). - While in part I find the empirical evaluation limited, it shows promiseing results, and highlight that the the proposed framework can satisfy new logic specifications without retraining. - The main concern is scalability. Constructing the transition system in realistic or continuous domains is likely intractable, especially since it requires identifying the regions where each atomic proposition holds. - The need to train one policy for each atomic proposition $\sigma \in \Sigma$ does not scale well as the number of labels grows. - As already anticapted, emprical evaluation is limited to toy domains. It is difficult to understand whether this method can be applied in more realistic environments. - The framework assumes that the available primitives are sufficient to cover all relevant behaviors, which is a strong assumption in realistic settings (more on questions). - How computationally demanding is it to construct the transition system in real environments, particularly regarding the recognition of regions where specific propositions hold? - Since the number of primitive policies required scales with the number of labels $\Sigma$, how does this behave in large or continuous domains? Could approximations or hierarchical abstractions make this approach feasible in practice? - Also, suppose a simple navigation problem in a 2D environment. A goal can be any point (x, y), this means that I need infinite primitives to account for each goal? Overall I find the idea interesting, my concerns are in the scalability and applicability of the method. I am open to discussion with the authors on the points raised. Fully human-written
Comp-LTL: Temporal Logic Planning via Zero-Shot Policy Composition Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work considers the problem of composing task primitives using LTL specifications. This is an established approach within the literature and this work aims to extend on this in two ways: 1) by incorporating an LTL pruning mechanism which simplifies the transition system defining the temporal sequence of tasks 2) by incorporating safety into the primitives themselves rather than relying on the LTL specification to guide the safety concerns which is done in prior work. The paper compares their approach to the state of the art approaches to using LTL specifications for task composition and shows that their approach is superior on a safety metric and in terms of learning speed. ## Originality The work is grounded quite closely to the literature on using LTL specifications for temporal and spatial task composition. This is not inherently bad, and in fact by positioning the work clearly against these prior works it does make the differences stand out. ## Quality The motivation of the work is clear and the hypothesis is grounded in prior work. The results that are presented are interpreted fairly. ## Clarity The paper is well written and figures are clear and useful overall. The paper uses notation and symbols in a way that is typical of this line of work which makes it easier to follow. ## Significance The work considers an important problem - safety within RL and also does support faster learning which is important as we expand our models into more difficult domains. Thus, there could be future work which builds on this paper and its stated claims. ## Clarity Firstly the minor concern, the figure captions are very uninformative and this limits the benefit of the figures substantially. Figure 3a in particular really needs to be more descriptive both in terms of the caption and figure itself. The work also uses jargon which is not sufficiently defined such as "sound". When a word is used in this manner to mean something technical it is necessary to define it formally. More importantly, it is very difficult for me to see the connection between the two main concerns of this work: the TS pruning and the approach of embedding safety directly into the primitives. This seems like two entirely distinct directions and makes the overall structure of the paper confusing. ## Quality and Significance My first critique here is that the paper takes for granted that safety should be embedded directly into the task primitives rather than specified in the LTL. This is not obvious to me and undermines the entire direction as a result. I would greatly appreciate clarity on why we even want this in the first place. Secondly, the consequences of putting the safety behaviour into the training of the primitives is not given due consideration. My understanding is that this will make all of the task primitives sub-optimal and inflexible to cases where the constrains may be temporary. So while the domains with fixed constrains may be fine for this, the flexibility of the approach is limited greatly and by extension so is the applicability of the model. Remark 1 similarly notes a trade-off which emerges from the paper's approach to zero-shot satisfaction and the possibility of introducing sub-optimality and notes that RMs take a different approach by fine-tuning. But then why is the paper phrased as if it improves on RMs and SMs in this regard (for example on lines 471 to 473)? What is the point of being "full-zero shot" if the proposed method is also suboptimal at this just like the prior work which at least considers fine-tuning. Finally, Table 4 seems unreasonable to me and is poor experimental design. Perhaps I am missing something, but to compare the prior methods on a fairly arbitrary metric (number of additional symbols collected) which they were not trained to consider at all, while the proposed method explicitly optimises the metric and then claims to be superior, is fairly meaningless. I would appreciate more explanation on why this is even a fair comparison. I have asked some question in the review above which I would appreciate answers to. Additionally, what is the connection between the TS pruning and safety? Why report the computation time and the safety metric if this work is primarily concerned with safety? How should I interpret the speed-up relative to prior work when there is a trade-off as a result of this speed up (Remark 1)? Fully human-written
ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes ZSPAPrune, a zero-shot, plug-and-play visual token pruning method that accelerates inference in vision-language models without any fine-tuning. The approach selects a small budget of visual tokens in two stages: first choosing the tokens most relevant to the text prompt (task-relevant core set), then adding tokens that are maximally diverse to preserve global context. Evaluated on LLaVA-1.5-7B and Qwen2.5-VL-7B-Instruct under aggressive 90% pruning, ZSPAPrune matches or improves accuracy on benchmarks such as MMMU, GQA, AI2D, POPE, TextVQA, and ChartQA compared to strong baselines like DivPrune. The paper also reports modest latency and memory reductions at inference time and emphasizes that the method is model-agnostic and easy to integrate. The paper presents a clear, zero-shot pruning method that balances prompt relevance and visual diversity, which prior work did not. Experiments across strong VLMs and multiple benchmarks show it maintains or improves accuracy under extreme pruning while reducing cost. The method is practically significant because it can be dropped into existing VLMs without any retraining or architectural changes. The paper does not report direct quantitative comparisons against strong prompt-aware pruning baselines (e.g., GlimpsePrune), so it is hard to verify that the proposed approach is actually better than the closest prior work. The efficiency claims are based on a single model/setting and only at an extreme 90% pruning ratio, with limited analysis of where latency and memory savings come from or how they scale with pruning level. The method is essentially heuristic and lacks a clear formal objective or robustness analysis (e.g., failure cases when relevance vs. diversity is misbalanced). The evaluation is limited to ~7B-scale VLMs, and there is no evidence that the proposed pruning strategy remains effective or stable for larger vision-language models, where attention structure and token redundancy may differ. How stable is ZSPAPrune across different prompt styles (e.g., long multi-step reasoning questions vs. short factual queries), and does the same relevance/diversity ratio work across them without retuning? Have you investigated automatically selecting the relevance–diversity ratio at inference time (e.g., predicting it from the prompt or task type), rather than setting it manually per dataset? Fully AI-generated
ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes ZSPAPrune, a zero-shot, prompt-aware token pruning framework for Vision-Language Models (VLMs). Existing pruning methods are often prompt-agnostic, ignoring text guidance and thus failing to prioritize task-relevant visual information. ZSPAPrune addresses this by reframing pruning as a balance between task relevance and information diversity, achieved through a hierarchical process: Prompt Simplification, Prompt-Aware Selection, and Diversity Balance. The method selects core visual tokens most relevant to the prompt and augments them with diverse tokens to retain global context. Experiments on multiple benchmarks and models show that ZSPAPrune achieves state-of-the-art or comparable performance with minimal accuracy loss even when pruning up to 90% of tokens, while significantly reducing GPU memory usage and inference latency. 1. From a perspective of prompt-aware token selection to balance task relevance and information diversity in visual representations. 2. Introducing a hierarchical pruning mechanism composed of Prompt Simplification, Prompt-Aware Selection, and Diversity Balance to achieve controllable token reduction. 3. Achieving significant inference efficiency improvements with minimal accuracy loss under zero-shot settings across multiple Vision-Language Models and benchmarks. 1. The paper lacks comparison with other methods that explicitly address the trade-off between task relevance and information diversity. Without such comparison, it remains unclear whether the proposed balance strategy is superior or merely heuristic. 2. As a plug-and-play method, ZSPAPrune should be validated on more models with different parameter scales to confirm its general applicability. The current experiments are limited to a narrow range of architectures, reducing the evidence of scalability. 3. The comparison with task-relevance-based approaches appears potentially unfair. Some baselines are reimplemented without clear alignment in training setup or hyperparameter tuning, which may bias the reported results. 4. The proposed method is overly simple and lacks crucial theoretical analysis. No formal justification or complexity discussion is provided to explain why the hierarchical prompt-aware pruning mechanism should work effectively. 5. The framework figure (i.e., Figure 2) is overly general and resembles a process diagram rather than an architectural framework. It fails to visually highlight the innovation and importance of the proposed components, and a more informative figure is recommended. null Moderately AI-edited
ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper studies token pruning issue in vision large language models. Specifically, it takes token pruning in vLLMs as a tunable balance between task relevance and information diversity. In implementation, the prompt-aware score by calculating the relevance between prompts and token embeddings, while the diversity balance is calculated by selecting the token most disimilar to all previously selected tokens. Experiments are done on several benchmarks. The strengths are as follows: 1.The paper is easy to read and the method is easy to follow. 2.Evaluated datasets and vLLMs are diverse. The weakness are as follows: 1.There are many existing works on task relevance of token pruning for vLLMs. This work additionally considers the information diversity, which seems incremental novelty. Meanwhile, in Figure 1, it is not easy to understand why the information diversity is useful for token pruning task. 2.Missing related works. Recently, there are many other token pruning methods[1,2,3,4], which are not analyzed and discussed in this work. These works should also be added for comparison. 3.In the method design, I have some concerns: (1) In Eq.4, averge pooling is applied on the prompt token embeddings. It is not quite reasonable since the prompt text may involve many not task-releted tokens. (2) The diversity balance is performed by selecting some tokens dissimilar to previously selected ones. Probably, this could select some useless tokens and background tokens. I am not sure this motivation is correct. [1] Boosting multimodal large language models with visual tokens withdrawal for rapid inference [2] Dynamic-llava: Efficient multimodal large language models via dynamic vision-language context sparsification. [3] Visionzip: Longer is better but not necessary in vision language models [4] Folder: Accelerating multi-modal large language models with enhanced performance See above Fully human-written
Dig2DIG: Dig into Diffusion Information Gains for Image Fusion Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. Due to the spatiotemporal imbalance of information gain contributed by different image modalities during the denoising diffusion process, this paper proposes a diffusion-model-based dynamic fusion method that computes a distinct set of fusion coefficients at each diffusion time step, thereby optimizing modality-specific information gain across the different stages of the diffusion process. - Overall, the discussion on dynamic fusion during the diffusion denoising phase is meaningful, as it aims to enhance the information gain achieved through fusion. - The key idea of the proposed method for tightening the upper bound of fusion generalization is to ensure that the guidance weight assigned to each modality is positively correlated with the residual information of that modality that has not yet been integrated into the current image, and theoretical support for this is provided. - Modality residual information is proxied by the information gain obtained during the diffusion process. - Using DIG as a proxy for dynamic weighting is not entirely reasonable, since DIG only measures the information loss during the forward noising process. Intuitively, it is more correlated with the intrinsic information richness of the original image. Consequently, modality images containing richer information regions would consistently receive higher fusion weights at any diffusion timestep. This contradicts the methodological insight that different diffusion stages exhibit distinct generation dynamics, where low-frequency structures and high-frequency textures evolve at different rates. - The illustration in Figure 1 may cause misunderstanding, as the information gain shown in the figure is different from the Diffusion Information Gain (DIG). It would be helpful for the authors to additionally indicate the computation method of the information gain depicted in the figure, so as to facilitate clearer understanding and distinction between the two concepts. - An important question to consider is how the authors construct the loss when computing DIG, and whether using only the ℓ2 loss is reasonable. The dynamic weights derived solely from the ℓ2 loss merely reflect pixel-level differences, and it remains questionable whether directly employing such differences as dynamic weights to modulate other types of losses, such as gradient losses, is theoretically and practically sound. - The authors have considered the dynamic weights in isolation, whereas in practice, the fusion weights are largely dependent on the formulation of the fusion loss constraints. However, the authors did not design targeted conditional constraints; instead, they directly adopted the conditional constraint scheme from DDFM - The selection of comparison methods in the experiments appears somewhat inconsistent. For example, Text-IF is used as a comparison method for multi-focus and multi-exposure image fusion, even though it is not a unified fusion framework. In contrast, CCF, which is indeed a unified approach, is only compared in the task of infrared–visible image fusion. - Is it reasonable to use DIG as a substitute for the dynamic weights? - Is it reasonable to compute DIG using only the ℓ2 loss? For instance, can it effectively measure the information loss in texture gradients? - It is not clearly explained how the dynamic weights are incorporated into the EM conditions; this part requires further clarification. In addition, it remains unclear whether such dynamic weights are compatible with the conditional constraint process—for example, with the gradient penalty term in the EM algorithm. - Is the selection of comparison algorithms fair? Lightly AI-edited
Dig2DIG: Dig into Diffusion Information Gains for Image Fusion Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes Dig2DIG, a dynamic image fusion framework built upon diffusion models, introducing Diffusion Information Gains (DIG) to dynamically guide the fusion process at each denoising step. The core idea is to weight modality contributions by quantifying how much information each modality can provide at each step, theoretically linking this approach to a provable reduction of the generalization error bound in multimodal fusion. Theoretical analysis, practical algorithmic realization, and extensive experimental validation are provided. 1. This paper thoroughly analyzes the theoretical motivation behind dynamic fusion in denoising diffusion models. 2. This paper systematically compares Dig2DIG with strong baselines over multiple challenging datasets. 1. This paper miss some discussion and comparison with several recent work, e.g., [R1-R2] 2. While the paper’s theoretical and algorithmic contributions are clear, the system-level architecture largely repurposes standard DDPM sampling with softmax weighting for fusion guidance. 3. While the mathematics generalizes to $K>2$, the paper does not present empirical or even synthetic evidence for how the framework scales with larger numbers or more heterogeneous modalities, nor does it discuss failure points in such scenarios. 4. Ablations that could further clarify the incremental importance of each design are only briefly touched upon. e.g., what if weighting is region-wise but not dynamic in time? What if different normalization or activation functions are used on DIG? References: [R1] Deng Y, Xu T, Cheng C, et al. Mmdrfuse: Distilled mini-model with dynamic refresh for multi-modality image fusion[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 7326-7335. [R2] Yang B, Jiang Z, Pan D, et al. LFDT-Fusion: A latent feature-guided diffusion Transformer model for general image fusion[J]. Information Fusion, 2025, 113: 102639. See the weakness Lightly AI-edited
Dig2DIG: Dig into Diffusion Information Gains for Image Fusion Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces a novel dynamic image fusion framework for diffusion models, addressing the limitations of static fusion strategies. The authors first identify a "spatio-temporal imbalance" in the denoising process, where information from source modalities emerges at unequal rates across different steps and regions. To leverage this, they propose a metric called Diffusion Information Gains (DIG) to quantify the per-modality information contribution at each denoising step. These DIG values are then used as dynamic weights to guide the fusion, ensuring that only the most informative regions from each source are integrated at the appropriate time. The authors provide a theoretical proof that this dynamic weighting scheme provably reduces the upper bound of the generalization error, and experimental results confirm that the method achieves superior fusion quality and inference efficiency compared to existing diffusion-based approaches. The paper is based on an interesting observation of spatio-temporal imbalance in the denoising process. The proposed dynamic weighting scheme is an intuitive response to this identified issue. 1. The qualitative comparisons appear incomplete and inconsistent with the quantitative evaluation. While numerous methods are benchmarked in the quantitative tables, the qualitative results in the figures only feature a select subset. For instance, in the visible-infrared fusion task (Figure 3), several strong baselines such as SwinFusion, DIVFusion, MoE-Fusion, CDDFuse, and DDFM are notably absent from the visual comparison. This omission is also observed for the MFFW and MEFB datasets. The authors should provide a rationale for this selective comparison or include qualitative results for all methods to ensure a fair and comprehensive evaluation. 1. Addtionally, DCEvo, as a SOTA fusion method in IV fusion, is also recommended for comparative analysis in experiments. [1] DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion. CVPR 2025. 2. The definition and application of the Generalization Error (GError) in Equation 3 are confusing in the context of image fusion. Generalization error is a concept fundamentally rooted in supervised learning, where a ground truth is available for evaluation. However, image fusion is an inherently unsupervised task that lacks a single, well-defined ground truth. Therefore, the direct application of this GError formulation seems unjustified. The authors must provide more substantial evidence or a more rigorous justification to demonstrate why this generalization bound is applicable and meaningful for an unsupervised task like image fusion. 3. The paper's central claim that the weights w_k derived from DIG guarantee a smaller generalization error bound—is not sufficiently proven. The theoretical connection between the proposed Diffusion Information Gain (DIG) and the reduction of the generalization error is tenuous. The manuscript lacks a formal proof demonstrating how optimizing for DIG directly leads to the tightening of this bound. Without this crucial link, the motivation for introducing DIG appears ad-hoc, and its relationship to the theoretical framework is not well-founded. 4. The visualization of the information gains in Figure 2 is perplexing and counterintuitive. The 'vi information gain' and 'ir information gain' maps appear to be almost perfectly complementary. The expectation is that the information gain for a modality should highlight unique information present in that modality. However, the 'vi information gain' map seems to highlight information behind the smoke, which is not visible in the original visible image and is instead a key feature of the infrared image. This is a significant contradiction. The authors must clarify the calculation and meaning of these gain maps and provide additional visual examples to resolve this apparent inconsistency. Please see Weaknesses Heavily AI-edited
Dig2DIG: Dig into Diffusion Information Gains for Image Fusion Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes a novel dynamic image fusion framework called **Dig2DIG**, which leverages **Diffusion Information Gain (DIG)**. Traditional diffusion-based fusion methods employ fixed fusion weights, ignoring the varying rates of information contribution among different image regions and modalities during denoising. Dig2DIG addresses this issue by quantifying each modality’s information contribution at every diffusion step and dynamically assigning fusion weights based on DIG. Theoretically, the authors claim that aligning fusion weights with residual modality information can reduce the generalization error of the fusion model. Experiments on multiple datasets (e.g., LLVIP, M3FD, MSRS, MFFW, MEFB) demonstrate that Dig2DIG achieves state-of-the-art performance in both fusion quality and computational efficiency. 1. **Novel Theoretical Foundation** – The paper provides a formal analysis suggesting that dynamic weighting via DIG can theoretically improve generalization compared to static fusion. The proof is sufficiently detailed. 2. **No-Training Framework** – The method reuses pre-trained diffusion models without additional training or fine-tuning, enhancing generality and reproducibility across tasks. 3. **Efficiency Improvement** – The inference time is significantly reduced (up to 70%) by skipping low-information diffusion steps, while maintaining comparable accuracy, demonstrating potential for practical efficiency gains. 1. **Limited Applicability Beyond Diffusion Models** – The theoretical proof relies only on L-smoothness assumptions and does not strictly require a diffusion process. However, DIG itself is closely tied to diffusion-based sampling. Its generalization to non-diffusion or Transformer-based fusion frameworks remains unclear. 2. **Practical Interpretability** – Although theoretically well-founded, the practical estimation of “information gain” is not intuitively interpretable for non-expert users, and the DIG computation may be sensitive to design choices such as the metric function. 3. **Insufficient Experimental Validation** – Several theoretical or qualitative claims—such as the correlation between residual information and reduced generalization error, or the spatio-temporal imbalance illustrated in Figure 1—lack quantitative or ablation experiments for confirmation. Many statements remain conceptual rather than empirically supported. 4. **Heuristic Nature of Step-Skipping** – The claim that “steps with low information gain can be skipped” is based on empirical observation rather than theoretical proof. While steps with small DIG values contribute minimally and can be skipped with negligible quality loss, there is no formal guarantee that diffusion consistency is preserved. Thus, the approach should be viewed as a heuristic efficiency optimization, not a theoretically grounded one. 5. **Unclear Definition of the Residual Information Term $I_{k,t}$** – This term plays a central role in the theoretical derivation (Eq. 4), yet its formulation is only symbolic. The function $\Delta I(c_k, x_t, x^*(c))$ is introduced without specifying its exact mathematical form, dimensionality, or relationship to observable quantities. 6. **Unexplained Link Between DIG and $I_{k,t}$** – The paper states that since $I_{k,t}$ is unobservable, it is replaced by the empirical DIG metric in Section 3.3. However, the connection between this substitution and later equations (e.g., Eq. 37 in the appendix) is not clearly explained. 7. **Inconsistent Probabilistic Assumptions** – The derivation first assumes i.i.d. (independent) modalities for linear weighting of $x_t$, but later uses the DDFM framework, which is based on a joint hierarchical Bayesian model under EM optimization (keeps the joint distribution). These two assumptions contradict each other, and the latter is theoretically more appropriate. This inconsistency undermines the claimed theoretical rigor. 8. **Problematic Weighted Approximation (Eq. 16)** – The authors initially assume conditional independence (Eq. 15), then relax this by introducing arbitrary weights $w_k$ without redefining the joint distribution $p(c_1, \dots, c_K | x_t)$. This heuristic substitution lacks probabilistic justification. When modalities are correlated—which is usually true since all source images come from the same scene—the gradient of the joint log-likelihood should contain cross terms that cannot be represented by a simple weighted sum of marginal gradients. As a result, Eq. (16) violates Bayes’ theorem consistency and weakens the “provable” nature of subsequent derivations. 9. **Redundant Description in Section C.1** – The paper states, “We follow the EM algorithm of DDFM,” but the process described is nearly identical to DDFM’s. The section should either explicitly detail differences in E- and M-step formulations compared to DDFMs' or be removed for conciseness. 10. **Inconsistent Equation Formatting** – Equation references are inconsistently labeled (“eq.”, “Eq.”, “equation”, “eq. equation”), which detracts from clarity and professionalism. 11. **Problematic geometric analysis** – The analysis is not mathematically well-grounded starting fron ( v = a,u_{IR} + b,u_{RGB} + c,u_S ) . The paper does not specify the underlying metric space, ignores the stochastic dependency of diffusion variables, and conflates statistical covariance with geometric cosine similarity. As a result, the claimed geometric intuition may be conceptually appealing but lacks rigorous justification or empirical verification. 1. Can the proposed DIG mechanism be extended to **non-diffusion generative models**, such as Transformer- or GAN-based image fusion systems? 2. Regarding **step-skipping (weakness 4)**: does “skipping low-information steps” correspond to using $S = 10$ or the DIG-25 variant for inference? What is the precise relationship between these two settings? 3. Could the authors provide additional **ablation or sensitivity analyses** to confirm how DIG values correlate with residual information and generalization error reduction? 4. How does the choice of metric $l(\cdot, \cdot)$ in Eq. (5) affect the stability of DIG computation? 5. How could the DIG framework be modified to capture **nonlinear inter-modality interactions** beyond simple time-dependent linear weighting? 6. What space or norm is this ''angle'' defined in, and how is the stochastic nature of the diffusion variables handled? Heavily AI-edited
RainPro-8: An Efficient Deep Learning Model to Estimate Rainfall Probabilities Over 8 Hours Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents RainPro-8, an efficient and compact deep learning model for 8-hour high-resolution probabilistic precipitation forecasting in Europe. RainPro-8 integrates heterogeneous multi-source data (radar, satellite, and numerical weather prediction, NWP), and adopts an optimized U-Net + MaxViT architecture. The model generates probabilistic maps for all time steps and precipitation classes in a single forward pass. The authors introduce an "ordinal consistent loss" to explicitly encode the ordinal nature of precipitation bins, improving interpretability and consistency of the probabilities. RainPro-8 uses less than 20% of MetNet-3's parameters but achieves superior performance across lead times and precipitation intensities, outperforming NWP baselines, extrapolation methods, and the latest deep learning models (e.g., Earthformer, SimVP, MetNet-3*). Extensive experiments, including ablations, attribution, and efficiency analysis, are provided, as well as competitive results on the short-term SEVIR benchmark. Code will be public, and datasets and experimental details are thorough. RainPro-8 efficiently combines heterogeneous multi-source data, achieves multi-step probabilistic forecasting in one pass with much fewer parameters than strong baselines, and its ordinal consistent loss is specifically designed for the nature of precipitation bins. The model consistently outperforms state-of-the-art competitors on accuracy, efficiency, and uncertainty quantification. The “ordinal consistent loss” (Section 3.2) is a core novelty, but its theoretical justification is limited. The paper gives only a high-level description and refers to another field (semantic segmentation) for motivation. There is no theoretical or empirical exploration of why ordinal modeling is essential for probabilistic precipitation. Ablation is very basic, and critical aspects such as loss stability, parameter sensitivity, and generalization are not deeply examined. While RainPro-8 claims to handle varying precipitation intensities, most experiments focus on average metrics. The model’s ability to capture extreme/rare strong precipitation events (e.g., ≥25 mm/h) is not thoroughly assessed. Only the class coverage is reported, but there are no case studies or targeted metrics for extreme events, missing a rigorous quantification of robustness or uncertainty calibration for outlier scenarios. does the ordinal consistent loss have any theoretical advantage (e.g., calibration, capturing uncertainty, or predicting extremes)? Could you provide more empirical or theoretical analysis on loss stability and generalization? In known European extreme rainfall cases or strong out-of-distribution samples, how does RainPro-8's predicted probability align with actual precipitation? Are confidence intervals and rare event probabilities calibrated and unbiased? Fully AI-generated
RainPro-8: An Efficient Deep Learning Model to Estimate Rainfall Probabilities Over 8 Hours Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper describes a deep-learning model for high-resolution probabilistic precipitation forecasting over an 8-hour horizon. The main advantage of the proposed model is in its ability to predict precipitation for a longer time frame than the standard nowcasting, up to 2 hours. To achieve this goal, the authors fuse radar, satellite, and physics-based nu- merical weather prediction (NWP) data. The paper covers an important topic, precipitation forecasting. Originality: the paper leverages multi-source prediction and claims prediction of up to 8 hours (however, see the weaknesses below) Quality: the paper is well-written in general (however, there are caveats as some parts of the paper are difficult to follow). The equations, as I checked, are correct Clarity and quality: see questions below Significance: I think the authors need to clarify on this point. The contributions cite the following: - *Efficient architecture and training strategy*: however, the architecture seems to be a well-parameterised UNet-based model (Figure 1). The authors state: "Key differences include single-pass predictions without lead time conditioning (Section 3.3), early downsampling in the encoder, halving internal channels, and removing topographical embeddings, all contributing to a reduced parameter count of 36.7M from the original 227M." Would this be the contribution? I think it could be better to have some sort of takeaway message justifying these architectural solutions, and why it could help develop better new precipitation forecasting architectures - "significant gains over deep-learning nowcasting models" See Q1 - "Demonstration of RainPro’s versatility for radar-only 2-hour predictions on the SEVIR benchmark, achieving state-of-the-art performance compared to both deterministic and generative nowcasting models" I am not entirely sure I get it. If the claim is to get 8 hour prediction, is achieving state-of-the-art results on 2-hour radar predictions of the contribution? 1. "Extensive empirical evaluation demonstrating that RainPro-8 outperforms existing operational methods by 65% " I am not sure I can follow where this is described, it only appears in the introduction 2. I am not sure I can follow Figure 2, it is very small. From what I can follow, does the proposed method essentially follow very closely MetNet-3*, is that correct? In that case, the main claim could be that you achieve slightly higher performance, but reduce the computational costs 48 times. 3. For Table 3, the ablation study results show only small differences in performance. Could the authors give confidence intervals, if possible? Fully human-written
RainPro-8: An Efficient Deep Learning Model to Estimate Rainfall Probabilities Over 8 Hours Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose RainPro-8, a deep learning model for high-resolution, 8-hour probabilistic precipitation forecasting over Europe. The model is based on the MetNet-3 architecture but is modified to have significantly fewer parameters. It introduces a *single-pass prediction* strategy to improve inference efficiency and an Ordinal Consistent Loss to handle probabilistic bins. The authors claim the model surpasses existing deep learning and numerical weather prediction methods on several metrics. 1. The paper addresses the challenge of 8-hour, high-resolution probabilistic precipitation forecasting. This is a critical and difficult task that bridges the gap between traditional nowcasting and medium-range forecasting. 2. The usage of Ordinal Consistent Loss models the conditional probability of exceeding intensity thresholds, is designed to explicitly account for the ordinal structure of precipitation classes , which is a more principled approach than using a standard cross-entropy loss that treats classes as independent. 1. **Limited Methodological Novelty (*Main Weakness*)**: The primary weakness of this paper lies in its limited methodological novelty relative to ICLR standards, which emphasize fundamental advances in learning representations. - Aside from the new loss function, the work's primary contribution is an application of existing techniques to create an efficient system. - The model architecture is just based on MetNet-3. The main architectural changes include *early downsampling* and *halving internal channels*. These are standard engineering practices for model compression and efficiency, not **novel architectural designs or new methods for learning representations**. 2. **Lack of comparison on GAN-based models**: The paper's experimental validation lacks a crucial component in its discussion of *GAN-based generative models for precipitation*. In the related work section (Section 2), the authors correctly identify the importance of deep generative models, citing high-impact work such as DGMR (Ravuri et al., 2021) and NowcastNet (Zhang et al., 2023). These models, both published in *Nature* and recognized for their strong performance, are a key pillar of the state of the art in this field. Despite acknowledging this work, the main experimental comparison in Table 1 completely **omits comparisons against these (or any other) GAN-based generative methods**. This is a significant gap in the evaluation. Without benchmarking against this entire class of SOTA models, it is impossible for the reader to assess the paper's true performance, and the claim to "surpass... deep-learning nowcasting models" remains unsubstantiated. 3. **Contradiction Between Quantitative Metrics and Qualitative Results**: A significant weakness undermines the paper's experimental conclusions: the quantitative metrics and the qualitative case studies appear to contradict each other directly. - Quantitative Metrics: The paper's aggregated metrics, specifically the Frequency Bias Index (FBI) in Table 7 and Figure 8, suggest a key advantage for RainPro-8 in forecasting heavy precipitation. At the 10.0 mm/h threshold, RainPro-8 reports an FBI of 1.636, which is notably lower (i.e., less over-forecasting) than the 1.821 reported for MetNet-3*. **This metric suggests the author's model is more balanced**. - Case Study Results: The paper's own qualitative visualizations repeatedly show the opposite. In the case study for the +8h forecast in Figure 19, the Ground Truth shows only a small area of precipitation at the >10.0 mm/h (yellow) level. However, **RainPro-8 predicts a large, distinct area of heavy rain, while the MetNet-3 forecast for the same event shows a substantially smaller overforecasted area**. This same pattern, where RainPro-8 visually over-forecasts heavy rain more severely than MetNet-3, is apparent in the other examples provided (Figure 17). This fundamental discrepancy, in which all qualitative examples in the paper contradict the aggregate metrics for heavy rain, is neither acknowledged nor explained. This casts serious doubt on the reliability of the reported results and the validity of the paper's evaluation. See weaknesses. Heavily AI-edited
RainPro-8: An Efficient Deep Learning Model to Estimate Rainfall Probabilities Over 8 Hours Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces RainPro-8, a deep-learning model for probabilistic precipitation forecasting. It proposes an Ordinal-Consistent Loss that maintains the natural ordering of rainfall intensity classes and a Single-Pass Prediction method that predicts all future timesteps at once for greater efficiency and temporal consistency. Built on a U-Net with MaxViT blocks, the architecture efficiently processes multi-resolution atmospheric inputs with significantly fewer parameters than prior MetNet models. - The paper identifies the limitations of radar-only forecasting methods and proposes a multi-source approach to improve prediction accuracy. - It provides a detailed analysis of recent Climate AI research, highlighting their characteristics and limitations, and conducts experiments following established protocols and evaluation metrics. - To address the ordinal nature of precipitation classes and the difficulty of long lead-time forecasts, it introduces a specialized loss function and a single-pass prediction method. - The appendix provides comprehensive details on the variables, data assimilation, and preprocessing steps for the diverse data sources used. - Including the baseline model MetNet, RainPro-8 relies on a variety of data sources, which limits its applicability to regions—such as developed countries—where high-resolution data from multiple sensors, satellites, and NWP outputs are readily available. - The model lacks significant technical novelty. Moreover, its performance improvement over the existing baseline, MetNet-3, is marginal. - Many models that perform 6-hour nowcasting are discussed in the Related Works section. What is the logical reasoning and supporting justification for forecasting 8 hours instead of 6, as done in those prior works? - Besides SEVIR, is it possible to evaluate performance on other relevant benchmark datasets such as the Shanghai dataset or CIKM? - In Line 109, the paper states that “its training requires significant time and resources, involving hundreds of Tensor Processing Units (TPUs) for multiple days.” However, if this method uses an “NVIDIA H100 SXM5 GPU,” isn’t that also an extremely powerful computational setup? Please summarize the computational power and training duration for the compared baseline methods. Lightly AI-edited
OR-PRM: A Process Reward Model for Algorithmic Problem in Operations Research Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces OR-PRM, a domain-specialized Process Reward Model for Operations Research (OR). The authors (1) diagnose high noise in existing OR datasets; (2) curate a cleaner seed set via a three-stage pipeline; (3) generate step-wise trajectories with MCTS; (4) have GPT-4o perform structured, step-level judgments and corrections; (5) train a generative PRM that outputs critiques rather than scalar scores; and (6) show gains under Best-of-N selection and a Modeling→Critique→Code pipeline, reporting up to +12.5 pp average improvement across benchmarks and model sizes (Qwen2.5 series, LLMOPT). 1. PRMs are a natural fit because OR solutions require step-wise logical validity (not only final objective values). The paper targets this gap explicitly. 2. Three-stage filtering for a seed set, MCTS exploration, and GPT-4o step audits with issue/judgement/correction fields—this is more structured than typical PRM pipelines in math reasoning. 3. Returning natural-language critiques and a corrected first error goes beyond scalar PRMs and mirrors trends in “corrective” PRMs in broader reasoning literature. 4. Best-of-N and model-agnostic critic usage show uniform gains; the Complex-LP results are especially notable (e.g., +24.2 pp on Qwen2.5-32B). 1. The paper should deeply contrast with OmegaPRM (automated process supervision via MCTS) which established scalable step-labeling at large scale in math reasoning (1.5M annotations) (https://arxiv.org/abs/2406.06592), and with recent generative/critic-style PRMs designed to explain and correct steps such as GM-PRM and VisualPRM/Athena-PRM (multimodal, but methodologically very close in “PRM generates critiques + refines BoN”) (https://arxiv.org/abs/2508.04088). The proposed “generative PRM” feels incremental w.r.t. these trends. 2. The paper claims >30% severe flaws in mainstream OR data and a rigorous three-stage filter, but I don’t see inter-rater reliability, error taxonomy distributions, or random spot-check protocols beyond illustrative examples. Given PRM sensitivity to label noise (documented in PRM surveys/lessons) (https://arxiv.org/abs/2501.07301) , stronger auditing is necessary. 3. Step labeling and final verification heavily rely on strong LLMs (GPT-4o/Qwen verifiers) inside the pipeline that also guide critique at inference. Without cross-model, cross-vendor checks and held-out validators, there is potential circularity (the critic agrees with the validator it was trained/selected with). 4. Since BoN/selection tends to give large lifts, the paper should compare against strong non-PRM baselines (self-consistency; majority vote with route-length filters; retrieval-augmented PRM or OOD-robust PRMs) that recent works show to be competitive, e.g., Retrieval-Augmented PRM (R-PRM / RAPRM) and R-PRM variants focusing on OOD and data bootstrapping. (https://arxiv.org/abs/2502.14361) Current ablations (e.g., “Major Voting”) look weak and not representative of the SOTA toolkit. 5. OR is a heterogeneous space (LP/MILP/NLP/CP-SAT with industry quirks). I don’t see OOD splits (new templates, new constraint families, solver switches) or robustness to noisy/problematic instances (e.g., infeasible but realistic specs). OOD weaknesses are a known pain point for PRMs. (https://arxiv.org/abs/2502.14361) 6. The Modeling→Critique→Code pipeline may conflate several effects: (i) data curation, (ii) structured prompts, (iii) critic guidance. A factorized ablation (swap in a scalar PRM; blind the critic to code; remove “Corrected Step”) would clarify whether “generative PRM” itself is the key. 7.Important but under-specified: exact prompts, MCTS hyper-parameters per task, GPT-4o temperature and refusal handling, and the policy/critic decoupling at test time (who conditions on whom). 1. How often does GPT-4o disagree with the MCTS label? Show confusion matrices and human audits on a random 500-step sample. How many “corrections” by GPT-4o were later proven wrong? 2. If you allow the critic to emit corrections (you already do), can you implement Refined-BoN like GM-PRM and report deltas? (https://arxiv.org/abs/2508.04088) 3. Evaluate with new solver backends (e.g., Pyomo→PuLP / OR-Tools), new template families, and noisy text to test OOD per RAPRM concerns. (https://arxiv.org/abs/2502.14361) 4. Ablate the “generative” aspect: Replace the critic with (i) scalar PRM, (ii) critique-only (no correction), (iii) correction-only (no explanation). Which component drives Complex-LP gains? 5. Add self-consistency, vote-with-length-normalization, and retrieval-augmented PRM. Current “Major Voting” is too weak to claim SOTA.(https://arxiv.org/abs/2502.14361) 6. End-to-end token/cost for data curation + training + inference (critic calls per step)? Compare with OmegaPRM’s cost per step label. (https://arxiv.org/abs/2406.06592) 7. Failure modes: Provide qualitative cases where OR-PRM gives confident but wrong constraints/objectives (classic OR pitfall), and whether the system catches feasible but semantically wrong models. 8. For 100 randomly sampled instances, have expert OR graders rate the usefulness and correctness of critiques (Likert & error-type taxonomy). Your current evidence relies mostly on automatic verifiers. 9. As far as I know, the algorithm complexity of MCTS is relatively high, and I am interested in the impact of introducing MCTS on computational efficiency. 10. If we don't use GPT-4o, which is a relatively good model, but some relatively poor models, can your method still demonstrate robustness? Fully AI-generated
OR-PRM: A Process Reward Model for Algorithmic Problem in Operations Research Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces the first Process Reward Model tailored for Operations Research (OR-PRM), aiming to explore the potential of Large Language Models (LLMs) in this complex reasoning domain. The authors surprisingly find that direct training of a PRM on existing mainstream OR datasets yields weak performance. Through systematic analysis, they identify the primary bottleneck as data quality, noting that over 30% of existing annotations are severely flawed. To address this, the authors collect all existing synthetic datasets and employ a carefully designed filtering pipeline to construct a high-quality seed dataset, which is then used for model training and evaluation. This work not only lays the foundation for applying LLMs in the OR field but also provides a crucial warning and proposed improvement strategy for the quality of current OR algorithm datasets. 1. This is the first application of the Process Reward Model paradigm to algorithmic problems in the field of OR. OR is a domain highly dependent on structured reasoning and complex algorithms, making the integration of LLMs highly valuable for research and application. 2. The authors conduct a systematic analysis that clearly identifies a severe annotation quality issue (over 30% defects) in mainstream OR datasets. This is a significant contribution and warning to the wider community, as identifying data bottlenecks is a crucial step for field advancement. 3. The authors do not avoid the data quality issue but instead proactively address it by collecting existing synthetic datasets and applying a "carefully designed filtering pipeline" to build a high-quality seed dataset. This strategy of solving the problem at the data source is commendable. 4. If OR-PRM proves effective, it will greatly simplify the modeling and solving process for OR problems, providing new avenues for automated algorithm discovery and solution. No serious weakness. None Moderately AI-edited
OR-PRM: A Process Reward Model for Algorithmic Problem in Operations Research Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors develop the first Process Reward Model (PRM), which is specifically intended for Operations Research. Based on prior research that existing synthetic datasets in this area are significantly flawed, the manuscript proposes a filtering pipeline to initially construct a "seed" dataset of problems and their description in a formalized way as well as a solution. Each problem is validated in multiple ways by ensuring that the reference code runs and yields the correct result, that constraints are valid as well as that the mathematical model to solve represents the problem faithfully. This dataset is then used to construct several reasoning trajectories of multiple reasoning steps based on Monte Carlo Tree Search (MCTS), where each step is validated, to generate a dataset for training (called OR-ProcessQA). A process reward model, OR-PRM, is then trained on this dataset to provide dense rewards for guiding the reasoning effectively. The reward is not a single score, but instead provides feedback in natural and corrections in natural language, so is more like a critic model. * new dataset seems to be addressing a need in this field, if existing datasets are so severely lacking * writing seems to be mostly clear, especially the description of the dataset construction * field of Operations Research is never properly introduced * novelty is a bit diminished by over-claiming: abstract/introduction/conclusion give the impression that the analysis of the existing datasets is done by the authors of this manuscript, but judging from the related work section it was done by someone else * this directly hurts the motivation for the new dataset, since the problems in existing datasets are not discussed in detail * dataset statistics are mostly missing, for example: * number of samples in the (seed) dataset * What is a sample? * average length of trajectories, i.e. number of steps * number of trajectories per sample * how many failed/successful trajectories * dataset composition from the original datasets (partially in Appendix C.2) * little empirical analysis of the dataset creation pipeline * limited evaluation: * generation (not the primary focus of the manuscript): while different model sizes are used, they are only from one model family, leaving the question whether the approach generalizes to other model families * only a single model, small, trained as a reward model, so generalization is a big question * analysis focuses too much on the employed generative models and less on the capabilities of the reward model * no baselines for the first scenario (Best-of-N) besides the original model * unclear baselines for second scenario: * How does pass@1 with critic work? It considers the proposed corrected step/solution? * How does pass@8 without critic work? * potential baseline: Use of a general purpose model with the same prompt to act as an critic? * missing cost analysis minor issues: * it seems that the wrong cite command is used in general, so brackets around the references are missing * related work: * title - seems to be not all caps like the other section titles * line 147: for some reason "offline" is incorrectly hyphenated * Figure 2: * you might want to make the bubble (bottom row, roughly the middle) a bit bigger, so that "Data Diversification" does not touch the outline * same for the bubble inside "OR-PRM Model Training" * titles of 3.1 and 3.1.1: You might want to properly capitalize the titles. * 4.1: * line 327: "Specifically, Industry OR..." - not a proper sentence * line 331: "OR-PRM, We" - "we" should not be capitalized * line 346: reference for CoT? * references: * cited differently than other arXiv references (can only be surmised from the URL): * [DeepSeek-AI 2024] * [Luo et al. 2024] * [Ma et al. 2024] * [Zhang et al. 2025a] * additionally consider capitalizing the titles to be consistent, especially abbreviations and proper names (which is sometimes done): * [Huang et al. 2025a/b] * [Ma et al. 2024] * [OpenAI 2025] * [Wang et al. 2025] * [Wu et al. 2025] * [Xiao et al. 2024] * [Xiao et al. 2025] * [Xie et al. 2023] * [Yang et al. 2025b] * [Zhang et al. 2025a] * [Zhou et al. 2025] * missing place of publication: [Wu et al. 2025] * [Xiao et al. 2025] was published IJCAI '25 * Appendix B: * NL4OPT: The number of samples (245) seems to differ from Table 3. * line 711: Wrong table reference (Table 3 instead of 4)? * Figure 5, caption: "Example:LLM" - missing whitespace * Appendix E, line 858: "etc.Three" - missing whitespace; "Three examples as follow:" - not a proper English sentence * Appendix F: title sounds a bit strange * 3.1.1: What does $\hat x$ represent? The new solution? But if the expected output is already known, why not use that as ground truth? * 3.1.2: What is $\mathcal{L}$? I suppose it is not the loss function? * 3.2: How are Generative PRMs different from a general critic model employed for each reasoning step? * Table 1: What would have been the improvements, if the best solution would have been selected based on the ground truth, i.e. what would have been the upper bound of improvement when generating additional trajectories? * This would give context how well OR-PRM selects reasoning chains or whether improvements "just" stem from the additional generation, i.e. more token use. * this sentence in 4.4 seems to suggest that such data exists: "our Best-of-N performance is strong, but it still falls short of the theoretical upper bound" (line 466) * Why was OR-PRM not used in this setting for the proprietary models * Table 2: * What does Qwen2.5 (Zero Shot) do? Why are the results different than in Figure 3? * same for OR-PRM (Ours) * How does majority voting (filtered null) work? * Which scenario is used for the ablation study? probably the second one * Did you try any out of distribution benchmarks? * 4.4/conclusion: How would expanding the dataset help to create credible baselines? * Appendix F, Critic Prompt: seems only for the full solution * Can we actually call this a Process Reward Model, if it does not look at individual reasoning steps? * What is difference between the two prompts on page 24? Where are they applied? Fully human-written
OR-PRM: A Process Reward Model for Algorithmic Problem in Operations Research Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces OR-PRM, the first Process Reward Model (PRM) tailored for Operations Research (OR) reasoning tasks. The work targets a fundamental challenge: large language models (LLMs) often fail to produce reliable, logically consistent reasoning in OR problems involving optimization modeling, constraints, and solver code generation. To address this, the authors propose a three-stage pipeline: Data construction – Build a high-quality, process-annotated dataset (OR-ProcessQA) through careful seed curation, constraint validation, and semantic verification. Monte Carlo Tree Search (MCTS) – Generate diverse reasoning trajectories and preliminary correctness labels automatically. Process Reward Model (OR-PRM) – Train a generative PRM using Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO), enabling fine-grained, interpretable step-level feedback for OR reasoning. Extensive experiments on multiple open-source (Qwen2.5, LLMOPT) and closed-source (GPT-4o) models show that OR-PRM substantially improves reasoning accuracy, robustness, and interpretability. Well-motivated and novel contribution: Addresses a clear gap between generic reasoning PRMs and domain-specific reasoning in mathematical optimization. OR-PRM is the first reward model explicitly designed to evaluate reasoning steps in mathematical optimization, going beyond scalar scoring to produce structured, natural-language critiques and corrections. Methodologically rigorous: Multi-stage pipeline (SFT + DPO) is logically constructed, with explicit checks (execution, constraints, semantics) to ensure data correctness. Excellent technical clarity: The paper defines each component precisely (seed data, MCTS, PRM training), making it reproducible and interpretable. Comprehensive evaluation: Benchmarks include both open- and closed-source models, with ablation studies isolating the effect of DPO alignment. Strong empirical results and interpretability: The improvement is not just numerical but also interpretable, showing how OR-PRM critiques and corrects reasoning steps. It is not clear to me whether the evaluation pipeline can serve as the ground truth. Since LLMOPT and GPT-4o may make mistakes. It is unclear how far the evaluation is from having an OR expert evaluate the reasoning steps. The pipeline heavily depends on LLMOPT. It would be nice to discuss if a different model were used to generate the data, how much impact that would have on the results. It seems that the OR problems have data that appeared in the problem description in scalar form. What if the data is stored in CSV files? Would the pipeline still work? Lightly AI-edited
TEMPFLOW-GRPO: WHEN TIMING MATTERS FOR GRPO IN FLOW MODELS Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. TempFlow-GRPO is a temporally-aware reinforcement learning framework for flow matching models that improves human preference alignment by introducing trajectory branching, noise-aware weighting, and seed grouping to achieve precise credit assignment and efficient optimization across timesteps. For reinforcement learning tasks, dense rewards are crucial for effective credit assignment. The proposed Trajectory Branching mechanism provides an elegant and effective way to obtain dense rewards along the denoising trajectory. The introduced reweighting mechanism offers a valuable analysis of how gradients evolve across steps in baseline algorithms and presents a solution to mitigate the identified issues. The proposed method involves numerous ODE denoising steps, which substantially increase computational overhead. However, the paper lacks a comparison against the baseline method using training time as the horizontal axis to illustrate efficiency trade-offs. The authors should evaluate the performance of the reweighting mechanism under different $\sigma_t$ schedulers rather than relying solely on the one used in Flow-GRPO, to examine how the choice of scheduler influences its effectiveness. It remains unclear whether simply reweighting the coefficients in the earlier part to 1 would yield good results under different schedulers. The comparison between batch std and global std is only evaluated on PickScore. How does this observation generalize to other tasks? Can the proposed reweighting mechanism be applied to hybrid variants (FlowGRPO-Fast/MixGRPO) where only a subset of steps follows an SDE formulation? Lightly AI-edited
TEMPFLOW-GRPO: WHEN TIMING MATTERS FOR GRPO IN FLOW MODELS Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents TempFlow-GRPO, a new reinforcement learning framework that addresses the limitation of uniform credit assignment across timesteps. The method introduces trajectory branching, which switches from ODE to SDE sampling at selected timesteps to generate exploratory branches and assign their rewards to intermediate states. This paper further proposes noise-aware policy weighting, prioritizing optimization at high-noise early stages over low-noise refinement phases. Experiments show that TempFlow-GRPO achieves substantially improved efficiency and final performance compared to the baselines. - The paper is overall well-written and easy to follow. - The motivation and the proposed method are clear and straightforward: addresses the temporal inhomogeneity and credit assignment problems through intermediate resampling for intermediate value estimation and noise-aware reweighting. - The proposed method shows strong empirical performance in both efficiency and end-level performance, with comparisons that include GPU time. - Theorem 1 is intuitively reasonable, but labeling it as a Theorem feels overstated since the underlying assumptions and proof sketch are insufficiently formalized. The analytical depth is also somewhat limited. - The explanation around line 847 (regarding why the average number of branches is 4.5× when K = 10) is unclear. It is not obvious how this factor arises or how the branching schedule operates, and the paper does not explicitly describe it. - Adding more algorithmic details or pseudocode would improve readability and make the proposed procedure easier to follow. See weaknesses. Lightly AI-edited
TEMPFLOW-GRPO: WHEN TIMING MATTERS FOR GRPO IN FLOW MODELS Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 10: strong accept, should be highlighted at the conference Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposed TempFlow-GRPO, a framework that makes the optimization process temporally aware to address the key limitation of temporal uniformity in previous RLHF works. The paper introduces a mixture of ODE and SDE sampling, along with a noise-aware policy weighting scheme, to balance exploration and reward exploitation. Experiments demonstrate that TempFlow-GRPO achieves state-of-the-art performance, yielding higher rewards than standard GRPO approaches. - The paper pinpoints temporal uniformity as the primary limitation of existing flow-based GRPO methods and proposes TempFlow-GRPO to solve it with precise credit assignment and noise-aware optimization. The authors demonstrate this non-uniformity well with empirical evidence from rewards, supporting the need for temporal information. - The paper introduces the core mechanisms of trajectory branching and noise-aware reweighting to create temporally-structured policies that respect the dynamics of the generative process. The authors also provide a theoretical justification from the policy gradient perspective, further supporting the use of noise-aware reweighting. - The proposed TempFlow-GRPO achieves state-of-the-art performance compared to the existing vanilla GRPO approach, demonstrating the effectiveness of the method. The authors also include comprehensive ablation studies to better understand the dynamics of this model. - The computational cost, as thoroughly analyzed in Appendix A.6, will be higher than the vanilla GRPO models due to the branching process. Nonetheless, this is more like a trade-off between quality and time, given the superior quality metrics. - How is the performance affected by the number of branches (K) at each step, the specific timesteps chosen for branching, or the exact function used for noise-aware weighting? The ablation study (Fig. 8) shows that the 4x6 (seed x branch) configuration was chosen, but it's unclear how much tuning is required to find the optimal setup for a new model or dataset. A discussion on how to choose these hyperparameters will be useful for general applications of the proposed framework. Moderately AI-edited
TEMPFLOW-GRPO: WHEN TIMING MATTERS FOR GRPO IN FLOW MODELS Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the sparse terminal reward and uniform credit assignment problem in GRPO training of flow models. The authors propose TempFlowGRPO, which includes: (1) Trajectory Branching, where only one step of SDE is used at timestep k; (2) Noise-Aware Policy Weighting by reweighting according to noise level; and (3) a seed group strategy. The method achieves state-of-the-art performance in human preference alignment and text-to-image benchmarks. 1. The authors astutely identify that the FLOW-GRPO algorithm treats all timesteps equally, and tackle this issue via single-timestep SDE optimization. 2. The noise reweighting method is shown to be effective through both soild theoretical analysis and experiment results. 3. The paper is generally well written with a clear logical structure. 1. The contribution of seed group strategy is relatively small to other parts of the work, and the paper should provide additional details of the seed group strategy. 2. Similarly, MixGRPO [1] proposes a training window of SDE time steps that also tackles the issue of treating all timesteps equally. However, there is limited discussion comparing with MixGRPO. 3. The paper does not discuss the phenomenon of reward hacking, which is an inevitable problem for the GRPO method. [1] Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde 1. The trajectory branching mechanism appears similar to MixGRPO limited with a single-timestep window. How do their efficiency and effectiveness compare? 2. The paper claims that Flow-GRPO (Prompt) is an improved baseline with group standard deviation stabilization, but does not provide much detail. Could the authors elaborate on this improved method? 3. Why are the Pickscore curve trends by steps and GPU hours on the left of Figure 3 inconsistent? 4. Compare to FlowGRPO, the experiment of Visual Text Rendering is not addressed. How well does TempFlow-GRPO perform on this particular task? Fully human-written
Causal Partial Identification with Data Augmentation Soundness: 3: good Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper seeks to provide theoretical bounds for using data augmentation when coupled with partial identifications. The bounds proposed give both best- and worst-case scenarios for partially identifying the causal averages they employ, with respect to predefined causal risks. Although I am no expert in neither domain augmentation nor partial identification, it seems to me that combining these two ideas is quite novel. The quality of writing is not incredibly good (see Weaknesses). The theorems appear to be sound and (without taking a close look at the proofs) the conclusions do seem natural enough. The concepts presented are clearly explained and it is clear that the theorems are algebraic derivations based on these definitions. Adding an image to illustrate an observation of the dataset described in Section 6.2 would make the explanation more intelligible. The abuse of notation in Section 3 (Line 202) was quite confusing at first. It seems to me that the notation employed could be formally introduced, to prevent confusions. The first equation hints to a very general functional model, that is later replaced by a linear model. Possibly because the results of Vankadara et al., (2022) are only valid for linear models. Is this the only bottleneck that prevents extending the current framework to more general models? ### Typos and errors: Several citations in the first pages (and possibly other parts of the text) should be citep, instead of citet. This makes reading of these pages quite painful and annoying. - Line 20: ubiquitous. - Line 273: extra 'e' in perturbs. - Line 281: I suspect there are some parentheses or mathematical notation missing for $f^T x \in H_{da+pi}(x), H_{pi}(x)$. - Line 461: out -> our. - Line 464: identification. - Line 466: Rosenbaum is inconsistently-cited (no year). - Equation (5), the norms used are not clearly defined. The (very subtle) difference between $X$ and $\mathbf{x}$ inside the do operator, makes these hard to distinguish on a first read. The authors claim these bounds are valid for the infinite-data setting and naturally are only able to assess in a finite-data setting. However, it would be good to numerically assess how the derived bounds behave for different data-size regimes. Outside of the optical device data used in Section 6, it is hard for me to envision the type of causal questions where the proposed tools would be useful. Could the authors point out other settings where this could be useful? Note that I am not requesting more experiments with this question, but they would be welcome. In Theorems 1 and 2, I assume that the '_equality iff_' would also correspond to having slack equal to zero. Is this intuition correct? If so, how does the slack increase for different degrees of independence? Since the models are linear (and possibly Gaussian), measuring this dependence could be easily done by correlations/covariances. On that note, the Gaussian assumption (over $G, U, N_X$, and $N_Y$) in Example 1 is quite strong (and it later affects all the theoretical developments), in what key ways does your contribution depend on this? Fully human-written
Causal Partial Identification with Data Augmentation Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors investigated whether outcome‑invariant data augmentation (DA) can sharpen partial identification (PI) bounds for causal effects under the presence of unmeasured confounders. The key idea lies in constructing an outcome-invariant data augmentation transformations G as a transformation intervention do(x=Gx). A very simple linear Gaussian case study is provided. By experiments, the authors showed that DA could help reduce the length for the pointwise interval width. For the theoretical analysis, the authors showed that DA lowers worst‑case causal excess risk over the identified set. This seems to be a flexible approach as the DA approach is model‑agnostic and compatible with many PI frameworks. 1. The conceptual idea is very clear and straightforward. Constructing a outcome-invariant DA as transformation intervation and then use it as a plug-in module in any PI pipeline. 2. The theoretical analysis is solid, especially for the and lower worst case excess risk (Theorem 2) 3. The proposal is a pre‑processing step that can be composed with existing PI methods without changing their solvers or constraint sets. 1. The linear–Gaussian SEM seems still restrictive with additive noise and a particular sensitivity model. Under other common cases, such as non-Gaussian, non-linear outcome model, many guarantees may not work. 2. The outcome-invariant assumption is hard to be testable, there is no diagnostic or stress test for misspecified augmentations. 3. The numerical results are not comprehensive enough. Only one real dataset (Optical Device) is used with fixed sample size (n=1000). 1. on the Optical Device data, why should flips/rotations/Gaussian noise be outcome‑invariant for f Can you provide predictive‑invariance checks? 2. Can any of the theoretical results be extended to non-Gaussian case? Or with some additional mild assumption? 3. How to choose G seems very important and tricky. A practical guideline for choosing G is very necessary. 4. A comprehensive sensitivity analysis to mis‑specified DA is very important for understanding the role of DA. 5. There are many augmentation parameters, such as noise scale, angle, etc. How these parameters affect the PI and the prediction performance? 6. The authors should test DA in multiple PI framework to demonstrate compatibility and gains beyond the partial R^2 model. 7. Since authors argue compatibility with IVs, a toy example where an IV is present, showing that DA preserves IV validity and may further tightens bounds, should be very convincing. Fully human-written
Causal Partial Identification with Data Augmentation Soundness: 4: excellent Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a method based on data augmentation to improve partial identification bounds in causal inference when unobserved confounding prevents point identification. The key idea is to use data augmentation, specifically, an outcome-invariant augmentation, as an auxiliary source of information that can act as a ``soft intervention.'' If we can transform data in ways that leave the outcome function unchanged (say, rotations of a picture in a classification task), these transformations can help tighten the partial id bounds. The method requires background knowledge about the regression function, for instance the null space of the coefficients matrix in a linear regression model, which can be quite limiting in practice. 1. Even though the analysis is restricted to linear-Gaussian systems, I believe most of the results have straightforward generalizations to more complex models. --- 2. Synthetic and semi-synthetic experiments consistently show that DA+PI yields tighter bounds than baseline PI. 1. Significant overlap with Akbar et al [1]: Taking a look at the first reference of the paper, one can notice immediately that this paper has a significant overlap with, and is essentially not adding much to Akbar et al. Surprisingly, a great deal of content is copied without even rewording. Ideas are copied from that paper: DA as soft intervention, Figure 2 of the paper with its caption, the running example of the paper, and most significantly, the theoretical results of this paper such as proposition 2, Lemma 3 and Lemma 4, all appear in Akbar et. al., and the rest of the results provided here are quite trivial or straightforward at best to prove given those. Compared to Akbar et al., I don't see much of a novel idea, novel proof, novel result, or novel presentation, begging the question what is the merit of this paper given that? Only bringing up the observation that DA can be used for partial identification too? Then I do not believe this paper is contributing enough to the literature. --- 2. The paper presents its main claims (“valid bounds”, “sharpened partial identification”, etc) early as if fairly general (Sections 1–3). But when you dig into Section 4, you find that the proofs are only in the linear Gaussian “Example 1” setting with heavy structure/assumptions. This can be quite misleading. The authors should either restrict their claims explicitly in the first few sections, or lift their proofs to more general settings (if they can). --- 3. The manuscript repeatedly introduces notation and small definitions close together; readers must hunt back and re-read definitions frequently. I believe this paper can be presented in a much better (readable, at the least) way if a bit of time is spent on it. --- 4. I am not convinced by the idea of using the background knowledge (say symmetries of f) in the way that this paper suggests. Read my question below too. If you already have symmetry knowledge about f, why do DA instead of directly imposing constraints? If the researcher truly knows those symmetries, you can (in many cases) impose them directly in the inference/optimization (e.g., enforce invariance in hypothesis class H, add equality constraints, or augment the objective with invariance penalties) rather than applying DA as a pre-processing step. The manuscript should either 1) clearly justify why DA is preferable to directly injecting invariance into the estimator (computational simplicity? easier to combine with off-the-shelf solvers? better finite-sample behavior?), and provide a brief theoretical or experimental comparison; or 2) explicitly treat DA as one practical way to operationalize symmetry knowledge and discuss tradeoffs (what you lose/gain compared to e.g. constraining H). Fully human-written
Causal Partial Identification with Data Augmentation Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper studies partial identification bounds under known symmetries of the causal effect, i.e., under invariances of $f$ when $Y = f(X) + \xi$ when $\xi$ may be correlated with $X$, and hence acts as an unobserved confounder. Primarily focusing on the population setting, they define a hypothesis class $H_{pi}$ of possible causal effects functions (each $h \in H_{pi}$ is a function from $x$ to $y$), which are consistent with the observational distribution $P_{X,Y}$ and satisfy the invariance constraints. For each $x$, this hypothesis class induces a set of possible causal effects $H_{pi}(x)$, which in turn can be used to define a worst-case excess risk. Assuming a multivariate Gaussian distribution over $(X, Y, \xi)$, they show that the excess risk strictly decreases under these constraints, as long as the symmetries are not almost surely orthogonal to the expectation of $X$ given $\xi$. In practice, rather than strictly enforcing the invariances, they are captured using data augmentation. This approach is corroborated through a simulation experiment on a linear model, giving sharper identification bounds, and in real-world experiments on the Optical Device dataset. **Originality and significance:** The use of known symmetries to sharpen partial identification bounds is an interesting and (to the best of my knowledge) novel direction. As the authors discuss, such symmetries are common in many applications, especially in scientific machine learning, where causal inference is quite important, so the combination of the two is a very good match. **Quality and clarity:** The work is well-executed, the motivation is clear, and the mathematical details are well-written. ## Major weaknesses 1. **Limitation to multivariate Gaussian setting:** To the best of my understanding, the results are limited to multivariate Gaussian distributions on $(X, Y, \xi)$, and this limitation is not made as transparent as it should be. Based on the text after Assumption 1, I understand that the partial R-squared sensitivity model is not a necessary restriction, but I'm less certain about the Gaussianity assumption (or at least, a linearity assumption). For example, Proposition 1 invokes a Lebesgue measure over $H_{pi}$, which is initially defined as a function space, but in the proofs, $h$ is taken to be a vector (i.e., the coefficients of a linear function). Overall, this lack of transparency gives the feeling that the results are being oversold. 2. **Not enough focus on the quantitative form of the results:** In connection to Weakness 1, I would be much more interested to see Lemma 5 in the main paper and a more quantitative discussion of *how much* the invariances sharpen the partial identification bounds. Theorem 1, Proposition 2, and Theorem 2 don't provide any intuition about how much the invariance sharpens the bounds, which I think would be the most interesting part. ## Minor weakness 3. **Overly focused on data augmentation:** I think a more logical way to present the results would be to focus on how known symmetries/invariances improve the partial identification bound, and afterwards connect these results to data augmentation. The results are really about the restriction of the hypothesis space, and would hold even when strictly enforcing the invariance - the approach of using data augmentation is more of a practical implementation detail. - Please address the Major Weaknesses, especially (1). Fully human-written
MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs Soundness: 3: good Presentation: 2: fair Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces MoleculeIQ, a molecular structure reasoning benchmark composed of symbolically verifiable tasks. Specifically, the tasks consist of three types: (1) feature counting, (2) index-based attribution, and (3) constrained generation, and the features of interest include functional groups, chemical properties, synthesis, and so on. The MoleculeIQ dataset is composed of 849 molecules. In addition, the paper proposes a framework that dynamically computes the ground-truth labels for the designed tasks, which it calls MoleculeQID. The experimental results show that current LLMs exhibit a significant gap in the compositional and structural reasoning abilities required. * This paper provides a wide-ranging evaluation across diverse model types and sizes. * The introduced MoleculeQID can be further utilized for new and open molecules, guaranteeing its scalability. * This study is grounded in the belief that there is a positive correlation between molecular structural understanding and molecular reasoning ability for complex property prediction. However, this belief is not explicitly demonstrated, so the necessity of building a structure-reasoning benchmark appears limited. I suggest presenting the relationship between structural understanding and predictive performance on molecular properties. * Overthinking is a well-known pitfall in molecular structure understanding, yet the results here differ from conventional wisdom. Providing an in-depth analysis of this aspect would strengthen the contribution of the proposed benchmark and offer clearer grounds for the necessity of reinforcement learning (RL) training. * The results clearly show where models fail, but the paper would be stronger with deeper qualitative error analysis. Presenting examples of incorrect molecules generated by top models and categorizing the types of structural mistakes would give model developers more actionable insights. * Could the authors provide the failure cases on the multi-task scenario? * What kinds of substructures do LLMs fail to capture? Fully human-written
MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces MOLECULARIQ, a benchmark aimed at chemical structure reasoning with symbolically verifiable ground truths. Tasks are built directly on molecular graphs (via RDKit solvers) and span three orthogonal axes: reasoning category (feature counting, index attribution, constrained generation), multitask load (1, 2, 3, or 5 simultaneous requirements), and molecular complexity (Bertz bins). The authors also provide a dynamic variant (MOLECULARIQD) for regenerating fresh test sets and preventing saturation/overfitting. Evaluation is integrated into the lm-evaluation-harness, with hierarchical answer extraction and semantic (key-agnostic) comparison; accuracy is averaged over three independent rollouts. A large-scale study (34 models) finds that recent MoE/generalist reasoning LLMs dominate; chemistry-tuned instruction/RL models often underperform their bases; constrained generation is easiest at low constraint counts but degrades sharply as constraints stack; and multitask load harms performance more than molecular complexity. The paper argues symbolically verifiable tasks reduce leakage/bias and produce capability fingerprints localizing failure modes. Tight problem statement & clear contribution. A fully symbolically verifiable chemistry benchmark focused on structure-grounded reasoning (not factual recall) is timely and well-motivated. Three-axis profiling. Disentangling reasoning type, multitask load, and molecular complexity provides diagnostic granularity and actionable error localization. Index-based tasks. Pairing counting with index attribution helps distinguish genuine graph reasoning from pattern-matching/shortcut counts. Solid evaluation infrastructure. Harness integration, hierarchical extraction, and semantic answer comparison address brittle formatting issues; multi-rollout accuracy mitigates sampling variance. Dynamic benchmark (MOLECULARIQD). A path to refreshable evaluations that can evolve with the field and support RL with verifiable rewards. Empirical insights. Consistent trends (MoE leads; chemistry tuning can hurt; canonical/aromatic SMILES easier; multitask load dominates difficulty) are useful for both modeling and benchmark design. Transparent limitations section. Clear articulation of the current scope (2D graphs, single-molecule tasks, symbolic-only feature set). 2D-only scope. Restricting to graph connectivity omits 3D stereoelectronic/conformational effects that matter for realistic chemical reasoning; several stereochemistry tasks may still be fragile under a purely 2D treatment. Verifier dependence & edge cases. Heavy reliance on RDKit rules brings corner-case risk (e.g., aromaticity/kekulization, tautomers, undefined stereocenters). Clear auditing and unit tests for borderline cases would strengthen claims. Dataset scale & coverage. The main static benchmark uses hundreds of molecules / thousands of questions—adequate for signal but small relative to chemical space. It is unclear how representative the selected feature distributions and Bertz bins are for downstream applications. Generation tasks may be gameable. Low-constraint prompts (e.g., “has two rings”) can be satisfied by template snippets; evidence that models aren’t exploiting canonical pattern banks (beyond SMILES randomization) would be welcome. Per-model configuration fairness. “Tailored configs” improve each model’s score but may complicate cross-model fairness. A fixed canonical configuration alongside tailored ones would help disambiguate. Rollout averaging & significance. Averaging over three stochastic runs may be thin for close comparisons; confidence intervals and multiple seeds per model/config would improve robustness. Leaderboard/process details deferred. Several artifacts (e.g., public leaderboard link, full configs) are promised camera-ready; reproducibility would benefit from making the dynamic generator and verifier immediately available. Limited assessment of prompt/extraction shaping. Although hierarchical extraction is a strength, further stress tests against format-shaping and overfitting to extraction heuristics would increase trust. Verifier audits: How do you handle aromaticity/kekulization mismatches, tautomerism, and unspecified stereochemistry during indexing and generation checks? Any published test suite of adversarial edge cases? SMILES perturbations: Beyond canonical/aromatic randomization, did you try token perturbations (ring index relabeling, randomized branches) to further separate pattern recall from graph reasoning? Fairness controls: Can you report both fixed (uniform decoding and temperature) and tailored configs for all models to separate capability from tuning sensitivity? Rollouts & variance: Why three rollouts? Do results meaningfully change with 5–10 rollouts or multiple seeds? Please add per-model variance bars in the main text. Constraint hardness: For constrained generation, can you provide a calibrated hardness ladder (e.g., constraint sets with matched feasibility rates) and show how models scale as we move up the ladder? Dynamic set governance: How will MOLECULARIQD updates be versioned (to avoid moving goalposts) and how will you prevent train–test leakage as the community begins to tune on the benchmark? 3D extension: What’s the roadmap for 3D-aware, symbolically verifiable tasks (e.g., CIP resolution, ring puckers, distance constraints) and for multi-molecule tasks (reaction stoichiometry, scaffold ranking) while maintaining verifiability? Failure mode taxonomy: Can you release capability fingerprint templates and diagnostic exemplars so users can map a model’s errors to specific graph-perception or compositional failures? Fully AI-generated
MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work introduces a new benchmark to evaluate LLMs on chemistry tasks. All tasks are grounded in the molecule’s graph structure and have answers that can be checked by a symbolic solver. The authors evaluates 34 LLMs and find that the largest mixture-of-experts models with high reasoning budgets achieve the best accuracy. 1. The paper presents a verifiable benchmark and all tasks have ground-truth solutions, allowing reliable automatic evaluation. 2. The benchmark varies the molecular complexity and tests different chemical reasoning skills. 1. The benchmark excludes tasks like quantitative property prediction or reaction prediction, so it does not evaluate LLMs’ ability on some real-world chemistry problems. 2. The tasks in the benchmark are fundamental checks that chemical softwares can do straightforwardly. They may be somewhat disconnected from how humans typically solve chemistry problems. For example, asking a model to ‘generate a molecule with two rings and five heterostoms’ is more like a puzzle or exercise than creative problem-solving. 3. The benchmark uses SMILES strings for molecules and the authors noticed that LLMs may use pattern recognition rather than structural reasoning. Similar arguments have been discussed in the literature, such as the inconsistency of LLMs in molecular representations which indicates LLMs fail to capture the underlying chemistry. The authors may consider adding similar consistency checks in the benchmark. 1. Does a better performance on the benchmark always suggest a stronger reasoning? Is there a way to explicitly evaluate if the model is doing reasoning, or is it doing pattern recognition? 2. Some LLMs use external tool calls to solve chemistry tasks such as ChemCrow. Do the authors envision the benchmark being used not just standalone LLMs, but also LLM-based agents? Fully human-written
CoRGI: GNNs with Convolutional Residual Global Interaction for Lagrangian Simulation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper aims to address the challenge of modeling long-range interactions in particle-based fluid simulation. To achieve this, the authors propose a hybrid framework that combines particle and grid representations: particle features are projected onto a grid for convolutional updates and subsequently mapped back to the particle domain. Experiments conducted on several Lagrangian benchmark datasets demonstrate improvements over existing methods. * The proposed method achieves superior performance compared to existing baselines. * It maintains computational efficiency, introducing only minimal additional overhead. 1. Limited evaluation horizon. Results are reported only for 20-step rollouts, which is atypical for practical use. This short horizon makes it difficult to assess **error accumulation and long-term stability**. Please report long-term prediction performance (e.g., full-trajectory rollouts of $\ge$ 150 frames, as in GNS). 2. Insufficient temporal qualitative evidence. The qualitative evaluation relies on sparsely sampled frames, which obscures temporal coherence (smoothness, stability, and plausibility of motion). Side-by-side videos of predictions vs. ground truth at fixed frame rates would better demonstrate dynamics, including representative failure cases. 3. Missing 3D visualizations for 3D tasks. Although the experiments include 3D scenarios, only 2D snapshots are shown. Please provide 3D renderings for the 3D cases. Please refer to the weaknesses. Moderately AI-edited
PreviousPage 28 of 1516 (75800 total rows)Next