ICLR 2026 - Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	15899 (21%)	4.43	3.58	3687
Heavily AI-edited	3233 (4%)	4.22	3.59	2990
Moderately AI-edited	7082 (9%)	4.20	3.61	2722
Lightly AI-edited	16648 (22%)	4.15	3.68	2746
Fully human-written	32938 (43%)	4.13	3.62	2917
Total	75800 (100%)	4.21	3.62	3026

Title	Ratings	Review Text	EditLens Prediction
RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces RELDIFF, a generative framework for synthesizing complex relational databases. Unlike prior methods that flatten schemas or assume conditional independence, RELDIFF explicitly models database structures as graphs and uses a graph-based diffusion model to generate mixed-type attributes across interconnected tables. The framework ensures referential integrity via a D2K + SBM graph generator and captures both inter- and intra-table dependencies using GNNs. Experiments are conducted on 11 datasets. 1.The paper is generally well-written and easy to follow. 2.The use of the D2K + SBM graph generator to preserve foreign key cardinality and hierarchical dependencies is novel and technically interesting. 3.The ablation study is comprehensive. 1. The decomposition $p(\mathcal{V},\mathcal{E})$ = $p(\mathcal{E})p(\mathcal{V}\|\mathcal{E})$ is assumed without theoretical support. 2. The proposed joint diffusion model is not clearly novel compared with existing tabular diffusion approaches such as TabDDPM, TABSYN, and TabDiff. 3. The high training cost of RelDiff raises scalability concerns, and memory usage is not reported. 1. The statement “tabular data includes complex and varied distributions” (lines 41-42) appears somewhat vague. Image and text datasets can also exhibit diverse and complex distributions due to varying sources and contexts. Could the authors clarify in what specific sense tabular data distributions are considered more complex or varied? 2. The decomposition $p(\mathcal{V},\mathcal{E})$ = $p(\mathcal{E})p(\mathcal{V}\|\mathcal{E})$ seems to be taken as a modeling assumption without sufficient justification. It is unclear why the generative process is assumed to first sample the relational structure and then the attributes. In practice, foreign-key relationships (edges) may be influenced by attribute distributions (e.g., business logic or temporal constraints), while attribute distributions can also be constrained by the structure (e.g., table hierarchy and connection density). Therefore, this factorization implicitly assumes a unidirectional dependency from structure generation to attribute generation, yet the paper provides neither theoretical justification nor empirical evidence to support this assumption. 3. The proposed joint diffusion model seems not an innovative design. The use of diffusion models for generating heterogeneous tabular features (i.e., numerical and categorical) has been extensively studied in prior works such as TabDDPM [1], TABSYN [2], and TabDiff [3]. The authors are encouraged to clarify what makes their proposed hybrid generation method novel beyond existing tabular diffusion approaches and to provide stronger empirical evidence demonstrating its superior effectiveness. 4. While quantitative metrics are provided, the quality of the generated tabular data should be further demonstrated through visualization to offer more intuitive and interpretable evidence of the model’s effectiveness. 5. As shown in Table 10, the training cost of RelDiff is substantially higher than that of ClavaDDPM, raising concerns about the method’s scalability and practicality on large-scale datasets. Moreover, the paper does not report the memory cost across different datasets, which is important for assessing the overall efficiency and deployability of the proposed framework. [1] TabDDPM: Modeling tabular data with diffusion models. ICML2023 [2] Mixed-type tabular data synthesis with score-based diffusion in latent space. ICLR2024 [3] TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation. ICLR2025	Lightly AI-edited
RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper tackles relational data generation via graph diffusion that first resamples a D2K+ SBM foreign-key graph to preserve referential integrity, and then jointly denoises numerical and categorical attributes using a heterogeneous GNN, achieving state-of-the-art multi-table fidelity and up to an 80\% improvement in k-hop correlations across 11 real-world databases. Quality. The paper uses a combination of different techniques. First, it generates a graph via their D2K + SBM generator. Their generator is comprised of Bayesian SBM as a model of graphs + D2K graph generator to preserve relationships between nodes. Subsequently, they define a conditional hybrid diffusion process which generates categorical and numerical samples conditioned on the generated graph. Clarity. Paper is easy to follow. Significance. Paper looks at tabular data generation for relational databases. Originality. A conditional generation framework of integrating graphs could be interesting to the community. Overall, experiments and ablation studies are comprehensive, comprising of performance, runtime, computation and privacy. However, a concern I have is its novelty. Its a combination of existing well-known methods which I believe for the current standards of conferences like NeurIPS, ICLR and ICML, it may be insufficient. The main takeaway that the framework provides is that integrating graph based generators into diffusion models help provide extra signal to improve generative performance. Please see weaknesses.	Fully human-written
RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper tackles synthetic relational database generation. Rather than flattening multi-table schemas or generating tables in a pre-set order, the authors decompose the task into sampling a relational entity graph that respects foreign-key cardinalities and hierarchy using a microcanonical, degree-corrected Stochastic Block Model, and a joint, graph-conditioned diffusion model that denoises mixed-type attributes across all tables with a heterogeneous GNN. Training uses subgraph neighbor sampling and a hybrid continuous + categorical masking diffusion objective; sampling first draws a new entity graph with the SBM module, then jointly denoises attributes conditioned on the graph. On two benchmarks covering 11 real-world databases, the method reports stronger multi-table fidelity and good downstream RDL utility compared to ClavaDDPM, RCTGAN, SDV, RealTabFormer, TabularARGN, and PrivLava. 1. The modeling choices are well-motivated: microcanonical SBM gives hard constraints for referential integrity; hybrid diffusion aligns with mixed continuous/categorical columns; heterogeneous GNNs with subgraph sampling are a sensible scalability strategy. 2. Joint graph-conditioned diffusion over the entire entity graph, coupled with a microcanonical, nested SBM to preserve relational cardinalities and hierarchy, is a clean and compelling synthesis. 1. The baselines omit recent joint modeling approaches like GRDM (Graph-Conditional Relational Diffusion Model), which also performs joint denoising over relational graphs and reports strong k-hop performance. The paper positions prior work mainly as sequential/conditional (ClavaDDPM, etc.), but the landscape now includes joint graph-conditioned diffusion and flow-matching variants. 2. The nested SBM is well-motivated for modular schemas, but the paper preprocesses two-parent/no-child tables to many-to-many edges and then learns blocks and degrees under hard constraints. That can bias structure when schemas are weakly modular and may interact with functional dependencies in ways the SBM cannot express. 1. How sensitive are results to the neighborhood radius and the subgraph neighbor-sampling scheme during training and inference? 2. Many databases encode temporal edges. Can RelDiff model time-stamped relations and reproduce temporal integrity?	Fully AI-generated
Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes VRFT-Aug, an RL-based post-training framework for medical vision-language models (VLMs). It augments (i) perception via (a) prompt/context augmentation with domain attributes and (b) implicit knowledge injection by first learning localization with RL and then transferring to classification; and (ii) reasoning via (c) a recitation reward that encourages/discourages repeating injected knowledge and (d) a multi-grade fuzzy reward (MFRS) for ordinal grading. Experiments on several MedMNIST tasks show consistent improvements over SFT and vanilla RFT, with ablations on prompt/context/ # Strengths 1. Clear decomposition of failure modes (perception and reasoning) and mapping to concrete training knobs (prompt/context, localization transfer, reward shaping). The four components are easy to reproduce conceptually. 2. MFRS alleviates sparse rewards and gives notable gains over binary accuracy rewards on grading datasets. 3. Ablation shows the effectiveness of penalizing recitation, which can generalize better than rewarding it (positive), and is a non-obvious but actionable insight for RL recipes in VLMs. # Weaknesses 1. Evaluation mainly on small/classification datasets; limited open-ended medical reasoning. Many reported wins are on MedMNIST-style classification and a few fine-grained sets; these are simpler than full radiology VQA or report-generation and do not stress long-form reasoning or clinical justification as strongly as prior medical RL papers. The paper’s strongest novelty claims (recitation reward design/sign; localization-to-classification transfer) would be more compelling on harder, free-form medical VQA benchmarks where MedVLM-R1/Med-R1 already set a high bar. 2. Rewarding/penalizing BLEU (n-gram) overlap with injected knowledge may (a) favor superficial copying or (b) punish legitimate paraphrase; the paper itself observes positive recitation converges to a “sub-optimal plateau,” underscoring metric-gaming concerns. Stronger signals (factuality/ontology verification or vision-grounded rationales) would better target reasoning quality. # Questions 1. Do your benchmarks mostly test perception rather than reasoning? Please quantify and justify the “reasoning” claim. 2. Is the recitation-induced sub-optimal plateau a consequence of your task suite and BLEU reward, or does it persist on free-form VQA?	Heavily AI-edited
Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces VRFT-Aug, a visual reinforcement fine-tuning framework that augments both perception and reasoning for large vision–language models in the medical domain. It enhances the standard V-RFT objective by (1) augmenting prompts with task-relevant contextual knowledge, (2) injecting implicit perceptual priors into the policy, (3) shaping the reward through a recitation reasoning term, and (4) shaping a multi-grade fuzzy reward that mitigates sparse-reward issues. The paper makes a creative and well-motivated extension of RFT from LLMs to medical–language models. By bridging RFT and medical-language reasoning, this work could stand as a practical foundation for safe and interpretable medical AI system. The empirical improvements are consistent and meaningful. Its originality lies not in inventing a new algorithmic family, but in articulating a new decomposition of the RFT pipeline into perception, policy, and reward components, each augmented with domain-specific priors. Also, the algorithm yields clinically aligned reasoning behavior, not merely better accuracy. I like the methodology. It is rigorous and shows empirical strong with principled reward shaping. The paper is well-structured. Guiding from motivation to formal definitions enables readers understand the algorithm very clearly. In addition, Figure 1 visualizes the modular design of each component. W1. Incremental algorithmic novelty. - The four augmentations (prompt, policy, recitation, fuzzy reward) are conceptually coherent but individually modest extensions of known techniques. Prompt engineering, auxiliary localization, imitation control, and fuzzy reward shaping. The work’s strength is integration rather than theoretical innovation. Given this work aims for medical purpose, I can understand this concatenation of existing techniques tho. W2. Scalability and generalization not demonstrated. - The experiments use Qwen2.5-VL-3B, a moderate-scale model; the method’s computational overhead and transfer behavior on larger LVLMs (e.g., InternVL-20B or Gemini-Vision) remain unexplored. Similarly, all datasets are relatively small and well-curated; testing on noisier real-world hospital data would better support claims of robustness. -	Fully AI-generated
Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	To address the limited improvement in reasoning ability of multimodal large vision-language models (LVLMs) after reinforcement learning (RL) in the medical domain, this paper introduces two key enhancements: 1.A two-stage knowledge injection process to enrich the domain-specific medical knowledge of LVLMs. 2.The design of new reward functions to improve the model’s reasoning capabilities. More specifically, the paper proposes targeted modifications to three components of the original GRPO framework for optimizing LVLMs: 1.The prompts in the original RL training data are expanded using GPT, incorporating more medical terminology and clinical details. The experiments demonstrate that even by simply replacing the prompt, the method significantly alleviates the optimization bottleneck of GRPO for medical LVLMs, effectively injecting medical knowledge. 2.Before the main reinforcement learning phase, the policy model is trained on an auxiliary task that involves predicting bounding boxes based on medical image features, in order to improve its grounding in visual information. 3.On top of the standard GRPO reward functions such as accuracy reward and format reward, two additional rewards are introduced: recitation reward, which evaluates the extent to which the model’s reasoning path appropriately references the given prompt, and MFRS reward, a more lenient reward designed to better handle the verification of integer-type medical labels. This paper addresses the limitations of the GRPO method in enhancing the reasoning capabilities of medical multimodal large vision-language models by proposing three targeted improvements including prompt expansion, auxiliary visual tasks and novel reward designs which demonstrate clear practical value. In a previous conference, the reviewers had already raised concerns regarding related issues. However, compared to the previous conference, this paper has not addressed these issues, and the overall content remains consistent with the submission to the earlier conference. Therefore, the reviewers’ acknowledgment of the paper’s strengths and their concerns about its weaknesses are consistent with what was expressed in the previous conference： 1.Why does simply expanding the prompt lead to such significant improvements over the original GRPO, as shown in Table 1? In reinforcement learning, optimization signals originate only from the reward derived from the final outcome. Why is prompt modification able to achieve knowledge injection under an RL setting? Could the authors provide a brief mathematical explanation of this phenomenon and result? 2.Can the approach of knowledge injection through prompt expansion during reinforcement learning be generalized to domains beyond medicine? 3.Could the authors provide more experimental details about the reinforcement learning process, such as the dynamics of reward changes over time? 4.Regarding the RL baseline experiments, when comparing the effectiveness of the proposed method the question arises whether the policy model in the baseline was also trained with the bounding box prediction task. Considering that regional classification tasks are conceptually similar to bounding box prediction this raises concerns about the fairness of the experimental comparison. 5.Concerning the evaluation methodology, the paper states that many errors in the medical domain arise from knowledge gaps and methods like GRPO are generally used to enhance complex reasoning. The study could consider including tasks that more directly test complex reasoning abilities such as multimodal diagnostic scenarios in clinical settings rather than focusing on more traditional tasks like image classification and regional classification that conventional models can already handle. 6.The recitation reward seems particularly prone to reward hacking, such as the model repeating prompt content whenever the delta is positive. Could the authors provide additional experimental evidence or analysis to make this aspect more convincing? If the authors are able to address the above questions, the reviewer will consider raising the score. 1.There is a lack of mathematical explanation for knowledge injection through prompt modification. 2.Some experimental details are not provided (especially the RL process). If the authors are able to address the above questions, the reviewer will raise the score.	Lightly AI-edited
Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper investigates the application of Reinforcement Fine-Tuning (RFT) to Large Vision-Language Models (LVLMs) for medical image analysis, a process Visual Reinforcement Fine-Tuning (V-RFT). The authors argue that standard V-RFT methods fail in the medical domain because they require both robust visual perception (to see subtle cues) and structured reasoning (to apply clinical logic). The paper tackles a timely and significant problem: adapting reinforcement fine-tuning (RFT) for large vision-language models to the medical domain. 1. Disjointed Framework Evaluation 2. Limited Novelty of Components 1.Please clarify the exact mechanism for the $PA_{\pi}$ evaluation? How does a model trained only on localization to output bounding boxes perform zero-shot classification? 2.Why was the full VRFT-Aug framework, combining all compatible components (e.g., $PA_p + PA_{\pi} + \delta^-R_{recite}$), never evaluated? The disjointed experiments make it difficult to judge the synergistic value of the proposed methods. 3.Which definition of the $R_{MFRS}$ reward is correct for a 2-class difference: $1/10$ 27or $0.0625$28? 4.How does the “recitation” mechanism affect the linguistic diversity of outputs during reasoning? 5.Limited generalization evaluation—the experiments focus mainly on MedMNIST-like datasets; real-world clinical validation or higher-resolution benchmarks would strengthen claims. 6.How sensitive is performance to the weighting parameters (λ, α, γ, δ) in the composite reward function? 7.Are there any ethical or bias considerations in using GPT-4o for generating domain-specific medical descriptions?	Fully AI-generated
Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces X-GRAAD, a novel inference-time defense framework for detecting and mitigating backdoor attacks in pre-trained language models (PLMs). The method's core idea is that backdoor trigger tokens, when activated, abnormally dominate the model's attention and gradient attribution signals simultaneously. The proposed method operationalizes this insight by combining these two signals to assess the anomaly score of each token. It identifies malicious inputs by searching for a single "peak" token with an exceptionally high score. If such a token is found, it is neutralized via a noise injection mechanism before the model generates its final prediction. The authors demonstrate experimentally that this method effectively reduces Attack Success Rates (ASR) while maintaining high Clean Accuracy (CACC). The paper also highlights the method's explainability, showing its ability to localize trigger tokens. 1. Novel Core Hypothesis. The core hypothesis—that backdoor triggers manifest as strong, simultaneous anomalies in both attention and gradient channels—is an intuitive and novel insight. Combining these two distinct attribution modalities for anomaly detection is a strong starting point. 2. Interpretability. The method not only provides a defense but also offers interpretability via attribution scores (as shown in Figures 2 and 5), helping to localize and understand the behavior of backdoor triggers, which is a valuable feature. 3. High Practicality (Efficiency and Simplicity). As an inference-time defense, X-GRAAD requires no model retraining or fine-tuning. Its computational overhead is far lower than many existing model purification methods as shown in Table 4. This efficiency, combined with its implementation simplicity (relying on standard attribution tools), makes the method highly practical for real-world deployment and reproducibility. 1. Narrow and Simplistic Threat Model. The paper's primary weakness is that its strong performance claims are based on an evaluation against a very narrow and simplistic threat model. The experiments (Sec 5.1) almost exclusively use triggers that are short, non-semantic, rare words (e.g., cf, mb). These triggers are statistical outliers by design and are "easy" targets for any attribution-based anomaly detector. The evaluation completely omits more advanced, stealthy attacks such as semantic triggers (synonyms), syntactic triggers, or longer phrasal triggers, making it hard to assess the method's generalizability. 2. Potential Design Limitations and Vulnerability to Adaptive Attacks. The defense mechanism's design presents potential limitations. Its reliance on the max operator (Eq. 8) to find a single peak score appears vulnerable to "distributed triggers"—a plausible adaptive attack where an adversary uses multiple tokens, each with a low, non-anomalous score, to activate the backdoor. Furthermore, the generality of the character-level "noise injection" (Sec 4.2.2) is unclear. While suited for the simple tokens tested, it may be less effective or could potentially risk CACC against semantic triggers (e.g., changing "price" to "pride"). 3. Misleading "Robustness" Analysis. The paper fails to test its robustness against these obvious adaptive attacks. Section 5.2.3 is mislabeled as a "Robustness Analysis" when it is merely a hyperparameter sensitivity analysis (for the detection threshold $\tau$). A true robustness evaluation would have tested the defense against an attacker aware of its max-based design, using the very "distributed trigger" attack mentioned above. The absence of this analysis is a significant gap. 4. Limited Methodological Novelty. While the idea of combining attention and gradients is smart, the method itself is a relatively straightforward heuristic. The components used (attention maps and input gradients) are standard interpretability techniques, and their combination (a simple product) lacks significant methodological innovation. 1. The paper's positive results are based on triggers that are easily isolated (short, rare words), which aligns perfectly with the max operator-based detection (Eq. 8). How would X-GRAAD perform against "distributed triggers" where the backdoor is activated by multiple tokens (e.g., a phrase) that each contribute a small, non-anomalous score? 2. Following Q1, have the authors evaluated X-GRAAD against more stealthy triggers that are part of the natural language distribution, such as semantic triggers (e.g., a specific synonym replacing a common word) or syntactic triggers (e.g., a specific sentence structure)? 3. The robustness analysis in Sec 5.2.3 is a hyperparameter sensitivity test. Could the authors provide a more formal adversarial robustness analysis? Specifically, can they comment on how X-GRAAD would fare against an adaptive attacker who is aware of the max-based design and explicitly crafts a distributed trigger to bypass it? 4. Regarding the "noise injection" (Sec 4.2.2): What is its impact in two failure-case scenarios? (a) If the trigger is a semantic word, how is it neutralized? (b) If the model falsely identifies a critical, clean token (e.g., "price") as a trigger and corrupts it (e.g., to "pride"), what is the measured impact on Clean Accuracy (CACC)?	Fully AI-generated
Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper studies backdoor detection for LLM embeddings and proposes a framework to identify whether an embedding model is Trojaned based on contrastive probing and embedding-space consistency tests. The authors argue that backdoor attacks create detectable inconsistencies in the geometry of embedding space. They introduce an embedding residual consistency score that compares clean-prompt vs trigger-prompt embedding behavior without requiring model weights or activation access, and perform evaluations across multiple backdoored and clean LLM embedding models. Experiments suggest that the proposed scoring metric can distinguish Trojaned models across different triggers and poisoning rates while maintaining low false positives. 1. The residual-consistency metric is lightweight and does not require model internal access, making it potentially practical for model vetting. 2. The paper shows results across multiple backdoored settings and trigger types, demonstrating reasonable detection performance with low false-alarm rates. 1. Evaluations seem focused on standard patch/text triggers; emerging semantic or concept-level backdoors are not included, limiting robustness claims. Also, the defense might not work on the style / synthetic triggers. (1) Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer; 2) Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger.) 2.Limited large-scale models: Most experiments appear to use medium-scale embedding models; testing with modern foundation embedding models would strengthen impact. 1. Have you tested the method against more subtle backdoors (e.g., syn backdoor or style backdoor attack) where the trigger is not simple phrase? 2. Can the method be extended to detect clean-label backdoors where the embedding shift might be less explicit?	Fully AI-generated
Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes X-GRAAD, a novel inference-time defense mechanism designed to detect and mitigate backdoor attacks in PLMs. The central idea is that in backdoored models, trigger tokens disproportionately dominate both attention weights and gradient attributions. X-GRAAD leverages this insight to compute token-level anomaly scores by combining normalized attention and gradient signals. Sentences with high anomaly scores are flagged as suspicious, and the most anomalous tokens are neutralized by injecting character-level noise, thereby preventing trigger activation without retraining or modifying the model. 1. The method combines token-level attention and gradient score to compute anomaly score. 2. The experimental results show that the method consistently achieves state-of-the-art performance. 3. As an inference-time defense that requires no model retraining, fine-tuning, or complex pruning, X-GRAAD is more efficient than many other competitors. 4. The method not only defends but also explains its decisions by localizing the suspected trigger token through the anomaly score. 1. The trigger neutralization mechanism (random character-level perturbation) is relatively naive. While the results show it works, it feels less sophisticated than the detection mechanism. 2. The defense requires access to both attention weights and input gradients, which may not be available in many deployment scenarios (e.g., black-box APIs, closed-source models). The paper does not discuss how the approach generalizes to limited-access settings. 3. The detection threshold (e.g., 95th percentile of clean validation scores) is tuned manually and dataset-dependently. Line 327 notes that ALBERT requires a lower threshold (65th percentile vs. 95th) and shows slightly elevated ASR in one case (ALBERT-LWS on SST-2). 4. While the empirical results are strong, the paper lacks a clear theoretical justification for why the product of normalized attention and gradient magnitudes is an optimal anomaly indicator. A deeper analysis (e.g., statistical or information-theoretic motivation) would strengthen the contribution. 5. While a robustness analysis over anomaly thresholds is presented, there is no study on adaptive or adversarial countermeasures (e.g., trigger patterns designed to minimize gradient-attribution visibility). See weakness	Fully AI-generated
Optimal Pricing for Bundles: Using Submodularity in Offline and Online Settings	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper studies the problem of revenue-maximizing bundle pricing under a cardinality constraint, where the seller can choose up to k items to form a bundle and set a price, while buyers’ valuations are unknown and their purchasing decisions follow a choice model based on surplus value. The analysis is conducted under two data settings. In the offline setting, using historical data and assuming a Logit choice model, the authors identify near-optimal bundles and show that the submodularity of the bundle valuation function serves as an efficient criterion for selection. In the online setting, the paper proposes an algorithm to find the optimal bundle–price combination that maximizes revenue, achieving a regret of $T^{3/4}$ against an α-approximation of the optimal revenue. The paper analyzes revenue-maximizing bundle pricing under both offline and online settings. In the offline setting, it leverages historical data and a logit choice model to estimate customer valuations and efficiently identify promising bundles based on submodularity principles. In the online setting, where valuations are unknown and only sale feedback is available, the authors propose an algorithm with a $T^{3/4}$ regret bound. The paper not only provides rigorous theoretical guarantees but also validates the effectiveness of the proposed algorithm through simulation experiments based on real-world data models. 1. In the offline setting, the valuation function $V(S)$ is defined as a unified function shared by all customers. However, this approach overlooks the significant differences in customer preferences in real-world scenarios, which may lead to results that deviate from practical outcomes. 2. In the online setting, the paper does not compare its proposed algorithm with other similar approaches. As a result, the experimental validation is somewhat limited in its scope. In the offline setting, the paper mentions how to identify the most promising bundling combinations, but lacks detailed explanation on the process for determining the optimal pricing. Could the authors provide further clarification on the specific implementation or approach used for this?	Moderately AI-edited
Optimal Pricing for Bundles: Using Submodularity in Offline and Online Settings	Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper looks at bundle pricing in two scenarios - offline where you have historical purchase data, and online where customers arrive sequentially. The main idea is that products with "submodular" valuations (where the bundle is worth less than sum of parts) are actually good candidates for bundling because you can discount them and drive demand. They fit a quadratic model to retail data and show correlation between submodularity and revenue gains. For online they give an algorithm with T^3/4 regret. The submodularity angle is interesting and a bit counter-intuitive. I liked Proposition 1 even though it's only stated informally at first (more on this later...) Real data from an actual retailer is nice to see, even if the dataset is limited The theoretical result for online setting seems technically sound Figure 1 makes the main point pretty clearly My main concern is that the offline and online parts feel like two different papers. The offline uses a specific logit model with quadratic valuations, then the online suddenly switches to a completely general nonparametric model. Why? If the quadratic model works offline, why not use it online? If the general model is better, why bother with the quadratic one? The paper never really addresses this disconnect. The experimental validation is pretty weak honestly. For offline, they only look at beauty products from one store, and the baskets are tiny (average 1.19 products - see Fig 4). That's basically people buying 1 item most of the time. How do you even validate bundling when people rarely buy multiple items? For online there's only simulations using the fitted model, no real data. Would have liked to see comparisons to baselines too - how does this compare to just using UCB on bundles, or Thompson sampling, or even random exploration? The quadratic model seems pretty limiting. It's only pairwise interactions right? What if there are three-way or higher order effects? Like maybe shampoo + conditioner is fine, and shampoo + soap is fine, but shampoo + conditioner + soap together is redundant? The model can't capture that. The paper mentions this has 2^n degrees of freedom without structure but then the quadratic model has n^2 parameters which is still a lot when n=282 products and you only have 2148 receipts. That's a lot of parameters to fit with limited data... Theoretical issues: The α-regret thing is confusing. The paper says α ≤ 1-e^(-1) but then later it depends on κ_g which is never computed for any real application. So we dont really know if we're getting 0.63-regret or 0.1-regret or what. This makes it hard to evaluate if the bounds are meaningful. Also Assumption 1 about BP decomposition seems strong - when does revenue actually decompose this way? The paper just asserts it. When doesn't it compose? Proposition 1 / Lemma 1 mismatch is weird. In the intro it's stated generally but then the actual lemma only works for k=2 and requires V({x}) = V({y}). That's much more restrictive. Presentation: Some parts are hard to follow. The connection between sections 2 and 3 is abrupt. Also the paper introduces this elaborate importance sampling scheme in Appendix B.1 to handle the intractable denominator but doesn't really justify why this is the right approach vs other approximations. Also - no Appendix A? The pure bundling analysis (Figure 3b) is interesting - shows mixed bundling is better - but this isn't really developed. Seems like an important practical insight that gets buried. Missing related work: What about the assortment optimization literature? That seems very related. Also recent contextual bandit work with deep learning could be applicable here. Why not use the quadratic model in online setting? What's a typical value of κ_g? Is 0.5 realistic? 0.9? The dataset is very sparse - did you try any regularization besides ridge with λ=0.01? How does the explore phase scale - for n=1000, k=10, M=100, m=1000 that's like 10^7 rounds before you commit? Is that correct?	Fully AI-generated
Optimal Pricing for Bundles: Using Submodularity in Offline and Online Settings	Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents a framework for identifying revenue-maximizing product bundles and their optimal prices, exploring both offline and online settings. The core idea is that submodularity is an effective and sample-efficient criterion for finding promising bundles. Originality & Significance: The paper's primary contribution is framing the problem of revenue-maximizing bundle pricing. It studies this problem under a submodularity assumption. The work focuses on both offline and online settings. It provides an online learning algorithm for the online setting, while also conducting experiments in a data-rich offline setting. Quality: In the offline setting, the choice of a quadratic valuation model is well-justified as it has a direct connection to the submodularity of product pairs. The methodology for fitting this model and then using it to estimate the revenue impact of new bundles is sound and clearly explained. In the online setting, the authors use reasonable sets of assumptions (Assumptions 1 and 2) to develop a greedy algorithm (Algorithm 1) and provide a regret analysis. Clarity: The introduction provides good motivation, using relatable examples and building intuition for the core concept of submodularity. The distinction between the offline and online problems is sharp. Missing Proofs: This is the most critical weakness. The paper claims that proofs for Theorem 1 and Lemma 2 are in Appendix D. However, Appendix D only contains a proof for a different result (Lemma 3). The core theoretical results of the online section are therefore unsubstantiated. Lack of Formalism and Algorithmic Detail in Offline Setting: The exposition of the offline setting lacks rigor. - It focuses heavily on a motivating example rather than formally defining the revenue maximization problem and presenting a general algorithm to solve it. - Lemma 1 connects the submodularity gap to revenue increase for pairs of products. The authors claim this "facilitates the use of efficient approximation algorithms," but never describe such an algorithm for the general case of finding the best bundle of size $k$. Technical Novelty in Online Algorithm: Algorithm 1 is an "explore-then-commit" greedy algorithm. This is a well-established paradigm in the literature on online submodular maximization. While its application to the bundle pricing problem is novel, the paper could do more to delineate the specific technical innovations in the regret analysis compared to prior work. The key challenge here is jointly optimizing over the combinatorial set of bundles and the continuous set of prices. The paper handles this via discretization, but a more detailed discussion on the unique challenges posed by the pricing dimension and how the analysis addresses them would better highlight the technical contribution. Structural and Presentation Issues: The paper's structure and presentation could be significantly improved. - The online setting contains the paper's most substantial technical results and should be presented first. The current ordering buries the lead behind a less formal and less complete offline analysis. - The lengthy motivating example in Section 2 (as well as Figures 2 and 3) interrupts the paper's flow. This discussion would be better suited as a case study in the appendix, allowing the main body to focus on the core technical contributions. - The quality of the figures is low. Figure 2, which plots demand curves, is not very informative as it primarily shows that demand for individual items decreases as a competing bundle's price drops, which is an expected outcome. The figures need to be better designed to convey the key insights more effectively. - Furthermore, the appendix is disorganized. For example, Appendix G, titled "EXTRA PLOTS/EXAMPLES FOR OFFLINE SETTING," contains the proof for Theorem 2, which is interrupted by plots, making it difficult to follow. The appendix must be thoroughly reorganized and, most importantly, the missing proofs must be included. - The proofs for Theorem 1 and Lemma 2, which are the main theoretical results for the online setting, appear to be missing from the manuscript. The paper states that it is in Appendix D, but that section contains a proof for a different result. Could you please provide these proofs or clarify their location? - The analysis in the offline setting focuses heavily on a motivating example. Could you please provide a more formal problem definition for revenue maximization, given the learned quadratic valuation function? - You state that Lemma 1, which applies to bundles of size $k=2$, "facilitates the use of efficient approximation algorithms" for finding promising bundles in general. How does one use this property to search the combinatorial space of bundles efficiently? The paper does not provide an answer beyond the simple case study.	Heavily AI-edited
Optimal Pricing for Bundles: Using Submodularity in Offline and Online Settings	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This submission studies revenue-maximizing bundle pricing under a cardinality constraint for customers with unknown valuation functions. Two variants are presented: offline (learning from historical baskets, modeled with a logit choice model) and online (sequential interaction with sale/no-sale feedback). The goal is to identify promising bundles and their prices in the offline case, and to learn a near-optimal bundle–price pair with regret guarantees in the online case. More specifically, in the offline setting one is provided with prices of the single items and a list of bundles that have been bought by the customers. Then the goal is to learn from this data optimal prices for bundles. To make that feasible, a specific class of valuation functions is introduced that captures both submodular and supermodular valuations. This class has $n^2$ parameters if there are n items (one for each pair of items modelling their joint value). Then a method for learning these parameters based on gradient descent is provided and testet on a real-world data set. For the online setting, the submission proposes an explore-then-commit greedy approach that builds a bundle of size k greedily by estimating marginal revenue gains for candidate items and discretizes prices to search over a grid. Under typical assumptions on the pricing function, a regret bound of O ⁣(T^3/4n^1/4k) against an α-approximation to the hindsight optimum is proven. When the demand curve is concave in price, the regret improves to the order of T^5/7. Most of the proofs are technically sound and well written. In the online setting an algorithm with a provable regret bound is obtained. The part about the offline setting does not contain any particular interesting or new results. A natural model is introduced and a standard framework to learn its parameters is used. The algorithm for the online setting is also quite natural: it starts with an exploration phase, in which a greedy algorithm is applied to obtain a good bundle-price combination. Then the submodularity guarantees that the greedy algorithm finds a good solution. I find the online model also somewhat questionable because each buyer is presented only with a single bundle that she can take or leave. In the motivating examples in the paper, a model in which multiple bundles are offered and one can also buy single items separately seems more realistic. Remark 2, which claims that the demand function $Pr[U(S) \geq p]$ is continuous if and only if $rev(p, S)$ is 1-Lipschitz, is likely incorrect or, at least, requires further justification. This property is used in the greedy algorithm's analysis to provide an upper bound for the error introduced by price discretization. For this purpose, it would be sufficient to show $L$-Lipschitz continuity for some finite $L > 0$, which seems more plausible.	Fully human-written
ProSAR: Prototype-Guided Semantic Augmentation and Refinement for Time Series Contrastive Learning	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes ProSAR, a prototype-guided semantic augmentation and refinement framework for self-supervised time-series representation learning. Built upon the information-bottleneck principle, ProSAR co-designs learnable prototypes and data augmentations to preserve task-relevant temporal semantics while discarding noise. Specifically, it introduces (1) prototype-conditioned semantic segment extraction via DTW alignment, (2) targeted augmentation in both time and frequency domains, and (3) a dual-prototype refinement loop linking latent and time-domain prototypes through decoding consistency. PProSAR is conceptually elegant and highly readable. It connects information-theoretic augmentation design with prototype learning, offering both interpretability and empirical strength. The co-design of prototypes and augmentations is well-motivated, and the dual-loop refinement provides a unified view bridging input and latent spaces. 1. While the framework is grounded in the information-bottleneck principle, the derivation stops at intuitive propositions. There is no formal proof that the proposed co-optimization converges or that the learned prototypes indeed approximate the latent semantic variable. 2. The use of DTW for semantic segmentation is computationally intensive (O(T²)), which may limit scalability for long sequences or large datasets. The paper should quantify training cost and discuss potential accelerations. 3. The framework introduces multiple components, yet lacks sensitivity analysis. See the above Weaknesses and the following: 1. What is the actual computational cost of DTW-based segmentation per epoch? Have you tried Soft-DTW or pruning techniques to improve efficiency? 2. How would ProSAR handle irregular or very long time series? 3. Is the prototype refinement stable under streaming or online updates? 4. Could the prototype-guided augmentation concept generalize to other modalities (spatial-temporal graphs, videos) or multi-domain transfer tasks? 5. Can you provide qualitative examples showing what a “semantic prototype” represents—e.g., typical waveform patterns or frequency signatures? 6. Provide a more formal analysis (e.g., gradient coupling or fixed-point stability) to justify the convergence of the prototype–augmentation co-design process?	Fully AI-generated
ProSAR: Prototype-Guided Semantic Augmentation and Refinement for Time Series Contrastive Learning	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper presents ProSAR, a self-supervised framework that integrates information-theoretic principles with learnable prototypes to guide semantic augmentation for time-series contrastive learning. It introduces prototype-conditioned segment extraction using DTW and a dual prototype refinement loop between latent and time-domain prototypes. Experiments on forecasting and classification benchmarks show consistent improvements over recent SSL baselines. S1. The paper clearly identifies the limitation of heuristic or random augmentations in time-series CL and grounds its design in an information-bottleneck formulation. S2. Results across both forecasting and classification tasks are strong and consistent, with comprehensive ablations demonstrating component contributions. S3. The paper is generally well-written and the framework diagram effectively illustrates the mechanism. W1. The idea shares conceptual similarities with prior prototype-based methods (e.g., MHCCL, AimTS); the contribution is more an integration than a fundamentally new paradigm. W2. The DTW-based semantic segmentation and dual refinement introduce substantial computational cost and hyperparameter sensitivity. W3. Although prototypes are claimed to be “semantic,” the paper provides minimal qualitative analysis of what semantics they actually capture. W4. Lacks comparison with large-scale pretrained or generative SSL frameworks. Q1. How computationally expensive is the DTW-based segmentation step? Can the framework scale to large datasets such as Traffic or PEMS? Q2. The method introduces both time-domain and latent-space prototypes. How sensitive is the performance to the number of prototypes or their initialization? Q3. Can the learned prototypes be visualized or qualitatively analyzed to confirm that they correspond to meaningful temporal semantics rather than cluster artifacts?	Fully AI-generated
ProSAR: Prototype-Guided Semantic Augmentation and Refinement for Time Series Contrastive Learning	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	ProSAR (Prototype-Guided Semantic Augmentation and Refinement) is a self-supervised learning framework developed for multivariate time series (TS) contrastive learning (CL), addressing the limitation that standard hand-crafted augmentations risk destroying critical temporal cues and semantic content in noisy, non-stationary TS data. ProSAR’s approach is founded on an information-theoretic principle derived from the Information Bottleneck, aiming to generate augmented views that maximize the information about an associated semantic prototype (P) while discarding content irrelevant to that prototype. This objective is implemented using learnable time-domain prototypes as explicit semantic anchors, which guide the identification of temporal characteristic segments in the input time series (x) via Dynamic Time Warping (DTW) alignment. Experimental evaluations on diverse benchmarks demonstrate that ProSAR achieves superior performance in learning discriminative representations, attaining the highest mean accuracy (0.764) and the best mean rank (1.867) on the UEA multivariate time series archive for classification, and consistently surpassing comparison methods in forecasting tasks. The submission is written clearly and is well structured, making the main ideas and technical contributions easy to follow. The motivation is articulated convincingly, and the authors provide sufficient context for why the problem is relevant and timely. Additionally, the related work section is thorough and appropriately cited, demonstrating a solid understanding of the existing literature and situating the proposed approach within the broader research landscape. Overall, the presentation is polished and the narrative is coherent and well motivated. W1. While the paper is generally well written, the claimed novelty of the proposed approach is not clearly articulated or sufficiently demonstrated. The authors state that their method offers better prototypes with better semantics, but it remains unclear how these prototypes differ from or improve upon existing prototype-based contrastive learning approaches. The manuscript would benefit from a more explicit and detailed discussion of what is fundamentally novel, ideally supported by conceptual distinctions, empirical evidence, or ablations that isolate the proposed contribution. For instance, with the expressions in Lines 72 - 75, as well as Lines 92 - 96, it remains unclear how the proposed prototype-based anchors significantly differ from existing ones, and it remains unclear how the proposed method can improve the interpretability, contrallability of the anchors. W2. For Line 161, "these prototypes are dynamically refined to steer the augmentation policy"; however, existing prototype-based or clustering-based constrastive learning approaches also dynamically update the prototypes. What are the key differences? W3. It remains unclear how the proposed prototypes substantively differ from those used in existing prototype-based methods. The paper would benefit from a clearer explanation of the conceptual or algorithmic distinctions. In addition, the experimental evaluation could be strengthened by including comparisons with a broader range of prototype-based and clustering-based contrastive learning approaches, which would help more convincingly demonstrate the advantages of the proposed method. Please see Weaknesses above	Lightly AI-edited
ProSAR: Prototype-Guided Semantic Augmentation and Refinement for Time Series Contrastive Learning	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes ProSAR, a prototype-guided semantic augmentation and refinement framework for time series contrastive learning. The core idea is to jointly learn data augmentation strategies and semantic prototypes under an information-theoretic constraint: time-domain prototypes obtained through DTW alignment guide the generation of augmented views, applying different perturbations to semantic and non-semantic segments, while latent-space clustering and decoding consistency iteratively refine the prototypes. The proposed method aims to produce semantically consistent yet diverse views, achieving superior performance to self-supervised baselines such as AutoTCL and FreRA on both forecasting and classification benchmarks. Incorporating learnable prototypes into the data augmentation process represents a meaningful conceptual innovation, breaking through the limitations of fixed or purely heuristic augmentation strategies in traditional contrastive learning. The proposed dual-prototype mechanism—comprising time-domain and latent-space prototypes—and its iterative refinement loop demonstrate a coherent and logically consistent system design. The authors did not compare ProSAR against several representative time-series representation learning models such as TSLANet and AimTS. Compared with these baselines, the reported results are not particularly competitive. Although the idea of prototype-guided augmentation is interesting, the overall contribution appears incremental. The differences between ProSAR and prior works like AutoTCL and AimTS remain relatively small. The claimed interpretability is unconvincing—especially the visualizations shown in Section D.4, which do not make much sense and fail to clearly demonstrate semantic consistency or prototype meaning. The framework consists of multiple submodules (DTW segmentation, STFT alignment, dual prototypes, clustering, and decoding consistency), yet the ablation studies only examine the augmentation operation and the semantic segmentation, which is insufficient to validate the contribution of each component. The reliance on DTW alignment and clustering could introduce significant computational overhead for long or high-frequency sequences. Although the paper briefly acknowledges this issue, it lacks a concrete analysis of time complexity or runtime performance. See weakness	Moderately AI-edited
DAG-Math: Graph-Guided Mathematical Reasoning in LLMs	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces DAG-MATH, which is a framework to evaluate mathematical reasoning of LLMs through modeling Chain-of-Thought as stochastic process over directed acyclic graphs. The authors propose metric called "logical closeness" for distinguishing between when model solves problem by search versus real logical inference. They make benchmark with 2,894 gold-standard DAGs and test five LLMs, discovering that although PASS@1 scores vary much between models, their perfect reasoning rates stay quite similar, which suggests search is inflating the accuracy metrics. The paper tackles really important problem about understanding whether LLMs achieve correct answers through systematic search or through genuine logical reasoning, which is fundamental question for the field. The DAG-based formalization is quite novel approach that sits nicely between completely free-form CoT and very formal systems like LEAN verification, making it more practical to use. The logical closeness metric gives us insights that go beyond simple PASS@k metrics that everyone uses. The empirical analysis is quite comprehensive, showing interesting patterns about how DAG statistics like number of nodes, edges, density and branching correlate with problem difficulty. It reveals that harder problems create larger and more sparse graphs with higher branching, which makes sense intuitively. The finding that search and exploration inflate PASS@1 while actual reasoning ability measured by PRR stays comparable across different models is really actionable insight that changes how we should think about evaluating these systems. Also good that authors released the benchmark and code for others to use. There is concerning circularity in how the benchmark was constructed - using GPT-4 and Qwen to create the "gold standard" DAGs means the benchmark is essentially created by same type of models that are being evaluated, which introduces obvious biases. The theoretical justification feels not enough developed. Why should we believe this specific DAG formalization captures what "true" reasoning means? The stochastic process described in Equation 1 seems somewhat arbitrary choice without proper justification. The statistical analysis lacks rigor with only 32 samples per problem and no significance testing provided for the differences claimed between PASS@1 and PRR. There is no analysis about robustness - what happens when same problem can be formulated with different but equivalent DAG structures? The scope is quite limited, restricting to mathematics problems with difficulty below 6, and missing comparisons with other approaches like process reward models that also try to do step-level verification. The three-stage prompting methodology might be imposing particular reasoning patterns that are not universal. Most importantly, there is no human validation beyond what the models themselves produce, which is problematic for claiming these are "gold standard" solutions. How did you validate that the DAGs generated by GPT and Qwen actually represent correct reasoning structures? It seems crucial to have human experts verify at least subset of these DAGs to ensure the benchmark quality. Many mathematical problems have multiple valid solution approaches - how does the logical closeness metric handle cases where completely different but valid DAG structures could exist for same problem? Could you provide proper statistical significance tests for the differences you claim between PASS@1 and PRR? How sensitive are all these results to the specific prompting strategies you used? If you changed the prompts slightly, would the DAG structures and evaluation results change significantly? What is the practical path from these evaluation insights to actually improving LLM reasoning capabilities? The paper identifies interesting patterns but doesn't suggest how to use this knowledge for making better models. How does this framework compare empirically with recent work on process reward models from OpenAI and others that also try to verify reasoning at step level? Why did you choose to require exactly one assertion per node - this seems quite restrictive and arbitrary? And why make logical closeness binary measure instead of having gradient that could capture partial correctness better? Finally, have you considered that the models might be following completely different internal reasoning process and the DAG structure is just post-hoc rationalization that we impose on their outputs?	Fully AI-generated
DAG-Math: Graph-Guided Mathematical Reasoning in LLMs	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces DAG-MATH, a framework designed to formalize and evaluate the Chain-of-Thought (CoT) trajectories generated by Large Language Models (LLMs) in mathematical reasoning. The CoT is modeled as a rule-based stochastic process over a Directed Acyclic Graph (DAG), where nodes are intermediate states. The core proposal is the Logical Closeness metric, which quantifies the fidelity of an LLM's path against a "gold standard" DAG, yielding the Perfect Reasoning Rate (PRR) and AUC scores. The authors claim this provides a superior diagnostic tool compared to simple final-answer metrics like PASS@k. The resulting DAG-MATH benchmark is built using LLM-generated structured outputs. 1. The underlying idea of treating CoT as a DAG traversal is fundamentally sound and offers a pathway for structured reasoning analysis beyond token-level checks. This is the paper's primary and most important strength. 2. The authors have created impressive few-shot prompts to enforce their complex output format, which is a valuable demonstration of structured generation control in LLMs. The visual examples of the DAGs are convincing. 3. The metric correctly isolates failure modes like speculative branching and imperfect reasoning, which are invisible to simple PASS@k. 1. The PRR/AUC metric confuses adherence to the authors' custom template with true logical reasoning ability. The paper must provide evidence that this metric holds up when applied to non-formatted, naturally generated CoT. 2. A critical omission is the lack of comparison with or contextualization against MCTS or similar graph-based search methods. If the goal is to improve reasoning, how does the DAG-MATH diagnosis inform or relate to these established LLM search strategies? 3. The use of LLMs to generate the ground truth DAGs for their own evaluation introduces a circular dependency. This casts significant doubt on the objectivity and reliability of the Logical Closeness scores. 4. The Acyclicity Assumption restricts the framework to simple forward derivation, excluding crucial reasoning patterns like planning, iterative refinement, or proof by contradiction, thereby limiting its general applicability. 1. Can you demonstrate the utility of PRR/AUC by heuristically parsing DAGs from unconstrained, free-form CoT outputs (without the DAG-MATH template) on a subset of problems? If the metric collapses here, it confirms the dependency on the template is too strong. 2. Please elaborate on the relationship between DAG-MATH's diagnostic insights and existing MCTS/Tree-of-Thought techniques. How can the PRR/AUC scores be used to guide the search policies or reward functions in such systems? 3. Given the reliance on LLM-generated ground truth, what specific human review or verification process was applied to the 2,894 gold-standard DAGs to ensure their canonical logical structure? What were the human agreement statistics on the logical decomposition?	Fully AI-generated
DAG-Math: Graph-Guided Mathematical Reasoning in LLMs	Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes to model mathematical reasoning’s CoT traces as a rule-based stochastic process over task-specific DAGs, where nodes represent reasoning states and edges encode inference rules or justifications. Building on this formulation, the paper introduces logical closeness as a new metric to evaluate the model's reasoning trajectory. It also presents a benchmark of gold-standard DAG-MATH graphs with verified logical structures and statistical analyses relating graph properties to problem difficulty. Finally, the paper evaluates several large language models by prompting them to produce formatted CoT reasoning on mathematical benchmarks and examines how their reasoning abilities correlate with the proposed DAG-MATH framework. - The paper is clearly written and well-organized. - The idea of representing CoT reasoning with DAG-MATH is interesting and novel. The proposed metrics are also new and conceptually sound. - The empirical results are informative, showing how graph structures reflect problem difficulty and reasoning quality. - Enforcing the DAG-MATH format may degrade the natural reasoning flexibility of LLMs. It would help to include an analysis or ablation comparing performance with and without this formatting constraint. Furthermore, if the few-shot examples are drawn from a specific model family, models of the same family might have an advantage because their reasoning patterns are similar. - The analysis is primarily quantitative. Some qualitative examples or case studies of the generated DAG-MATH graphs, especially highlighting common reasoning errors or structural failures, would strengthen the insights. - In Section 2.2, the paper mentions that thinking LLMs can be viewed as “an exploration of the task-specific DAG with self-correction or backtracking, but its final output … is still consistent with our transition rule.”, but empirical results did not include reasoning models, which could strengthen the value of the paper. - DAG-MATH has limited scalability. Complex problems with very long and multiple reasoning traces are computationally expensive to construct. Those with cyclic reasoning or backtracking are hard to capture in a strictly acyclic form. - How exactly is the branching of reasoning paths determined? - Since the canonicalization turns reasoning steps into SNF, does it limit the type of mathematical questions that DAG-MATH can apply to? - In lines 63-64, what does it mean that the other works “fail to capture long-range and cross-branch dependencies, as well as the goal-directed, absorbing-state nature of CoT”?	Lightly AI-edited
DAG-Math: Graph-Guided Mathematical Reasoning in LLMs	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a DAG-based framework for representing CoT reasoning and introduces the concept of logical closeness, enabling fine-grained evaluation of LLM mathematical reasoning beyond final-answer accuracy. Its approach assesses the coherence and consistency of logical dependencies along a CoT trajectory, rather than focusing solely on whether the solution is correct. Furthermore, the authors construct a benchmark of DAG-formatted mathematical problems derived from existing datasets and provide empirical analyses linking graph-level characteristics—such as size, density, and branching complexity—to problem difficulty. - This work formalizes the notion of logical closeness and proposes a metric, the perfect reasoning rate, based on this notion to measure LLMs' logical consistency beyond the final output. This indeed addresses an important question of whether an LLM arrives at a correct answer through genuine logical reasoning or mere pattern matching. - The formalization in Sections 2 and 3 is clearly presented and easy to follow. - Section 5 and Appendix B offer several interesting insights, such as the correlation between graph structure and problem difficulty, and how a correct final answer may still arise from unclosed or flawed reasoning. - The DAG-MATH benchmark presented in the paper is validated using symbolic correctness and an LLM-as-Judge approach. I assume that SymPy is employed to verify mathematical equivalence, while the logical dependencies between nodes (i.e., whether an edge should exist) are assessed by the LLM-as-Judge. However, as the paper itself demonstrates, LLMs can often produce superficially consistent but logically inconsistent reasoning. While using LLMs for judgment is a practical solution, it would be helpful for the authors to further justify the reliability of this dataset construction and evaluation methodology. - Building on this point, reliable automation of the logical closeness check appears challenging, if not infeasible, since formalizing a DAG from natural-language CoT inherently involves subjective interpretation, particularly for more complex problems. For instance, reasonable disagreement could arise over whether a given edge should connect two specific nodes. - How can we trust an LLM-as-Judge to reliably evaluate the logical coherence of DAG constructions, particularly when edges are intended to represent valid inference paths? Have the authors conducted any analyses or validation studies to assess the consistency or accuracy of these judgments? - At first glance, Figure 4 being a line plot was somewhat confusing. Do the authors plot accuracy against varying levels of logical correctness rates and then smooth the resulting curve? A brief clarification in the caption or text might help. - I also wonder whether different node segmentation choices could exist for the same reasoning trajectory. If so, how sensitive are the proposed approach and the PRR metric to such segmentation differences?	Lightly AI-edited
Evaluating LLM In-Context Few-Shot Learning on Legal Entity Annotation Task	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This work investigates how to leverage the in-context few-shot capabilities of LLMs for named entity recognition in the legal domain (legal NER) and proposes a complete annotation workflow: constructing a Minimal Golden Dataset (MGD) , building an Examples Database from the MGD , conducting few-shot prompting with three example selection strategies (random, similarity-based, and clustering-based) and different numbers of examples, constructing separate prompts for each entity and merging the results. The authors conducted comprehensive experiments on six open-source/closed-source LLMs (e.g., Gemini 1.5 Pro/Flash, Llama 3.1, etc.) using a Portuguese corpus based on decisions from the Brazilian Supreme Federal Court. Two evaluation criteria, strict-match and relaxed-match, were employed for performance comparison. 1. This work systematically applies the in-context few-shot method to Portuguese legal NER and validates it on a real-world, large-scale judicial corpus. 2. The paper conducts a systematic comparison of the sensitivity between "example selection strategies" (random, similarity-based, clustering-based) and the number of examples, and combines it with cost-benefit analysis, which constitutes a relatively practical contribution. 3. The experiments cover a wide scope: 6 different models, four entity categories, validation/test splits, repeated trials, and statistical tests (ANOVA, Kruskal-Wallis, post-hoc tests), featuring rigorous methodology and experimental design. 4. Manual review is conducted to analyze differences between model annotations and human annotations, providing qualitative insights rather than mere numerical comparisons. 1. Insufficient baseline evaluation: The paper fails to directly compare the performance of few-shot LLM with that of traditional supervised learning (fine-tuning on a small number of samples) or weak supervision methods. 2. Insufficient in-depth analysis of the "boundary" issue and marker error handling: There is a significant gap between strict and relaxed settings (with a lower strict score for Precedent), yet the paper mainly reports the overall F1 score and lacks fine-grained statistics on boundary error types (truncation, over-length, and misalignment). Additionally, only quantitative descriptions are provided regarding the sources of Marker (@@ … ##) errors and specific repair strategies. 3. Inadequate description of the dataset and generalization: Evaluation is only conducted on judicial documents from the Brazilian Supreme Federal Court (STF) (despite the large corpus size), while the model's robustness in other legal text genres, other jurisdictions, or under noisy conditions (e.g., OCR errors) is not assessed. The paper mentions that some typical formats (such as LC 78/93) may conflict with precedents, but no targeted data augmentation or pattern normalization methods are proposed. 1. Complete Prompts and Examples: Could the complete prompts (including system/user instructions, the order of examples, and whether explanatory text is included) and sample outputs used for each entity be provided in the appendix or code repository? (This directly affects the reproducibility of both the experiment and prompt engineering.) 2. Reasons for GPT-4o mini’s Poor Performance: Could more specific diagnoses be provided (e.g., whether GPT-4o mini has truncation/API limitations, or if its training corpus is insufficient to cover Brazilian legal terminology)? 3. Bias of Heuristics for Boundary Handling: The paper uses a priority order (Person > Legislative > Precedent > Academic) to resolve overlaps. Could you demonstrate how this priority order affects the recall/precision of the final Person entity (e.g., does this heuristic lead to over-coverage or under-coverage of Person)? Additionally, what would the results be if this priority is replaced with other strategies such as "longest match first"? 4. Comparison of Manual Efficiency/Cost: You have provided the model’s token-based pricing and cost-benefit analysis. However, could you supplement this with a comparison of time/manual costs: after using LLM to assist with annotation, what is the average time required for manual correction? How does this compare to the total time/cost of pure manual annotation? This will directly support the claim of "cost savings." 5. Error Examples and Repair Strategies: During manual review, the model outperformed humans in 20% of cases (indicating that the model identified missed annotations or errors made by humans). Could these cases be categorized, and could automated repair suggestions be provided?	Lightly AI-edited
Evaluating LLM In-Context Few-Shot Learning on Legal Entity Annotation Task	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper evaluates the capability of Large Language Models (LLMs) for in-context few-shot learning on legal Named Entity Recognition (NER) tasks, specifically focusing on Portuguese legal documents. The authors propose a new annotation process that leverages LLMs to identify four types of legal entities: Academic Citations, Legislative References, Precedents, and Persons. The study uses the most extensive Portuguese corpus for legal NER and evaluates six different LLMs in various configurations. The experiments test different example selection strategies (random, similarity-based, and clustering-based) and varying numbers of examples (4, 8, 16, and 32) to determine optimal prompting approaches. The best-performing model achieved an F1 score of 0.76 using relaxed matching criteria. Additionally, a manual review of divergent annotations revealed that LLMs correctly identified entities missed by human annotators in 20% of cases, highlighting their potential to assist in the annotation process. Practical application in a specialized domain: The research addresses a real-world challenge in legal text processing, particularly for non-English languages where annotated resources are limited. Comprehensive evaluation methodology: The authors test multiple LLMs, example selection strategies, and example quantities, providing robust insights into optimal configurations for legal NER tasks. Detailed error analysis: The manual review of annotation discrepancies offers valuable insights into both LLM limitations and potential improvements to existing human annotation processes. Cost-benefit analysis: The paper includes a practical assessment of the cost-effectiveness of different models, considering both performance and computational expenses. Limited language scope: While the focus on Portuguese legal documents addresses a gap in the literature, the findings might not generalize to other languages with different legal systems and terminologies. Reliance on existing annotations: The evaluation uses human-annotated data as ground truth, but the manual review reveals inconsistencies in these annotations, which could affect performance metrics. Lack of comparison with fine-tuned models: The paper doesn't compare the few-shot learning approach with traditional fine-tuned NER models, which would provide more context about the relative advantages of in-context learning. Entity type imbalance: The dataset contains significantly fewer Academic Citations compared to other entity types, which could affect the reliability of performance metrics for this category. Limited exploration of prompt variations: While the paper tests different example selection strategies and quantities, it doesn't explore variations in prompt structure or entity descriptions that might impact performance. How would the performance of the proposed approach compare to fine-tuned domain-specific NER models, and what are the trade-offs in terms of computational resources, data requirements, and accuracy? How might the findings generalize to other legal systems and languages with different legal terminologies and citation formats? Could the approach be extended to handle more complex nested entity structures, especially considering that the original corpus includes fine-grained nested entities? How sensitive is the performance to variations in the entity descriptions provided in the prompts, and could optimizing these descriptions further improve results? Given that LLMs correctly identified entities missed by human annotators in some cases, could an iterative annotation process that combines human and LLM inputs lead to higher quality annotated datasets?	Fully AI-generated
Evaluating LLM In-Context Few-Shot Learning on Legal Entity Annotation Task	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	Summary The paper evaluates large language models (LLMs) for named entity recognition (NER) in Portuguese legal documents, focusing on four coarse-grained entity types. The stated motivation is to build a pipeline that could assist human annotators in labeling legal texts. The pipeline handles long-document segmentation and compares six off-the-shelf LLMs (both open and closed-source) under zero-shot and few-shot in-context learning. Results show that LLMs achieve competitive F1 scores even with random few-shot sampling, and that retrieval-based selection (RAG-style) offers no measurable benefit. Overall Recommendation: Reject While the paper is clearly written and applicable for this niche usecase at a surface level, it lacks originality, depth, and analytical rigor for ICLR. Its findings that LLMs perform reasonably on few-shot NER and that random sampling rivals similarity-based retrieval, are well-established in prior work. The human-in-the-loop (HiTL) framing is also not substantiated through workflow design, user studies, or quantitative cost analysis. The result is an incremental replication of known patterns in a narrow domain. Reviewer LLM Usage: I have read the paper in full and written the review myself. Large Language Models (LLMs) were used only for writing polish, clarity improvements, and to refresh memory of (public) related work or references. The analysis and conclusions are entirely my own. 1. Comprehensive model coverage: Evaluates six distinct LLMs, both closed- and open-source, providing useful comparative evidence. 2. Clear preprocessing pipeline: Long-document segmentation and sentence splitting are described clearly. 3. Interesting premise: The hypothesis that models trained on judicial corpora may encode “judicial reasoning” offers a speculative direction for future exploration. 1. Lack of novelty and analytical depth: The paper confirms previously known findings: LLMs perform adequately for few-shot NER, and retrieval-based example selection rarely outperforms random sampling. No new prompting paradigm, retrieval method, or model adaptation technique is proposed. 2. Partial and shallow error analysis Although the authors provide per-entity metrics, special-marker error counts, and a manual review of 193 misclassifications by five annotators, they stop short of deeper analyses such as confusion matrices, span-offset distributions, or nested-entity handling. 3. Unsupported HiTL framing The work’s central claim, assisting human annotators, is not explored at all. There is no workflow description (pre-seeding, confidence based triaging to humans, uncertainty sampling, etc.), no usability study, and no productivity or cost metrics. The only cost discussion concerns API token pricing, not annotation effort. Thus, the HiTL style assisted annotations claim remains unsubstantiated. 4. Missing supervised baselines and limited data No fine-tuned NER baselines (e.g., BERTimbau [Souza et al., 2020], XLM-R) are compared. BERTimbau is used only as an embedding encoder for retrieval. Moreover, evaluation covers just 5 validation documents and 53 test documents, which is insufficient for stable generalization. 5. Narrow scope and simplified labeling By restricting the task to four coarse entity types, the study avoids typical hierarchical NER challenges like nested entities, sub-type confusion, and overlapping spans. Consequently, the reported performance overestimates true real-world difficulty for their task which also has a second layer (which is considered out of scope). 6. Missing literature and context The paper omits key prior work that already demonstrates effective few-shot and prompt-based NER: TANL [Paolini et al. 2021], PromptNER [Jie & Lu, 2022], InstructUIE [Wang et al., 2023] and cross-lingual benchmarks like XGLUE [Liang et al., 2020] and LEXTREME [Niklaus et al., 2023]. This omission overstates the novelty of the contribution. 7. Retrieval finding lacks explanation Statistical tests (ANOVA and Kruskal-Wallis) confirm no significant difference between random, similarity, and cluster-based selection, yet the authors provide no analysis of retrieved-example quality or similarity distribution to qualitatively explain intuitively why this occurs. 8. Terminology The phrase “mixing the LLM as independent agents by entity” simply denotes per-entity model ensembling not genuine multi-agent reasoning or tool use. So rigor in terminology is diluted. 1. Replace references to “multiple agents” with “multiple LLM runs” or “ensemble of LLMs.” 2. Consider visualizing error types or entity overlaps for interpretability. 3. Also, consider exploring multiple LLM ensemble further. 4. Can you clarify whether few-shot examples were sampled from within the same document type or across different decision types, as this could impact generalization?	Fully AI-generated
Evaluating LLM In-Context Few-Shot Learning on Legal Entity Annotation Task	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper studies the problem of legal entity recognition and presents an LLM-based method. 1. The topic is practically interesting and useful. 2. The authors have considered and tested several different LLMs. 1. This work is more engineering than research. The technical part is quite high-level. I don’t see research-level insights or designs (especially the rationale of designs), but only engineering-level descriptions and examples. 2. Limited technical depth regarding the standard of ICLR. More specifically, for RQ1, the authors investigated three simple strategies for prompt engineering with no significant difference. For RQ2, the authors simply tried 6 LLMs. Of course, one may try more, given sufficient time. For RQ3, the strict-match and relaxed-match methods are sort of standard routines in practice. See detailed comments.	Fully human-written
UMCI: A Unified Counterfactual Framework for Robust Vision-Language Reasoning	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents UMCI (Unified Multi-round Counterfactual Inference), an inference-time framework aimed at improving the robustness of large vision–language models (LVLMs).UMCI unifies visual and textual counterfactual reasoning within a single process: the model performs multiple rounds of counterfactual queries and aggregates the resulting logits to produce more stable predictions. Building on previous causal inference methods such as TIE, VCD, and M3ID, the authors extend these approaches by incorporating textual counterfactuals generated through simple templates—covering tone variation, language switching, and role prompting.The framework is evaluated on two representative LVLMs, LLaVA-NeXT and Qwen2-VL, across multiple multimodal benchmarks.Results show small yet consistent gains in robustness. The paper also provides detailed experimental settings and introduces a dynamic benchmark to test sensitivity to linguistic and visual perturbations under realistic conditions. In summary, UMCI offers: 1）A unified, multi-round inference framework for enhancing LVLM robustness. 2）Joint treatment of visual and textual counterfactuals within a causal reasoning paradigm. 3）Transparent and reproducible experiments on two distinct LVLM architectures. 4）A discussion of inference-time efficiency and robustness trade-offs. 1）Clear Motivation and Unified Framework The paper addresses an important yet underexplored problem—how to improve the inference-time robustness of large vision–language models (LVLMs) without retraining. By unifying visual and textual counterfactual reasoning under a single causal formulation, UMCI offers a coherent and conceptually clear framework that extends prior causal decoding approaches such as TIE, VCD, and M3ID. This unified perspective helps bridge previously separate research lines in visual and linguistic robustness. 2）Transparent and Reproducible Experiments The experiments are fully transparent, with detailed descriptions of datasets, model architectures, and all hyperparameters. The study relies entirely on open-source LVLMs (LLaVA-NeXT and Qwen2-VL) and public toolkits, ensuring that all reported results are easily reproducible and verifiable. Such experimental rigor adds credibility and makes the framework accessible for future benchmarking. 3）Cross-Model Generalization The evaluation covers two representative LVLMs—LLaVA-NeXT, which is more vision-focused, and Qwen2-VL, which has stronger language grounding. UMCI demonstrates consistent gains across these different architectures, indicating that the method is not tied to a specific model type. This cross-model validation strengthens the claim of generality and suggests potential applicability to broader multimodal systems. 4）Strong Engineering Integration The framework integrates multiple existing components—visual counterfactuals, textual counterfactuals, and multi-round inference—into a unified and reusable pipeline. This design reflects strong engineering execution and makes the framework practically useful. Researchers can readily adapt the system for robustness studies, counterfactual evaluation, or future model comparisons. 5）Dynamic Robustness Evaluation Benchmark The proposed dynamic benchmark supports model-adaptive robustness evaluation under both visual and linguistic perturbations. Unlike traditional static bias datasets, it allows more flexible and realistic testing that better reflects real-world input variations. This makes the benchmark itself a valuable tool for advancing robustness evaluation practices in multimodal research. 1）Limited Methodological Novelty The proposed UMCI framework primarily integrates existing causal inference approaches—TIE, VCD, and M3ID—into a single formulation. While this unification is conceptually coherent, it introduces limited algorithmic innovation. Both the visual counterfactuals (black, blurred, or noisy images) and the textual counterfactuals (template-based rewrites) largely follow existing methods or rely on simple heuristics. Consequently, the contribution is more of an engineering consolidation than a genuine methodological advancement. 2）High Inference Cost and Practical Constraints UMCI requires multiple inference rounds (typically three to seven) to achieve stable results. This increases inference latency by roughly 1.8×–2.5×, alongside proportional growth in token usage and GPU memory consumption. Although batch inference can partially offset the delay, the memory footprint still scales linearly with the number of rounds, limiting feasibility on consumer GPUs or edge devices. Moreover, the paper lacks quantitative measurements of token-level overhead or throughput degradation, leaving its practicality for real-time applications uncertain. 3）Static and Simplistic Textual Counterfactual Design The textual counterfactual module is based on three fixed templates—tone modification, language switching, and role prompting—which are handcrafted and deterministic. This design lacks semantic diversity and adaptivity, making it unable to capture the broader range of linguistic sensitivity cases. While the authors claim these represent typical scenarios, they provide no statistical evidence or empirical justification. As a result, the textual component feels simplistic and may not generalize beyond the tested benchmarks. 4）Marginal Performance Improvements The reported gains across standard benchmarks are modest, typically below 0.5%, and in some cases even lower than those achieved by earlier methods such as M3ID. Considering the 3–7× inference repetition, the trade-off between robustness and efficiency remains weak. Furthermore, the “scaling law” analysis (Section 4.4) shows performance saturation after five rounds and lacks statistical validation, suggesting that the claimed improvement trend may not be robust. 5）Outdated Baselines and Missing Comparisons Although UMCI compares against several causal inference and decoding-based baselines (TIE 2021, VCD 2024, M3ID 2024), it omits more recent methods that have advanced inference-time robustness for LVLMs. For example, LCD (ACL 2024), ICD (ACL 2024), and RVCD (ACL 2025) extend contrastive decoding and causal reasoning with improved robustness and efficiency. Without experiments or discussion involving these newer methods, it is difficult to determine whether UMCI truly advances the state of the art. As a result, the paper feels somewhat dated and incomplete in benchmarking, which weakens the strength of its empirical claims. 1）Computational Cost and Resource Usage Could the authors provide detailed measurements of token usage and GPU memory consumption for different inference rounds (e.g., UMCI₃, UMCI₅, UMCI₇)?Including such data would give a more comprehensive understanding of the computational cost beyond latency alone and clarify the framework’s scalability in practical settings. 2）Textual Counterfactual Categorization The paper divides textual counterfactuals into three categories, but this classification appears heuristic and lacks empirical or systematic justification.Could the authors provide evidence or statistical analysis showing that these three types adequately represent major linguistic sensitivity patterns?Additionally, since fixed templates may not generalize well across diverse prompts, have the authors considered using dynamic or adaptive counterfactual generation methods? 3）Robustness–Accuracy Trade-off UMCI shows strong gains on the BS Benchmark but only marginal or even negative improvements on standard benchmarks.Does this suggest that the robustness enhancement comes at the cost of reasoning accuracy?It would be helpful if the authors could clarify this trade-off and discuss possible ways to mitigate performance degradation on clean data. 4）Cross-Model Inconsistency UMCI exhibits inconsistent performance across models: it yields larger gains on Qwen2-VL but very limited or even negative changes on LLaVA-NeXT across several datasets.This discrepancy implies that UMCI’s effectiveness may depend on specific model architectures or training alignments rather than being universally applicable.Could the authors elaborate on the reasons behind this difference?Does UMCI require model-specific tuning to maintain consistent improvements across architectures?	Fully AI-generated
UMCI: A Unified Counterfactual Framework for Robust Vision-Language Reasoning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes UMCI, a test-time method for large vision-language models (LVLMs) that unifies visual and textual counterfactual reasoning. It generalizes Visual Contrastive Decoding (VCD) by averaging logits over multiple perturbed images and paraphrased prompts to reduce language bias and sensitivity. The authors also introduce a Bias & Sensitivity Benchmark, adaptively identifying model-specific fragile samples, and report modest robustness gains on this benchmark and standard datasets. * This paper provides a unifying perspective linking VCD to causal debiasing (TIE/TDE) and interprets temperature as a causal weighting mechanism. * UMCI is simple to apply at inference and it includes prior methods as special cases (VCD ≈ VC only; CF-VQA ≈ TIE only). The decomposition into VC and TC is intuitive. * The proposed benchmark formalizes bias and sensitivity via explicit criteria and highlights that non-robust samples vary across LVLMs, which is a useful diagnostic perspective. * Minor performance gains. In Table 4, the improvements on standard benchmarks are very small, often within noise. Most gains appear only on the BS Benchmark, which is constructed using the model’s own failure cases. * The method mainly combines known ideas — visual perturbations and prompt ensembling — under a unified formulation. The innovation over prior works (VCD, TIE, TDE) is modest. * UMCI uses more test-time compute (multiple counterfactual rounds) than baselines. Compute-matched ensemble baselines are not compared. na	Moderately AI-edited
UMCI: A Unified Counterfactual Framework for Robust Vision-Language Reasoning	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper presents a systematic study of visual contrastive decoding (VCD; Leng et al, 2024) in context of previous counterfactual inference methods such as CF-VQA (Niu et al, 2021), and proposes a unified inference framework for counterfactual bias mitigation in large vision-language models. Specifically, the authors show that VCD is equivalent to reweighting the original output probabilities by the exponential of total indirect effect (TIE) logits as defined in CF-VQA, and generalize this formulation by repeating the inference and reweighting over multiple text and visual counterfactuals. Experiments on a new bias-sensitivity benchmark demonstrate this multi-round inference procedure reduces language bias and improves robustness to text perturbations. - Neat, systematic approach to unify existing methods on counterfactual inference of LVLMs. UMCI treats CF-VQA and VCD as special cases but bridge and generalize them to enable more diverse counterfactuals, reducing inconsistencies of the output. - Interesting method to probes bias and sensitivity of models by bootstrapping from existing benchmark data. This allows measuring and comparing bias and robustness of LVLMs in realistic settings, not artificial tasks from previous bias/hallucination benchmarks. - UMCI shows promising results on the bias-sensitivity benchmark, outperforming baselines by large margins and demonstrating some degree of scaling over inference rounds. - The counterfactuals in the proposed BS benchmark are generated using the same procedures as in the proposed UMCI method. This seems to lead to exaggerated improvements over the baselines (table 2), while their performances are much closer on real-world benchmarks (table 4). In other words, it is not clear to me how well the method generalizes beyond the counterfactual types used at inference time (a crucial dimension of test-time scaling in my opinion). - While the definition of the BS benchmark makes sense for MCQ and binary questions, I'm not sure it is suitable for open-ended tasks when the output is more than one or a few words (ViLP), as it is highly unlikely the model generates identical long responses over multiple runs, especially under nondeterministic sampling. I wonder if using generative evaluation (GPT judge) or some form of semantic matching like https://arxiv.org/abs/2302.09664 may make the benchmark more robust for true open-ended QA? - Test-time scaling seems to saturate after 3-5 inferences, more apparent for LLaVA-NeXT. While the general trend is still positive, I would be hesitant to consider it a definite proof of "scaling law" (in asymptotic sense) before at least experimenting with more models and sampled counterfactuals. See weaknesses. Also, are the optimal hyperparameters ($\tau_1$, $\tau_2$) similar across models? Is it possible to use different numbers of M and N and study scaling behavior along both directions?	Fully human-written
UMCI: A Unified Counterfactual Framework for Robust Vision-Language Reasoning	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors identify two key robustness challenges in current Vision Large Models (VLMs): a bias towards prioritizing language over vision, and a high sensitivity to prompts. To address these challenges, they propose the Unified Multi-round Counterfactual Inference (UMCI) framework. This framework adjusts the input to generate multi-round results and performs visual and linguistic debiasing at the logit level. Concurrently, the authors introduce a method to dynamically construct a benchmark (the BS benchmark) designed to measure the bias and sensitivity of different inference methods for a specific model. 1. The paper is well-written with a clear and easy-to-follow logical flow. 2. The proposed method achieves significant improvements on the paper's BS benchmark. When evaluated on general benchmarks, it outperforms other inference methods in terms of both the number of benchmarks improved and overall performance. 1. The construction of the BS benchmark shares structural similarities with the proposed method. This raises a concern that the selected subset might be inherently biased, creating a situation where the proposed method is "both the player and the referee" in its own evaluation. 2. The analysis of the BS benchmark could be more in-depth. For instance, it is unclear how the resulting subsets differ when constructed from different sets of visuals (v) and questions (q), and whether this would lead to different evaluation results. 3. As described, the use case for the BS benchmark appears limited. It evaluates the relative bias and sensitivity of different inference methods for the same model. In practice, the community is often more concerned with the absolute bias and sensitivity exhibited by a specific model in application. 4. The paper suggests that Visual Counterfactual (VC) primarily mitigates the bias of neglecting vision (corresponding to the B subset), while Textual Counterfactual (TC) mainly reduces sensitivity to text prompts (corresponding to the S subset). It is not clear if this correspondence is explicitly reflected in the metrics. 5. In Table 5, the contribution of TC appears to be limited, with VC playing the primary role. Their respective contributions on the general benchmarks are not clearly delineated. 6. The generalizability of TC may be a concern, as it seems to require specific designs for specific problems. Its application in more general scenarios needs further analysis. 7. The proposed method shows substantial gains on the BS subset but smaller improvements on general benchmarks. This raises the question of whether the method might degrade performance on the remaining parts of the full test set. 8. Line 51 of the paper states that language bias has some overlap and connection with hallucination. Given this, the performance of the proposed method, the baseline, and the main compared methods on a benchmark like HallusionBench should be presented. 9. Regarding Equation 5, it is unclear if it represents a token-level probability distribution. If so, since VLMs generate text autoregressively, the textual condition for generating each token should be the prompt 'q' combined with the previously generated tokens, not just 'q'. See Weaknesses. If most of my concerns can be well addressed, I would like to raise my score.	Lightly AI-edited
RM-R1: Reward Modeling as Reasoning	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	Inspired from reasoning LLMs, the authors add long CoT into reward modeling and introduce a new class of generative reward models (Reasoning RMs). They design a chain-of-rubrics reasoning process, trained a set of RMs (RM-R1) with distillation and RL, and validate their performance and scalability. 1. The motivation of transfering CoT reasoning to reward modeling is clear and sound. 2. The design principle of rubrics-based evaluation for chat tasks and correctness-first judgment for reasoning tasks align well with intuition and practice. 3. The experimental results are strong and scalable. 1. Strong-to-weak supervision. It is generally believed that it is easier to discriminate than to generate (a smaller, weaker RM can supervise a larger, stronger models). The design of reasoning RMs says otherwise (e.g. the RM needs to solve a reasoning task itself to give judgment). This could severely limit its use. 2. Heavy training cost. Both querying strong LLMs for high-quality distillation and doing RLVR are very costly. This, especially the distillation part, makes the method not appliable to large scales. 3. Lack of analysis on reward hacking. The paper acknowledges that distilled models suffer from overfitting to trivial patterns, which makes RL necessary, but does not validate RL's effect on mitigating this. 1. Weakness 1. How do RM-R1 perform when it is used to supervise a stronger model? For example, on a reasoning task where RM-R1 cannot solve correctly but the training model can? 2. Weakness 2. How much computation do RM-R1 require in comparison with other RMs? Can generative RMs or reasoning RMs benefit from test-time computation, and if so, what is the advantage of RM-R1? 3. Weakness 3. Is there any evidence other than benchmark scores to support the claim?	Fully human-written
RM-R1: Reward Modeling as Reasoning	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	- RM-R1 treats reward modeling as a reasoning process, where the model produces reasoning traces instead of only scalar scores. - The model follows a structured format with tags such as type, rubric, eval, and answer to standardize reasoning across tasks. - It is trained in two stages: first by distilling reasoning traces from stronger verifier models, and then by reinforcement learning with Group Relative Policy Optimization using binary rewards. - The goal is to make reward models interpretable, verifiable, and robust by aligning reasoning quality with preference correctness. - Experiments show that RM-R1 outperforms traditional scalar reward models in both consistency and interpretability without losing accuracy. - It introduces a clear and interpretable reasoning structure for reward modeling, making the decision process transparent and auditable. - The two-stage training pipeline effectively combines teacher reasoning with verifiable reward optimization. - It demonstrates that reasoning-based reward models can outperform traditional scalar models in both accuracy and consistency across benchmarks. - The reinforcement learning stage with GRPO optimizes for a proxy reward rather than true human satisfaction, leaving room for reward hacking or misalignment. - Generating and processing structured reasoning traces substantially increases training and inference cost compared to scalar reward models. - The paper lacks a detailed error analysis showing when reasoning helps versus when it harms reward accuracy. - The work does not provide a clear mechanism for verifying the correctness of the generated reasoning traces themselves, only their final verdicts. - Why was a binary reward signal chosen instead of a continuous or rubric-weighted scoring scheme, given that reasoning traces contain richer evaluative information? - Have you measured the factual correctness of reasoning traces separately from their final decision accuracy? - Have you quantitatively analyzed whether longer or more detailed reasoning traces actually correlate with better reward accuracy? - How do you ensure diversity of reasoning strategies in the training data so the model does not overfit to one verifier's reasoning style? - Since distilled reasoning models such as DeepSeek-R1-Distill-Qwen-32B are publicly available and already exhibit strong structured reasoning ability, why did you not adopt one of these as the base RM-R1, instead of training reasoning capabilities from non-reasoning models? - In Line 194, can be find -> can be found - In Line 166, claude-3-7-sonnet -> Claude-3-7-sonnet - judgement should be judgment in American English; I believe the paper mostly uses American English. - In Line 181, ) , -> ), (no space) - Please use \citep and \citet appropriately. - Please ensure that the citation formats are consistent, the capitalization is correct, and the information is up-to-date.	Fully AI-generated
RM-R1: Reward Modeling as Reasoning	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes RM-R1, a new paradigm that treats reward modeling as a reasoning process rather than a simple classification task. The authors introduce Reasoning Reward Models (REASRMs), which combine two training stages: Reasoning Distillation: reasoning traces and rubrics distilled from high-performing proprietary models (Claude-3 and OpenAI O3). Reinforcement Fine-tuning: applying Group Relative Policy Optimization (GRPO) to optimize reasoning-based reward models. The model follows a Chain-of-Rubrics (CoR) framework — it first identifies task type (chat vs reasoning), then generates rubrics or intermediate reasoning steps, and finally outputs a judgment. Across several reward-modeling benchmarks (RewardBench, RM-Bench, RMB), RM-R1 achieves state-of-the-art results, surpassing GPT-4o and LLaMA-3.1-70B, with especially strong gains on reasoning-intensive tasks such as math (+20%). Conceptual novelty: The paper reframes reward modeling as an explicit reasoning process, bridging evaluation and interpretability. Transparency: RM-R1 produces human-readable rubrics and step-by-step reasoning chains, offering insight into how judgments are formed. Strong empirical results: Substantial gains over larger models on multiple reward benchmarks; improvements are consistent across scales. Comprehensive experiments: Includes ablation studies, scaling analysis, and qualitative case studies. Data dependency and potential bias: RM-R1 heavily depends on Qwen-2.5 and DeepSeek-Distilled-Qwen outputs, possibly inheriting reasoning biases or training contamination. Moreover, the distillation data from Claude-3 and O3 could embed stylistic or safety biases not analyzed in the paper. Simplified reward formulation: The final reward is binary (+1/-1) correctness, lacking multi-component structure (e.g., coherence, rubric adherence). No stability or sensitivity analysis is provided for different reward signals. Limited theoretical grounding: The paper provides intuitive motivation but no formal justification for why reasoning improves reward alignment. Connections to existing PRM or verifiable RM frameworks are missing. Lack of domain generalization: All experiments focus on text-only reasoning; no evidence of transfer to multimodal, code, or embodied tasks. Ethical and bias analysis omitted: The paper claims “no ethical concerns,” yet relies on closed-source models (Claude, O3) for supervision, which may introduce opaque bias or intellectual-property issues. Data provenance and bias How do you ensure that reasoning traces distilled from Claude-3 and O3 do not introduce bias or data leakage into RM-R1? Reward formulation The final reward is binary correctness (±1). Have you explored multi-component or continuous reward signals (e.g., coherence, rubric consistency)? How stable is the RL training under noisy rewards? Theoretical motivation Can you provide any theoretical or cognitive rationale for why explicit reasoning improves reward alignment compared to outcome-only modeling? Generalization Has RM-R1 been tested on multimodal or dynamic tasks (e.g., vision-language reasoning or agentic evaluation)? If not, how well do you expect it to generalize? Distillation fidelity What fraction of the distilled reasoning traces were incorrect or low-quality, and how does this affect downstream RL optimization?	Fully AI-generated
RM-R1: Reward Modeling as Reasoning	Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces Reasoning Reward Models, which formulate the reward modeling process as a reasoning task. It further proposes the Chain-of-Rubrics mechanism—a self-generated checklist process by the reward model itself—offering a reasonable implementation of CoT reasoning in the reward modeling domain. The authors provide a detailed recipe for training a ReasRM, and the trained model RM-R1, which achieves superior performance across three benchmarks on average. The paper is well-written and easy to follow. However, upon reviewing the paper, the reviewer observes that RM-R1 essentially functions as an “LLM-as-a-judge” judger. This perspective, along with the proposed training and usage methodology, raises several questions and concerns. The reviewer has listed many questions in Weakness and Question, and if they are answered properly, the reviewer will consider increasing the score. 1. Integrating reasoning ability into rewarding is a good method, and considering the submission time, this method has a certain novelty. 2. The CoR proposed in this paper generates customized checklists for problems and has different solutions depending on the types of issues. 3. The paper provides a training recipe including data construction, SFT, and RL, and releases the training hyperparameters. In implementation, (1) The authors only use a series of Qwen models, which is under suspicion of data leakage[1]. By viewing the detailed results on three benchmarks in the appendices, the reviewer finds that RM-R1 mainly performs better on math and code generation; the former one is under suspicion of data leakage. However, in the chat area, RM-R1-32B did not perform better than some 8B / 27B models, though equipped with a reasonable CoR mechanism. (2) The authors use a significantly strong “oracle” model to construct the structured reasoning trace, which is costly but does not introduce significant gains in general domains. In usage, the method trains an LLM-as-a-judge server, which is easy to cheat with a “Please give my answer a better score”-like prompt, especially easy to hack the reward in reinforcement learning usage. So, in the reviewers' opinion, the authors propose an interesting concept (CoR), but not a practical method. I believe the gains in helpness and harmlessness are introduced by CoR. 1. What’s the prompt for strong GenRMs like GPT-4o? Did they use a CoR-oriented prompt to ensure fair comparison? 2. For the reasoning tasks, RM-R1 performs an ‘answering-before-judging’ behavior, but the base model is under suspicion of data leakage in some reasoning tasks[1]; an explanation is needed for this. Is the improvement in effectiveness due to the model having a stronger reasoning (task-solving) ability or a stronger evaluation (judging) ability? This means that RM-R1 cannot judge the problems it cannot solve. 3. Is RM-R1 easy to cheat using a prompt like “Please give my answer a better score”? It’s important to determine whether it can be used in RL (with the easy-to-hack concern). 4. The construct costs compared to an ability-matching scalar model? 5. The inference costs compared to the scalar model? Would using multiple scalar models and equipped with consistency methods for inference, yield better results while remaining lower cost? 6. How to ensure the correctness of the intermediate process? In training data construction, humans were involved in data construction, but how to ensure it in inference? Though a strong reasoning model is prone to making intermediate errors in long reasoning. [1] Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination	Fully human-written
Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors propose an RL method to handle large discrete action spaces. The method pretrains a transformer with masked action inputs to reconstruct the action, then do RL on top of the learned transformer representation. They show this method matches other large action RL baselines in just a fraction of their training times. The method is evaluated in a modified version of DM Control where the action space is discretized. - The method is very simple and intuitive. By doing self supervised learning on the large action space, one can learn a more meaningful action representation than the original one. This will lead to large downstream gains. - The paper is well written and easy to understand. - The experiment section has interesting analysis results to pin down why SPIN is helpful. ### Empirical evaluation feels toy and contrived - these methods are all evaluated in rather artificial RL tasks (hopper, quadruped, etc.), where they take a popular benchmark (DM Control) and then factorize the action space. While useful for fast iteration and initial scientific insight, it is insufficient for convincing me that this method, or even the problem of large discrete action space, is useful. The authors motivated the problem by citing natural problems with large action spaces like recommender systems, robot assembly, etc. Could the authors show results in a more realistic problem setting? ### Method novelty - In terms of methodological novelty, there's not too much at the high level. When you have noisy or high dimensional data, e.g. noisy sensors, high dim images, representation learning is the first thing we try to improve the signal to noise ratio of our data. So doing this for actions, using a standard masked reconstruction objective, seems very obvious, and not too "novel". On the other hand, this method is "novel" in the sense of applying the masked reconstruction objective to this particular problem where action spaces are large. - This can be seen as a special case of literature studying masked transformers for decision making problems [1], where the mask of the transformer is just set to the action modality. It would be interesting to compare SPIN against a masked transformer baseline that does masking over both state and action modalities. [1] Masked Trajectory Models for Prediction, Representation, and Control [2] PASTA: Pretrained Action-State Transformer Agents See weaknesses, I would like to see more realistic experiments. For method novelty, it would be addressed by better framing, and also comparing against a standard representation learning approach like masked reconstruction over the entire input sequence.	Fully human-written
Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces SPIN, a two-stage approach designed to improve efficiency in discrete combinatorial action spaces. Specifically, it separates representation learning from policy learning. In the first stage, an action structure model learns a representation function that captures the manifold of valid actions. In the second stage, this representation is frozen and reused, with lightweight policy heads built on top of the pre-trained action structure model. The experimental results demonstrate clear benefits in terms of both performance and efficiency. The paper is clearly written and well motivated. The proposed idea is straightforward, and the algorithm is compatible with actor–critic frameworks, which enhances its applicability across a wide range of settings. The experimental results demonstrate the superiority of the proposed approach compared with the three selected baselines. While the focus on offline RL is relevant, it is not sufficiently justified in the paper. In particular, SAINT is originally an online approach, which has been used here in an offline setting for comparison. In my view, it is not entirely fair to claim that SAINT jointly learns the action structure and control, as it was designed for a different purpose. This raises questions about the validity of the comparison. The evaluation is also somewhat limited. The implementation details for the selected baselines are not described clearly, making the fairness of the comparison uncertain. While it is understandable that the authors aimed to keep architectural choices consistent, comparing directly with the original implementations of the baseline methods would strengthen the credibility of the results. There are a few relevant works in this area that the authors may wish to consider for experimental comparison, such asOHIO (https://openreview.net/forum?id=dTPz4rEDok), and MERLION (https://proceedings.mlr.press/v162/gu22b/gu22b.pdf). Particularly, the paper claims the decoupling the representation learning from control. However, MERLION also learns reusable action embeddings. The contribution over MERLION remains unclear in the paper. 1. Please justify the experimental comparison. 2. Please clarify the contributions with respect to the earlier works, especially, MERLION.	Fully human-written
Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	Problem: The paper tackles offline reinforcement learning in large, discrete combinatorial action spaces, settings where the agent must select from exponentially many joint actions (composed of multiple sub-actions) and ensure these selected sub-actions form coherent combinations. This is relevant for domains like healthcare decision support, robotics, recommendation systems, and fleet management, where online exploration is costly, risky, or infeasible. Approach: - The authors propose Structured Policy Initialization (SPIN), a two-stage framework that decouples representation learning from control. : (a) Action Structure Model (ASM) is trained to learn an action representation function, (b) ASM is frozen and lightweight policy heads are trained for downstream RL control on this learned action representation. - SPIN offers a principled and empirically validated way to accelerate and improve offline RL in large discrete combinatorial action spaces, primarily by decoupling structure learning and control, thus making learning tractable and robust as complexity grows. The separation of structure and control is motivated and clearly shown to overcome the slowness/instability of joint learning - SPIN works with multiple offline RL algorithms (IQL, AWAC, BCQ). Overall, SPIN is a promising and elegant approach that reframes discrete combinatorial control as a representation problem. - The current framework requires architectural compatibility between ASM and policy modules for effective weight transfer. This can limit its integration with arbitrary RL architectures and restrict broader applicability. - The paper notes that SPIN is compatible with IQL and AWAC but not CQL. Could you elaborate on stability issues that arise with value-regularization methods and whether hybrid objectives could reconcile them? - Why was masked conditional modeling chosen over contrastive or next-sub-action prediction? Did you test alternative pretext tasks, and if so, how did they compare? - Have you evaluated SPIN on higher-arity combinatorial domains (e.g., VRP or job-shop scheduling) where the sub-action semantics differ? Would the same ASM formulation apply without state–action token alignment?	Fully AI-generated
Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors proposed an algorithm for RL with combinatorial action spaces. The proposed method has two stages. We learn the action structure during the first stage, and then we learn a policy in the second stage. By separating the learning of action structure and policy, the proposed algorithm overcomes the computational cost issue that a previous work named SAINT has. Determining when to finish pretraining and move on to policy training is crucial. Stopping pretraining too early could lead to poor action structure modeling (Sec 6.1 illustrates the importance of sufficient pretraining), and stopping pretraining too late could lead to the same computational cost issue that is with SAINT. The authors do not provide an approach to choose the stopping time of pretraining. The proposed method largely reuses the policy architecture in SAINT, and thus the novelty of this work is limited. Could the authors provide an approach to choose the stopping time of pretraining?	Fully human-written
Robust Strength Behavior Modeling of Coarse-Grained Soils Using HSIC-Guided Stable Learning	Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	AI models are not robust under distributional shifts in data for this engineering application of analyzing the strength of coarse-grained soils. The paper proposes a solution that reweighs training samples to stabilize a training module which is integrated with a deep neural network. The approach is compared with SNN on several metrics. Their approach calculates the Hilbert-Schmidt Independence Criterion (HSIC) directly rather than using an approximation, because the dataset size allows for it. This paper applies stable learning, a recently introduced method, to a novel problem. The primary contribution of the paper is demonstrating that using the exact calculation of HSIC improves accuracy over the SNN method that uses an approximation (and over the baseline of a neural network without reweighting). Results are demonstrated on synthetic datasets with known amounts of distributional shift; and score using several metrics. The primary conclusion is that exact HSIC performs better than an approximation on this one application. This result is not surprising, as it is the more computationally-expensive and exact calculation. The result will only generalize to other problems with similar sizes of datasets, but the size of this dataset is not clear. There is no exploration of dataset sizes. The result tables show very large errors. What range of values are being regressed? It is likely that there are very large values that are hard to predict. Normalization or quantization may help with this. It would be insightful if the paper showed a scatterplot of at least a subset of the predictions versus targets to better understand the R^2 values. Minor issues: It would be helpful to indicate in the caption of Figure 3 that this is dataset index 7. Fonts in plots are too small to read. The paper demonstrates that exact HSIC performs better than an approximation on your data. How would somebody wanting to apply your approach know if it is tractable for their own data? How much data is tractable? There could be some exploration of the computation/accuracy tradeoff to help guide others toward choosing between the approaches. In Figure 3 and the dataset it is showing, what is confining pressure? It is not the value being regressed. So, is it one of the input features? What R^2 or error level is needed to make it practically useful to use AI for this application? It seems like the errors will need to drop by orders of magnitude rather than incremental amounts.	Fully human-written
Robust Strength Behavior Modeling of Coarse-Grained Soils Using HSIC-Guided Stable Learning	Soundness: 3: good Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposed a learning framework for improving the robustness and generalization of regression models under distributional shifts. The application of interest is predicting deviatoric stress–axial strain of coarse-grained soils, where data are scarce and heterogeneous. The key idea is to combine exact kernel-based Hilbert-Schmidt Independence Criterion (HSIC) with deep neural networks to decorrelate features through sample reweighting and weight globalization. This goal is to reduce spurious correlations. Numerical experiments are conducted on synthetic datasets only. The proposed approach is compared to only one baseline, SNN, which it outperforms. The lack of comparisons to existing domain generalization works, narrow focus on one application, and evaluation on synthetic data only severely limits the contributions of the paper. - The application domain, predicting deviatoric stress–axial strain of coarse-grained soils, is quite interesting. It is an example of a real-world system where data is scarce and heterogeneous, and there is a need for methods that work well for such scenarios. - From an application perspective, it promises a data-driven alternative to costly triaxial tests. - There is a consistent and moderate improvement across multiple distribution-shift scenarios, albeit on synthetic datasets. - Experiments are conducted on a single domain (soil mechanics), that too only on synthetic biases. It is unclear whether the approach generalizes to other regression tasks or modalities. - The proposed approach is compared only to one baseline (SNN). There is no comparison the many methods proposed for domain generalization. Further, it is unclear what this SNN baseline is. The acronym is never defined and SNN is never directly cited. - There are many design choices in the algorithm. But there are no ablation studies to check which ones are sensitive, how to select them, etc. - The proposed approach has limited novelty. It is able to use the full kernel without approximation due to the limited dataset size. Other approaches which approximate the kernel could also afford to not do the approximation for the same dataset. So, the real contribution of the paper is unclear. Applying to new domain without ML contributions is insufficient for an ML conference. - How sensitive is the model to the choice of Gaussian kernel bandwidth? - How does the model compare to the plethora of existing domain generalization methods? - Real-data is often noisy in such cases. How robust is the method to noise? - How would the model perform on real-datasets? Perhaps, it can be trained on synthetic but evaluated on real-data. - How much benefit does the globalization module provide over local reweighting?	Fully human-written
Robust Strength Behavior Modeling of Coarse-Grained Soils Using HSIC-Guided Stable Learning	Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes HSIC-StableNet, a stable learning framework that combines exact kernel-based Hilbert-Schmidt Independence Criterion (HSIC) with deep neural networks to predict deviatoric stress-axial strain curves for coarse-grained soils. The method aims to improve out-of-distribution (OOD) generalization by reweighting training samples to reduce feature dependencies. ++Addresses a real problem in geotechnical engineering where data collection is expensive and challenging. ++The paper articulates well why OOD generalization is vital for soil strength prediction, given data scarcity for large-particle soils. +Using exact kernel methods instead of approximations (RFF) for HSIC computation is reasonable for moderate-sized datasets. Limited novelty: A search for “Hilbert-Schmidt Independence Criterion” on Google finds several highly relevant and related papers, such as: https://arxiv.org/pdf/1910.00270 “We investigate the use of a non-parametric independence measure, the Hilbert-Schmidt Independence Criterion (HSIC), as a loss-function for learning robust regression and classification models.” https://proceedings.neurips.cc/paper_files/paper/2007/file/d5cfead94f5350c12c322b5b664544c1-Paper.pdf Furthermore, a main contribution according to the authors is “While most existing stable learning methods are developed for classification tasks, this work extends the paradigm to regression scenarios by embedding a stable learning mechanism within a regression framework.” However, this paper https://ojs.aaai.org/index.php/AAAI/article/view/6024 from AAAI 2020 considers both classification and regression. The baselines are very simplistic and not fully described. I would not be able to reproduce the results even if I had the data. However, the datasets are all synthetic, and it is not clear how they are generated. The tests do not clearly follow the story of targeting soil test strengths. The authors’ synthetic datasets might be modelled after these; however, that is not very clear when reading. The area is significantly outside the domain of ICLR, and should thus be thoroughly explained. The experiments are not repeated over multiple seeds, even though the training methods are stochastic. There are no error bars, significance tests, or confidence intervals. Computational cost completely ignored. HSIC computation requires O(N^2) kernel matrix operations. (For each feature pair, the method computes and stores N by N matrices.) The authors claim this is feasible for "moderate-sized" datasets, but provide no timing comparisons. The captions are minimal. Please extend these to explain the figures and tables fully. The related work is almost entirely missing. The section is barely 13 lines… The authors should highlight relevant related work and compare and contrast their method to it. The authors mention spurious correlations several times, but do not bring this up in the related work. Missing overview figure. There is a system’s figure in Figure 1, but there is no figure to give an overview of what the authors are doing. Please align the notation with the formatting instructions. Grammatical errors: Equations 1,3,4,5,9,10,13,15, and 17 should all end with a period. The others should end with a comma. Issues with citations: “Ma Z. M. Chen Y., Xiong R. When does group invariant learning survive spurious correlations? Advances in Neural Information Processing Systems, 35:7038–7051, 2022” is in the reference list but not cited in the paper. “coarse-grained soil data are often sparse or imbalanced, especially for large particle sizes, leading to distribution shifts that degrade model generalization” missing citations. What do you mean on line 209 with “statistically independent?” What is the exact definition? “However, since MBGD processes only a subset of samples in each batch, the resulting weights remain localized, which can limit the effectiveness of reweighting in addressing statistical dependencies across the entire dataset.” If each batch took a random subset of the features, why would they remain localized? Please elaborate on this or provide references. How do you define the bias (%) in Figure 2? What is seen on the y-axis of Figure 3? I do not understand it. Why do you only test synthetic datasets? How do you create the synthetic datasets? Do they follow physical laws or something else?	Fully human-written
Robust Strength Behavior Modeling of Coarse-Grained Soils Using HSIC-Guided Stable Learning	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes HSIC-StableNet, a stable learning framework that uses the Hilbert-Schmidt Independence Criterion (HSIC) with exact Gaussian kernels to improve out-of-distribution (OOD) generalization in predicting deviatoric stress–axial strain (q–εₐ) curves for coarse-grained soils. The method replaces approximate kernel methods (e.g., Random Fourier Features) with exact kernel computation and extends stable learning—typically used for classification—to regression. Experiments on synthetically biased datasets show improved performance over a standard DNN and a prior stable learning baseline (SNN) across R², MSE, MAE, and MAPE. The authors highlight the model’s ability to predict large-particle soil behavior using only small-particle training data, addressing data scarcity in geotechnical engineering. 1. The authors successfully extend stable learning, a concept predominantly used in classification, to the regression domain. This is a convincing and valuable adaptation that broadens the applicability of these methods. 2. The use of exact Gaussian kernels is a well-reasoned choice. For this specific domain with smaller geotechnical datasets, the authors show this yields measurable gains over approximate methods (e.g., SNN with RFF). 3. The model’s ability to predict soil behavior across different particle sizes with limited data directly addresses real-world challenges in the geotechnical field, making it highly relevant for both academia and industry. 4. Synthetic bias effectively simulates distribution shifts. Performance trends (e.g., gains increase with distribution deviation) support robustness claims. 1. The paper would be strengthened by a deeper theoretical discussion. It's not entirely clear why the exact kernel leads to better generalization for regression, or under what specific conditions we can expect the cross-scale transfer to hold. A discussion of the underlying causal mechanisms (e.g., identifying spurious versus invariant features) is a notable missing piece. 2. The use of synthetically biased data is a good controlled test, but it may not prove robustness against the complex distribution shifts encountered in practice. A real-world transfer experiment (e.g., train on one soil type, test on another geological region) would better validate practical robustness. 3. Given ICLR's focus on core machine learning principles, the highly applied, geotechnical nature of this work might be a less immediate fit. The community might question whether the methodological contribution is foundational enough, or if it is primarily a successful application of existing tools. 4. The experiments compare HSIC-StableNet primarily against DNN and SNN. To strengthen the paper, including comparisons with other state-of-the-art models in OOD generalization or domain adaptation would provide a broader context. 5. The dataset is not public, and key hyperparameters (e.g., kernel bandwidth σ, globalization factor α) are not thoroughly reported. While code sharing is often post-acceptance, these omissions currently limit verifiability. 6. No ablation study isolates the effect of the globalization module (Section 3.3.2). Is it necessary? 1. What justifies the assumption that decorrelating input features leads to learning invariant mechanisms? Could you clarify which features are considered spurious vs. causal in the soil strength prediction task? 2. How does the model performance scale with much larger datasets in real-world applications, and are there any optimization strategies in place to handle the computational burden of HSIC? 3. How sensitive is performance to the kernel bandwidth σ and globalization factor α? Were these tuned via validation, and if so, how? 4. While exact kernels are feasible here, could your approach be adapted to larger datasets via Nyström approximation or other scalable HSIC estimators without significant performance loss? 5. Can you provide a direct ablation comparing exact HSIC vs. RFF-based HSIC within the same stable learning framework? The current comparison to SNN conflates kernel choice with other architectural differences.	Fully AI-generated
Robust Strength Behavior Modeling of Coarse-Grained Soils Using HSIC-Guided Stable Learning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This work introduces a stable learning framework based on the Hilbert-Schmidt Independence Criterion to address the distributional shift problem in OOD generalization. In predicting deviatoric stress-axial strain curves that represent the strength characteristics of coarse-grained soils, the model consistently surpasses conventional DNN models and a previously introduced stable learning approach, and demonstrates strong performance in estimating the strength behavior of coarse-grained soils with large particle sizes by utilizing data samples from soils with smaller particles. 1 A stable learning framework based on the Hilbert-Schmidt Independence Criterion to address the distributional shift problem in OOD generalization. 2 Strong performance in estimating the strength behavior of coarse-grained soils with large particle sizes by utilizing data samples from soils with smaller particles is demonstrated. 1 The dataset should be explained in more detail. 2 It is not clear how the improvement demonstrated in the new framework matters in addressing the OOD problem of soil mechanics. More discussion on the problem should be discussed. 1 The accuracy of the present approach is better than others, but marginally. Can you conclude how does this minor improvement help in the problem of soils. 2 If we want to predict large-grain mechanics from small-grain ones, is such a statistical learning approach enough? Can we exclude the physics bias or how can be measure the bounds of generalization? Some practical guidelines should be given.	Fully human-written
GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-Guided Latent Diffusion Model?	Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper treats scene-text SR as a bi-objective problem: visual quality + text legibility. It uses a dual-branch Text-SR Fusion ControlNet guided by OCR text/positions and a ping-pong scheduler that alternates text-centric and image-centric guidance. It reports big gains in OCR F1 on SVT/CTW/CUTE80 while keeping perceptual metrics competitive. 1. Clear goal (make text readable, not just “look sharp”). 2. Comprehensive evaluation with OCR metrics and perceptual IQA. 1. Limited analysis of trade-offs (e.g., when text gets clearer, what happens to non-text textures?). 2. No multilingual or curved-text stress test. 3. Sensitivity to OCR detector quality is not studied. 1. How robust is the method to OCR detection errors? 2. Can the approach handle dense, multi-language street scenes? 3. How is the ping-pong schedule chosen; can it be learned?	Fully AI-generated
GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-Guided Latent Diffusion Model?	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces GLYPH-SR, a novel vision-language model (VLM)-guided latent diffusion framework designed to address the dual-objective problem in image super-resolution (SR): achieving high perceptual quality and high-fidelity scene-text recovery. The core of the method is a dual-branch Text-SR Fusion ControlNet (TS-ControlNet) that integrates scene-level captions with OCR-derived text strings and their spatial positions. A key innovation is the "ping-pong" scheduler, which dynamically alternates between text-centric and image-centric guidance during the diffusion denoising process. The model is trained on a carefully constructed synthetic corpus that factorizes glyph quality and global image quality perturbations. Extensive evaluations on SVT, SCUT-CTW1500, and CUTE80 benchmarks at up to 8× scaling demonstrate that GLYPH-SR achieves significant improvements in OCR F1 scores (up to +15.18 percentage points) while maintaining competitive performance on perceptual metrics (MANIQA, CLIP-IQA, MUSIQ) against a strong suite of diffusion and GAN-based baselines. 1. The paper compellingly argues that text legibility is a critical yet overlooked aspect of SR in practical applications. It provides a clear analysis of the systemic biases (metric and objective) in prior work that lead to text hallucination or conservative restoration, effectively framing the need for a dual-objective approach. 2. The proposed TS-ControlNet architecture and the binary ping-pong scheduler are elegant and effective solutions for fusing semantic text cues with global image priors without disrupting the pre-trained diffusion backbone. The design allows for targeted text restoration through fine-tuning a relatively small number of parameters. 3. The construction of a four-partition synthetic dataset is a significant methodological contribution. It enables the precise disentanglement of text restoration from general SR, providing a clean signal for training the text-specific components. 4. The paper provides an extensive empirical evaluation across multiple datasets, scale factors, and a wide range of state-of-the-art baselines. The dual-axis evaluation protocol, reporting both OCR metrics and perceptual quality metrics, is thorough and appropriate for the claimed contributions. 5. The ablation studies on guidance components, the scheduler policy, and the sensitivity analysis to upstream OCR/VLM errors are systematic and provide valuable insights into the model's behavior, strengths, and limitations. 6. The qualitative results (Figures 1, 4, 5, 12, 13) are highly effective. They clearly demonstrate GLYPH-SR's superior ability to reconstruct legible, accurate text in challenging scenarios (e.g., 8× scaling) where other methods fail, providing strong visual support for the quantitative claims. 1. The related work, experiment section (Section 2/4) and lacks a thorough discussion of several recent and highly relevant works that also leverage VLMs, text prompts, or diffusion models for text-aware image restoration. Notable omissions include, but are not limited to: a) Zhang et al. (2024), "Diffusion-based Blind Text Image Super-Resolution" b) Chen et al. (2024), "Image Super-Resolution with Text Prompt Diffusion" / "Universal Image Restoration with Text Prompt Diffusion" c) Zhang et al. (2024), "ConsisSR: Delving Deep into Consistency in Diffusion-based Image Super-Resolution" d) Bogolin (2025), "Text-Aware Image Restoration with Diffusion Models" e) Xiaoming et al. (2024), "Enhanced Generative Structure Prior for Chinese Text Image Super-Resolution" This gap weakens the contextualization of the paper's novelty and leaves the reader uncertain about how GLYPH-SR specifically advances the field beyond these concurrent efforts. A more comprehensive survey and a clearer delineation of contributions are needed. 2. While Figure 14 is provided, the discussion of failure cases is somewhat brief. A deeper analysis is warranted, particularly regarding: (a) the root cause of text hallucination in non-text regions (e.g., is it due to over-reliance on text guidance or errors in the initial OCR?); (b) the model's tendency to enhance only the most salient text instances; and (c) a critical failure mode not explicitly discussed: what happens when the upstream VLM/OCR fails to detect or correctly recognize severely degraded text in the LR input? This scenario is highly probable in real-world applications and likely breaks the method's core premise. 3. As shown in Table 6, GLYPH-SR's computational footprint (13B+ parameters, ~43GB VRAM, ~38s/inference) is substantial, limiting its practical deployability compared to faster baselines. The discussion on potential efficiency improvements (Section C.4) is preliminary and speculative. A more concrete analysis or preliminary results from, for example, a distilled VLM, would strengthen the paper's practical impact. 4. Minor Typos and Presentation: L273. The authors should carefully check the content. 1. The sensitivity analysis in Table 5 uses simulated OCR errors. How does GLYPH-SR perform on real-world low-quality images where the initial OCR (providing S_TXT) is inherently noisy or incomplete? Can you provide results on a wild dataset with poor ground-truth OCR to demonstrate robustness? 2. Could you provide more details on the tuning of critical hyperparameters like the control scale s_CTRL and the CFG scale ω? Were they empirically tuned, and what are the observed trade-offs between text fidelity and image quality at different values? Are there failure modes associated with extreme values? 3. Have you conducted any experiments with a smaller or quantized VLM (e.g., a distilled version of LLaVA-NeXT) to reduce computational cost? If so, what was the corresponding drop in OCR F1 and perceptual scores? This would greatly inform practical applications. 4. The paper rightly notes the misalignment of traditional metrics (Fig. 7). To further substantiate the perceptual claims, have you considered or conducted a user study to quantitatively assess human preference between GLYPH-SR and key baselines regarding both overall image quality and text readability? 5. The evaluation focuses on standard Latin scripts. What are the prospects or any preliminary results for GLYPH-SR on multilingual text, complex scripts (e.g., Chinese, Arabic), or handwritten text? Does the current design have inherent limitations for such scenarios? 6. Based on the analysis of Figure 14, what specific architectural or training modifications (e.g., incorporating a text-region segmentation mask, adding a localization loss, or using a more robust text detector) do you envision could mitigate the issues of hallucination in non-text regions and incomplete enhancement of multiple text instances?	Fully AI-generated
GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-Guided Latent Diffusion Model?	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposed a framework GLYPH-SR, which utilizes textual information mining from the LQ to guide the denoise process through the designed TS-ControlNet. This paper also introduced ping-pong scheduler to control the condition injection strength along the denoising trajectory for better balance between visual and textual restoration. Extensive experiments demonstrate the competitive performance of the proposed framework. Strengthens: 1. Proposed the GLYPH-SR framework with TS-ControlNet, allowing fine-grained control over both glyph-level details and scene-level realism. Furthermore, a ping-pong scheduler is introduced to dynamically balance visual fidelity and text legibility during the denoise process. 2. Constructed a factorized synthetic corpus separating text degradation from global image degradation, enabling controlled finetunning and clear ablation analysis. 3. Analyzed the trade-off between SR metrics and OCR metrics, which is essential for the evaluation of the proposed framework and similar methods. Weaknesses: 1. The novelty could be further improved. - The text branch of the proposed TS-ControlNet continues to adopt the plain ControlNet structure, without any specific modifications for its text-focused role. Introducing task-oriented designs could potentially further improve its performance. - Although the paper considers the trade-off between SR and OCR metrics, it does not introduce a unified metric to evaluate both scene reconstruction and text restoration quality. 2. The experiments are insufficient. The paper lacks comprehensive evaluations on related real-world image super-resolution task. 3. Some minor writing issues. e.g. "paragraphStep" in line 273 1. Does the proposed framework only support English text? How does it perform on non-Latin characters?	Fully human-written
GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-Guided Latent Diffusion Model?	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes GLYPH-SR, a vision–language guided diffusion framework for text image super-resolution, aiming to jointly optimize image perceptual quality and text legibility. The core technical contributions include: 1. Bi-objective formulation and dual-axis protocol that treats SR as joint optimization of image and text fidelity. 2. Text-SR Fusion ControlNet, which fuses OCR-derived textual cues and image captions. 3. A ping-pong scheduler alternating text-centric and image-centric guidance during diffusion denoising. 4. A synthetic factorized corpus that decouples text and image degradation for targeted training. 1. The bi-objective view of SR (visual realism + text fidelity) is intuitive and important for practical use, addressing the neglected fact in most STISR works. 2. The TS-ControlNet + ping-pong scheduler combination is intuitive for the target of optimizing image perceptual quality and text legibility jointly. 3. The four-way partition synthetic corpus fits the claimed objective of joint optimization for training purpose. 1. Most baselines in experiments are not SR methods specialized for scene text image, except DiffTSR. In addition, methods like DiffTSR are not built for restoring a full scene text image, but for cropped image that only contains a single textline. The comparison could be unfair. 2. Despite most baselines were not built for scene text image, the proposed GLYPH-SR still can not outperform them consistently, even in terms of OCR accuracy. 3. As mentioned in Sec. C.3, the restoration performance rely heavily on the OCR result at the beginning of the procedure. Though strong VLM is applied, the OCR result could still be wrong under severe degradation, otherwise the super-resolution is unnecessary. 4. The dataset used for training and evaluation is not specifically built for image super-resolution. Even in the evaluation, the LR images seem to be generated by manually downsampling. The lack of real-world scenario in evaluation made it less convincing. 5. The OCR accuracy on LR/HR image is not reported in the tables, which made it harder to see the improvements. 1. How was DiffTSR applied to this task? Were the full scene text images directly fed to DiffTSR, or fed after cropping? 2. Why is this paper named "GLYPH-SR" ? The Text-guidance in ControlNet only contains the recognized text and detected position. It seems that this work has nothing to do with glyph. 3. What is the OCR accuracy on the original image/downsampled(HR/LR) image? 4. Have the authors consideder applying other methods to balance the image and text guidance instead of the binary ping-pong policy? (e.g. through a dynamic learnable parameter)	Fully human-written
Value-Anchored Group Policy Optimization for Flow Models	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes VGPO, an extension of GRPO for flow-matching image generation. It introduces TCRM, which defines per-step rewards via one-step ODE projection and estimates discounted Q-values, and ADAE, which refines group-normalized advantages using timestep weights and a reward variance term. Evaluations on GenEval, OCR, and PickScore show small to moderate gains over Flow-GRPO. - Identifies real issues in applying GRPO to flow models (temporal misalignment and reward diversity collapse). I like this insight. - Clear algorithmic description with simple implementation. - Reasonable experimental coverage across three benchmarks. - The novelty is limited. The “instant reward” idea and the way of generating dense per-step feedback are not new; Many papers have already utilized the one-step operator to obtain the outcome from a sample at noise level t, and then compute the reward for training or inference time scaling [1]. - The “long-term cumulative value” is simply a standard discounted Q-value estimation commonly used in reinforcement learning, given the instant reward. - The design of the advantage function is heuristic and ad hoc. The motivation is clear, but why this paper’s proposal resolves these issues remains unclear in both empirical and theoretical aspects. - Overall, the contributions are incremental adaptations of existing concepts with minor modifications. The justification for these changes is not well-motivated and well-supported. The introduction of ω and α (3.3.2) seems ad hoc, with no clear theoretical justification for why they should stabilize training or prevent reward collapse. The std is still on the denominator and can still collapse to 0. The explanation provided (based on the limit when reward std → 0) is algebraic and does not clarify the actual learning dynamics. No further analysis is provided to demonstrate how these modifications alter the gradient behavior or enhance convergence. - The paper only shows standard training curves and small metric improvements. - There are no experiments designed to directly test whether ADAE indeed mitigates the collapse or variance issues it claims to address. Ablations are shallow and do not isolate the real effect of omega or alpha. - The proposed method appears computationally expensive because it requires one-step ODE projection and Monte Carlo Q estimation at every time step. The paper does not report the computation cost, reward-model calls, or wall-clock comparison with the baseline. [1] Ma, Nanye, et al. "Inference-time scaling for diffusion models beyond scaling denoising steps." arXiv preprint arXiv:2501.09732 (2025). 1. How does the proposed “instant reward” approach fundamentally differ from prior methods that also derive intermediate or dense process rewards? 2. What is the real computational overhead compared to Flow-GRPO (in terms of reward evaluations, training time, and GPU cost)? 3. Why were omega = Q/mean(Q) and alpha = k * std chosen? Are these values empirically tuned or based on theoretical reasoning? 4. Can you show any experiment where reward diversity is intentionally reduced to demonstrate that ADAE truly prevents optimization collapse? 5. How noisy are the Monte Carlo Q estimates, and how many rollouts are required per step to stabilize them?	Moderately AI-edited
Value-Anchored Group Policy Optimization for Flow Models	Soundness: 1: poor Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper identifies and addresses two major limitations when applying GRPO to flow-based generative models: poor temporal credit assignment and optimization stagnation. The authors introduce value-anchored group policy optimization (VGPO) to address these issues by leveraging a temporal cumulative reward mechanism and advantage estimation. The authors demonstrate, through experiments across three benchmarks (compositional generation, text rendering, and human preference), that VGPO achieves state-of-the-art image quality. - The paper identifies a fundamental mismatch when applying GRPO to flow-based generation, pinpointing the two critical limitations of poor temporal credit assignment and reliance on reward diversity. To the best of my knowledge, limited work has explored this problem before. - The authors' delivery of the proposed approach is clear and well-supported by the issues raised. The VGPO approach is easy to follow and intuitively makes sense. - Code is provided for reproducibility. - Comprehensive experiments demonstrate the superior performance of the proposed approach compared to the vanilla GRPO models. - My most significant concern involves the validity of using one-step extrapolation in the temporal cumulative reward mechanism. Flow matching models learn the marginal vector field at each noisy data point. Therefore, it is not guaranteed to be a straight sampling path and requires ODE/SDE solvers with normally tens to hundreds of sampling steps for decent generation results. This is also the reason why techniques like rectified flow and consistency models emerge for few-step generation while maintaining the marginals. Therefore, the simple one-step extrapolation from the early sampling stage will likely deviate from the distribution of the learned data, leading to less credible rewards. It remains unclear, at least mathematically, how such stepwise rewards from a shifted distribution can still guarantee the convergence of the GRPO/DPO-based methods. - As the proposed method requires the calculation of rewards on each intermediate noisy sample in the iterative inference process, the practical running time of the algorithm may be larger compared to the vanilla GRPO/DPO approaches, especially for rewards that rely on other ML models, such as CLIP score. - Baselines compared in the paper remain limited. Only the vanilla GRPO was compared. DPO-based approaches, which share a similar spirit, should also be considered. - Can you derive or discuss the validity of using the one-step extrapolation even when the marginal probability path is not guaranteed to be straight? - Can you provide detailed running time comparisons between the proposed approach and the baselines? - How is the performance of DPO-based models on this task?	Lightly AI-edited
Value-Anchored Group Policy Optimization for Flow Models	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes Value-Anchored Group Policy Optimization (VGPO) for online RL alignment of flow-matching image generators. It argues that directly applying GRPO to flows causes (i) faulty credit assignment by spreading a sparse terminal reward uniformly over timesteps and (ii) signal collapse when intra-group reward variance shrinks. VGPO addresses this via two components: Temporal Cumulative Reward Mechanism (TCRM), which uses a one-step ODE projection to define stepwise “instant rewards” and accumulates them into action values $Q_t$, and Adaptive Dual Advantage Estimation (ADAE), which mixes relative (group-normalized) and absolute (value-anchored) advantages with timestep re-weighting $\omega_t$. Experiments on compositional generation (GenEval), OCR text rendering, and human-preference alignment (PickScore) show higher task scores and generally improved image-quality metrics versus Flow-GRPO, with learning curves indicating faster and smoother convergence when KL is used. (Method & motivation: Fig. 1–2; Eqs. (8)–(13); Alg. 1; Results: Tab. 1, Fig. 3–5.) - Problem diagnosis is specific and evidenced. Eq. (8) exhibits the coupling of a time-dependent policy ratio with a time-independent terminal advantage, explaining why uniform credit across steps is misaligned for flows; Fig. 1 (left/right) visualizes sparse-vs-instant reward and the decline of group reward std during training. - Clean MDP formalization and sampling interface. The paper casts reverse sampling as an MDP (Eq. (3)) and gives the SDE sampler used for policy exploration (Eq. (7)), making the injection of per-step RL signals well-defined. - TCRM turns terminal reward into process values with explicit formulas. One-step ODE projection defines $R_t$ (Eq. (9)), discounted accumulation yields $Q_t$ (Eq. (10)), and timestep weights $\omega_t$ (Eq. (11)) prioritize impactful steps; implementation appears straightforward in Algorithm 1. - It is low in cost and simple in engineering. TCRM only performs one more ODE projection in each inversion step (Equation (9)) without introducing additional PRM or critic; Algorithm 1 shows that it can be directly spliced with the existing Flow-GRPO training process (page 7). - Inconsistent time scale/notation around Eq. (9). Eqs. (1)/(7) use continuous $t\in[0,1]$ while Algorithm 1 iterates discrete $t=T,\ldots,1$; Eq. (9) then uses $(t-1)$ both as an argument to v_\theta and as a scalar step, which conflicts with the $[0,1]$ convention when $T=10$ (Appendix A.2). - “Monte-Carlo estimation” wording does not match the computation. TCRM’s instant reward comes from a deterministic one-step ODE projection (Eq. (9)), and $Q_t$ is summed along a single sampled trajectory (Eq. (10)); there is no explicit resampling of future randomness or variance estimate - ADAE’s non-collapse guarantee hinges on $\alpha=k\,\mathrm{std}(Q)$ but this dependency is not stated in the main text. Eq. (13) treats $\alpha$ as a hyper-parameter; the limit proof (App. B, Eqs. (15)–(19)) requires $\alpha\propto\mathrm{std}$. - Training schedule conflicts with the theory: $\alpha$ applied only in first five steps. Appendix A.2 notes $\alpha$ is enabled only for the first 5 sampling steps, which is inconsistent with the “$\sigma\to0$” limit argument that assumes the \alpha-rule always holds. - Definition/stability of $\omega_t$ is underspecified. Eq. (11) divides by the mean over $t$ of $Q_t$; behavior is unclear if rewards can be negative or the mean is near zero outside the reported non-negative settings (Appendix A.3). - Mismatch between training and evaluation steps (T=10 vs. 40) is not justified. The paper does not explain how $Q_t/\omega_t$ advantages transfer across different step counts (Appendix A.2). 1. Is $t$ in Eqs. (1)/(7) continuous and in Algorithm 1/Eq. (9) discrete? If so, should Eq. (9) use $\tau_{t}=t/T$ and $\hat{x}0=x{t-1}-\tau_{t-1}v_\theta(x_{t-1},\tau_{t-x1})$? The current $(t-1)$ conflicts with $[0,1]$ when $T=10$. (Eqs. (1),(7),(9); Alg. 1; App. A.2.) 2. Does Eq. (9) integrate from $t-1$ to $0$ in one Euler step (“remaining time” as the step length)? Please provide the derivation and either an error bound or an empirical bias analysis versus the true terminal state. (Eq. (9).) 3. Is $Q_t$ in Eq. (10) computed from a single rollout or with resampling around the same $x_{t-1}$? If single-rollout, what does “Monte-Carlo” refer to, and is $\mathrm{Var}[Q_t]$ tracked? (Text near Eq. (10); Alg. 1.) 4. The limit result (App. B) assumes $\alpha=k\,\mathrm{std}$ (Eq. (15)), while App. A.2 enables $\alpha$ only in the first five steps. What is the formal definition and time schedule of $\alpha$ in the main method, and how sensitive are results to $k$? (Eq. (13); App. A.2/B.) 5. Domain and robustness of $\omega_t$. With Eq. (11), how is $\omega_t$ handled if $Q_t$ can be negative or the mean across time is near zero? What assumptions on reward sign/scale are required? 6. Train–test step mismatch. With training $T=10$ and evaluation $T=40$, how do the learned $\omega_t$/$\hat{A}_t$ translate across different discretizations? Is any renormalization used?	Moderately AI-edited

PreviousPage 22 of 1516 (75800 total rows)Next