ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (67%) 4.00 4.00 3317
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (33%) 6.00 3.00 2276
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 0 (0%) N/A N/A N/A
Total 3 (100%) 4.67 3.67 2970
Title Ratings Review Text EditLens Prediction
BEYOND IMITATION: RECOVERING DENSE REWARDS FROM DEMONSTRATIONS Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes that supervised fine-tuning (SFT) can be viewed as a special case of inverse soft-Q learning under a deterministic token MDP with γ = 1 and a linear conjugate. Building on this, it defines a baseline-relative dense reward and introduces Dense-Path REINFORCE (DPR), a token-level REINFORCE update using log-probability differences between the SFT model and an earlier checkpoint as reward signals. Experiments on AlpacaEval, MT-Bench, LIMA, and Arena-Hard show modest but consistent gains over SFT and competitive results with SPIN and GSIL. 1. Theoretical link between SFT and inverse RL is clear and well-motivated. 2. The telescoping argument makes the equivalence intuitive under γ = 1. 3. DPR is simple, reproducible, and the dense reward idea is easy to understand. 4. Ablations on γ, checkpoint choice, and baseline removal are informative and thoughtfully designed. 1. Narrow validity: The equivalence holds only for γ = 1, deterministic token transitions, and a linear conjugate. The paper does test γ < 1, showing expected degradation, which confirms the limitation rather than extending the theory. 2. Inconsistent assumptions: The equivalence relies on a linear conjugate, while the later stability theorem assumes strong convexity. These are mathematically incompatible but presented as part of one framework. 3. Evaluation design: DPR is trained and tested on the same prompt set, so gains could reflect continued fine-tuning instead of genuine reward recovery. 4. Limited robustness: The temperature ablation is minimal and doesn’t analyze stochastic effects or variance. 5. Weak empirical evidence: All evaluations rely on GPT-4 judges without error bars, human checks, or multiple seeds; gains over SFT are small (a few percent). 6. Missing controls: No baseline comparing DPR to simply extending SFT training, making attribution unclear. 1. Can the same ψ be both linear (for equivalence) and strongly convex (for contraction)? 2. Why does the halfway checkpoint consistently give the best result? 3. Did you test multiple seeds or new prompts to confirm robustness? 4. Could a continued-SFT baseline match the reported gains? Fully AI-generated
BEYOND IMITATION: RECOVERING DENSE REWARDS FROM DEMONSTRATIONS Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper reframes SFT under a token-MDP view (γ≈1) and shows that the SFT objective is equivalent to an inverse soft-Q objective. This motivates extracting a dense proxy reward as a log-likelihood ratio between a final SFT model and a reference checkpoint, followed by a short, critic-free RL step (REINFORCE with KL to SFT). The method is closed-loop (no environment reward, no preference data, no reward model), simple to implement, and reports consistent gains over plain SFT across several backbones/evals. Dense credit assignment: Per-token signals are actionable and address SFT’s plateau on long sequences and intermediate steps. Low engineering overhead: Uses existing SFT artifacts (final SFT + ref checkpoint). No external judges or reward models. Theory ↔ practice alignment: The SFT ↔ inverse-soft-Q connection and the safe-improvement style argument justify using the recovered dense signal for a small policy-gradient step. Stable optimizer: REINFORCE + KL to SFT (no critic) keeps the pipeline robust and easy to reproduce. Reference sensitivity: The approach assumes π_SFT is meaningfully better than π_ref. If ref is too weak (noisy signal) or too strong (vanishing signal), the log-ratio reward becomes brittle or tiny. Log-ratio gaming: The policy may drift toward stylistic artifacts that inflate log π_SFT − log π_ref without improving task quality. Domain narrowness: Because SFT and ref share training data, the proxy reward is intrinsically domain-tied; out-of-domain generalization of the dense signal is unclear. Eval reliance: Gains are primarily shown via LLM-as-judge; stronger human or task-grounded metrics would strengthen the case. minor S1. Reference selection by evaluation, not step count: – Choose π_ref via validation metrics (MT-Bench, small human slice, task-specific set) to target the “elbow” where the signal-to-noise of the log-ratio is highest. – Alternatively use an EMA of SFT weights as π_ref to smooth noise. S2. Multi-reference ensemble: – Define log π̄_ref = logsumexp_i(log π_ref,i) − log k (geometric mean). Reward becomes r̂ = log π_SFT − log π̄_ref. This damps idiosyncrasies of any single checkpoint and makes “progress” less gameable. S3. Dual-KL regularization and reward shaping to reduce gaming: – Keep KL(π_θ || π_SFT) and add a small KL(π_θ || π_ref). – Clip/normalize the per-token log-ratio (e.g., cap magnitude or z-score by position). – Penalize tokens where both π_SFT and π_ref assign low probability (both uncertain) even if the difference is large. – Add light style/fluency guards (repetition rate, perplexity bounds under a separate LM). S4. Correlation-gated updates: – On each batch, compute the correlation between r̂-improvement and a cheap proxy (exact-match on small QA, code unit tests, math verifier). If correlation drops below a threshold, reduce step size or increase KL. This is a simple “reward sanity check.” S5. Leverage the re-forward pass to incorporate useful rewards (when available), without changing the core method: – Hybrid reward: use R = α·r̂ + (1−α)·r_ext, where r_ext can be any lightweight verifier signal (unit tests for code, arithmetic checker, safety filter, format validator). α can be annealed from 1→0.8. – Doubly-robust/token-aware AWR: advantage-weight the SFT tokens by r̂ (and r_ext if present), i.e., reweight the teacher-forced loss with w_t = exp(β·A_t) to unify imitation and RL in one pass. – Counterfactual filtering: when the re-forward reveals contradictory beams (both low confidence), zero out r̂ for those tokens to avoid amplifying noise. S6. Report a reference sweep and ablations: – Show downstream metrics vs. ref placement (early/mid/late) to directly address “is π_SFT actually better than π_ref?” – Include ablations for single-ref vs. multi-ref, with/without dual-KL, and with/without correlation gate. How is the reference checkpoint chosen? Is it purely by training step or by validation metrics? What is the sensitivity curve (early/mid/late)? Q2. Can an ensemble of references reduce variance/bias? (Geometric mean of checkpoints as a smoother ref.) Q3. How do you detect or mitigate reward-hacking (odd outputs that maximize the log-ratio)? Q4. What is the robustness out of domain (math/code/safety) where SFT confidence calibration differs? Q5. Since the method already performs fresh forward passes, can those passes be leveraged to incorporate additional rewards or verifiers when available (see Suggestions)? Fully AI-generated
BEYOND IMITATION: RECOVERING DENSE REWARDS FROM DEMONSTRATIONS Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a novel idea that Supervised Fine-Tuning (SFT) can be viewed as a special case of Inverse Q-Learning, suggesting that SFT does not merely imitate expert policies but implicitly learns a dense, token-level reward model. The authors then recover this implicit reward from the SFT model and propose Dense-Path REINFORCE (DPR), which leverages the recovered reward to more efficiently optimize large language models (LLMs). Experimental results demonstrate that the proposed method outperforms standard SFT across multiple benchmarks. 1. The paper presents a novel idea that Supervised Fine-Tuning (SFT) can be viewed as a special case of Inverse Q-Learning, offering a new perspective for understanding SFT. 2. The authors provide a comprehensive theoretical analysis to support the proposed formulation. 3. Experimental results show considerable improvements over traditional large language model (LLM) training methods. 1. In the theoretical analysis, the authors make several strong assumptions about the setting—for example, assuming a deterministic token sequence and a fixed discount factor of $\gamma=1$, rather than a value smaller than 1 as typically used in RL. 2. In the proposed DPR method, the reference policy $\pi_\text{ref}$ is not formally defined. The authors state that it is an SFT checkpoint trained with half of the training samples; however, if the dataset is sufficiently large, wouldn’t this reference policy also become fully trained, thereby reducing the meaningful difference between $\pi_\text{ref}$ and $\pi_\text{SFT}$? 1. The reviewer is not an expert in LLMs, but I question whether the current training paradigm of LLMs is primarily driven by SFT. Wouldn’t RLHF have a greater overall impact on model alignment and performance? I therefore have some concerns about the potential contribution and significance of this work to the broader LLM literature. 2. If SFT is theoretically equivalent to IQL, would it be possible to directly apply IQL methods to learn the reward function instead of recovering the reward from SFT? 3. The authors choose REINFORCE for policy optimization. Could the authors clarify why this choice was made instead of using more advanced RL algorithms, such as actor–critic or PPO-based methods? Moderately AI-edited
PreviousPage 1 of 1 (3 total rows)Next