ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 3.00 4273
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (75%) 4.00 3.33 2838
Total 4 (100%) 4.00 3.25 3196
Title Ratings Review Text EditLens Prediction
Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper tackles covariate shift for multi-turn LM agents on SWE-bench, proposing On-policy Expert Corrections (OEC): roll out the student on-policy, switch to an expert model at a randomly sampled turn, keep only the expert segment (student tokens are masked during SFT), and filter trajectories via unit-test verification and simple repetition heuristics. Claimed contributions are: (i) an analysis showing student–expert divergence grows with depth; (ii) the OEC data-generation recipe (random switch + masking + verifier filtering); (iii) improved pass@1 over behavioral cloning (BC) and fully on-policy data under their setup; and (iv) ablations isolating the importance of masking and repetition filtering. Key evidence comes from scaling curves (OEC > BC > on-policy), covariate-shift diagnostics across turns, and controlled ablations on filtering/masking. - The paper empirically quantifies turn-wise divergence between student and expert, motivating partial on-policy corrections. - OEC is simple to implement (random switch, mask student tokens, verifier-gated acceptance) and yields consistent gains over BC and fully on-policy data under the stated setup. - The study isolates two actionable levers—on-policy masking and repetition filtering—and shows they materially affect results, especially at larger model scales. - Findings like “later switches help,” and “unit-test filtering alone is insufficient without anti-loop filters,” offer concrete guidance to practitioners training SWE agents. 1. **Limited novelty relative to established interactive IL.** **Problem:** OEC’s core idea—on-policy rollouts with expert corrections—substantially overlaps with DAgger-style data aggregation and related learning-to-search methods; the paper’s unique elements are mainly engineering choices (random switch policy and masking) rather than a new principle. ICLR, novelty beyond well-known IL frameworks is expected; otherwise significance hinges on breadth and rigor of validation. **Action:** Clearly position OEC against DAgger/LOLS/AggreVaTeD, including a small theoretical or synthetic analysis of the bias introduced by masking/truncation and the effect of the switch-time distribution; add a controlled comparison to a **DAgger-like** baseline adapted to the same scaffold. 2. **SOTA framing is ambiguous.** **Problem:** The paper claims or suggests “SOTA within class,” but the comparison set and constraints (backbone, scaffold, verifier usage, inference-time scaling) are not crisply defined, and some numbers are very close to strong baselines (Table 1). Without a precise protocol, improvements may reflect differences in test-time scaling or data collection rather than the training recipe. **Action:** Define the comparison “class” explicitly; if SOTA is not supported under strict parity (same scaffold, retries, timeouts, verifier configs), rephrase to “competitive.” 3. **Possible confounds from distribution differences.** **Problem:** Some OEC points in scaling plots appear to mix slightly different instance distributions compared to BC/on-policy. The gains could partially come from instance mix rather than the method. **Action:** Re-run scaling where all regimes are collected on the same fixed instance set. 1. Please define precisely what “state-of-the-art in their respective classes” means (backbone size, scaffold, verifier rules, retries/timeouts, inference-time scaling). Under those constraints, is OEC actually SOTA? 2. How many evaluation retries, max turns, and unit-test limits were used across methods? Were they identical for all baselines? 3. Can you report mean ± std over ≥3 seeds for all key numbers and curves? 4. What is the instance overlap (if any) among expert demos used for pretraining, OEC/BC/on-policy training data, and evaluation splits? 5. Please provide a controlled comparison to a **DAgger-like baseline** adapted to your scaffold, and discuss how masking/truncation differs theoretically from classic dataset aggregation. 6. Could you release the exact repetition-filtering heuristics (regexes/thresholds) and masking implementation code? 7. Have you tested OEC beyond SWE-bench (e.g., SWE-bench-Live or a non-SWE agent domain)? If not, can you discuss expected limitations? Fully AI-generated
Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces On-Policy Expert Corrections for LLMs' imitation learning, collecting expert correction data by rolling out student trajectories and switching to the expert midway. This allows training on on-policy states while involving expert supervision. Experiments on software engineering tasks show that OECs outperform both behavioral cloning and learning on fully on-policy data. 1. The experiment is performed on the newly released SWE-bench benchmark, outperforming the baselines in both 7B and 32B settings. 2. The OEC framework is compatible with any combination of student and expert language models. 1. **Switching to the LLM expert at a random step is an inefficient and outdated way** to collect online correction data. A better method is to let the expert take over when the student is about to make a mistake, such as LEAP [1], or analyze a complete failure rollout to identify the error steps and provide feedback, such as LFM [2]. 2. Apart from the corrective actions, **LLM experts can also provide reward signals and trajectory preference pairs**. Agent-RLVR [3] uses the expert to generate a corrected trajectory t' for a failed rollout t and performs preference learning over (t, t'), instead of doing behavioral cloning on t'. IRPO [4] and DMPO [5] use LLM-provided reward models to finetune the student via preference learning. In contrast, **OEC discards the student's portion and only performs training on the expert portion of each trajectory**, which may have lower training efficiency. 3. The baselines are insufficient. In the paper, OEC is **only compared against behavioral cloning and on-policy trajectories**. Several prior works [1–5] share a similar teacher-student-learning setting (especially [3]), but the paper does not demonstrate that OEC achieves better sample efficiency or faster training compared to them. 4. The OEC algorithm is **essentially a variant of DAgger with limited novelty**. DAgger (S Ross et al. 2010) has a lot of variants in robotics and RL, including EnsembleDAgger (K Menda et al. 2018), ThriftyDAgger (R Hoque et al. 2021), etc. These variants involve expert interventions in a much more interactive way, allowing the expert to take over when the agent is uncertain or is about to fail. OEC's sample complexity may be reduced if it uses these DAgger-variants to collect correction trajectories. References: [1] Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback, S Choudhury et al. 2024 [2] Policy Improvement using Language Feedback Models, V Zhong et al. 2024 [3] Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards, J Da et al. 2025 [4] Iterative Reasoning Preference Optimization, RY Pang et al. 2024 [5] Direct Multi-Turn Preference Optimization for Language Agents, W Shi et al. 2024 1. Can OEC be deployed to other tasks? Does OEC still outperform baselines in the tasks other than SWE-bench? Why does the paper choose this task? 2. Can you further explain the "on-policy" baseline? When the trajectories are entirely rolled out by the student model, how do you use the expert model for supervised fine-tuning? Fully human-written
Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces a method, termed `on-policy expert corrections (OECs)`, to improve imitation learning (i.e., finetuning on expert trajectories) for LLMs. The main problem they are trying to address is covariate shift in standard behavior cloning. OEC is a data-collection scheme where on-policy data from the learner is collected till a random timestep for each trajectory, and then the expert policy is used to collect data for the rest of the trajectory. They conduct experiments on SWE tasks, and show that a combination of standard behavior cloning and learning from OEC data improves over standard behavior cloning alone. The paper identifies an important that hasn’t been addressed yet (covariate shift in behavior cloning for LLMs), and proposes a well-motivated and easy-to-understand method The experiment design is appropriate for the claims in the paper, and there’s evidence that (i) there is covariate shift between learner and expert’s policies trajectory distribution, and (ii) OECs improve performance for SWE task for 7B and 32B models derived from Qwen2.5-Coder-Instruct series. They include additional experiments (i) showing that masking loss from the learner’s part of the trajectory and filtering out repetitive trajectories is important for good performance, and (ii) showing a LLM-as-a-judge analysis of failure modes of the models trained with different strategies, which is great to have. The presentation is very clear. The paper introduces OEC as a novel data-generation method, but this strategy of learning from trajectories where we use on-policy behavior till a certain timestep and then switching to expert policy for the rest of the episode has been studied before. For example, this family of strategies is the focus of “Learning to Search Better than Your Teacher” by Chang et. al. at ICML 2015. I think the paper should cite and discuss previous work on OEC in the related works section, and appropriately frame its contribution (e.g., bringing this idea to the setting of behavior cloning for LLMs). A significant issue with the experiments is that they are performed with a single seed, so there are no error bars. There is past work showing there is significant variation in LLM performance across fine-tuning runs [1, 2, 3], which makes the results suspect. I understand that single-seed results are somewhat the norm in the field due to computational costs, I think the paper should at least acknowledge this limitation. 1. Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models by Zhou, Savova and Wang 2. Measuring the Instability of Fine-Tuning by Du and Nguyen 3. On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines by Mosbach, Andriushchenko and Klakow Finally, it seems like the best results in the experiments are obtained by a combination of behavior-cloning using expert data and OEC data, but the paper doesn’t suggest a strategy for choosing the ratio of both data sources. Looking at Figure 3, it seems like including a small amount of OEC data could even degrade performance sometimes. - Fully human-written
Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces an approach to finetune existing LLMs with on-policy expert corrections. The idea is to use a combination of on-policy and expert samples to construct trajectories, and only update the model with the expert portion of the trajectory. This helps mitigate the effect of covariate shift but also stabilizes training with on-policy samples. Experiments are conducted in software engineering tasks with improvements over prior works. The method is a simple variation of online imitation learning, and shows good empirical results. Analysis on the covariate shift between student and expert policies offers an interesting view of the problem. * The idea of the approach is simple and easy to follow * Emperical results show improvements over prior works on open benchmark * The analysis on covariate shift is nicely done * The author mentions the potential of using RL with verifier rewards, but does not compare with any baselines using RL. * Based on results, later switching improves the model more, but the method uses uniform sampling to determine the switch time. This seems to be a bit contradictory. * The authors mention the no-regret guarantee of DAgger is violated by using OECs, but do not provide any additional theoretical insights. * There are no explanations on the results provided in Table 3, so I am a bit confused about the meaning of this experiment. * Minor issue with y-axis on figure 2(a), showing 0.17% resolution rate (I believe that it is a plotting mistake?) * What is the rejection sampling with supervised finetuning procedure? Is it simply using successful trajectories from the verifier for supervised training? * What is the reasoning behind sampling from uniform distribution as opposed to other distributions, e.g. geometric distribution? Given that experiment results show that switching later is better? * How does the method work compared to DAgger? * What does finetuning the student model from scratch mean? Do you initialize the weights to be random? * Why is BC data added only to the 32B setting and not the 7B setting? Why is BC data important in one setting and not the other? * Are there any theoretical guarantees in terms of regret bounds for this algorithm? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next