ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	1 (25%)	4.00	4.00	4525
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	1 (25%)	4.00	3.00	3575
Fully human-written	2 (50%)	4.00	3.50	2324
Total	4 (100%)	4.00	3.50	3187

Title	Ratings	Review Text	EditLens Prediction
Co-Evolving Agents: Learning from Failures as Hard Negatives	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a co-evolving framework with two policies: a target agent trained via preference optimization and a specialized failure agent trained on failure-vs-failure preference pairs to synthesize hard negatives (near-success but still failing trajectories). By alternating updates, the failure agent continually produces informative negatives that sharpen the target agent’s decision boundary. Experiments on WebShop, ScienceWorld, and InterCodeSQL report consistent gains over common preference/RL baselines (notably on unseen splits), with qualitative analyses suggesting the method increases the density and quality of “useful failures.” Clear, intuitive idea: Elevating “failures” into structured supervision via failure-vs-failure preferences is conceptually neat and practically motivated. Co-evolution design: Alternating updates between a failure generator and a target learner is a simple mechanism to keep training signals challenging and fresh. Analyses beyond toplines: The paper examines failure quality/quantity and includes ablations (e.g., replacing the failure agent with a standard “positive” agent), which supports the central claim that specializing on failures matters. Limited Novelty in Core Idea: The framework builds heavily on existing methods like ETO (Exploration-Based Trajectory Optimization) and DPO, primarily adding a "failure agent" for hard negatives. While this is an incremental improvement, it may not be sufficiently novel for ICLR, as similar concepts (e.g., negative agents in multi-agent systems or hard negatives in contrastive learning) are referenced in related work but not deeply differentiated. Baselines are not strong enough as configured. Prior work shows that RFT (or related preference-optimization variants) can be substantially stronger when you multi-sample rollouts and apply DART-style data selection (e.g., sample several candidates per prompt/step and keep the most informative pairs). In practice, “RFT + multi-sample + DART selection” can outperform ETO; using only ETO (or lightly tuned RFT) as the strongest baseline likely understates what a robust, compute-matched preference pipeline can do. The paper cites Iterative step-level Process Refinement (IPR), but does not provide a compute-matched, carefully tuned head-to-head. Given IPR’s strong step-level refinement on agentic tasks, a direct comparison (same models, prompts, sampling budget, and filtering strategy) is necessary to substantiate superiority. Computational Overhead: Training two co-evolving agents alternately requires significant resources (e.g., 8 NVIDIA H100 GPUs for experiments). Overfitting Risks in Failure Agent: The failure agent is trained solely on failure trajectories, which could lead to overfitting to specific failure modes rather than exploring a broad "failure landscape." The paper claims it generates hard negatives, but empirical evidence is limited to one qualitative example and basic quantitative stats (e.g., Figure 2), without deeper analysis like diversity metrics or ablation on failure pair construction. Imbalanced Training Signals: The target agent's loss combines DPO and SFT with weights (e.g., λDPO=0.5, λSFT=0.5 for expert pairs), but justification for these hyperparameters is weak. As noted (citing Yuan et al., 2025a), DPO can be unstable due to imbalance between chosen and rejected trajectories, yet no sensitivity analysis or alternatives (e.g., other preference methods like GRPO) are explored. Novelty and Differentiation from Prior Work: The failure agent concept builds on ideas like hard negatives from contrastive learning (e.g., Robinson et al., 2021) and negative agents in multi-agent systems (e.g., Zhang et al., 2024). What are the key theoretical or empirical differences that make your co-evolutionary approach distinct from these, beyond the alternating training phases? For instance, does it provably converge or provide guarantees on generating harder negatives over iterations? Failure Agent Behavior: The qualitative example in Section 5.2.1 shows the failure agent generating more structured near-success failures. However, the quantitative analysis (Figure 2) is limited to trajectory counts and average rewards. Have you evaluated diversity metrics (e.g., trajectory entropy or edit distance) to confirm that the failure agent explores a broad failure landscape rather than overfitting to specific modes? What prevents the failure agent from collapsing into trivial or repetitive failures during co-evolution?	Fully AI-generated
Co-Evolving Agents: Learning from Failures as Hard Negatives	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a co-evolution framework with two agents: a target agent that is ultimately optimized for performance and a dedicated failure agent that learns a fine-grained model of the failure landscape. Both agents are initialized via behavior cloning, then independently tuned with SFT before co-evolution. The failure agent is trained with DPO on preference data obtained from failure trajectories generated by itself and the target agent. Once the failure agent reliably produces hard negatives, the target agent is updated using a weighted DPO objective on those negatives alongside SFT on expert-prediction and failure-failure pairs, plus an auxiliary SG loss to stabilize learning and anchor high-reward behaviors. The core idea is to use noisy interaction data as structured supervision that sharpens decision boundaries, and the results demonstrate improvements over baselines. 1. The paper is clearly written and provides a strong motivation for converting interaction data into structured supervision using failure trajectories to learn strong decision boundaries. The co-evolution learning process using hard negatives that are near-success failures promotes sharper decision boundaries and finer discrimination. 2. The paper includes an extensive evaluation across different tasks. The comparisons with baseline methods demonstrate the impressive performance of the proposed framework. The ablation studies further highlight the importance of each module in the method. 3. The problem formulation using co-evolution learning process by utilizing hard negatives is an interesting idea, and this could serve as foundation for future research. 1. Hard negatives: It is unclear what hard negative quantitatively mean. They are defined as trajectories that are closer to success but still unsuccessful. However, it’s unclear how close to success is quantitatively defined, is it based on reward threshold? 2. One of the claims of the paper is that the limited number of expert trajectories result in overfitting. However, no information about the number of successful trajectories versus the hard negatives generated trajectories is provided to understand the statistics and how much does the co-evolution result in creating this balance of close to success, success and failure trajectories. And how these numbers impacts the overall performance? It is unclear how data balance impacts performance and overfitting claims. 4. Qualitative analysis section mentions that generating hard negatives make the agent capture subskills such as navigation, object manipulation and device control, however, no qualitative results in any of the benchmarks demonstrate that. It is a critical part of the paper, as it shows the importance of using hard negatives for co-evolution. 5. Further, no qualitative comparisons among expert, hard-negative, and generated/predicted trajectories demonstrate the “near-success” informativeness. 6. Missing baseline: DPO without co-evolution to isolate the contribution of the failure agent. 7. Supplementary material appears incomplete; more implementation details and dataset/task specs should be included. Please see weaknesses above	Fully human-written
Co-Evolving Agents: Learning from Failures as Hard Negatives	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces a co-evolving agents framework designed to enhance the learning of self-improving language model agents by leveraging failure trajectories as informative training signals. The key idea is to pair a target agent with a failure agent that learns to generate hard negatives—failed trajectories that are close to success. Unlike existing self-improving frameworks such as Exploration-Based Trajectory Optimization (ETO), which risk overfitting due to limited ground-truth supervision, the failure agent here continuously learns from both its own and the target’s failure trajectories through Direct Preference Optimization (DPO). These refined hard negatives are incorporated into the target agent’s preference optimization, sharpening decision boundaries and improving generalization. The authors demonstrate their method on the WebShop, ScienceWorld, and InterCodeSQL benchmarks. Their results show improved performance over baselines like ETO, particularly a significant +10.8 point gain in generalization on unseen ScienceWorld tasks. 1. The idea of a failure agent that co-evolves with the main model is novel and well-motivated. 2. The theoretical setup (POMDP formalization, DPO-based optimization) is rigorous, and the experimental evidence convincingly supports the claims. 3. The approach directly addresses a key limitation of current self-improving LLM agents—overfitting to limited expert data—by leveraging a sustainable, self-generated source of supervision. Results across multiple domains, along with both quantitative and qualitative failure analysis, strengthen credibility. 1. The method relies on a multiple-stage process (SFT → failure-agent DPO on failure–failure pairs → target-agent DPO+SFT on mixed pairs) executed in alternating iterations, but Figure 1 does not illustrate this flow or the data construction steps, making Section 4 harder to follow at a glance. 2. The most significant weakness is in Section 5.2.1. The text states it will provide a qualitative comparison of failure trajectories, describing the ETO baseline's "degenerate failure" and contrasting it with the "Ours" method's "more structured failure trajectories". However, the example provided for "Ours" is explicitly labeled as a "Success" (Reward: 0.75, Outcome: Success). This example fails to illustrate what a "hard negative failure" from the proposed method actually looks like and directly contradicts the text's stated purpose. 3. The InterCodeSQL results show no improvement, which the authors attribute to sparse rewards. This suggests the method's applicability may be limited to environments with dense reward signals. 4. Appendix Section C seems incomlete. 1. Training two agents alternately is presumably more expensive than training one. How much does this increase training time and compute requirements? Is the performance gain worth the added cost? 2. What prevents the failure agent from collapsing to a non-productive policy? For example, what stops it from becoming identical to the target agent and only learning to succeed, or conversely, learning a trivial policy that always fails with a reward of 0? How does it maintain the "near-success" decision boundries while enabling generating a larger pool of failures? 3. Could you provide more detail on the training dynamics in Section 4.3? Specifically, when constructing the target agent's dataset $D_{ tgt}$, how are the three types of preference pairs—(expert, target-failure), (expert, failure-agent-failure), and (failure-failure pairs)—sampled and balanced during training?	Lightly AI-edited
Co-Evolving Agents: Learning from Failures as Hard Negatives	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	To address overfitting in self-improving agents with scarce data, the authors propose a novel co-evolving agents framework. The target agent is improved jointly with an auxiliary failure agent that uses preference optimization to generate informative hard negatives. Incorporating these hard negatives sharpens decision boundaries, significantly enhancing generalization and performance. 1) Well motivated. Due to scarce data, the authors try to improve target agent with a failure agent, that generate hard negative samples. 2) The improvement seems significant. Results achieve 5% average improvements compared to the best baseline. 1) The current experimental setup appears limited. The evaluation is conducted solely on the Llama-2-7B architecture instead of more recent model architectures (e.g., Llama-3, Qwen-3) and larger model scales. 2) The current ablation study lacks a critical control experiment. The proposed co-evolving framework inherently doubles the number of trainable parameters (Target Agent + Failure Agent) compared to a standard single-agent baseline (e.g., a 7B model). The authors must conduct an ablation to isolate whether the observed performance gains are due to the co-evolutionary mechanism itself or simply the increased parameter count. Specifically, an experiment comparing the following two conditions is necessary: * 7 B Target Agent + 7 B Failure Agent. * A single 14B model trained under the standard SFT or RFT. See weaknesses.	Fully human-written

PreviousPage 1 of 1 (4 total rows)Next