ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 5.00 3.00 2896
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 2 (50%) 3.00 4.50 3056
Fully human-written 0 (0%) N/A N/A N/A
Total 4 (100%) 4.00 3.75 2976
Title Ratings Review Text EditLens Prediction
Forging Better Rewards: A Multi-Agent LLM Framework for Automated Reward Evolution Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces FORGE (Feedback-Optimized Reward Generation and Evolution), a multi-agent framework that automates reward synthesis for reinforcement learning using large language models (LLMs). FORGE replaces traditional genetic algorithm encodings with LLM-guided crossover, uses a planner-based zero-shot initialization to generate structured reward functions, and introduces a depth metric to quantify reward complexity over evolutionary iterations. A reward pool is maintained as memory to manage and refine candidates. Experiments are conducted across four environments; three games (Tetris, Snake, Flappy Bird) and a continuous-control task (MuJoCo Humanoid), showing performance improvements over prior methods such as Eureka and REvolve, while claiming enhanced stability, interpretability, and token efficiency. 1.1 The paper compares different LLMs and evaluates on multiple environments. 1.2 The experiments are broad and include several baselines. 1.3 The depth measurement is a welcome addition, giving a way to track reward composition complexity. 2.1 The comparison with Eureka is unfair. Eureka is greedy (keeps only the best reward per generation), not population-based. Plotting average population scores for FORGE and REvolve, but not the best-per-generation for Eureka, makes the figure incomparable. A fair comparison would show the maximum score for each generation across all methods. 2.2 The REvolve baseline results are incorrect. The authors state that they use the environment's extrinsic rewards as the fitness score. However, in Fig. 2, the population average for REvolve decreases over generations, contradicting REvolve’s framework, which adds individuals only if their fitness score exceeds the current population average. This guarantees that the average will never decrease (section 3.4 of the REvolve Paper). 2.3 The table results for the Humanoid task do not match the video output. The task is to move as fast as possible on the x-axis without falling. REvolve’s humanoid performs this correctly and visibly better, while FORGE’s agent moves much worse, although achieving high reward results. The results and videos are thereby inconsistent. 2.4 The claimed “stability”, “interpretability”, and “token efficiency” are not supported. Figure 2 shows dips even for FORGE, so it is not stable. “Interpretability” is only a depth-vs-score correlation and does not explain why rewards work. “Token efficiency” is argued but not tested. 2.5 The authors claim that passing only the reward function and best score is sufficient and that raw metrics are not beneficial. Metrics can indicate which reward components failed and guide improvement in the next generation. REvolve showed that better quality feedback leads to better performance. No results are showed to support that claim. 2.6 Line 217-219: the authors say “to address these limitations we generalize the evolutionary process by incorporating LLM inference…”. This is presented as if it is new, which is not (T2R/Eureka were first). 2.7 It is unclear to me what exact rule is used to retain or discard individuals and when/how the pool is pruned. 2.8 The Planner can be viewed as structured prompt design rather than actual planning. It is only used for zero-shot initialization, not iterative optimization. 3.1 Can the authors explain why the average population scores for REvolve decrease over generations, even though REvolve’s framework guarantees a non-decreasing average as described in the original paper? 3.2 Can the authors explain why they plot average population scores instead of the best-per-generation values, and provide the additional results showing the best scores per generation for all methods to make the comparison fair? 3.3 Can the authors elaborate on the points raised in 2.6 and 2.7? Lightly AI-edited
Forging Better Rewards: A Multi-Agent LLM Framework for Automated Reward Evolution Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes FORGE, a multi-agent LLM + evolution framework for automated reward function design. FORGE first produces structured reward specifications via a Planner agent, turns them into modular executable rewards, and then iteratively refines them through LLM-guided evolutionary operations (selection + crossover) under real environment feedback. A reward pool acts as specialized memory and a depth metric captures the structural complexity of rewards. Experiments on Tetris, Snake, Flappy Bird, and MuJoCo Humanoid show that FORGE consistently outperforms Eureka and REvolve. - Clever decomposition into planner-based initialization + evolutionary refinement makes the framework both stable (at the beginning) and exploratory (later). - Reward pool + depth is a lightweight and straightforward design to get interpretability and to analyze how complexity correlates with performance. - Empirical results (3 games + 1 continuous control) show gains over strong recent baselines (Eureka, REvolve). - The most important issue is the computational efficiency. Since each evolutionary step requires training/evaluating an RL policy under a new reward, Even if token usage is controlled, the overall wall-clock/sample cost can still be the main blocker for practical use. A comparison on “environment steps per performance gain” against Eureka/REvolve is missing. - The method lacks ablations on different LLMs. The method assumes the LLM can both interpret two reward codes and synthesize a valid, environment-compatible offspring. It is unclear how robust this is to weaker models or higher code error rates. Some ablations with a smaller/older LLM, or with constrained generation, would clarify the robustness. - The setup says all baselines are re-implemented under the same foundation model, but the paper does not detail how much prompt engineering/tuning effort was spent on Eureka/REvolve. Since the key claimed improvement is “structured initialization + evolution,” it would be good to show that Eureka/REvolve do not simply benefit from the same structured spec. Considering most modern LLMs have good vision capabilities, would feeding some vision information (e.g., game environments / failure trajectories) into the LLM helps further improve the sampling efficiency? Lightly AI-edited
Forging Better Rewards: A Multi-Agent LLM Framework for Automated Reward Evolution Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes the FORGE, a multi-agent LLM framework that combines structured reward initialization, evolutionary refinement, and explicit complexity modeling. The core problem is that manual reward engineering is costly and suboptimal, while existing LLM-based methods (e.g., Eureka, REvolve) often produce unstable or opaque rewards. FORGE employs a Planner agent to generate modular reward specifications from task objectives and environment dynamics, followed by an evolutionary process where rewards are selectively combined and refined using LLM-guided crossover. Key contributions include a reward pool for efficient memory, a depth measure to quantify structural complexity, and token-efficient evolution. Extensive experiments across three games (Tetris, Snake, Flappy Bird) and a robotics task (Humanoid) demonstrate that FORGE achieves significant performance improvements—up to 38.5% over Eureka and 19.0% over REvolve in Humanoid—while maintaining competitive token usage. **Method Design**: FORGE introduces a structured two-stage process (Planner-based initialization and Engineer-driven evolutionary refinement) that moves beyond direct LLM sampling for reward generation. Sec. 3; Fig. 1. This modular approach enhances interpretability and stability compared to prior methods like Eureka and REvolve. **Comprehensive Experimental Evaluation**: The framework is tested across four distinct environments (three games and one simulated robotics task), covering both discrete and continuous control settings. Sec. 4.1; Table 2. FORGE consistently outperforms baselines, including general agentic frameworks, context-aware LLMs, and native environment rewards. **Token Efficiency**: Despite performance improvements, FORGE maintains competitive token usage by constraining LLM inference to modifying small subsets of reward functions. Sec. 4.5 This is a critical advantage for scalable deployment compared to other multi-agent LLM frameworks. **Limited Generalization**: Evaluation is confined to simulated environments (MuJoCo, games), with no evidence of testing in real-world robotics or complex physical systems (Sec. 5). Tasks lack diversity in observation spaces (e.g., low-dimensional vs. high-dimensional inputs), raising questions about scalability to vision-based or partially observable domains (Table 1). No cross-environment transfer experiments to assess reward function generalization beyond training domains (Sec. 4.1). **Incomplete Analysis of Evolutionary Components**: The probabilistic selection scheme (Eq. 6) uses unnormalized scores as weights, but no ablation is provided on alternative selection strategies (e.g., rank-based or tournament selection). Crossover operation relies solely on LLM inference without mutation mechanisms, potentially limiting diversity in later generations (Sec. 3.2). Depth measure is defined recursively but lacks theoretical grounding or comparison to other complexity metrics (e.g., code length, entropy) (Eq. 4). 1. How does FORGE handle environments with highly sparse or delayed extrinsic rewards, and does the depth measure correlate with performance in such settings? (Sec. 4.1; Fig. 3) 2. Could the evolutionary process be enhanced by incorporating multi-objective optimization to balance reward complexity and performance, rather than relying solely on extrinsic return? (Eq. 6; Sec. 3.2) 3. What are the specific failure modes of FORGE in cases where the LLM generates invalid reward functions, and how frequently do these occur across different environments? (Sec. 3.2; Sec. 4.4) Fully AI-generated
Forging Better Rewards: A Multi-Agent LLM Framework for Automated Reward Evolution Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. FORGE is a multi-agent LLM framework for automated reward evolution, using a Planner agent for structured zero-shot reward initialization, an Engineer agent for iterative selection and crossover refinement, and explicit depth metrics for complexity. Evaluated on Tetris, Snake, Flappy Bird, and Humanoid (MuJoCo), it outperforms baselines like Eureka, REvolve, and context-aware LLMs, achieving up to 38.5% gains over Eureka on Humanoid while maintaining token efficiency. 1.Structured initialization via Planner yields strong zero-shot rewards, outperforming direct LLM sampling; evolutionary refinement drives consistent gains across discrete and continuous domains. 2.Experiments show superior performance in both zero-shot and evolved settings, with ablation studies confirming the value of key components like selection and planning. 3.Introduces reward pool and depth metrics for stability, interpretability, and efficiency, enabling analysis of complexity-performance correlations (e.g., optimal depth 3 for games, 7 for Humanoid). 1.Relies on Claude Sonnet 4 without ablations on other LLMs—results may be model-specific and degrade with open-source alternatives. 2.Claims token efficiency but consumes more in some environments; lacks full compute cost breakdown or scaling analysis for larger tasks. 3.Baselines use the same LLM, but adaptations (e.g., replacing human feedback in REvolve) may not be optimal; multi-agent setup adds overhead without clear justification over simpler methods. 4.Depth analysis is insightful but underexplored—e.g., no explanation for domain-specific optima or robustness to hallucinations in crossover. 5.Limited to simulated environments; no real-world robotics tests, overlooking challenges like sensor noise, delays, and safety that could undermine practicality. 1.What is the sensitivity to the base LLM? Results with open-source models like Llama or smaller variants? 2.Detailed token/compute costs per iteration/environment? How does efficiency scale to more complex domains? 3.Why does evolution sometimes underperform context-aware LLMs on average? Mechanisms to boost consistency? 4.How can depth metrics guide early stopping? Thresholds or heuristics for optimizing iterations? Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next