ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 6.00 2.00 1846
Heavily AI-edited 1 (25%) 2.00 5.00 1323
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 4.00 4.00 3892
Fully human-written 1 (25%) 2.00 5.00 4017
Total 4 (100%) 3.50 4.00 2770
Title Ratings Review Text EditLens Prediction
PaT: Planning-after-Trial for Efficient Code Generation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes Planning-after-Trial (PaT) policy for code generation, where code generation with a cost-efficient generator (the "Trial") is followed after a powerful planner (the "Planning") if the initial attempt fails, which reduces unnecessary planning overhead. * Efficiency: By avoiding planning on a simple task, efficiency can be achieved * Adaptivity: Policy adapts to problem difficulty. * Dependency: The claimed efficiency improvement arises from skipping explicit planning for easier tasks. However, this benefit relies heavily on the quality initial generation. Sensitivity analysis would be useful. * Negative generation: Analysis on initial failures potentially negating the cost advantage would be necessary. If early-stage errors are frequent, can total cost exceed that of a consistent planning-based approach. * Cost: Overheads of deploying two models need to be further justified. More discussion on increased resource overhead (e.g., memory, latency) would be useful * Regarding dependency, can the policy adapt if the initial generation is systematically poor? * Regarding negative generation, an total cost exceed that of a consistent planning-based approach. * Are increased costs acceptable in real-world deployment? * Could a single multitask model emulating both roles be an alternative? Heavily AI-edited
PaT: Planning-after-Trial for Efficient Code Generation Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper focuses on efficiency of code generation: existing approaches either skip planning (which may lead to poor performance on complex tasks) or adopt a "Plan-before-Trial" (PbT) paradigm that wastes computational resources on simple tasks by enforcing planning regardless of task difficulty. To this end, the authors propose PaT (Planning-after-Trial), a two-stage framework that prioritizes efficiency without sacrificing performance. In the first stage, PaT leverages lightweight, low-cost small models to directly generate code for a given task; it then validates the generated code using test cases. If the code passes validation, PaT terminates early to save costs. If validation fails (indicating a complex task), PaT invokes a more capable but expensive large model to perform task decomposition (planning), breaking the complex task into manageable sub-tasks that are solved iteratively. To further optimize efficiency, PaT adopts a strategy: small models handle code generation, while large models only contribute to planning. The authors evaluate PaT across multiple models and code benchmarks , showing that it matches the accuracy of larger models (e.g., Qwen3-32B) with smaller models (e.g., Qwen3-4B) while reducing costs, and outperforms prior methods like FunCoder on complex tasks at ~60% of the computational cost. The paper’s core contributions are the PaT framework, the role division of small/large models for cost-efficiency, and empirical validation of its effectiveness on diverse code generation tasks. 1. The paper proposes the use of heterogeneous models and first trial pipeline to resolve the overhead and performance problem. Specifically, a smaller model is first employed for reasoning—if it passes the test cases, the process terminates; otherwise, a more powerful model and code generation workflow are invoked to solve complex problems. The proposed approach follows a straightforward rationale, and the authors have conducted extensive experiments across multiple baselines to validate its effectiveness. They claim that their method outperforms the previous SOTA approach in both performance and efficiency. 2. The paper introduces a heterogeneous model framework to tackle complex problems: a more powerful large model is utilized for overall code generation planning, while a smaller model handles relatively deterministic code generation tasks. This collaborative approach aims to reduce the overall computational overhead in the code generation process. 1. The approach proposed in this paper is straightforward, with a key aspect being how to verify the correctness of the initial reasoning performed by the small model, which requires accurate test cases. However, the paper states that the authors used both the built-in test cases provided with the problems and additional test cases generated by the model based on problem requirements to jointly validate the small model's outputs. This raises a concern: does using the provided test cases to verify the small model's results risk potential data leakage? 2. According to the experimental results, the proposed method demonstrates significantly and consistently superior performance compared to the previous SOTA method. While the paper claims to improve reasoning efficiency, the experiments show enhancements in both efficiency and performance over prior methods. Could the authors analyze the specific sources of these performance gains? Were the experimental settings fully aligned with those of the previous SOTA method. 3. Could the authors provide additional statistical insights—such as the proportion of problems successfully solved by the small model alone, and a comparison of token overhead between the proposed method and pre-decomposition approaches like Funcoder for complex problems—to offer a more intuitive understanding of the results? please refer to weaknesses Lightly AI-edited
PaT: Planning-after-Trial for Efficient Code Generation Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes PaT (Planning-after-Trial), an adaptive framework for efficient LLM-based code generation. Unlike prior Planning-before-Trial methods that plan before solving, PaT first attempts direct code generation and triggers planning only when execution fails, dynamically balancing cost and accuracy. The authors further design a heterogeneous configuration, using small models for generation and large models for planning, and provide theoretical analysis proving PaT’s cost-optimality under scaling laws. Experiments on multiple benchmarks (HumanEval, MBPP, xCodeEval) and models (Qwen3, LLaMA3, DeepSeek-Coder) show PaT achieves up to 7–8% higher Pass@1 while reducing token cost by ~40%, outperforming prior methods like FunCoder and Self-Plan. - Novel adaptive policy: The Planning-after-Trial framework is conceptually elegant and practical, bridging efficiency and accuracy. - Strong empirical evidence: Consistent improvements across multiple model scales and benchmarks validate robustness.- - Theoretical rigor: Formal cost analysis and scaling-law–based proofs establish clear efficiency guarantees. - Verification dependency: Performance heavily relies on the quality of automatically generated test cases. how robust is PaT under noisy or incomplete verification signals? - Scaling analysis abstraction: The theoretical model assumes idealized uniform difficulty distribution; how does this assumption hold empirically in real-world workloads? - Adaptivity granularity: The binary “trial vs. plan” trigger could be further refined. Could a graded or probabilistic planning policy yield smoother efficiency? - The method can be evaluated on more recent and challenging code benchmarks such as LiveCodeBench Pro, SWEBench, etc. How does performance scale with recursion depth or number of subproblems in decomposition? Fully AI-generated
PaT: Planning-after-Trial for Efficient Code Generation Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes PaT (Planning-after-Trial) to improve the efficiency of LLM code generation. Unlike the Planning-before-Trial (PbT), PaT first attempts a direct solution and only invokes the Planner upon verification failure. The authors further propose that PaT combined with a "heterogeneous configuration" (using a small model for generation and a large model for planning), can achieve performance close to that of the large model at a fraction of the cost. The paper is clearly written and easy to follow. Improving the inference efficiency of LLM code generation is of significant practical importance and broad interest. 1. The core idea (PaT) is an incremental improvement over the PbT policy. More critically, the method is highly similar to AdaCoder [1], (arXiv: 2504.04220), which also proposes an adaptive planning strategy. Given that this paper cites later work from September 2025 (Paglieri et al.), [1] should not be considered concurrent work. The authors must clarify their contribution's novelty relative to [1]. 2. Method 3.2 relies on a model-generated unit-test set for verification, which is a highly problematic step. As noted in [5, 6], model-generated test oracles are prone to errors and can lead to "false positives," thus misleading subsequent generation. The authors' "plateau hypothesis" (i.e., assuming remaining failures are due to flawed tests when $p^{(t)} \le p^{(t-1)}$) is an unsubstantiated and unrigorous heuristic. 3. The theoretical analysis in Section 4.1 and Appendix A reads more like "mathematical decoration" than a source of practical insight. The theory is built on highly simplified, unmeasurable abstractions (e.g., representing "complexity" as a scalar $k$ and "capability" as $p$). Theorem 3 claims a "practical, closed-form solution" to guide the selection of the optimal generator (sLM) is entirely unsubstantiated empirically. The authors never attempt to use their own data (e.g., Fig 6) to estimate the formula's parameters and, in turn, predict their own experimental results (i.e., why Qwen3-8B was the optimal choice). This makes the theoretical section appear superfluous and unconvincing. 4. The DIVIDE and COMPOSE procedure in Method 3.1 is nearly identical to the hierarchical decomposition mechanisms in FunCoder and CodeChain [7]. The authors fail to clarify its novelty. And paper lacks necessary citations and discussion of related work[2,3,4]. [1] AdaCoder: An Adaptive Planning and Multi-Agent Framework for Function-Level Code Generation https://arxiv.org/abs/2504.04220 [2] Planning In Natural Language Improves LLM Search For Code Generation https://arxiv.org/abs/2409.03733 [3] INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair https://arxiv.org/abs/2311.09868 [4] PairCoder: A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement https://arxiv.org/abs/2409.05001 [5] TESTEVAL: Benchmarking Large Language Models for Test Case Generation https://arxiv.org/abs/2406.04531 [6] Learning to Generate Unit Tests for Automated Debugging https://arxiv.org/abs/2502.01619 [7] CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules https://arxiv.org/abs/2310.08992 1. Did you measure the "false positive/negative" rate of the model-generated test cases? How does your "plateau heuristic" perform in the face of these erroneous test signals? For example, can it distinguish between a failure caused by a code bug versus a failure caused by a test bug? 2. Can you empirically apply Theorem 3's formula using your own data (e.g., by fitting $$\alpha, \beta$$ from Fig 6 and estimating the average planning cost $D_L$)? Does the predicted theoretical optimal generator cost $c_s^*$ align with the optimal empirical result you observed in Section 5.2 (the $Qwen3_{8B}$ model)? If not, does this imply the theoretical model is too oversimplified to be useful for real-world deployment? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next