ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 8.00 4.00 3271
Fully human-written 3 (75%) 6.67 4.33 3879
Total 4 (100%) 7.00 4.25 3727
Title Ratings Review Text EditLens Prediction
Kimi-Dev: Agentless Training as Skill Prior for SWE-agents Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper argues that “agentless” workflow training can serve as skill priors (localization, code edit, self-reflection) for end-to-end SWE agents. It introduces Kimi-Dev, a 72B open-weight code LLM trained via a curated recipe (mid-training on PR/issue data, a reasoning-heavy cold start, RL on code edits, and test-time self-play). Under a workflow (agentless) setup, Kimi-Dev reaches 60.4% on SWE-bench Verified; with 5k public SWE-Agent trajectories for light SFT, the model attains 48.6% pass@1 in a multi-turn agent framework, comparable to Claude 3.5 Sonnet (241022). These results aim to bridge workflow and agentic paradigms by treating agentless training as a foundation for agent adaptation. - This paper exercises a principled recipe and delivers strong results under agentic settings. The staged recipe (mid-training on ~150B tokens; 50B natural diffs, 20B PR commit packs, ~20B synthetic, upsampled ×4; reasoning-heavy cold start; RL with outcome-only rewards; test-time self-play) is coherent and empirically validated. The paper achieves 60.4% on SWE-bench Verified under a standardized 40 patch/40 test setting, with clear scoring for self-play selection. - Agent transfer with small data. With only ~5k public SWE-Agent trajectories, the adapted model attains 48.6% pass@1 - The authors performed detailed experiments demonstrating the skill transfer and generalization effect by controlling for different prior models. These experiments provide valuable insights into how the community can improve future agent models. - It is unclear to me that RL is an indispensable part of the recipe: In Figure 5, the performance difference between the RL and the SFT curves is relatively small. Does this demonstrate that the RL part is not really the essential part of the recipe—SFT on an agentless task is already sufficient to provide such a prior to improve the upper bound on performance given sufficient multi-turn agent SFT data? - Unclear if Agentless Skill prior will still be useful given sufficient multi-turn SFT data: Also in Figure 5, when given sufficient agent SFT data, the performance of MT gets very close to SFT/RL. Does that mean the value of agentless skill prior drops as you have access to more in-domain agent SFT data? - Unclear if performance improvements come from distilling from a stronger model: If I understand correctly, the SFT data for agentless skill was distilled from Deepseek R1 (section 3.2) - could the effectiveness of "agentless prior" come from distillation from the Deepseek R1 model? - For weakness 1: it will nice to either (1) show the benefit of RL by demonstrating performance improvements on other datasets/metrics/settings (eg higher pass@k for large k); OR (2) discuss the limitation of agentless+RL in providing better prior - For weakness 2: It will be nice to either (1) disprove it or (2) discuss the limitation - For weakness 3, my question is: Given the same computation budget, what if we perform distillation on the same set of training tasks (eg, swe-smith, swe-gym) -- Is it possible that the performance gain we get there will be larger than the one we get with agentless as prior? If so, why is it a good idea to use an agentless prior rather than directly generating data from environments that better match the domain? Fully human-written
Kimi-Dev: Agentless Training as Skill Prior for SWE-agents Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces Kimi-Dev, an open-source LLM for software engineering that unifies agentless and agentic approaches. It shows that agentless training through modular tasks can build transferable skill priors that make subsequent agentic finetuning more efficient. Kimi-Dev achieves SOTA on SWE-bench Verified and enables competitive SWE-agent results with minimal additional training. The main discovery of agent less training can induce skill that make further agent training more effective is interesting: it opens up a new paradigm how researchers may train their agent. The open source of the recipe further add to the contribution. The proposed multi-stage training recipe is well-explained and reflect the researcher’s understanding of the nuances in scaling SWE models. The empirical results of the trained model is strong, achieving SOTA among open-sourced models for workflow-based approaches on SWE-bench, and competitive pass@1 performance under agentic framework. This shows the method’s effectiveness. The analysis of the paper is very insightful, including the varying of token budget, RL progression, etc. All of these settings reflects a solid experimental approach. Some technical details are deferred to the appendix. The main context only gives a final evaluation formula while the concrete implementations of reward, prompt sampling, etc. are not rigorously and clearly laid out in the main text. There are some observations about false positive mentioned, but the paper can further benefit from providing quantitative and qualitative analysis of the error remaining given the current bug fixer and test writer, especially points that can reflect the challenges in agentic skill prior transfer. Much of the justification for skill transfer from agent less to agentic framework is empirical. As the core discovery of the paper, more principled proof or study about how skill prior linked to generalization and how these skills are manifested will further strengthen the claim. N/A Fully human-written
Kimi-Dev: Agentless Training as Skill Prior for SWE-agents Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work presents a training recipe that leverages workflow-based pipelines as opposed to more open-ended agentic rollouts in order to train an open-source code LLM for SWE tasks. The so-called Agentless training approach provides more stability and the opportunity to introduce intermediate skills, such as locating files, writing tests and fixing bugs, which may subsequently be used and composed together when training a SWE-agent model. Concretely, the authors describe: a mid-training procedure where data from GitHub issues and PR commits form examples of how developers reason about issues and fix bugs; a cold-start procedure where SWE-related reasoning traces from R1 are used to induce long CoT abilities; an RL stage where a policy gradient objective is used to train code editing capabilities with outcome-based rewards; and finally a test-time self-play stage where the model learns to play the roles of a BugFixer and TestWriter. Following this training procedure, the authors evaluate the resulting Kimi-Dev model on SWE-Bench Verified, achieving a strong performance of 60.4% for an open-weights 72B model derived from Qwen 2.5 (Base). The authors then investigate whether the structured Agentless training procedure leads to model checkpoints which offer useful "priors" for more open-ended SWE-agent fine-tuning. Indeed, various stages of the Agentless training procedure, in particular the mid-training, lead to a big uplift in SWE-Bench Verified performance after various amounts of SFT on SWE-Agent traces collected from Claude 3.5 Sonnet. The authors also demonstrate some evidence towards the RL prior (after the full Agentless training procedure) supporting a greater number of turns for SWE-Agent style rollouts after SFT. While advanced coding agent model training has been underway in model labs for certain years, there are relatively few detailed accounts of how to take a pre-trained model and effectively imbue it with the skills needed to solve software engineering problems. From this perspective, the work offers a valuable contribution, in particular the details of the mid-training procedure which appears to be the stage of Agentless training that yields the largest SWE-Bench Verified performance improvement after SFT on SWE-Agent traces; perhaps due to training the model to reproduce the motifs, reasoning patterns and skills required to conduct software engineering work. The work is of a reasonably large scale, and achieves compelling results on SWE-Bench Verified for a model of its size, with 60.4% resolution rate on SWE-Bench Verified with the Agentless-like framework. While the Agentless performance for SWE-Bench Verified is reasonably strong at 60%, the pass@1 rate on the SWE-Agent results are somewhat low at 48.6%, particularly given the model is roughly double the size of other models obtaining comparable scores. It would be good to investigate why this is, and crucially whether this is as a result of the Agentless training procedure that preceded SFT for SWE-Agent. Moreover, the authors train on the outputs of other models at two key points in the paper. This is disappointing since it effectively hides how the corresponding skills can be trained from scratch in an open-source model, instead deferring this to the black box of a preexisting model trained by some other lab. The first instance of this is during mid-training - which appears to be one of the steps in the training procedure which yields the greatest performance improvements - where traces from the DeepSeek's R1 model are used to introduce skills such as reasoning, problem analysis, method sketching, self-refinement and exploration of alternative solutions. This is an important procedure, and it would have been good to see discussion of how these skills could be bootstrapped without training on the outputs of a model which has already undergone this training. The second instance are the SWE-Agent traces used for SFT from Claude 3.5 Sonnet. While I understand that the primary focus of this paper is the Agentless training procedure itself, I do believe that any conclusions pertaining to the usefulness of this intermediate training procedure as a "prior" for training SWE-Agents must be made with a more legible training procedure for the SWE-Agent stage. In other words, had the jump from Agentless to SWE-Agent been achieved by a transparent RL procedure, then conclusions about the usefulness of the Agentless skill prior may be more impactful. Instead, we appear to have learned that these "priors" are useful when doing imitation learning from black-box model traces. Finally, on the point about generalization, the studies all revolve around SWE-Bench style tasks (including the live and multilingual variants in Appendix E). These represent only a small subset of software engineering work, and do not address related tasks such as novel algorithm design, performance optimization, task management and so forth, which I believe would be a better quantifier of generalization than merely more up-to-date or polyglot variants of SWE-Bench's issue resolution task. Indeed, this is an area where the paper's Agentless training thesis may hold promise; where systematically created workflow-based mid-training procedures can imbue a model with a diverse range of useful skills which can be learned to be composed together in the RL stage. Nits: - L427: do you mean to say "The RL prior outperforms all the other models..."? - what is behind the roughly linear increase in token length in Figure 3 as RL training proceeds? You analyze some of these traces in Appendix F, observing that there are some repetitive self-reflections, although also noting that increased response length correlated with improved answer accuracy. How do you quantify this at scale? Is the length increase due to your choice of RL objective? Recent work (https://arxiv.org/pdf/2508.08221) has commented on this sequence length effect, albeit with a GRPO-style objective. - How much do you attribute the performance improvements measured in both the Agentless framework evaluations and subsequent SWE-Agent evaluations to the dataset of DeepSeek R1 based reasoning trajectories? Looking at Figure 5, after $2^{23}$ SWE-Agent SFT tokens there is a large jump between the base model and MT checkpoint. What explains this? Fully human-written
Kimi-Dev: Agentless Training as Skill Prior for SWE-agents Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a two-stage training approach for software engineering (SWE) agents: first training on "Agentless" workflows (fixed pipeline tasks such as bug localization and repair), then adapting to multi-turn agentic frameworks. The authors introduce Kimi-Dev, a 72B parameter model that achieves 60.4% on SWE-bench Verified through Agentless training, and 48.6% when adapted to the SWE-Agent framework with only 5k trajectories. The central thesis is that Agentless training, framed as a single-turn task, facilitates more scalable RL training and serves as an effective prior for transitioning to multi-turn scaffoldings. 1. The paper is well-written with clear separation and explanation of each component, making the methodology easy to understand and follow. 2. The ablation studies (Figure 5) systematically isolate the contributions from mid-training, cold-start, and RL stages, providing clear illustration of each component's contribution. The sweep over different amounts of SWE-Agent SFT data (from 0 to ~1.5×2^28 tokens) is particularly comprehensive. 3. The reported performance of 60.4% on SWE-bench Verified within the Agentless scaffolding using Qwen2.5-72B as the base model is impressive. 4. The extensive appendices document data recipes, training procedures, and infrastructure details in substantial depth, significantly enhancing the work's reproducibility. 1. The anomalous performance dip of the "SFT" prior at 2^24 tokens (Figure 5) raises questions about the stability and reliability of the proposed priors. Including confidence intervals would help clarify whether the observed differences between priors are statistically significant or within the margin of experimental noise. 2. The 5k SFT trajectories represent a relatively small subset of potential teacher data. Given that Claude 3.7 Sonnet is a substantially stronger model, there may be a scenario where scaling the number of trajectories could narrow the gap between the Base and Kimi-Dev priors. If this is the case, the proposed approach would offer a more efficient rather than more powerful method for training multi-turn SWE agents, which represents a weaker claim that it could be otherwise. 3. While the Agentless checkpoint demonstrates strong performance as a prior for multi-turn adaptation, it remains unclear whether directly applying large-scale RL to the SWE-Agent framework would yield comparably strong or weaker results. The literature includes examples of applying GRPO (DAPO) to sparse reward settings for LLMs in SWE tasks, such as DeepSWE [1]. A direct comparison or discussion of this alternative training paradigm would strengthen the paper's claims about the necessity and superiority of the two-stage approach. [1] Michael Luo et al. DeepSWE: Training a State-of-the-Art Coding Agent from Scratch by Scaling RL 1. What is the precise configuration of SFT used for Figure 5? While the number of trajectories is clearly specified, what is the number of training epochs? Were hyperparameters tuned independently for each experimental run, or was a single configuration used across all conditions? 2. What are your thoughts on the TestWriter false positive examples documented in the appendix, and how might this limitation be addressed in future work? Lightly AI-edited
PreviousPage 1 of 1 (4 total rows)Next