|
SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces SWE-Perf, which is designed to systematically evaluate LLMs capability in solving realistic code performance optimization task. SWE-Perf includes both file-level and repository-level settings and conducts experiments cover a wide range of both open and close source LLMs as well as agent-based systems. The results show an 8.59% performance gap between the current best approach and human experts.
1. The paper extends code performance optimization from the function level to realistic repository-level settings. This design assesses LLMs’ ability not only to optimize a single code function or algorithm but also to retrieve relevant snippets and locate performance bottlenecks within large, complex codebases.
2. The authors design a two-stage evaluation setup consisting of the Oracle and Realistic settings, in file-level and repository-level respectively. This hierarchical design provides a structured way to separate two key abilities: (1) directly optimize performance when the target code snippet is known, and (2) retrieve and identify optimization opportunities in a large repository beyond simply performance improve.
3. The authors evaluate the model generated result in three stages - apply, correctness and performance. This gives a more complete and quantitative evaluation whether LLMs not only improve the performance but also preserve the functionality of the code.
4. The paper provides a comprehensive analysis by comparing both open and close source LLMs, as well as agent-based systems in the repository-level setting, which exhibits not only the base model's capability but also how agent system / workflow affect the performance.
1. SWE-Perf does not account for differences in repository application domains, structural characteristics, or development maturity. In real-world software systems, these factors substantially influence performance requirements, optimization strategy and optimization potential. However, SWE-Perf fails to consider such contextual performance ceilings and optimization structure / strategy differences, while simply mixing the instances from different domains / repositories with different features together.
2. SWE-Perf includes a limited number of repositories and highly uneven test instances distribution. As shown in Figure 4 and Appendix B, both human / model performance and runtime statistics vary significantly across repositories, while the instance distribution in different repository is significantly imbalanced. This imbalance can dominate the global average and cause misleading conclusions about overall model performance. Additionally, analyses such as the correlation between performance gain and function count are unreliable with the small instance numbers, as the average improvement ignores differences in task difficulty, optimization ceiling, and repository features. Even for a single function, the achievable performance gain can vary greatly depending on its maturity and inherent optimization limits.
3. The benchmark construction pipeline involves executing the complete set of unit tests, resulting in expensive computational overhead and scalability constraints. The authors report that a single sample execution can exceed one hour and ultimately produce only 140 instances despite collecting 25,264 codebases initially. Although this level of rigor enhances the reliability of the benchmark, it significantly limits its scalability for future enrichment or updated, especially when considering the limited number of instances currently.
1. The benchmark measures performance based on the runtime of existing unit tests in the repository. However, these tests may only cover some specific inputs and may not reflect the code's performance in general. In typical performance evaluation, we often look at best-case, worst-case, and average-case behaviors, or test scaling across different input sizes. Could the authors explain why they chose to rely only on the repository’s own test cases for performance measurement? Do the authors believe these tests are representative enough of the code performance in real-world scenarios?
2. In the repository-level setting, the benchmark is said to test LLMs’ ability to perform performance optimization across full repositories, potentially involving multiple functions and files. However, it is not entirely clear whether all instances truly require cross-function reasoning, or if many represent independent, localized function-level optimizations within a large codebase. If would be better if the authors could clarify whether they conducted any manual inspections or quality control to confirm that the selected performance improvement instances actually involve interactions across functions, files or modules.
3. In Figure 9, the paper attributes the decline in expert performance improvement partially to higher optimization difficulty. However, in the repository-level setting, the target functions include all the functions executed during the performance test, not only those directly modified by the experts or the LLMs. Could the observed decline simply be influenced by the presence of many non-optimized functions, whose runtime dilutes the overall improvement? It would be better if the authors could clarify this.
4. In Figures 8 and 9, the analysis groups performance results by the number of target functions and reports average improvements. However, different functions can have very different performance characteristics, optimization potentials, and difficulty levels. Could the authors clarify how they justify aggregating all these various functions into simple averages? Have the authors considered classifying or normalizing the results by function type, domain, complexity, or optimization method to ensure the trend is meaningful? |
Fully human-written |
|
SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces SWE-Perf, a repository-level benchmark for evaluating LLMs in code optimization. The dataset comprises 140 curated instances derived from performance-improving pull requests from popular GitHub repositories. In particular, for each instance, the authors focus on a subset of the unit tests that show significant performance improvements to evaluate the model-generated patches. Using metrics tailored for code optimization, the authors compare the state-of-the-art Agentless and OpenHands, observing that both show substantial room for improvement on SWE-Perf.
1. Realistic benchmark: The authors adopt well-established pipelines in SWE-Bench and SWE-Gym to filter PRs among 12 real-world Python code repositories. Furthermore, by distinguishing between file-level and repo–level agentic settings, the benchmark captures both targeted (potentially algorithmic) and system-wide optimizations.
2. Empirical evaluation includes both pipeline-based and agent-based paradigms, further decoupling correctness from performance.
1. Test coverage identification: In this work, the authors select only unit tests directly tied to performance optimization, which may not represent the full set of tests relevant to a target function or patch. This, in turn, could skew the empirical findings altogether. For instance, a loose set of related tests can be estimated from the static call graph by looking at the test coverage, and mapping it to all the functions covered in a patch. For more precision, the dynamic call graphs can be studied. This problem has been extensively studied in test coverage estimation, test case prioritization, and regression testing literature (some examples including [1], [2]).
2. Missing related work: Some with LLMs, some at the repository-level, some with deep learning models: the problem of code optimization has been explored in prior work [3], [4], [5] (this is still an incomplete list). None of this literature, though directly relevant, has been acknowledged, which leads to positioning this paper as the “first benchmark designed to evaluate the ability of laguage models to optimize code performance on real-world repositories”. A clear departure from these works need to be established, or the contributions need to be correspondingly weakened.
3. Human patch as ground truth: While expert patches are a reasonable baseline, they may not represent the optimal performance ceiling. This could underestimate the true potential of LLM-driven edits, or unknowingly penalize effective optimizations. This limitation needs to be recognized as a threat to validity.
[1] Alves, Tiago L., and Joost Visser. "Static estimation of test coverage." 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation. IEEE, 2009.
[2] Gligoric, Milos, Lamyaa Eloussi, and Darko Marinov. "Practical regression test selection with dynamic file dependencies." Proceedings of the 2015 International Symposium on Software Testing and Analysis. 2015.
[3] Gao, Shuzheng, et al. "Search-based llms for code optimization." arXiv preprint arXiv:2408.12159 (2024).
[4] Nistor, Adrian, Tian Jiang, and Lin Tan. "Discovering, reporting, and fixing performance bugs." 2013 10th working conference on mining software repositories (MSR). IEEE, 2013.
[5] Shypula, Alexander, et al. "Learning performance-improving code edits." arXiv preprint arXiv:2302.07867 (2023).
1. Did you consider alternative methods to broaden or validate the test subset? |
Lightly AI-edited |
|
SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces SWE-Perf, the first benchmark designed to evaluate the ability of Large Language Models (LLMs) to perform code performance optimization at the repository level. The authors identify a critical gap in existing research, where benchmarks either focus on code correctness (like bug fixing) in repositories or on performance optimization in isolated, function-level scenarios. These existing benchmarks fail to capture the complexity of real-world performance tasks, which often require changes across multiple files and modules.
- The data collection and curation process is exceptionally thorough. The multi-phase pipeline, which includes executing tests, ensuring reproducibility in a containerized environment, and using statistical tests (Mann-Whitney U test) to confirm performance improvements, lends high credibility and quality to the resulting dataset.
- Authors provides deeper analysis into how performance varies with the complexity of the task (e.g., number of target functions, original runtime) and offers qualitative insights by comparing the types of code changes made by models versus human experts. This analysis provides valuable direction for future research.
- Limited Dataset Size and Generalizability: The final dataset contains 140 instances from 9 Python repositories. While the curation process justifies the small size, it may limit the statistical power and generalizability of the findings. Performance on these popular repositories might not be representative of performance on other languages or less common software projects. The authors acknowledge this limitation.
- Reliance on Unit Tests: This is a key methodological limitation. While using unit tests enables a scalable and reproducible evaluation, it essentially reduces the task to micro-benchmarking. This approach cannot fully capture the complexity of macro-level, real-world application performance. It largely ignores system-level factors such as I/O bottlenecks, database interactions, network latency, high-concurrency stress, and inter-module overhead, which are often the true sources of performance issues. Therefore, a significant performance gain observed in an isolated unit test may not translate to a meaningful, user-perceptible improvement in a production environment. This simplifies the true complexity of the "performance optimization" task, although it is arguably a pragmatic compromise for creating the first automated benchmark of its kind.
- Lack cost metric for different agent scaffold
see weakness |
Fully AI-generated |