|
How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses an important problem in LLM evaluation: how to construct an efficient, reliable, and unbiased benchmark for assessing the quality of automatic test case generation methods. The core contribution is a novel framework based on binary-matrix theory. This framework uses the matrix rank to simultaneously determine the minimal number of wrong codes required for evaluation and a theoretical upper bound on the number of test cases needed for full error pattern coverage. Based on this framework, the authors devised WrongSelect to find an approximate solution to this NP-hard problem, culminating in the construction of TC-Bench.
1. The paper tackles the crucial challenge of evaluating test case generation methods for code.
2. The author processes an efficient approximation algorithm combining pre-filtering and random-restart local search, WrongSelect. This provides a reasonable and effective solution to the NP-hard problem of selecting a maximally diverse diagnostic basis from a vast collection of wrong codes.
3. The authors release a compact and diverse benchmark, TC-Bench. By design, TC-Bench can reduce the computational cost of evaluation and is resistant to score inflation.
1. Insufficient discussion of related work. The paper could be strengthened by a more thorough discussion of related work that also leverages code-test properties. For instance, CodeT [1] or other approaches that also use binary matrices, such as B4 [2].
2. Limited theoretical justification for the rank-coverage duality. A central claim of the paper is that the matrix rank provides an upper bound on the minimum number of test cases required for fault coverage. While the paper lacks a formal proof or a detailed discussion to substantiate this claim. A deeper analysis would strengthen the claim beyond intuition.
3. Limitations of the chosen diversity metric. The paper employs Jaccard similarity to measure the overlap. It implicitly assumes that all GTs are of equal importance. In practice, some GTs may represent extremely rare and critical edge cases, while others are more trivial. The Jaccard metric cannot capture this weighted difference. Have the authors experimented with or considered other similarity metrics that could account for the varying significance of different test cases?
[1] Chen, Bei, et al. "CodeT: Code Generation with Generated Tests." The Eleventh International Conference on Learning Representations.
[2] Chen, Mouxiang, et al. "B4: Towards optimal assessment of plausible code solutions with plausible tests." Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 2024.
See Weaknesses. I will be glad to raise my score if the authors could provide a sufficient rebuttal. |
Lightly AI-edited |
|
How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses how to rigorously evaluate LLM-generated test cases by reframing benchmark construction as a binary matrix problem over wrong codes (rows) and golden tests (columns). The central idea is to select a compact yet representative set of wrong codes that preserves the matrix’s diagnostic capacity while avoiding redundancy and score inflation. Concretely, the authors require the kept rows to form a row basis whose size equals the matrix rank (capturing all independent error modes), and among all such bases, they choose one that maximizes diversity by minimizing average pairwise Jaccard similarity of failure signatures.
They propose WrongSelect, an efficient approximation combining principled pre-filtering and random-restart local search. Pre-filtering drops problems with any all-ones test column (hack-prone) and removes trivial wrong codes that fail ≥80% of tests; then the local search swaps rows in/out while maintaining rank to minimize the diversity objective.
1. The paper presents a fresh and elegant formulation that connects test selection to linear algebra.
Modeling wrong-code failure signatures as rows in a binary matrix and selecting a rank-preserving, diversity-maximizing row basis is a principled way to eliminate redundancy while retaining all independent error modes. This perspective clarifies what “good coverage” means and gives a theoretical upper bound on minimal tests.
2. The empirical results are strong.
The proposed selection method yields a compact, diverse benchmark that meaningfully stresses current test generators, and the reported exclusion rates show clear, consistent improvements over baselines. The ablations and end-to-end evaluations convincingly demonstrate both effectiveness and practicality at scale.
1. The approach assumes access to a sufficiently rich pool of golden tests (public + private) to build informative failure profiles.
In many real-world settings, curating or synthesizing high-quality golden tests is costly and time-consuming. This reliance reduces the method’s portability and makes it difficult for third parties to reproduce or extend the benchmark to new domains or problem sets without significant investment.
2. While the formulation is theoretically sound (rank preservation plus diversity), the paper provides limited qualitative analysis to illustrate practical effectiveness.
Concrete case studies—e.g., representative wrong-code examples retained vs. removed, error modes uncovered by the selected basis, or developer-facing insights derived from the reduced matrix—would strengthen the claim that the method improves understanding and diagnosis beyond raw metrics.
could you provide an example for the effectiveness of your approach in a code example? |
Fully AI-generated |
|
How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective |
Soundness: 3: good
Presentation: 2: fair
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a new benchmark to evaluate LLMs' ability to generate high quality test cases for coding problems. The quality is determined by the efficiency in covering the error space of the problem. To improve quality representation of the error space, the authors form an error matrix, consisting of all the submissions by their pass/fail results on all the test cases. Since the rank of this 2D matrix conveniently represents the "coverage" of submission patterns, they develop a greedy algorithm to maximize the diversity of wrong code submission, which in return reduces to a more compact submission set for efficient evaluation.
The authors then evaluate several test case generation methods on various LLMs on two major metrics – the accuracy of generated test cases (PR) and the degree of successful filtering of wrong code submissions (HR). The evaluation results show that the choice of generation method matters significantly more than the choice of the underlying LLM. At the same time, the relatively underwhelming numbers (best combo leads to 60~% in HR) points to the importance of constructing a high quality set of wrong code – the analysis on the unfiltered version of the dataset reveals inflated scores, which is inaccurate.
* A meaningful contribution for a much-needed area, both in terms of efficiency and framework. The method can see practical use cases.
* The proposed method for selecting the minimal set of wrong code submissions is principled and reasonable.
* The outcome dataset effectively pinpoints the need for better TCG methods, citing the underwhelming numbers.
* Apple-to-apple comparison against the previous work: It would be stronger if you could replicate the other test case evaluation methods on the same problem set and show how TC-Bench more critically measures the TCG quality by LLMs. I'm aware this is done partially by comparing the method against the "All WC" counterpart.
* WrongSelect's robustness: It's uncertain if the greedy algorithm yields a stable set of problems.
* The need of translation: I assume the dataset is entirely sourced from non-English problems. The choice of data sources affect the quality and the number of (valid) wrong code submissions for problems. Have authors explored English data? Reducing the submission count to less than 2% depends on the population of wrong code submissions.
---
Below are NOT the weaknesses of the method, but I think they are important to raise for a higher quality submission. I'm leaving here as I don't see fit in "Questions"
* Presentation 1: Overall there is high usage of acronyms, such as "some WCs labeled as WA under GTs produce RE or TLE when executed on ATs". I got used to it by the end a little bit, but I often had to go back and refresh my glossary many times. I'd highly suggest fixing the inconsistency (e.g., "wrong code" and WC) and reducing overall abbreviations.
* Presentation 2: Plot styles are inconsistent – mostly comic sans but some of them serif (Times?) out of a sudden. It looks like being patched together. I recommend following a single style.
* Appendix B has no content
* How much variance in PR / HR do you observe by running the optimization multiple times? When you say convergence, do they arrive at the same set of WCs?
* The analysis comparing the method against the "All WCs", are the all WCs before or after pre-filtering? |
Fully human-written |
|
How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work addresses the challenge of evaluating automatically generated test cases for code using LLMs. The authors propose a new framework that models the relationship between code and test cases as a binary matrix, where the matrix rank reveals the minimal number of distinct wrong codes and test cases needed for evaluation. They introduce WrongSelect, an efficient algorithm for selecting a maximally diverse and representative set of wrong codes to form a compact benchmark, TC-Bench. Built from millions of competitive programming submissions, TC-Bench offers an efficient, diverse, and inflation-resistant benchmark for assessing test-case generation. Evaluation across 13 leading LLMs show that even the best current methods achieve only around 60% fault detection, highlighting significant room for improvement in automated test generation and evaluation.
- Formulating problem as code-test matrix
- Number of evaluated LLMs
- Approximation algorithm
- Benchmark construction
- Source of data from programming contests
- Theoritically ok, but what about practicality?
- Not studying other similarity metrics
- How practical is this technique in practice/industry?
- Do you think your technique generalizes on other datasets, when your evaluation is only using programming contests?
- Do you think other similarity metrics would fit better than Jaccard? |
Fully human-written |