ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 4.00 5378
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (75%) 5.33 3.00 2893
Total 4 (100%) 5.00 3.25 3514
Title Ratings Review Text EditLens Prediction
Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction Soundness: 3: good Presentation: 1: poor Contribution: 3: good Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. In standard setup for conformal prediction we guarantee the coverage per each test point. This paper addresses the problem for joint prediction sets where the guarantee is to be derived for joint coverage — all labels should be in all prediction sets within a batch to be count as a coverage. The authors pose two questions, one on the information theoretic bound over the trade off between the set size and the joint coverage, and the other is how can er find an optimal way to leverage the entire labeled set (including the training set) for conformal prediction without sacrificing the coverage guarantee (conventionally, this guarantee breaks if we introduce training points to the process). The authors provide lower bound on this setup showing that either the set size increases drastically or the guarantee vanishes to zero by increasing the number of points over which the decision is jointly made. Furthermore they discuss a specific case where the joint prediction is over a single label -- the case where a single datapoint is examined several times for robustness. *Disclaimer.* I should note that due to the representation issues I mentioned in the weaknesses, I could not fully follow the paper. 1. The targetted problem is very interesting (however in the naming it intersects with full conformal prediction). Being able to provide joint guarantee better than bounds from the Bonferroni correction is applicable. 2. The theory of the paper is concretely written, and (to the best of my understanding during reading the paper) all theorems are proved solidly. 3. The connection of their work to robustness (while it should not be mistaken with the adversarial and probabilistic robustness) is very interesting. **Introduction is not easy to read.** Surely the introduction is written with a solid mathematical notation and there is no issues with that. But it is at least not friendly to non-expert reader. I would suggest elaborating more to the comparison of inductive and transductive conformal prediction for example and offer a one sentence brief explanation by what you mean when introducing a new concept. One really helpful way to improve the readability is to directly say that “by transductive conformal inference on a batch of datapoints we are interested on he probability that all labels are within all prediction sets”. I know that formally it can be inferred from the text, but explicitly saying that in the introduction increases the reading speed considerably. The authors introduce two interesting questions in Section 2, while I can not see a footprint of those questions (specifically the second one) in the introduction. Even the setup with all equal labels is not presented as it is in the introduction. At least it could be better if the authors mentioned the robust prediction application when introducing the setup for the first time. **General readability.** The paper (due to the subject) is already not easy to read, and the authors sometimes introduce a new notation without fixing it. For instance $P_e$ in line 48. The authors also do not provide any intuitive message from the theorems they proof (for example in theorem 3.1 I can not parse what the theorem is trying to say about the joint probabilities). **Limited Experimental Setup.** The paper only introduces numerical results on MNIST dataset. I do not count this as a negative point in my score as the paper is a theory paper. The question remains that how their results could be written in terms of a lower bound for any dataset. If the authors provide a clear algorithmic approach to derive the bounds over any dataset, then I would increase my score. 1. In line 54 is the term $P(Y^{m+n}_{m+1}m | X^{m+n}_{m+1}m)$ equal to the product of all conditional probabilities from m+1 to m+n? If so, can you elaborate why it not a function of the number of elements? From reading your proof I assume this is because the value alpha already encodes the number of the elements but I am not sure why. 2. Is the entropy mentioned in Theorem 3.2, written in terms of the true confidence? Is there any bounds on the accuracy of the model? 3. What is delta in theorem 3.4? Also is there any intuitive understanding about the other variables sigma and rho? What are they encoding? 4. How your results can be expressed in terms of a guaranteed upper bound on the joint guarantee for any dataset? Is this bound affected by the number of classes or the quality of the model? Or is there any need to exactly specify the ground truth p(y|x) to derive these bounds? Fully human-written
Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper asks how small a joint prediction set can be when doing transductive conformal prediction at a given confidence. It proves lower bounds on expected set size. The main asymptotic message is that to maintain any non-trivial confidence, the expected joint set must grow roughly like exp of n times conditional entropy. A non-asymptotic refinement adds a second order “dispersion” term that depends on the variance of log probabilities. The authors also give an achievability result in an oracle setting where P of Y given X is known, and they analyze a special case where all test points share one true label by connecting to classical hypothesis testing with the generalized Jensen–Shannon divergence. They include a small MNIST toy study to illustrate trends. Puts a clean information-theoretic lens on transductive conformal prediction and tries to sharpen earlier entropy-based limits with a finite sample expansion. The reduction to hypothesis testing in the same-label case is neatly explained and hooks into known optimal tests. The statements are precise, the asymptotic and finite-blocklength styles are consistent, and the proofs trace to standard tools like information density bounds, Berry–Esseen, and method of types. The paper is careful about what is guaranteed and where the constants come from. Key quantities like efficiency rate, dispersion, and the role of conditional entropy are defined clearly. The contrast between joint confidence and per-point Bonferroni is made explicit and illustrated. The same-label section is self-contained and readable. Joint guarantees are relevant in batch certification, safety screening, and ranking. Having even pessimistic lower bounds helps practitioners understand why transductive sets often balloon as n grows. The work could become a common reference when teams debate whether transductive guarantees are worth the price in set size. The headline asymptotic bound reaffirms the conditional entropy barrier already highlighted in recent work on conformal efficiency. The finite sample refinement with a dispersion term is welcome, but the experiments show a persistent gap that closes slowly, which makes it hard to see the practical sharpening. The achievability result assumes oracle access to P of Y given X. That turns the task into thresholding products of true class probabilities and inevitably matches the converse to first and second order. This is informative theoretically but not actionable. The paper stops short of proposing any implementable transductive procedure that approaches the bound when P of Y given X is learned with error. Efficiency is measured only by expected set size. In transductive practice, teams often optimize other surrogates like false coverage proportion, cost-weighted set size, or rank-based utility. The bounds are said to “extend” to other notions, but those are deferred. Without at least one nontrivial worked out alternative, the generality claim feels thin. The MNIST label noise toy is not enough. It uses simple noise models where H of Y given X is tractable, then shows that Bonferroni blows up. That result is unsurprising. There is no stress test on real transductive use cases such as ranking or batch classification with covariate shift, no study of how well one can estimate H and dispersion from data, and no attempt to check tightness against a strong transductive algorithm rather than a Bonferroni baseline. Dispersion is defined as the standard deviation of log P of Y given X. This could be quite useful as a diagnostic, yet the paper does not show how to estimate it reliably from finite data or how it correlates with observed set growth. The reader is left without a recipe to turn the bound into an engineering rule of thumb. Can you give a theorem or a corollary that shows your finite sample lower bound strictly dominates prior entropy-only bounds over a clear domain of alpha and n, with explicit constants? A small synthetic where you can compute both exactly would make the gain concrete. Suppose P of Y given X is approximated by a calibrated classifier with a known risk or Bregman divergence to the truth. Can you translate that misspecification into a slack in the achievability side, even if loose? A bound of the form “gap grows as epsilon to some power” would be valuable. Pick one alternative efficiency metric, for example expected rank of the true label inside the joint set or a budgeted FCP, and carry your full derivation through to a nontrivial corollary. This would support the claim that the framework covers broader measures. Can you sketch an extension of the finite sample bound to continuous X with mild regularity on scores, perhaps using empirical process tools instead of types? Even a simplified theorem for plug-in density ratios would widen the impact. Provide a practical estimator and a confidence band for H of Y given X and dispersion from held-out data, with a study of bias and variance. Then compare the predicted lower bound against observed growth on at least one real task. This would turn the theory into a planning tool. Compare your bound with the joint set sizes produced by a modern transductive method that is more nuanced than Bonferroni, across a range of n and calibration sizes. Highlight the regimes where the bound is close to achievable and where there is a big gap. Fully AI-generated
Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This submission is essentially split into two separate parts: first, it provides bounds on the set size for "transductive" CP by assuming a known data distribution; and then it provides a set-based extension of a known result by Gutman on a special case of the problem. The first part builds on standard information-theoretic tools, while the second leverage standard results on binary hypothesis testing with training data. Experimental results are provided for the first part only. The two parts of the submission, while limited on their own, offer a useful reference for researchers interested in "transductive" CP (defined here as the problem of producing a set prediction for a batch of test samples). The paper is clearly and formally written, and useful pointers are provided to the literature. The analysis in Section 3 essentially assumes knowledge of the data distribution. While it is true that this assumption yields upper bounds on the true performance of transductive CP, it is also the case that the assumption appears to completely hide the role of the calibration data size m. Furthermore, the results of the analysis appear to be rather expected and limited in scope. The typical set of a sequence of i.i.d. variables grows exponentially with the entropy, and so must also any prediction set with non-vanishing coverage. The result in Theorem 3.4 is also a refinement of the same idea. The setting studied in Section 4 is a direct extension of Gutman's work on binary hypothesis testing with training data. In fact, most of the section is devoted to reviewing existing results, and the new contribution follows directly by reframing the problem as one of set prediction. No experimental results are provided for the setting studied in Section 4. 1) How can the analysis in Section 3 be extended to provide insights on the role of the size of the calibration dataset? 2) Can Section 4 be rewritten to address directly the case with any number of hypotheses? 3) If so, can Theorem 5 be simplified to provide clearer insights into the average prediction set size? 4) How do the results in Section 4 connect to the theorem in Section 3? 5) Can experimental results be provided to relate the material in Section 3 and Section 4? Fully human-written
Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This is a theoretical paper that characterizes the trade off between coverage and set size of transductive conformal prediction in the classification setting. In transductive conformal prediction, the goal is to construct prediction sets for $n$ test points such that all test labels fall within the joint prediction set with probability at least $1-\alpha$. The paper proves both asymptotic limits and non-asymptotic bounds for the “efficiency rate”, a normalized measure of the size of the transductive prediction set. The paper makes some interesting connections between conformal prediction and information theory and applies some interesting tools. As someone who does not have a strong background in information theory or familiarity with Gutman’s test, I found it hard to follow section 4. 1. Theorem 3.1: Can you add the interpretation of this result in words? This applies to all theorem statements. 2. Do you view this work as more than simply “theory for the sake of theory”? Can this theory eventually lead to work that will inform practice? 3. On line 53, you write “When all test points share the same label, a scenario relevant to safety-critical applications…” — what is an example? Typos/stylistic comments: * There are multiple places where \citep is used where \citet should be used instead * I would mention somewhere that what you call “confidence” is commonly called “coverage” in the conformal prediction literature * Line 53: “same label” -> “same unknown label” * Line 143: Capitalize “in” * Line 174-175: asymptotically is used twice in this sentence * Line 175: “In case” -> “In the case” * Line 245: remove “setup” * Line 249-250: I would replace the “=“ with “:=“ * Line 281: “prediction sets a single” -> “prediction sets with a single” * Line 283: math mode error * Theorem 4.4: “M class”? Should this be “M-class setting”? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next