ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 2 (50%) 4.00 2.50 1748
Fully human-written 2 (50%) 4.00 3.00 5684
Total 4 (100%) 4.00 2.75 3716
Title Ratings Review Text EditLens Prediction
How to Teach Label to Understand Decisions: A Decision-aware Label Distribution Learning Framework Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a novel framework for decision-aware learning. It is based on LDL (label distribution learning), which in a nutshell in this context performs label enhancement in a decision driven way: I.e. the label space is expanded to accommodate a certain number of decision-aware labels with their support depending on the “transfer cost” from the decision that would be made if one vs another label was presented for a data-point at hand. The method is quite performant, achieving consistently the best overall regret, and the best robustness, compared to a variety of vanilla and more advanced decision-focused benchmark methods; this was tested on both synthetic and real world data on classic and substantively nontrivial combinatorial decision tasks, so I overall found the selection of tasks for the empirical evaluation to be good. In particular, there is a pronounced effect of low-data regime superiority compared to other methods. On the methodological side, the proposed method exhibits an intuitive multi-stage structure that lets it generate learned label distributions that can then be used to solve a large variety of decision-oriented problems; the distributions are learned in a way that leverage the downstream decisions and how those are correlated across samples; the usage of such correlations on the decision/label level, as opposed to earlier approaches that embraced adaptivity at the feature level, appears in certain ways more principled in decision-focused settings than these prior methods with local feature-based adaptivity. To start, the proposed method appears to be very computationally demanding compared to the alternatives it is benchmarked against. I could not find a runtime comparison in the pdf manuscript, so in the interests of transparency I’d like to ask for it to be disclosed. Overall, the expansion of the label space into custom “label-distribution” space involves a lot of matrix computations, together with many parallel neural networks with non-trivial architecture. Furthermore, it appears that due to the heavy parameterization of the method, there is a risk of non-robustness to misspecification that could be more pronounced than with the other methods. Thus, it would have been important to see how well this method does for noisy/drift-prone settings where the decision maps and/or labels could be misspecified. Also, examining the plots, I would agree that the proposed method does exhibit substantial regret benefit over the benchmark methods on aggregate (as well as that the proposed method is more robustly performant, with box widths smaller than the rest), but I would be more moderate in making the performance gains claims given that there is still substantial overlap between the boxes in most cases. Thus, what we can deduce with a lot of certainty is that the proposed method obtains much better regret than the naive benchmark in almost all evaluations, which other methods by and large cannot consistently achieve in the sense of box-plots. However, what I believe we cannot claim with absolute certainty is the superiority of the proposed method over all of the benchmarks at once: For instance, the KNN based benchmark is usually in the same ballpark. Moreover, on real-world multi-item newsvendor (Figure 3), the performance of most methods looks quite evenly matched, modulo the variation/box width. As another meta-issue, while I appreciated the logical nature of the proposed pipeline, I was not as convinced about the variety of heuristic choices that went into it at most junctures (many of these choices are not ablated against and would in fact be difficult to ablate). This relates to neural net architectures, hyperparameters like the neighbor count M when deciding on the largest transfer components; and this also relates to other subtler design choices that could be made, but were not made and weren’t usually discussed. Just as an example, when finding the M highest transfer cost samples, the hyperparameter M first of all sounds like the performance could be quite sensitive to it; so it appears that similar samples could be clustered together at first before performing this step, as the optimal M would then be found in a smaller, more robust range. Please see above. Furthermore, I have some additional questions, which if the authors are able to address them would likely require some different plots from the ones displayed. First, the adaptively chosen support is mentioned quite a few times, but there are no illustrations that specifically showcase the adaptivity/variation in support throughout the instances on any of the tasks, so I’d request for this to be provided. Second, there is a lot of mention of the difficulties, related to non-differentiability, of the standard existing predict-then-optimize approaches that are based on designing customized decision-aware losses. Yet, the comparison in the experimental section remains high-level, and doesn’t focus on exhibiting the favorable contrast as it specifically pertains to non-smoothness issues: I could imagine a dataset where decisions are intentionally set to be very discontinuous, and showing the benefits that the current framework has over decision-loss-based ones, locally. Currently, based on the results of this paper, it appears that the added benefit may be in the extra stability of the proposed method, as I imagine it to be quite computationally demanding compared to any decision-loss-optimizing method, smooth or nonsmooth. Second, returning to the point of the KNN method being one of the most closely matched, this raises the question as to whether the feature-level similarity may in any way have translated to decision-level similarity. If is there a way to display whether that is or is not the case empirically, that would be great; else, a qualitative discussion in the case of each of the two studied settings would suffice. Fully human-written
How to Teach Label to Understand Decisions: A Decision-aware Label Distribution Learning Framework Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a decision‑aware Label Distribution Learning (LDL) pipeline for contextual stochastic optimization (CSO). It constructs a decision‑aware similarity from optimization transfer costs to build per‑sample discrete label supports, estimates mixture weights via a manifold objective combining feature and task graphs, and trains dual‑branch networks with an MMD loss to predict mixture positions and weights. Joint distributions are factored over marginals; downstream decisions minimize expected cost over the learned discrete mixtures. The authors perform empirical evaluation on a multi‑item newsvendor and a small quadratic network flow and report lower regret than baselines. - The design of the framework seems novel - Limited information on experimental protocol (especially on how baselines were trained and applied) raises concerns about reproducibility. - It seems to me that the paper lacks some important baselines from families of differentiable solvers, direct task loss optimization with some soft surrogates, etc. - Random 80:20 split on newsvendor? Is it, in fact, time-series data? Is there potential leakage/test contamination possible here? - NIT: only plots, without reporting mean/stds in some table. Please see the weaknesses section. Lightly AI-edited
How to Teach Label to Understand Decisions: A Decision-aware Label Distribution Learning Framework Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a decision-aware Label Distribution Learning (LDL) framework for Contextual Stochastic Optimization (CSO). By incorporating decision information into data representations via a decision-aware similarity matrix and predicting full label distributions, the method avoids modifying loss functions while naturally reducing risk in high-cost regions. Experiments on synthetic and real-world datasets, including comparisons with SAA, prescriptive analytics, and feature-based LDL, show consistent regret reduction, particularly in low-data settings, demonstrating its effectiveness for decision-focused learning tasks. 1. The proposed decision-aware LDL framework is novel, introducing a similarity matrix that explicitly incorporates decision information. 2. Experiments show that LDL achieves consistently lower regret and higher stability than baselines across both synthetic and real-world datasets. 1. There is a lack of analysis of hyper-parameters, i.e., P, M, $\alpha$, $\lambda$. 2. Although the proposed decision-aware LDL framework is novel and achieves strong performance, the manuscript does not discuss computational efficiency. LDL involves multiple steps, which may be slower than simpler baselines such as SAA or KNN. It is better to include a table showing average training and inference times for all methods. 3. The decision-aware similarity matrix S is a key innovation, but no visualization or interpretability analysis is provided. It would help to show a heatmap comparing S with a feature-based similarity matrix. 4. Experiments only consider small-scale problems (K=2 or 4), so the method’s scalability is unclear. It would be beneficial to include an experiment on larger-scale problems to assess performance and computational feasibility. 5. The manuscript lacks comparison with recent representative learning-to-decision or end-to-end decision learning frameworks. Including such baselines would better contextualize the performance of the proposed method. Overall, the idea is novel, but the experiments are insufficient, so I give a score of 4. I will base my final score on other reviewers’ comments and the author’s responses. Please refer to the details in the weaknesses. Lightly AI-edited
How to Teach Label to Understand Decisions: A Decision-aware Label Distribution Learning Framework Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper studies the problem of contextual stochastic optimization, where the setup is as follows: there is a joint distribution $P$ over an observed context $x$ and unobserved problem parameters $y$, and a cost function $c(z, y)$ that associates a cost with each decision $z$ and problem parameters $y$. Given a context $x$, we would like to solve the optimization problem $$ \min_{z \in \mathcal{Z}} \mathbb{E}_{y \sim P(y \mid x)}[c(z, y)]. $$ The distribution $P$ is not known, but we have a dataset of samples $(x_1, y_1), \ldots, (x_m, y_m)$ drawn from $P$. The approach explored in this paper and in prior work roughly involves using the data to learn a model that predicts for each context $x$ the distribution over outcomes $y$. Then the optimization problem is solved using the predicted distribution in place of the true conditional distribution of $y$ given $x$. A recent line of work studies "Integrated Learning and Optimization", where the loss function used when learning to predict the distribution of $y$ given $x$ is informed by the down-stream task (instead of simply being some generic measure of distributional similarity). However, the authors point out that a weakness of this approach is that these loss functions need to be designed bespoke for each downstream task, and are often difficult to work with due to being non-differentiable or discontinuous. This paper proposes a new approach called Label Distribution Learning (LDL) that does not require per-problem loss derivations and argue that it generally achieves decision-awareness (i.e., works well for most down-stream tasks). The high-level idea of the proposal is follows: 1. From the training data of $(x_i,y_i)$ pairs, construct a distribution $p_i$ over $\mathcal{Y}$ for each example. The distribution $p_i$ is a product distribution over the coordinates of $\mathcal{Y}$ where each coordinate's marginal distribution is supported on a finite set of values. The support of each marginal distribution and the weights associated with each value are determined from the training data by incorporating the data manifold as well as the decision objectives. 2. Next, train a two-branch neural network to simultaneously predict the support and weights from the context $x$. The authors then carry out experiments comparing their proposed method against several baselines on two separate tasks with both real-data and simulated data. The experiments show that the proposed method works well on these tasks. The problem studied by the paper is interesting, the approach seems quite different from prior work and is innovative and interesting. The experimental results are somewhat limited (only two decision tasks) but show promise for the proposed approach. At a high level, the label enrichment process described in the paper makes intuitive sense. However, at the same time, there are no formal guarantees or arguments suggesting that the approach will always result in predicted conditional distributions over problem parameters that work well for the down-stream decision task. While a formal guarantee is not required, the experiments section is limited to two decision tasks, so it remains unclear if the proposed approach would continue to work well across a wide range of tasks. My main concern with the paper is further justification for the details of the approach, either with theory arguing that the specific approach will capture important problem-specific structures, or with a broader experimental evaluation. To give one example of a situation where the proposed approach might go wrong, suppose that the dataset of $(x,y)$ pairs has the property that every pair it contains appears at least $M$ times (so that there are many duplicates of every example). In this case, the set of top-$M$ neighbors that have maximal transferrability to a given $(x,y)$ pair (defined on line 242) will be the $M$ copies of $(x,y)$. As a result, the support for each of the marginal distributions over the coordinates of $y$ will contain a single value: the one that was present in $y$. After this, my understanding is that the label enrichment process will associate each training $(x,y)$ pair with a distribution $p_i$ that is a point mass on $y$, which seems to undermine the goals of the process. While this is an extreme example, it seems that softer versions of this could cause the LDL method to behave poorly. A second weakness (which is acknowledged by the authors in the conclusion) is that LDL always fits a product distribution for $y$, ignoring any correlations between the problem parameters. In some cases a product distribution might work well enough, but it seems like this is a significant simplifying assumption. 1. In equation 7 distances are measured according to a distance metric $d$, but in equation (8) you switch to using norm notation. Is the norm meant to be a different measure of distance, or are these the same distance? 2. How is the specific form of the objective in equation 10 motivated? Intuitively it makes sense to find distributions that are similar both for other data points that are similar either in their context $x$ or where their problem parameters $y$ have exchangeable optimal decisions, but it seems like there are many choices. 2. In the special case where the training data contains $M$ identical copies of each $(x,y)$ pair, does the label enrichment process result in $p_i$ being a point mass on $y_i$? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next