ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	2 (50%)	6.00	3.00	4089
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	0 (0%)	N/A	N/A	N/A
Fully human-written	2 (50%)	2.00	3.00	5172
Total	4 (100%)	4.00	3.00	4630

Title	Ratings	Review Text	EditLens Prediction
None to Optima in Few Shots: Bayesian Optimization with MDP Priors	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper introduces the ProfBO method, which is a prior-fitted network model for transfer learning / metalearning in BO. A key aspect of the method is that for a particular transfer learning problem instance, the PFN is fine-tuned using the source tasks to produce an MDP prior that trains the model on optimization trajectories. DQN in particular is used for modeling optimization trajectories, and a MAML-style approach is used for incorporating them into fine tuning. The focus of the paper is on settings where a small (20ish) number of evaluations can be made on the target task, and so metalearning from prior tasks is important.The paper shows good empirical performance in this setting on some new test problems and standard benchmarks. * The main contribution of the paper is the MDP prior and the associated incorporation of optimization trajectories into a PFN framework. This is novel and quite interesting. The general strategy also seems like it could be useful outside the transfer learning / metalearning setting that is the focus of the paper. * The paper also introduces new problems that can be used for evaluating transfer learning methods that are based on real-world problems and seem that they will be useful for future work. * The ablation study does a good job of showing the importance of various aspects of the optimization trajectory learning and providing insight into how that should be handled. * Feasibility of learning optimization trajectory policy on source tasks: The method includes learning a DQN policy network on the source tasks during fine tuning. As I understand it, this requires being able to make new evaluations of the source tasks, and in particular, not just using whatever optimization trajectory you happen to have from some earlier optimization on this task. Is that correct? The typical assumption in metalearning for BO is that the data you have from each source task come from some previous run of BO or the like on that source task, so the number of points per source task would be in the neighborhood of 20-40. How many datapoints per source task were used here? Appendix C.2 says "In Cancer-B, we utilize three meta-training datasets (6T2W, NSUN2, RTCB), comprising 437,634 evaluations in total, and two meta-test datasets (WHSC, WRN), totaling 291,756 evaluations." Does that mean that all O(10^5) points were used as the dataset for learning the DQN on the source tasks? If so, it would present a serious problem for the motivation of the paper. I cannot think of many realistic scenarios where one would be able to run orders of magnitude more evaluations on the source tasks as on the target task. In particular it does not seem to be the case in the scenarios used to motivate the paper: "In practice, these related source tasks can be the docking scores of a set of molecules evaluated on different receptors." If for one receptor (target task) it takes hours/days to do an evaluation, it is implausible that it would take only seconds/minutes to do the evaluation for some different receptor. * The paper is broadly framed as being for few-shot learning, but it is really about transfer learning. These are not the same thing, generally in few-shot learning one may not have access to the similar source tasks required by the method. Even the title of the paper may be confusing, "None to optima": Except is it really "None" when source tasks are required? * The potential for negative transfer is an important issue in any transfer learning / metalearning method, and is not investigated in the paper at all. I'd like to see what happens to performance as unrelated tasks are added as source tasks. * Understanding when trajectory information is helpful: The MDP prior and model for optimization trajectories is the core contribution of the paper. Internally to the paper, the claim that this is important/valuable seems well-supported by the ablation studies, which shows that eliminating either positional encoding or the MAML training algorithm significantly deteriorates performance. But, prior work has come to a different conclusion. The paper for the NAP method (Maravel et al. 2023) has a whole section (Property 3.2) claiming that history-order invariance is important for Meta-RL generally, and that positional encoding should not be used. NAP performs nearly as well ProfBO. So while removing positional encoding degrades ProfBO, other methods (NAP) intentionally exclude positional encoding and yet perform nearly as well as ProfBO with positional encoding. What is the source of this seeming discrepancy? * The language around optimization trajectories and sampling strategies seems confused. Section 4 states "Under the GP assumption in standard BO, at iteration t, all queries in Dt−1 are treated as i.i.d. uniformly distributed, ignoring the fact that Dt−1 is actually a BO trajectory." This makes it sound like the GP is making distributional assumptions on the X input locations, which is not the case. I think the text is trying to get at the notion of exchangeability, which is a related albeit different concept. * Number of samples: The paper is framed very strongly around 20 iterations being the upper bound for number of samples. E.g. page 1 "fewer than 20 evaluations"; page 3 "within a few shots, e.g. T<=20"; page 4 "T can be fewer than 20", etc. Then suddenly in Section 5.2 this is changed to "within 20 or 40 iterations." This is of course because the method did not perform particularly well at 20 iterations on the Cancer problem and required 40 iterations to beat TNP. This raises the obvious question of what happens on the covid problem if the number of iterations is taken out to 40, and furthermore seems like there should be some acknowledgement that in some settings more iterations are required. * It would be helpful to understand when/why more iterations are required. The HPO-B problems were run out to 90 iterations because that's what was done in the work this builds most directly on. We see that there are are significant improvements from iteration 20 to iteration 90. So while it may be correct to say that ProfBO is the best of the methods at iteration 20, it is not correct to say that ProfBO has solved the problem well in fewer than 20 evaluations; the regret is still very high if one were using ProfBO and had to stop after 20 iterations. Is well-performing few-shot learning just not possible here? When is it possible? * Please clarify how many points are being used as the dataset for each source task. If this is more than the 20-40 budget used in the target task, please justify.	Fully human-written
None to Optima in Few Shots: Bayesian Optimization with MDP Priors	Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper targets the important problem of black box optimization where only few evaluations are possible, but optimization trajectories of related tasks are available to speedup the optimization. The paper uses a prior-fitted neural network (PFN) as a model in bayesian optimization. The main contribution is a new method that incorporates optimization trajectories on existing tasks as an MDP prior into the PFN while employing MAML for adapting the trained PFN to the target task. * Tackles an important established problem setting highly relevant to the ICLR community * Usually PFNs are trained with synthetic data, incorporating actual evaluations is an interesting research direction and a good fit for the tackled problem setting * Combining MAML and PFN is an interesting methodological contribution * Follows best practices for reproducibility * Experimental setup, including baselines, protocol, and used datasets described in sufficient detail. * Sufficient discussion on related works that use optimization trajectories of related tasks * Clearly structured and understandable writing * The paper does not discuss limitations of their method and experimental design prominently * The ablation study is not convincing. The results of the ablation study as discussed in the text may not hold in the target range of 20 evaluations. Results are very noisy here and their un-ablated algorithm is not the preferred choice. The ablation study is not performed on the covid-b and cancer-b benchmarks. Why? * The approach does not take into account categorical parameters. How about integer parameters? Limiting its applicability. * Unclear how the chosen related tasks affect the approach and evaluation. How many related tasks are necessary? Do some approaches work better for different numbers of related tasks? An explicit evaluation of this would strengthen the paper. What about the case of misleading related tasks? How related are you tasks? * The authors use different #iterations for each benchmark (20, 40, 90). To keep evaluation here consistent, showing up to e.g., 20 everywhere (which is their target setting) would be better. For example, showing 40 evaluations for the cancer dataset, where your method performs worse than TNP for 20 seems hand picked, what was the reason here? See also my comment on the ablation study. * How were the final hyperparameter settings of your method chosen? The appendix lists a search space of hyperparameters for you method. Did you tune them? How? Did you also tune the baselines? * Regret plots for Covid / cancer (Figure 3) only have one y-axis label (10-1), not possible to know the difference in the aproaches at all. * Unclear how beneficial meta-learning across related tasks compares to other speed-up techniques for BO (e.g., multi-fidelity, cost-aware, or use of expert priors). An empirical evaluation would be best, but a discussion in related works is also missing. This is lacking in prior works, too. Minor * Line 180 "It [PFNs] outperforms traditional GP (see Figure 1), " While making an argument for use of PFNs over GPs, you are comparing a meta learned PFN to a standard GP, which is not an apples-to-apples comparison. * A small discussion of NN-based BO in general would be helpful. In particular, a discussion on PFNs, their inherent meta learning capabilities and their existing applications to BO would be beneficial. * x-axis of Figure 4 is unlegible * Paper shows normalized regret. How meaningful are the improvements in absolute terms for the respective benchmarking settings? * How many seeds were used? * You follow Maravaal et. al in selecting 6 of 16 HPO-B problems, how were they selected? How would the results look on all the problems?	Fully human-written
None to Optima in Few Shots: Bayesian Optimization with MDP Priors	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents ProfBO (Procedure-informed Bayesian Optimization), a framework that accelerates black-box optimization by leveraging priors learned from previous optimization trajectories on related tasks. ProfBO models a prior distribution of past optimization trajectories using a Markov Decision Process (MDP), enabling it to transfer knowledge gained from past optimization runs on related tasks to new tasks. The authors evaluate ProfBO on synthetic and real-world benchmarks, demonstrating that ProfBO consistently outperforms baselines from the literature, achieving faster convergence and lower regret within a very small number of function evaluations. This work shows that leveraging knowledge from past optimization of related tasks can substantially improve sample efficiency in Bayesian optimization. Originality: ProfBO introduces an original and conceptually elegant idea: learning procedural priors over past optimization trajectories to improve few-shot Bayesian optimization. The method treats prior optimization runs as Markov Decision Processes, enabling it to capture the dynamics of optimization itself. This “procedure-informed” view is a novel way to transfer optimization knowledge and differs from existing approaches. Quality: The paper is technically strong and empirically thorough. The methodology is well-motivated. Experiments are extensive, covering a large number of benchmark tasks and comparisons against baselines from the literature. ProfBO consistently achieves faster convergence and lower regret across tasks within small evaluation budgets. Clarity: The paper is clearly written and well-structured. The main components of the ProfBO method are clearly motivated and explained. The narrative is easy to follow, and the paper effectively communicates both intuition and implementation details. Significance: ProfBO demonstrates that procedural priors can improve sample efficiency of Bayesian optimization. This is significant in real-world domains such as drug design, materials design, etc. where sample efficiency is important. While ProfBO demonstrates strong relative performance across benchmarks, including real-world-inspired tasks such the COVID and Cancer benchmarks, the paper does not provide discussion or qualitative analysis of the meaningfulness of the obtained solutions. It remains unclear whether the best-found objective values correspond to scientifically relevant or near-optimal outcomes in these applications, or primarily represent improvements over baselines in a normalized performance sense. Adding brief commentary or case examples illustrating how close these solutions are to practically desirable or known-good results would strengthen the empirical claims and highlight the real-world impact of the method. 1: Could the authors provide discussion or intuition on why ProfBO achieves larger improvements over baselines on certain tasks but smaller gains on other tasks? Maybe leveraging prior optimization trajectories is more useful for some tasks than others? Do the authors have any intuition for why that is or what types of tasks ProfBO might be most useful for? 2: In experiments and method development, did the authors encounter any scenarios when procedural priors from source tasks are poorly aligned with the target? Is it possible that such a scenario could lead to ProfBO having worse performance than other methods rather than better performance? 3: For the biomedical benchmarks, do the best-found objective values correspond to scientifically meaningful or near-optimal solutions? Adding qualitative insight or examples would help contextualize the results.	Fully AI-generated
None to Optima in Few Shots: Bayesian Optimization with MDP Priors	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes PROFBO, a procedure-informed Bayesian optimization framework aimed at solving black-box optimization problems in very few evaluations (typically $T \le 20$). The central idea is to leverage optimization trajectories from related source tasks, modeled as MDP priors, so that the target-task BO does not have to learn an efficient search policy from scratch. Concretely, the method (i) trains lightweight DQN agents on source tasks to generate optimization trajectories, (ii) embeds these trajectory distributions $p(T^{(i)})$ into a prior-fitted Transformer (PFN) to act as a BO surrogate, and (iii) fine-tunes this PFN with MAML and positional encodings to transfer procedural knowledge while avoiding overfitting to specific source-task temporal patterns. At test time the BO loop is standard (context $\to$ posterior $\to$ acquisition), but the surrogate now reflects trajectory-informed priors. Experiments on two new real-world few-shot drug discovery benchmarks (Covid-B, Cancer-B) and on HPO-B show that PROFBO achieves lower regret and better early ranks than meta-surrogate baselines (META-GP, FSBO, TNP) and recent trajectory/meta-BO methods (MAF, NAP, OptFormer), while being more training-efficient than NAP. * Clear and motivated problem setting: BO in regimes where $T \le 20$ and evaluations are expensive, which is where standard asymptotic BO results are less useful. * Procedural transfer via MDP priors is novel in this particular PFN + MAML setup and allows the surrogate to internalize good search patterns rather than only response surfaces. * The PFN backbone gives a principled way to do single-pass Bayesian inference over contexts, leading to faster inference than GP posteriors and enabling the use of non-GP priors. * Ablation studies isolate the contribution of MDP priors, MAML, and positional encodings and show that each of them improves early-iteration regret. * Strong empirical results on newly proposed real-world discrete/continuous drug-like tasks (Covid-B, Cancer-B) and on HPO-B, outperforming several state-of-the-art meta/few-shot BO baselines. * Training-time comparison with NAP shows the proposed modular two-stage training (RL for priors, supervised PFN fine-tuning) is more efficient than end-to-end RL with transformers. * The approach depends critically on the availability and quality of source-task optimization trajectories; the paper does not quantify how performance degrades when source tasks are few, noisy, or mismatched with the target. * The MDP prior is learned with per-task DQN on (possibly) large discrete action spaces, and although the authors optimize it (subset of actions, batched GPU generation), this can still be expensive in domains without precollected meta-data. * The method is benchmarked mostly on structured or tabular/meta datasets with fixed embeddings (26D molecule embeddings, HPO-B); it is less clear how the approach would behave on high-dimensional continuous design spaces where actions cannot be discretized so easily. * No acquisition-function–level comparison is made under exactly matched hardware/time budgets; some of the reported gains could be due to PFN’s fast forward pass rather than the MDP prior per se. * There is no formal analysis of negative transfer: in principle a trajectory prior that encodes a suboptimal policy could bias the PFN and hurt few-shot performance. 1. How sensitive is PROFBO to the number and diversity of source tasks? For example, if only a small subset of Covid-B problems is available for training, does the method still outperform TNP/META-GP on the held-out problems? 2. In Section 4.2, the MDP defines the action space as candidate points from the dataset. How would PROFBO be instantiated for continuous domains where we cannot enumerate actions and cannot easily train DQN on a discrete subset? 3. The PFN head outputs a discretized bar distribution over $y$. When computing EI/UCB from this surrogate, are the baselines given access to the same acquisition budget (number of candidate points scored per iteration)? 4. The two-stage training (GP-like pretrain, then MDP-prior fine-tune with MAML) is argued to prevent overfitting to trajectory order. Can the authors show a small example where training only on trajectory priors actually harms OOD target tasks? 5. For the drug-discovery tasks, can the authors confirm that target tasks are not leaked into the MDP-prior training stage (i.e., that there is a strict meta-train/meta-test split at the trajectory level)?	Fully AI-generated

PreviousPage 1 of 1 (4 total rows)Next