ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	0 (0%)	N/A	N/A	N/A
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	1 (20%)	4.00	4.00	1493
Lightly AI-edited	1 (20%)	2.00	4.00	4881
Fully human-written	3 (60%)	5.33	3.67	2478
Total	5 (100%)	4.40	3.80	2761

Title	Ratings	Review Text	EditLens Prediction
HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	HiFo-Prompt tackles two common gaps in LLM-based Automatic Heuristic Design (AHD): lack of global search control and poor knowledge persistence. It adds a rule-based Foresight meta-controller that watches population progress/diversity and switches prompts among explore/exploit/balance regimes, and a Hindsight Insight Pool that distills reusable design principles from elites with utility-based credit assignment, then injects top-scoring insights into subsequent prompts. The method obtains the best results among various LLM-AHD baselines. - The idea of tracking both local and global evolution dynamics via specialized modules is interesting and well executed - Useful ablation studies - Strong performance with few function evaluations 1. Seed insights are required by the method. Importantly, these insights could significantly improve generation quality: “Design adaptive hybrid meta-heuristics synergistically fusing multiple search paradigms and dynamically tune operator parameters based on search stage or problem features.” particularly is a high-quality handcrafted prompt that can have a substantial effect on the generation. For fairness of comparison, one should provide the same information in the prompt of other baselines, say EoH. 2. The novelty regarding global control and historical information aggregation is overstated, e.g., ReEvo already implements a short and long-term reflection that could be seen as a simpler version of hindsight. Discussions would be appreciated. 3. I am not convinced about the population size being chosen as 4. How can diversity be maintained in such as small population and avoid inbreeding? 4. I found the methodology section quite confusing, with many quite complicated implementations. For example, a decay rate is introduced, but there is no ablation or sensitivity analysis on it. Eq. 3, which describes the evolutionary contribution, is full of hardcoded parameters, which are hard to parse, and the rationale for choosing them is not explained. 1. On this point, please clarify whether $g$ is a minimization or maximization objective. In EoH, this is maximization, but Fig. 2 and equations suggest otherwise. However, section A.1 again takes $g$ as argmax. This is confusing. 5. No code is provided 1. About dissimilarity (Eq 7): how are the textual descriptions calculated, and how do you ensure these are the same? (e.g.: will changing a single word make two descriptions different?) 2. It appears that there is a massive degradation if Qwen 2.5 max is not used in Table 9. How do you explain this? 3. What would happen if baseline methods also have the seed insights as part of the generator prompt?	Fully human-written
HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes HiFo-Prompt, a prompting framework for LLM-based automated heuristic design that marries two modules: Foresight (an Evolutionary Navigator that steers exploration vs. exploitation from population signals) and Hindsight (an Insight Pool that distills and reuses design principles from successful code across generations). By decoupling “thoughts” from code, HiFo-Prompt supplies state-aware guidance and persistent knowledge. Experiments on TSP, FSSP, online bin packing, and black-box functions show state-of-the-art quality, faster convergence, and lower token/time cost than prior AHD systems; ablations confirm both modules matter. - Dual Foresight/Hindsight design elevates the LLM from code generator to meta-optimizer. - Evaluation sees evident performance gain. - It’s unclear how you ensured a fair comparison under “the same query budget.” Does distilling insights consume additional queries? How many times did you run your method and the baselines? Did you use the same number of heuristic evaluations? Standard deviations are not reported, so the performance gains are not fully convincing. - The approach involves many hyperparameters. It’s unclear how they were chosen and how robust the method is to their settings. - The method relies heavily on pre-engineered prompts. - Similar ideas appear in EoH and ReEvo, where thoughts and reflections are distilled (both) and accumulated (the latter). Please clarify the novelty relative to these. See weaknesses.	Moderately AI-edited
HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes HiFo-Prompt with (i) a Hindsight module that distills reusable principles from successful candidates, and (ii) a Foresight module that adaptively switches explore/exploit/balance based on population state to guide LLM-based AHD. The proposed method is evaluated on TSP, Online BPP, FSSP, and BO. 1. The proposed method is well motivated and outperforms recent LLM-based AHD baselines across several tasks. 2. The design details are well presented. 3. The limitations and future directions are clearly analyzed. 1. For TSP step-by-step construction (i.e., Table 1), Appendix B.1 states that HiFo-Prompt involves LLM calls at inference time, however, it is unclear to me that whether such strategy also applies to the baselines. Please disambiguate: (a) If baselines also call the LLM at inference, please explain why HiFo-Prompt’s runtime is longer; (b) If they do not, please also report HiFo-Prompt under the same inference protocol for fair comparisons. 2. The main text claims TSPLIB results are in Appendix C.1, but C.1 contains only descriptive text and a placeholder “Table ??”, with no actual results. Please add the promised table/metrics or revise the pointer. 1. Line 387 says “100 instances at each of five sizes,” but Table 1 shows three, please fix the mismatch. Also, there are several misplaced “?” characters around lines 811, 854, 946, 967 that need cleanup. 2. Can you present some of the actual heuristics generated and used to produce the reported results? 3. In Table 5, removing the Insight Pool would make the method perform worse than EoH, which is surprising to me since the setup still retains the Foundational Prompts adapted from EoH and the Navigator module. Can you analyze the concrete differences between EoH and HiFo-Prompt w/o Insight Pool & Navigator that can explain this gap? Will the Navigator module improve baselines like EoH as a drop-in controller? 4. How frequently does the Navigator select explore or exploit across runs? Have you tried an ablation that fixes the state to “balance” throughout to isolate the benefit of adaptive switching?	Fully human-written
HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces HiFo-Prompt, a new method for LLM-based AHD. It integrates a Foresight module, featuring an evolutionary navigator that monitors population dynamics and steers the search using interpretable "verbal gradients," and a Hindsight module, which maintains a self-evolving insight pool that distills successful design principles from high-performing code into knowledge. Evaluated on different heuristic design tasks like TSP, BPP, and FSSP, HiFo-Prompt demonstrates competetive performance, achieving superior results with greater computational efficiency than existing LLM-based AHD methods. The self-evolving insight pool (Hindsight) and foresight instructions effectively prevent knowledge decay while enabling more strategic exploration of the heuristic space. It achieves superior performance with fewer LLM calls and lower runtime compared to other state-of-the-art AHD methods. The evolutionary navigator uses a fixed, rule-based policy with hand-tuned thresholds, which may lack generalization. The paper would benefit from additional illustrations and a more extensive set of results to further support its claims The stagnation is measured by raw fitness, delta g, which is a fixed value and may suffer from poor generalization to different heuristic design tasks. The semantic variety is calculated based on the textual descriptions of algorithms (eq. 7). Is it the thought or the code text? It seems that the indicator only counts when the two algorithms are exactly the same. Will it be too greedy? For the Foresight module, how was the specific set of Design Directives in the pool (Appendix G.3) designed? Was there an ablation study on the impact of different directive wordings on the LLM's output quality? The framework's knowledge management is confined to a single task; can the learned insights be generalized or transferred to new, unseen problem domains? What does the final algorithm look like? and how the insights and foresight prompts contribute to the generation of better heuristics, could you provide example illustrations? A discussion and comparison with related works on prompt evolution and hierarchical search is suggested [1-3]. [1] MeLA: A Metacognitive LLM-Driven Architecture for Automatic Heuristic Design, arXiv [2] Large Language Model-driven Large Neighborhood Search for Large-Scale MILP Problems, ICML [3] Experience-guided reflective co-evolution of prompts and heuristics for automatic algorithm design, arXiv There are typos and inadequate descriptions: e.g., line 811 ? line 854, 788 Figure 1, The Left and Right can be misleading	Fully human-written
HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design	Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper presents HiFo-Prompt, a framework for LLM-based automated heuristic design that combines Hindsight, which builds an evolving Insight Pool of distilled design principles, and Foresight, an Evolutionary Navigator that adaptively balances exploration and exploitation. The method is applied to several optimization tasks (TSP, Online BPP, FSSP, and BO), and the authors report improvements over prior AHD methods in both solution quality and sample efficiency. 1. The motivation of this paper is clear and reasonable. The design ideas of global guidance and the insight pool are interesting and inspiring. 2. The similarity-based diversity discussion for the insight pool is conceptually stimulating. 3. The paper is clearly written and well organized, making it easy to follow. 1. I have concerns about the novelty threshold. The Insight Pool’s novelty filtering relies on Jaccard similarity over token sets. While this removes near-duplicate sentences, such a pure text-based comparison cannot capture semantic overlap. For example, one insight might be expressed in different ways. Since this novelty threshold is crucial for ensuring diversity, I worry this design may harm the actual effectiveness of the diversity mechanism. 2. The combination of a usage penalty and a recency bonus in $U(k, t)$ aims to balance exploration and exploitation, but the dynamics between these opposing terms are not analyzed. This could be sensitive in practice, and it would be helpful to justify or empirically demonstrate that this interaction leads to stable selection rather than oscillatory behavior. In particular, $w_u$ is a hyperparameter without ablation or sensitivity analysis, and the calculation of $B_r$ is not clearly presented. This reduces the soundness and reproducibility of the method. 3. The mapping from normalized performance $\tilde{\rho}$ to the effective credit $g_{\text{eff}}$ uses manually chosen piecewise constants (0.8, 0.6, 0.5, -0.3, etc.) with no theoretical justification or ablation. While the idea of tiered reward regimes is understandable, the specific scaling choices seem ad hoc and may not generalize across tasks. It would strengthen the work to at least provide hints or guidelines on how to select these values. 4. The definition of phenotypic diversity as the fraction of non-identical algorithm text strings feels coarse and potentially misleading. The measure is a bit lexical that two code snippets are treated as completely different even if they differ only by refactoring or variable renaming, ignoring actual semantic or functional similarity (similar with my commen in 2.). As a result, the system may overestimate diversity and trigger unnecessary exploration. Moreover, this approach scales as $O(\|P\|^2)$ comparisons per generation, which may become inefficient for larger populations and increase token consumption for LLM-based evaluations. The diversity threshold is also arbitrary and not justified or ablated. Overall, the lack of semantic grounding and unclear efficiency raises concerns about the robustness and practicality of the Navigator’s diversity control. 5. The experimental section raises several concerns about fairness, reproducibility, and efficiency. Although the paper states that all LLM-based baselines were evaluated under the same Qwen 2.5-Max model, implementation details and prompt adaptations are not provided, so fairness remains unclear (like baselines might use different LLMs, thus they can not be comparied directly). HiFo-Prompt’s runtime on small TSP instances (Table 1) is about an order of magnitude slower than competing methods, with no explanation, contradicting the claim of improved convergence speed. Token-usage statistics are summarized only coarsely (Appendix C.7) without breakdown or cost analysis, leaving uncertainty about true computational overhead. The brief multi-LLM comparison (Table 9) covers only two tasks and lacks analysis, providing little evidence of model generality. Finally, runtime behavior is inconsistent across tables (slower in TSP 10–50 but faster in TSP 100–500) with no explanation. Together, these issues make it difficult to assess the practical efficiency and generalizability of the proposed framework. 6. The code does not seem to be provided. Even though the authors share the core prompts, several computational details remain unclear, as mentioned in earlier points. This makes it hard to guarantee reproducibility and verify the soundness of the proposed method. Minors 1. Very minor: for LaTeX quotation marks, please use the proper “…” format instead of plain double quotes. For example, in L055 the quotation marks are incorrectly formatted. 2. There are a few missing or incomplete citations in the appendix, such as at L811 and L1017. These should be corrected for completeness and consistency. See the weakness.	Lightly AI-edited

PreviousPage 1 of 1 (5 total rows)Next