ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	1 (25%)	2.00	4.00	2488
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	1 (25%)	2.00	5.00	2566
Fully human-written	2 (50%)	4.00	4.00	4365
Total	4 (100%)	3.00	4.25	3446

Title	Ratings	Review Text	EditLens Prediction
Structure Guided Equation Discovery with Influence-Based Feedback for Large Language Models	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a two-agent LLM framework for symbolic equation discovery. The “Propose” agent generates candidate basis terms, a linear model fits coefficients, and per-term influence scores measure the contribution of each term by zeroing its weight and computing the change in validation error. Experiments cover pharmacological, biological, and synthetic datasets. + The per-term influence mechanism provides a clear and interpretable feedback signal for guiding LLM pruning. + The two-agent architecture (Propose / Prune) is modular and amenable to ablations. + The biological case study shows potential for domain-specific insight generation. + The method demonstrates stable performance across several datasets and LLM backbones. - The biological case study is mostly qualitative; there are no out-of-batch or cross-condition validation results to substantiate mechanistic claims. - It is unclear which LLM produced the main “SGED (Ours)” results, and the number of random seeds differs across methods (25 vs 10). - - This invalidates confidence intervals. - PySR and AI-Feynman appear only in the appendix despite being the most relevant SR competitors. - Ablations show minimal gains from MCTS relative to iterative pruning, suggesting the default configuration is suboptimal. - Evaluation relies solely on MSE, ignoring structure correctness, sparsity, or parsimony. - The method’s complexity still scales with the number of proposed terms and LLM context size, yet no empirical analysis is provided. - The same validation split is used for both pruning and reward evaluation, creating a high risk of overfitting. - Datasets are largely older and omit standard SR benchmarks (SRBench, Nguyen, LLMSRBench), limiting comparability. - There are no ablations on noise robustness, out-of-distribution generalization, or prompt and LLM sensitivity. - Critical hyperparameters, token budgets, and search details are scattered across appendices, hindering reproducibility. - Why were standard SR benchmarks excluded? - Why was the “no-refit” influence variant chosen as default when “refit” or deeper MCTS rollouts yield better MSE in the appendix? - What are the compute and wall-clock costs for MCTS vs the simple iterative loop? - Which specific LLM(s) generated the “SGED (Ours)” results in Table 3? Are all baselines evaluated under the same model and seed count? - Can the authors add structure-recovery metrics (exact match rate, symbolic distance) on synthetic ground truths?	Fully AI-generated
Structure Guided Equation Discovery with Influence-Based Feedback for Large Language Models	Soundness: 2: fair Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes SGED, a framework where LLMs iteratively propose basis functions for linear models, then prune them using per-term influence scores that measure each term's contribution to validation MSE. The method can operate through simple iterative refinement or be enhanced with MCTS for broader exploration. Experiments on biological and synthetic datasets show competitive performance, with some interesting detailed case studies (e.g., on RNA Polymerase II pausing). The core claim is that providing LLMs with granular, per-term feedback enables more effective equation discovery than coarse scalar metrics alone. - The paper is very well-written and clearly structured, making the proposed method easy to follow. The inclusion of extensive experiments and ablations in the appendix (influence feedback impact, MCTS contribution, robustness across LLMs, scalability analysis, etc.) demonstrates thorough empirical work that goes beyond what is typical and I really appreciate this. - The benchmark problems are practical and interesting, particularly the RNA Polymerase II pausing case study with large number of 263 features. High-dimensional feature spaces where domain knowledge can guide and narrow down feature selection and engineering represent realistic scenarios where LLMs could provide real value. The biological validation of discovered patterns and correlations adds value. - Several methodological choices are well-motivated: using MCTS to avoid local optima in the search space, the dual-agent architecture separating proposal from pruning, and the idea of providing granular per-component feedback rather than scalar metrics. The influence score mechanism provides an interpretable signal about term contributions that is more informative than overall MSE alone. While I really appreciate the extensive experiments and ablations in the paper, I have a major concern with the paper's positioning and evaluation: - The framing of the paper as "symbolic regression and equation discovery" and comparison with SR methods is not very accurate. The method uses a fixed choice of modeling (linear model) with discovered/transformed features. I see the work much closer to efforts in the field of automated feature discovery/engineering than symbolic regression, which aims at discovering general nonlinear equation structures. - Furthermore, given that the studied problems in this work (e.g., biological processes) seemingly have unknown complex (and most certainly nonlinear) behaviors, one could question the constraining to linear models. A good well-studied alternative could be tree-based methods which also provide some level of explainability (similar to linear models, as we are eventually sacrificing full explainability by selecting linear models). - Apart from the choice of model (linear vs. tree-based), the main focus of this paper is automated feature discovery/engineering using an LLM-based framework. However, the paper omits evaluation, comparison, and discussion with the large body of research in automated feature engineering, from statistical and non-LLM-based approaches (e.g., AutoFeat [1], OpenFE [2]) to recent LLM-based approaches (e.g., CAAFE [3], FeatLLM [4], OCTree [5]). - First, I try to elaborate upon my major concern for the authors: The paper positions itself as symbolic regression, but in my opinion it is a kind of automated feature discovery for linear models. One might argue that methods like SINDy also use linear models with nonlinear basis functions; however, those models are designed for specific problems (typical ODEs) that commonly have linear forms, and admittedly have limitations for general nonlinear forms. In this work, however, the studied problems and benchmarks have very complex (and almost certainly nonlinear) behaviors, and in some cases we are dealing with partially observable systems, where we do not expect recovering or discovering ground truth functions from the data for fully explainable models. - Given this, I think the paper should discuss, evaluate, and compare with more relevant methods in the large body of research in automated feature engineering/discovery and tree-based methods. To clarify, there are two main components here: (1) The approach for feature discovery/engineering, and (2) the choice of model that uses these features for prediction. - In terms of (1), there are various methods that approach this including statistical and non-LLM-based approaches (like AutoFeat [1], OpenFE [2], with linear models) and more recent LLM-based approaches that are very similar to this work (e.g., CAAFE [3], FeatLLM [4], OCTree [5], ...) that use LLM domain knowledge for feature discovery similar to this work, while mainly using tree-based models due to their capabilities. - In terms of (2), this work could provide tree-based methods like XGBoost, LightGBM, etc. as baselines, which are currently missing. (A simple baseline could be on raw features, and a better baseline would be on top of discovered features from methods in the previous part.) It would be also interesting to see how the feature importance and explainability provided by those models (e.g., using SHAP) compare to current linear models. - In Figure 2, why is there already a substantial MSE gap at iteration 0 before feedback mechanisms should differentiate the methods? - For MCTS where children are generated through stochastic sampling, what temperature or sampling parameters were used? how diverse are the candidates? - Have you tested simple threshold-based or top-K pruning directly on influence scores as a baseline compared to the use of LLM and instructing it for pruning? This might reduce the need for the second agent call and reduce computations. References: [1] AutoFeat: The autofeat python library for automated feature engineering and selection, Horn et al (2020) [2] OpenFE: OpenFE: Automated Feature Generation with Expert-level Performance, Zhang et al. (2023) [3] CAAFE: Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering, Hollman et al. (2023) [4] FeatLLM: Large language models can automatically engineer features for few-shot tabular learning., Han et al., 2024 [5] OCTree: Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning, Nam et al., 2024	Fully human-written
Structure Guided Equation Discovery with Influence-Based Feedback for Large Language Models	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper proposes a symbolic regression framework called Structure-Guided Equation Discovery (SGED). In this approach, an LLM is first used to generate candidate basis functions, whose coefficients are then calculated using a linear model. The influence score (contribution) of each candidate term is evaluated by removing it and measuring the resulting change of MSE. Based on these influence scores, LLM are used to decide which terms to retain or remove. The entire process is iterative and can be integrated with a Monte Carlo Tree Search (MCTS) strategy. The proposed method achieves state-of-the-art performance on several benchmark datasets. 1、The motivation of the paper is commendable — previous methods that relied solely on mean squared error (MSE) for evaluation were overly coarse and lacked specific guidance mechanisms. 2、The inclusion of evaluation results on real-world case studies enhances the persuasiveness and practical relevance of the work. 1、The influence score (contribution) of each candidate basis functions is computed by removing one term while keeping others fixed. However, equation terms are often highly coupled, and this assumption fails to consider their interdependencies. The so-called “fine-grained” evaluation may therefore be misleading. 2、The pruning decision (remove which candidate term) is based on an explicit influence score (Delta_j). Using an LLM to decide which terms to keep, prompted only by Delta_j, seems unnecessary and even risky given the possibility of hallucinations. Using an LLM to perform pruning decisions seems unnecessarily redundant, as this step could be accomplished simply by setting an appropriate threshold of Delta_j. 3、Since LLMs are pre-trained on massive corpora that may include common equations or datasets, data leakage is a serious concern. The paper does not provide experiments on datasets unseen by the LLM would improve credibility. (Please refer to “LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models”.) 4、The paper reports only predictive accuracy. In symbolic regression, equation complexity is equally important. Without this, it is difficult to judge the trade-off between interpretability and performance. 5、Lack of fine-grained ablation analysis. Such as robustness to prompting strategy, or noisy data. 6、The paper mentions better efficiency under fixed token budgets but omits quantitative results on the number of LLM calls, latency, or computational overhead—critical for assessing real-world feasibility. Please refer to weakness.	Lightly AI-edited
Structure Guided Equation Discovery with Influence-Based Feedback for Large Language Models	Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces Structure Guided Equation Discovery (SGED), a framework for LLM-driven symbolic equation discovery that provides granular, influence scores as feedback to guide iterative refinement. Unlike prior LLM-based systems (e.g., D3, LLM-SR, etc.), SGED quantifies each basis function’s contribution to predictive performance and integrates this information into a dual-agent “propose-and-prune” process, enhanced by Monte Carlo Tree Search (MCTS). The approach is evaluated on some biological and pharmacological datasets and ablation experiments suggest that the influence feedback and MCTS both positively impact model convergence and accuracy. - The propose of influence-based structure-guided feedback as a more granular feedback is well-motivated - Applying LLM-driven equation discovery to real-world biological and pharmacokinetic data is also interesting and highlights practical benefits beyond benchmarks. - The paper is generally well written and easy to follow. - How does the proposed method perform on LLM-SRBench [1], the recent benchmark designed for llm-based equation discovery beyond memorization? Evaluating only on six dataset from one domain is not sufficient to assess generality of the proposed method (as claimed in the title). I believe a more comprehensive comparison on a larger, standardized benchmark like LLM-SRBench (which spans 100+ tasks across multiple scientific domains) is needed to provide a fair assessment of the framework’s contribution. - In Figure 2, why do the two models start from different initial points at iteration 0? The influence-based feedback should affect the efficiency and convergence of discovery, not the initial performance, which usually depends on the initial candidate pool, before any feedback is applied. - I would suggest authors to include some qualitative examples of these structure-guided feedback and how they are incorporated in the iterative refinement. Also, I think a simplified example perhaps integrated into Figure 1 may be helpful to better understand this. - It would also be helpful to provide some qualitative examples of showing how symbolically influence feedback alters the model’s reasoning or the discovered equations. [1] LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models, ICML 2025 included in the weaknesses section	Fully human-written

PreviousPage 1 of 1 (4 total rows)Next