ICLR 2026 - Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	15899 (21%)	4.43	3.58	3687
Heavily AI-edited	3233 (4%)	4.22	3.59	2990
Moderately AI-edited	7082 (9%)	4.20	3.61	2722
Lightly AI-edited	16648 (22%)	4.15	3.68	2746
Fully human-written	32938 (43%)	4.13	3.62	2917
Total	75800 (100%)	4.21	3.62	3026

Title	Ratings	Review Text	EditLens Prediction
A Neuro-symbolic Approach to Epistemic Deep Learning for Hierarchical Image Classification	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes unifying uncertainty estimation with an epistemic approach that ensures logical consistency for hierarchical image classification with pre-trained Swin transformers. They propose a two-head architecture (fine and coarse head). The epistemic component follows the strategy of RS-CNN with focal sets induced in the latent space. The logical consistency is integrated into the learning process through a belief-based, logically constrained loss, ensuring that fine-level belief masses are compatible with the coarse level. An experimental validation is provided for two datasets: CIFAR-100 and INaturalist 2021. The main contribution of the paper is the principle of unifying belief theory for epistemic uncertainty estimation with logical consistency. originality + The originality of the paper relies on the idea of unifying uncertainty estimation using belief theory and focal sets with semantic regularization using a logically constrained loss. The unification principle is straightforward and logical. quality + The authors took care to formalize the proposed approach with a set of equations to ensure a clear understanding of the proposed regularized cost function. clarity + The motivations of the paper are clear. significance + The contribution addresses an important topic in AI, particularly for deep learning models, concerning their robustness and uncertainty estimation. The idea of leveraging belief theory in this context is not new, but it presents an interesting avenue for study. The proposed approach also falls within the category of neuro-symbolic approaches, integrating a priori knowledge — logical constraints — into the learning pipeline (here, in the training process, with a semantic regularization term). It is also an interesting way to study with different expected gains (robustness, frugality, explainability...). I have several concerns about the paper. + Concern 1: lack of technical novelty. + The paper is strongly built on the [Random-Set Neural Network (RS-NN) paper](https://arxiv.org/pdf/2307.05772) and on classical semantic regularization on the neuro-symbolic literature. The contribution mainly relies on the integration of these two aspects. Moreover, some shortcomings in the proposed approach are not enough motivated. For instance, hierarchical classification appears to be limited to a bi-level problem with fine-grained and coarse-grained classes. What about a hierarchy with different layers? Regarding the semantic regularization part of the work, it also lacks precise positioning in relation to the state of the art on semantic regularization. See, for instance, [Xu et al,. 18](https://proceedings.mlr.press/v80/xu18h/xu18h.pdf), [Ahmed et al,. 24](https://arxiv.org/abs/2405.07387), [Ledaguenel et al,. 24](https://filuta.ai/images/compai/CompAI_paper_7.pdf). + It also lacks a positioning with regard to other uncertainty estimation approaches, such as conformal prediction, for instance. + Concern 2: lack of clarity. Although the effort to formalize the approach is commendable, it seems that the formalization should be carefully reviewed and verified. + What is $C$ in equation 3 ? It is not defined. While it seems clear that it is, in this context, the set of classes, as in the RS-NN paper, it should be defined. + Are there any implicit constraints on the mapping function $g$? For example, can the function handle multiple parents? This function and its properties should be described in much greater detail. + The clustering process of the latent representations should be more detailed. What is $\mathcal{O}$ in equation 6? + Concern 3: lack of an exhaustive positioning with the neurosymbolic literature + Section 6.7 is too short. Important references of the NS literature are missing. See, for instance, all the works on semantic regularization mentioned before. Moreover, an important aspect that is missing is the evaluation of the gain of the NS part. Main questions : + What about the scalability aspect of the proposed approach? Indeed, the exponential complexity of using $2^{\mathbb{Y}_f}$ and $2^{\mathbb{Y}_c}$ sets of classes is an important issue. How is this aspect managed in the proposed approach? + What about more hierarchical constraints, involving more than two levels (coarse and fine)? + Why a hidden representation space of size 512? + See previous questions regarding the mapping function $g$, on the clustering process . + What is the impact of the chosen pre-trained backbone on this clustering process? + What is the impact of the choices of the membership function and the chosen T-norm?	Fully human-written
A Neuro-symbolic Approach to Epistemic Deep Learning for Hierarchical Image Classification	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a neuro-symbolic framework that combines Swin Transformers with focal set reasoning and differentiable fuzzy logics for hierarchical image classification. The approach aims to improve calibration and logical consistency while maintaining competitive accuracy on CIFAR-100 and iNaturalist datasets. - This paper combines epistemic uncertainty modeling (via focal sets and Dempster-Shafer theory), fuzzy logic (t-norms), and modern vision transformers. - The primary strength is a new integration of two distinct fields: epistemic uncertainty and neuro-symbolic reasoning. The paper makes a strong case that most prior work addresses either logical consistency or uncertainty, but rarely both in a unified manner. - The presentation of the results is difficult to follow. The authors should consider consolidating these findings into a summary table or figure to improve readability. - The contributions of this paper are focused entirely on the proposed method, with no accompanying theoretical analysis or guarantees. - Only two datasets tested, both with relatively shallow hierarchies (2 levels). No comparison with recent strong baselines (e.g., hierarchical softmax, conditional probability approaches). - How sensitive is the model's performance to the pre-computed focal sets? How would performance change with different clustering algorithms or a different number of focal sets?	Lightly AI-edited
Efficient LLM Collaboration via Planning	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The main contribution of this paper is a multi-stage LLM inference pipeline that relies on plans: i) in the first stage, a small model generates a plan and tries to solve the problem; ii) if not satisfied, in the second stage, a large model generates a plan and then passes the plan to the small model to solve; iii) finally, if not satisfied, in the third stage, the large model executes the plan it generated in stage 2. Experiments on a couple of small models and a couple of large models and in comparison with several baselines (Cascade, ABC, ...) show improvements in accuracy with a decrease in cost. The cost savings are attributed to "early exits" by a model solving a problem after stage 1 or after stage 2 (avoiding the expense of inference on a large model as in stage 3). + I like the conceptual simplicity of the paper: it is presented for a small/large setup, but could easily be extended to n models of increasing strength. + The idea of planning at LLM inference is a subtle distinction from traditional C-o-T (and related approaches) wherein the model explicitly generates a plan before attacking the problem. + The experimental study shows consistent improvements in accuracy while reducing costs. I also appreciate the two results (Table 6 and 7) that show off the method on open-ended and agent tasks. - The approach is "system-ized" in that the core idea of using plans to yield better results is hidden through the three stages. I would be interested to see how the results play out for each of the settings: small planner, small executor; large planner, small executor; large planner, large executor. - The observations (Section 3) make good sense, and I appreciate that these are framed as just "observations", but it would be good to see more empirical support, e.g., considering other planners, executors, and ablations on the "plans aligned with its capacity". - To this last point, a simpler plan is used for smaller models vs a more systematic plan for larger models: I wonder what the impact of plan prompt is. These are hand-crafted as best I can tell (from the Appendix). I could imagine many levels of detail for these planner prompts that could impact the quality of the results. Can you show the results for the three settings in isolation? (without passing from one stage to the next?). Did you observe any impact of varying the planning prompt design?	Fully human-written
Efficient LLM Collaboration via Planning	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a three-stage collaboration framework for efficient large language model (LLM) usage, where smaller models handle simpler tasks and larger models are invoked based on confidence thresholds. The approach aims to balance accuracy and cost efficiency, and experiments on reasoning tasks demonstrate comparable performance and lower cost compared to existing methods. - The paper is clearly written and easy to follow. - The proposed approach is intuitive and straightforward. - Code is provided for review. - The proposed method is largely engineering-oriented, with the three-stage design appearing intuitive and lacking algorithmic or learning-based innovation. - The proposed plan–then–execute paradigm raises concerns regarding its effectiveness. In our experience, prompting a model to first produce a plan and then execute it often yields worse results than directly applying CoT prompting. This limitation stems from the fact that the model must generate a plan based solely on the query, without access to any intermediate reasoning results, which may hinder its ability to form accurate or adaptive strategies. - While using confidence as a trigger is empirically effective, it does not fully address the issue of overconfidence. - Table 3, 4: The performance gains in both accuracy and cost over existing baselines (such as Cascade and ABC) are relatively limited. See weaknesses.	Moderately AI-edited
Efficient LLM Collaboration via Planning	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes COPE, a test-time collaboration scheme between a small model and a large model. A planner writes a short plan, an executor answers conditioned on that plan, and the system defers to the larger model when a confidence rule is not met. The claim is comparable quality to a strong single model with lower API cost across math, code, open-ended QA, and an agent benchmark. - paper is well written and structured. - clear problem framing: save large-model calls by escalating only when needed. - simple interface: plan as a short guidance that conditions the executor. - broad evaluation across several task families with cost reporting. - sensible ablations on sampling, thresholds, and truncating later stages. - paper is easy to reproduce as most resources are available in appendix. - Novelty is arguably modest. COPE looks like a heuristic cascade or router with task-specific confidence checks -- one question from me: "how does COPE handle those killer questions that would even need multiple rounds of large model involvement?". Mixture-of-Experts or learned routing can address the same goal in a more principled way. The paper does not compare to MoE or to strong learned routers, so the contribution reads (arguably) incremental. - Missing core metric. There is no clear “defer rate” analysis. I need to see how many items are solved by small-only, how many are escalated, and how accuracy and cost move conditional on the stage. Without this, the cost-savings story is incomplete. - Confidence rules are heterogeneous across tasks (vote for math, pass rate for code, perplexity or LLM-as-judge for open-ended). This makes cross-task comparisons hard and risks bias in open-ended settings. - Planner overhead is under-analyzed. Plans add tokens and extra calls. The paper should separate plan-token cost from answer-token cost and report the win only at matched accuracy. Minor comments: - Robustness is unclear. Results may depend on current pricing, specific model pairs, and judge choice. There is little stress-testing for model updates or alternative judges. - Insights on “who should plan for whom” are unsurprising. The page-3 observations mirror common intuition and existing engineering writeups (for example, multi-agent system notes by anthropic blogs). Presenting them as headline findings feels overstated. 1. For each benchmark, what percentage of items are answered at Stage 1, Stage 2, and Stage 3? What is the accuracy conditional on each stage? What is the average cost per item at each stage, split into plan tokens and answer tokens? 2. What exactly is new relative to prior cascades and learned routing? How does COPE compare to a learned router that chooses small versus large using validation signals under the same budgets? 3. For open-ended tasks, do multiple independent judges or a small human study replicate the reported wins? 4. Relative to a no-plan executor at the same stage, how much does adding a plan improve accuracy? 5. Will you frame COPE as a practical cascade router rather than a new reasoning framework?	Fully AI-generated
Efficient LLM Collaboration via Planning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes COPE, a test-time collaboration framework that enables small and large language models (LLMs) to work together more efficiently through planning. Instead of immediately generating an answer, a model first produces a high-level plan (e.g., goal or guideline), which then guides another model in solving the task. COPE proceeds in three escalating stages: 1) small model plans + executes; 2) large model plans + small model executes; 3) large model plans + executes. Besides, a confidence measure (e.g., majority vote) determines whether to accept the output or escalate to the next stage. This minimizes expensive large-model usage while preserving performance. Experiments across math reasoning, code generation, open-ended tasks, and agent tasks show the advantages of COPE. 1. This paper investigates the collaboration of small and large models to reduce the inference costs of LLM, which is a timely and important topic. The proposed method, COPE, reduces reliance on expensive LLM calls by: Letting small models solve easy problems Escalating to large models only when needed Example: On MATH-500, cost drops ~45–50% versus GPT-4o. 2. COPE introduces an explicit planning stage where models write high-level goals/guidelines. This plan provides structured task abstraction and helps small executors perform better, stabilizing multi-stage reasoning. This also enables cross-model synergy not captured by prior cascades. 3. Unlike baselines that require few-shot data and training of a scoring function, COPE needs no extra training, no supervised planning data and only needs inference-time prompting. 1. The major concern of this work is its innovation. There are many previous papers which decomposed the task or generated a high-level plan by a large model and then executed sub-tasks by a small model, e.g. [1-2]. It seems that this multi-level planning framework is only an extension of previous works. [1] Kwon, M., Kim, Y. and Kim, Y.J., 2025, May. Fast and accurate task planning using neuro-symbolic language models and multi-level goal decomposition. In 2025 IEEE International Conference on Robotics and Automation (ICRA) (pp. 16195-16201). IEEE. [2] Li, Z., Chang, Y., Yu, G. and Le, X., 2025. Hiplan: Hierarchical planning for llm-based agents with adaptive global-local guidance. arXiv preprint arXiv:2508.19076. 2. Authors claim that small language models are free to access. However, many of them are not. Because of their size, e.g., llama-3B needs more than 8GB VRAM, these models cannot work on small devices and people still need to pay if these models are installed and maintained in a data center. 3. Writing needs to be improved. There are many confusing statements in this paper. For example, in the introduction section, the third paragraph and last second paragraph are contradicting each other. The working mechanism of the previous method stated in the third paragraph is presented again in the last second paragraph as the proposed method, which is very confusing. 4. Authors evaluate the proposed framework in tasks across many areas. However, its capacity in each specific area is not evaluated deeply and comprehensively enough. For example, in agent tasks, only AFLWorld is used and other popular benchmarks, such as webshop, are not included. We suggest authors focus on one area, such as math reasoning and conduct more comprehensive evaluations in a single area. 1. Authors propose to use majority voting over self-generated answers to calculate the confidence score of LLM. However, it is widely known that LLM could be overconfident on wrong answers. How to address the situation that LLM is confident but wrong? 2. Authors propose to use a plan generated by a large model to help the task execution. However, it is widely known that the planning capability of modern LLM is still limited [1]. Specifically, the plan, even generated by GPT 4o, may violate some constraints in the domain. If the generated plan has mistakes, it could mislead the downstream task execution. How to resolve this issue in the proposed framework? [1] Cao, P., Men, T., Liu, W., Zhang, J., Li, X., Lin, X., Sui, D., Cao, Y., Liu, K. and Zhao, J., 2025. Large language models for planning: A comprehensive and systematic survey. arXiv preprint arXiv:2505.19683. 3. Authors report that the success rate of the proposed method in ALFWorld is only around 40%. However, only vanilla ReAct can achieve nearly 65% with GPT-4o. If the proposed executor is augmented by a generated plan, how is its performance worse than a vanilla ReAct?	Fully human-written
Efficient LLM Collaboration via Planning	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposed a test-time framework to enable collaborations between small and large LMS to solve various tasks. The collaboration is done through a two-stage planning and execution process that is first attempted by a small model and then can be escalated to a larger model if the small model is not confident about the final answer. Authors have tested their approach over on mathematical, coding, open-ended and agentic tasks and showed better or on par performance to large closed source models while reducing the API costs of only relying on large models. - the paper tested their method on a wide range of tasks and benchmarks including mathematical, coding, open ended and agentic tasks. - the paper is generally well written - Some of the papers claims are overstated and rebranding of existing all-known approaches. For example, the use of "planning" in the paper is just as similar to the known "CoT (chain-of-thoughts)" and cannot be considered as the contribution of the paper. The authors need to provide a clear explanation on the novelty of their work compared to the prior works. - Relatedly, a more fair baseline to compare their method with is the vanilla Large/Small including CoT. That is prompting the model to first do some brief reasoning before attempting the problem - Overall, the contribution of the paper seems more incremental 1. How is the plan different from the commonly known CoT? are the authors trying to rebrand that as plan and planner instead of CoT and reasoner? While I agree that the current landscape of reasoning is more on long chain of thoughts but the earlier studies introducing this approach focused on short CoTs which to me is similar in nature to what you call plan=brief guideline. (as shown in Figure 1) 2. Some qualitative analysis/comparison between your plan and existing CoTs would be helpful to understand the difference (if there's any). 3. Suggestion: Table can benefits from more self-explanatory captions to guide the reader. For example, better baseline naming, explaining how cost is computed etc. 4. for a fair comparison between your method and baselines you should consider including CoTs (or your so called Plans) as inputs to the model but generate those using the answer generation model itself as opposed to having a collaboration between small/large models. The plan/Cot generation is not a contribution of this paper and thus be leveraged by all other baselines to better show the impact of collaboration between small and large lm.	Fully human-written
Bridging Gene Expression and Text: LLMs Can Complement Single-Cell Foundation Models	Soundness: 1: poor Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper explores methods for bridging the gap between text-based Large Language Models (LLMs) and specialized single-cell foundation models trained on raw single-cell data. Single-cell (gene expression) data is a high-dimensional vector of gene activity counts for a single cell that is usually very sparse (only a few genes are active/expressed in each cell at a time). The authors' core methodology is to convert this vector into a "cell sentence," which is a text string of gene names, ranked by their expression level. The main intuition/idea behind this conversion is to tap into the vast textual knowledge of LLMs, namely from curated datasets about gene markers for a specific biological class. Namely, an LLM trained on medical literature would likely learn the association between the word "insulin" and the phrase "Beta cell." The paper has two parts: First, it presents an interpretability study to investigate what biological insights LLMs learn from these "cell sentences." The authors use standard interpretability techniques (LIME and Integrated Gradients) to uncover what "marker genes" where used for a prediction. They report that the genes the model found "important" were, in fact, biologically-known marker genes, confirming one of the parts of the hypothesis. Second, the paper introduces scMPT, a simple fusion model. This model takes the frozen embeddings from a single-cell model (scGPT) and a text-encoder LLM (Ember-V1), concatenates them, and feeds them into a small, trainable MLP. The authors claim this fusion model achieves strong performance, even suggesting it is "competitive with full fine tunes of scGPT" on tasks like cell-type classification. The claims for this second part are highly questionable as discussed in weaknesses section. The paper's only methodologically sound contribution is the interpretability analysis in Section 4.1. This section is well-executed and validates that the authors' model learns to identify key biological features. However, the novelty of this analysis is limited, as the core finding—that a model learns marker-gene associations for cell typing, validated against the PanglaoDB database—was previously demonstrated by the scBERT paper (Yang et al., 2022) . The paper's novel contribution is therefore a methodological validation in successfully replicating this concept on a different class of model (a general-purpose text encoder, not a specialized biological model). Therfore, this is a good but incremental validation rather than a novel conceptual discovery. The paper has several critical flaws in methodology which leaves its central claims unsubstantiated. - The paper's headline claim is that its scMPT model is "even competitive with full fine tunes of scGPT." The key evidence is in Table 3, where scMPT (F1=0.745) appears to beat the "scGPT (full fine-tune - reported)" baseline (F1=0.718) on the Pancreas dataset, which is a highly surprising fact. However, the scMPT's score of (F1=0.745) was tested on an intra-study benchmark, as elaborated here: > We focus our experiments on the datasets that were used to evaluate the cell type classification and clustering performance of GenePT (Chen & Zou, 2023)... For each dataset, we use the same train/test split as each of these respective works ... The cited GenePT and other cited work use a randomized intra-study (eg. they note "For the Aorta dataset, we used a random 80%/20% train/test split. "). In contrast, the scGPT baseline (F1=0.718), was generated on a much harder, cross-study benchmark, elaborated here: > The human pancreas dataset contains data from five scRNA-seq studies... The five datasets were split into reference and query sets by data sources. The reference set consists of data from two data sources, and the query set contains the other three. Therefore, the authors are comparing their high score on a much simpler intra-study split task to a baseline's score on a hard generalization task, which invalidates their primary claim of SOTA-competitiveness. - Flawed claim of minimal loss under "cell sentence" conversion: The paper builds upon on converting numerical expression data into "cell sentences, and and cites Levine et al. (2023) to claim "minimal information loss" under such a conversion. However, in a more careful examination of the cited paper (Levine et al. 2023), the evidence shows that a linear model that attempts to reconstruct the expression values from the ranks, achieving an $R^2$ score of 0.815 (Figure 3 in Levine et al.). This means that nearly 20% of the variance in the original data is lost during the conversion to a "cell sentence." Representing a 20% loss of information as a "minimal loss" without citing an exact figure is a misrepresentation of what the literature shows. - It also fails to cite and discuss CellPLM (Wen et al. 2023), a highly relevant related work that directly critiques the paper's "cell as sentence" methodology, arguing it misses crucial cell-cell relationships. - Missing baseline: The paper's empirical evaluation fails to compare against scBERT (Yang et al. 2022), another prominent transformer-based foundation model for this exact task. (this paper is cited but not used as a baseline) Please answer the issues raised above. In particular, I look forward to hear author's responses and clarifications regarding the train/test split, which is my most important concern.	Fully human-written
Bridging Gene Expression and Text: LLMs Can Complement Single-Cell Foundation Models	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	Summary This paper (Bridging Gene Expression and Text: LLMs Can Complement Single-Cell Foundation Models) investigates how large language models (LLMs) can complement single-cell foundation models (scFMs) such as scGPT for cell type classification and related single-cell analysis tasks. The authors systematically evaluate several LLMs (e.g., Ember-V1 for encoder, GPT-4o for non-reasoning LLM, o3-mini for reasoning model) on cell sentence representations (cell represented as sorted genes, commonly use in most LLM based cell foundation models), perform interpretability and ablation studies, and propose scMPT, a simple fusion model that concatenates scGPT and LLM embeddings followed by an MLP classifier. Results show modest improvements over individual models and qualitative evidence that LLMs capture marker-gene knowledge and simple gene-expression patterns. Overall Assessment This work provides a careful and systematic evaluation of how LLMs interact with scFMs and contributes valuable insights for future research. However, the methodological innovation is limited, the improvements are small and insufficiently substantiated, and the experiments focus on relatively easy tasks. 1. The paper is well written and clearly structured, with strong motivation for exploring complementarity between text-based LLMs and single-cell foundation models. 2. Provides comprehensive, systematic experiments including ablations, interpretability analyses (integrated gradients and LIME), and comparisons across datasets to explore what the LLMs use to perform the cell-oriented classification tasks. This is a very good angle and offering a fresh perspective for the community. 3. The concept of using LLMs as complementary rather than alternative to scFMs is timely 4. The work contributes useful empirical baselines for future multimodal single-cell modeling studies. 1. Limited architectural novelty. The proposed fusion model (scMPT) concatenates scGPT and LLM embeddings followed by an MLP. This is a straightforward late-fusion ensemble rather than a novel modeling approach. The contribution is primarily empirical, focusing on evaluating existing models on existing datasets. 2. Use only scGPT to represent scFMs. There are many other scFMs (e.g. scBERT, scPRINT, scFoundation, geneformer, just to name a few) which represent the cell in different ways, and may interact with LLM differently. Using only 1 seems relatively weak. 3. Small dataset The generative re-ranking experiment uses only 100 samples per dataset, which is too few to claim meaningful cost savings or statistical significance. 4. Over-reliance on simple tasks. The primary evaluation task is cell type classification, is known to be relatively easy (especially for well known cell types), as marker genes alone yield high accuracy. The study does not address more challenging cases such as fine-grained subtypes or rare cell populations, where complementary multimodal reasoning would be more revealing. 5. Scope dilution. Although interesting, the inclusion of LLM-based reasoning experiments (e.g., scGPT + o3-mini / DeepSeek-R1) feels underdeveloped and somewhat disconnected from the main contribution. The small scale and lack of deeper analysis make it difficult to interpret their significance (e.g. o3-mini alone outperform other models in the Pancreas dataset). 6. Unrealized potential of the fusion idea. The fusion concept is promising, but a more principled hybrid approach, such as alignment / contrastive pretraining between scGPT and LLM embeddings, could yield stronger gains. The current setup does not fully exploit cross-modal synergies. 1. Can the authors show per-class performance, especially for rare or ambiguous cell types? 2. Have the authors considered alternative fusion strategies (e.g., gating, cross-attention, contrastive alignment)?	Fully AI-generated
Bridging Gene Expression and Text: LLMs Can Complement Single-Cell Foundation Models	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper studies what biological insights contribute toward the performance of LMs when applied to single-cell data, and how these models can complement single-cell foundation models to improve upon their performance. For the former, they found that LMs capture biological insight, and specifically knowledge of marker genes, as well as simple but effective gene expression pattern (top marker gene patterns). For the later, they introduced scMPT, which leverages both the representations generated by scGPT and an Ember-V1 text encoder, enabling better overall performance for cell-type classification and disease phenotype prediction. - The paper tackles a relevant and timely question in computational biology: how language models interpret gene-level signals in single-cell data, and how this can be leveraged to complement single cell specialized foundation models. - Interesting findings and analysis: The interpretability analysis connecting marker genes with model attributions is interesting and may be useful to biological researchers. And the discussion on gene-name hashing raises a thoughtful point—why language models might already capture much of the relational information between genes through their co-occurrence in “cell sentences,” rather than through explicit gene-name semantics. - The paper is clearly written. - Recommend for journal submission instead of ML conference: The method itself does not introduce substantial theoretical or engineering innovations. It resembles an empirical benchmark or ablation study for a biological problem rather than a new modeling framework. - Limited benchmarking: The experiments include a narrow range of datasets and baselines. Since scGPT’s release, many other cell foundation models have emerged, and traditional non-foundation approaches remain competitive. Without broader comparisons, it is difficult to establish whether scMPT achieves state-of-the-art performance. - Weak ML insight: While the biological analysis is sound, combining two frozen encoders and concatenate the two features for classification sounds like years-old architecture. Or in other words, the paper does not provide much new ML insights, such as novel objectives, architectures, or theoretical findings. 1. how is the model in Figure 1 is trained? Directly using cell type classification using cross entropy loss to train from scratch? If yes, then is it natural to observe the interpretability methods give high score to marker genes as explanation? Because that’s the determinant part for cell types, and cell types are used for supervision. 2. Could you clarify why gene-name hashing improves performance, based on your understanding? Do you believe semantic information from gene names contributes meaningfully beyond their co-occurrence statistics in cell contexts? 3. Will you expand benchmarking to include additional more advanced cell foundation models and traditional baselines (e.g., scANVI)?	Fully human-written
Bridging Gene Expression and Text: LLMs Can Complement Single-Cell Foundation Models	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors pose the question on what LLMs learn from “cell sentences” and whether they complement single-cell FMs like scGPT. The paper finds (via IG+LIME and ablations) that LLMs lean on marker genes plus simple expression-rank patterns, then proposes scMPT, a light fusion that combines frozen Ember-V1 text embeddings with frozen scGPT features. scMPT typically matches or beats each component; a second scheme uses reasoning LLMs (e.g., o3-mini) to choose among scGPT’s top-3 labels and improves accuracy on several sets. - The question is clearly formulated and the evidence are solidly given with interpretable analyses using integrated gradients and LIME against PanglaoDB markers. - Given the experimental results, scMPT (frozen encoders + tiny MLPs) yields consistent gains; even reasoning-LLM reranking over scGPT’s top-3 helps. - Some disease-phenotype results suggest benefits aren’t limited to cell-type classification. - Hashing names and shuffling ranks show performance diffs on top ~10% in-context genes and their order, supporting the “simple pattern” claim. - It seems that cell-sentence bias remains, as findings show reliance on high-rank genes/order; it’s unclear how robust this is to batch shifts. - Regarding the claims about the knowledge of marker genes, we observe only some correlational evidence. In order to assure such claim, a stronger causal probe would be necessary, for example through counterfactual token tests. - I find the eval for generative LLMs limited and biased: The head-to-head with GPT-4o / o3-mini uses only 100 test cells per dataset and constrains the label space to a provided list. That yields to sample-inefficient and risks optimistic estimates (especially for few, separable labels). - If I follow correctly the methodology, IG and LIME are applied to an MLP on frozen Ember-V1 embeddings, not to the end-to-end LLM or the generative-LLM setup. So attributions correspond the MLP+Ember composite, not to scMPT nor the reranking LLMs. - My main concern is that the ablations might hint some shortcut learning: Hashing names wipes performance, which might mean heavy reliance on lexical identity rather than biological semantics. These patterns look like lexical+rank shortcuts, not robust biology. The authors should evaluate under batch shift and low-depth cells. Check weaknesses.	Lightly AI-edited
DATR: DDI-Aware Therapeutic Structure Reconstruction for Safer Medication Recommendation	Soundness: 3: good Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces DDI-Aware Therapeutic Structure Reconstruction (DATR), a framework for medication recommendation that jointly optimizes for accuracy and safety. The authors attempt to address two key issues: the semantic gap, where a drug's molecular structure doesn't capture its specific clinical use, and the post-hoc nature of existing DDI-avoidance strategies. DATR's solution involves first applying Therapeutic Structure Reconstruction, which learns drug representations by encoding their molecular structure conditioned on their ATC category. Second, it introduces a Potential DDI Constraint, an asymmetric penalty that identifies interacting drugs and suppresses the one with lower therapeutic relevance to the patient's current condition, preserving only the most critical treatment. Extensive experiments on the MIMIC-III and MIMIC-IV datasets demonstrate that DATR significantly outperforms all baselines, achieving state-of-the-art accuracy while simultaneously recording the lowest DDI rates. The problem is relevant, and the proposed approach is reasonable. The presented results significantly improve performance when compared to competitor models. Ablation studies provide sufficient evidence for the importance of many of the components of the model, and the robustness to the selection of hyperparameters is well demonstrated. ## General While the method seems sound and efficient, the presentation of the paper requires more work. Many paragraphs are unclear and convoluted, and following the presented argumentation is at times very difficult. Moreover, mathematical notation is often inconsistent and does not help understand the problem formulation and the proposed solution. Some crucial modeling choices are not properly justified by referencing the relevant literature or claiming authorship of those ideas. Finally, competitor methods are only briefly discussed. It seems that the paper suffers from a lack of space to develop certain ideas due to the page limit. Please note that some minor issues are not necessarily wrong passages, but suggestions of parts that could be reduced to leave space to improve the discussion in other sections. ## Major * Pieces of the Introduction and Related Works sections are too dense and high-level. Examples, concrete cases, figures and a clearer explanation are necessary. Main examples of these issues can be found below, but others may also be present, so the paper would benefit from additional proofreading by the authors. * Line 45: This alleged gap needs better justification. The cited paper discusses computer vision representation learning. It's not at all clear how this could be extended for molecular representations. Overall, it would be important to have stronger evidence of the existence of such gap, i.e., why only relying on the global structure is not sufficient. * Line 130: the authors say "VAEs have recently..." and proceed to cite a paper from 2015. My understanding is that VAEs are not a recent model, at least by machine learning standards. Other than that, I believe the original VAE paper by Kingma and Welling (2013) [1] would be a better reference for this paragraph. * Line 132: The sentence here may make a reader believe that the other models discussed in this section do not utilize gradients, which is not the case. * In "Deep-learning-based molecular representations" there's no discussion on molecular transformers, even though this class of models has shown significant results [2]. * Line 154: It is said that A is a binary matrix and afterward it is said that it is calculated as the amount of known interactions between the medications. Both definitions are contradictory. * Figure 1 does not correspond to the textual description in the main text. For example, it does not show how categorical features are used to quantify the relevance of a drug category to the health condition of the patient. * Figure 2 has several problems: the VAE structure is not illustrated, the arrows corresponding to the Potential DDI constraint do not correspond to what is written in the text and there's no mention about CA standing for Cross-Attention. Overall, these problems make it difficult to follow the general architecture of the model. * Some references seem to have errors. For instance, the reference for "Attention is All You Need" only mentions Vaswani as the author, while there are others. * Line 295: references to the previous works should be included in this part. * Lines 214-257 present material that largely summarizes well-established concepts. There is no need to go in-depth into the math, citing the original paper that derived these equations may be enough. This section could focus on explaining the practical steps employed to generate the reconstructions in the context of the model, which is not clear as it stands now. * My major issue with the paper lies on the many details missing to understand the overall structure of the model and the experimental settings. In addition to the examples already provided above, it is unclear what the actual input of the Reconstruction module is and how the authors choose to input the ATC label into the model. * The paper doesn’t explain how the dataset was handled apart from mentioning the training and testing split. It’s not clear, for example, whether this split was done by patients, that is, whether the model was tested on patients who were not included in the training set. * One of the claimed contributions of the paper is to address a “semantic gap” by using ATC labels. However, the ablation study does not include an experiment that tests whether this is actually addressed by DATR. ## Minor * Line 97: this part is very confusing. If I understood it correctly, I would suggest something like "Furthermore, the model can avoid the dependency on specific drug pairs in the training data because of its global consideration of all drug pairs for potential interacting risks". * Lines 111-123: expanding the explanation of instance-based approaches could be beneficial. * Line 161: use of calligraphical M while before it was normal (see line 151). * Equation 1: epsilon is not described in the text. * Equation 7: It may be better to use one equation per line to improve readability. Also, $E_d$, $E_p$, $E_m$ and $T(\cdot)$ are not explicitly defined in the text. * Lines 268-269 needs some work. It would be good to change it to something like "The medication taken by the patient in the previous time point is denoted by...". * Table 1: column DDI has no runner-up. Also, it may be beneficial to push down the table. * Line 434: Notation "R->T" needs to be introduced before it is used. * Appendix C1 is unnecesary. * D.3.1: could be enriched by adding sources on this feature of VAEs and limitations in the transfer learning. ## Typos and Language * Line 63: Reconstruction * Line 66: label * Line 101: bridge * Line 172: obtains * Line 184: medication, predictions, constraints * Equation 2: comma in formula * Line 267: linearly * Line 983: be * Line 1113: they are * Line 1127: utilize References [1] Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. [2] Luong, K. D., & Singh, A. (2024). Application of transformers in cheminformatics. Journal of Chemical Information and Modeling, 64(11), 4392-4409. * Line 130: what does the authors mean by "prime factors"? * Line 159: Is the DDI graph A or D? Or are there two matrices describing these interactions? * Line 189: is the idea of using global and substructure-based representation new? * Line 336: is there a difference between the process for choosing the hyperparameters for DATR and for competitor models? If so, why? * Equation 12: only $L_{DDI}$ has a weight coefficient? It is common practice in VAEs to also ponder the reconstruction loss. Is there a reason for this choice?	Fully human-written
DATR: DDI-Aware Therapeutic Structure Reconstruction for Safer Medication Recommendation	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces DDI-Aware Therapeutic Structure Reconstruction (DATR), a framework aimed at improving medication recommendation systems by simultaneously enhancing accuracy and safety, specifically by reducing drug-drug interactions (DDIs). The framework integrates molecular structure information with therapeutic intent, using a novel therapeutic structure reconstruction method and a proactive DDI constraint. Experimental results on two real-world datasets show that DATR outperforms existing methods in both accuracy and DDI reduction. 1. The integration of therapeutic intent with molecular structure is an innovative approach. Explicitly considering the clinical significance differences between interacting drugs is an interesting idea. The model is theoretically well-founded. 2. Extensive experiments on two real-world datasets with ablation studies and case analysis demonstrate strong performance in both effectiveness and safety. 3. The paper is well-written and well-organized, making it easy to follow. And the appendices gives reproducible details. 1. The paper does not provide sufficient insight into how the model makes decisions, particularly how the integration of molecular structures and therapeutic intent influences the final recommendations. Explainability is crucial for making clinical decision. 2. The paper does not discuss how the framework might perform across different therapeutic areas or for drugs with less common usage patterns. It would be helpful to see more discussion on its adaptability or limitations in diverse clinical contexts. 3. The datasets used (MIMIC-III and MIMIC-IV) are widely used in the healthcare community, but their biases (e.g., patient demographic distribution or disease-specific contexts) might limit the generalizability of the model. Addressing potential biases and testing on more diverse datasets could improve the model’s applicability to different patient populations. 1. Can you illustrate how specific molecular fragments contribute to disease-specific recommendations through the proposed therapeutic structure reconstruction? Are there examples where the molecular features directly correlate with therapeutic outcomes? 2. The experiments did not report how many drugs the framework recommends on average. Could you provide this information and discuss how it might affect clinical decision-making and the adoption of the framework in real-world settings? 3. The variant without the substructure-level representation of DATR underperforms the original model, even worse than some baselines. Could you clarify why the substructure-level representation is crucial and whether the improvements come solely from the addition of BRICS or from other factors? Could alternative explanations be explored for this observed performance drop?	Fully AI-generated
DATR: DDI-Aware Therapeutic Structure Reconstruction for Safer Medication Recommendation	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes DATR (DDI-Aware Therapeutic Structure Reconstruction), a framework that jointly models drug molecular structures, therapeutic intent (ATC-4), and DDI (drug–drug interaction) knowledge. The method not only improves the overall performance of medication recommendation, but also reduces DDIs, thereby mitigating potential adverse impacts of drug–drug interactions during recommendation. 1. The authors introduce a DDI-constraint mechanism that simultaneously reduces the risk of drug–drug interactions and improves recommendation performance. 2. On two real-world clinical datasets, MIMIC-III and MIMIC-IV, the model achieves state-of-the-art results. 3. The study further involves 20 clinicians to subjectively evaluate the recommended drug combinations, taking into account medications in the patients’ histories that might be appropriate yet previously overlooked, which provides clinical validation of the approach. In Appendix D2, the paper presents detailed case analyses for individual patients (e.g., patients X and Y). However, Section 5.3 does not sufficiently document the details of the 20-clinic expert evaluation (e.g., patient conditions, criteria for judging effectiveness, and the underlying rationale). The paper would benefit from adding this analysis or providing more comprehensive expert-evaluation information in the appendix. 1. In many clinical settings, patients have limited visit histories; thus cold-start scenarios (first several visits) are particularly important for medication recommendation. What's the performance stratified by the number of visits? 2. When the DDI loss is removed (i.e., $\gamma=0$), the predictive performance decreases rather than increases. What primarily drives this drop? Please clarify if the DDI constraint offers any direct benefit to recommendation accuracy beyond mitigating interaction risk.	Lightly AI-edited
Assumption-lean inference on treatment effect distributions	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The article under review proposes an efficient method for estimating bounds on the distribution of treatment effects. The approach combines two existing ideas: debiased estimators and smoothing techniques. This combination yields element-wise semiparametrically efficient estimators while simultaneously addressing the smoothing bias introduced by the technique. The article relies on recent techniques from the literature, and the analysis appears sound. 1. Bounds - The paper's abstract, which argues that existing workflows "overlook distributional risks," appears to be overstated. There is a considerable literature on partial identification for treatment effect distributions that directly addresses this. The more pressing challenge, which the paper does not discuss, is the practical utility of these existing methods. Often, the identified bounds—and particularly their confidence bands—are too wide to provide meaningful guidance, thus limiting their impact. This essential context is missing from the paper's framing. - This omission becomes more concerning in light of the paper's own results. The authors are critical of existing work for its "non-standard inference methods," yet the figures presenting their own estimations omit the corresponding confidence bands. To show the practical utility of the proposed method and maintain consistency with its own critiques, reporting these confidence bands is essential. Hiding this information prevents a full and fair evaluation of the method's practical contribution. 2. Theoretical Contributions - Concerns Regarding Novelty and Development: While Table 1 summarizes the paper's stated contributions, the core techniques, such as debiasing and smoothing methods, are established tools from the existing literature. This reliance on established methods raises concerns about the paper's marginal contribution and novelty. Moreover, the analysis appears underdeveloped in key areas, as detailed below. - On the Definition of "Efficiency": The paper's claim of an "efficient" estimator appears to hold only in an element-wise sense (i.e., for each plug-in estimator). However, the primary quantity of interest is the interval estimator. The paper does not demonstrate that efficiency of the individual endpoints implies efficiency for the interval itself. This is a critical distinction that needs to be rigorously addressed. - Omission of Uniform Inference: Furthermore, when discussing distributions, inference should not be restricted to point-wise estimation. The more appropriate and relevant analysis would consider a uniform bound that holds over the entire distribution. This approach has been extensively investigated in the literature, and it is unclear why the authors did not conduct such an analysis. This omission is a significant gap, as it avoids the standard method for this class of problem. 3. Simulation and Synthetic Data Studies - The table reports only the Mean Squared Error (MSE). Given that the proposed smoothing method intentionally introduces bias, bias must be reported separately. Presenting only MSE hides the trade-off at the heart of the technique. - The MSE of each interval endpoint is of limited relevance. The primary object of interest is the interval itself. The authors should report metrics appropriate for interval estimation, such as average interval length and empirical coverage probability, rather than treating the problem as one of point-estimation. - Questions 1-3: Please see Weaknesses 1-3. - Question 4. The paper highlights the lack of "asymptotic normality" in existing methods as a significant drawback. This criticism, however, appears overstated. Many well-established estimators in statistics feature non-standard asymptotic distributions, and this alone does not invalidate them. This critique is particularly questionable given that the proposed method introduces its own set of practical complications, namely smoothing bias and the necessity of selecting tuning parameters. Is a more fair discussion of these respective trade-offs helpful? - Question 5. The paper’s asymptotic analysis lacks clarity, particularly regarding the convergence rate and the precise role of the smoothing bias in the large-sample results. While Corollary 4.3 provides a formula for a confidence interval, the main text lacks the explicit weak convergence results necessary to formally justify this corollary. Furthermore, the proposed confidence interval depends on numerous nuisance parameters and functions. The paper fails to demonstrate that the interval remains theoretically valid when these nuisance components are estimated via a "plug-in" approach. This is a critical omission, as the validity of such a procedure is not self-evident. This concern is compounded by the vague specification of the paper's underlying assumptions, which makes it impossible to verify the method's theoretical soundness. Where can I found those information in the paper?	Moderately AI-edited
Assumption-lean inference on treatment effect distributions	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	na pros: This paper makes a clear and original contribution to causal inference by proposing an assumption-lean framework for inferring the distribution of treatment effects. Unlike prior methods that rely on restrictive margin or smoothness assumptions, the authors introduce smoothed Makarov bounds that allow valid semiparametric inference even when standard assumptions fail. The method combines theoretical rigor—through efficiency theory and bias control—with strong empirical validation on synthetic, semi-synthetic, and real-world A/B test data. Its ability to uncover heterogeneous and potentially harmful treatment effects, even when the average treatment effect is positive, highlights significant practical value for risk-aware decision-making in experiments. con: The main limitations lie in computational complexity and scope. The proposed smoothing-based estimators require intensive numerical integration and careful tuning of smoothing parameters, which may limit scalability in high-dimensional or large-scale settings. In addition, the framework currently applies only to binary treatments and assumes bounded outcomes, leaving open challenges for extending it to multi-valued or continuous treatments and heavy-tailed outcomes. Despite these constraints, the paper's methodological innovation and practical relevance make it a strong step toward more robust and distributional approaches to causal inference. na na	Fully AI-generated
Assumption-lean inference on treatment effect distributions	Soundness: 3: good Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper develops a new way to estimate Makarov bounds on treatment effect distributions by replacing the non-smooth max/min operators with smooth log-sum-exp (LSE) approximations. This smoothing enables valid inference even when traditional semiparametric estimators fail due to non-unique extrema (margin violations). 1. The idea of using log-sum-exp smoothing to recover differentiability is straightforward and simple. 2. The paper provides a principled and theoretically grounded framework for valid semiparametric inference under margin violations. Overall, the work is technically sound, and well supported by strong theoretical guarantees and empirical validation. _1. Weak litearture review._ The paper omits a substantial body of work on Quantile Treatment Effects (QTE), which serves as a core approach to distributional causal inference. This omission weakens the positioning of the proposed Makarov-based framework within the broader context of distributional effect estimation. I recommend to discuss prior QTE literature, and clearly articulate why the Makarov-based approach is advantageous when the joint distribution of potential outcomes is unidentified, whereas QTE only captures marginal contrasts. _2. Unclear motivation and significance._ The paper does not sufficiently justify why margin violations pose a critical practical problem. While the theory is correct, the empirical motivation could more clearly demonstrate concrete failure cases of existing methods under margin violations, ideally with a real-world example rather than synthetic illustrations. Without such evidence, the significance of the proposed smoothing appears unclear. _3. Incremental contribution_ The main methodological idea (replacing non-smooth max/min operators with log-sum-exp smoothing) is well-known in optimization and statistical theory. The paper mainly applies this existing trick to the Makarov bounds without offering fundamentally new theoretical insight or stronger guarantees beyond standard smoothing arguments. _4. Presentation_ The table is too packed and the fonts are too small. I believe this is a violation of the font-size regulation ("do not change font sizes") _5. Title and scope._ The title “Assumption-Lean Inference on Treatment Effect Distributions” is overly broad relative to the actual methodological contribution. The paper’s method still depends on very strong assumptions (e.g., ignorability condition, discrete treatments, overlap, etc.) so the “assumption-lean” claim seems overstated. A more concrete title explicitly reflecting the smoothing-based inference for Makarov bounds convey the scope and focus. Q1. How tight or sharp the smooth approximation of the Makarov bound?	Fully AI-generated
Assumption-lean inference on treatment effect distributions	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper focuses on the problem of estimating the full distribution of treatment effects rather than just the average treatment effect (ATE). The authors focus on estimating the so-called Makarov bounds, which define the sharp limits on the possible treatment-effect distribution given only the observed marginal outcome distributions. Estimating these bounds is challenging because they involve non-smooth operations and previous methods either rely on strong “margin” assumptions (which are often violated in practice), or use plug-in estimators that lack valid inference guarantees. The paper proposes an assumption-lean approach that smooths the non-differentiable components of the Makarov bounds using differentiable approximations. This smoothing enables the derivation of efficient influence functions and the construction of debiased, asymptotically normal estimators. The authors also provide an explicit bound on the bias introduced by smoothing and adjust their confidence intervals accordingly. They propose two data-driven procedures for selecting the smoothing parameters: one minimizing an empirical MSE bound and another based on a Lepski-type adaptive rule. Empirically, the method outperforms plug-in and “envelope” baselines on synthetic, semi-synthetic, and real A/B test data, especially when the margin assumption fails. The results show improved bias–variance trade-offs and more reliable inference. 1. The paper targets inference on treatment-effect distributions rather than just the mean effect. This is an important and underexplored direction for causal inference and A/B testing. 2.Their method removes the need for the restrictive margin assumption used in prior work, which is often violated in realistic settings where treatment effects are constant or nearly constant. This makes the inference procedure more robust and broadly applicable. 3.The use of smooth surrogates (log-sum-exp and softplus) to approximate the non-differentiable Makarov bounds is conceptually elegant. It may have impact on other statistical topics too. Moreover, for smoothing parameters, it introduces two adaptive, data-driven methods to select them: (i) minimizing an empirical MSE upper bound and (ii) a Lepski-type adaptive selection rule. The experiments also showed their method's better performance. 1. Issues on widened and potentially conservative confidence intervals. Because the method adds an explicit smoothing-bias correction term to guarantee valid coverage, the resulting confidence intervals can become conservative and potentially wider than necessary. In practice, this may dilute the practical utility of the inference especially when the true bias is small. 2. A follow up question to 1 is: this is actually the cost (from smoothing) of this paper's approach by translating the difficulty in avoiding the restrictive margin assumption to the difficulty of balancing the bias and variance. Hence, although the paper provided two data-driven ways for picking the smoothing parameters, there are no theoretical guarantees on these choices. This makes the issue not totally solved (simulation or real data results could be from an optimal search of these smoothing parameters rather than from a solid guideline). Also, data-splitting approach is also used and this will reduce the data efficiency. This problem then hinges on further investigation both theoretically and empirically. So far, it is only heuristic. Please see Weakness and the following: 1. The bias bound $b(t_1,t_2)$ is derived under compact outcome support and finite Lebesgue measure. How sensitive are your guarantees to this assumption, and can it be relaxed for unbounded or categorical outcomes? 2. The framework assumes no interference between units. Could it be adapted to clustered or networked data, where spillovers exist? 3. Due to this increased confidence intervals by smoothing, do you have empirical evidence or calibration plots showing how often they over-cover relative to the nominal level? To be precise, any evaluations on whether this bias-correction term $b(t_1,t_2)$ makes confidence intervals conservative?	Fully human-written
Enhancing Diffusion-Based Sampling with Molecular Collective Variables	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces the Well-Tempered Adjoint Schrödinger Bridge Sampler (WT-ASBS), a novel method that significantly enhances the exploration capabilities of energy-based diffusion models for molecular systems. The core innovation is integrating an adaptive bias, inspired by well-tempered metadynamics (WTMetaD), into the Adjoint Schrödinger Bridge Sampler (ASBS) training loop. This bias is deposited along pre-defined Collective Variables (CVs), which are low-dimensional projections of molecular coordinates. The bias effectively increases the sampling temperature in the CV space, mitigating the mode collapse pitfall often seen in standard diffusion samplers. Crucially, the method retains the ability to recover the correct Boltzmann ensemble through importance reweighting. The authors demonstrate WT-ASBS's efficacy on conformational sampling benchmarks (alanine dipeptide and tetrapeptide) and, notably, achieve the first demonstration of reactive sampling using a diffusion-based model on $S_N2$ and post-transition-state bifurcation reactions, showing significant efficiency gains over traditional WTMetaD in terms of wall-clock time and energy evaluations. The reviewer here acknowledges that I am familiar with MCMC methods and general energy-based sampling, but is less knowledgeable about the specifics of modern molecular sciences applications. - The work presents a highly original and timely unification of two distinct fields: enhanced sampling (metadynamics) and generative diffusion modeling. By incorporating the well-tempered biasing mechanism directly into the iterative proportional fitting (IPF) scheme of the ASBS, the authors solve the critical problem of mode collapse for diffusion models applied to complex, high-dimensional, and multi-modal free energy landscapes. The successful application of reactive sampling, which is extremely challenging for standard diffusion models, is a significant first. - The paper is technically sound and well-executed. - The authors provide a convergence guarantee (Proposition 3.1), showing that WT-ASBS provably approaches the desired well-tempered target distribution. - The experiments are convincing. On the alanine tetrapeptide, WT-ASBS discovers all eight metastable states much faster than WTMetaD (Figure 3c). The ability to accurately resolve complex free energy landscapes (PMFs) for both conformational and reactive systems, with close agreement between PMF from bias and PMF from reweighting, strongly supports the method's correctness and efficiency. - For the complex reactive systems using universal ML interatomic potentials (uMLIPs), WT-ASBS achieves convergence with significantly fewer energy evaluations and less wall-clock time compared to WTMetaD (Figure 5). - The paper is well-structured and clearly written, making complex concepts accessible. The methodological details, especially the two-time-scale algorithm and the practical implementation with a replay buffer (Algorithm 2), are laid out logically. Figure 1 clearly illustrates the overall scheme, and the experimental figures (e.g., Figure 2) effectively show the evolution of the PMF during training. The effectiveness of WT-ASBS hinges on the accurate selection of several critical hyperparameters inherited from the enhanced sampling domain, such as the initial Gaussian height $h$, the Gaussian width $\sigma$, and the bias factor $\gamma$. It seems that the paper lacks a systematic ablation study to investigate the sensitivity of the method to these parameters (e.g., how the convergence speed or final accuracy changes with different $\gamma$ values). Given that WT-ASBS generates uncorrelated samples, the optimal choice for parameters like $h$ might deviate significantly from WTMetaD conventions. It requires clearer guidelines for practitioners. See weakness	Fully AI-generated
Enhancing Diffusion-Based Sampling with Molecular Collective Variables	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	In this work, the authors present a collective-variable-based approach to enhance diffusion-based samplers. By incorporating a biasing force that is gradually deposited, similar to metadynamics, the approach improves the exploration ability of diffusion-based samplers. This addresses a consistent issue with diffusion-based samplers being susceptible to mode collapse. The specific diffusion-based sampler enhanced using collective variables (CVs) is the Adjoint Schrödinger Bridge Sampler (ASBS), although the CV-based approach does not appear to be necessarily specific to this instance of diffusion-based sampler. The biasing along the CV is approached similarly to most biased enhanced-sampling methods: an additional bias potential is added to the base potential energy. This bias potential is then updated using the same rule as in well-tempered MetaD. This results in a two-step optimization process: an inner step, in which ASBS is performed (resulting in mode collapse), and an outer step, in which the bias potential is updated. In addition, a few practical improvements are implemented, such as the use of a replay buffer and a warm start. The presented method is evaluated on a number of interesting systems. First, the evaluation considers two peptides, Alanine Dipeptide and Alanine Tetrapeptide, and, following this, the authors study reactive pathways in the form of a simpler SN2 reaction and a more complex reaction with multiple products. The authors present an interesting and seemingly successful approach to alleviate the well-known issue of mode collapse in diffusion-based samplers within the domain of molecular conformations. The focus on ASBS as the diffusion-based sampler to enhance, and the choice of the well-tempered bias update, also seems the most sensible approach. Furthermore, the paper is well written and does an excellent job highlighting the core considerations that go into an enhanced-sampling method. The experimental evaluation is relatively in-depth and focuses on two interesting problems by considering both simple peptides and reactions. While quantitative results are limited, the presented results suggest a significant improvement over standard WTMetaD. All in all, the paper presents an excellent approach to enhance diffusion-based samplers to make them more appropriate for the equilibrium sampling of a molecular system. As such, I vote to accept the paper for publication. - Looking at Figure 2, I’m surprised to see roughly uniform sampling from the second outer-loop iteration onwards (assuming that this is what the dots in panel c represent). I would have expected the samples to be located around the two smaller modes at this point, or still located at the original modes. Only at the later iterations would I expect to see roughly uniform sampling. Notably, this is also directly in contrast with the intuition presented in Figure 1. An author response to this inconsistency is required for me to increase my score beyond a borderline accept. - There are a number of important components to the training setup that would be good to see studied more in-depth, such as the use of a replay buffer and the warm start. Most importantly, I am interested in seeing how these components influence the number of energy evaluations needed. - Due to the reliance on collective variables, it is difficult to envision the applicability of the presented approach outside domains where CVs are known and provide a clear boundary to the sampling domain. For example, in areas such as the study of protein folding dynamics, it is hard to define CVs that sufficiently restrict the important sampling domain. - As the work could potentially spark new interest in enhanced sampling, it would be good to see a more in-depth discussion of work in this area, such as umbrella sampling and adaptive biasing force methods. This could potentially be combined with the discussion of the requirements in Section 2.1, whose current purpose is not entirely clear. Additionally, it would be good to see additional works discussed, such as [1–4] (not a complete list), all of which use a similar biasing force to that proposed here in the context of enhanced sampling (but are more focused on transition path sampling). [1] Seong, Kiyoung, et al. “Transition Path Sampling with Improved Off-Policy Training of Diffusion Path Samplers.” arXiv preprint arXiv:2405.19961 (2024). [2] Singh, Aditya N., Avishek Das, and David T. Limmer. “Variational path sampling of rare dynamical events.” Annual Review of Physical Chemistry 76 (2025). [3] Holdijk, Lars, et al. “Stochastic optimal control for collective-variable-free sampling of molecular transition paths.” Advances in Neural Information Processing Systems 36 (2023): 79540–79556. [4] Du, Yuanqi, et al. “Doob’s Lagrangian: A Sample-Efficient Variational Approach to Transition Path Sampling.” Advances in Neural Information Processing Systems 37 (2024): 65791–65822. See weaknesses.	Fully human-written
Enhancing Diffusion-Based Sampling with Molecular Collective Variables	Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes WT-ASBS (Well-Tempered Adjoint Schrödinger Bridge Sampler), a method that combines a neural diffusion-based sampler (ASBS) with a well-tempered bias potential defined in collective-variable (CV) space. The goal is to accelerate exploration in molecular systems by gradually flattening the free-energy landscape while retaining thermodynamic consistency through reweighting. The authors claim that this hybrid design achieves faster and broader sampling compared to standard ASBS and conventional enhanced sampling methods such as WTMetaD. Empirical results on alanine dipeptide, alanine tetrapeptide, and chemical reaction benchmarks suggest improved exploration and free-energy reconstruction. Conceptually innovative hybridization. The integration of Schrödinger bridge–based diffusion sampling with a well-tempered metadynamics bias is novel. It bridges probabilistic transport and enhanced sampling, offering a unified view of energy-driven exploration in molecular systems. Clear motivation from sampling efficiency. The work addresses a concrete challenge in molecular generative modeling — slow mode discovery and poor coverage of rare events — and proposes a pragmatic biasing mechanism that is both simple and theoretically interpretable. Transparent discussion of limitations. The paper openly discusses the non-amortized bias, potential generalization issues, and differences from traditional WTMetaD, which reflects scientific maturity and honesty. 1. Conceptual tension between learned sampler and non-learnable bias. Although the method is presented as a neural diffusion sampler, the core bias $V_{WT}$ is manually accumulated (non-neural). This design choice blurs whether the sampler truly learns the target distribution or merely follows a hand-crafted MetaD bias. 2. Lack of fair and direct comparison with ASBS. Since WT-ASBS is an ASBS variant augmented with a well-tempered bias, the most meaningful benchmark is ASBS itself under identical conditions. However, ASBS only appears in Fig. 3c (explored states) and is not evaluated for PMF accuracy, making the incremental benefit of WT unclear. 3. # of force-calls inconsistency between figures. Figure 3.c shows up to $\approx$2 M evaluations, while Figure 3.d extends to $\approx$40 M. Because WT-ASBS reaches chemical accuracy after $\approx$4–5 M evaluations, truncating Fig. 3c prevents fair comparison at later stages. 4. Limited diffusion-based baselines. Comparisons are restricted to WTMetaD, which mainly highlights conceptual differences rather than validating the WT mechanism across diffusion methods. Adding at least one diffusion-based baseline (annealed, energy-guided, or score-guided) would strengthen the evaluation. 5. Scalability and consistency. WT-ASBS excels on alanine dipeptide (Figure 2.f) but underperforms WTMetaD on tetrapeptide (Figure 3.d), suggesting potential sensitivity to CV dimensionality or generalizability to complex system. 6. Lack of statistical robustness. The reported results appear to be based on a single simulation run for WT-ASBS. Given the stochastic nature of both diffusion and metadynamics sampling, multiple independent runs (with mean and variance) are crucial to assess convergence stability and reproducibility. Without them, it is difficult to judge whether the reported trajectories and PMF curves reflect consistent behavior or a favorable random seed. (Q1) If $V_{WT}(\xi)$ is not parameterized by a neural network, how does the model ensure that the learned sampler contributes beyond a standard MetaD bias? Could the authors clarify whether WT-ASBS should be interpreted as (a) a neural sampler guided by an analytic bias, or (b) a learnable bias parameterized through the neural transport model? (Q2) In Figure 8, the reweighted and bias-derived PMFs are nearly identical, suggesting that the sampling in CV space is almost uniform. If so, what ensures that the sampled structures remain physically valid? Could the authors provide representative molecular structures sampled from different regions of the CV space (e.g., high- vs low-free-energy regions) to illustrate what kind of configurations the model actually produces? This would clarify whether the method truly explores meaningful conformations or simply performs uniform random exploration over CVs. (Q3) Could you extend Figure 3.c to the same energy-evaluation range ($\approx$10–40 M) as Figure 3.d to show late-stage exploration and better quantify the WT bias’s long-term effects? Please include an ASBS PMF accuracy curve for Ala2 system. (Q4) Could you summarize cost–benefit metrics (energy evaluations, wall-clock time, ESS, PMF error) across ASBS, WT-ASBS, and WTMetaD for standardized comparison? (Q5) ASBS’s slow exploration (Figure 3.c) might result from strong pre-training confinement near the initial mode (marked * in Figure 3.b). Have you explored reducing pretraining strength or fine-tuning its duration to encourage broader exploration? (Q6) When computing the free-energy difference, which states or basins were used? Were they predefined minima or clusters in CV space? Clarifying this would help interpret the reported PMF results. (Q7) Are all reported curves from single runs, or averaged over multiple simulations? If single, could you comment on the run-to-run variability? Showing variance bands or repeated trials would help establish statistical reliability.	Fully AI-generated
Enhancing Diffusion-Based Sampling with Molecular Collective Variables	Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces WT-ASBS, a diffusion-based generative sampler enhanced by biasing along CVs, inspired by WTMetaD. The method replaces the local MD step in WTMetaD with a diffusion-based neural sampler (ASBS) that learns to generate molecular configurations directly from the potential energy function. A repulsive potential is iteratively constructed in CV space to encourage exploration of new regions and flatten free-energy barriers, with subsequent reweighting to recover the unbiased Boltzmann ensemble. Experiments demonstrate that WT-ASBS improves sampling efficiency and mode discovery for molecular systems. 1. The idea of integrating diffusion-based neural samplers with enhanced-sampling techniques from molecular simulation is elegant and promising. While both ASBS and WTMetaD exist independently, their combination is novel and creates a bridge between machine-learning-based and physics-based sampling paradigms. 2. The methodology is technically well-founded, with a clear derivation and an explicit algorithmic description (Algorithm 1). The paper includes a convergence proposition showing that the bias potential approaches the well-tempered limit, ensuring theoretical consistency. 3. The paper is clearly written, logically organized, and visually well-supported by figures. The workflow in Fig. 1 and the schematic diagrams for peptide systems help the reader follow the training and bias-updating procedure. 4. From an application perspective, the work demonstrates that diffusion-based samplers can efficiently handle physically meaningful molecular systems. The results on peptides and reactive landscapes show large efficiency gains over WTMetaD. This points toward real practical potential in molecular modeling. 1. The method mainly replaces the MD propagation in WTMetaD with ASBS without introducing new learning objectives or architectures. The innovation lies more in application and system integration than in core ML algorithmic development. For ICLR, where the focus is typically on advancing machine-learning methodology, the contribution might appear not so significant. The work may find a more natural home in computational chemistry or physics-oriented venues. 2. Algorithm 1 involves two nested loops (outer bias update over k, inner ASBS optimization over l). In addition, an iid sampling procedure is involved. However, the paper does not specify when or how to stop the iterations in practice. ASBS itself is self-consistent and lacks a single scalar loss whose decrease guarantees convergence. The absence of heuristic or empirical convergence criteria (e.g., stabilization of bias potential or free-energy estimates) makes reproducibility difficult. 3. All experiments use well-known CVs such as torsional angles or bond distances. For realistic, high-dimensional systems, CV discovery is non-trivial. It would strengthen the paper to evaluate the method with ML-discovered CVs (e.g., TICA, SPIB, or FMRC). Without this, the method’s applicability to problems where good CVs are not known remains uncertain. 4. Boltzmann Generators and related approaches can guarantee asymptotically correct estimates via MCMC or importance-sampling refinement. WT-ASBS claims reweighting via the bias but does not discuss whether such reweighting ensures unbiased expectations in the presence of neural-sampler approximation error. 5. In Fig. 3d, the free-energy mean-absolute-error of WT-ASBS is slightly larger than that of WTMetaD. It would be valuable to analyze the reason to understand when diffusion-based sampling may underperform. 6. The method’s feasibility for explicit-solvent systems remains unclear. 1. How should practitioners determine that the two-loop WT-ASBS training has converged? 2. Can the authors clarify whether the proposed reweighting scheme guarantees unbiased Boltzmann expectations, or whether the residual approximation in the learned diffusion process introduces systematic bias? Could additional refinement (e.g., short MCMC runs or importance sampling) ensure asymptotic correctness similar to Boltzmann Generators? 3. Have the authors considered coupling WT-ASBS with ML-based CV discovery? 4. WT-ASBS achieves broader exploration but sometimes slightly worse free-energy accuracy (Fig. 3d). Can the authors comment on this? 5. Can the authors explore the computational feasibility for systems with explicit solvent?	Fully AI-generated
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents ThinkGeo, a benchmark designed to evaluate tool-augmented LLM agents in remote sensing (RS) tasks. Unlike general-purpose benchmarks such as ToolBench, GAIA, or GTA, ThinkGeo emphasizes spatial reasoning grounded in satellite and aerial imagery, assessing how agents plan, invoke tools, and integrate perception, logic, and operation modules. The benchmark covers 486 agentic tasks with 1,773 expert-verified reasoning steps across seven RS domains (urban planning, disaster assessment, environmental monitoring, transportation, aviation, recreation, and industry). It includes an executable tool suite (14 tools) and two evaluation modes (step-by-step and end-to-end). Experiments over diverse models (GPT-4o, Claude-3.7, Qwen2.5, LLaMA3, etc.) reveal large performance gaps in spatial grounding, argument formatting, and multi-step consistency. 1. Novel application domain: ThinkGeo meaningfully extends agentic evaluation into remote sensing, an underexplored yet practically critical field. 2. Comprehensive dataset design: 486 tasks annotated with fine-grained ReAct traces and human validation offer substantial scale and detail. 3. Structured tool taxonomy: The integration of perception, logic, and operation tools (e.g., `ChangeDetection`, `SegmentObjectPixels`) provides a modular evaluation of spatial and reasoning ability. 4. Clear methodology: The pipeline for query generation, validation, and dataset curation is carefully documented and reproducible. 5. Empirical breadth: Covers a wide range of LLMs and provides both quantitative metrics (InstAcc, ToolAcc, ArgAcc) and qualitative analyses (failure examples, tool call errors). 1. Despite the impressive engineering effort, the paper mainly adapts existing ReAct-style benchmarks (e.g., GTA, ToolBench) to a new domain. It does not introduce fundamentally new evaluation principles or algorithms—its novelty lies primarily in the application context. 2. The benchmark remains evaluation-oriented rather than analysis-oriented. While the authors present tool accuracy and reasoning consistency, there is minimal exploration of why models fail (e.g., visual grounding vs. reasoning gaps) or how agentic design can mitigate these issues. 3. The benchmark’s query generation pipeline depends on human experts and semi-automated GPT scripting. This makes it difficult to replicate or extend without significant effort, and introduces potential annotator bias. A more data-driven or semi-synthetic generation strategy would enhance scalability. 4. The benchmark evaluates models only on remote-sensing tasks but does not verify whether improvements transfer to non-RS domains. Without such comparison, it is unclear whether ThinkGeo assesses general agentic reasoning or simply domain familiarity. 5. The design choices—such as the selected 14 tools, ReAct format, and evaluation metrics—are taken as fixed. There is no ablation on how task difficulty, tool diversity, or image modality (RGB vs. SAR) affect performance. 6. The paper adopts “LLM-as-a-judge” evaluation for correctness but provides little evidence of its reliability. No inter-rater agreement or validation against human experts is reported. This weakens the credibility of reported accuracy metrics. 7. Much of the main text reads like a technical report rather than an academic analysis. Figures and tables convey results clearly, but interpretive discussion (e.g., why GPT-4o outperforms open-source models, or what features drive performance) is limited. 1. Please quantify the reliability of your LLM-as-a-judge system. How consistent are its decisions with human evaluation? Reporting an inter-annotator agreement or accuracy over a labeled subset would strengthen the methodological validity. 2. Conduct an ablation varying (a) the number of tools available and (b) the task difficulty (easy vs. hard). Does model performance degrade linearly with reasoning steps or tool diversity? 3. Evaluate a small subset of models (e.g., GPT-4o, Qwen2.5) on cross-domain tool reasoning (such as GAIA or GTA tasks) to assess whether ThinkGeo captures domain-specific or general agentic competence. 4. The qualitative failures in Fig. 7 are informative; please complement them with a quantitative breakdown by error type (argument errors, redundant steps, wrong bounding boxes, etc.) across all models. 5. Appendix A.1 briefly reports SAR results but lacks discussion. Please analyze why optical-trained agents fail on SAR—e.g., due to missing spectral priors or inappropriate tool arguments.	Fully AI-generated
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents a benchmark for geospatial agents on remote sensing tasks. The authors collected 486 tasks spanning a set of different applications from urban planning to change analysis and aviation monitoring etc, and grounded the answers using satellite or arial imagery. To succeed an agent has to crrectly reason through the question and the images to find specific answers. The dataset is well constructed including with expert reasoning steps. The authors built an agent to evaluate the benchmark on a set of different models with reasonable accuracy on frontier models on the step by step evaluations but weak overall accuracy on the tasks. The strength of the paper is the benchmark itself, which appears carefully put together, as well as teh sets of tool calls need to answer the questions. The expert review on each question evaluates the accuracy against the entire task suite. The difficulty of this benchmakr is hard to understand. It was constructed from a standard set of publically available remote sensing datasets, with questions posed by "experts" about the datasets being the core of the benchmark. If a model or a person could successfully solve all of the problems in the benchmark, what level of expertise do they have? The expert evaluation is welcome but this makes it difficult for this to serve as a reproducible benchmark -- we aren't given much information about the experts and their level of expertise, but from the tasks given the questions in the benchmark seem straightforward to answer even for a nonexpert. What is the difficulty of the benchmark, measured in terms of skills of expert human. Easy and hard are relative terms and are not well calibrated. If a system could perform well on this benchmark what capability would it have? To what extent do the tools you are providing to the agent affect the quality of the answer	Fully human-written
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks	Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces ThinkGeo, a new benchmark designed to evaluate tool-augmented LLM agents on remote sensing tasks. Unlike existing benchmarks such as GAIA, ToolBench, or GTA that focus on general-purpose reasoning or synthetic multimodal setups, ThinkGeo specifically targets spatially grounded, real-world Earth observation applications. The benchmark includes: 1. 486 agentic tasks with 1,773 expert-verified reasoning steps, 2. a suite of 14 tools for perception, logic, and spatial operations (e.g., object detection, segmentation, calculator), 3. step-by-step and end-to-end evaluation metrics (e.g., ToolAcc, ArgAcc, AnsAcc), 4. comparative evaluation of leading models (GPT-4o, Claude-3.7, Qwen2.5, LLaMA-3, etc.). The authors show that even top models struggle with tool argument formatting, spatial grounding, and multi-step planning, highlighting open challenges in agentic reasoning for geospatial tasks. The authors introduce a novel agentic benchmark for the remote sensing domain. While the benchmark is very domain specific, it requires skills such as fine-grained visual reasoning, geospatial reasoning, and tool selection, which are generalizable to a wide variety of domains. The benchmark curation pipeline is rigorous with expert validation and error analyses. Inclusion of finegrained step-wise evaluation is also not common of agentic baselines. The paper could benefit by including more details on benchmark curation and evaluation. See questions. The authors use gpt-4o for LLM-as-a-judge. Have you calibrated the juge across different models to verify concordance with human judgment / prevent bias toward GPT-family outputs? In Section 4 evaluation methodology, including concise descriptions of InstAcc, ToolAcc, etc would make the paper more self-contained. The benchmark relies on publicly available datasets. Is there risk of data leakage if pretrained models have seen these images before? The datasets also heavily feature urban and transportation-heavy scenes. Does this create bias against rural and vegetation-dominant regions? "Queries appearing earlier in the sorted list, with fewer complex keywords and shorter reasoning steps, were labeled as easy, while the rest were classified as hard." Could the authors clarify what "fewer" and "shorter" mean here? Was an explicit threshold used? Table 3 reports very close numbers for some metrics. Is it possible to report a measure of spread (SD, confidence interval) etc to help contextualize these difference? Could the authors include average per query costs for some common closed source models?	Fully human-written
Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents IKEA, a method that extracts knowledge from RAG using queries. IKEA stays stealthy by creating natural queries built from anchor concepts. The method has two parts. Experience Reflection Sampling chooses concepts that are likely linked to the RAG’s internal knowledge based on past query results. Trust Region Directed Mutation changes anchor concepts within a set similarity range to find new and related information more effectively. Experiemnts show that IKEA performs much better than other methods. The extracted knowledge can also be used to build a working substitute RAG system. - The paper studies an important security issue in RAG systems : extraction attacks. Its focus on harmless-looking queries makes it different from most past work. - IKEA method is explained in a clear and direct way. Figure 1 gives a clear summary of the process. - Experiments use several settings. Results show that IKEA keeps high EE and ASR while passing basic defenses. This is a strong finding. - Code is provided. It seems make sense. - The tested defenses are not enough. Stronger and more realistic defenses include semantic output filtering, consistency checks, detection of repeated probing, or methods for iterative query attacks. - I am not sure about the main assumption that the RAG topic is fixed and known limits how well the method can be used in other cases. - The results is not enough to support the claim that the substitute RAG performs “comparably” (Sec 4.5). Three metrics cannot measure many other aspects . - The cost in time, API calls, and total query rounds isnot clear. This may make the attack too expensive for extracting large knowledge bases. 1. See weakness 2. The topic probing method seems crucial for practical applicability. Could you provide more details on its robustness? 3. While IKEA avoids malicious prompts, could the pattern of queries generated be detectable by analyzing query sequences over time using anomaly detection techniques?	Fully human-written
Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes IKEA, a “benign-query” knowledge-extraction attack on RAG. It combines (i) Experience Reflection sampling over “anchor concepts” and (ii) Trust-Region Directed Mutation (TRDM) to explore the embedding space, and evaluates against RAG-Thief and DGEA under input/output defenses, reporting higher extraction efficiency and attack success. 1. The paper is well-written and easy to follow 2. The studied topic is important and novel 1. The paper evaluates against only two prior attacks, RAG-Thief (prompt-injection) and DGEA (jailbreak), even though the Related Work section lists additional, closely related extraction methods (e.g., Pirates of the RAG / adaptive black-box extraction) that are not included as baselines. This makes the claimed superiority (“surpassing baselines by 80%+”) hard to trust. At minimum, strong black-box, non-jailbreak/PIK variants and adaptive coverage attacks should be implemented. More discussion on the related works is needed. 2. “Semantic Similarity (SS)” uses an encoder to compare outputs with retrieved docs, favoring paraphrase-style extraction (IKEA) over verbatim baselines, while CRR (ROUGE-L) penalizes paraphrase. Claims hinge on SS/EE/ASR; there is no human audit of copyright risk nor independent leakage criteria. Copyright/privacy stakes aren’t well reflected by SS alone. 3. HealthCare-100k, HarryPotterQA, and Pokémon are niche; Pokémon is explicitly chosen as low-overlap with pretraining. Results may not generalize to enterprise RAG (contracts, support logs, medical records), where policy, formatting, and noise differ. 4. The main setup assumes a known domain topic; the “unknown topic” setting still uses a bespoke topic-probing stage powered by a secondary LLM, then evaluates almost identically—this weakens the claim that IKEA remains benign and practical under stricter assumptions. 5. Replacing Top-K with off-topic docs predictably tanks both the attack and benign utility to near zero (Table 4), which is not an acceptable real-world mitigation, so it doesn’t inform deployers what works. 6. The pipeline and equations are clear, but the headline claim (“surpassing baselines by >80% efficiency, >90% success”) rests on a baseline set that is neither representative nor matched to IKEA’s benign-query regime. Without stronger baselines, the empirical claim reads overstated. 1. Add competitive benign-query baselines: random/diversity sampling; k-center or farthest-point query selection; BM25 lexical sweeps; self-ask/chain-expansion; an adaptive coverage agent; and a re-implementation of adaptive black-box extraction from the works already cited in §5. 2. at least one enterprise-style corpus with policy/PII-like structure, and long-document settings that stress retrieval/reranking.	Fully AI-generated
Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper studies extraction of documents from a RAG knowledge base. Instead of using malicious queries, the authors use benign queries repeatedly to collect RAG answers as stolen knowledge, and propose several tricks to improve search efficiency—e.g., avoiding duplicate retrievals and increasing coverage. Experiments evaluate metrics including extraction efficiency and the downstream performance of RAG systems reconstructed from the stolen knowledge. 1. The paper addresses an important problem by studying the privacy risks of RAG systems under more realistic settings—specifically, black-box access with defenses in place. 2. Compared with baselines, the proposed method demonstrates stronger robustness against defended RAG systems, successfully extracting more knowledge when defenses are applied. 3. The paper is well-written, and the overall idea is intuitive and easy to follow. 1. The idea of using query–response semantic distance as a proxy for local RAG density is based purely on intuition, without further discussion. The paper does not provide references or experiments to validate this assumption. 2. The evaluation includes only two baselines, while several other relevant methods are mentioned but not compared experimentally. 3. The extracted documents achieve low ROUGE scores (below 0.3, Table 1), indicating that the extracted content fails to accurately recover the original documents. This limits the practical implications for privacy or copyright concerns. 4. Some metric definitions are unclear. For example, extraction efficiency depends on the number of “unique” extracted documents, but the notion of uniqueness is not specified. Moreover, since the method does not reconstruct original documents, comparability of this metric with prior work is questionable. Similarly, the definition of ASR—the ratio of non-rejected queries—does not directly measure extraction success. 5. The proposed method introduces many hyperparameters (over ten), which may be difficult to tune in practice. The paper provides little discussion on how these parameters are chosen. 6. Ablation results show only marginal improvements over random baselines (Table 13), particularly for ASR, CRR, and SS metrics, raising concerns about the actual effectiveness of the proposed approach. In the evaluation, some methods such as DEGA achieve high ROUGE scores (up to 0.96 in Table 1) in the no-defense setting, suggesting near-literal copying. However, their embedding similarity remains relatively low. What are the possible reasons?	Lightly AI-edited
Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This submission investigates covert extraction of proprietary knowledge from retrieval-augmented generation (RAG) systems and proposes “IKEA,” an implicit, benign-query attack that grows “anchor concepts” via history-aware sampling and a trust-region mutation in embedding space. Evaluations across several corpora and model/retriever pairings suggest higher extraction efficiency than prompt-injection baselines and show that a substitute RAG assembled from harvested content retains non-trivial utility. Topic probing for unknown domains and simple adaptive/DP-style defenses are also explored to characterize security–utility trade-offs. 1. This paper clearly specifies a realistic black-box threat model for RAG and delineates attacker capabilities and constraints with precision. 2. Empirical coverage is broad, spanning multiple LLM–retriever configurations and defenses, and the attack remains effective when common jailbreak/prompt-injection attacks are blocked. 3. The method is straightforward and reproducible—anchor-based benign queries guided by history-aware sampling and a cosine-bounded trust-region mutation—with prompts and hyperparameters disclosed. 1. Algorithmic novelty feels limited; the core components amount to history-penalized sampling and cosine-bounded mutations without formal coverage or sample-complexity guarantees. 2. This paper depends on a known or easily probed domain topic and centralized corpus semantics, making generalization to heterogeneous, multi-topic enterprise deployments uncertain. 3. This paper’s defense study leans on simplistic or utility-destroying mechanisms and omits deployable strategies like per-client rate limiting, query-set anomaly detection, and semantic drift monitoring. 4. This paper lacks an end-to-end economic analysis of the attack (token/time costs and sensitivity to generator quality), which is crucial for real-world risk assessment. No more questions.	Heavily AI-edited
AnveshanaAI: A Multimodal Platform for Adaptive AI/ML Education Through Automated Question Generation and Interactive Assessment	Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents an AI/ML learning platform that includes questions on various AI topics, covering different difficulty levels and Bloom’s taxonomy categories. The authors describe the dataset of questions used in the platform, along with its system architecture and user interface. They analyze the dataset’s statistics and explore the relationships between question category, difficulty, and taxonomy level to evaluate its quality. Finally, they report results from fine-tuning a language model using this dataset. - The dataset is extensive and includes a range of taxonomy types and difficulty levels. Overall, the paper reads more like a technical report describing an educational framework rather than a research paper presenting novel findings. The main issues are as follows: - It’s not clear to me how the 3 research questions proposed on the first page are addressed in this work. - The dataset generation process is not clearly described. It appears that instructors created the questions using the proposed framework, but the details of the annotation process are missing. For example, how difficulty levels were defined or determined. - The distribution of data across categories and difficulty levels is not reported. - The paper does not include any sample questions from the dataset, which makes it difficult to assess its quality or variety. - The purpose of the fine-tuning experiments is unclear. Moreover, the generated questions by the model are not evaluated for quality or correctness. Reporting improved perplexity alone is not sufficient to demonstrate that the model produces high-quality questions. - The scope of the dataset is also limited to the AI domain. In addition to the points mentioned in the weaknesses, I have the following questions: - How is user engagement measured? Are there any statistics or analyses showing how users interacted with the Q&A content on the platform? - How was the difficulty scaling and cross-mode adaptation described in Section 2.2 conducted?	Lightly AI-edited
AnveshanaAI: A Multimodal Platform for Adaptive AI/ML Education Through Automated Question Generation and Interactive Assessment	Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This work presents AnveshanaAI, an application-based learning platform for artificial intelligence. With AnveshanaAI, learners are presented with a personalized dashboard with streaks, levels, badges, and structured navigation across domains such as data science, machine learning, deep learning, transformers, generative AI, large language models, and multimodal AI, with scope to include more in the future. The authors also design gamified tracking with points and achievements to enhance engagement and learning, while switching between Playground, Challenges, Simulator, Dashboard, and Community supports exploration and collaboration. 1. The paper is easy to read and understand. 2. The related work makes it easy for readers to understand the context. 3. The authors intentionally highlight lots of key words, which is good but can also be distracting. 1. The main contribution is more around the user interface design side, rather than model/algorithm side. As such, this work may be more suitable for conferences in UI design such as ACM Conference on Intelligent User Interfaces (IUI), instead of machine learning conferences. 2. The experimental evaluation focuses more on the dataset property itself, rather than a proposed model or algorithm. Thus, this paper may be more suitable for a dataset track paper. 3. What does the results of fine-tuned Mistral 7B model mean? It is unclear what the advancement that the authors want to demonstrate, as no models or other datasets are compared. As such, the significance is unclear. 4. The current experiments can't really answer the proposed three research questions, based on the limitations of the experiments, including the lack of novel model design, no model comparison, no ablation studies, and no human-factor studies with real learners. 5. There is only one dataset from the authors for evaluation. No other public datasets are used, limiting the rigor and effectiveness of this work. 6. F1-score = 0.427 does not seem to be a good performance. Moreover, there are no baseline models nor ablation studies. 7. This work lacks insights about model contributions. Overall, this work is not ready for the formal publication. I suggest the authors to introduce novel model design and run more rigorous experiments to strengthen this work. Please check my concerns and questions in Weaknesses section.	Fully human-written
AnveshanaAI: A Multimodal Platform for Adaptive AI/ML Education Through Automated Question Generation and Interactive Assessment	Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The authors propose AnveshanaAI, an application-based learning platform designed to support artificial intelligence education. The proposed AnveshanaAI is a platform designed to support artificial intelligence education and to demonstrate its application value. - This work is primarily application-oriented, and the authors present it more as a project or program report rather than as a rigorous academic paper. - Consequently, the writing, organization, and individual sections show limited connection to a well-defined research question or hypothesis. In its current form, it is difficult to identify clear evaluation criteria or a scientific contribution to assess. NA	Moderately AI-edited
AnveshanaAI: A Multimodal Platform for Adaptive AI/ML Education Through Automated Question Generation and Interactive Assessment	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents AnveshanaAI, an educational platform for AI/ML learning featuring automated question generation via fine-tuned Mistral-7B, gamification elements, and multiple interaction modes. The authors construct a 10K+ question dataset aligned with Bloom's taxonomy and deploy a web-based system with playground, challenges, simulator, and viva modes. 1. Addresses relevant problem in AI/ML education 2. Comprehensive dataset statistical analysis (entropy, Bloom's taxonomy coverage etc.) 3. Dataset publicly available on HuggingFace 4. Multi-modal platform design with gamification elements 1. Wrong venue: Applications papers at ICLR must contribute novel ML methods or representation learning insights. This paper proposes no new ML techniques. 2. Minimal validation of core claims - no human evaluation 3. Weak experimental methodology, FT on 10K utterances 4. Insufficient dataset validation 5. Missing critical comparisons with similar platforms 1. Most critical: Where are the engagement measurements claimed in the abstract? Did any students actually use this platform? 2. Have you validated that generated questions are factually correct and pedagogically appropriate? 3. Why is BERTScore F1 so low (0.427)? How does this compare to baselines? 4. How does question quality compare to human experts or GPT-4? 5. Why should this appear at ICLR rather than e.g., AIED?	Heavily AI-edited
Extrapolating Large Models from the Small: Optimal Learning of Scaling Laws	Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper addresses the problem of uncertainty in scaling prediction in LLM evaluations and proposes a framework that cuts the evaluation costs when using smaller models to predict the larger model's performance. - The problem of scaling prediction seems to be correctly introduced and motivated in the introduction - General organization of the paper is in a good state. - W1) Unclear Definition 1. The definition of the equivalent sample size/ESS (Definition 1) is central to this paper, but the definition is not clear. It does not clearly introduce the "equivalent sample size". The definition just says that a distribution has an equivalent sample size if the number of samples $n$ fulfills a certain condition. Is $n$ the equivalent sample size or what do you mean specifically? If your goal is to introduce a metric for quantifying prediction uncertainty you have to explicitly introduce this metric (this is not sufficiently clear in the current state). A clear introduction would be especially critical for researchers / practitioners to understand how to adapt your measure for prediction uncertainty. - W2) Missing justifications. The derivation in line 180 suggests that Hoeffding's inequality can be applied, however, applying this inequality requires justification. The problem is also that you do not clearly introduce the random variable for which you want to apply this inequality. Do you want to bound the expectation of the random variable $\ell(f(S_i), R_i)$, where $\ell$ is a loss function? The problem is that we would require this random variable to be bounded in order to apply Hoeffding's inequality (in my understanding), which would require an additional assumption on the loss function. It would also be helpful to cite a source where this inequality is introduced, e.g. (Hoeffding, 1963). - W3) General writing-quality is low. In line 178 $ \hat{P}\_n $ is introduced as the empirical performance, but in Definition 1 it is denoting a distribution (line 184). First $X\_\$ is a non-negative value (line 212), but later you consider $X_ \in [x_l, x_u]$ (line 254) and it remains unclear why that changes exactly. Section 4 and 5 refer to a Theorem 4.1 but there is no such theorem (just a proposition). $\overline{XY}_M$ in line 216 is never used. The main section 4 aims at quantifying "reliability" instead of uncertainty (line 169) and it remains unclear what is meant with reliability. The paper requires significant polishing. Minor weaknesses - The paper would generally benefit from a first figure. - Typo in line 180. Do you mean $\epsilon_n$ instead of $\epsilon$ in the confidence interval? - The example 1 (lines 231-237) is very loaded with notation and rather challenging to follow. Overall, the paper requires major revisions throughout the entire manuscript, I cannot recommend it for acceptance in its current state. What do we need the "effective sample size" for (line 192)? This is never used again (unless I'm missing something). Further questions see weaknesses.	Fully human-written
Extrapolating Large Models from the Small: Optimal Learning of Scaling Laws	Soundness: 4: excellent Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper focuses on the confidence intervals of scaling law. It mainly puts forward a series of optimization strategies to address the problem that large model evaluation consumes excessive resources. The authors' discussion revolves around two key perspectives: 1. As the model scales up, the confidence intervals widen; 2. As the volume of test data increases, the confidence intervals narrow. Accordingly, the authors propose that more evaluations should be conducted at critical junctures, with the aim of reducing the widening of confidence intervals during the model scaling process. The authors' discussion angles demonstrate remarkable novelty. Moreover, they indeed possess considerable practical value—specifically, they can help pre-training teams proactively predict model performance. - Some studies have proposed that scaling laws may not follow a simple logarithmic curve—especially with an increase in data repetition. This could lead to biases in the widening of the authors' confidence intervals, greatly undermining the effectiveness of the authors' method. - The authors' discussion lacks consideration of learning rates. In fact, most model training processes may involve adjustments to learning rates during multi-stage training, which exerts a significant impact on model performance. Meanwhile, the usability of the authors' method is also compromised. Chen, Zhengyu, et al. "Revisiting scaling laws for language models: The role of data quality and training strategies." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. Muennighoff, Niklas, et al. "Scaling data-constrained language models." Advances in Neural Information Processing Systems 36 (2023): 50358-50376. Hernandez, Danny, et al. "Scaling laws and interpretability of learning from repeated data." arXiv preprint arXiv:2205.10487 (2022). How do the authors plan to mitigate the aforementioned weaknesses?	Moderately AI-edited
Extrapolating Large Models from the Small: Optimal Learning of Scaling Laws	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper studies using small models to fit scaling laws and then extrapolate to large models to cut evaluation cost, but finds that such extrapolation is out-of-distribution and often unstable. The main contribution of the work is proposing an interpretable ESS metric, explaining the variance-amplification mechanism of out-of-range extrapolation, and providing an optimal selection method for small models. 1. The proposed definition and framework are novel and workable. 2. In implementing the algorithm, the objective function and constraints are clearly defined, and the experiments support the theoretical claims. 1. Reframing the Problem Motivation: The paper compellingly frames the problem around the "prohibitively expensive" cost of evaluation. This is a valid point. However, in the context of SOTA model development, the cost of training is often the dominant bottleneck by several orders of magnitude. The primary industrial use of scaling laws is typically for pre-training decision support (e.g., comparing architectures or data/model allocations) rather than saving post-training evaluation costs. The paper's impact could be significantly strengthened by discussing how the proposed ESS metric and optimal selection algorithm could be adapted for this (arguably more critical) pre-training decision-making scenario. 2. On the 'Model Family' and 'Universal' Link Function Assumption: The framework's practicality hinges on the assumptions made about 'model families.' In practice, scaling laws are often most needed to compare models with subtle but critical differences (e.g., aspect ratios, data mixes), which may already constitute different 'families.' The paper's real-world experiment (Section 7.2) and Remark 1 assume a 'largely universal' link function by fitting it across multiple, distinct model architectures (LLaMA, Qwen, Bloom, etc.). This assumption feels very strong and may not hold in the very scenarios where scaling prediction is most valuable. It would be helpful for the authors to provide more justification for this universality or discuss the sensitivity of their method to this assumption. 3. Reliance on a Correctly Specified Model Form: The paper's core theoretical contributions, such as the variance characterization (Proposition 5.1) and the optimal selection algorithm (Theorem 6.1) , are derived from a specific power-law model (Eq. 1) assuming Gaussian noise. This framework primarily addresses aleatoric uncertainty (noise) and extrapolation variance, assuming the model form itself is correct. As the authors thoughtfully acknowledge in Section 8, this is a key limitation. In practice, model misspecification (i.e., the true scaling relationship is not captured by Eq. 1) is often the largest source of error. This creates a risk that the proposed algorithm could lead to a solution that is "optimally confident" but potentially "incorrect" (i.e., high ESS for a biased prediction). A discussion on how ESS might also help detect or quantify this model form uncertainty would be a valuable addition. 4. Distinguishing Prediction 'Confidence' from 'Correctness': Following the point above, the paper's focus is on minimizing variance to maximize ESS. This is an important and non-trivial contribution. However, it would be beneficial to more clearly delineate this from the challenge of correctness (i.e., bias). The current framework does not seem to penalize a model that is "confidently wrong." It would strengthen the paper to discuss this distinction and whether the optimal selection strategy might inadvertently shift if the goal was to minimize total error (Bias + Variance) rather than just variance. 5. Simplification of Scaling Factors and Interaction Terms: The analysis in Section 5 simplifies the design factor $X$ to a single dimension (log model size) for illustrative purposes. While the model $Y=\alpha+\beta X$ could in principle handle a multi-dimensional $X$, it remains a linear model. This may not be sufficient to capture the complex, non-linear interaction terms between scaling factors [1] have shown to be critical. It is unclear how the optimal selection algorithm (Theorem 6.1) would behave if the true underlying scaling law had such cross-terms. [1] Houyi Li, Wenzhen Zheng, Qiufeng Wang, Zhenyu Ding, Haoying Wang, Zili Wang, Shijie Xuyang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang“Predictable Scale: Part II, Farseer: A Refined Scaling Law in Large Language Models.” arXiv:2506.10972 [cs.LG]. https://arxiv.org/abs/2506.10972. 1. Regarding the problem motivation on evaluation cost: Could the authors elaborate on the practical scenarios where this cost is the primary bottleneck, especially in relation to the much larger training costs that scaling laws are typically used to predict? We are interested in how the framework might be repositioned to address the pre-training decision problem. 2. Regarding the 'universal' link function: This assumption is central to the experimental setup . How sensitive is the method to this assumption? In practice, one often needs to compare two very similar, but distinct, model families. How would a practitioner validate whether they can use a shared link function, or how would the method be adapted if they cannot? 3. Regarding the ESS metric (Definition 1) : This provides a valuable measure of confidence based on prediction variance. However, if the underlying scaling model (Eq. 1) is misspecified, ESS might be high for a prediction that is, in fact, highly biased. Could the authors comment on this potential trade-off? Is there a way to use ESS to also help diagnose model misspecification, rather than only quantifying variance? 4. Regarding interaction terms: The theoretical analysis relies on a linear relationship between the critical quantity $Y$ and the design factors $X$ (Eq. 1). How would the theoretical results (change if the true scaling law involved non-linear interaction terms between factors, as suggested by other recent work on scaling laws [e.g., Farseer, Li et al., 2025]?	Fully AI-generated
Extrapolating Large Models from the Small: Optimal Learning of Scaling Laws	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper addresses the statistical guarantees of the observational scaling law. It introduces a metric called equivalent sample size to quantify prediction uncertainty. The authors also develop an efficient algorithm to maximize ESS within computational budgets. Synthetic and real-world experiments demonstrate the proposed method’s effectiveness. - The paper tackles an important problem regarding formal guarantees for scaling laws. - The proposed efficient model-selection method is technically sound. Some conclusions are counterintuitive and therefore noteworthy. - The proposed framework largely relies on the functional form derived from observational scaling laws. This raises the risk of model misspecification, and the Gaussian assumption may not hold in that case. - The computation saved by the proposed algorithm seems modest to me in experiments. - Minor issue: It would be better to reference the appendix contents in the main paper. Currently, none of the proofs are cited. Please refer to those in the weaknesses.	Lightly AI-edited
Noise-Guided Transport for Imitation Learning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper is set to address imitation learning in low-data regimes. To this end, the authors define and learn a “potential”, $h$, and a trainable predictor, $f$. They then use the difference of potential's expectations over expert vs. agent distributions as an adversarial reward, and argue this is equivalent to optimizing an optimal transport objective provided that the potential is 1-Lipschitz. Large parts of the methodology are focused on providing a practical recipe for the 1-Lipschitz approximation. Empirically, MuJoCo Gymnasium tasks, including a more challenging humanoid control task, and other results report strong performance in low‑data regimes. I appreciate the authors' effort in presenting a complete and thorough work. I believe that the motivation is strong: the authors properly identify the problem with vanishing gradients in GANs and try to tackle this with a novel formulation, especially under low-data regimes. The literature is sufficiently covered, and the appendix is rather thorough. The paper explains key concepts and design choices in great detail, and attempts to support them with proofs. The authors also compare their method against multiple baselines, some re-implemented, and release the codebase, which helps the credibility and reproducibility of the empirical results. Introduction 1. The statement “IRL … is shielded from compounding errors” is not accurate and overstates IRL's capabilities. IRL helps mitigate compounding errors, but it doesn’t guarantee avoidance. If the learned reward is misspecified/ambiguous, if the discriminator overfits, or if optimization stalls, the learned policy can still drift into regions where its behavior is poor and errors cascade. 2. The method, NGT, is defined in the intro, but the logic behind the chosen name is left to multiple pages into the manuscript, and remains unclear to the reader until then. Please consider aligning the explanation with the method's name. 3. The last paragraph of the introduction is not cohesive. The text keeps referring to Figure 3, and tries to justify or explain the results. The introduction is better off by focusing on high levels, and explaining specific results can be done in the experiments. It could also benefit from being split into two paragraphs. Maybe consider a more structured, cohesive, and to-the-point version of this paragraph? Related work 4. The authors perhaps misuse the LaTeX's \cite and \citep commands, and make the related works section harder to follow. 5. Some abbreviations are defined multiple times (like OT), and some are defined and never used again, like Maxent IRL. 6. The related work section could also use some structure, like categorizing the previous work based on their main perspective or application. Background 7. What does it mean to reset $\gamma$? Isn't the discount factor fixed while learning? Methodology 8. The definition of actor-critic methods belongs more to the background than the methodology, the same for reply buffer and SAC. 9. Equation (1) is far from any density estimator, not only because it doesn't integrate to 1, but because it varies based on different seeds, scaling, etc. Given this, is calling it a pseudo-density still justified? 10. In Equation (2), the authors basically reparameterize the discriminator as a frozen random network + predictor, but it’s still optimizing a divergence between expert and agent distributions. There’s no clear reason why this avoids vanishing gradients. Don't gradient magnitudes depend on how the predictor generalizes, not on its target’s randomness? A clarification on this is indeed needed here. 11. Why choose an exponential function of potential for the reward (other functions can provide positivity and a suitable range)? Exponentiating the reward makes the system exponentially sensitive to small changes in the potential, which can happen based on noise or initialization. 12. The reward is a derived quantity, not a general potential. There’s no guarantee this mapping can represent all 1-Lipschitz functions. It’s actually a highly restricted subset of functions shaped by the architecture of loss and the initialization of networks. So how is this an effective search within 1-Lipschitz functions? 13. Upon reading the proof of Theorem 4.1, I believe that there is an ambiguous logic applied in the last line, where the comparison switches from less than or equal to an equality without any justification. Can you elaborate on this? 14. In addition to the previous concern, I think the basic properties of Lipschitz compositions are known, and Theorem 4.1 merely restates them? It’s useful to state, but the framing reads stronger than it is, more like part of the paper's contributions. Can you justify this? Experiments 15. Running for only 4 random seeds could be insufficient, especially for low-data regimes with higher uncertainty. Is the low number of seeds due to computational problems of the method / baselines? 16. On the previous note, I expected to see larger error bands for fewer demonstrations, but the plots in the main text do not follow this trend. Why is this the case? And why does lower data not cause higher uncertainty in the predictions, not only in NGT but also across other baselines? Please refer to the weaknesses, and justify the comments / answer the questions. I am open to discussion and will indeed be willing to raise the score if these points are sufficiently addressed. Again I thank the authors for their time and dedication.	Fully human-written
Noise-Guided Transport for Imitation Learning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents Noise-Guided Transport, an off-policy imitation learning (IL) method that uses an optimal transport formulation of IL and adversarial learning for optimisation. One of the main components of the paper is a frozen random prior network that acts as a “regulariser” by moving the optimal solution away from agent data. The primary focus of the paper is to develop a method for IL in data-sparse settings, and its main claims are that it achieves high data and sample efficiency. The paper presents an interesting reformulation of the imitation learning problem from the lens of optimal transport. The authors formulate an objective similar to several prior works in imitation learning (matching expert and agent distributions), and approach this as the optimisation of the earth mover distance using a histogram loss. While this is similar to WGANs (and WGAN-based IL), the paper presents some interesting theoretical results and a new learning strategy that uses a frozen random prior network to regularise the predicted rewards. The proposed ideas seem novel and worth exploring. While the idea of the paper is interesting, I believe that the execution is lacking in presentation and experimentation. Overall, I find the paper poorly written and often unnecessarily hard to digest. Several sections are very long-winded, and explicit mathematical notation is often missing. A clearly written background would have significantly improved the clarity and impact of the paper. Additionally, I find some claims overly bold and the experimental section to be insufficiently rigorous. I address some specific concerns below: 1. Section 4.1 is very unclear. The paper does not motivate or introduce “learning from random priors”. A short background on this would greatly improve the clarity and impact of the paper. On line 133, “By the properties of neural networks,” is also a very vague reasoning. If I understand correctly, this is true only given infinite data and if the function is trained till global optimality. This assumption is clearly violated in the data-sparse settings that this paper analyses. Please correct me if I am mistaken. 2. On line 168, “Methods based on GANs optimise a JS-divergence between distributions $P_{expert}$ and $P_{agent}$, which suffers from mode collapse and vanishing gradients” is again an unsupported and potentially incorrect statement. If I am not mistaken, optimising the Jensen-Shannon divergence is not necessarily the root cause for GAN instabilities. Rather, the issue likely arises because of perfect discrimination and poorly aligned supports [1, appendix E in 2]. 3. Could you please clarify how Eq 2 is different from the apprenticeship learning setup (Eq 6 in [3])? To my understanding, your objective of “distinguishing the expert and agent distributions” is closely aligned with several prior works in IRL. If this is the case, I request the authors to rewrite their claims to mention that their method is a reformulation of the distribution matching idea (but from an OT point of view). 4. The core functionality of this paper is quite close to Adversarial Motion Priors [4], a method that uses the WGAN style loss for adversarial IL. I believe that this is important prior work that should at least be mentioned in the analysis of this paper (and potentially used as a baseline). 5. The experiments in this paper only use 4 random seeds per task. In my opinion, this is quite low and not rigorous enough to rule out the possibility that the reported results are due to random fluctuations or chance rather than consistent underlying performance differences. A larger number of seeds (10+) would help to claim statistically significant improvements. 6. On line 460, the authors state that “we did not carry out per-task tuning for any method.” I believe this is a flawed methodology. Each method should be tuned individually for each task, or alternatively, all methods should be tuned to achieve the best average performance across all tasks. Using the same hyperparameters across tasks that differ significantly in dynamics, reward formulations, and optimal expert policies leads to an unfair comparison and potentially misleading conclusions. Different environments naturally require distinct exploration strategies and hyperparameter settings. Consequently, applying a fixed hyperparameter set across all tasks may cause some baselines (such as DiffAIL) to underperform relative to their true potential. I therefore request that the authors clarify their hyperparameter tuning strategy—specifically, for which environments the methods were tuned and for which they were kept fixed. Clarity Concerns: 1. In section 3 (lines 110-117), could you please include the full expressions for the transition dynamics $P(s_{t+1} \| s_t, a_t)$ and reward function $r(s,a)$. The line “policy π(a\|s) maps states s ∈ S to distributions over actions a ∈ A, aiming to maximize cumulative rewards” is incorrect as the policy does not map states to distributions. It is a distribution over actions conditioned on a given state. Also, the RL objective is to maximise the expected cumulative reward (unless your work sets up the problem differently). Additionally, could you please clarify the line “We work in the episodic setting, where γ resets to 0 upon episode termination”? 2. The beginning of section 4 (lines 121-137) is phrased very awkwardly in my opinion. I would prefer if the terms are defined explicitly in mathematical language (eg: $r_{\zeta}: \mathbb{X} \rightarrow \mathcal{R}$ where $\mathbb{X}$ is either $\mathbb{S} \times \mathbb{S}$ …). Please also explicitly define $P_{expert}$ and $P_{agent}$. The definitions will change depending on $\mathbb{X}$ (eg: $P_{agent}(s,a) = \pi(a\|s) \sum_{t} \gamma^t P_{agent}(s)$). I believe that, unless necessary for the rest of the paper, the explanation of the actor-critic methods can be relegated to the appendix. 3. I don’t find Figure 1 particularly informative. From what I understand, it is an aggregation of all the curves in Figure 2. However, such aggregation across different tasks loses any nuance/meaning. I also don’t understand how the normalization is done in this figure. Do you consider mean expert performance across tasks? If so, how do you account for the fact that all environments have different max expert performances? Further, the dotted lines in Figure 1 aren’t labelled, nor is Figure 1 referenced in the text. Could you also explain why the humanoid results were omitted? References [1] Arjovsky M, Bottou L. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862. 2017 Jan 17. [2] Diwan AA, Urain J, Kober J, Peters J. Noise-conditioned Energy-based Annealed Rewards (NEAR): A Generative Framework for Imitation Learning from Observation. arXiv preprint arXiv:2501.14856. 2025 Jan 24. [3] Abbeel P, Ng AY. Apprenticeship learning via inverse reinforcement learning. InProceedings of the twenty-first international conference on Machine learning 2004 Jul 4 (p. 1). [4] Peng XB, Ma Z, Abbeel P, Levine S, Kanazawa A. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG). 2021 Jul 19;40(4):1-20. Included alongside weaknesses	Fully human-written
Noise-Guided Transport for Imitation Learning	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This work proposes Noise-Guided Transport (NGT), a lightweight off-policy imitation learning method for low-data regimes. NGT formulates imitation as an optimal transport problem solved via adversarial training, requiring no pretraining or specialized architectures. It naturally models uncertainty, is simple to implement, and achieves strong performance on challenging continuous control tasks with as few as 20 expert transitions. - The paper proposes a novel approach to enforce the Lipschitz condition and measure distributional distance under the Wasserstein-1 metric without relying on a gradient penalty. - Empirical results demonstrate that the proposed method outperforms state-of-the-art adversarial imitation learning (AIL) approaches in terms of episode rewards. - The ablation studies are comprehensive, and the paper provides theoretical justifications supporting the proposed method. - The writing quality of the paper needs improvement. There are numerous unnecessary bolded words, and missing hyperlinks (e.g., line 706 in the appendix). - Although the authors claim improved computational efficiency by avoiding gradient penalties for enforcing the Lipschitz condition, Table 3 shows that the proposed method does not demonstrate a clear advantage in computation time compared to existing approaches. - Could the authors provide a more detailed explanation or empirical justification for the claimed computational efficiency of the proposed method compared to approaches that use gradient penalties?	Moderately AI-edited
Noise-Guided Transport for Imitation Learning	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper studies the imitation learning problem in the low-data regime and introduces an offline policy approach called Noise Guided Transport (NGT). The method addresses the problem using adversarial training. The authors evaluate the performance of their approach in several continuous control environments, such as Ants, HalfCheetah, etc. The paper is overall well written with some room for improvement. The problem it addresses is both interesting and relevant. The experimental results demonstrate that the proposed method performs well compared to state-of-the-art approaches. There is some concern regarding the novelty of the proposed method and its positioning within the existing literature. The idea of using generative adversarial models for imitation learning, i.e., employing a loss function similar to Eq. (2), has been explored in prior work, including studies that establish its connection to optimal transport theory, e.g., [Xiao et al., 2019]. The authors should more clearly emphasize the novelty and specific contributions of their approach to distinguish it from existing methods in the literature. The Lipschitz constant of the potential function plays an important role in the analysis of this method. However, this constant can become arbitrarily large for functions with sharp transitions, such as ReLU-type functions. How would such large Lipschitz constants affect the performance of the proposed method? In the experimental results (Figure 2), it appears that the performance of NGT does not improve as the number of demonstrations increases, for instance, in the HalfCheetah, Humanoid, or Walker environments. This raises the question of whether the proposed method is consistent. Please see the Weaknesses.	Lightly AI-edited
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes Orthogonal Sparse Autoencoders that add an orthogonality penalty on decoder atoms to mitigate absorption and composition while keeping a classic SAE training loop. The penalty acts on the maximum cosine similarity within random chunks of the dictionary, which brings the cost near linear in the number of atoms. Experiments on Gemma 2 and Llama 3 layers compare OrtSAE to ReLU SAE, BatchTopK, and Matryoshka. Core metrics show slightly lower explained variance than BatchTopK but improved mean nearest neighbor cosine, reduced composition and reduced absorption, with SAEBench reporting a gain on spurious correlation removal and broadly similar performance elsewhere. Figures 1, 3 and 4 plus Appendix C document these effects and the chunking tradeoff. The paper is well written, I would like to thank the author for that, and i noticed some really positive points (P) that I will describe here: P1. Clear objective that is easy to implement. Section 3.3 gives a concrete loss with a practical chunk strategy, and the design slots into existing BatchTopK code paths. The result is near random init orthogonality on the mean nearest decoder cosine in Figure 3c. P2. Sensible empirical sweep. The paper reports reconstruction and KL scores, interpretability via Autointerp, atomicity via composition and absorption, and downstream SAEBench. Figures 3 and 4, plus layer and chunk ablations in Appendix B and C, help readers understand where the gains come from. P3. Useful qualitative decompositions. Figures 5, 8, 9 and 10 illustrate how a composite or overly narrow BatchTopK feature can be expressed as a sparse mix of more atomic OrtSAE features. Now, even if I liked the paper, in my opinion, there is still some weaknesses, some Major (M) and some minor (m) that I will detail here: M1. Reconstruction versus sparsity does not garantee the right concept basis. The paper still optimizes explained variance plus sparsity and adds orthogonality, but none of these objectives ensure recovery of the correct semantic axes. I think SAEBench go in the right direction here, but I would also like to see simple synthetic data with planted factors, to test subspace recovery and concept identifiability. M2. Phenomenology of orthogonality is under motivated. Section 3 argues that absorption and composition correlate with high decoder cosine and therefore orthogonality should help, which is plausible, but there is a broader literature on representation geometry that could predict when orthogonality should or should not help. A short review and positioning would help justify the method theoretically, and it would also guide where to apply it in depth. M3. Global orthogonality versus conditional orthogonality at inference. The current loss pushes all decoder atoms apart. For downstream use, what we actually need is low mutual coherence among the subset of atoms that activate together on a given input. Please consider a conditional version that penalizes the Gram matrix of only the active atoms per batch, or an evaluation that measures conditional mutual coherence at inference. This could preserve reconstruction while targeting the failure mode more precisely. M4. To continue on this, you should use the Babel score [1]. Since the paper cares about sets of correlated atoms, Babel and cumulative coherence are natural complements to mean nearest neighbor cosine. Please report Babel on the whole dictionary and on active subsets during inference, and compare to BatchTopK and Matryoshka at matched sparsity. This will also clarify whether orthognality is achieved where it matters most. [1] Greed is good: algorithmic results for sparse approximation. Tropp & al M5. Missing baseline on MP-SAE. Matching pursuit and orthogonal matching pursuit induce a conditional orthogonality effect at selection time, which is directly relevant to absorption and composition. As such, MP-SAE [2] is extremely relevant here. Without it the geometric contribution is hard to isolate. [2] From Flat to Hierarchical : Extracting Sparse Representations with Matching Pursuit. Costa & al. M6. Connection to Grassmannian frames and linear representation hypotheses. If the intended target is a dictionary that minimizes mutual coherence for a given size, this is close to Grassmannian frames [3] or even equiangular tight frame objectives [4]. The paper could formalize this perspective, relate the proposed loss to a relaxation of Grassmannian frame construction, and test whether learned dictionaries move toward frame like spectra. A useful justification would be: if the linear representation hypothesis holds at the chosen layer, then an approximately Grassmannian decoder is preferred, hence the proposed loss is a practical surrogate. Empirically, the singular value spread of W_dec and the off diagonal structure of the Gram matrix could be tracked across training. [3] On the structures of Grassmannian frames, Haas & al. [4] On the existence of equiangular tight frames, Sustik & al M7. Explain why reduced cosine should yield better concepts rather than just different ones. Figure 3c shows much lower nearest neighbor cosine and Figure 4 shows better atomicity metrics, but a causal story is missing. Please add targeted interventions where you hold reconstruction constant and vary the orthogonality weight. This feel circular. M8. Scope of downstream improvements is modest. SAEBench shows a clear gain on spurious correlation removal but other tasks are similar to baselines or favor Matryoshka. I would not expect large universal gains from pure geometry, but the paper could better delineate which tasks benefit and why, possibly linking to the conditional orthogonality hypothesis above. See Figure 6. Now for the minor points: m1. Consider reporting conditional Gram statistics. For each batch, compute the Gram of active atoms and report the distribution of off diagonal entries. This directly measures the property the method aims to improve. m2. Clarify the chunk size choice. Section 3.3 sets K as ceiling of m over 8192. A short sensitivity table on chunk size versus reconstruction and atomicity could help but I agree would be bonus. m3. Add a short note on the slight explained variance drop. Figure 3a shows OrtSAE is a little below BatchTopK. A one paragraph discussion of when that trade is acceptable for users focused on interpretability would be helpful. Cf Major point 1-8 and minor points 1-3.	Fully human-written
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper tries to address the symptoms of feature absorption and feature composition via adding an orthogonality loss to SAE training. This loss term can be added to any SAE architecture, and penalizes the max cos sim of each latent with all other latents. The paper provides an optimized version of the algorithm that allows this to scale linearly with decoder size rather than quadratically. The paper shows that BatchTopK SAEs trained with this orthogonality loss achieve better scores on absorption and several other downstream metrics compared to standard SAEs, and are competitive with Matryoshka SAEs. - The logic behind the technique makes sense, and the training optimization also makes sense - I like the MetaSAEs-based compositionality score, that seems like a good contribution too. - The technique is simple and can be easily tacked-on to existing SAE architectures - The metrics are impressive, and seems like a good architecture to add to the SAE toolbox - The LLM SAE training details all look very reasonable (enough tokens, reasonable width, reasonable dataset choice, reasonable learning rate, etc...) - The paper does not explore the sensitivity of the SAEs to the parameter $\gamma$. In the paper, $\gamma$ is set to 0.25, but it's not clear how important this specific value is or how it was chosen. - The paper makes an implicit assumption that "true features" should be very close to orthogonal, but there is some evidence this is not always the case. E.g. there has been work showing that days of the week are represented in a circle in a 2d plane [1], so the cosine similarity between these feature directions will naturally be high. Regardless, this assumption should be made explicitly and implications should be discussed. - The plots don't have error bars or stdev. This isn't a huge deal for plots where there's an obvious trend, but for bar plots or plots that look noisy (e.g. sparse probing) this would be helpful to have. - It's unclear when one would choose to use OrtSAEs over Matryoshka SAEs based on the results. ### References [1] Engels, Joshua, et al. "Not All Language Model Features Are One-Dimensionally Linear." The Thirteenth International Conference on Learning Representations. - Is there a "rule of thumb" for how to set $\gamma$ ? What happens if it's set too high, and what is a reasonable range for it? As a practitioner training OrtSAEs this is very important to know. I'm worried that setting this incorrectly can easily break the SAE. - Figure 1 feels a bit misleading since it's not showing Matryoshka SAEs. Matryoshka SAEs are the obvious comparison to make. It feels like this was left out because it makes OrtSAEs look less impressive when compared with Matryoshka. - What are the group sizes used for the Matryoska SAEs in this paper? I didn't see it specified anywhere, including the appendix. I think the number of groups and size of the groups will also impact the results for Matryoshka SAEs.	Fully human-written
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents a regularizer for sparse autoencoders. It encourages orthogonality between the weights of the different latents by: 1) sampling random blocks of latents 2) computing and penalizing cosine similarity between different weights within each block. This method is cheap (a stated advantage over Matryoshka-SAEs). In experiments, it seems effective at reducing feature absorption and composition, which is the stated motivation, and performs competitively in other regards. The problem is significant and timely, as Sparse Autoencoders are a significant and active area of research, and this work attacks some substantial, qualitative issues with common approaches. I expect a method like this to be a standard baseline for other approaches, as it is simple, cheap, and appears effective according to various evaluations. The paper is well written, with only a few minor issues. (moderate): When I google "orthogonal autoencoder", I encounter several prior works, e.g. "Orthogonality-Enforced Latent Space in Autoencoders: An Approach to Learning Disentangled Representations", Cha & Thiyagalingam, ICML 2023. I'm not sure how similar any of these works actually are, but I think the submission should include references to this and any other such works in any case, eg to reassure the reader if they are in fact quite different if that is in fact the case. If they are substantively similar, I don't think it's a major problem for this work, it just needs to be noted. (moderate): I feel like Figure 5 (and Appendix I) are missing similar decomposition results for other methods, unless I misunderstand, there is no side-to-side comparison included at the moment. (moderate): The technique seems like a blunt tool. I'm not sure how competitive such an approach will prove to be compared to alternatives that tackle the problems more head-on. The paper doesn't explain how encouraging orthogonality is expected to change what the SAE ends up learning in any detail; a more thorough analysis would be welcome. The submissions states: "Feature absorption and composition lead to redundant representations where multiple latents capture overlapping concepts, which results in high cosine similarities between them. This suggests that enforcing orthogonality between SAE latents could provide a principled approach to mitigate these issues." The first sentence of this quotation is unsupported, and the second sentence doesn't sound very principled -- the way it's described, this method is treating a symptom of the problem, rather than the problem itself. I believe the submission could strengthen the motivation here. (minor): The statement ‘Sparse Autoencoders (SAEs) have gained significant traction for interpreting LLMs, addressing the challenge that LLMs often function as “black boxes”’ is vague and is suggestive of a higher level of effectiveness in addressing the “black box” issue that I think is warranted. (minor): I didn't find the descriptions of feature absorption very clear, and had to look at the referenced paper to understand this concept. Relatedly, Section 3.2 felt repetitive with previous content. I recommend focusing on explaining the issue more thoroughly there, and editting to remove redundant content. (nit): “penalize high cosine similarities between SAE latents” It would make more sense to say you're penalizing similarities between the weights. (typo): Iine 420: Appx. D. should link to Appendix I. Do you have any explanation of the qualitatively different results on different downstream benchmarks (Figure 6)? Is the Cross-Model Feature Overlap Analysis sensitive to the threshold of .2? In Figure 3d, the performance of different methods is quite different for high sparsity, with Relu-SAE performing the best and OrtSAE the worst. Do you have an explanation for this? Do you think this is a potential problem for the method? Isn't the ultimate goal interpretability? With absorption and composition being relevant inasmuch as they make the results less interpretable? Given that the OrtSAE only achieves comparable levels of interpretability according to Figure 3d, I wonder if it’s actually achieving the aim of the work? “The main contribution of OrtSAE is the introduction of a new orthogonalization penalty” -- Are there others? Since the method is introduced as an approximation, it would be interesting to run some experiments that don't involve approximation to see if they in fact yield better results. Is it necessarily the case that "Feature absorption and composition" result in "high cosine similarities between [latents]"?	Fully human-written
How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective	Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper addresses how to rigorously evaluate LLM-generated test cases by reframing benchmark construction as a binary matrix problem over wrong codes (rows) and golden tests (columns). The central idea is to select a compact yet representative set of wrong codes that preserves the matrix’s diagnostic capacity while avoiding redundancy and score inflation. Concretely, the authors require the kept rows to form a row basis whose size equals the matrix rank (capturing all independent error modes), and among all such bases, they choose one that maximizes diversity by minimizing average pairwise Jaccard similarity of failure signatures. They propose WrongSelect, an efficient approximation combining principled pre-filtering and random-restart local search. Pre-filtering drops problems with any all-ones test column (hack-prone) and removes trivial wrong codes that fail ≥80% of tests; then the local search swaps rows in/out while maintaining rank to minimize the diversity objective. 1. The paper presents a fresh and elegant formulation that connects test selection to linear algebra. Modeling wrong-code failure signatures as rows in a binary matrix and selecting a rank-preserving, diversity-maximizing row basis is a principled way to eliminate redundancy while retaining all independent error modes. This perspective clarifies what “good coverage” means and gives a theoretical upper bound on minimal tests. 2. The empirical results are strong. The proposed selection method yields a compact, diverse benchmark that meaningfully stresses current test generators, and the reported exclusion rates show clear, consistent improvements over baselines. The ablations and end-to-end evaluations convincingly demonstrate both effectiveness and practicality at scale. 1. The approach assumes access to a sufficiently rich pool of golden tests (public + private) to build informative failure profiles. In many real-world settings, curating or synthesizing high-quality golden tests is costly and time-consuming. This reliance reduces the method’s portability and makes it difficult for third parties to reproduce or extend the benchmark to new domains or problem sets without significant investment. 2. While the formulation is theoretically sound (rank preservation plus diversity), the paper provides limited qualitative analysis to illustrate practical effectiveness. Concrete case studies—e.g., representative wrong-code examples retained vs. removed, error modes uncovered by the selected basis, or developer-facing insights derived from the reduced matrix—would strengthen the claim that the method improves understanding and diagnosis beyond raw metrics. could you provide an example for the effectiveness of your approach in a code example?	Fully AI-generated
How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective	Soundness: 3: good Presentation: 2: fair Contribution: 4: excellent Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a new benchmark to evaluate LLMs' ability to generate high quality test cases for coding problems. The quality is determined by the efficiency in covering the error space of the problem. To improve quality representation of the error space, the authors form an error matrix, consisting of all the submissions by their pass/fail results on all the test cases. Since the rank of this 2D matrix conveniently represents the "coverage" of submission patterns, they develop a greedy algorithm to maximize the diversity of wrong code submission, which in return reduces to a more compact submission set for efficient evaluation. The authors then evaluate several test case generation methods on various LLMs on two major metrics – the accuracy of generated test cases (PR) and the degree of successful filtering of wrong code submissions (HR). The evaluation results show that the choice of generation method matters significantly more than the choice of the underlying LLM. At the same time, the relatively underwhelming numbers (best combo leads to 60~% in HR) points to the importance of constructing a high quality set of wrong code – the analysis on the unfiltered version of the dataset reveals inflated scores, which is inaccurate. * A meaningful contribution for a much-needed area, both in terms of efficiency and framework. The method can see practical use cases. * The proposed method for selecting the minimal set of wrong code submissions is principled and reasonable. * The outcome dataset effectively pinpoints the need for better TCG methods, citing the underwhelming numbers. * Apple-to-apple comparison against the previous work: It would be stronger if you could replicate the other test case evaluation methods on the same problem set and show how TC-Bench more critically measures the TCG quality by LLMs. I'm aware this is done partially by comparing the method against the "All WC" counterpart. * WrongSelect's robustness: It's uncertain if the greedy algorithm yields a stable set of problems. * The need of translation: I assume the dataset is entirely sourced from non-English problems. The choice of data sources affect the quality and the number of (valid) wrong code submissions for problems. Have authors explored English data? Reducing the submission count to less than 2% depends on the population of wrong code submissions. --- Below are NOT the weaknesses of the method, but I think they are important to raise for a higher quality submission. I'm leaving here as I don't see fit in "Questions" * Presentation 1: Overall there is high usage of acronyms, such as "some WCs labeled as WA under GTs produce RE or TLE when executed on ATs". I got used to it by the end a little bit, but I often had to go back and refresh my glossary many times. I'd highly suggest fixing the inconsistency (e.g., "wrong code" and WC) and reducing overall abbreviations. * Presentation 2: Plot styles are inconsistent – mostly comic sans but some of them serif (Times?) out of a sudden. It looks like being patched together. I recommend following a single style. * Appendix B has no content * How much variance in PR / HR do you observe by running the optimization multiple times? When you say convergence, do they arrive at the same set of WCs? * The analysis comparing the method against the "All WCs", are the all WCs before or after pre-filtering?	Fully human-written
How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This work addresses the challenge of evaluating automatically generated test cases for code using LLMs. The authors propose a new framework that models the relationship between code and test cases as a binary matrix, where the matrix rank reveals the minimal number of distinct wrong codes and test cases needed for evaluation. They introduce WrongSelect, an efficient algorithm for selecting a maximally diverse and representative set of wrong codes to form a compact benchmark, TC-Bench. Built from millions of competitive programming submissions, TC-Bench offers an efficient, diverse, and inflation-resistant benchmark for assessing test-case generation. Evaluation across 13 leading LLMs show that even the best current methods achieve only around 60% fault detection, highlighting significant room for improvement in automated test generation and evaluation. - Formulating problem as code-test matrix - Number of evaluated LLMs - Approximation algorithm - Benchmark construction - Source of data from programming contests - Theoritically ok, but what about practicality? - Not studying other similarity metrics - How practical is this technique in practice/industry? - Do you think your technique generalizes on other datasets, when your evaluation is only using programming contests? - Do you think other similarity metrics would fit better than Jaccard?	Fully human-written
How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper addresses an important problem in LLM evaluation: how to construct an efficient, reliable, and unbiased benchmark for assessing the quality of automatic test case generation methods. The core contribution is a novel framework based on binary-matrix theory. This framework uses the matrix rank to simultaneously determine the minimal number of wrong codes required for evaluation and a theoretical upper bound on the number of test cases needed for full error pattern coverage. Based on this framework, the authors devised WrongSelect to find an approximate solution to this NP-hard problem, culminating in the construction of TC-Bench. 1. The paper tackles the crucial challenge of evaluating test case generation methods for code. 2. The author processes an efficient approximation algorithm combining pre-filtering and random-restart local search, WrongSelect. This provides a reasonable and effective solution to the NP-hard problem of selecting a maximally diverse diagnostic basis from a vast collection of wrong codes. 3. The authors release a compact and diverse benchmark, TC-Bench. By design, TC-Bench can reduce the computational cost of evaluation and is resistant to score inflation. 1. Insufficient discussion of related work. The paper could be strengthened by a more thorough discussion of related work that also leverages code-test properties. For instance, CodeT [1] or other approaches that also use binary matrices, such as B4 [2]. 2. Limited theoretical justification for the rank-coverage duality. A central claim of the paper is that the matrix rank provides an upper bound on the minimum number of test cases required for fault coverage. While the paper lacks a formal proof or a detailed discussion to substantiate this claim. A deeper analysis would strengthen the claim beyond intuition. 3. Limitations of the chosen diversity metric. The paper employs Jaccard similarity to measure the overlap. It implicitly assumes that all GTs are of equal importance. In practice, some GTs may represent extremely rare and critical edge cases, while others are more trivial. The Jaccard metric cannot capture this weighted difference. Have the authors experimented with or considered other similarity metrics that could account for the varying significance of different test cases? [1] Chen, Bei, et al. "CodeT: Code Generation with Generated Tests." The Eleventh International Conference on Learning Representations. [2] Chen, Mouxiang, et al. "B4: Towards optimal assessment of plausible code solutions with plausible tests." Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 2024. See Weaknesses. I will be glad to raise my score if the authors could provide a sufficient rebuttal.	Lightly AI-edited
From Fragile to Certified: Wasserstein Audits of Group Fairness Under Distribution Shift	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper studies the instability of existing group fairness functionals under distributional shifts. It proposes a distributionally robust framework that assesses fairness in the worst case over all test distributions within a Wasserstein ball centered at the empirical data. The key concept, $\varepsilon$-Wasserstein Distributional Fairness (WDF), provides a formal criterion for certifying fairness robustness. The authors present theoretical results ensuring feasibility and consistency, showing that the proposed framework offers stable and reliable fairness evaluation beyond a single observed dataset. The main contribution lies in formulating fairness auditing as a distributionally robust optimization (DRO) problem under Wasserstein uncertainty. This approach generalizes group fairness notions such as demographic parity and equalized odds within a unified framework. The paper derives a tractable reformulation via strong duality and establishes finite-sample guarantees connecting empirical and population-level fairness. The method provides formal robustness certification, addressing a key limitation of prior fairness approaches that lack distributional reliability. Overall, the work offers a clear theoretical foundation and a systematic formulation for fairness auditing under distribution shift. While the paper presents a clear and well-motivated formulation, its degree of theoretical novelty appears somewhat limited. The proposed framework essentially extends existing Wasserstein DRO theory to a fairness functional, without introducing new theoretical or algorithmic components. Although the application of DRO to fairness auditing is conceptually meaningful, many of the theoretical results—such as dual reformulation and finite-sample guarantees—are direct consequences of established DRO literature rather than novel contributions. Beyond the issue of limited novelty, several other practical aspects deserve further consideration: 1. The claim of improved stability under distributional shifts is not fully substantiated. Figure 2 demonstrates that WDF serves as an upper bound on fairness violation; however, this alone does not establish empirical stability (in fact, the variability between the empirical estimates and the WDF estimates appears to be quite similar). The authors motivate the introduction of WDF by showing in Figure 1 that existing fairness functionals are fragile under distributional shifts. However, this qualitative observation does not quantitatively confirm that WDF indeed yields more stable fairness estimates. Analyzing the variability of fairness functionals under resampling or perturbations would provide stronger evidence for the robustness claims beyond merely showing an upper bound. Such an analysis would offer a more comprehensive validation of the proposed framework’s robustness. 2. The framework is theoretically sound but its scalability to high-dimensional or large-scale fairness auditing tasks remains unclear. Solving Wasserstein-based optimization problems typically requires large-scale linear programming or optimal transport computation, which may be computationally demanding in practice. Moreover, the proposed formulation may not readily extend to general fairness functionals, as many fairness notions are non-differentiable or non-convex, making their integration into the Wasserstein framework non-trivial. A discussion of approximate or scalable alternatives would enhance the practical impact of the work. The choice of the Wasserstein radius $\delta$ is fixed (e.g., 0.01) without explicit justification or sensitivity analysis. Since $\delta$ controls the size of the uncertainty set, it plays a crucial role in determining the robustness–accuracy balance. Providing a principled selection rule or an empirical calibration strategy for $\delta$ would strengthen the practical validity and reproducibility of the results.	Fully AI-generated
From Fragile to Certified: Wasserstein Audits of Group Fairness Under Distribution Shift	Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper addresses the fairness assessment under distribution shifts. The core contribution of this work is the proposal of a Wasserstein distributionally robust optimization framework to certify worst-case group fairness. Specifically, they define a $\epsilon$-Wasserstein Distributional Fairness ($\epsilon$-WDF) as a robust audit target that quantifies worse-case group fairness over a set of plausible test distributions, modeled as a Wasserstein ball centered at the empirical data distribution. Recognizing the computational challenges of optimizing over infinite-dimensional sets of distributions, the authors derive tractable reformulations for $\epsilon$-WDF and its associated DRO regularizers. To address the out-of-sample problem, where only empirical data is observed, the paper establishes finite-sample certification guarantees for auditing fairness. And the theoretical insight is that the worst-case fairness estimated from the finite samples could upper bound the true worse-case disparity under shifts within an $\epsilon$-Wasserstein ball. Originality: The paper's primary original contribution lies in the novel formulation of $\epsilon$-WDF. By proposing a Wasserstein distributionally robust framework, the authors introduces a new and more resilient approach to fairness assessment that directly addresses the fragility of existing metrics under distribution shifts. This re-framing of the fairness audit problem, moving towards worse-case certification over plausible distributions, is a significant contribution. While DRO and Wasserstein distances have been applied in various machine learning contexts (especially distribution shifts), their comprehensive integration specifically for certifying group fairness showcases the originality of this framework. Quality: The theoretical development is rigorous, starting from a clear problem statement and building towards a robust solution. Clarity: The paper is well-written and generally clear. The logical flow from defining the problem to proposing $\epsilon$-WDF, detailing its theoretical underpinnings, and outlining the estimation method is easy to follow. The mathematical notation is consistent. Significance: The significance of this work is substantial, with implications for both theoretical research and practical applications. By providing a framework to certify worse-case worse-case group fairness, the paper offers a more reliable basis for auditing algorithmic systems. The proposed DRUNE estimator has the potential to be adopted in practical auditing and training pipelines. - Computational Complexity: the authors acknowledged that the stage 1 of Algorithm 1 has an $O(d^3)$ per-point time cost and stage 2 runs in $O(N\log N)$ time. While it might be efficient for a handful of closest-point computations, it could still pose a significant bottleneck for high-dimensional feature space (especially embedded with textual or visual embeddings). - Comparison to Related Works: The related work section is underdeveloped and lacks discussions with a broad range of relevant studies. In particular, it fails to explicitly discuss or compare the proposed framework with (Chen et al., 2022), which also addresses fairness certification in the target distribution under small distributional shifts. [1] Yatong Chen, Reilly Raab, Jialu Wang, and Yang Liu. 2022. Fairness transferability subject to bounded distribution shift. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22). Curran Associates Inc., Red Hook, NY, USA, Article 819, 11266–11278. 1. The notations of $\mathcal{S}_{\delta, q}$ and $\mathcal{I}\_{\delta, q}$ are a bit repetitive. If $f$ was defined as the absolute value, i.e., $f=\|h(x) \phi(\cdot, \cdot)\|$, then the two quantities $\mathcal{S}\_{\delta, q}$ and $\mathcal{I}\_{\delta, q}$ could be consolidated into a single, unified notation $\mathcal{S}\_{\delta, q}$. In that way, Eq. (9), (10), (11) could be furthered reduced into only one regularizer. Similarly, you won't need to swap the 0 and 1 in the afterwards expressions. 2. The time complexity of the DRUNE estimator is $O(d^3)$ to compute $d_i$ for each data point and $O(N\log N)$ for solving the knapsack problem. However, if one wants to incorporate the estimator into a real fairness optimization problem, it may incur such time cost during each iteration (as the model parameters will change). This may not be effective for training a fair classifier. 3. How does the Wasserstein regularizer compare to the bounds in (Chen et al., 2022)? Basically, they also provided provable guarantee of group fairness at the worst case. If we only look at the data examples within the thin boundary, it seems that the regularizers in your work could be converted into the bounds in their work. [1] Yatong Chen, Reilly Raab, Jialu Wang, and Yang Liu. 2022. Fairness transferability subject to bounded distribution shift. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22). Curran Associates Inc., Red Hook, NY, USA, Article 819, 11266–11278.	Lightly AI-edited

PreviousPage 11 of 1516 (75800 total rows)Next