ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper posits that the performance gap between Vision-Language Models (VLMs) and humans in GUI automation stems from a core knowledge deficit, rather than failures in reasoning or planning alone. The authors introduce GUI Knowledge Bench, a diagnostic benchmark designed to evaluate this specific knowledge gap. The benchmark is structured around three dimensions derived from common agent failure modes: Interface Perception (recognizing widgets, layouts, and system states), Interaction Prediction (predicting action types and their effects), and Instruction Understanding (planning and verifying task completion). 1. The primary strength is the conceptual shift in evaluation. Instead of focusing on end-to-end task success, which conflates multiple agent capabilities (e.g., reasoning, planning, knowledge, grounding) 2. The benchmark is comprehensive, covering 6 platforms and 292 applications, which ensures the findings are generalizable. 1. Conflation of "Knowledge" and "Reasoning" in Benchmark: The paper's primary claim is to evaluate GUI "knowledge" (”Different from most existing benchmarks that primarily evaluate task success, which mainly focus on the grounding, reasoning, and planning”) as distinct from "reasoning" and "planning." However, several tasks within the benchmark, particularly in the "Ins Understanding" (e.g., Task Planning) and "Interaction Prediction" dimensions, appear to inherently require complex reasoning or logical capability of LMs. 2. There is a contradiction in the paper's terminology. The authors claim to test 'base VLMs,' but the models listed are clearly instruction-tuned, optimized, and/or RLHF'd (e.g., 'GPT-5-Chat'). 'UITARS-1.5-7B' is itself a fine-tuned GUI agent (sft/rled based on qwen2.5vl). This invalidates the premise of testing 'base' models. 3. The OSWorld validation in Sec 4.4 provides evidence that providing plans improves agent performance. This validates the 'Instruction Understanding' gap. However, the link for 'Interface Perception' and 'Interaction Prediction' is supported primarily by two examples described by words. A stronger validation would require a systematic correlation analysis. 1. The benchmark's construction relies heavily on 'GPT-5' to generate question-answer pairs. This methodology is concerning as it may introduce artifacts or biases specific to the generator model, which could then be reflected in the evaluation of other VLMs. Does this benchmark test for fundamental GUI knowledge or for knowledge that 'GPT-5' happens to encode well? Lightly AI-edited
GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates why large vision–language models still struggle with automated GUI tasks despite recent advances. The authors hypothesize that missing core GUI knowledge in current VLMs is a key factor behind their performance gap compared to humans. To address this, the paper makes several notable contributions. First, it defines three dimensions of GUI knowledge based on common failure patterns: (1) interface perception, (2) interaction prediction, and (3) instruction understanding. Using these categories, the authors introduce GUI Knowledge Bench, a comprehensive benchmark composed of multiple-choice and yes/no questions derived from over 40,000 GUI screenshots and 400 execution traces across six platforms and 292 applications. This benchmark systematically evaluates what GUI-relevant knowledge is encoded in VLMs prior to any fine-tuning on task-specific data. The paper’s experiments show that while current VLMs can often recognize basic widget functions, they struggle with perceiving system state, predicting interaction outcomes, and verifying task completion. Importantly, the authors demonstrate a close link between performance on the knowledge benchmark and real GUI task success: models with more encoded GUI knowledge perform better on actual GUI automation tasks, and providing missing knowledge (e.g. in the form of operation plans) significantly improves task execution success. Overall, the paper’s contributions include a novel benchmark dataset for GUI knowledge, an analysis revealing specific knowledge gaps in state-of-the-art VLMs, and empirical evidence that addressing these knowledge gaps can improve GUI task automation. 1. The paper tackles a crucial and timely problem in multimodal AI - why VLM-driven GUI agents often fail in real scenarios. The authors clearly motivate that beyond reasoning and planning, knowledge of GUI specifics is missing in current models. 2. The introduction of a large-scale benchmark with 3483 knowledge-centric questions across 6 operating systems and 292 applications is a significant contribution. 3. The paper’s breakdown of GUI knowledge into three dimensions – interface perception, interaction prediction, and instruction understanding – is well-grounded in observed failure modes and prior literature. 4. The authors evaluate a wide range of state-of-the-art models, including both closed-source and open-source models. The benchmarking results are detailed per knowledge category and sub-task, allowing for nuanced comparisons. This extensive comparison lends credibility to the findings. 5. The results yield clear, actionable insights. 6. A significant strength is the additional experiment bridging the benchmark to real task execution. 1. While the paper states that the 3,483 questions were produced via automated generation plus manual annotation, it provides few details on this process. It’s not fully clear how questions and answers were generated or verified for correctness and difficulty. 2. Focus on base VLMs without fine-tuning. The benchmark specifically tests models “prior to downstream training” – i.e., base VLMs without task-specific fine-tuning. This isolates inherent knowledge but also means models are out-of-the-box, not specialized for UI understanding. 3. Evaluation of knowledge integration methods is limited. The paper’s solution implications mainly suggest selecting better base models or augmenting them with knowledge. While it demonstrates that adding operation plans helps one model, it stops short of exploring other ways to inject GUI knowledge (for instance, using the benchmark itself as additional training data or using retrieval during inference). A deeper discussion on concrete strategies (beyond the brief mention of retrieval augmentation) would have been useful to translate the findings into actionable guidance for building improved agents. 1. How were the 3483 knowledge questions constructed and validated? The paper mentions a mix of automated generation and manual annotation, but clarification is needed on the process. For example, did you use templates to generate questions from execution trajectories, and what steps ensured that the questions have unambiguous correct answers and appropriate difficulty? 2. The OSWorld case study is great evidence of knowledge helping in one setting. Did you observe (or do you anticipate) a strong correlation between a model’s score on GUI Knowledge Bench and its success rate on various downstream GUI tasks (beyond OSWorld)? For example, if Model A scores 10% higher on the benchmark than Model B, is A consistently better when fine-tuned or evaluated on full task benchmarks? Fully AI-generated
GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents GUI Knowledge Bench (GUIKB), a diagnostic benchmark designed to analyze knowledge gaps in Vision-Language Models (VLMs) applied to graphical user interface (GUI) understanding and control. The benchmark categorizes GUI-related reasoning into three dimensions Interface Perception, Interaction Prediction, and Instruction Understanding band builds a large-scale dataset spanning 6 operating systems, 292 real-world applications, and over 3,400 questions derived from around 40,000 screenshots and 400 GUI trajectories. Timely and relevant problem focus It goes beyond measuring overall task success to dissecting the specific types of GUI knowledge involved. Broad empirical coverage: The benchmark’s diversity across multiple OSes and hundreds of applications is impressive, offering a realistic assessment environment that surpasses previous GUI benchmarks in scope. Practical community utility: The dataset could serve as a diagnostic suite to evaluate model grounding and reasoning for future GUI or computer-use agents. The main conceptual novelty lies in organizing and scaling previous evaluation ideas. It extends earlier benchmarks (e.g., MMBench-GUI, SeeClick, Web-CogBench) in breadth rather than introducing new forms of reasoning or interaction. The multiple-choice and yes/no question format, combined with visual hints, simplifies grounding and may allow models to exploit linguistic or positional biases instead of demonstrating genuine understanding. Free form questions (going beyond 25% random performance) could be richer. The OSWorld improvement experiment is intriguing but small in scope and lacks variance or ablation analysis, making the link between benchmark performance and real agent success suggestive rather than proven. The paper seems to lack deeper analysis which can give insights on what current agents lack in terms of decision making. Have you tried a free-form response setting (without multiple-choice cues) to confirm that models truly possess GUI knowledge rather than exploiting format bias? Could you share dataset composition metrics (OS/app balance, interface types, redundancy rate) to verify diversity and coverage? I am not very confident accepting this paper, decision subject to rebuttal. Lightly AI-edited
Computational Bottlenecks for Denoising Diffusions Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses a very interesting question: Can denoising diffusions efficiently sample from any distribution $\mu$ for which sampling is otherwise tractable? The paper presents a rigorous negative answer by showing that diffusion samplers inherit the computational hardness of the underlying denoising task. The core mechanism of diffusion relies on learning the drift term $m(y,t)$, i.e. the Bayes-optimal denoiser. If this denoising problem exhibits an Information-Computation Gap (ICG), a regime where the statistically optimal error is low, but no polynomial-time algorithm can achieve it, then the authors show that diffusion sampling based on score-matching must fail. The paper includes several key theoretical results. Theorems 2 \& 5): Efficient diffusion sampling implies efficient near-Bayes-optimal estimation. The contrapositive confirms the central thesis: computational hardness in denoising implies diffusion sampling fails. Theorem 1: The paper proves the existence of polynomial-time drifts that are super-polynomially close to the optimum (among poly-time algorithms) in the score-matching objective, yet produce samples maximally far from the target $\mu$ (in Wasserstein distance). So the standard training objective is insufficient and potentially misleading when an ICG exists. Theorem 3: The paper shows that any poly-time, near-optimal drift satisfying a mild any \emph{Lipschitz} near-optimizer fails to sample correctly. Furthermore, \emph{Corollary~5.1} provides a concrete sufficient condition: a $C/t$-Lipschitz drift fails when $t \ge (1+\delta)\,t_{\mathrm{alg}}$. These results are illustrated using the sparse PCA model where sampling is trivial but denoising is conjectured to be hard below a threshold $t_{\mathrm{alg}}$. Empirical results show that the diffusion process produces degenerate samples. This work is very significant. It goes beyond analyzing statistical generalization or discretization errors, identifying fundamental limits of denoising diffusions. Originality. The paper goes beyond the previous analysis of diffusion samplers devoted to KL/Wasserstein bounds or generalization arguments. It provides an interesting computational perspective. In particular, it identifies the information-computation gap as the key problem for correct diffusion sampling; it shows that even near-optimal score-matching drifts (among poly-time methods) can sample the wrong distribution and it establishes reductions from diffusion sampling to denoising. Quality: Theorems 1--3 are stated with clear hypotheses, and Appendix proofs are thorough. The proofs are relatively easy to follow given the technicality of arguments. Clarity: The paper is well-written. Assumptions specifying the ICG gap are stated clearly. The paper separates neatly the main conceptual arguments in the main text from the proofs. The figures nicely illustrate the theoretical claims. Significance: This is a very interesting paper explaining when diffusion samplers fail for computational reasons—even when sampling from $\mu$ is itself easy. They also specify when positive results (which rely on accurate scores or favorable distribution classes) do not apply. Dependence on assumptions: The main results depend on Assumptions 1 and 2, which do not hold unless an ICG is present. This aligns with predictions for models like Sparse PCA, the underlying hardness remains conjectural in the general case. It would help to map Assumptions 1 and 2 to specific hardness conjectures in the literature. Robustness of Theorem 1: Th 1 shows that sampling can fail when the drift is $O(n^{-D})$ close to the optimum among poly-time algorithms. This is a fairly stringent definition of near-optimality. How robust are the results to this assumption? even an heuristic argument would be good. Construction of the Drift (Theorem 1) The drift $\hat{m}$ constructed to prove Theorem 1 is somewhat artificial. While this demonstrates the insufficiency of the score matching objective, it does not demonstrate that such deceptive drifts arise naturally during standard training procedures (e.g., gradient descent on neural networks). \item \emph{Actionable Insight:} Discuss whether standard optimization dynamics are likely to find these specific deceptive local minima, or if this is primarily an existential proof. Are there architectural choices or regularization techniques that might steer optimization away from these solutions? \end{itemize} Lipschitz assumption. Th 3 establishes a negative result for Lipschitz drifts, specifically requiring a $C/t$-Lipschitz condition for $t \ge (1+\delta)t_{\mathrm{alg}}$. However, non-Lipschitz drifts might succeed. It would be good to clarify whether this Lipschitz condition is key to the failure of just a technical requirement. Are the typical architectures used in practice (e.g. your GNN in section 3) satisfy this specific $C/t$-Lipschitz property? Applicability of the results. The analysis focuses on distributions where the ICG is well understood, e.g. Sparse PCA. How are these findings translate to standard distributions meet in physical systems; e.g. Lennard-Jones or \phi^4? It would be beneficial if the authors could discuss even informally the characteristics of distributions that might lead to a ICG. Minor point: The authors follow a non-standard presentation of diffusion models (the one originating from stochastic localization). I know this is equivalent but i think it hurts the readability of the paper. Alternative Objectives: Proposition 2.1 suggests that score matching is the root cause of the failure when ICGs are present. Does this motivate the development of alternative training objectives for diffusion models that do not strictly enforce denoising optimality, particularly at low SNR (small $t$)? Optimization: Theorem 1 implies that the score matching loss landscape possessing near-optimal solutions (in the poly-time class) that are bad for sampling. Do we know anything about whether standard optimization methods are likely to find these bad near-optimizers rather than potential good ones? Fully human-written
Computational Bottlenecks for Denoising Diffusions Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. In this paper, the authors study the feasibility of sampling with diffusion models. More precisely, they are interested in answering the following question: "Are there distributions such that, there exist almost optimal denoisers that can be computed in polynomial time and for which diffusion sampling fail?" They answer this question by the affirmative. The paper proceeds as follows: * In Theorem 1, they show that for any arbitrary polynomial time denoiser which satisfies two generic assumptions, there exists a modified denoiser for which the diffusion sampling fails. In Corollary 3.1, they apply Theorem 1 to the case of sparse random matrix measures. Other sparse examples are considered * While Theorem 1, shows that "good denoiser does not imply good sampling", in Theorem 2 they show that "good sampling implies good denoiser". * In Theorem 3, they show that under an additional Lipschitz condition, the results can be strengthen to show that *any* polynomial time algorithm which is nearly optimal for the denoising task will lead to poor sampling quality. The authors conclude the papers with toy numerical experiments. * The paper is theoretically strong. I have to command the quality of the writing (apart from the introduction of diffusion models and some notation choice discussed in the "Weaknesses" section). Even though I am not a domain expert on stochastic localization and random matrix theory, the paper was pleasant to read and the ideas clearly exposed. * The results presented in the papers will be of great interest for the theoretical diffusion model community. Those results indeed fill a gap in the literature regarding what can be achieved with diffusion models. * One of my main pain point with the paper is the choice of following the notation and convention of [1]. While I understand that it might be easier for the authors who seem to familiar with the techniques developed in [1] to build their theory on top of this framework it severely limits the adoption of the work, as it has to be translated back into "classical" diffusion model framework (see [2, 3, 4] for a (non-exhaustive) list of papers which adopt the classical diffusion model convention). The burden of the explanation and communication should be on the authors and not on the readers. * In my opinion, the strongest result is Theorem 3 but there is little discussion on how realistic is the Lipschitz assumption the authors consider. I found Corollary 5.1. to be very hard to read. To come back to the Lipschitz justification I am not sure I understand the following sentence "We assume the Lipschitz constant to be C/t, because the input of the denoiser is yt = tx + Wt, and hence the two t-dependent factors cancel.)". * Something that is unclear to me is the generality of the results presented in the paper. While Theorem 1 is very general it is not quite clear to me how to apply them beyond the sparse random matrix measure studied by the authors. In particular, I am wondering how specific those results are. To further specify my question: is (one of) the contribution(s) of the paper 1) the identification of a set of measures under which the assumptions stated in the general Theorems are valid thereby showing some limitations of diffusion models 2) the statement that most data distributions (even the ones considered in practice) fall under the category of the general Theorems thereby showing the much more stronger statement that diffusion models are flawed. It would be good to clarify the scope of the paper in that respect. [1] Montanari (2023) -- Sampling, diffusions, and stochastic localization [2] Conforti et al. (2024) -- KL Convergence Guarantees for Score diffusion models under minimal data assumptions [3] Li et al. (2024) -- O(d/T) Convergence Theory for Diffusion Probabilistic Models under Minimal Assumptions [4] Chen et al. (2022) -- Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions * "l. However if such drifts exist, our results suggest that minimizing the score matching objective is not the right approach to find them (since the difference in value with bad drifts will be superpolynomially small)." (l.186) This is an interesting remark. There is some evidence that diffusion models fail at approximating the score. First, they are trained on the empirical measure hence the target distribution is the "optimal" target gives a fully memorizing denoiser. Second, due to neural network approximations they often fail to correctly approximate the true denoiser. I would be keen to understand how the authors deal with that intrisic limitation of diffusion models and how this would affect their analysis. * In Proposition 2.1., I do not understand why we can have equality to 0 in (ii). * Could you provide more details regarding the construction of the drift in Proposition 2.1? (the construction of a drift which is a bad approximation to the true denoiser but gives good diffusion sampling results). * Before Theorem 1 I think it would be worth providing just a few sentences regarding the validity of Assumption 2 and Assumption 1. At this stage it is very much unclear for the reader if those assumptions are easy or not to satisfy. Fully human-written
Computational Bottlenecks for Denoising Diffusions Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper's primary contribution is to rigorously prove that a computational bottleneck, not just a statistical one, limits the power of denoising diffusion models. The authors prove that not all easily-sampled probability distributions can be efficiently learned by these models. They construct a specific counterexample based on the sparse submatrix estimation problem, which is conjectured to be hard. This shows that even a nearly-perfect, efficiently-computed drift function will fail to generate accurate samples, thus identifying a fundamental gap between what is statistically possible and what is computationally feasible for this class of generative models. 1. This paper's core strength is its novel perspective, shifting the analysis of diffusion models from standard statistical limits to the overlooked statistical-computational gap. 2. The authors construct an excellent counterexample using exactly the sparse submatrix estimation to provide a distribution that is easy to sample from directly but computationally intractable for a diffusion model to learn. 3. Supportive simulations clearly validate the theory, showing that a computationally-bounded diffusion model fails to generate accurate samples from this "hard" distribution, just as the authors predict. 4. The paper's argument is mathematically rigorous, employing formal, well-established proofs. 1. The paper's organization is not straightforward; the structure is relatively loose, which makes the argument hard to follow. 2. The planted clique counterexample feels specific, which makes it difficult to connect this theoretical computational bottleneck to the broad success of diffusion models on real-world data. 1. Would considering the 'average complexity' via diffusion over a class of easily-sampled distributions be more meaningful than focusing on 'bad' extreme counterexamples like the sparse submatrix prior? 2. Do your have conclusions about the training procedure in this paper, or does it imply that for the provided counterexample, the training of diffusion models will always fail due to its inherent computational hardness? 3. Is it possible this specific computational bottleneck is just an artifact of this particular formulation of diffusion sampling, or does it represent a fundamental hardness that alternative generative models like flow-matching models could not bypass? Fully AI-generated
Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents the Knowledge Reasoning Language Model (KRLM), a novel framework for inductive knowledge graph reasoning (KGR). The primary goal is to address the "knowledge distortion" problem in existing LLM-based approaches, where sparse contextual information from a knowledge graph (KG) can interfere with or override the dense intrinsic knowledge of the LLM. KRLM aims to create a more unified coordination between the LLM's internal knowledge and the external KG's structural knowledge. The paper clearly identifies and articulates the "knowledge distortion" problem, which is a significant and practical challenge in combining LLMs with KGs. The goal of unifying the two knowledge sources is highly relevant. The proposed KRLM is a sophisticated system where each component is designed with a clear purpose that ties back to the central goal of knowledge coordination. The KRL instruction format, the memory-augmented attention, the structured predictor, and the mutual distillation loss all work in concert. This is a significant engineering and research effort. The proposed KRLM architecture is very complex. It involves multiple GNNs (for relations, entities, and the projection head), a modified attention mechanism, and a custom tokenizer, all integrated with a base LLM. While the results are strong, this complexity raises questions about its practical scalability and efficiency. The computational complexity analysis in Appendix E confirms that this is a heavy model. Could the authors comment on the trade-offs between this complexity and the performance gains? Is it possible to achieve similar benefits with a simpler architecture? The experiments are based on Llama2-7b. While this is a reasonable choice, the field of LLMs is moving incredibly fast. How dependent are the architectural choices and performance gains on this specific base model? Would the same knowledge coordination mechanisms be as effective or even necessary with more advanced, capable models (e.g., Llama-3, Mistral, etc.) that might have better inherent reasoning and context-handling abilities? see weaknesses Fully AI-generated
Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces the KRLMl, a novel framework designed to address the problem of knowledge distortion in LLM-based KGR models. Specifically, KRLM aims to integrate the inherent knowledge of LLMs with the structural context of KGs to improve the accuracy and reliability of fact reasoning in open-domain KGs. The proposed model shows strong performance in both zero-shot and fine-tuned inductive KGR tasks, outperforming existing models in several benchmark datasets. 1. The paper proposes a sophisticated mechanism for integrating the structural knowledge of KGs with the intrinsic knowledge of LLMs. 2. The experiment results indicate KRLM's superiority in both zero-shot reasoning and fine-tuning scenarios, outperforming several state-of-the-art models. 3. The model is well-grounded in both theory and practice, with clear explanations of the new modules 1. The reasoning complexity of KRLM, particularly during fine-tuning and inference, is a significant concern. While the model shows excellent performance, the computational overhead may limit its practical deployment in large-scale real-time applications. The authors briefly mention the computational complexity but do not provide enough detail. 2. The paper does not sufficiently address how the model would perform or be adapted to environments with limited computational resources or sparse KGs. 3. Tranditional knowledge graph reasoning methods need to be dicussed. The paper assumes that the knowledge memory will contain relevant information for reasoning. However, in cases where the relevant entities or relations are not part of the memory, will the model's performance dramatically drops? Fully AI-generated
Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes KRLM, a Knowledge Reasoning Language Model for inductive knowledge graph reasoning where entities/relations may be unseen. KRLM aims to coordinate intrinsic LLM knowledge with sparse KG context to mitigate knowledge distortion and reduce hallucinations. The method introduces (i) a KRL instruction format and a KRL tokenizer that map entities and relations into unified tokens aligned with LLM text representations; (ii) a KRL attention layer with a dynamic knowledge memory to balance LLM priors and KG evidence during reasoning; and (iii) a structure-aware next-entity predictor that constrains decoding to a trusted knowledge domain. Across 25 real-world inductive KGR datasets, KRLM is reported to outperform prior methods in both zero-shot and fine-tuned settings. 1.The paper focuses on an important pain point: leveraging LLM prior knowledge without letting sparse KG signals distort or override it, while also constraining generative hallucinations. 2.The KRL instruction + tokenizer provide a concrete alignment mechanism between symbolic KG elements and LLM token space, which is practical and reusable beyond a single dataset. 3.The KRL attention layer with dynamic memory is a sensible architectural step toward balancing textual priors and structured context, and the design is easy to ablate. 4.Results on 25 inductive benchmarks (zero-shot and fine-tuning) suggest robustness across datasets and settings, which strengthens external validity. 1.While the KRL interface and attention layer are coherent, the overall contribution currently reads as a well-engineered integration of known ingredients. The paper needs to sharpen what is theoretically or algorithmically new. 2.There is no efficiency study, yet the method introduces extra tokens (KRL), a memory mechanism, and constrained decoding. It is needed to provide per-query token counts, latency, and memory usage broken down by components, and compare to strong LLM-based KGR and KGFM baselines under matched budgets. 3.Ambiguity around the “dynamic knowledge memory.” It’s not clear how the memory is constructed, updated, and queried. How are conflicts between LLM priors and KG entries resolved? 4.The paper should specify how KRL embeddings, attention layers, and predictor parameters are initialized (random, vocabulary-tied, or from pretrained adapters), and whether the backbone LLM is fully fine-tuned, LoRA-adapted, or frozen. Include sensitivity to model scale and initialization choices. 5.The central claim, coordinated KRL reduces LLM knowledge distortion, needs direct measurement. It is needed to add controlled ablation studies that vary KG sparsity, conflicting triples, and noisy relations, reporting distortion/hallucination metrics. How exactly does the structure-aware predictor enforce structural constraints? Fully AI-generated
Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents KRLM, an LLM-based foundation model for inductive knowledge graph (KG) reasoning. It aims at mitigating the issue of knowledge distortion. KRLM introduces several key components: a custom knowledge reasoning language (KRL) instruction format and tokenizer, a KRL attention layer that dynamically integrates intrinsic LLM knowledge with KG context through a memory mechanism, and a structure-aware next-entity predictor. Extensive experiments across 28 datasets demonstrate that KRLM consistently outperforms state-of-the-art models in both zero-shot and fine-tuning settings. - The paper proposes a well-integrated architecture that effectively aligns LLM representations with KG structure. - The proposed method shows consistent performance gains across a wide range of datasets, particularly in inductive settings, with thorough comparisons to both structural and LLM-based baselines. - The contribution is primarily empirical; a stronger theoretical justification for how KRLM’s architectural choices address knowledge distortion would enhance the paper’s depth. - While KRLM employs a memory-efficient tokenizer, the paper lacks discussion on computational efficiency, including training time and parameter overhead compared to prior models. - The model heavily relies on KRL-format instructions, yet the design choices—such as instruction length, style, and vocabulary—are not systematically analyzed. - The concept of “knowledge distortion”, while intuitive, is loosely defined. Can it be quantified or diagnosed more rigorously? A deeper analysis would help ground this notion. - Table 1 emphasizes accuracy-based metrics (e.g., MRR, Hits@10), but omits key efficiency metrics such as FLOPs, memory footprint, and wall-clock time for fine-tuning vs. pretraining. These are crucial for assessing practical viability. - The paper would benefit from qualitative examples illustrating how KRLM produces more faithful or interpretable reasoning compared to baseline LLMs. Fully AI-generated
ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a benchmark including response-criterion pairs to evaluate professional knowledge across multiple fields. It introduces an efficient LLM-as-judge evaluation framework that mitigates self-enhancement bias. The authors evaluate current LLM performance on both criterion fulfillment classification and response generation for these challenging tasks. 1. The benchmark covers multiple scientific domains and evaluates knowledge storage and complex reasoning capabilities. Its expert-designed criteria facilitate precise, granular assessment of LLM performance on challenging tasks. 2. The paper assesses a wide range of LLMs to provide comprehensive performance benchmarks. The experimental design encompasses comparisons across model accessibility (open-source and closed-source), scale, and reasoning capabilities. 3. The high-quality annotators group and reliable rubric creation pipeline guarantee the dataset quality. 1. In Section 4, the LLM-as-judge is used as a binary classifier, with performance evaluated by F1 score. The target LLM is used to identify whether the provided criterion fulfills all the requirements to check the quality of the response. However, for such complex tasks, the F1 score only captures misalignment between the LLM and human experts. It does not reveal the LLM's internal understanding of the task or identify specific weaknesses. 2. In the rubric creation process from Section 3, the criteria creation and review stages are not described in detail. Both stages appear heavily dependent on annotator judgment, and it remains unclear how each proposed criterion contributes to the granular assessment of response quality. 1. While using LLM to judge the criterion-fulfillment, is there any way to extract more information from the LLM performance for further failure mode analysis? Reasoning models are allowed to generate some inference steps prior to binary predictions. How might these reasoning traces be utilized to analyze misalignment between LLM predictions and human annotations? Lightly AI-edited
ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces ProfBench, a high-quality evaluation benchmark spanning multiple professional domains. It contains 80 tasks and more than 7,000 human-constructed response–criterion pairs, developed entirely by domain experts with PhD or MBA degrees and without any LLM assistance, ensuring authenticity and professional rigor. The authors further propose a rubric-based LLM-as-a-Judge evaluation paradigm and design three metrics to comprehensively assess assessment consistency, fairness, and efficiency. A systematic evaluation of over 40 open-source and closed-source models investigates the influence of different reasoning mechanisms, model sizes, and response lengths, and presents strategies to reduce evaluation cost and bias. Experimental results show that even the strongest closed-source model (GPT-5-high) reaches only 65.9% on this benchmark, highlighting the professional difficulty and challenge posed by the tasks. In the data annotation phase, the paper adopts a rigorous expert participation mechanism. A total of 38 professionals with PhD or MBA backgrounds were recruited to design tasks and formulate scoring criteria. Multiple rounds of review and consistency verification were conducted to ensure the reliability of annotations. This process guarantees the high quality of the dataset in terms of knowledge depth and annotation accuracy. The paper systematically compares over 40 types of mainstream models, covering dimensions such as closed-source vs. open-source, different sizes, and "thinking" settings. It also analyzes the relationships between model performance, bias, output length, and reasoning costs. The overall experiments are comprehensive, and the conclusions are credible. During the evaluation process, the authors systematically studied the performance and cost differences of different judge models, and proposed an optimal sample allocation method and a low-cost evaluation scheme. This strategy can significantly reduce evaluation costs while maintaining high consistency, which provides valuable insights for large-scale evaluation practices. ProfBench covers four professional domains: Physics, Chemistry, Finance, and Consulting. With diverse task types, it can comprehensively reflect the generation and judgment capabilities of large language models in professional scenarios. 1. The paper describes the annotation process and consistency metrics in detail, but lacks qualitative demonstration of controversial samples and explanations of the adjudication mechanism. Given the subjectivity of rubric-based evaluation, it is recommended that the authors supplement several typical cases to illustrate the judgment differences among different annotators and the final adjudication process, thereby enhancing interpretability and transparency. 2. The current definition of the Bias-Index relies on a limited set of reference models, and directly subtracting the Bias-Index from the Macro-F1 score to obtain the "Overall" indicator may lead to issues of dimension inconsistency and sensitivity in multi-model scenarios. It is suggested that the authors verify the robustness of this indicator on a larger model set and supplement other fairness metrics. 3. The paper’s results show obvious variations in task difficulty across different domains, yet the authors do not explain whether task balancing or weighting was performed. When comparing across domains, comprehensive scores may be affected by the distribution of domain samples, thereby impacting overall fairness. 1. You mentioned that the annotation process underwent multiple rounds of review and reported a relatively high consistency metric. When discrepancies arise among annotators, is there a fixed adjudication process or arbitration mechanism in place? If yes, could you supplement typical cases in the appendix to help readers understand how to handle criteria with strong subjectivity? 2. Is this metric stable when the number or type of reference models changes? Have sensitivity analyses been conducted, or has the consistency of calculating the bias-index in a larger model pool been tested? Lightly AI-edited
ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces a new benchmark (ProfBench) to test LLMs on PhD/MBA-level tasks that require specialized knowledge. Unlike prior benchmarks focusing on quickly verifiable problems, ProfBench focuses on open-ended tasks like writing reports that require expertise. The paper’s contributions include: (1) establishing a multi-domain, expert-annotated rubric benchmark; (2) assessing LLMs both as report generators and as judges; (3) and proposing techniques for LLM-based grading that aims to reduce evaluation cost and bias. The experiments demonstrate that LLM judges can grade responses with reasonable agreement to human experts. - A major strength is the exploration of LLMs as automated judges. Building on work in rubric-evaluation, the authors propose a framework to have LLMs determine if a given response satisfies each expert criterion. The framework aims to reduce self-enhancement bias (i.e., where LLM judges would favor responses from the same model or provider), as well as API costs. - ProfBench benchmarks LLMs as report generators in scenarios that mirror actual professional workflows, requiring multi-step reasoning and synthesizing information from multiple reference documents. - Ablation experiments demonstrate the importance of reference documents for model performance. - The benchmark’s scoring relies on LLM judges, and while the authors do measure agreement with human annotators, it’s shown that the best judge isn’t near perfect (<80% Macro-F1 overall). There’re risks that LLM-judges might miss nuanced criteria fulfillment or penalize creative answers. The paper doesn’t deeply discuss failure modes of the LLM-judge. - ProfBench covers only four domains, leaving out several important domains of professional reasoning — notably, the legal, health, and engineering domains. - Relatedly, the paper omits some relevant benchmarks involving professional tasks in the legal and engineering domains, which are not covered by the benchmark: Chilton, A., Guha, N., Nyarko, J., Ho, D., Ré, C., Narayana, A., ... & Peters, A. (2023). LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 36. Zhou, X., Wang, X., He, Y., Wu, Y., Zou, R., Cheng, Y., ... & Zhao, J. (2025). EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving. arXiv preprint arXiv:2509.17677. - The paper does not discuss any strategy for preventing test data leakage or overfitting. In contrast, some prior benchmarks like MMLU-Pro introduced hard, out-of-distribution questions to stay ahead of models. - The paper’s evaluation doesn’t include an overall quality judgment beyond summing criteria. While rubric scoring is objective, it might not capture important aspects of professional work such (e.g., originality or creativity) that are hard to enumerate. This is a philosophical weakness of rubric-based evaluation in general, and acknowledging this potential limitation would improve the work. - Is there a risk of saturation with the best model scoring over 65%? Is the benchmark future-proof? Should there be a “hard” subset of prompts that are more adversarial in nature? - Could the authors clarify their rationale for selecting the four domains for ProfBench? For example, was the health or legal domains excluded due to difficulty in obtaining annotations or because existing benchmarks like HealthBench or LegalBench already cover this ground? If the goal is to provide a broad measure of professional knowledge, adding other professional domains seems valuable. Fully human-written
ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces ProfBench, a rubric-guided benchmark of 80 expert-curated tasks spanning four domains, Chemistry, Physics, Finance, and Consulting, with more than 7000 human-written response-criterion pairs. Domain experts create the tasks as well as weighted task-specific rubrics and label three “seed” model responses (o3, Grok4, R1-0528) per criterion as Yes/No, forming the ground truth for calibrating an LLM judge. The quality of the judge is measured by a metric that takes into account also the bias towards the same model family. Each task has the structure of report-writing with grounding documents, and removing the documents leads to worsening performance. The best model reaches an average score of 65.9%, with lowest performance on Physics (49.3%) and best performance on Consulting (80%). - The tasks are realistic, complex, with grounding documents and created by domain experts, with reviewer feedback. - The LLM judges are evaluated in a clear way that takes also into account the bias towards the same model family. - Separate re-annotations show high inter-annotator agreement. Moreover, the LLM judge is highly reliable, with a tiny difference with human-annotated scores. - GPT-5 already achieves a high score on Consulting (~80%), while Physics lags at ~49%. This suggests that one domain might saturate sooner than the other. - The set-up is text-only, even if tool use might be helpful for some tasks, e.g. calculators, spreadsheets, code etc. - Despite current difficulty, the text-only format may offer limited room for improvement as models become more capable. - Domain coverage is narrow, covering only two science and two business domains. - What was the rationale for choosing the four domains? Do you plan to add others? - Are you planning on performing evaluations with models that can use tools that are relevant for the benchmark tasks? - The LLM judge evaluation depends on three seed models. I wonder whether the judge evaluation can change when picking different models. Do you have any thoughts on that? Fully human-written
StylOS: Multi-View 3D Stylization with Single-Forward Gaussian Splatting Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes Stylos, a feed-forward 3D Gaussian stylization framework. Given one or more unposed RGB views of a scene plus a single style reference image, Stylos directly predicts a stylized 3D Gaussian scene in a single forward pass, without per-scene optimization or known camera intrinsics/extrinsics. Stylos explicitly separates geometry and style: geometry is inferred by a VGGT-like alternating-attention backbone, while style is injected through a Style Aggregator that applies global cross-attention from the style tokens to all content-view tokens. The paper also introduces a voxel-based 3D style loss: instead of matching style statistics (mean/variance in VGG feature space) per-frame, Stylos fuses multi-view features into a shared voxel grid and aligns those 3D features to the style distribution, which the authors argue enforces cross-view consistency and preserves geometry. Experiments on CO3D, DL3DV-10K, and Tanks and Temples report improved multi-view consistency (LPIPS/RMSE) and competitive perceptual/artistic quality (ArtScore) compared to per-scene stylization baselines such as StyleRF and StyleGS, and include ablations on cross-attention design and style loss variants. - Practical significance. Single-forward stylization of an entire 3D Gaussian scene (poses + Gaussians + colors) without per-scene finetuning or known camera parameters could be transformative for content pipelines. This is not just incremental NeRF/3DGS stylization but closer to the real situation, like “instant turn-this-video-into-stylized-3D-assets.” - Architectural clarity. Clean separation between a geometry backbone and a Style Aggregator branch (global cross-attention conditioning on the style image). The ablation across “frame-only,” “hybrid,” and “global-only” variants is convincing and shows why global conditioning improves structure and texture coherence. - Breadth of evaluation. The paper reports not only reconstruction-like metrics (PSNR/SSIM/LPIPS) but also temporal / multi-view consistency metrics (LPIPS & RMSE at different frame gaps) and ArtScore for perceptual “artness,” and demonstrates cross-category and cross-scene generalization. - 3D style loss justification is empirical. The 3D voxel-statistics loss gives better LPIPS/RMSE consistency, but the paper does not analyze potential side effects such as global color bleeding or loss of local style detail in poorly observed regions. Some intuition or failure-case visualization would strengthen the argument that voxel-statistics matching is fundamentally better than 2D scene-level matching. - Scalability limits. Although Stylos is advertised as handling up to “hundreds of views,” the scaling experiment shows visible degradation beyond ~32 views per batch (edge artifacts, instability in the Gaussian representation). The method may still be quite useful, but this practical constraint, probably brought by VGGT, should be communicated more prominently. - Failure cases of the 3D style loss. Are there any visual failure cases where voxelized style statistics hurt local detail (e.g., oversmoothing, texture bleeding in occluded regions)? Right now we only see success cases and quantitative gains. Fully AI-generated
StylOS: Multi-View 3D Stylization with Single-Forward Gaussian Splatting Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents StylOS, a single-forward 3D style transfer framework based on 3D Gaussian Splatting. The method employs a Transformer backbone with a dual-pathway design that separates geometry prediction from style injection: the geometry pathway maintains self-attention mechanisms to preserve geometric fidelity, while style is injected via global cross-attention to ensure multi-view consistency. 1. Proposes the first single-forward-pass 3D style transfer framework that requires no camera pre-calibration, offering significant practical value. 2. The method demonstrates strong scalability, supporting processing from a single view to hundreds of views. 1. Lacks fair comparison with per-scene optimization methods under the same computational budget. 2. Critical implementation details are missing regarding the voxelization operation: What is the voxel resolution? How are occlusions handled? What is the specific implementation of confidence weighting? Do these factors have significant impact on the results? 3. The weight parameters for each loss term are not provided. 4. The specific architecture of the Gaussian Adapter is not described in detail. 5. There is insufficient discussion on which scene types or style categories the method performs poorly on. 6. Why does global cross-attention not only improve style consistency but also enhance geometric fidelity? This seems counterintuitive and requires deeper analysis. See Weaknesses. Lightly AI-edited
StylOS: Multi-View 3D Stylization with Single-Forward Gaussian Splatting Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces StylOS, an efficient framework for view-consistent 3D style transfer using 3D Gaussian Splatting (3DGS). The main technical contribution is a single-forward inference model, built upon VGGT and AnySplat, that directly predicts the stylized 3DGS representation. This approach enables optimization-free, geometry-aware synthesis from unposed content views (from a single image to a multi-view collection) and a reference style image. By bypassing time-consuming per-scene optimization, StylOS achieves fast processing and superior generalization to unseen scenes and styles. 1. **Single-Forward Efficiency**: The most compelling contribution is the ability to perform high-quality 3D style transfer in a single forward pass without per-scene optimization. This drastically reduces processing time and makes 3D stylization scalable for real-time or large-scale applications. 2. **Transformer Integration and Unposed Capability**: By utilizing a Transformer architecture (built upon VGGT) to directly predict the parameters of the stylized 3D Gaussian Splatting representation, the method bypasses the traditional optimization pipeline and simultaneously enables robust handling of unposed content (single or multi-view). This significantly lowers the barrier to entry for users, as complex camera pose estimation is not required. 3. **Generalization and Robustness**: The method demonstrates excellent generalization capabilities across unseen content scenes, object categories, and style images, suggesting robust learning of disentangled style and geometry features. 1. **Limited Technical Novelty in Overall Framework**: The overall conceptual framework of achieving fast, optimization-free 3D stylization by integrating a feed-forward 3D reconstruction model (VGGT/AnySplat) with cross-attention for style injection is highly similar to previous works like Styl3R (NeurIPS 2025, published on arXiv in May 2025), which uses Dust3R. While StylOS's integration of VGGT allows it to process a wider and more flexible range of input views, the core technical contribution of the idea of a single-forward 3D stylization framework is substantially weakened by this conceptual overlap. The authors should better address this overlap and clearly articulate how their specific implementation of the style injection and 3DGS prediction provides unique advantages over Styl3R's architecture. 2. **Style Fidelity vs. Optimization**: Although the speed advantage is clear, a deeper discussion or visual comparison is needed to evaluate the potential trade-off in style fidelity against state-of-the-art optimization-based 3D style transfer methods like 𝒢‐Style: Stylized Gaussian Splatting or ARF. Are there certain styles (e.g., fine-grained texture styles) where the feed-forward approach visibly struggles compared to optimization? 3. **Minor Formatting Concern (Font)**: It appears the authors may have modified the default font used in the ICLR style configuration. 1. **Style Control and Blending**: Can the authors demonstrate fine-grained control over the stylization process? Specifically: a. Can the system adjust the content-style trade-off weight post-inference to control the strength of the stylistic transfer? b. Can the model perform multi-style blending by interpolating or averaging the style embeddings from two or more distinct style images? 2. **Input and Output Resolution**: What is the range of resolutions for the input content views (single or multi-view) and the output rendering that the model is designed to support? Are there inherent limitations or scaling constraints related to the Transformer architecture or 3DGS representation density? Fully AI-generated
StylOS: Multi-View 3D Stylization with Single-Forward Gaussian Splatting Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper introduces a feedforward stylized Gaussian method to achieve efficient, high-quality, and visually pleasing 3D scene reconstruction and stylization in a single step of model inference. The proposed approach is built upon VGGT, taking multi-view scene images and a style exemplar image as input. The model predicts the depth and camera poses for each view, while the predicted pixel Gaussian attributes are merged into voxels using confidence-based weights following the AnySplat framework. The results, as demonstrated in the figures and supplementary video, are remarkable. This feedforward stylized Gaussian model has the potential to significantly impact the field. Artistic stylization, a cornerstone of human creativity throughout history, plays a vital role in bridging the technical and emotional aspects of visual representation. In the context of 3D world reconstruction, artistic stylization becomes even more critical, as it encapsulates the essence of reality through the lens of human emotion, emphasizing mood, character, and individuality. By enabling stylized 3D scene reconstruction without the need for test-time training, this model addresses a significant challenge and represents a major step forward in both technical and artistic innovation. The technical strengths of the paper can be summarized as follows: 1. This work is the first to achieve a stylized 3D world in a fully feedforward manner. From the early days of this field, StylizedNeRF introduced the concept of reconstructing 3D scenes using NeRF, accompanied by a conditional stylization module. While it generalized across styles, it required per-scene optimization. ARF, on the other hand, achieved superior results using a style-matching loss and a deferred backward strategy, but it was not generalized across scenes or styles. Both methods, being NeRF-based, faced limitations due to the high GPU demands during training, requiring solutions like mutual learning and deferred optimization to mitigate these challenges. Similarly, StyleRF was also restricted to per-scene optimization. In the era of 3DGS, StylizedGS achieved impressive results in 3D stylization but remained constrained by per-scene fitting. This paper, however, marks a significant milestone by creating a stylized 3D world using a tuned foundation model like VGGT, combined with a generalized stylization module, enabling a fully feedforward pipeline. This eliminates the need for lengthy optimization processes. There is no more waiting hours to transform multi-view images into an artistic 3D world. This breakthrough paves the way for future research to focus on achieving even faster and higher-quality stylizations using similar feedforward approaches. 2. The results presented in the paper are both impressive and inspiring. The examples, particularly in the supplementary video, showcase how realistic 3D scenes can be transformed into artistic styles such as cartoon, sketch, and painting. The stylization maintains consistency across views without any flickering artifacts, which is a significant achievement. Moreover, obtaining such high-quality results in a single feedforward inference is truly remarkable and highlights the potential for real-time applications in this domain. 1. The most significant contribution of this work is the introduction of a feedforward 3D stylization model, whose core advantages lie in faster inference and reduced memory usage. However, these critical metrics are not evaluated in the experiments. The paper should include a comparison of stylization time across prior works, from the earliest StylizedNeRF, ARF, and StyleRF to the more recent StyleGaussian and StylizedGaussian. For methods requiring per-scene fitting, the stylization time should account for both training and rendering times. Such comparisons would provide readers with a clearer perspective on how this work advances the field in terms of efficiency and practicality. 2. The comparison between NeRF-based and Gaussian-based stylization approaches is not adequately addressed. While StyleRF focuses on zero-shot stylization, its stylization quality is not the strongest. A detailed comparison with stylization examples from StylizedNeRF and similar NeRF-based methods is essential to highlight the advantages of the proposed approach. Moreover, StylizedGS, which outperforms StyleGS, should be included as the representative Gaussian-based stylization method in the comparisons. Incorporating these comparisons would strengthen the paper by demonstrating the superiority of the proposed method over both NeRF-based and Gaussian-based approaches. 1. **Include Inference Speed Comparisons**: Adding a comparison of inference speed would highlight the advantages of the proposed method and make the paper more compelling. Demonstrating faster processing times would provide a stronger justification for the model's practical contributions to the field. 2. **Compare with StylizedNeRF and StylizedGaussian**: Include comparisons with StylizedNeRF and StylizedGaussian, as their stylization results are superior to those of StyleRF and StyleGaussian, respectively. The latter two methods primarily focus on zero-shot and fast stylization but rely on a CNN decoder, which means they are not strictly stylized **radiance** fields but rather stylization **feature** fields. Lightly AI-edited
SFdiff : Diffusion Model with Self-Generation for Probabilistic Forecasting Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work presents a robust diffusion framework for probabilistic multivariate time-series forecasting that explicitly addresses noisy or unreliable conditioning histories. Rather than conditioning on a fixed past and generating only future values, the proposed SFdiff jointly reconstructs the past and predicts the future within the same reverse-time process. By doing so, the model “purifies” the historical window on the fly, leading to more stable forecasts. The authors provide a theoretical argument via an upper bound on sensitivity, indicating that the joint sequence score is less affected by perturbations in the history than a future-only score, and supply a compatible denoising score-matching training objective. They further integrate classifier-free guidance into score-based conditional modeling and show that, under self-generation, modest guidance improves calibration and sharpness. Across two synthetic dynamical systems and five real benchmarks, SFdiff generally achieves stronger CRPS_sum than prior diffusion, flow, and non-generative baselines. Comprehensive ablations detail the effect of history/future loss masking, guidance weights, and the number of sampling steps. The approach is promising for scenarios with corrupted contexts, though it requires careful hyperparameter tuning and retains the higher sampling cost typical of diffusion methods. - The toy benchmarks are minimal yet representative, making it straightforward to diagnose where the proposed approach helps under perturbed histories. - The writing is concise and well structured, allowing readers to grasp the core idea and theoretical claim without unnecessary detours. - The ablation study is quite insufficient; for example, I did not see an ablation analyzing the contribution/effect of the historical sequence. - Since this is a forecasting task, why not include **point-forecast metrics** (e.g., MSE, MAE, RMSE) in the evaluation? - The model framework is unclear—for example, it is not specified what architecture is used for the denoising network, and many implementation details remain ambiguous. See Weakness. Lightly AI-edited
SFdiff : Diffusion Model with Self-Generation for Probabilistic Forecasting Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces SFdiff, a diffusion-based model for probabilistic time-series forecasting. The method couples a "Self-Generation" mechanism that jointly denoises past and future segments with classifier-free guidance (CFG), and the experiments span two toy datasets plus five real-world datasets compared against a broad mix of classical and neural baselines. The motivation resonates: noisy historical conditions genuinely hinder diffusion forecasters, and the self-generation story is communicated with helpful illustrations. I also appreciate the wide baseline coverage. 1. The related-work discussion acknowledges earlier full-sequence diffusion approaches such as TSDiff, yet the manuscript never spells out how SFdiff differs in architecture, loss design, or sampling; despite the γ-sweep and prediction/self-generation comparisons, the incremental novelty over these predecessors remains vague. 2. Theoretical support is also opaque: Theorem 3.1 invokes assumptions (A1–A3) and constants that are never defined, and no proof sketch accompanies the statement, so the promised robustness guarantee cannot be verified. 3. On the empirical side, the narrative that CFG "significantly reduces forecasting errors" conflicts with Table 1/3, where Solar performance worsens once CFG is applied; the text needs to reconcile or explain this divergence. 4. Reproducibility concerns compound the issue: the repeatedly promised "Table 5" with dataset statistics and hyperparameters never appears, leaving key experimental information missing. 1. What tangible distinctions separate SFdiff from TSDiff and other total-sequence diffusion models, and can ablations quantify the incremental benefit of γ-weighting and self-generation? 2. What exactly are assumptions (A1–A3) and the constants in Theorem 3.1, and how do they connect to the VP-SDE sampler used in practice? A proof or detailed sketch is necessary. 3. Could you supply the missing Table 5 (or an equivalent appendix) that records model architectures, diffusion steps, training schedules, compute budgets, and tuning protocols for every method? Beyond addressing those questions, it would help to clarify why CFG degrades Solar results while improving others, ideally with diagnostics that measure the claimed purification of historical inputs. Quantitative evidence of the purification effect and high-level pseudocode would also improve the presentation. Fully AI-generated
SFdiff : Diffusion Model with Self-Generation for Probabilistic Forecasting Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper tackles probabilistic forecasting for multivariate time series using a diffusion-based approach that departs from the common “predict-future-only” setup. The authors propose SFdiff, which reconstructs the entire sequence—past and future—during reverse diffusion. This “self-generation” step implicitly denoises the historical context so that errors and outliers in the conditioning window exert less influence on the generated forecast. The method is formalized with a score-matching objective and a sensitivity bound showing that whole-sequence conditioning is less susceptible to perturbations than future-only conditioning. - **S1** The paper is easy to follow end-to-end, with crisp notation and a clean separation of method, theory, and experiments. - **S2** The analyses on history/future masking, guidance strength, and sampling steps provide actionable takeaways for reproducing and deploying the method. - **W1** The task is time-series forecasting, evaluation relies almost exclusively on CRPS. I would like to see complementary point-forecast metrics (e.g., MAE/MSE/RMSE/SMAPE) to assess accuracy alongside calibration. - **W2** Qualitative results are shown only for the synthetic setups. It remains unclear how SFdiff behaves visually on real benchmarks. Add forecast trace plots and predictive intervals on several real datasets (e.g., Exchange/Electricity), including challenging noisy cases. - **W3** Provide sensitivity to key hyperparameters (history/future weight, guidance strength, steps) and statistical significance tests (e.g., paired t-tests or bootstrap CIs) against strong baselines. - **W4** The framework may benefit from longer contexts (potentially noisier histories), yet the effect of varying input/forecast lengths is not systematically studied. Run controlled studies with longer contexts and horizons to test whether SFdiff’s advantage widens as historical noise increases. See W1 to W4. Moderately AI-edited
SFdiff : Diffusion Model with Self-Generation for Probabilistic Forecasting Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper aims to enhance the performance of conditional diffusion models for time series forecasting, by considering the intrinsic noise within historical context. Specifically, the paper proposed SFDiff that reconstruct the full sequence, including both historical part and the target part, instead of only predicting the target part, given the historical part. The paper claims that, in this way, high-frequency anomalies are largely reduced, and thereby minimizing their impact on forecasting. Experiments over 5 datasets and 12 baselines demonstrate lowest probabilistic forecasting error of the proposed method. Case studies illustrate the anomaly deduction effect. Ablation studies discuss the performance sensitivity against different classifier-free guidance scale and diffusion steps. 1. This paper investigates from the perspective that reducing influence of anomalies within look back window by treating them in part of the generation target, which is interesting. 2. The experiments have validated this crucial claim, well supporting the idea that the anomalies can be reduced. 3. The paper is well-structured. Experiments are extensive. 1. Theorem 3.1 is not rigorous. A tighter upper bound does not necessarily lead to a strictly lower function value. (A1)-(A3) is not mentioned throughout the paper. 2. Baselines are old. The authors should consider newer diffusion-based generation models like TimeDiff, TSDiff, MG-TSD, TMDM, NsDiff, etc. 3. The author should discuss more on the difference and connection between the proposed method and TSDiff, which employs a similar technique named observation self-guidance. 4. Formatting issue: The equations are not numbered. Line 246, 247 out of margin. Is the performance gain of the model significantly and positively correlated with how anomalous the dataset is? Fully human-written
Provably Efficient Policy-Reward Co-Pretraining for Adversarial Imitation Learning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a theoretical analysis on the effects of pre-training both policy and reward function before using AIL. The paper proposes an interesting theoretical analysis of AIL with policy pretraining alone. Furthermore, it quantifies the regret arising from not pre-training the reward function (Proposition 1, Eq. (5)) and provides an improved bound to showcase the effects of pre-training the reward function. 1. The paper’s contributions are mainly incremental and overlap with existing analyses in the AIL literature. 2. Previous theoretical work on AIL has already highlighted its reliance on reward updates (e.g., Lemma 3.2 and Theorem 4.5 in [1]). 3. The reward pretraining contribution is not entirely novel, as the connection between policy pretraining and reward learning has already been explored in [2]. 4. The choice of experimental baselines is not fully justified. Given the close connection to OLLIE in [2], which similarly combines pretraining with AIL, including OLLIE as a baseline would have been more appropriate. Since CoPT-AIL emphasizes stronger theoretical analysis, the comparison against OLLIE would have been particularly interesting to demonstrate the practical benefits of the added theory. 5. The results in Fig. 2 are not fully convincing and do not clearly substantiate the paper’s claims regarding CoPT-AIL’s superiority over prior AIL approaches. **References**: [1] Chen Y, Giammarino V, Queeney J, Paschalidis IC. Provably efficient off-policy adversarial imitation learning with convergence guarantees. arXiv preprint arXiv:2405.16668, 2024. [2] Yue S, Hua X, Ren J, Lin S, Zhang J, Zhang Y. OLLIE: Imitation learning from offline pretraining to online finetuning. arXiv preprint arXiv:2405.17477, 2024. 1. Setting aside the theoretical analysis, what are the main differences between OLLIE in [2] and CoPT-AIL? Are there reasons why one approach should be preferred over the other? 2. Can you make concrete claims about the quality and quantity of data required for CoPT-AIL to succeed? In particular, how does it perform with suboptimal demonstrations or with only a small number of expert trajectories? Lightly AI-edited
Provably Efficient Policy-Reward Co-Pretraining for Adversarial Imitation Learning Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates the theoretical foundations of pretraining in adversarial imitation learning (AIL). The authors first theoretically reveal the fundamental reason why policy-only pretraining fails: reward error becomes the bottleneck. Based on this insight, they propose CoPT-AIL, which leverages reward shaping theory to jointly pretrain both policy and reward through a single BC procedure. Theoretically, they prove that when the number of expert demonstrations satisfies N ≳ C√(|S||A|H²/K), the method achieves a tighter imitation gap bound compared to standard AIL. Experiments validate the effectiveness of the approach on 6 DMControl tasks. 1. Clear motivation and important problem: The paper addresses a practical pain point in AIL—the requirement for extensive online interactions—and provides theoretical explanation for why existing policy-only pretraining approaches have limited effectiveness. The problem formulation is valuable. 2. Solid theoretical contributions: The error decomposition in Proposition 1 clearly reveals the bottleneck role of reward error. The use of reward shaping theory to circumvent the reward ambiguity issue is clever. Theorem 1 provides the first theoretical guarantee for the benefits of pretraining in AIL, filling an important theoretical gap 3. Simple and elegant method: The derivation of r̃*(s,a) = log π^E(a|s) from the expert's soft-optimality property is natural, enabling joint pretraining of policy and reward through a single BC procedure. The method is simple to implement and computationally efficient. Reasonable experimental design: The paper includes comparisons with multiple SOTA methods, and ablation studies adequately validate the necessity of joint pretraining. The theoretical analysis is explicitly limited to tabular MDPs (using |S|, |A| notation), which the authors acknowledge in the conclusion: "this work focuses on the standard tabular setup". However, the experiments use continuous state-action spaces in DMControl tasks, which do not match the theoretical setting. There is no discussion on how to extend the theory to function approximation and continuous spaces. Moreover, the constant C = max d^{πBC}(s)/d^{πE}(s) can be large when BC quality is poor, potentially weakening the practical utility of the theoretical results. Setting r¹(s,a) = log π^BC(a|s) depends on BC quality, which may be poor when demonstrations are limited. For continuous action spaces, Algorithm 3 does not specify how to compute log π^BC(a|s). If π^BC deviates significantly from π^E, the pretrained reward may mislead subsequent training. The paper only tests on 6 DMControl tasks, lacking environmental diversity. It is missing sensitivity analysis with respect to the number of expert demonstrations N (which the theoretical condition depends on). Eq. 14 introduces regularization term β exp(-r), but Table 1 does not report the value of β, and there are no ablation experiments to validate its importance. Additionally, there is no analysis of whether the condition N ≳ C√(|S||A|H²/K) is satisfied in the experiments. The derivation from r*(s,a) to r̃*(s,a) in Section 4.1 is somewhat abrupt; the choice of shaping function Φ could be explained more explicitly. Implementation details for baselines in the experimental section are insufficient, such as specific hyperparameter settings for each method. While Proposition 2 is based on the classic result of Ng et al. (1999), its application in the current problem is reasonable. 1. Your theory explicitly targets tabular MDPs (original text: "standard tabular setup"), but experiments use continuous state-action spaces. Can you (a) discuss how the theory extends to function approximation settings, or (b) at least provide heuristic analysis or experimental validation of theoretical predictions in continuous spaces? 2. For continuous actions, how do you compute r¹(s,a) = log π^BC(a|s)? Do you assume π^BC has a specific parametric form (e.g., Gaussian policy)? Algorithm 3 should explicitly clarify this implementation detail. 3. Practical impact of constant C: In your experimental tasks, what are typical values of C = max d^{πBC}/d^{πE}? How does this affect the tightness of the theoretical results? 4. Regarding the regularization coefficient β in Eq. 14: (a) Why is the value of β not listed in Table 1? (b) Can you provide ablation experiments to validate the importance of this regularization term? 5. The theory requires N ≳ C√(|S||A|H²/K). Can you verify whether this condition is satisfied in your experiments, or show sensitivity analysis of algorithm performance with respect to N? 6. Can you provide experimental comparisons with Jena et al. 2021 "Augmenting GAIL with BC for Sample Efficient Imitation Learning"? This method also combines BC and AIL, and comparison would more clearly demonstrate the advantages of CoPT-AIL. 7. Can you discuss the dependence of the method on the soft-optimality assumption for the expert policy (Eq. 1)? If the expert is suboptimal, does the derivation of r̃*(s,a) = log π^E(a|s) still hold? Note: Satisfactory answers to questions 1-5 would strengthen the paper significantly and could raise my recommendation to Accept. Questions 6-7 are less critical but would further enhance the contribution. Heavily AI-edited
Provably Efficient Policy-Reward Co-Pretraining for Adversarial Imitation Learning Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper tackles the failure mode of BC pre-training in adversarial imitation learning (AIL), where the AIL agent barely benefits from BC pretraining, if at all. This phenomemon has been observed and reported in prior works, but without identifying an actionable cause nor carrying out a rigourous theoretical study. The paper addresses this overlooked gap in the literature and idendifies the culprit as being the lack of reward pre-training in AIL: the reward starts from random initialization. They augment the AIL algorithm with a reward pre-training step, such that the policy and reward are co-pre-trained, hence naming their approach AIL with Policy-Reward Co-Pretraining (CoPT-AIL). Theoretical results show a tighter imitation gap. Empirical results show that CoPT-AIL can outperform standard AIL methods. * The paper targets a crucial desideratum of AIL: taking advantage of potentially limited expert data to squeeze as much performance as possible while remaining offline. * The detailed contribution rundown in the introduction is appreciated. This could be improved by indicating, for each point at least, where in the paper they are treated. * The regret bound the authors end up with have are easy to interpret. For example, the new “relative policy evaluation error” bound decreases as 1/N as the number of expert trajectories N increases. * Despite the weaknesses of the experimental setup, the proposed "fix" to BC-pretrained AIL is so simple that there is little reason for any practitioner not to use it. * The authors mention "the telescoping argument" as a discovered property; it is not an incidental property but the reason reward shaping is constructed as such. It might be worth writing about the design principle behind the definition of reward shaping earlier in the main text or with more emphasis, especially considering how central it is to the solution developed in the paper. * Experiments: going beyond DMC would be a big plus; but even in DMC, there are tasks like humanoid-walk/run, quadruped, and dog that could all be valuable additions. While the authors treat 6 environments, those are 6 tasks from 4 different environments (i.e. only 4 different underlying simulated robotics models). The environment with the highest degrees of freedom (leading to highest state space and action space dimensionalities) is the bipedal walker. A quadruped and/or a humanoid would me more convincing. * When it comes to the practical use of the algorithm, it would be useful to see whether there is some “unlearning” happening in the reward, e.g. by showing how the reward gradients are behaving, or how aligned the BC and AIL gradients are, which would give empirical evidence that, with CoPT-AIL, BC and AIL are not working against each other anymore. In Walker-Run for example (the most complex task treated in the paper), there is a ~50k step initial phase where CoPT-AIL is unable to accumulated any reward when 3 baselines can. How I interpret this initial plateau at near-zero return where nothing happens: the reward pre-training with the log-probability of the BC-pre-trained policy as reward leads the reward to be peaked at its maximal value on the expert samples, and zero anywhere outside the support of the expert distribution, with very abrupt changes in value as we go in and out of the support. This is a tedious signal to optimize via RL (sparse) a priori. The typical AIL agent (especially in the off-policy setting, cf. “Lipschitzness is all you need” by Blondé et al 2022) would employ a gradient penalty regularizer to smooth out the reward signal and make learning possible. The authors seem to do something akin to such penalization in their practical implementation (page 22). In L1159-1165, the authors introduce a reward regularizer to “improve the stability of reward training”. What the exponential-based regularizer does: the term penalizes very low or negative reward values, and suppresses excessively large ones, since exponentiation smooths the gradient and bounds the value to a small positive range. I would therefore be useful to show and discuss the reward landscape that the co-pre-training creates, which would at least answer the question of why there is a flat performance plateau at the very start of the learning curve for Walker-Run, only for CoPT-AIL. Note: using the reward smoothing regularlizer also in the reward pre-training phase might solve the issue already, since it would allow AIL to start with a non-random yet smooth reward landscape. * With CoPT-AIL, the policy and reward are co-pre-trained (first the policy, then the reward). By the end of the pre-training phases, Q is still randomly initialized. Would it not make even further progress toward your goal of avoiding any negating effect to also pre-train Q? For example, with a method akin to fitted Q iteration? That however assumes that the expert data is available in triplets (state, action, next-state), since the next state is required for the Q target. This condition is a fortiori statisfied if the expert trajectories are available whole and ordered, without subsampling. * L95-96: aligning with how Ross and Bagnell 2010 (“Efficient Reductions in IL”) defined their non-stationary policies, I think it would be clearer for the authors to qualify the policies, dynamics, and reward structures introduced as non-stationary in the text, grounding the exposition in the pre-existing literature and improving precision. Do the concepts introduced by the authors not align with those in Ross and Bagnell 2010? * The point made by the authors about the presence of a difference of value with an untrained reward in the bound of Prop 1 is sound. That being said, would it not be best to leave the second term of what the authors call “reward error” out of the curly brace? The same goes for the very last term of the bound. * How many expert demonstrations were in use? (In the appendix the authors write how many expert demonstrations were collected from the experts, but it is unclear how many you used for the experiments shown on the plots.) * How does the method fare w.r.t. various number of available demonstrations? * Is the relative policy gap increasingly reduced compared to the baselines, as the developed theory would dictate? * It would be interesting to see how CoPT-AIL reacts to different policy and reward architectures, considering how sensitive RL agents generally are to their reward landscape. For example, a reward designed from the error of a powerful diffusion model would, by its representational capacity, probably overfit the reward signal to the pre-training reward learning signal CoPT-AIL introduce, compared to a weaker model. I that process makes the reward landscape particularly non-smooth and peaked, AIL would have to deal with an initially sparse reward function, which would probably be more tedious to deal with than a randomly initialized reward. Style, typos, suggestions: * The shorthand “relative policy evaluation error” is rather confusing, as it is part of the reward error, which is distinct from the policy error. It could simply be called reward error if, in Prop. 1, the authors would call the first term of the bound the (scaled) reward error and the second term of the bound the “approximation error” since it grows to 0 as the dataset size tends to infinity. * The authors should add what the dotted horizontal lines represent in the plot legends . * [minor] Exposition is very clear; first part of 4.1 however is unnecessarily long for how well-known its contents are. Fully human-written
Diffusion Bridge Variational Inference for Deep Gaussian Processes Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes Diffusion Bridge Variational Inference (DBVI), a novel method for posterior inference in Deep Gaussian Processes (DGPs). It aims to solve a key limitation of its predecessor, Denoising Diffusion Variational Inference (DDVI), namely the inefficiency of starting the reverse diffusion from a fixed, unconditional Gaussian prior (a "cold start").The core idea of DBVI is to replace this fixed start with a learnable, data-dependent "warm start" distribution, $p_0^\theta(U_0|x) = \mathcal{N}(U_0; \mu_\theta(x), \sigma^2 I)$. The paper introduces two key technical innovations to make this work: It formally re-interprets the diffusion process as a Doob-bridged diffusion, which is grounded in Doob's h-transform. This allows for the derivation of a new, tractable ELBO objective.To make the initial distribution $p_0^\theta$ scalable and avoid conditioning on the full dataset, it proposes a structured amortization network $\mu_\theta$ that cleverly conditions on the layer-wise inducing inputs $Z^{(l)}$ as proxies for the data. The paper's core strength is its theoretical rigor. It correctly identifies a clear weakness in DDVI (the "cold start") and proposes a non-trivial, principled solution. Grounding the "warm start" in the mathematics of Doob's h-transform allows for a clean derivation of the ELBO and proves that DBVI is a strict generalization of DDVI. The most innovative practical contribution is the structured amortization design. Naively conditioning $\mu_\theta$ on the raw data $x$ would be infeasible. The proposal to use the learnable inducing inputs $Z^{(l)}$ as a data-dependent, low-dimensional proxy is an elegant and effective solution that neatly sidesteps issues of dimensionality and data dependency. The paper's central hypothesis a "warm start" improves inference efficiency, is directly and convincingly validated by the case study in Figure 4. This plot clearly shows DBVI converging significantly faster and to a better final RMSE than DDVI, confirming the mechanism works as intended. The experimental evaluation is incomplete for a paper claiming state-of-the-art posterior approximation. A major, competing line of work for expressive DGP posteriors, Normalizing Flows (NFs), is entirely absent from the related work and experimental comparisons. Without benchmarking against NF-based VI methods, the claims of superior posterior quality and accuracy are unsubstantiated. Missing Practical Baseline: The paper fails to establish a practical "sanity-check" baseline. For the image classification tasks, the DGP is applied to features from a ResNet-20. The performance of this feature extractor backbone alone must be reported. If the final, highly complex 4-layer DBVI model (which achieves 95.68% accuracy on CIFAR-10) does not substantially outperform the ResNet-20, it implies the entire DGP/DBVI machinery adds significant complexity for little to no practical gain. Unfavorable Complexity-Performance Trade-off. This is the most significant weakness. The paper advocates for a method that is substantially more complex than its predecessor. It requires an SDE solver for $s_\phi$, a new NN for $\mu_\theta$, and an ODE solver for $(m_t, \kappa_t)$. The justification for this complexity rests on predictive gains that are empirically marginal (e.g., 95.68% vs 95.56% on CIFAR-10; 0.859 vs 0.857 AUC on HIGGS). This trade-off makes the practical utility of DBVI highly questionable. While Table 1 provides per-iteration timings, the paper lacks a formal analysis of the additional computational overhead. It should provide a breakdown of the cost of the $\mu_\theta$ forward pass and the $(m_t, \kappa_t)$ ODE solver, and discuss how these new costs scale with the number of inducing points ($M$) and layers ($L$). Baselines: Could the authors provide results for two critical missing baselines?a. A competing SOTA expressive VI method, such as a Normalizing Flow-based DGP.b. The "backbone-only" baseline for the CIFAR-10 experiment (i.e., the accuracy of the ResNet-20 feature extractor). Given that the strongest empirical result is the convergence speedup (Fig. 4), while the final accuracy gains are marginal, would the authors agree that the primary contribution of DBVI is in inference efficiency rather than predictive accuracy? If so, I would recommend reframing the paper's narrative to emphasize this. Could the authors provide a more detailed breakdown of the additional computational cost of DBVI over DDVI? Specifically, what is the wall-clock cost and scaling behavior of the $\mu_\theta$ forward pass and the $(m_t, \kappa_t)$ ODE solver? The $Z^{(l)}$-amortization is a clever feedback loop. How sensitive is this mechanism to the initialization of the inducing inputs $Z^{(l)}$? Does a poor $Z^{(l)}$ initialization lead to a poor $U_0$ start, which in turn hinders the model's ability to find good $Z^{(l)}$ locations? Fully AI-generated
Diffusion Bridge Variational Inference for Deep Gaussian Processes Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper generalizes denoising diffusion variational inference (DDVI) for deep Gaussian processes (DGP) by replacing the unconditional starting distribution with a learnable, data-dependent initial distribution and reinterpreting the DDVI framework with incorporation of Doob's h-transformation as a diffusion bridge model (DBVI). The proposed method can reduce the gap between the prior and the posterior in the diffusion process and hence speed up the training. There are a few benchmarks included to demonstrate the improvements in accuracy, convergence speeds and posterior quality. It elegantly extends DDVI with Doob-bridge modification. The theory behind such extension is sound and neat. The derived loss clearly connects to that of DDVI and it is straightforward to spot the innovation. However, the numerical experiments could be improved. The benchmark results mainly focus on estimation/prediction accuracy, not much on the posterior analysis. See more comments in the questions below. I would rate 5 if allowed and will raise my score if the numerical evidence could be strengthened. 1. There is only Figure 4 illustrating arguably after convergence on the small dataset. It appears DDVI reduces the error faster initially before 50 iterations and catches up around 200 iterations. How about after 200 iterations? Can you include more of such plots to demonstrate "faster convergence"? 2. Claiming better posterior quality, the authors only show reconstruction RMSE and test log-likelihood, which may not well evaluate quality of the learned posterior. How about empirical KL if you can test in some simulation? Or the rate of credible intervals covering true values? 3. Is the computational cost proportional to the layers of deep GP? How does the diffusion time step K interact with the number of layers L interact? 4. Minor suggestion, are $d_{in}$ and $d_{out}$ layer specific? Should then carry layer index $l$? Fully human-written
Diffusion Bridge Variational Inference for Deep Gaussian Processes Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes Diffusion Bridge Variational Inference (DBVI), an extension of the recently introduced Denoising Diffusion Variational Inference (DDVI) method for Deep Gaussian Processes. While DDVI models the variational posterior as the terminal distribution of a reverse-time diffusion SDE parameterized by a neural score network, it suffers from an unconditional Gaussian initialization that is typically far from the true posterior, resulting in long, inefficient inference trajectories. DBVI addresses this limitation by making the diffusion start data-dependent...using an amortized initialization. Using Doob’s h-transform, they reinterpret the reverse SDE as a bridge process that explicitly “bends” the diffusion between a start and an end distribution. Empirical results across regression, classification, and image-reconstruction benchmarks show that DBVI consistently improves predictive accuracy, posterior quality, and convergence speed over DDVI, and other benchmarks. - The paper identifies the unconditional start distribution in DDVI as the source of slow convergence and inaccurate posteriors, and replaces it with a data-conditioned start via a Doob bridge. - The use of a linear reference drift ensures that even after introducing endpoint conditioning, the bridge process has closed-form Gaussian marginals at all intermediate times. - Like DDVI, the variational posterior is defined implicitly by a reverse diffusion, which is significantly more flexible than mean-field or low-rank Gaussian approximations typically used in DGP inference. - Improvements are observed not only on small UCI datasets but also on large datasets (like HIGGS/SUSY), image classification (more subtle), and the frey faces task. - The paper explains the Doob correction, but the experimental section gives limited direct visualization of how the bridge shortens the reverse diffusion trajectory (e.g path length, KL rate, or score norm decay). This makes it harder to see the effect that drives the performance gains. - The method (in my understanding) depends crucially on using a linear reference diffusion so that intermediate marginals remain Gaussian and h is tractable. If the model or dataset induces posteriors that are far from those implied by linear reference dynamics, the reference score may become a poor baseline and learning may slow or plateau. - The model fixes the covariance and learns only the mean. How sensitive is DBVI to this choice? Do performance or convergence degrade meaningfully if the variance is misspecified, and can a learned covariance or low-rank structure further improve inference? - The Doob bridge construction relies on the fact that the forward reference diffusion has closed form Gaussian marginals at all t, does learning the covariance disrupt this? - More generally, what are the limitations of tying amortization exclusively to the inducing input grid? - Could there be a diagnostic metric which can quantify how much the data conditioning shortens the reverse path? Heavily AI-edited
Diffusion Bridge Variational Inference for Deep Gaussian Processes Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes Diffusion Bridge Variational Inference (DBVI), a new approach for performing inference in Deep Gaussian Processes (DGPs). DBVI builds on Denoising Diffusion Variational Inference (DDVI) but improves it by learning how to start the reverse diffusion process from a more informed, data-dependent initialization instead of a random prior. This helps the model start closer to the true posterior and improves both accuracy and efficiency. The authors interpret this modification through Doob’s h-transform, giving a bridge-based view of the diffusion process that keeps the method theoretically consistent while making it more flexible. They also describe a practical inference scheme based on inducing points to make training scalable. In experiments on regression, classification, and unsupervised learning tasks, DBVI shows consistent improvements over DDVI and other standard inference methods for DGPs such as DSVI, IPVI, and SGHMC. Originality: The paper proposes the novel idea of reinterpreting DDVI as a kind of diffusion bridge using Doob’s h-transform. This leads to a principled way of conditioning the diffusion process on input data. The use of an input-dependent initialization for diffusion-based inference is conceptually elegant and gives a new perspective on how diffusion models can be adapted for Bayesian inference. Quality: The technical development is solid and well thought out. The authors clearly connect their bridge formulation to the underlying variational objective. The experiments are thorough, covering regression, classification, and unsupervised learning, and DBVI consistently outperforms DDVI and other inference methods like DSVI, IPVI, and SGHMC. Clarity: Overall, the paper is well written and easy to follow. The authors do a good job of explaining the motivation behind their changes to DDVI. The main ideas are presented in a logical order, and the appendix includes helpful details for implementation. Significance: This work makes a meaningful contribution to improving inference in deep Gaussian process models. By addressing a key limitation of DDVI and showing consistent gains across several tasks, the paper offers a practical improvement that should be useful to researchers. Primary Weakness: Magnitude of improvements: While DBVI consistently outperforms DDVI and other methods, the numerical improvements are sometimes modest (e.g., some overlapping error bars in Figure 3). A discussion of whether these gains translate to meaningful practical differences would strengthen the empirical section. Minor Weaknesses: Figure readability: The font size in Figures 1, 2, and 3 is quite small, making some labels difficult to read without zooming in a lot. This is a minor visual issue that can easily be fixed for the camera-ready version. Formatting issue: The arrow in Figure 2 partially obscures the word “likelihood.” 1: Could the authors comment on the computational cost of the additional amortization network used to initialize the diffusion bridge? In particular, does this learned initialization introduce noticeable overhead compared to the standard DDVI setup? 2: In Figure 3, the improvements over DDVI are consistent but often modest. Could the authors comment on whether larger benefits might appear in other settings and why? 3: Do the authors plan to release code and trained models for reproducibility and to facilitate adoption by the community? Fully AI-generated
Endogenous Communication in Repeated Games with Learning Agents Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper analyzes endogenous communication among learning agents in infinitely repeated stage games with a costless pre-play channel. Each agent compresses its private signal via an encoder subject to an information budget, then plays the stage game; policies are updated by no-regret learning, while encoders optimize a myopic value-minus-information objective. 1. The paper cleanly ties cheap talk and information bottlenecks: it formalizes value-sufficiency, defines a budget threshold, proves existence of efficient communication above the threshold, and a necessary pooling structure with an explicit welfare-gap lower bound below it. These results offer actionable predictions about when emergent messages become informative vs. collapse 2. The stability notion is coupled to no-regret policy updates and information-penalized encoder updates, with a convergence guarantee of samples under standard step sizes and ergodicity. The provided alternating scheme makes the framework concrete 1. The paper is incomplete, lack a great amount of details. The proof is only sketch. 2. Many assumptions are strong and unjustified. NA Fully human-written
Endogenous Communication in Repeated Games with Learning Agents Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The authors study a model in which no-regret learning agents are augmented with the ability to send costless messages to each other. I think the intersection of agent communication and learning in games can produce interesting settings and research directions. The paper is very poorly written. There are 10 references, some of which are only tangentially related, and none of which are even mentioned in the main body, unless I have missed something. The proofs are too informal and vague, and have several non-sequiturs. The setup is not specific enough. This is not a length issue either; the paper is only six pages long including appendix, and the extra length could easily have been used to provide much more relevant detail. These writing issues alone are enough to recommend rejection. I implore the authors to add more detail. The setting certainly looks interesting enough that there could be some interesting results and analysis in this paper, but the writing issues meant that I gave up on attempting to parse the paper before being able to come to a complete understanding of what the claims and techniques are. None. Fully human-written
Endogenous Communication in Repeated Games with Learning Agents Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper studies pre-play communication in infinitely repeated games. Each agent observes a private signal and sends a discrete message via an encoder constrained by a mutual information budget. Policies are learned by mirror descent; encoders maximize expected continuation value minus λ\times MI. The authors define a “stable communicating equilibrium” where policies are best responses, encoders are budget-optimal, and learning converges. They show that: (1) if budgets exceed a problem-specific threshold κ*, value-sufficient messages enable efficient payoffs; (2) below threshold, any equilibrium pools signals into a finite partition bounded by exp κ, implying a welfare gap; and (3) standard no regret dynamics are sufficient to reach a near stable point with O(1/epsilon^2) data. 1. The paper poses a clear, meaningful problem and introduces a formulation that links repeated‐game incentives with information-constrained pre-play communication, with a notion of stable communicating equilibrium. 2. The thresholding results given by Theorems 1–2 is clean and interesting: when the information budget exceeds a problem-specific threshold value-sufficient messaging can implement efficient outcomes; when it does not, any equilibrium must pool signals, leading to an unavoidable welfare loss. 1. The writing is often unclear. Key terms such as the formal definition of V, the notion of Lipschitz continuity, and the exact meaning and role of the learning rate \eta are never properly defined. It’s also confusing to bundle assumptions about the game itself and the learning algorithm into one block. The reference to “standard folk theorem” should be made explicit rather than assumed. 2. The proofs are mostly brief sketches and difficult to follow. The theorems are not stated in a fully formal way, and several terms used in them are never clearly introduced. 3. The related-work discussion is thin. It mentions prior directions in broad terms but does not cite or compare against specific, closely related papers. Overall, the paper is very hard to follow, especially for readers who are not already experts in all relevant literatures. Clearer structure and more careful exposition would make it far more readable. 1. Could the authors clearly define all notation and formally state each theorem, giving precise definitions for verbal notions and complete proofs instead of sketches? The paper is quite hard to follow, and clearer formalization would make it easier to evaluate. 2. In Theorem 3, the learning-rate choice (\eta_t \propto t^{-1/2}) appears inconsistent with Assumption 1’s requirement that (\sum_t \eta_t^2 < \infty)? Heavily AI-edited
Provable Guarantees for Automated Circuit Discovery in Mechanistic Interpretability Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a framework for provable circuit discovery in mechanistic interpretability, offering formal guarantees of input robustness, patching robustness, and minimality. It introduces two algorithms—a greedy local-minimality search and a blocking-set duality method—for discovering circuits with verified faithfulness using α–β-CROWN. Experiments on MNIST, CIFAR-10, GTSRB, and TaxiNet confirm 100 % robustness within specified domains but show high computational cost. 1. Clear definitions and proofs for robustness and minimality in circuit discovery. 2. The greedy and hitting-set approaches are well motivated and practically implementable. 3. Connects mechanistic interpretability with formal verification through provable guarantees. 1. High computational cost: Verification with α–β-CROWN for each candidate circuit is extremely slow; scalability remains a bottleneck. 2. Small experimental scope: Only small convolutional and MLP networks are tested; no evidence on larger or modern architectures. 3. Limited interpretive analysis: The paper emphasizes correctness and robustness, but offers little discussion of what the discovered circuits mean semantically. 4. Evaluation imbalance: Most comparisons are against heuristic baselines without runtime or coverage trade-offs clearly analyzed. A recent paper, “Learning Minimal Neural Specifications” (Geng et al., NeuS 2025), finds that a small subset of neurons can often characterize a model’s robust behavior. Your minimal-circuit formulation appears conceptually related. Could you discuss the connections between these phenomena? Fully AI-generated
Provable Guarantees for Automated Circuit Discovery in Mechanistic Interpretability Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. In the context of mechanistic interpretability for neural networks, the authors use neural network verification methods to come up with circuits with provable guarantees. Authors claim to outperform conventional circuit discovery methods., i.e.e they provide stronger robustness guarantees for discovered circuits. - Problem addressed is impactful and the authors identified clear blind spot in literature - Thorough theoretical contribution - Experiments rely on VNN-COMP community-benchmark. Results show authors method strongly outperforms the chosen baselines, across all experiments. - Paper is clear and flows well. Authors contribution also clear. - Literature review carried out with diligence. I am not a domain expert - fellow reviewers with greater expertise may have identified gaps. - Some neural network verification concepts could have been introduced more in details (e.g. "Patching") - Lack of running example make reading hard to instantiate into a real-world, impactful use case. - I could not find a discussion on the computational overhead of the method w.r.t. the baselines. Could you briefly post a comment about it? (Apologies if I have missed) Fully human-written
Provable Guarantees for Automated Circuit Discovery in Mechanistic Interpretability Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper focuses on automated circuit discovery in neural networks, one of the key challenges of mechanistic interpretability (MI). Current circuit discovery methods are heuristic and rely on sampling and approximations, therefore failing to provide theoretical guarantees about the obtained circuits (faithfulness and robustness to perturbations). The authors introduce a novel framework that uses neural network verification testing to discover circuits with three types of provable guarantees: - Input-domain robustness, ensuring that the circuit's behavior remains faithful across a continuous input region. - Patching-domain robustness, guaranteeing faithfulness under a continuous range of perturbations (patching) to the activations of non-circuit components. - Minimality: The authors introduce and formalize four types of circuit minimality, hierarchically ordered (quasi-, local, subset- and cardinal minimality) and provide algorithms to either obtain or approximate those. Some of the key technical contributions of this work include: - A siamese network encoding, which allows standard verifiers to certify the aforementioned properties - Identifying a "circuit monotonicity" property that enables stronger minimality guarantees - Using a circuit-blocking-set duality based on Minimum Hitting Sets (MHS) to find cardinally-minimal circuits. The authors validate their approach on several vision models and benchmarks, obtaining circuits with substantially stronger robustness guarantees than sampling-based baselines. - The core contribution (applying formal verification to automated circuit discovery) is novel and significant. It directly addresses a fundamental and critical limitation in the field of MI (moving from heuristic approximations to provable guarantees), which has been acknowledged in many recent works in the literature. This work makes a significant step towards increasing the reliability of MI methods and is much-needed. - The paper is technically excellent. The formalizations are clear, and the introduced hierarchy of minimality guarantees (Definitions 3 through 6) clarifies a concept often used loosely in the literature. All of the stated theorems are proved in depth in the Appendix, which is rare to find in this type of work. - The experiments clearly support the paper's claims. The authors use the SOTA α,β-CROWN verifier on standard benchmarks, and their proposed methods achieve 100% certified robustness (Tables 1 and 2), whereas sampling-based methods largely fail. Finally, the experiments in section 5.3 are a good illustration of the trade-off between minimality strength (as defined in the proposed hierarchy) and runtime. - The paper is very well-written, given the complexity of the topic. - The most significant weakness (which the authors do acknowledge) is the scalability of the neural network verification methods. The experiments are conducted on relatively small vision models (e.g. ResNet2b) yet already require computation time in the minutes to hour range. Current MI research largely focuses on large LLMs or other models using the Transformers architecture, and as proposed, the authors' framework would likely be computationally intractable for such models. This is an inherent challenge of the verification field and by no means a flaw in the paper's methodology itself and doesn't detract from the foundational contribution of the work, but it does limit the immediate practical applicability to SOTA models. - The paper could be strengthened by adding a more detailed qualitative analysis of the discovered circuits. For example, the provably robust circuit depicted in Figure 1 contains an extra component. A natural question is: What functional role does it play? An illustrative example would make the benefits of the approach even more tangible and would also be interesting from a research perspective. - The paper uses a logit-difference metric for faithfulness, which works well for verification and is a standard choice in MI, but may not always accurately reflect the semantic meaning of a circuit that one may be interested in. It is not entirely clear if a small logit difference is always the most meaningful proxy for a circuit "doing the same thing" as the model (a circuit could potentially maintain a small logit difference but modify its internal representations or attributions in a way which may be significant). However, this is, again, the de facto standard used in most MI research and this is a fairly minor point. - Have the authors considered adapting the framework to certify different properties? For instance, instead of bounding the logit difference (which can be problematic, as stated above), could it be adapted to certify that the predicted class remains invariant across the input domain? This seems like a natural guarantee for classification. - The complexity of MHS depends on enumerating blocking sets up to size t_max. What was the practical size of the blocking sets found in the experiments, and how large did t_max need to be to provide good lower bounds or find the minimal circuit? (My apologies if this is answered in the attached codebase, which I did not consult) - For the example in Figure 1, do the authors have any functional hypothesis for why the components highlighted in green are essential for robustness (handling specific edge cases)? What kinds of input perturbations would cause the sampling-based circuit to fail, which the provably-robust circuit correctly handles? - Do the authors have any insight into potential ways to apply this framework to larger models/architectures, such as Transformers? Are there specific verification techniques that offer a better trade-off for this application? Fully human-written
From Curiosity to Caution: Mitigating Reward Hacking for Best-of-$N$ with Pessimism Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. * The paper aims to improve the degradation in performance in the Best-of-N sampling scheme for large values of $N$ ("reward hacking") by constructing a weighted sampling scheme which combines reward and uncertainty scores. * In the model, a language model stochastically maps a prompt $x$ to a response $y$, a reward model $r$ is a mapping from prompt-response pairs $(x,y)$ to scores. Given reward model $r$, uncertainty estimate $u$ and hyper-parameter $\\lambda$, the proposed approach samples candidates according to the “pessimistic” reward $r_\\mathrm{LCB}=r-\\lambda u$. The uncertainty estimate is inspired by random network distillation, aiming to detect OOD datapoints. * A theoretical analysis is briefly described in the body of the paper, and presents a simplified setting in which pessimistic sampling converges towards the optimal-reward response but the absolute gap between naive and pessimistic sampling grows with $N$. * The empirical analysis evaluates performance on three datasets, showing favorable performance compared to the Best-of-$N$ baseline. * Addresses a well-motivated and timely issue. * Proposed method seems relatively simple to implement, suggesting potential practical applicability. * The presentation is generally clear and easy to follow. * Despite the focus on empirical analysis, the analysis code doesn’t seem to be attached. * While related prior work is mentioned (e.g., Huang et al. 2025, Jinaai et al. 2024), empirical evaluation doesn’t seem to include it as a possible baseline. * The theoretical analysis underlying section 2.3 seems to be relatively intricate, but the body of the paper presents it only briefly. In particular, key assumptions, their limitations, implications of the analysis, and proof techniques are do not seem to be discussed. * Is it possible to elaborate on the intuitive assumptions and limitations underlying the theoretical analysis? * What are the general limitations of the proposed approach, and when is it expected to fail? * How does the method handle cases in which the ground truth reward has “true" label noise? (e.g., topic in which annotators have typically diverse preferences). * Is it possible to formulate the relation the proposed method as a "soft distance constraint" between the training set and candidate response distributions? For example, if I understand correctly, it seems that only "in-distribution" candidates will be sampled when $\\lambda \\to \\infty$ and $u$ is a perfect OOD detector, because any OOD candidates will be penalized heavily. Does this intuition hold? And if so, what can be said about intermediate values of $\\lambda$? Fully human-written
From Curiosity to Caution: Mitigating Reward Hacking for Best-of-$N$ with Pessimism Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses reward hacking in Best-of-N (BoN) sampling. The authors introduce an elegant, lightweight, and practical solution called "caution." The method works by training a lightweight predictor network to mimic the internal features of a given reward model on in-distribution data. At inference time, the predictor's error on a new candidate response serves as an uncertainty penalty, which is subtracted from the original reward score. Extensive experiments demonstrate that this approach effectively mitigates reward hacking, leading to improved performance with N. - A core strength is the paper's excellent presentation and framing of the problem. Drawing a parallel between curiosity-driven exploration in RL and "caution" for avoiding OOD reward exploitation is an insightful and generally useful contribution. - The proposed solution is compelling in its simplicity and practicality. Training the auxiliary predictor network is a one-time offline cost (with unlabeled data), and the inference-time overhead is minimal as it leverages the same internal representations as the reward model. - The experiments are extensive and well-designed. W1. This is both a perceived weakness and a question. Please correct me if I misunderstood the setup. The "caution" mechanism defines "in-distribution" by learning the typical feature patterns of the base reward model. This is effective at penalizing novel responses that exploit weird loopholes, but it may fail to mitigate, and could even reinforce, the reward model's own systemic biases which you mention in Lines 81-83 and in Figure 4. For instance, if a reward model was trained on data where humans consistently preferred verbose answers, its internal features would encode verbosity as a "normal" characteristic of high-quality responses. The caution predictor would learn this pattern and would consequently fail to penalize a verbose, incorrect, reward-hacking response. It might even penalize a concise, correct response for being stylistically "out-of-distribution." In essence, the method regularizes generations toward the central tendencies of the reward model, and if those tendencies are themselves flawed, the method risks preserving those flaws. If doing so is indeed shown to mitigate hacking, why do you think hacking happens to begin with? W2. What is the most fundamental way to combine a reward signal and an uncertainty signal with BoN? Subtracting the uncertainty from the reward feels very natural from an RL / training point of view. However, BoN only cares about the ranking of the responses according to their rewards, making it insensitive to any shifts or rescaling. Is there a way we can introduce uncertainty in the ranking itself and how would you justify subtraction for inference-time alignment? I think this aspect is worth discussing in the paper and makes your decision of running a z-score transformation followed by using a global hyperparameter to balance the subtraction of the reward with the uncertainty more convincing. Other work considered, for example, a multiplicative exponential term to model uncertainty in reward scoring and they motivate it well for policy optimization [1]. [1] Houliston, Sam, et al. "Uncertainty-penalized direct preference optimization." arXiv preprint arXiv:2410.20187 (2024). - How do you expect the "caution" mechanism to perform if the base reward model is systematically biased? - The paper's discussion of Huang et al. (2025b) is very clear. A direct empirical comparison on the same experimental setup would be appreciated, as it represents the most relevant alternative approach. - Why do you not consider calibrated reward as done in InfAlign (Balashankar et al., 2024)? Since Best of n only cares about the ranking of the reward scores, such a calibration is theoretically motivated. Instead, you opt for a z-score transformation. Can you elaborate on lines 300-302? In particular, did you perform the z-score transformation per prompt or across prompts? - You may consider citing [2]. They use the reward model's learned representation of the prompt response pair to model its uncertainty. They argue when a data point is (almost) in distribution, there is less uncertainty. [2] Zhang, X., Ton, J. F., Shen, W., Wang, H., & Liu, Y. (2024). Overcoming reward overoptimization via adversarial policy optimization with lightweight uncertainty estimation. arXiv preprint arXiv:2403.05171. Fully human-written
From Curiosity to Caution: Mitigating Reward Hacking for Best-of-$N$ with Pessimism Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes Best-of-N with pessimism to mitigate potential reward hacking issues in test-time scaling. Specifically, they capture "caution" by training a lightweight predictor network to reconstruct reward model features, and the pessimistic score is larger when the prediction error is lower (the reward model is more familiar with the response pattern). Experiment results on GSM8K, Math-500 and BBH-hard demonstrate that their methods are stronger than the Best-of-N baseline. - The proposed pessimistic score is reasonable, which accounts for the reward model's familiarity with the input. The reward design thus becomes the original reward minus the prediction error, and is clear and easy to implement. - Experiment results demonstrate that the proposed pessimistic score is effective when applied to Best-of-N method. What's more, it is also useful for OOD domain test-time scaling. - The authors did not compare with other strong test-time scaling methods, such as self-consistency. So it is unsure whether their method is widely applicable to other test-time scaling methods, as well as state-of-the-art inference time scaling methods. - The proposed method is not as lightweight as the authors claimed. In fact, it requires extra training for each reward model. That means for every reward model on a specific task, they need to train a predictor network again based on the specific reward model. - The layer number L is a hyperparameter, and it is unclear how they select it. The authors only mentioned in appendix that they use L=10. It is unsure whether this hyperparameter can be applied to other settings / domains / reward models. See weaknesses above. Fully human-written
From Curiosity to Caution: Mitigating Reward Hacking for Best-of-$N$ with Pessimism Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper addresses the problem of reward hacking in Best-of-N sampling, where due to biases of the reward model sub optimal preferences might be made. The authors propose "caution," an inference-time technique that instantiates the principle of pessimism from reinforcement learning. It works by training a lightweight "predictor" network to predict the internal feature representations of the frozen reward model on a dataset of typical responses. The prediction error of this network is then used as an uncertainty score. This score is subtracted from the reward model's output to create a "pessimistic" reward, which penalizes responses that are out-of-distribution (OOD) from the perspective of the reward model. The authors demonstrate this approach with significant gains over GSM8K, MATH-500, BigBench-Hard reasoning tasks. - The paper addresses an important and persistent problem in reward hacking that exists despite a large amount of papers on the subject. - The experimental evaluation is comprehensive and well-designed. The testing across different distributions and domains (GSM8K, MATH-500, BBH) convincingly demonstrates the robustness of the approach. - The ablation studies in Section 3.2 are thorough and successfully validate the key design decisions, especially the critical importance of using reward model features over random ones. - The proposed method is computationally efficient and practical. The predictor network is trained fully offline, and at inference time, its forward pass can be parallelized with the reward model's, adding minimal overhead. This makes it a readily applicable solution for practitioners already using BoN sampling. - Weak theoretical justification, It is unclear how the insights from this linear-Gaussian model translate to the complex, high-dimensional geometry of transformer feature spaces. - Stronger baselines. The paper could be strengthened by comparing "caution" to other plausible inference-time uncertainty estimation or OOD detection techniques. There has also been a lot of work in reward hacking mitigation, so a comparison with more techniques would be appreciated. - Qualitative examples - the paper would benefit from more qualitative study of the hacking mitigations - Human alignment - more human evaluation would strengthen the usefulness of the approach for most tasks. - The main comparison is to standard BoN. How might "caution" compare to other classes of OOD/uncertainty detection methods applied at inference time, such as using Monte Carlo dropout on the RM's final layers or calculating Mahalanobis distance in the feature space of the RM to penalize outliers? - The predictor Pθ is trained on typical responses from one policy (Llama-3.2-3B). How well would this pre-trained predictor generalize if used to score responses from a different, more capable base model? - he case study highlights the detection of formatting errors. Can you provide examples of more semantic or stylistic deviations that the method successfully identifies and penalizes? Moderately AI-edited
From Curiosity to Caution: Mitigating Reward Hacking for Best-of-$N$ with Pessimism Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. In this work, the authors mitigate reward hacking by adding a pessimism or caution term to a reward model. Specifically, they focus on Best-of-N sampling and mitigate the selection of high proxy-reward, low true-reward generations by penalizing out-of-distribution samples where the proxy-reward estimate is uncertain and inaccurate. They do so by training a second network to reconstruct the proxy-reward model's latent representation. Then, on future data, they use the error of this reconstruction as a measure of the certainty of the reward model. They regularize their reward with this certainty term, penalizing OOD samples which result in low reconstruction while having abnormally high reward. * The method is intuitive and interesting in its own right: quantifying OOD-ness of data using a second model trained with reconstruction loss * The paper is clear and well-written and provides a thorough study of their method and demonstrates its efficacy. * The empirical results demonstrate hacking mitigation on mathematical reasoning tasks. * The authors say they see "monotonic" performance improvements for all N, yet in my opinion the curves look more like a plateauing reward. Indeed, past work on hacking often frames the approach as avoiding over-optimization and recovering an "optimal" reward rather than expecting gains as N increases forever [1]. I don't think this claim is necessary for the impact of the paper, and I think some may disagree with it. * Notationally, it was a little confusing introducing $T$ as separate from $r$. I understand that you take some intermediate layer from $r$, but I might define that more explicitly in the paragraph on line 202 or streamline notation. [1] Khalaf, H., Verdun, C. M., Oesterling, A., Lakkaraju, H., & Calmon, F. D. P. (2025). Inference-Time Reward Hacking in Large Language Models * In Table 2, it seems you use $\lambda$ as a convex combination of reward terms, $\lambda r(x) + (1-\lambda) u(x)$. This is different than your regularization term introduced in line 236. Can you standardize this? * Intuitively why do we expect any performance gains when using caution only? We aren't using any reward signal at all, and are just picking completions that are more in distribution. What are the costs in terms of coherency or diversity in completion? * In Table 3, for Lightweight + Separate Emb.: How can Peak Acc be lower than Final Acc? * In your proofs, you condition on $\theta$, but isn't $\theta$ not a random variable? Why are you conditioning on it? * Can you explain in Prop. 2 where you use the $E[r^*|\hat{r}]$? Fully human-written
MVP: Memory-enhanced Vision-Language-Action Policy with Feedback Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents MVP, a memory-enhanced Vision-Language-Action model that addresses the limitations of Markovian VLA policies by incorporating episodic memory from historical observations and actions. The key technical contributions include a compact memory representation using adaptive pooling inspired by video understanding, and a novel feedback learning strategy based on SO(3) trajectory perturbation that encourages the model to leverage temporal information during training. 1. The SO(3) augmentation approach is well-motivated—by randomly rotating trajectories, the current observation alone becomes insufficient, forcing the model to genuinely leverage historical context rather than taking shortcuts. 2. The adaptive pooling strategy for compressing historical observations is simple yet effective, enabling the model to handle long histories (128 steps) while maintaining reasonable inference speed and memory footprint. 1. The paper doesn't compare against other potential augmentation strategies (e.g., SE(3), temporal jittering, or other forms of trajectory perturbation). The ablation only shows z-axis vs xyz-axis rotation—this is too limited to validate the design choice. Also, applying the same rotation to the entire trajectory seems artificial; real-world disturbances are typically non-uniform. 2. The authors acknowledge that most tasks in their training data (OXE) are Markovian, yet they claim to learn non-Markovian policies. How can a model learn meaningful temporal reasoning when 90%+ of training examples don't require it? The SO(3) augmentation feels like a patch rather than a principled solution to this data problem. 3. The adaptive pooling approach is presented as inspired by "video understanding techniques," but there's no comparison with actual video compression methods (e.g., the techniques from PLLaVA, Flash-VStream that they cite). Why is (2,2) pooling optimal? What information is lost? The inference speed comparison (Table 6) shows their method is slower than CogACT, undermining claims of efficiency. 4. The paper fails to cite several highly relevant recent works on temporal modeling in VLAs, such as TraceVLA (Zheng et al., 2024) and UniVLA (Bu et al., 2025b). The comparison also lacks recent strong baselines like π0, raising questions about whether the improvements hold against state-of-the-art methods. 5. The real-world evaluation consists of only a single task type (object swapping with 4 variants), which is insufficient to demonstrate generalization of the memory mechanism. The simulated experiments use SIMPLER, which is not considered state-of-the-art, and the baseline methods (RT-2-X, OpenVLA) are relatively outdated given the rapid progress in VLA community. 6. Unclear necessity of the approach: The paper's core motivation relies on the claim that most VLA tasks require non-Markovian reasoning, yet their own results show only modest improvements (4-9%) on SIMPLER benchmarks that supposedly don't require memory. The proof in Section 3.2 merely restates the Markov property without providing compelling evidence that real manipulation tasks actually violate this assumption—the swapping task seems cherry-picked to favor their method. See weaknesses. Fully AI-generated
MVP: Memory-enhanced Vision-Language-Action Policy with Feedback Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces MVP, a robotic control model that integrates memory of past images and actions to solve complex tasks requiring long-horizon reasoning. Its core innovation is an SO(3) rotation-augmented training strategy that effectively forces the model to leverage historical information for decision-making. Experiments show MVP excels on simulated and real-world benchmarks, especially on memory-dependent tasks where traditional models fail. - The SO(3) augmentation strategy is a clever solution to prevent the model from "shortcut learning" and ignoring history during training. - The paper introduces a compact memory representation using adaptive pooling, which is highly effective, allowing the model to maintain high inference speeds and manageable memory usage, making it practical for real-world deployment. - The model achieves state-of-the-art performance in simulation and significantly outperforms baselines on a custom real-world task designed to be memory-dependent. - Real-world validation relies on a single task type ("object swapping"), leaving the model's generalization to other complex tasks unclear. - The paper convincingly shows MVP is better than a Markovian baseline (CogACT) on the memory task. However, it doesn't compare its performance against other models mentioned in the related work, such as GR-2 and RoboVLM, which also rely on historical states and images for action prediction. This omission makes it difficult to conclude that MVP's specific memory architecture and training strategy are superior to alternative designs. - The ablation study shows that full 3-axis SO(3) augmentation degrades performance, while z-axis only augmentation is optimal. The paper’s explanation is superficial and lacks deeper analysis. This omission leaves key questions about the method’s limitations and its true generalization capabilities unanswered. - While the failure modes of the baseline model are analyzed, the paper provides no corresponding analysis for the proposed MVP model, leaving its specific limitations unexamined. see weakness Moderately AI-edited
MVP: Memory-enhanced Vision-Language-Action Policy with Feedback Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a non-Markovian Vision-Language-Action (VLA) model that incorporates past observations and actions as auxiliary information, along with an augmentation method to enhance robustness. 1. The paper is easy to understand and follow. 2. The experimental design and completeness are relatively strong. 1. After reading the related work section, it remains unclear how this paper differs from prior studies and what specific advancements it contributes beyond existing methods. 2. Leveraging past information is a very common strategy — many previous imitation learning approaches have incorporated historical observations [1,2]. Therefore, presenting this as a main innovation makes the contribution seem somewhat incremental. Moreover, prior works [1] and many others have also discussed the copycat problem that arises when using past observations in behavioral cloning. [1] Fighting Copycat Agents in Behavioral Cloning from Multiple Observations. NeurIPS 2023. [2] Rethinking Latent Redundancy in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation. ICML 2025. 1. The paper states that “while a Markovian formulation suffices for simple, short-horizon tasks, it fundamentally restricts the robot’s ability to reason about temporally extended goals or to learn from the consequences of previous actions.” However, this limitation can be addressed by incorporating several past frames to compensate for the lack of historical information — an approach that some prior VLA models have already adopted. Therefore, the proposed method seems to mainly extend the temporal window by including additional past observations, together with the Legend data augmentation technique to improve robustness, which makes the contribution appear somewhat incremental. 2. In the introduction, the authors mention that “while memory is not strictly necessary for these benchmarks.” Could you clarify why memory is not required in the simulation benchmarks? 3. I am also confused about the use of SO(3) perturbations. Why do you introduce a distortion in the action space? Wouldn’t large perturbations make it more difficult for the model to learn the correct action mapping? 4. Finally, I am curious why setting the memory length to 8 yields the best performance. Why doesn’t increasing the amount of historical information further lead to continued improvement? Lightly AI-edited
MVP: Memory-enhanced Vision-Language-Action Policy with Feedback Learning Soundness: 4: excellent Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes MVP to overcome a fundamental weakness in current robotic manipulation models: they only consider the present state and cannot reason about temporal sequences. The authors argue that simply adding memory to these models is insufficient because they learn to ignore it, since most training tasks can be solved without historical context. Their main technical insight is to apply random rotations to robot trajectories during training, which breaks this dependency on current observations alone and forces the model to leverage past information. They compress the visual history using pooling techniques to maintain computational tractability. The experimental results reveal a modest 4-9% improvement on standard simulation benchmarks where memory is not essential, but a more convincing 2.5x performance gain on real-world tasks that genuinely require temporal reasoning. The work demonstrates that memory matters for complex manipulation, though it remains constrained by the scarcity of memory-dependent examples in existing training datasets. The paper makes solid contributions through its SO(3) trajectory augmentation strategy, which cleverly addresses the training shortcut problem where models ignore historical information. The theoretical motivation is sound, with clear explanation of why Markovian shortcuts emerge. The experimental design appropriately combines standardized benchmarks with custom memory-dependent tasks, and the 2.5x improvement on object swapping provides convincing evidence that memory matters for long-horizon manipulation. The presentation is clear with effective visualizations, and the problem formulation distinguishing MDPs from non-MDPs is both precise and accessible. The paper has two critical limitations. First, the real-world evaluation is insufficient, testing only a single task type (object swapping) with 4 variants and 40 demonstrations each, which provides weak evidence for generalization to diverse memory-dependent manipulation scenarios. Second, the absence of direct comparisons with memory-augmented baselines like GR-2 and RoboVLM makes it impossible to determine whether performance gains stem from the proposed SO(3) augmentation strategy or simply from incorporating memory itself, leaving the core contribution unclear. 1. Your theorem shows current state is sufficient for MDPs, and SO(3) augmentation breaks this sufficiency. However, can you provide empirical analysis of how the model learns to use history? Specifically, does it learn causal relationships between actions and state changes, or does it simply memorize rotation-specific trajectories? 2. Your method requires storing and processing historical observations, which becomes prohibitive for very long horizons. Have you explored hierarchical memory architectures where recent history is detailed but distant history is compressed into higher-level state summaries? Fully AI-generated
DIVERSE: Disagreement-Inducing Vector Evolution for Rashomon Set Exploration Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors introduce DIVERSE, a novel method to find other models in the ***Rashomon Set*** of a reference model. A *Rashomon Set* is defined as the set of models that have similar predictive accuracy on a given task. The DIVERSE algorithm is based on two main components: FiLM layers which apply affine transformations to pre-activations, and CMA-ES for gradient-free optimisation of the parameters of the FiLM layers. Since CMA-ES is not scalable to high dimensions, DIVERSE uses it to optimise a low-dimensional vector \$z\$, which is then projected to higher dimensions using random fixed matrices. The authors evaluate their method using both prediction-based and probability-based metrics, in all cases after the last layer. They compare against two baselines: re-training using different seeds, and dropout-based Rashomon exploration. In terms of diversity, the generated models are generally less diverse than with full re-training, but for a runtime orders of magnitude lower. This paper proposes a relatively simple, inexpensive and elegant solution to the Rashomon Set exploration problem. S1: The proposed method is relatively simple, yet effective. This makes it very relevant to solve the Rashomon Set exploration problem. S2: The evaluation of the algorithm is good, with metrics covering both class predictions and output probabilities. Furthermore, multiple hyperparameters are explored and DIVERSE is compared to existing baselines. S3: The paper is generally clear and well-written. The introduction effectively contextualises the Rashomon Set problem, experimental setup and results are generally clear. I think that the paper is overall quite solid. However, I believe its main weak points are related to its impact and motivation. W1: Based on the Introduction and the Conclusion of the paper, I do not understand why Rashomon Set exploration is an important problem to solve. I would like the authors to better motivate why their work is impactful for research in Machine Learning. W2: Furthermore, I think the results insufficiently show that DIVERSE is better than dropout-based Rashomon exploration. In particular, dropout-based Rashomon exploration is less computationally expensive than DIVERSE, and shows better performance in most cases for PneumoniaMNIST. I believe the authors should better highlight where their method outperforms the existing baselines, and should clarify under what conditions DIVERSE is preferable, such as specific datasets, architectures, or computational constraints. W3: For the experiments comparing DIVERSE with its baselines, it is unclear to me what hyperparameters are being used. I think this should be described in the experiment setup. W4: This is a minor point, but I believe it would be preferable that the metric mathematical definitions, currently in Appendix A.1, be integrated to the main text. This would improve the clarity and readability of the results Section. Q1: The definition in the Introduction defines the Rashomon Set as the set of models that achieve a similar performance on a same task, whereas in Equation (1) this set is constrained to a hypothesis space of models parametrised by weights \$w \in \mathbb{R}^p\$. Could the authors please clarify whether the Rashomon Set is constrained to models that use the same architecture? Q2: In Table 1, it is unclear which size \$d\$ is used for the vector \$z\$. Since CMA-ES struggles with higher dimensionalities, how do higher values of \$d\$ impact the runtime of DIVERSE? Is there a trade-off between diversity, Rashomon ratio and runtime as \$d\$ increases? Q3: The paper only explores using DIVERSE on CNNs. Would DIVERSE be applicable to Transformers? If yes, demonstrating this applicability in the paper could further strengthen it. If not, what are the potential challenges? Fully human-written
DIVERSE: Disagreement-Inducing Vector Evolution for Rashomon Set Exploration Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors introduce DIVERSE, a framework for exploring the Rashomon set of deep neural networks, innovative way to find high quality and diverse models that match a reference model's accuracy but differ in their predictions. DIVERSE adds Feature-wise Linear Modulation (FiLM) layers to a pretrained model and uses Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to search a latent modulation space, producing diverse model variants without retraining or gradients. On MNIST, PneumoniaMNIST, and CIFAR-10, DIVERSE finds multiple high-performing models that behave differently - Innovative approximation of Quality-Diversity evolution - Well written paper - Limited Ablation study on evolution side. Why CMA-ES and not any other evolutionary algorithm? Fully human-written
DIVERSE: Disagreement-Inducing Vector Evolution for Rashomon Set Exploration Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper adresses the problem of exploring the Rashomon set of a trained machine learning model in a cost-efficient and diverse manner. The $\epsilon$-Rashomon set is the set of all machine learning models that reach the same empirical risk as the reference model with a tolerance $\epsilon$. In this set lies a multiplicity of different models that may produce different predictions for identical individuals. Being able to explore $\epsilon$-Rashomon sets model is interesting to construct diverse ensemble of machine learning models for uncertainty quantification or improved predictive performance. As exploring the $\epsilon$-Rashomon set of a deep neural network can be compute intensive with naive methods (retraining from scratch or exploring the whole parameter space), this paper proposes to explore a subspace of the $\epsilon$-Rashomon set using Feature-wise Linear Modulation (FILM), a low-dimensional parameterized transformation of a neural network. In this low dimensional space, a black-box and derivative free optimization algorithm is used (CMA-ES) for exploration. The constraint on exploring models included in the $\epsilon$-Rashomon set is relaxed by adding a penalization in the objective function minimized by CMA-ES. The proposed approach (DIVERSE) is then evaluated on three small to medium scale image classification datasets. Empirical results show that DIVERSE can be a good compromise between exploration and compute. Extensive experiments and ablation studies are conducted to evaluate the impact of each introduced hyperparameters. 1. The paper is well-written with a clear objective, all notions are introduced clearly. 2. The proposed approach has the potential to be a fundamental tool in many domains of machine learning. 3. The proposed approach is sound and well grounded in the literature. 4. The ablation study is quite furnished and the experimental protocol sound 1. The paper lacks qualitative or illustrative experiments to helps the reader gain intuitions on the significance of the results. 2. The paper lacks experiments with downstream tasks (uncertainty quantification, ensembling, ...) to better asses if generated models with DIVERSE are actually effective for practical tasks. 3. The paper lacks quantification about how smaller the explored set of models induced by FiLM is to the true $\epsilon$-Rashomon set. Has Rashomon set exploration been done for different training tasks such as regression? ### Remarks The columns of Figure 2 and 3 are not in the same order. The impact of $\lambda$, the mixing coefficient between soft and hard agreement is not evaluated in the ablation study. More diverse datasets could be used in the experimental protocol. MNIST (and I think PneumoniaMNIST also) is quite an easy dataset where a linear model trained on the raw pixel can achieve very high accuracy. Even though not ideal, Fashion-MNIST and K-MNIST might be better alternatives. MNIST-1D, even though not a image classification dataset but a time-series classification one, could be a potential candidate, as linear model have poor performances on it. - Fashion-MNIST : Xiao, Han, Kashif Rasul, and Roland Vollgraf. "Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms." 2017. - K-MNIST : Clanuwat, Tarin, et al. "Deep learning for classical japanese literature." 2018. - MNIST-1D : Greydanus, Sam, and Dmitry Kobak. "Scaling down deep learning with mnist-1d." 2020. Fully human-written
DIVERSE: Disagreement-Inducing Vector Evolution for Rashomon Set Exploration Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduce a gradient-free method called DIVERSE to explore the Rashomon set of deep learning models. This set contains models with similar accuracy but different predictive behaviors. DIVERSE takes a pre-trained model and augments it with "Feature-wise Linear Modulation" (FiLM) layers. DIVERSE then uses a search algorithm "Covariance Matrix Adaptation Evolution Strategy" (CMA-ES) to find different model variations without needing to retrain them from scratch. The experiments show that DIVERSE can uncover multiple high-performing and functionally distinct models efficiently. It offers a competitive way to explore the Rashomon set, achieving comparable diversity to retraining but at a much lower computational cost. Originality: The paper reframes latent exploration as Rashomon set exploration in deep networks. It searches a bounded FiLM modulation space around a fixed model and balances an accuracy tolerance epsilon with explicit control over behavioral diversity. This perspective differs from weight generation or full retraining by focusing on efficient, local, and controllable exploration around a single reference model. Quality: The methodology is clearly specified and reproducible, with objectives, constraints, and data splits stated in enough detail. The experimental design is sound for the stated goals and uses appropriate datasets and baselines. The evaluation employs complementary metrics including Rashomon Ratio, discrepancy, ambiguity, VPR, and RC, which together provide a comprehensive view of diversity under an accuracy constraint. The analysis includes sensitivity to key hyperparameters and highlights dataset dependent effects. Clarity: The paper is clearly written and easy to follow. The flow from problem setup to method and experiments is logical. Notation is consistent, and the figures make the FiLM based search space and the role of the latent variable z intuitive. Strengths: The approach is practical and training free for a given reference model, which makes audits of the local Rashomon set feasible under realistic compute budgets. The joint use of decision level and probability level metrics supports a nuanced interpretation of disagreement. By mapping accuracy constrained behavioral variants, the method helps characterize the local performance diversity landscape of a trained model and can inform stress testing, ensembling, and selective prediction. The compute footprint is small compared to retraining, enabling broader exploration on larger models and datasets. Scaling of the search. The method relies on full-covariance CMA-ES over FiLM latents, which does not scale well as dimensionality increases. This limits exploration on deeper or more complex models and constrains the approach’s practical reach. Architecture locality. Because FiLM layers are inserted into a specific pretrained network, the results are tied to that architecture. It is unclear whether conclusions transfer across backbones with comparable accuracy, and cross-architecture comparability is not established. Objective agnostic to why models disagree. The fitness targets disagreement under an accuracy tolerance but remains insensitive to the underlying cause. As a result, discovered variants may be superficial perturbations rather than models with meaningfully different reliance on features, robustness characteristics, or fairness profiles. Experimental scope and baseline breadth. Experiments are confined to image classification on moderate-scale datasets. The behavior of the Rashomon set may differ substantially on larger benchmarks (e.g., ImageNet) or in other modalities (e.g., NLP with attention). Moreover, “retraining” is treated as a single comparator despite encompassing diverse regimes, leaving the trade-off landscape underexplored. Which layers contribute most to disagreement. Please provide a layerwise or stagewise sensitivity analysis of FiLM norms versus diversity. How portable is the method across backbones with similar top line accuracy. A compute matched comparison on two architectures would clarify whether results are model local or architecture agnostic. Fully AI-generated
Intra-Trajectory Consistency for Reward Modeling Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes an Intra-Trajectory Consistency Regularization (ICRM) method for reward modeling, aiming to improve credit assignment along reasoning trajectories without requiring explicit step-level labels. The key idea is to enforce consistency between neighboring reasoning steps by introducing a probabilistic regularization term, which supposedly yields more coherent and fine-grained process rewards. The authors claim that ICRM outperforms generalized reward models (GRMs) and enhances downstream reasoning tasks such as best-of-N reranking and RLHF-style fine-tuning. 1. The work addresses a meaningful problem in reward modeling, how to approximate process-level supervision from outcome-level signals, which is timely and important for reasoning-heavy LLM tasks. 2. The idea of linking neighboring steps via a Bayesian decomposition is conceptually interesting and could potentially bridge ORM and PRM methods. 3. The authors evaluate across multiple tasks and models, using RewardBench and RLHF-style setups. 4. The paper provides ablations and partial visualizations (heatmaps, section D) that help interpret the proposed regularization. 1. Main claim not empirically supported: The paper claims that ICRM improves credit assignment along trajectories, but the presented heatmaps primarily reflect accumulative error effects (probabilities naturally decay toward later tokens in autoregressive models). Without disentangling this from the regularizer’s effect, it is unclear whether credit assignment actually improved. 2. Visualization is confounded by probability decay: Figure 3’s token-level heatmap shows reward decline mainly in later segments, which likely results from accumulated log-probability decay rather than effective intra-step credit redistribution. No analysis isolates the effect of the regularizer from this autoregressive bias. 3. Selective reporting and overstatements: Claims such as “ICRM surpasses all GRMs” are overstated. Tables show categories where GRM still performs better (e.g., Chat domain). The improvements are average, not universal. 4. Lack of statistical rigor: Standard deviations and multiple random seeds are missing for most key metrics. RLHF evaluations rely solely on a single automatic judge (QRM-Llama3.1) rather than human ratings, making results less reliable. 5. Credit signal concentrated at sequence end: Section D.10 shows that ICRM primarily improves error detection in later parts of the trajectory, with little gain in early steps. This suggests that the method strengthens terminal penalties rather than distributing credit more effectively throughout the process. 6. No control for probability influence: Since the regularization explicitly uses next-token probability as its weighting factor, the correlation between probability and reward consistency should be empirically verified. The authors provide no statistical analysis (e.g., correlation plots or regression residuals) to support the assumed relationship. 7. Weak comparison baseline: The paper only compares against GRMs, but recent implicit PRM and DPO-Q works (e.g., Yuan et al., 2024; Rafailov et al., 2024) already achieve process-level supervision from outcome labels. Without including those baselines, it is unclear whether ICRM offers any genuine advantage. 8. Interpretation inconsistencies: The authors attribute posterior reward decay to “better credit distribution,” yet the data could equally reflect probability mass shrinkage or end-of-sequence effects. This ambiguity is never addressed. 1. Can you provide an analysis separating the influence of next-token probability from the effect of consistency regularization? For instance, show residual rewards after regressing out token probabilities. 2. How does ICRM perform if the regularizer is disabled but probabilities are still normalized? Does the improvement persist? 3. Why are there no standard deviations or multiple random seeds reported for the main tables? 4. Have you compared ICRM directly against implicit PRM or DPO-based Q-value approximations, which also derive process-level feedback from outcome labels? 5. Can you visualize cases with early or mid-sequence errors rather than only late errors to test whether ICRM truly improves step-level credit assignment? 6. Does the observed performance improvement hold when the generator and reward model come from different distributions or architectures? Fully AI-generated
Intra-Trajectory Consistency for Reward Modeling Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Reward model training usually uses coarse, response-level supervision, which can miss which parts of a trajectory actually drive the final score and can overfit to spurious cues (e.g., length). The paper proposes propagating this coarse signal to intermediate steps. By adding a regularizer so that within the same response, adjacent prefixes receive more similar rewards when the next-token probability is higher. - It supplements the standard BT outcome loss; no extra labels are needed, just next-token probabilities from a (frozen) generator. - With only response-level labels, ICRM approaches process-supervised PRMs and even boosts a process-reward model when combined. - Improvements hold across reward model benchmarks, RLHF policies, and inference-time verification, and extend to code generation. [minor weakness] - table2 reasoning section, the bold is wrongly inserted (Classifier + label smooth shows higher performance) - Figure2, two methods are not distiguishable, it would be better to use different color to be compared better - Figure3, in headmap, the color bar to show the scale is missed, and it would be better to use dense color scale to show the difference clear [weakness] - You weight consistency by the model’s next-token probability, but the generator is not guaranteed to be calibrated. How sensitive is your method to miscalibration? - Why did you choose the sampled token’s probability instead of distributional uncertainty metrics like entropy or a margin score? - Because the LM is conditioned on the prefix, next-token probabilities can vary with token position in the text. Do you observe any position-dependent trends in your regularization term? - Your tokenizer is BPE, so tokens don’t necessarily align with words. However, it seems in your example in Figure3, the the term seems splitted exactly aligned with word boundary. Did you distinguish within-word subtoken transitions from cross-word transitions when applying consistency? - If you randomly choose adjacent tokens , do you still see gains? This would isolate how much of the effect comes from probability-based selection itself, versus smoothing any adjacent pair. See above Lightly AI-edited
PreviousPage 39 of 1516 (75800 total rows)Next