ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
Active Learning for Flow Matching Model in Shape Design: A Perspective from Continuous Condition Dataset Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper describes the application of active learning methods to training flow matching models in a data-efficient way. New selection strategies to adapt standard active learning to this new scenario are proposed, and evaluated. On strategies bases election on diversity, another on accuracy, and a third is a hybrid of the two. This selection strategies are "model-free", relying on characteristics of the data alone rather than the outputs of the model for selection. The proposed method is evaluated using four datasets. The main strengths of the paper are: -- The work described addresses a niche problem in active learning that is not widely studied and would be of interest to ICLR attendees. -- The work blends practical and theoretical contributions and insights. -- The proposed approach is evaluated on multiple datasets. -- The visual presentation of results is clear and carefully designed. -- Based on the authors; evaluation experiments the models appear to perform well. -- The theoretical contributions are supported by detailed appendices. The main weaknesses of the paper are: -- The authors do not sufficiently explain how the active learning problem is mapped to the flow matching, generative model scenario. The exact role of labels in this scenario is not clear from the authors' explanations nor is the role of labelled data in training the models. A much clearer explanation of how the authors approach the active learning problem is required. -- The paper requires more careful review and revision as there are multiple typographical errors. E.g. “For example, GALISPZhang et al. (2024) consider ”subject of interest” which transforming the open querying problem in the label space into a semi-open one.” and "where et,i is the noise that make xi to x′ “ and "Because d+ 1 points form a d-dimensional plane.” Also, the referencing style appears incorrect and throughout opening and closing quotes in Latex are not used appropriately. -- The evaluation setup is not completely clear - what data is used for evaluation? Is this different op what is used for training models? Also is the use of accuracy and diversity for evaluation appropriate, given that the proposed approach maximise these? -- The ideas of model free active learning selection strategies and selection strategies mixing diversity and accuracy exist in the literature. It is not clear exactly where the novelty of the approach lies. It would be useful for the authors to answer the following questions: -- Exactly what is the role of labels in the flow matching model training process described and how does the active learning approach integrate with how that model is trained? -- Why does accuracy appear to reduce in Figure4 as the algorithm proceed? Fully human-written
Active Learning for Flow Matching Model in Shape Design: A Perspective from Continuous Condition Dataset Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The main contribution is a unique analytical framework to better understand how the composition of a data set affects the behavior of an FM model. Here, the FM neural network is modeled as a Continuous Piecewise-Linear (CPWL) function. From this, the authors derive a crucial finding that data points with labels similar to existing labels provide the main source of diversity in the output of the model whereas data points with dissimilar labels improve the accuracy of the output. It provides a practical process, based on data, which allows to decrease the cost of training superior generative models in critical scientific and engineering fields, which are often label-poor. The essential insight that this selection process can be “decoupled” from the training of the FM model itself, relying solely on properties of the dataset and an inexpensive surrogate model, allows for it to be an efficient and practical process. 1. The entire theoretical framework is based on the assumption that the FM network behaves as a piecewise linear interpolator. The authors state that their network (8 layer, 512 hidden unit FCN) uses LeakyReLU, which is CPWL. However, this theoretical underpinning rests on what is known as "condensation phenomenon", which is by no means guaranteed to hold for all architectures or training regimes. 2. The diversity strategy $Q_D$ (Eq 4) feels slightly ad-hoc. The first term ($-\text{distance}(y, \mathcal{Y})$) is well-motivated by the theory (Section 2.3). However, the second ($\Delta \text{entropy}$) and third ($\text{distance}(x, \mathcal{X})$) terms are imported from other concepts. The ablation study (Fig 9) then reveals that the entropy term—which is part of the justification for balancing $m$ and $n$ in the 1D case—has a "comparatively minor effect". This makes the final formulation feel more "engineered" than "derived" and slightly undermines the elegance of the data-centric argument. 3. The paper does not discuss the scalability of the proposed methods as the label (condition) dimension $d$ increases. The core analysis in Section 2.3 uses $c \in \mathbb{R}^1$ for intuition, and the experiments go up to $y \in \mathbb{R}^4$. Nonetheless, the theory, e.g. of the error bound (Eq. 5), and entropy better be calculated (clustering necessary) on the partitioning of the label space into convex hulls/simplices, may suffer the curse of dimensionality in the case of intractability of computational solution or irrelevance of the analysis to intractable high-dimensional conditional spaces. 1. Equation 3’s “generation law” implies that samples generated under new conditions, $c^*$, are basically linear blends of existing training examples. This makes the model sound more like a memorizer and interpolator than a true generator, which raises some concerns about its ability to create genuinely new designs beyond what it has already seen. How do the authors reconcile this interpretation with the well-known creative power of generative models? And could the $Q_A$ strategy—by focusing only on reducing interpolation error—actually discourage the model from being creative? 2. The accuracy strategy $Q_A$ (Eq. 6) is motivated by the error bound in Eq. 5 and works like a coreset method in label space—sampling near the “edges” to shrink the $\max ||c_i - c_j||^2$ term. But this approach could overlook regions where the underlying function $f(x)$ is highly non-linear, even if those regions are relatively small. Have the authors thought about alternative versions of $Q_A$—for example, an uncertainty-based strategy that samples from the center of the largest unexplored region (as shown in Fig. 1b), or from areas with high predicted interpolation error, instead of just focusing on the boundaries? 3. Both $Q_D$ and $Q_A$ rely on a surrogate RBF neural network to predict labels across the entire unlabeled pool. This means the effectiveness of the query strategy heavily depends on how accurate that surrogate is. How sensitive are the results in Figure 4 to the quality of this RBF predictor? Also, what’s the actual computational overhead of retraining this surrogate at each active learning step? It would be helpful to know how this cost compares to the “cumbersome intermediate training cycles” of model-dependent strategies—especially as the unlabeled pool $U^n$ becomes large. Fully human-written
Hierarchical LLM-Guided Multi-Task Manipulation with Multimodal Learning and Action-Mask Policy Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. In this paper, the authors proposed a pipeline consisting of a high-level task planner with a low-level action planner to do manipulation tasks. To be specific, the authors used a VLM to extract visual features and an LLM to decompose the task for the low-level action head. Then, the authors trained a low-level action head based on the ACT, modifying the vanilla ACT with multi-modal inputs. The authors perform real-world experiments in weighing scenario and multi-object scenario, demonstrating the effectiveness of the proposed method. 1. The paper is well-written and easy to follow. 2. Compared to the baselines, the proposed method improves performance effectively. 1. The method's novelty is limited. The idea of combining high-level planner with low-level action head is broadly explored in the previous methods. Also, the multi-modal ACT conditioned on the text inputs is also common and not novel. 2. The scope of experiments is limited. The experimental scenarios are relatively simple, which cannot be considered as challenging to the VLM and LLM. 3. The baselines are missing. I noticed the authors did not compare with concurrent VLA methods like pi0, pi0.5 or Gr00T etc.. See the weakness. Fully human-written
Hierarchical LLM-Guided Multi-Task Manipulation with Multimodal Learning and Action-Mask Policy Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work presents a comprehensive framework for addressing long-horizon and complex tasks, consisting of two main components: a high-level planner and a low-level policy, with an additional voice control module. The high-level planner is responsible for scene understanding and task decomposition, while the low-level policy introduces a language-conditioned, act-based model for action execution. 1. The overall structure of the paper is well-organized, and the logic is clear. 2. The method is explained clearly, allowing readers to quickly understand the detailed approach of the work. 3. This is a complete and systematic piece of research. 1. The proposed method feels somewhat “straightforward.” Equipping a VLA model with a high-level planner is essentially an internal mechanism for handling long-horizon tasks, which makes it difficult to view as a core innovation. Moreover, the feature fusion design in the low-level policy also appears to have limited novelty. 2. The experimental setup is not clearly described — for example, it is unclear what fine-tuning data were used, how large the dataset is, and other relevant details. 1. I find the action-mask policy used as a task completion detector somewhat confusing. Why does adding a mask enable the model to determine whether a task has been completed? Shouldn’t task completion be judged based on visual input or certain state information instead? 2. Fine-tuning a pretrained encoder (such as the SigLIP2 text encoder) is not a common practice. Could you explain why fine-tuning is necessary in this case? 3. The experimental setup is not clearly described. I am unsure what fine-tuning data were used, how large the dataset is, and what the detailed training configurations are. Lastly, a minor suggestion: this work feels more suitable for ICRA or IROS, as it represents a complete system-level effort but lacks an element of novelty or conceptual intrigue. Moderately AI-edited
Hierarchical LLM-Guided Multi-Task Manipulation with Multimodal Learning and Action-Mask Policy Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a hierarchical robotic manipulation framework that integrates large language models (LLMs) and vision-language models (VLMs) for multi-task, long-horizon manipulation. The method separates perception (via a VLM) and reasoning/planning (via an LLM) to mitigate modality bias. At the low level, the framework introduces (1) an asymmetric multimodal encoder (SigLIP2 + DoRA for text and ResNet-based vision), (2) a Temperature-Scaled Spatial Attention (TSA) and Bidirectional Cross-Attention (BCA) for fusion, and (3) an explicit action-mask policy that jointly predicts actions and termination masks for sub-task switching. Experiments on dual-arm robot setups (weighing and multi-object manipulation) show improvements in task success and efficiency, with ablations validating the contribution of each component and generalization shown on a second platform. - The separation between VLM perception and LLM planning is well-justified to combat modality bias, which is an important issue in VLM–LLM integration for robotics. - The two-stage planning pipeline (VLM for structured scene representation → LLM for task reasoning) is elegant and interpretable. 2. Novel low-level policy design. - The explicit action-mask mechanism for real-time sub-task completion is a practical and innovative contribution that directly addresses inefficiencies in long-horizon control. - The asymmetric encoder architecture balances computational cost and multimodal expressivity, which is valuable for real-world deployment. 3. Comprehensive experiments. - Includes comparisons to strong baselines (SayCan, YAY, CoT planning, ACT). - Ablation studies are systematic and isolate contributions of LLM choice, encoder symmetry, TSA/BCA fusion, and mask policy. - The cross-platform validation adds credibility to the generalization claim. 4. Strong engineering execution and clarity. - Figures and diagrams (Figures 1–4) are informative and easy to follow. - Prompts are provided in the appendix, supporting reproducibility. 1. Limited novelty at the conceptual level. - While the paper’s combination of known ideas (hierarchical LLM-VLM structure, cross-attention fusion, mask-based termination) is effective, the individual components are mostly incremental extensions of existing work (e.g., SayCan, ACT, SigLIP-based fusion). - The explicit action-mask policy resembles termination gating or validity prediction seen in prior action-chunking or skill-switching literature (e.g., hierarchical RL termination functions). 2. Evaluation scope is relatively narrow. - Only two primary scenarios (weighing and multi-object) are tested, both tabletop and dual-arm setups with limited object diversity. - Tasks involve mainly short sequences (≤4 sub-tasks), so claims about “long-horizon manipulation” may be overstated. 3. Insufficient quantitative analysis of planning vs. control contributions. - While ablations isolate modules, it remains unclear how much of the performance improvement comes from high-level planning vs. low-level policy. - The success metrics mix perception, reasoning, and control success; more fine-grained metrics (e.g., sub-task detection accuracy, action validity prediction F1) would strengthen the analysis. 4. No real comparison to existing hierarchical frameworks. - Missing comparisons to HAMSTER (ICLR 2025) or Robridge (arXiv 2025a), which also use hierarchical multimodal reasoning for manipulation. - These would provide stronger evidence that the proposed hierarchical integration offers unique benefits. 5. Writing and presentation. - While overall clear, some sections (especially in Appendix A.10) contain minor grammatical errors and redundancy. - The “Conclusion” section reads more like a restatement of contributions than a discussion of implications or limitations. 1. How does the action-mask policy compare to learned termination functions in hierarchical reinforcement learning (e.g., options framework)? 2. How robust is the system to VLM misperception? Does an incorrect JSON scene description propagate to major planning errors? 3. What is the real-time inference latency for both planners combined? Can this system operate interactively in real-world settings (>5 Hz)? 4. Did you attempt fine-tuning the LLM on task planning examples? Or is it purely prompt-based zero-shot reasoning? 5. Could the method extend to continuous high-level planning (without discrete sub-task libraries)? Fully AI-generated
Hierarchical LLM-Guided Multi-Task Manipulation with Multimodal Learning and Action-Mask Policy Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a hierarchical framework for long‑horizon robot manipulation that separates perception from reasoning: a VLM first converts the scene into a structured JSON description, and an LLM turns that into an ordered sequence of sub‑tasks from a predefined library. Execution is handled by an asymmetric multimodal policy that encodes language (SigLIP2+DoRA) and multi‑view vision (ResNets), fuses them with temperature‑scaled spatial attention and bidirectional cross‑attention, and adds an action‑mask head that predicts when actions are invalid so the system can stop a sub‑task and switch without extra inference. Across multi‑step tabletop tasks (weighing objects and multi‑object manipulation), the approach achieves higher planning accuracy and faster, more reliable execution than baselines, with ablations showing the two‑stage planner, fusion modules, and action‑mask are key contributors and a deployment on a different robot indicating some generalization. It cleanly separates perception from reasoning (VLM → LLM), which reduces modality bias and yields more reliable, library-constrained plans; it introduces an action-mask that lets the policy detect sub-task completion and switch tasks without extra LLM calls, making execution faster and more stable; its asymmetric encoders plus TSA/BCA fusion tighten language-vision alignment; results show higher planning success and shorter execution times than strong baselines across multi-step tabletop tasks; and ablations plus a cross-robot demo support both the design choices and some generalization. - Relies on a fixed sub-task library, limiting open-ended generalization to unseen skills. - Requires per–sub-task teleop demonstrations and a multi-stage pipeline, adding data and engineering overhead. - Sub-task switching depends on an action‑mask threshold (e.g., n consecutive invalids), which can be sensitive to noise and may need tuning. - Two‑stage planning improves accuracy but adds latency; also introduces dependence on large, possibly closed and costly LLMs. - Evaluation is confined to a few tabletop scenarios with modest trial counts; evidence beyond a single cross‑platform demo is limited. - Robustness/safety under disturbances, sensor noise, or environment shift is not extensively analyzed. See weakness Moderately AI-edited
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper presents ORAK, a comprehensive benchmark designed to evaluate and train large language model (LLM) agents across a diverse set of 12 video games spanning six major genres. The benchmark addresses critical gaps in existing game-based evaluations by incorporating complex, real-world video games, providing modular agentic strategies (e.g., reflection, planning, and tool use), and releasing a high-quality fine-tuning dataset derived from expert gameplay. The authors also propose a unified evaluation framework that includes leaderboards, LLM battle arenas, and in-depth analyses of input modalities, agentic strategies, and fine-tuning effects. The experiments demonstrate the capabilities and limitations of both proprietary and open-source LLMs across various tasks, showcasing the potential of ORAK as a foundation for advancing general-purpose gaming agents. To be honest, I'm very excited to see an LLM benchmark that integrates complex games. The LLM field is currently flooded with a large number of fixed datasets, yet claiming to be “agentic” is clearly insufficient. ORAK covers a wide range of game genres, including action, adventure, role-playing, simulation, strategy, and puzzle games. This breadth ensures a holistic evaluation of LLM capabilities, from logical reasoning to spatial understanding and long-term planning. The use of the Model Context Protocol (MCP) to integrate LLMs with game environments and agentic modules is a significant contribution. This modular approach enables plug-and-play experimentation and facilitates systematic studies of LLM behaviors in diverse scenarios. The release of expert gameplay trajectories across multiple genres is a valuable resource for the community. The dataset encapsulates meta-knowledge and demonstrates how fine-tuning can enhance LLM performance in both gaming and non-gaming tasks. The paper provides extensive experimental results, comparing proprietary and open-source LLMs across tasks and modalities. The insights into the effects of fine-tuning, visual inputs, and agentic strategies are particularly compelling. I strongly recommend acceptance of this paper. It makes a significant contribution to the field of LLM evaluation and training, offering a robust benchmark that bridges the gap between academic research and real-world applications. The paper is well-written, methodologically sound, and forward-looking, providing a solid foundation for future work in gaming AI and general-purpose LLMs. I particularly appreciate the authors' attention to detail in designing ORAK and their commitment to open science through the release of datasets and tools. This work is not only timely but also highly impactful, and I believe it will become a key reference for researchers in the field. This paper still has some minor flaws, and I hope the authors will pay attention to the following issues. 1. The current setup pauses games during LLM inference, which simplifies evaluation but does not fully reflect real-world gaming scenarios. Including preliminary results or discussions on latency-aware evaluation protocols would strengthen the paper. 2. Although the paper mentions RL-based fine-tuning as future work, a brief discussion on how ORAK could be adapted for RL experiments (e.g., reward design, dynamic data extraction) would be valuable for readers interested in this direction. 3. The authors acknowledge the cost of proprietary games and LLM APIs. Exploring potential solutions, such as open-source alternatives or simplified game environments, could help lower the barrier to entry for researchers with limited resources. If the authors could provide detailed data for problems 1 and 2, as well as potential solutions for problem 3, I believe this paper would be worthy of a spotlight paper. See weakness. Moderately AI-edited
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes **Orak** benchmark and evaluation suite for training and evaluating LLM/VLM agents on a diverse set of real video games. The benchmark includes 12 games spanning action, adventure, strategy, and puzzle genres, a modular agent interface (reflection / planning / action / memory components) built on a Model Context Protocol (MCP), prompt templates and action-space definitions, and an expert-trajectory dataset used for supervised fine-tuning. The submission reports cross-model comparisons, ablations of agentic modules, and fine-tuning experiments including transfer tests to out-of-distribution and non-game tasks. 1. **Broad coverage & engineering effort.** The benchmark covers 12 real games across diverse genres and provides a modular evaluation harness, which is useful for benchmarking different LLM agent designs. 2. **Multi-dimensional ability taxonomy.** The paper defines and uses a set of capabilities (e.g., long-horizon planning, spatial reasoning, rule compliance) and maps games to these capability needs, enabling capability–task analyses. 3. **Publicly available supervised trajectories & fine-tuning experiments.** The authors collected and provide a dataset of expert LLM interaction trajectories and demonstrate supervised fine-tuning improvements and some transfer effects. 4. **Useful baseline comparisons.** Results compare multiple closed-source and open-source LLMs under several agentic strategies (zero-shot, reflection, planning, ref-plan), giving a practical snapshot of current model gaps and engineering trade-offs. 1. **Insufficient novelty argument / differentiation from prior benchmarks.** The paper lists related benchmarks but does not convincingly quantify or empirically demonstrate how Orak meaningfully advances beyond existing game/agent benchmarks. The unique scientific questions Orak enables are not sharply distinguished. 2. **Experimental robustness & statistical reporting are incomplete.** Many reported results lack rigorous statistical detail (consistent number of seeds/trials, confidence intervals, or significance testing). Some reported scores show large variance, which weakens the reliability of the conclusions. 3. **Lack of systematic prompt / hyperparameter sensitivity analyses.** The paper attributes improvements to agentic modules (reflection/planning), but does not systematically vary prompt wording, temperature, max-context length, or other prompt-engineering factors to rule out that effects are largely prompt-driven. 4. **Real-time interaction and latency issues under-addressed.** For latency-sensitive or action-timed games (e.g., fighting or platformers), the evaluation often pauses the game during inference. This departs substantially from real-world online agent constraints; latency-aware experiments are deferred to future work, limiting external validity. 5. GPT-generated trajectories (e.g., gpt4o, o3-mini) **risk bias, hallucinations, low strategy diversity, and contamination**; the authors should document generation details and show that fine-tuned models generalize beyond the generator. 1. For each major table/figure, how many independent trials and random seeds were used? Please add confidence intervals and describe any hypothesis tests performed. If trials vary by game, report that explicitly. 2. Did you run controlled sweeps over prompt phrasings, temperatures, context lengths, or token limits when comparing agentic modules? If not, please run such sweeps for key games or explain why module effects are independent of prompt variants. 3. Can you provide at least one latency-aware experiment for a timing-sensitive game (e.g., impose an upper bound on LLM response time or simulate delay) and report performance degradation as a function of latency? 4. Fine-tuning data & overfitting controls. For the supervised fine-tuning dataset: how were trajectories sampled (top trajectories or diverse sampling)? What regularization, early-stopping, or validation protocols prevented overfitting to a specific agentic workflow? 5. I request ablation results comparing fine-tuning on different generator sources (e.g., gpt4o-only, o3-mini-only, mixed-source, and, if available, human or RL trajectories) to quantify generator-specific biases. Fully AI-generated
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper proposes Orak, a benchmark and dataset for evaluating foundation models in dynamic digital games scenarios. Orak increases the diversity of game genres covered in evaluation when compared to previous benchmarks, and the developed platform also allows plug-in and enabling/disable different agentic modules for ease of evaluation/ablation. Moreover, Orak includes a dataset of fine-tuning data, collected as interaction trajectories generated by foundation models guiding playback of all 12 games supported in the platform. Experimental results show the performance of 15 foundation models on the benchmark, an ablation study of 2 of those models utilizing different agentic approaches to play all games, results on scenarios combining different data modalities, and SFT results using Llama to illustrate the benefits of the collected trajectory dataset. Orak provides a very funcional benchmark platform for different game genres, as well as a somewhat general covered of game genres that allows better insights about required capabilities from foundation models during gameplay. Such results can also potentially generalize to wider impact beyong digital games alone. The presented results well illustrate the current capabilities of foundation models and how different games probe them in different dimensions. The created dataset (and, more importantly, data collection platform) can also serve as a stepping stone for further research on improving gameplay models/agents/systems. While well presented and illustrating the potential of the benchmarl in principle, the current paper presents some limitations. Mostly regarding analysis of the results and the rationale for some of its design aspects. The manuscript claims "in-depth analyses of input modality, agentic strategies, and fine-tuning effects", but falls a bit short of this challenging goal. It does provide some interesting insights, but doesn't really go deep into either of the 3 areas. Having said that, the platform itself is already of great value and can be used as a strong base for further evaluation and analysis. Toning down such claims still leaves the rest of Orak as solid work. Regarding input modalities, Orak emphasizes pre-extraction of game state information and textual representation of such data. This both; i) greatly reduces the challenges of observation/state/world understanding and already bias results to text; and ii) muddles the analysis of providing different modalities later which then also needs to deal with potential conflicts in different modalities and in some models seemingly having preference for specific modalities. Also known issues previously discussed in the literature. The presented ablation for agentic modules is interesting as a high-level overview, but its results as presented are not well discussed and don't show significant insights. The analysis here could go much more in-depth in future work. The LLM finetuning experiment and resutls are not in-depth and only take a quick look at 1 foundation model in 2 small sizes, where the benefits of SFT with any data would already provide the most benefit. The paper would benefit from a more detailed analysis of dataset quality and what/how it actually contributes during training, as well as any possible insights into data scalling. However, the main limitation of Orak as a platform is the lack of functionality to properly evaluate grounding of actions, as each game action space has been manualy defined differently and already mapped to high-level functions that heavily abstract and simplify the problem. Though, the platform could be easily modified to offer a full pixel-to-keyboard/mouse-commands interface that fully exposes complexity and allow uses to benchmark at different levels. I don't think the games industry itself is the main beneficiary of such benchmarking effort. How does this motivation and focus affect the benchmark design? Agent autonomy using games as learning environment could have much wider impacts and would need to be analysed differently. The paper mentions previous evaluation "often rely on visual inputs" as a criticism. Why? I'd argue that understanding of visual observations is exactly the most important area where games can help as benchmark for foudnaiton model capabilties. In the Arena setting results, why exactly was Starcraft evaluated with one less model than Street Fighter? Also, critically, do you have any insights on why Minitron-8B performs so much better than in full results? Minitron was 0 for Starcraft in Table 3. Any deeper understanding here could be significant. BTW, Table 3 shows the results only of using the "auto extracted state in text only form", correct? It would be interesting to have an easy comparison of that vs. best resutls somewhere. Even if in an appendix. In a similar veing to the Arena question for Minitron, do you have any insights on why there is no SFT benefit intra-game for Startcrat and Baba? Typo: "In game industry" -> "In the games industry" Fully human-written
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Orak, a benchmark for evaluating LLM agents across 12 video games spanning six genres. The authors aim to address limitations in existing game benchmarks by offering greater diversity, enabling studies on agentic modules (such as reflection and planning), and providing resources for adapting LLMs into gaming agents. The key contributions are the benchmark itself, which uses a plug-and-play interface based on the Model Context Protocol (MCP) for standardized evaluation, and a fine-tuning dataset derived from expert LLM trajectories designed to distill gaming skills into smaller models. The paper presents a series of experiments on 15 LLMs, analyzing their performance, the impact of agentic modules, the effect of visual inputs, and the generalization capabilities of fine-tuned models. The primary strength of this paper is the scale and diversity of the benchmark. Compiling 12 games across six distinct genres, each with its own environment setup and state representation, is a big **engineering effort.** This provides a broad testbed for evaluating a variety of agent capabilities, from reaction time in action games to long-term planning in strategy games. The introduction of a unified, plug-and-play interface using MCP is a commendable step towards standardized and reproducible evaluation of LLM agents in gaming environments. The release of a fine-tuning dataset, while based on LLM-generated trajectories. The paper, despite its significant engineering effort, suffers from several weaknesses in its core claims, methodology, and the novelty of its conclusions, which limit its overall contribution. Largely Unsurprising and Incremental Conclusions: The main findings drawn from the extensive experiments largely confirm well-established knowledge in the LLM agent community, offering little new insight. - The conclusion that proprietary, closed-source models outperform their open-source counterparts is widely accepted and requires little further validation in 2025. - The finding that agentic workflows (e.g., reflection, planning) benefit capable models is not new. The claim that visual inputs often hinder performance is misleading. This outcome is likely an artifact of the experimental design, where highly structured and pre-processed text provides a cleaner, more direct signal than raw visual data for current VLMs. A more accurate conclusion would be that under this specific setup, the models fail to extract sufficient value from visual inputs to overcome the noise, rather than a general indictment of visual modalities for gaming agents. Questionable Design Choices in Benchmark and Data Generation: - The selection of games appears to be driven more by the availability of existing APIs or emulators (e.g., Mineflayer for Minecraft, PyBoy for Pokémon Red) rather than a principled selection of titles that would best probe the frontiers of AI capabilities. The benchmark lacks modern, complex 3D games that pose severe challenges in perception from raw pixels, physics-based interaction, and complex spatial reasoning. - The fine-tuning dataset is generated by an 'expert' LLM (GPT-4o), which fundamentally caps the potential performance of any fine-tuned model at the level of the teacher model. This methodology prevents the discovery of novel strategies that might surpass the teacher's capabilities and introduces the teacher's inherent biases and failure modes into the student models. - By selecting only the highest-scoring trajectories for the fine-tuning dataset, the authors introduce a strong survivorship bias. The models learn from 'perfect' or near-perfect executions ('sunny day' scenarios) but are not exposed to data on how to recover from mistakes, adapt to unexpected situations, or turn a losing game around. This is a critical omission for developing robust agents that can handle the stochasticity and adversity inherent in complex games. The paper positions Orak as a foundational benchmark that pushes the boundaries of agent evaluation. However, the tasks often seem simplified through pre-processed states and high-level APIs, which might not establish a clear differentiation from prior work. Could the authors elaborate on the unique challenges Orak presents compared to existing agent benchmarks? The current results do not seem to establish a clear differentiation, as the main conclusions are largely echoes of findings from other domains. What specific agent capabilities are uniquely tested in Orak that are not adequately covered by prior benchmarks? A more compelling case could be made by showcasing a task where top-performing agents from other domains systematically fail due to a game-specific challenge that Orak is specifically designed to evaluate. For instance, is there a scenario that rigorously tests an agent's ability to reason under partial observability from raw signals, a core challenge in many games? I'm willing to increase my score if the author can answer my question. Fully AI-generated
On the Reasoning Abilities of Masked Diffusion Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper provides a formal analysis of the reasoning and computational capabilities of MDM. Under a finite-precision, logarithmic-width transformer setting, the authors prove that MDMs are theoretically equivalent to PLTs and can simulate CoT reasoning. They further show that MDMs are provably more efficient only for parallelizable problems (e.g., regular languages), where parallel denoising enables faster reasoning, while for inherently sequential tasks (e.g., P-complete problems) MDMs offer no efficiency gain and may even be less practical due to architectural overhead. 1. The paper offers a novel and rigorous theoretical framework for analyzing the reasoning capability of masked diffusion models, a topic that has been largely unexplored. 2. It provides clear conceptual connections between MDMs, chain-of-thought transformers, and padded looped transformers, helping to unify different reasoning paradigms under one formulation. 3. The analysis yields meaningful theoretical insights into when MDMs can achieve efficiency gains through parallelism, offering guidance for future research on diffusion-based reasoning models. 1. The theoretical framework relies on strong assumptions about positional encodings, which are constructed to include arithmetic information (e.g., division and modulo) that real transformers cannot compute, potentially overstating MDMs’ practical capability. 2. The analysis is purely theoretical, without even minimal empirical validation or illustrative experiments to verify whether the predicted efficiency gains appear in practice. 1. The paper makes strong assumptions about positional encodings, requiring them to include arithmetic information such as division and modulo. Do widely used schemes like RoPE [1] satisfy these assumptions? 2. The authors argue that for P-complete problems, MDMs cannot benefit from parallelism and that the absence of KV-cache makes autoregressive models preferable. Would this conclusion still hold if we consider recent MDM variants that incorporate KV-cache [2]? [1] Su et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. [2] Wu et al. Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding. Lightly AI-edited
On the Reasoning Abilities of Masked Diffusion Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper analyzes the computational capability of MDMs, which is composed of a planner and a predictor, both implemented by transformers. One core finding is the equivalence between MDMs and PLTs under the discussed setting. Based on this equivalence, this paper characterizes the computational capability of MDMs and provides a comparison between MDMs and CoT Transformers. 1. The paper is well-written and provides a detailed discussion of the problem setting, related work, and its theoretical assumptions. 2. The target question, the computational capability of MDMs, is practically relevant and important. 3. The proposed "planner-predictor" formulation for MDMs is reasonable. 4. The approach of linking MDMs to the PLTs is natural and allows the use of well-studied conclusions about PLTs. 1. Positional encodings. The main theoretical results appear to rely on very strong assumptions about the positional encodings. As discussed in Appendix D.1, the PEs are constructed to carry complex algorithmic information (e.g., the results of division and modulo operations). This assumption seems to offload a large part of the required computation from the Transformer's computation mechanism onto the input representation. 2. Idealized planner. In the proofs, the planner is a Transformer designed to perfectly execute a specific algorithm. It is unclear if the practical confidence-based unmasking planner could replicate this perfect capacity. 3. Empirical study. While the work is theoretical, the results would be strengthened by even simple empirical studies to see the practical performance of the predicted capabilities (e.g., the efficiency of MDMs on parallelizable tasks versus CoT). 1. Positional encodings. How much of the computational capability attributed to MDMs (e.g., in Theorem 3.2) is actually due to the pre-computed information in the PEs rather than the Transformer’s computation mechanism? This should be more clearly discussed. 2. Idealized planner. How could practical planners approximate the performance of the idealized planners? A more detailed explanation or empirical investigation of this gap would strengthen the paper's real-world implications. Lightly AI-edited
On the Reasoning Abilities of Masked Diffusion Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper provides a formal theoretical analysis of the reasoning capabilities of Masked Diffusion Models (MDMs) for language generation. It establishes equivalence between MDMs and Padded Looped Transformers (PLTs) under finite-precision, log-width settings, and compares their expressivity to Chain-of-Thought (CoT) transformers. The authors show that MDMs can simulate CoT reasoning (with some overhead), and are provably more efficient on parallelizable problems. They also identify a "sequentiality bottleneck" in CoT transformers, which MDMs can overcome due to their parallel nature. The paper concludes that MDMs are better suited for parallelizable reasoning tasks, while CoT is more efficient for inherently sequential ones. 1. The paper provides formal proofs and complexity-theoretic characterizations of MDMs, grounding their reasoning capabilities in well-understood models like PLTs and CoT transformers. 2. It introduces the idea of a "sequentiality bottleneck" in CoT and shows how MDMs can leverage parallelism, offering a clear separation in expressive efficiency for certain problem classes (e.g., regular languages, NC1). 3. The idealized MDM model is motivated by practical implementations, and the authors show how their theoretical framework aligns with real-world MDM behaviors, such as confidence-based unmasking and resampling. 1. The paper is purely theoretical and lacks experimental validation of the claims. While theoretical depth is valuable, even synthetic experiments could help illustrate the practical implications of the findings. 2. The assumptions may be too strong. The analysis relies on idealized assumptions (e.g., finite-precision, log-width transformers, perfect planners/predictors), which may not reflect real-world limitations of MDMs or PLTs. 3. The reasoning tasks considered (e.g., regular languages, circuit complexity classes) are formal and abstract. It’s unclear how the results translate to more complex, open-domain reasoning tasks commonly faced by LLMs. 1. Do you plan to validate your theoretical findings empirically? For example, can you design controlled experiments to show that MDMs outperform CoT on parallelizable tasks like expression evaluation or state tracking? 2. How do your assumptions affect the realism of the model? What are the implications of relaxing assumptions like perfect approximation or uniform unmasking? How would your results change under noisy or learned planners? 3. Can your framework be extended to other reasoning paradigms? For instance, could similar analyses be applied to latent diffusion models, state-space models, or hybrid autoregressive-diffusion architectures? Fully AI-generated
On the Reasoning Abilities of Masked Diffusion Language Models Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper explores the reasoning capabilities of MDMs from the standard complexity theory perspective. The main message of this paper is that MDMs excel on highly parallelizable or ambiguous-generation tasks, achieving strong results with few denoising steps and easy steering via subtask learning. The authors also provide rigorous theoretical statements to support this claim. This paper works on a topic that previous work didn't explore at all: theoretically exploring the MDM's reasoning ability. Although the community believes that MDM can achieve better performance than a causal model in tasks that require non-causal thinking, formalizing that intuition would've been needed for the community. Last but not least, the paper is clearly well-written, even people who aren't familiar with complexity theory can easily follow. This paper doesn't provide any empirical results, which is not actually in the scope, though. Moreover, although the introduction is clearly well-written, Figure 1 gives too much information and is even a bit hard to follow. Also (as far as I understand), this paper's theoretical analysis is based on a remasking scheme (where we remask the unmasked tokens again), which people don't really use in their large-scale Masked Diffusions. Although the authors provide two MDM references that use this scheme, my thought is that the remasking strategy is actually not the mainstream sampling approach, at least by far. - How does the theory result affect under without remasking strategy? - Are there (at least some prior work's) empirical results that can support the author's claim? Fully human-written
KVCache-Centric Memory for LLM Agents Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. MemArt reframes agent memory as KVCache-centric rather than plaintext. The paper shows how MemArt stores past turns as reusable KV blocks and retrieves them by computing attention in latent space. This avoids retrieval drift and preserving prefix-caching benefits. The system comprises (a) AABB-based key compression for each fixed-size block, (b) multi-token aggregation retrieval that scores blocks against all query tokens, (c) decoupled positional encoding that re-embeds retrieved KV without stale RoPE offsets, and (d) a managed memory pool. Compression represents each block with coordinate-wise minima and maxima, enabling fast coarse filtering. Notably, the relevance for a single token is upper-bounded via the dot-product with those bounds. For multi-token prompts, scores are first normalized per token across all blocks and then aggregated to select top-k blocks in chronological order. Retrieved blocks are concatenated with the current KV and re-encoded with unified positions, ensuring coherent attention within the current window without exceeding native limits. MemArt's design is quite interestng. It reframes memory to be KVCache-centric with latent-space retrieval, decoupled positional encoding, and lightweight AABB key compression with multi-token aggregation. This yields a model-agnostic, plug-and-play system. System-wise, memory-pool I/O can add non-trivial latency, and safe reuse critically depends on decoupled positional re-embedding. The issue is that, without it, long-context behavior can be non-performant. 1. I am curious, what is the precision and recall trade-off of the AABB block filter on adversarial or highly paraphrased queries? 2. What is the end-to-end latency and memory-traffic breakdown (prefill, retrieval, re-embed), and how would specialized KV hardware change the bottlenecks? 3. How does MemArt compare head-to-head with KV pruning strategies like Keyformer and MorphKV under the same memory budget and latency constraints? Moderately AI-edited
KVCache-Centric Memory for LLM Agents Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes MemArt, a memory system that stores and retrieves past context directly in KV-cache space instead of plaintext. It introduces: (i) AABB-based key compression to index each KV block with min-max vectors; (ii) multi-token aggregation retrieval that scores blocks using normalized per-token relevance and then aggregates across tokens; and (iii) decoupled positional encoding that strips RoPE at storage time and re-embeds positions at reuse time to avoid positional mismatch. MemArt’s KV-native retrieval aligns with the attention mechanism and removes prompt concatenation, which can avoid retrieval drift and preserve prefix-caching efficiency. The AABB compression is simple and allows a fast coarse filter before fine attention. The multi-token aggregation is well-motivated and the ablations (Softmax vs reciprocal-rank; Sum vs Max; block size) help isolate what matters. The decoupled positional encoding is clearly described and addresses long-context reuse failure modes. 1. Model coverage is limited for a 2025–2026 submission. Results are only on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct, with no newer families and no size sweep to show scaling trends. 2. Baseline breadth is narrow. The method is compared to plaintext memory systems (Mem0, Zep), but there is no head-to-head with cache-centric and dynamic sparse attention systems that also select KV blocks (for example Arkvale, InfLLM, Quest, NSA), even though they are discussed. 3. Scope of datasets is narrow. All main results are on LoCoMo; there is no test on other long-horizon agent traces Because MemArt stores and retrieves latent KV-cache tensors instead of text, the retrieved memory is not human-interpretable. This makes it difficult to verify what information is actually being recalled or whether retrieval errors occur. Can you provide any mechanism to improve interpretability — for example, storing lightweight metadata (token spans, summaries, or embeddings) alongside each KV block, or decoding retrieved KV tensors back into approximate text via the model’s unembedding layer? Additionally, can you report any qualitative analysis showing that the retrieved memories correspond to semantically relevant parts of the dialogue? Without such transparency, it is hard to assess whether MemArt retrieves correct information or merely benefits from implicit correlations. Heavily AI-edited
KVCache-Centric Memory for LLM Agents Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes MemArt, a new KV-cache centric memory paradigm for LLM agents that replaces plaintext with direct reuse of latent KV cache blocks. Instead of re-feeding retrieved text into prompts, MemArt stores and retrieves prior computation directly in latent space which dramatically improves both accuracy and efficiency. Specifically, they propose to compress keys via a bounding box, then they compute the attention over KV blocks through normalization and aggregation over the query tokens. Finally, they append these KV blocks after injecting the positional index to start the decoding. * The proposed KV-cache centric memory paradigm can directly reuse the calculated KV during prefill, which reduces computational overhead. * The proposed multi-token aggregation does alleviate retrieval overhead by reducing the number of index. * Their proposed decoupled positional encoding practically solves the issue. * While the proposed method achieves higher accuracy and lower latency, it inevitably involves an ever growing memory size that might cause storage issue. This is due two design choices: 1) the KV cache is represented in float numbers and it scales much faster than plaintext; 2) the memory is linearly growing with no upper bound on the size. * Another drawback of using KV cache paradigm is the generalization across models. The importance of memory sharing amplifies in multi-agent systems, where one model needs to understand the other model’s memory. It limits the scope of the paper. * There seems to lack some experimental comparison with KV cache compression literature. I have listed several below for reference. 1. _H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models_ 2. _SnapKV: LLM Knows What You are Looking for Before Generation_ 3. _A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression_ * Finally, the model size experimented are limited and primarily lies around 7/8B. I believe the work benefit from validating on larger scale models such as 32B or MoE models. * How would you discriminate your work from KV cache compression literature? Fully human-written
Don't Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper starts from a observation from standard, LLM-validated pruning catastrophically collapses VLA policies—success drops to 0% on both manipulation (OpenVLA: 85.2% to 0%) and navigation (NaVILA: 43% to 0%). It then proposes GLUESTICK, a training-free, pruning-agnostic weight-space correction: compute the dense–pruned gap per linear layer, take a truncated SVD, and add a low-rank correction at inference to restore “lost directions”. GLUESTICK substantially recovers manipulation (≈50% of success lost to pruning) and fully restores navigation success while keeping most of the VRAM savings of structured sparsity; unsafe-episode rates return near dense baselines. The paper further diagnoses why VLAs are fragile: compared to LLM layers, VLA layers show flatter singular spectra, meaning “useful signal” is spread across many directions and is easily excised by structured pruning. * Clear empirical finding: the results are quite good across two domains (manipulation/navigation) and three architectures. * Simple method arch: GLUESTICK is training-free, drop-in, and pruning-agnostic; a single interpretable hyperparameter r controls the memory-recovery trade-off. * Thoughtful analysis: the analysis provides a plausible reason VLAs differ from LLMs (flatter spectra -> pruning removes distributed, important directions), which aligns with the effectiveness of a low-rank “stitch-back” on top of pruned weights. * Corrections applied only to the linear layers: for the model with heavy conv layers for vision encoders or attention projection with structured kernels, the proposed method might lose some effectiveness, like the WorldVLA case. * The rank scheduling is empirical: with a single global r used all-way, given large per-layer variation, and the vision backbone is sensitive parameter-wise, without considering the inner difference of different layers. * The requirement of dense-weights: the method needs the original dense checkpoint to compute the gap SVD, which might constrain its usage in some scenarios. * Is it possible for the method to interplay with other techniques like quantization or LoRA? Could you try with more compression baselines for more comprehensive ablation studies? * You have mentioned that the manipulation task can only achieve ~50% of recovery; can you analyze more on the performance difference with different task settings? * How sensitive is the method to domain shift and long-horizon tasks? * In Appendix D.1, the authors find that a smaller, more "targeted" calibration set for Wanda pruning yields a 2% performance gain. This is an interesting but counter-intuitive result. Does this suggest that pruning methods are highly sensitive to the quality and relevance of calibration data, and that "more data" is not always better? Lightly AI-edited
Don't Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The manuscript presents a study of VLA pruning, demonstrating that VLA models considerably lose performance compared to their LLM counterparts. The authors demonstrate and benchmark this behavior on manipulation as navigation tasks using the OpenVLA, WorldVLA, and NaVILA models. By analyzing the spectrum of model weights, the authors further demonstrate a difference in the weight space between VLAs and LLMs. Based in this observation, GLUESTICK is proposed as a mitigation method. By compressing the most important components of the pruned weights using SVD, and thus, reconstructing the suppressed component of weights, the authors are able to recover part of the model performance. This behavior is demonstrated on simulation benchmarks for manipulation and navigation tasks. - The analysis of performance loss is well executed, spanning multiple models and tasks. - The proposed recovery method is grounded in the joint findings from a study considering the pruning process itself, as well as from evaluations in a robotic simulator. - GLUESTICK is able to recover part of the lost model performance in both deployment scenarios across model architectures. - The method is straightforward to implement without requiring target domain calibration data and shows strong improvements over the baseline. - As a major shortcoming, the work misses a comparison to VLM pruning, where the strong performance drop of VLMs compared to LLM models is a known property [1,2]. In previous works, this behavior is especially prominent at or below 50% sparsity, the operating point chosen in this work. Since VLAs typically build on top of VLMs, rather than on a language-only LLM, this comparison and discussion of related works is required. - Since VLMs lose considerable performance from pruning, it remains unclear if the demonstrated problem is a problem stemming from the VLM backbone or the full VLA built for the robotic task. - The experiment in Q6 should be put in context. 200 SVD components are an extremely strong compression of the weight space, especially when compared to 200/500 residual components in GLUESTICK. - Experiments are purely performed in simulation. Since real-world deployment of VLA policies can show considerably different performance, a small study on real robots can help the experiments in this work. [1] Liang, Yinan, et al. "Efficientllava: Generalizable auto-pruning for large vision-language models." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. [2] Koike-Akino, Toshiaki, Jing Liu, and Ye Wang. "$\mu $-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts." arXiv preprint arXiv:2505.18451 (2025). - How strong is the performance loss at different operating points of pruning, especially with lower sparsity? - How does Figure 2 change when compared to the corresponding VLM model? - Overall, the work should discuss the relation to VLM pruning and put the findings and novelty in the context of existing work in this area. I will reconsider my rating if this shortcoming is adequately addressed in the paper discussion, existing works, and experimental validation. Fully human-written
Don't Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 10: strong accept, should be highlighted at the conference Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper is about the very important question of how to make a VLA model relevant (fast) by pruning without deteriorating the performance too much. The approach computes a correction term that is based on the un-pruned and the pruned model that is later used in inference with the pruned model. The term is only computed once and required not knowledge of the pruning method. The paper provides empirical evidence of the problem actually existing, provides insights as to why the problem arises, and demonstrates that the proposed method solved the issue. - The paper addresses a relevant and important problem. - The paper presents a good explanation and demonstration that the issue exists and is relevant. - The paper presents an effective solution to the problem. - The approach has favorable properties such as only computing the correction term once and being independent of the pruning method. - The introduction and related work sections are well formulated. - The method is innovative, making use of a low-dimensional correction term (from SVD). - The paper contains code examples for the most important parts of the approach. - The presentation of figure 2 is hard to read and some different way of presenting the same information would help. - The proposed approach computes the correction term after pruning and which happens after learning the model in the first place. Would it be possible to improve the correction performance by pruning in a certain way or learning parameters that make correction with the SVD approach easier? - What bias is the SVD approach introducing to the correction? - There exist other low-rank decompositions of matrices. Why is the SVD one preferred? Fully human-written
Don't Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces GLUESTICK, a training-free, pruning-agnostic post-pruning recovery method. While pruning is an effective compression technique for LLMs, it causes drastic degradation in VLA models, leading to near-zero task success rates and increased safety violations. GLUESTICK computes a lightweight corrective term via SVD of the difference between dense and pruned weights and then applies this correction during inference. A single interpretable hyperparameter rank $r$ is used to balance efficiency and accuracy. The overall method is simple and easy to implement. Experimental results across several VLA models and benchmarks show that GLUESTICK can help recover most of the lost performance while maintaining memory efficiency. - The paper’s observation of the pruning collapse issue in Vision-Language-Action (VLA) models is quite meaningful. - The proposed method is simple, efficient, and easy to implement, compatible with various pruning techniques. - It offers valuable insights for the compression, pruning, and deployment of VLA models. - The method presented in this paper bears some similarity to adding a low-rank adapter on top of pruning to offset pruning-induced losses. It would be better for the authors to elaborate on the differences between their proposed method and approaches like LoRA. - When utilizing different backbones and dimensions, how should the hyperparameter $r$ (rank) be determined for each scenario? Would it be feasible to assign distinct $r$ values to different weight matrices? This adjustment seems promising for further enhancing the trade-off between performance and efficiency. - The performance of the method under different pruning sparsity levels is not explored. - Does the proposed method have any impact on inference speed and latency? - In practice, VLA deployment often requires combining pruning with other methods like quantization. Does GLUESTICK still work with these techniques? Lightly AI-edited
CodeMirage: Stress-Testing AI-Generated Code Detectors Against Production-Level LLMs Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. ## Paper Summary This paper introduces CodeMirage, a large-scale benchmark designed to evaluate and stress-test AI-generated code detectors under realistic, multilingual, and adversarial conditions. It addresses limitations in prior benchmarks that (1) focused on few programming languages, (2) relied on non–production-level LLMs, and (3) lacked adversarial scenarios such as paraphrasing. CodeMirage includes 210K samples across 10 programming languages, derived from human-written code (CodeParrot dataset), AI-generated code from 10 commercial LLMs, and AI-perturbed (paraphrased/adversarial) variants. The authors evaluate 10 representative detectors covering four paradigms (zero-shot, embedding-based, fine-tuned, and pretrained-LLM-with-classifier) under four configurations — in-distribution, out-of-distribution (cross-model/language), adversarial, and hybrid (OOD+adversarial). ## Strengths 1. Comprehensive Benchmark Coverage – CodeMirage spans 10 major programming languages and includes 10 diverse production-level LLMs (e.g., GPT-4o, Claude, Gemini, DeepSeek, Qwen), significantly improving realism over prior datasets. 2. Valuable Insights – The analysis (e.g., fine-tuning overfitting vs. zero-shot robustness) offers practical implications for deploying detectors in real-world environments. 3. Well-Structured and Readable – The paper is clearly written, well-organized, and presents technical content in an accessible and logical manner. ## Weaknesses 1. Limited Novelty – The novelty of this work appears modest. Comparing with related work in Table 1, the contribution primarily lies in integrating existing dataset design principles (multi-language coverage, adversarial perturbation, multi-model generation) rather than proposing a fundamentally new methodology. If the main contribution is the benchmark construction, similar ideas have already been explored in works such as CoDet-M4 (Orel et al., 2025) and LLMGCode (Xu et al., 2024b). If the contribution lies in evaluating more baselines, the paper does not introduce new detection methods or theoretical insights, and thus the advancement is mostly empirical rather than conceptual. 2. Motivation Needs Stronger Justification – The motivation for document-level detection remains unclear. The paper should better articulate why evaluating detectors on entire files (instead of function- or snippet-level, as in prior work) is necessary or more realistic. For example, are detectors expected to operate on full repositories in deployment? Or is document-level detection shown to capture distributional cues that function-level detection misses? Without such justification, the motivation appears weak. 3. Missing Implementation Details – Several approach details are under-specified. For AI-code perturbation, the six transformation types are only briefly mentioned, but not defined. Likewise, the implementation details for Multi-Round Paraphrasing, DeepWordBug, and AST-based Perturbation are omitted in the main text and deferred to appendices, making reproducibility difficult. A short description in the main section would improve clarity. 4. Lack of Fine-Grained Analysis – First, there is no detailed discussion on which specific LLMs produce code that is easier or harder to detect — although Figure 3 hints at this, the insight is not explicitly analyzed. Second, the evaluation setup seems largely model-agnostic, raising the question of how it leverages CodeMirage’s unique dataset properties. Since most perturbation techniques (e.g., paraphrasing, AST rewriting) are reused from prior work, the connection between the dataset design and the evaluation outcomes is unclear. Third, no dataset-specific evaluation (e.g., per-language or per-perturbation difficulty analysis) is provided to highlight what CodeMirage uniquely contributes beyond existing benchmarks. 5. Lack of Broader Impact or Future Vision – The paper does not clearly discuss how CodeMirage will influence future research or deployment. For example, will it enable detector training, robustness certification, or standardized evaluation protocols? Without articulating such a broader vision, the impact of CodeMirage may remain limited to a one-time empirical study rather than a lasting benchmark standard. 1. What is the six transformation types in the adversarial perturbation generation? 2. If you conduct the same experiments on the dataset in existing work, will you get the same conclusion? 3. What is the motivation for document level detection? Moderately AI-edited
CodeMirage: Stress-Testing AI-Generated Code Detectors Against Production-Level LLMs Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces CodeMirage, a comprehensive benchmark for evaluating AI-generated code detectors under realistic and adversarial conditions. CodeMirage comprises approximately 210,000 code samples across 10 programming languages, including human-written code from GitHub and AI-generated/perturbed variants produced by 10 state-of-the-art production-level LLMs (including reasoning models like DeepSeek-R1, GPT-o3-mini, and Gemini-2.0-Flash-Thinking). The authors design six progressively challenging evaluation tasks across four configurations: in-distribution testing, out-of-distribution (cross-model and cross-language) testing, adversarial perturbation (primarily paraphrasing), and hybrid testing combining OOD shifts with adversarial attacks. Through extensive experiments with 10 representative detectors spanning four methodological paradigms (zero-shot, embedding-based, fine-tuning-based, and pretrained-LLM with downstream classifiers), the paper reveals several key findings. + The paper constructs a comprehensive and realistic benchmark that includes 10 programming languages and apply varied production-level LLMs to rigorously evaluate LLM-generated code detectors + The benchmark includes varied tasks to evaluate LLM-code detecotrs at different difficulty levels and from multiple perspectives + The paper presents a very comprehensive evaluation over existing LLM-code detectors and draw useful conclusions as clear takeaways. - **Outdated code snippets for evaluation**. To avoid contamination with AI-generated code, the authors use the CodeParrot GitHub-Code-Clean dataset collected in May 2022, before the widespread deployment of modern code-generating LLMs. However, this reliance on pre-2022 code may introduce distributional shifts that affect the validity of findings, as coding practices, library usage patterns, and programming paradigms have evolved significantly since then. This setting raises concerns about whether the conclusions drawn from this benchmark generalize to contemporary code written today and in the future, particularly given that modern developers are evolving their coding behaviors by collaborating with advanced libraries and AI tools, potentially adopting significantly different patterns to detect unwanted AI usage or plagiarism. - **Existing detectors seem to perform well for in-distribution data, challenging the difficulty and practical value of the benchmark**. As shown in Figures 3 and 9, many detectors achieve high F1 scores (often >0.85-0.95) in the in-distribution setting, with fine-tuned methods like GPTSniffer and CodeT5+ performing very well across languages and LLMs. This suggests that the in-distribution detection task may be too easy when training and test data share the same generator and language, potentially limiting the benchmark's ability to differentiate detector capabilities in this most basic scenario. - **(Overly) aggressive filtering for potential memorized, AI-generated code**. The authors apply a conservative BLEU < 0.5 threshold to filter out potentially memorized AI-generated code, which may unintentionally introduce the distribution divergence between human-written and AI-generated code that can be easily captured by existing detectors. This aggressive filtering may artificially inflate the apparent difference required of AI-generated code in the benchmark, potentially biasing the dataset toward only highly divergent AI outputs while excluding realistic scenarios where AI models appropriately generate conventional solutions that naturally resemble human code patterns. - Could the authors conduct some analysis or provide some conceptual discussion to assess whether the temporal gap that includes only pre-2022 code introduces systematic biases for evaluating existing detectors with most recent code? - Could the authors explain whether the low BLEU score filtering artificially introduces distribution divergence to make it easier for detectors to tell AI-generated code from human-written ones? Fully human-written
CodeMirage: Stress-Testing AI-Generated Code Detectors Against Production-Level LLMs Soundness: 2: fair Presentation: 4: excellent Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work explores and investigates the important task of code testing. It includes considerations of key data domains such as programming language types and challenges, and generates a large volume of test data. Additionally, this work leverages models as assistants to enhance the efficiency of data generation, making improvements and supplements to traditional approaches. 1: This paper conducts testing across a wide range of programming languages, effectively addressing the limitations of existing works that predominantly focus on mainstream backend programming languages such as C and Python. 2: The work considers the distribution of objective factors such as challenge levels and scene types, thereby providing a comprehensive data framework. 3: Through extensive experimentation, a large number of models are evaluated, leading to eight key findings. 1: The scale of the dataset remains a weak point, especially in the context of ten programming languages and a large number of LLM-assisted sample generation. There are concerns about whether the current dataset size can adequately support stress testing in the complex scenario of code generation. 2: There is insufficient discussion on how this work differentiates itself from existing studies, such as the peer-reviewed and published work Droid: A Resource Suite for AI-Generated Code Detection, and potentially other relevant works. These studies demonstrate significant overlap with the content of this work and are based on more comprehensive efforts, which undermines the contributions presented in this paper. 3: The use of LLMs for data generation is not a novel technique. Rather than focusing on the increased output of test data through LLMs, greater attention should be paid to ensuring the quality and diversity of synthesized test data, particularly for code—an inherently complex data structure. This aspect warrants further exploration. 4: There are minor issues in the paper’s presentation. While the number of categories and hierarchy are indeed essential considerations in dataset construction, the rationale and comprehensiveness of the classification scheme deserve more detailed discussion. Additionally, these categories lack a macro-level, systematic visual representation. The eight key findings in the experiments appear overly trivial and lack robust justification and deeper analysis. These issues cause the paper's intended message to be muddled and lacking in impact. 1: Could the authors elaborate on the potential scalability of this work? For instance, how could the dataset size be increased through upgrades to the LLM or enhancements in resource allocation? 2: Have the authors considered the positive impacts of evaluation in this context? For example, how could this work help mitigate specific shortcomings in code generation tasks? 3: The authors should provide clearer visual representations and deeper explanations of the motivation behind the hierarchical classifications across different dimensions. Without such clarification, it would be difficult to accurately assess the true value of these figures. 4: Is it possible to present a more objective comparison of the advantages of this work relative to other recent, relevant studies? Heavily AI-edited
Reinforced Data-Driven Estimation for Spectral Properties of Koopman Semigroup in Stochastic Dynamical Systems Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes Reinforced Dynamic Mode Decomposition, which introduces Reinforcement Learning (RL) to automatically guide the data collection process for Stochastic Dynamic Mode Decomposition. They shape a new reward signal to guide the agent based on the spectral consistency, which measures how well the Koopman operator has been estimated. The method has been validated using three different RL algorithms on canonical systems. Moreover, they provide a theoretical analysis of the proposed algorithm under strong assumptions. - They introduce a new reward function to guide the agent in collecting the data. The function leverages an exploitation term, which is defined as the spectral consistency, and an exploration bonus measured with the Gaussian kernel. - The data-collection method is shown to be working using three different RL algorithms. 1. Lack of baselines: The paper is missing baselines to show that their method brings benefits. For example, randomly initializing the agent position at each rollout and collecting data from there could be a simple yet effective comparison. 2. Inconsistency between the theory and the experimental results: The theoretical results are built upon the strong assumptions that the Q and V functions could be linearly expressed, respectively, in the DQN and PPO algorithms. However, these assumptions are not tested to see if they hold in practice. 3. Missing reward learning results: The paper does not demonstrate whether the proposed reward function can actually be learned by the agent in practice. The authors should include figures showing the results compared to the agent’s performance relative to the reward convergence, to illustrate how expert the used agent is for data collection. 4. Computational costs: The proposed method appears to be computationally intensive. Without a comparison to baseline methods, it remains unclear whether the computational costs are justified. 5. Unclear role of R_0: You don’t justify the role of this baseline, and whether it is a hyperparameter to be tuned or not. You left over this term along the paper. Minor 1. In Line 356, you mention that Figure 5 shows the first eigenfunction, but you are showing just the second one. 2. Typo in line 483 -> “Essentiall” Unclear meaning of images and doubts on the evaluation process: 1. You show in Figures 2, 3, and 5 that the Koopman eigenfunctions are learnt better on the data points collected using an improved agent over training. Are these eigenfunctions learnt using a fixed number of points? Are these coming from the different policies obtained along the training in an “off-policy” way? 2. Where do the points on which you evaluate the eigenfunctions come from? Are the eigenfunctions learnt using all of that or just a portion of those points? Fully human-written
Reinforced Data-Driven Estimation for Spectral Properties of Koopman Semigroup in Stochastic Dynamical Systems Soundness: 3: good Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper integrates Reinforcement Learning (RL) with Stochastic Dynamic Mode Decomposition (SDMD) to improve data-driven Koopman spectral estimation for stochastic dynamical systems. The method is named Reinforced Stochastic Dynamic Mode Decomposition (Reinforced SDMD). It is well understood that the capacity of Koopman-based methods to approximate the spectral decomposition of the (possibly stochastic) evolution operator crucially depends on the distribution of samples, that is on where and how trajectories are obtained. Poorly chosen initial conditions or long-time scales to escape meta-stable regions lead to inaccurate estimation of eigenfunctions and spectra. To address this issue, Reinforced SDMD casts data collection as an RL problem. That is, the agent’s policy determines the initial sampling locations of trajectories which are than obtained by numerically solving known SDE over some time-horizon. The reward is based on spectral consistency, that is on how well estimated eigenpairs predict system evolution, and an exploration term encouraging coverage of the state space. The paper explores sequential decision-making algorithms including Bandit, DQN, and PPO, showing that the agent identifies dynamically informative regions. Theoretical analysis that links the quality of the learned policy to Koopman operator estimation accuracy is provided. Experiments on small-dimensional canonical stochastic systems (double-well, Duffing oscillator, FitzHugh–Nagumo) show efficient discovery of coherent regions without prior domain knowledge. (1) The proposal to use RL framework in combination with numerical solvers of SDEs to estimate the spectral decomposition of the corresponding semigroup of Markov transfer operators, and hence build an efficient ML based solver is, up to my knowledge, novel and interesting. (2) The choice of analysing different sequential decision-making algorithms (Bandits, DQN, PPO) is appreciated. (3) The choice of canonical SDEs is appropriate for small dimensional problems, and the experiments support the claim that Reinforced-SDMD can obtain good estimation (1) The paper misses to report on big body of work on learning transfer operators of stochastic systems related not only to proposed SDMD but also to the problem of sampling from complex distributions via data-driven SDEs. To name just a few directly related works: - Christof Schütte and collaborators have a big track record on learning stochastic systems and in particular treating the problem this paper tackles, see e.g. 150p. review Overcoming the timescale barrier in molecular dynamics ActaNumerica 2023, and references therein. - Frank Noe and collaborators have also made significant impact on this topic, see e.g. VAMPnets for deep learning of molecular kinetics, Nature Communications 2018. - Massimiliano Pontil and collaborators made significant contributions in understanding learning algorithms of transfer operators associated to SDEs, see e.g. Learning dynamical systems via Koopman operator regression in reproducing kernel Hilbert spaces, NeurIPS2022. - More recent papers providing methods on learning continuous semigroups of stochastic processes, also containing statistical learning bounds: - Hou et al., Sparse learning of dynamical systems in RKHS: An operator-theoretic approach. ICML2023 - Devergne et al., From biased to unbiased dynamics: An infinitesimal generator approach. NeurIPS2024 - Kostic et al., Laplace transform based low-complexity learning of continuous Markov semigroups. ICML2025 (2) Proposed method is not adequately compared to vanilla performance of numerical solvers of SDEs. Namely, the reader cannot judge if the overall computational complexity of Reinforced-SDMD that has a big overhead of using sequential decision-making algorithms to get new samples has any advantage of randomly sampling initial points in the space. This is even more so important, since the current paper only works in small state dimensions, and its scalability is not clear. (3) Authors make strong Assumption on the core method of estimation of the transfer operator (SDMD), and then continue with standard theoretical arguments for convergence of considered sequential decision-making algorithms. Hence, the main novelty of the paper is methodological with weak, at best, theoretical novelty. (4) From my perspective, many aspects of the paper are not clarified enough, please see the questions for details. (5) Given the of methods for learning representations with neural networks (VAMP-Nets, DPNets, LoRA,..) that are used as a subspace on which SDMD is run, as well as having many competitors of SDMD, at least broader discussion of different approaches that can be coupled with RL formulation is needed to appreciate the authors' proposed approach. My current score reflects identified weaknesses, however I am ready to revise it depending on authors' clarifications and revision of the paper. (1) Your choice of reward implicitly assumes that the noise level (diffusion term) is much smaller than the signal of the drift, making the forecasting of states a reasonable task. But, isn't the forecasting of distributions an adequate reward for stochastic systems? What happens if the diffusion is stronger, how it impacts the experiments? (2) Looking at Eq. (2.7), one expects that the estimator has a large number of eigenvalues close to one when dictionary is sufficiently large. Is this the case in your experiments? If yes, how do you choose eigenvalue-eigenfucntion pairs which approximate true leading non-trivial ones (typically just few close to one). (3) In Assumption 5.1, which norm is used? If the norm is operator norm in $\cal{F}=L^({\cal M},\rho)$, how you formally define estimator acting on this domain? If it is the norm w.r.t. finite-dimensional subspace of $\cal{F}$ given by the dictionary, transfer operator is typically not well-defined on that, so the assumption becomes unreasonable. I believe that formal presentation off the method needs to significantly improve in order that proofs (4) For learning the transfer operator, or the generator, the samples need to come from the invariant distribution so that the one can guarantee learning of the object on the domain $\cal{F}$. However, in your proposal, we are getting samples from adequate supports of the invariant distribution (dense in meta-stable states), but I don;'t see how the samples are distributed according to $\rho$. What am I missing? Fully human-written
Reinforced Data-Driven Estimation for Spectral Properties of Koopman Semigroup in Stochastic Dynamical Systems Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. In the present paper, the authors introduce an amendment to the stochastic dynamic mode decomposition in which reinforcement learning is utilized in an outer loop to optimally select the trajectory sample points to satisfy a Koopman reward function. Evaluation is performed on a set of 3 stochastic dynamical system namely the double-well potential, the stochastic Duffing oscillator, and the FitzHugh-Nagumo model. The paper is well-motivated, and very well written in its construction of the individual algorithm variants, relating the background theory for the Koopman operator to the subsequent construction of the algorithm variants. As is the paper suffers from a number of inherent flaws, in short at a high level: * Insufficient relation to existing work on optimal/intelligent sampling, especially also to active sampling into the broader purview of which this paper falls * Lack of a coherent design of evaluation * Approach seemingly confined to a single DMD algorithm. Broader utility to the field not readily apparent #### Insufficient relation to related work * The paper is not set in relation to other work which utilizes reinforcement learning for optimal sampling strategies such as Zhao and Pilai [1], and Shen et al. [2]. * The core idea of the paper, moving away from random sampling for sDMD, is one not only confined to DMD but the wider scientific computing and machine learning based design literature. Some use GP surrogates to sample from to alleviate expensive individual sampling trajectories [3], others train individual sampling models [4], yet all of them can be broadly summarized under the larger umbrella of _Active Learning_ [5]. The paper sadly fails to draw these connections. * Line 54-56, the choice of dictionary is orthogonal to the same issue faced in symbolic regression / SINDy-based approaches. Drawing this connection would aid greatly in embedding this work in existing literature. * Line 58, some have started utilizing LLMs for the learning of the dictionary. See e.g. [6]. #### Design of Evaluation * The current evaluation does not permit to draw conclusions on the performance of the introduced algorithms as only the potential and eigenvalues of the stochastic dynamical systems are shown. Going further, it is not apparent which algorithm(s) are used for the construction of the eigenvalues. * There exists no actual performance evaluation, such as e.g. evaluating each of the 3 algorithms on each of the stochastic dynamical system for its sampling efficiency to reach a predetermined quality. This would also be a natural point to introduce ablations. * The authors perform no ablations. To properly motivate the use of reinforcement learning for optimal sampling, an ablation to random sampling and e.g. importance sampling would be highly desirable to actually establish the performance advantage of the introduced algorithms. As is, it is not apparent whether the new algorithms outperform a random sampling baseline or not. #### References 1. Zhao, D., & Pillai, N.S. (2024). Policy Gradients for Optimal Parallel Tempering MCMC. ArXiv, abs/2409.01574. 2. Shen, W., Dong, J., & Huan, X. (2023). Variational Sequential Optimal Experimental Design using Reinforcement Learning. ArXiv, abs/2306.10430. 3. Jones, A., Cai, D., Li, D. et al. Optimizing the design of spatial genomic studies. Nat Commun 15, 4987 (2024). https://doi.org/10.1038/s41467-024-49174-4 4. Fannjiang C, Listgarten J. Autofocused oracles for model-based design. Advances in Neural Information Processing Systems. 2020;33:12945-56. 5. Hsu, D.J. (2010). Algorithms for active learning. 6. Grayeli, A., Sehgal, A., Costilla-Reyes, O., Cranmer, M.D., & Chaudhuri, S. (2024). Symbolic Regression with a Learned Concept Library. ArXiv, abs/2409.09359. * Why are the algorithms not evaluated on more challenging (stochastic) environments? As is, it is hard to evaluate the limits of the approach. * Has there been any quantitative comparison to commonly used sampling algorithms? Training a reinforcement learning agent is not cheap, and especially in such a highly specific application it is not readily apparent to the reviewer how the expended compute is to be amortized later on. A potential comparison here might be taking one of your existing environments on which the agent is trained, and providing the same compute budget to a random sampling based SDMD to then compare them on key metrics. Fully human-written
Reinforced Data-Driven Estimation for Spectral Properties of Koopman Semigroup in Stochastic Dynamical Systems Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a novel technique for learning the Koopman operator via reinforcement learning (RL). The proposed framework, called Reinforced Stochastic Dynamic Mode Decomposition (Reinforced SDMD), integrates RL with SDMD to actively guide the data acquisition process in stochastic dynamical systems. The method leverages three RL algorithms (Multi-Armed Bandit, Deep Q-Network (DQN), and Proximal Policy Optimization (PPO)) to generate well-behaved trajectories that enhance the robustness of Koopman operator estimation. The reward signal is based on a spectral consistency criterion, designed to encourage the agent to collect informative trajectories while maintaining adequate exploration. The authors validate their approach on three synthetic systems (the double-well potential, stochastic Duffing oscillator, and FitzHugh–Nagumo model), showing that the agent can identify dynamically relevant regions. They also provide a theoretical convergence analysis that connects estimation accuracy to the quality of the learned sampling policy. The integration of RL and Koopman operator estimation is conceptually appealing and addresses an important limitation of existing data-driven spectral methods, i.e., their dependence on data quality and sampling. I appreciate that this approach is systematically evaluated using three distinct RL algorithms and tested across three representative dynamical systems. Overall, the paper is clearly written, well structured, and technically sound. - While the qualitative illustrations are convincing, it remains unclear how much improvement RL sampling yields compared to random or uniform sampling strategies. Including quantitative metrics, for instance the eigenvalue distance between estimated and ground-truth spectra (e.g., using the directed Hausdorff distance) would significantly strengthen the empirical validation. - The experiments focus exclusively on 2D systems; even a moderate increase in dimensionality (e.g., 5-10 dims) would help demonstrate the method’s scalability and computational feasibility. See Weaknesses Fully AI-generated
A superpersuasive autonomous policy debating system Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper describes an end-to-end fully autonomic debating system for American-style policy debates which is a significantly more complex competitive debate format than the ones that have been previously considered in this space. The system contains multi-agent workflows for strategic planning and argument generation where multiple agents collaborate and critique each other's outcomes to eventually come up with the best response. Successfully participating in comptetive policy debates requires a very high level of planning, strategizing and reasoning capabilities. Most humans will struggle if asked to participate in such a debate. Developing autonomous systems that can achieve this is extremely ambitious and the authors should be commended for pursuing this goal. The techniques suggested, even though they lack details, seem reasonable as they take inspiration from the cognitive processes that humans go through when participating in a debate. The paper lacks details. While there is a full system diagram in the appendix, it is very complex and difficult to understand. The authors write that each component is a multi-agent system where agents collaborate and critique each other, but hardly any details are provided on how this is done. Figure 2 is too high-level and the description in Section 4.3 presents the stages which expert human debaters go through during a competitive debate. It involves a lot of debating terminology but no description of the actual implementation of the multi-agent workflows. Very few experiments were conducted with no ablations whatsoever. It is possible that the authors developed a very impressive system, but the way the paper is currently written, it is not ready for publication. Please address the following comments regarding the comparison to Project Debater: - From the description in Section 3, one can infer that IBM's Project Debater generates a one-off speech. While it is true that the debate format addressed in Project Debater is much simpler than policy debates, it does include opening, rebuttal, and summary speeches. - The system described in the paper uses a database of high-quality arguments (‘cards’) created specifically for policy style debates. This bypasses the difficult task of extracting claims and evidence from unstructured corpus, and putting together, which was the main challenge in previous autonomous debating systems such as Project Debater. Fully human-written
A superpersuasive autonomous policy debating system Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents an autonomous system designed to compete in full-scale, American competitive policy debates, surpassing prior efforts like IBM Project Debater, which is designed for shorter and simplified debates for lay people. The system uses a hierarchical architecture of LLM agents that collaborate through specialized workflows to perform tasks such as evidence retrieval, argument synthesis, and self-correction, leveraging the OpenDebateEvidence corpus. Demonstrated through a live, ongoing debate performance, the agents generate complete speech transcripts, engage in multi-turn exchanges, and conduct intelligent cross-examinations. Preliminary evaluations show the system consistently produces superior arguments compared to human-authored cases and wins simulated rounds judged by an autonomous adjudicator. Expert debate coaches also favor its outputs over those of human debaters. Overall, the manuscript seems to have been written hastily. For publication, important details need to be added, the presentation should be improved, etc. as detailed in the weaknesses section below. 1. This work tackles an interesting and important topic of persuasive argument generation. 2. The experiment results seem promising. 1. The manuscript is missing important details, preventing an adequate evaluation. - The use of multiple agents is heavily emphasized, yet they are not explained. The only hint I can find is not in the main text but in the caption for Fig 1, where it states that gpt-4-mini is used. Does that imply an “agent” is simply gpt-4-mini with a different prompt? If so, what are the prompts? How were they selected? - The interaction between agents in the pipeline presented in Sec 4.3 is not clearly described. - Given practically no information about human authors who competed against the proposed system, it is difficult to judge the significance of the experiment results. 2. The experiments do not adequately evaluate the system nor provide useful insights. - Without any analysis of results, it is hard to know what strengths and weaknesses of the system are. - The experiments are only done against human authors, but other LLM-based baselines can show which component of the proposed pipeline are responsible for the performance. For instance, given that policy debate case has a rigid structure, an approach that uses templates would be a good baseline to showcase the superiority of the proposed pipeline for “mastering the intricate structure and esoteric strategies.” 3. Dense retrievers have been around for years and have shown to be superior to BM25 in general. Yet a simple BM25 is used for retrieval. 4. The clarity can be improved. More details about American competitive policy debate should be provided for the general audience. It follows a rigid structure, yet the structure is not explained. For instance, the components and structure of an affirmative case can be presented. Also, the figures can be better prepared. Fig 3 is a very important figure, but it was not designed for easy perusal. Fig 1 takes up a lot of space without providing much information. Lastly, the paper can be better formatted. For instance, the unnecessarily large gaps in pg 2 can be removed and the appendix can be formatted so that readers know what to look for and where. 5. The whole body of work on argument generation, which is much more than just IBM project debator, are closely related to this work and should be surveyed. - Spreading seems to be a relatively easy technique for AI systems to master. If it is considered a legal technique, what prevents AI from abusing it? - This task seems a lot more rigid in structure than the setup for IBM Project Debator. I see that the need to fluently generate all the components can be more challenging, but doesn’t the rigid structure also make this task easier? Fully human-written
A superpersuasive autonomous policy debating system Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors tackle the problem of American-style policy debating, a challenging and structured competitive debate format that emphasizes detailed high-quality evidence and structure constraints. They propose a multi-agent pipeline that constructs the various components of the policy debates, relying on sparse retrieval over a curated dataset of debate evidence. The paper presents results showing that this agentic pipeline outperforms human-authored debates. The work deals with a timely topic, illustrating the potential (and danger) of producing persuasive content with modern LLMs. 1. In my view, the paper does not offer any technical novelty - the gist of it is that the authors tailored a multi-agent pipeline to their problem and dataset and used gpt-4.1 mini along with agent frameworks and retrieval with BM25. While a limited technical novelty can be OK for a paper in the applications track, it does certainly increase the burden to deliver strong contributions in other aspects (e.g., comprehensive analyses, useful code, surprising results, valuable resources etc.), which unfortunately I personally thought were lacking as well. Moreover, when claiming that "our core contribution is a novel multi-agent architecture" I would have expected something more than a keyword search with a pipeline of LLM-based nodes that look quite standard (generate -> search -> review quality, as described in Figure 2). 2. There is no analysis, and really very little in terms of results - the entire empirical section consists of human evaluation of 3 system outputs and the evaluation of 20 simulated debates using an LLM judge (with no additional validation). So this raises some doubts about the robustness and significance of the work, but just as importantly in my opinion does not provide much insights to the reader, whether in terms of ML, LLM reasoning, implementation challenges or even debating-specific insights. As I see it, the paper content can be accurately summed up as "we tailored a multi-agent pipeline with retrieval and according to an LLM judge it beats humans". In my view, this is simply not enough as a contribution for a conference paper. 3. The description of the pipeline is composed mainly of a large amount of debating jargon, making it largely incomprehensible to the naive reader. 4. I felt that the paper somewhat misrepresents prior works. Notably, the authors state as an advantage of the present work that it relies on a human curated evidence dataset, whereas prior works synthesized arguments from broad corpora. But importantly, argument mining and synthesis from a large-scale general-purpose corpus is precisely the technical challenge that many of the prior works aimed to tackle; and relying on a debate-specific corpus, where relevant debate arguments and debate evidence are readily accessible, mainly means that the present work tackles a very different, and in certain ways easier, technical problem. Also regarding specifics, stating for example that Slonim et al. "produces a single, monolithic speech with very limited reference to evidence" is incorrect (they produced more than one speech per motion and did employ evidence as a central component). * Why does the abstract talk about a "continuously running live spectacle debate performance"? I am not sure how this connects to the rest of the paper content. Typos: - l. 108 Our contribution -> Our contributions - l. 174 growing body literature -> growing body of literature - l. 178 "hyperpersuasion -> "hyperpersuasion" Fully human-written
A superpersuasive autonomous policy debating system Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates argument modeling and introduces a multi-agent system for policy debate. The proposed framework follows a structured retrieval–synthesis–self-correction workflow to generate debates, leveraging an external knowledge base, OpenDebateEvidence, to provide factual grounding. Through this design, the system produces arguments that are both logically coherent and supported by evidence. The authors evaluate the framework on simulated debates assessed by human judges, focusing on argument quality, factual accuracy, and faithfulness to retrieved evidence. - This paper proposes a multi-agent system for debate modeling with augmentation of external evidence. - The paper lacks enough novelty, as its main contribution lies primarily in system construction rather than method innovation. As such, it would be more appropriate as a demonstration paper rather than a full research submission to a top-tier venue like ICLR. - Important implementation details are missing, including the choice of the base model, prompt design, and human evaluation setup. The absence of these details significantly reduces the transparency and reproducibility of the work. - The experimental evaluation is insufficient. The authors only report results on 20 simulated debates without comparing against baseline systems or conducting ablation studies to justify the effectiveness of each component. Moreover, the paper provides little to no analysis of the results, which further weakens the empirical support for the proposed framework; - No statement on ethics, while argumentation or debate system may produce harmful contents. None Lightly AI-edited
Invariant and equivariant architectures via learned polarization Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a novel theoretical framework for constructing invariant and equivariant neural networks that respect group symmetries. Whereas traditional approaches often rely on the "generating set" of the invariant ring, which is frequently computationally intractable, this work focuses on the less restrictive concept of a "separating set." **Importance of the Problem Setting and Viewpoint:** The paper's premise—that identifying the "generating set of the invariant ring," which serves as the theoretical foundation for designing symmetric neural networks, is extremely difficult—is accurate and highly relevant. **Introduction of "Separating Sets":** In response to this difficult problem, the theoretical approach of introducing the concept of a "separating set"—a weaker condition than requiring the complete information of the invariant ring (the generating set)—and attempting to construct an architecture that can universally approximate invariant functions using this set, is original. **Complete Lack of Experimental Validation :** The paper remains a purely theoretical proposal, and no experiments whatsoever have been conducted to demonstrate the effectiveness, practicality, or limitations of the proposed framework. The fact that the proposed method theoretically guarantees the *existence* of a separating set is an entirely different matter from whether it can be stably *learned* as a machine learning model on actual data (e.g., via gradient descent) and whether it possesses practical advantages (e.g., data efficiency, generalization, computational cost) compared to existing methods. To claim theoretical soundness, minimal proof-of-concept (PoC) experiments (e.g., a demonstration on a simple finite group) are essential. **Lack of Awareness of Related Work (Especially Practical Invariant/Equivariant Networks) :** The authors repeatedly claim that "existing methods require explicit knowledge of the generating set of the invariant ring," but this does not accurately reflect recent research trends. For example, many studies, led by Deep Sets (Zaheer et al., 2017) and Invariant Graph Networks (Maron et al., 2018), achieve permutation invariance (a type of symmetry) using simple operations like sum-pooling or averaging, without any complete knowledge of the invariant ring, and have demonstrated high practical utility. The theoretical contribution of this paper needs to be clearly discussed and compared with these (already practical) approaches to clarify its relationship and potential advantages. However, this contextualization is severely lacking, making it difficult to judge the paper's novelty and contribution. **Regarding the lack of experiments:** Why does this paper not include even minimal experiments (e.g., a proof-of-concept on a simple dataset with known symmetries) to validate the effectiveness of the proposed method? A theoretical "existence proof" does not necessarily guarantee practical effectiveness as a machine learning architecture in terms of learning stability, expressive power, or computational cost, does it? **Regarding comparison to existing practical methods:** What are the theoretical and practical differences between the "learned polarization" proposed in this paper and existing methods that achieve invariance/equivariance without explicit generating sets of the invariant ring (e.g., sum-pooling), as exemplified by Maron et al. (2018) and Zaheer et al. (2017)? Fully AI-generated
Invariant and equivariant architectures via learned polarization Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper discusses how to construct (separating sets of) invariant/equivariant polynomials for high-dimensional group representations, starting from low-dimensional ones. This is based on the polarization method from invariant theory. - The paper is well written: it is concise, clear, and the mathematical ideas are presented in a friendly and pedagogical way, despite being fully rigorous. - The idea of relying on polarization to construct invariant/equivariant models is original, and can be potentially built upon in the context of geometric deep learning. The main limitation of the paper is that, for the most part, it is not targeted towards machine learning / neural networks, which, in my opinion, is fundamental for a machine learning venue such as ICLR. While generating and separating sets are briefly motivated from the perspective of constructing neural networks (lines 106-114), this relation is never elaborated further. Instead, all the results, hypotheses, and constructions, are written in a purely-algebraic formalism, without any connection to neural networks. I do not see how the constructions of separating sets can be applied to obtain actual neural architectures. Some specific issues are: - The constructions discussed in the paper are recursive, i.e., they require a starting separating set. It is unclear when and to what extent a starting separating set is known for representations used in practical applications. - The paper focuses on invariant polynomials, as traditional in algebraic invariant theory, and the constructions crucially rely on their algebraic nature. However, neural networks typically do not define polynomials, so it is unclear how to apply these constructions to them. - The constructions relying on standard and simple polarization require assumptions on the dimensions of the representations, and on their multiplicities (see line 169, 201-202, Proposition 1 and 2). Again, it is unclear whether these assumptions are satisfied for commonly-deployed representations. Note that this is not an issue for cheap polarization, since it does not require this type of assumption. Moreover, the paper does not provide a single example, even outside of the machine learning world. Instead, I believe it would greatly benefit from examples, especially related to neural networks, and from expanding on the connection to deep learning. As a side note, the paper is less than 7 pages long, implying that the lack of examples and of connection to deep learning is not related to space constraints. In conclusion, I unfortunately believe that the paper, despite proposing an original and appealing idea, does not deserve acceptance. I am, of course, open to discussion. Minor: - Formatting issue: I believe that some (sub-) sections are too short (e.g., 3.1.1 and 3.1.2), which is inappropriate formatting, I believe. - Definition 1 looks redundant to me, since it reintroduces the notion of separating set, which already appears as early as in Section 2.1.5. I understand that the notion in Section 2.1.5 is defined only for finite sets, while Definition 1 is for infinite/continuous families (indexed by $\mathbf{\lambda} \in \mathbb{R}^p$). However, this is a mere convention: one can directly define a general notion of separateness for arbitrary families of functions. Why not introducing the notion of separating set for arbitrary families from the beginning, and, where appropriate, require them to be finite? - Typo on line 319: the sentence ends in a comma, instead of a period. I would be grateful if the authors could elucidate on the main weakness raised above, i.e., the relation to actual neural network architectures. For example, could you explain how these polynomial constructions apply to neural networks? Could you provide examples of the resulting separating sets, and of representations appearing in practice that satisfy the assumptions? Fully human-written
Invariant and equivariant architectures via learned polarization Soundness: 4: excellent Presentation: 4: excellent Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents a principled method for constructing computationally efficient, separating, polynomial equivariant features. They show that this enables the efficient construction of models universal in the class of continuous equivariant functions. The approach repeatedly applies polarization (and variants) on the isotypic decomposition of the input representation to build such features. The authors show that polarization and simple polarization are asymptotically equivalent in computational cost, and often intractable, whereas cheap polarization, though limited to finite groups, is typically more efficient. - A clear and accurate exposition of the introductory material on classical invariant theory, polarization, and separating sets. - The problem and research direction are relevant to the equivariant ML community, with both practical and theoretical potential. - **Limited novelty and contribution.** The main propositions and theorems are direct applications of standard results. In particular, Theorem 1 is an immediate corollary of Theorem 1.7 in [1] (mis-cited in the manuscript as Theorem 2.7), while Propositions 1–3 follow directly from interpolation via the Vandermonde matrix, a standard technique in classical invariant theory (see, e.g., [2]). - **Literature positioning is incomplete.** The paper should more thoroughly address related work in equivariant ML based on classical invariant theory and clearly position itself relative to fundamental work in the field such as [3, 4]. To better address the graph-learning community, the authors might also cite [5], which studies equivariant polynomials as equivariant features in graph learning. - **No empirical validation.** While empirical validation is not strictly required for this type of theoretical work, providing it where feasible would strengthen the contribution (see also Q1). - **No illustrative examples.** Including simple, worked examples in concrete settings (e.g., the symmetric or cyclic groups) would substantially improve readability. ##### References: [1] N. Dym and S. J. Gortler, *Low-dimensional invariant embeddings for universal geometric learning*, 2025 \ [2] J. Draisma et al., *Polarization of Separating Invariants*, 2008 \ [3] B. Blum-Smith and S. Villar, *Machine learning and invariant theory*, 2023 \ [4] B. Blum-Smith et al., *A Galois theorem for machine learning: Functions on symmetric matrices and point clouds via lightweight invariant features*, 2025 \ [5] O. Puny et al., *Equivariant Polynomials for Graph Neural Networks*, 2023 1. Is it feasible to compute separating sets with softwere tools such as Magma, SageMath, or Macaulay2? More broadly, how should the proposed pipelines be implemented in practical ML workflows? 2. For groups where cheap polarization is unavailable, what concrete advantages polarization can offer over existing invariant bases? Lightly AI-edited
Invariant and equivariant architectures via learned polarization Soundness: 3: good Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents tools based on polarization from invariant theory and suggest that they can be used to build equivariant neural networks. Three variants of polarization are analyzed and a scheme for learned polarization is suggested. - The paper presents interesting potential for bridging concepts from classical invariant theory in machine learning in a novel way - I have to say that despite being quite familiar with equivariant machine learning most of the concepts developed in this paper were quite obscure to me. I think the writing is heavy and assume knowledge of algebraic invariant theory which is uncommon - I think a very small audience within ICLR will be able to understand the paper and interest in it. I suggest instead submitting to an applied math venue, where the work will benefit from a more appropriate audience and better feedback - The lack of proposal for a practical implementation or suggestion of specific application is also something that makes ICLR a suboptimal venue for this work - Fully human-written
QUASAR: Quantum Assembly Code Generation Using Tool-Augmented LLMs via Agentic RL Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes QUASAR, an agentic RL framework to fine-tune LLMs to generate OpenQASM 3.0 programs for quantum optimization tasks. The method augments a 4B SFT model with a tool-use loop that calls an external quantum simulator and optimizes the policy with GRPO using a four-level hierarchical reward: (i) syntactic validity; (ii) distributional alignment via Jensen–Shannon distance with a qubit-mismatch penalty; (iii) expectation-value proximity to the ground truth problem Hamiltonian; and (iv) optimization-progress that rewards fewer classical optimization steps and better final value. The pipeline improves both syntax and semantics over SFT and strong prompting baselines on a dataset of graph-based optimization instances. Ablations suggest the distributional alignment term is the dominant driver, with expectation-value and optimization-progress giving complementary benefits. - The reward shaping is well-designed. Clear decomposition into a hierarchy of four levels: syntax, distributional alignment, expectation value, and optimization progress. The qubit-mismatch penalty addresses a common failure mode of wrong wire counts. Ablations reinforces the effectiveness of RE term. - Realistic training stack and reproducible high-level settings. - **Overclaimed scope**. The title claims to be "quantum assembly code generation", and the introduction targets at general quantum circuits. However, all tasks, rewards, and metrics presuppose Hamiltonians + parameterized ansatzes. Nothing addresses general OpenQASM programs (e.g., QFT/PE, arithmetic, mid-circuit measurements, control flow, etc.). As far as I can see, the rewards and metrics cannot be adopted directly to universal quantum circuits, which limits the usage of this framework. - **Limited conceptual novelty**. The framework largely repackages a common agentic RL template. The quantum parts are well-crafted instantiations rather than new principles. - **Gains over SFT are modest**. QUASAR’s main gains are semantic but the margins over SFT are somewhat incremental. An analysis of marginal improvement vs. extra compute would strengthen the case. - **Threshold choices & metric redundancy (minor).** The SREV tolerance $|E(C)−E^\star∣\le 0.2$ is not justified. Sensitivity to this threshold should be reported. HQCR is defined as RE within 0.1, which is not necessary as a metric given RE in my opinion. - **Lack of open-sourced code**. The abstract claims to provide training code at GitHub, but the linked repository is empty. Also an unanonymous link violates the double-blind review policy, which in principle should be desk-rejected. See Weaknesses. Lightly AI-edited
QUASAR: Quantum Assembly Code Generation Using Tool-Augmented LLMs via Agentic RL Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes QUASAR, an agentic reinforcement learning framework for post-training large language models to generate parameterized quantum circuits in OpenQASM 3.0. It also introduced 4 hierarchical reward mechanism to enhance the effectiveness of the RL training process. - Improves LLMs’ proficiency in PQCs generation. - Well-structured and easy to follow - A clear summary of quantum optimization problems. - The motivation is not accurate - Gap between reward mechanism and evaluation metrics - What is QUASAR's motivation? QUASAR look more like to enhance LLM generate better PQC structure and initial parameters, rather than generate quantum circuit. The distinction between these goals should be made clearer. - What is the calculate process of the expectation-value reward? In Section 4.2 the expectation-value reward is calculated by the distance between the eigenvalues of the generated circuit and the ground truth circuit. However, in section 4.2.3 this was calculated by the problem specific cost Hamiltonian. - Why use JS divergence as a reward. The JS divergence reward encourages the model to generate circuits whose unitaries closely match those of the ground-truth circuits. In contrast, the expectation-value reward and optimization-step reward aim to produce circuits that better approximate the target Hamiltonian. However, there remains an inherent gap between the dataset circuits and the ideal Hamiltonian solution. Therefore, the three rewards are not fully consistent in optimization direction. As the number of qubits increases, the JS divergence metric loses accuracy in capturing the difference between the two distributions, making it a less effective reward for high-dimensional quantum systems. - Why SREV is used as the evaluation metric instead of directly using the expectation value percentage? Since the expectation value measures how closely the parameterized quantum circuit (PQC) approximates the target Hamiltonian, it is unclear why SREV was chosen as the primary indicator. According to the paper, SREV appears to better capture the approximation degree between the generated PQC and the target Hamiltonian. However, the experimental results show that increasing the expectation-value reward actually decreases SREV performance, which is counterintuitive. A detailed explanation of this discrepancy and the rationale behind selecting SREV over direct expectation-value measures would greatly improve the clarity of the evaluation section. - How the training and test datasets are partitioned. - How the prompts are constructed. It's better to give an example. Fully human-written
QUASAR: Quantum Assembly Code Generation Using Tool-Augmented LLMs via Agentic RL Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces QUASAR, an agentic reinforcement learning (RL) framework that augments large language models (LLMs) with external quantum simulators for quantum assembly code generation. The goal is to improve the syntactic and semantic correctness of OpenQASM 3.0 quantum circuits. QUASAR integrates supervised fine-tuning with a four-level hierarchical reward mechanism that incorporates syntactic validity, distributional alignment (via Jensen–Shannon distance), expectation-value alignment, and optimization progress. The approach is evaluated by augmenting a 4B-parameter LLM and compared with GPT-4o, GPT-5, DeepSeek-V3, and RL-only or SFT-only baselines. Results show high syntactic validity (99.31% Pass@1) and improved circuit quality on several quantum optimization benchmarks. 1. The topic is timely and relevant, addressing quantum circuit generation using tool-augmented LLMs. 2. The hierarchical reward mechanism is conceptually well motivated. 3. The evaluation covers both syntactic and semantic metrics with clear quantitative reporting. 4. The experimental setup includes multiple baselines and ablation studies. 1. In Section 4.2 and Figure 3, the hierarchical reward mechanism, while interesting, lacks theoretical grounding or ablation analysis showing why the chosen four components (syntax, entropy, expectation value, optimization) are optimal. Other plausible metrics could exist, but justification is not provided. 2. The comparison with GPT-4o and GPT-5 in Table 1 does not constitute a fair baseline against a fine-tuned model. The paper should clarify hyperparameter settings, prompt design, and reproducibility details for these baselines. 3. Figure 5 reports ΔE distributions but omits units and normalization conventions. Without specifying whether values correspond to expectation differences or normalized eigenvalue gaps, it is difficult to interpret it. 4. The Agentic RL in Section 3.3 largely reiterates standard GRPO methods (Shao et al., 2024) without adaptation to quantum contexts. The contribution seems incremental, since it applies an existing RL algorithm to a new domain with minimal innovation. 5. The quantum verification pipeline in Section 4.1 is described only superficially. Implementation details of the “Quantum Tool Server” and simulation fidelity are missing. It is unclear whether noise, decoherence, or realistic hardware constraints were modeled. 6. The reward normalization in Eq. (3) and Eq. (4) assumes bounded eigenvalues, but many Hamiltonians used in QAOA/VQE have variable scaling. This could bias the reward and affect convergence; no normalization consistency checks are discussed. 7. Section 2 (Related Work) misses recent key works on quantum circuit compilation via differentiable programming and symbolic optimization. The related work is dominated by LLM-based citations and omits competing non-LLM approaches. 8. Figure 2 and accompanying description do not specify the optimization problem (Hamiltonian) associated with the illustrated ansatz. It would be recommended to provide context, to improve clarity. 9. The evaluation metrics in Section 5.1 rely on Pass@K-style measures, which are adapted from code generation. These metrics may not align with physical correctness or execution fidelity on real quantum backends. Including hardware-executed validation would strengthen the paper. 10. The presentation has recurring typographical and formatting errors (e.g., “desigin” in the introduction, inconsistent use of subscripts in formulas), which reduce readability. Figures also have low resolution. 11. The hierarchical reward ablation in Table 2 suggests only marginal gains from additional reward components, implying that most improvements could stem from data scale or SFT pretraining rather than from RL itself. 12. While it has been promised to release the code after acceptance, it would be preferable to make the code available during the review in the supplementary material. 13. The paper’s novelty lies mostly in combining existing techniques (OpenQASM simulation, GRPO RL, and LLM fine-tuning) rather than introducing a fundamentally new algorithmic insight. 1. How sensitive is the performance to the relative weighting of the four reward components? 2. What mechanisms prevent reward hacking when circuits add extraneous qubits? 3. How does QUASAR scale beyond 9-qubit or 12-qubit benchmarks? 4. Can the hierarchical reward framework be generalized to other DSLs beyond OpenQASM? Fully AI-generated
QUASAR: Quantum Assembly Code Generation Using Tool-Augmented LLMs via Agentic RL Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors attempt to build quantum circuits efficiently with a specially trained LLM. In the measures defined by the authors, it outperforms unoptimized LLMs like GPT-5. The application area of the paper is certainly novel. It indicates that superior performance can be obtained from an LLM even on tasks that are not well-suited if it is refined further. The central idea, while imaginative, is not sufficiently grounded in the technical realities of quantum computing. Key challenges — such as the exponential scaling of gate requirements with qubit count — are not adequately addressed. Moreover, the paper provides limited information about the scale of the experimental setups, leaving it unclear how complex or realistic the tested instances are. - Could the authors clarify the size of the experiments, in terms of qubit count and circuit depth? - Do you observe any changes in code generation performance as problem size increases? - Are there notable differences in quality when generating dense versus sparse circuits? Moderately AI-edited
How Does Local Landscape Geometry Evolve in Language Model Pre-Training? Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates how the local loss landscape geometry evolves during the pre-training of large language models. The authors identify two distinct phases in training dynamics. Early phase: At the beginning of training, the loss landscape exhibits high sharpness. During this stage, the learning rate governs the training stability. Therefore, a **learning rate warmup** is essential to move from sharp to flatter regions safely. Late phase: Once the model enters a more stable training regime, the batch size and the resulting gradient noise scale become the dominant factors shaping the local landscape. This implies the importance of a **dynamic batch-size schedule**. **Systematic analysis**: The paper provides a systematic framework to analyze how the local loss landscape evolves during pre-training. **Empirical insights**: It empirically identifies a two-phase transition in sharpness and links this dynamic to training stability and learning rate schedules. **Practical implications**: The theoretical insights are translated into practical training strategies (learning rate warmup and dynamic batch-size schedule). **Limited architecture and optimizer scope**: As noted in the paper’s limitations, the analysis is restricted to the LLaMA-2 architecture and the AdamW optimizer. It remains unclear whether the findings would generalize to other architectures or optimization algorithms. **Lack of analysis on interaction effects**: The paper analyzes batch size ramping and learning rate decay in isolation. The interaction effects between BS ramping and LR decay therefore remain unclear. **Theoretical simplifications**: Assumptions such as noise covariance being proportional to M, time-invariant M, equilibrium, and local quadratic approximation are introduced for theoretical convenience, but may not hold in actual training dynamics. **Overstated novelty discussion**: The authors point out that prior works are “largely restricted to small-scale networks” and emphasize that their study provides the “first systematic study of how local landscape geometry evolves in large-scale language model pre-training.” To make this claim more meaningful, it would be helpful to more clearly articulate the qualitative differences between small-scale networks and large-scale language models. Moreover, discussing how the observed dynamics might change when scaling beyond the 93M and 170M parameter models used in this work would further strengthen the argument. Minor Errors Line 129: The Hadamard product are -> The Hadamard product is Line 132: near an local -> near a local Line 134: at i-the example -> at i-th example Line 146: gradient covariance at $\theta$ -> gradient covariance at $\theta^\star$ Line 157: Our experiments varies -> Our experiments vary Line 67: landscapes gradually widens/flattens -> landscapes gradually widen/flatten Line 266: in the most-case directions -> in the most directions Line 279: This rational further suggests -> This rationale further suggests Line 322: two key question remains -> two key questions remain Line 172: Loss spikes and plateau -> Loss spikes and plateaus Line 238: only a sufficient small -> only a sufficiently small Heavily AI-edited
How Does Local Landscape Geometry Evolve in Language Model Pre-Training? Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper admits a two-phase analysis on language model pretraining, to address two crucial decisions required: the learning rate warmup phenomenon and batch-size scheduling. The authors note the common hyperparameter settings used in the literature, wherein a large number of warmup steps are used for the maximal stable learning rate (LR) applied, followed by a stable or annealing LR. Also, the use of arbitrary batch size schedules is reported in the literature. The locally quadratic loss landscape analysis in the paper looks at two distinct training stages relying on SGD: (i) *early phase*: where the loss landscape is sharper and the imperative is `to flatten`, and thereby LR warmup, to begin optimization from a less sharp region, or attractor basins; (ii) *late phase*: where the loss landscape tends to a flatter region owing to SGD convergence, and the imperative is `to deepen` the loss basin reached for an improved loss, which is done by either LR decay or a batch size ramp up. * Strong motivation and setup, to address two important empirical choices in language model pretraining, which are often the first hyperparameters adjusted for any new task, hardware. * Fairly clear math and assumptions for the analytical explanation of the pre-training dynamics using the quadratic approximation of the loss landscape around the theoretical minima; empirical analysis supporting claims. * Clear conditions given on the practical rule-of-thumb in setting the LR warmup and Batch size ramping, for a warmup-stable learning rate schedule. * The terminology and goal of the analysis can be made cleaner; that is, the terms: sharpness, flat minima, wide basin, deep basin could be clarified independently early on before the lemmas and theorems. * It is hard to understand if the direction of analysis emerges as a result of wanting to go from sharp to flat early on (and wide to deep in the later phase), or vice versa. Especially since the locally quadratic assumption does not necessarily hold at initialization. * Given LR decays are an important discussion in managing LR schedules and thus scaling and the laws fit on this data, the interaction of both LR decay and increasing batch size feels underexplored. * Given the relatively limited empirical experimental scope, a broader *Limitations* section is warranted: especially for the unknowns such as the role of LR annealing choice; how the timing and the number of batch ramp ups matter. Below is an enumeration of various questions and suggestions. Please note that the rating will go up contingent on the points below, with more weightage on the following points: 1, 2, 4, 5, 9. 1\. L76-81: Could the authors please explain (or elaborate here) why the deepening of the loss basin at *late-stage* training is crucial and different from the flat-minima conversation around SGD convergence? 2\. L106-114: Could the authors include [1] here and also contrast their early-late training phase interpretation with the loss landscape perspective given here? 3\. Equation 3: What is the role or effect of the $B$ in the denominator given the $1/n$ already included in $\Sigma ({\theta}^{*})$ (L145-146)? 4\. Figure 2: It appears that only the larger LRs (in the grid shown here) lead to spikes. Could the authors intuit why and how the finding here regarding the edge-of-stable LRs explains this phenomenon? * How does zero-warmup (not in Figure 2) actually affect the findings here, and also the actual effect on *moving away from sharp loss landscapes*? Given we expect noisy updates in the beginning few update steps in most cases, does no warmup and a small enough batch size have the same effect? 5\. L178: Could the authors make an overall comment on how the absence of LR decay influences the assumptions and findings, and thereby the practical recommendations? Given that recent literature suggests LR decay is crucial for a model before modeling its loss obtained for scaling law derivations, downstream performances, finetuning, etc. 6\. Figure 3: Could the authors please provide some guidance on how to read Figure 3? What exactly is being plotted with the orange line, and what are $u_1$ and $u_w$? 7\.1. Figure 4 (top): Which layer matrix are the eigenvalues being reported? 7\.2. Figure 4 (bottom): Does this perform a perturbation on all the weights? 8\. Lemma 4.2: Could the authors explain (or intuit) the definition of $z$ and therefore $z_l$? 9\. L270-291 and Figure 3: Does the figure suggest that anything *more* than the optimal LR warmup length leads to a worse loss? What, thus, is a *reasonable range* (L290) in practice? 10\. Figure 10: Could the settings here be marked differently (markers/linestyles) and not just with transparency? 11\. Personally, took me a long time to understand why we want to move away from a wide basin, until I realized we are talking about a region around the minima (a basin) and not the flat minima we converge to. Therefore, the finding of a small batch required for a wider basin felt counterintuitive and required re-reading of Section 5. The motivation, analysis, and conclusion can be presented much more cleanly, building up to a general batch size schedule. Additional comments on this or identifying future directions w.r.t. role of LR schedules could be more explicit. 12\. L266, L469: What does `most-case direction` mean? --- References: [1] Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs, Bergsma et al., 2025, arXiv:2502.15938 [cs.LG]. Fully human-written
How Does Local Landscape Geometry Evolve in Language Model Pre-Training? Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors analyze the "local landscape" of the loss as training progresses. The paper is organized around two phases: "early" (when loss is sharp) and "late" when loss is flat. Within these two phases, they ask and answer two questions related to training dynamics. Based on these theoretical results, they offer heuristics/principles which they instantiate and show experimentally. The questions/answers are: * Why do short warmups produce instability? (Because the LR rises too quickly when the model is still too sharp) * Why do spike and plateaus only happen at the beginning of training? [Reviewer's note: they do not only occur then?] Because things are smoother later on. * "Why is there a trade-off between widening and deepening the basin?" because temperature affects basin selection and different temperatures prefer different basins. * How does BS impact this trade-off? BS impacts temperature. At a high level, the paper reads really nicely. It's easy to follow: the math, the figures, the simulation results, and the real results are all presented in a coherent, digestible way. It would, in many ways, make a nice tutorial/review about some of the concepts in this space, linking theory to practice. I especially like the visualizations of sharpness along 1D slices. This is a nice visual tool, though I feel like the authors don't actually make that much use of it. I think the experiments mostly support the claims made in the paper (mostly, though I have a quibble). While it is nicely written, the authors claim much too much novelty. They apply existing tools to familiar settings but without much new insight. They don't seem to arrive at any substantially different conclusions. The math is sometimes a bit different (though usually not), but when it is, it provides seemingly no insight over what's already known. Given that there is in fact quite a lot of prior art for many of these findings (which, admittedly, the authors do frequently cite), it would be good if they empirically compared their prescriptions to those other sources. ## Most results are known, both theoretically and empirically To be blunt, I feel like the majority of the insights of this paper are fairly well known, both from an empirical and a theoretical perspective. For example, I think theorem 5.1 is a repackaging of known results (e.g. Jastrzębski et al. (2018) eqn 12 is I think the two-basin version of this). To drive the point home, I asked ChatGPT: "does the batch size impact the type of basin found when using Adam?" and it produced a detailed, similar explanation with cites to existing theoretical and empirical work." (To be clear, I am not insinuating the authors had ChatGPT write this paper, just that the results are standard.) I could dig up additional academic sources, but I'm fairly confident that I could pose ChatGPT most of the questions addressed in this paper and reach largely similar results, getting already published papers, both empirical and theoretical. As another example, it's well known that BS and LR trade off with one another, including using theory from SDEs! One example: https://www.cs.princeton.edu/~smalladi/blog/2024/01/22/SDEs-ScalingRules/ The batch size warmup is also known, with known theoretical grounding relating to CBS/gradient noise scale: https://arxiv.org/pdf/2505.23971 And even the "LR warmup needs to be longer for higher max LRs" is known, including a similar theoretical justification. The authors dismiss Kalra and Barkeshli as being focused on resnets, but they also do studies on GPT-2 style transformers with much longer warmups than the authors claim in their description of the work? ## The principles/recipes are no more actionable than what is known/done The "recipe" for LR warmup length literally just says... "the larger the peak LR, the longer the warmup should be". How much longer? proportional to LR change? proportional to LR^2? To me, a recipe would suggest a particular heuristic or criterion for, say, warming up the batch size. The authors say "Ramp the BS once loss reduction becomes marginal." but then seemingly the authors just use a fixed step/token count for when to ramp? How is this an improvement over what we know? (CF Merill et al 2025 https://arxiv.org/pdf/2505.23971 which actually do provide a criterion for when to scale that can be tracked during training) To be a bit less polemical, regarding the batch size experiments: the BS vs LR doesn't really match up super closely: increasing BS and decreasing LR both make the loss go down, great. The Princeton link I pasted above would say that Adam LR should scale with sqrt(BS) rather than linearly. If you ran your experiment using sqrt(BS) scaling, would it match more closely? I guess yes. Merrill et al explore batch size warmup from a CBS perspective, deriving a specific, trajectory-driven way of determining when to warm up batch size. How does your approach compare? Fully human-written
How Does Local Landscape Geometry Evolve in Language Model Pre-Training? Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper aims to systematically study the geometrical evolution of the local loss landscape during LLM pre-training and correlate it with hyperparameter tuning strategies. The authors divide the process into two phases. In the first phase (early), authors reports a "from sharp to flat" evolutionary trend and, based on this, provides a geometrical explanation for the necessity of LR warmup via linear stability analysis. In the second phase (late), this paper argues that the landscape geometry is dominated by gradient noise. Through the continuous-time limit of Stochastic Differential Equations (SDEs) and the principle of free energy minimization, it reveals how batch size regulates a "depth-flatness" trade-off, thereby proposing a dynamic batch size scheduling strategy. 1. This paper intuitively documents and reports the "from sharp to flat" macro-dynamic trend in LLM pre-training, which is an important empirical finding that contradicts studies on small-scale models. 2. The proposed hyperparameter tuning strategies are logical, easy to implement, and experimentally shown to significantly improve training efficiency, contributing directly to reducing the cost of large model training. 1. In Section 4, this paper attempts to explain the loss spikes and plateaus of the early training phase. The core argument is: first, it empirically observes that the initial landscape is very sharp (high Hessian eigenvalues), and then points out that a large LR on such a sharp landscape causes instability. To theoretically support this, the author borrows linear stability analysis from standard gradient descent (Lemmas 4.1 and 4.2). However, this analysis is strictly established on a **deterministic (noise-free), quadratic model centered around a local minimizer $\theta^{*}$** (Equation 4). **The key issue here is the applicability of this "extrapolation"**: While this paper does not claim the quadratic model *fits* the initial state, it *assumes* that the stability condition derived from this highly simplified, local-convergence model (i.e., $\eta < 2/\lambda_{max}(S)$) can effectively explain the dynamics of the earliest training phase (far from any minimizer, highly non-convex, and noisy). This is a strong assumption. The true early-stage dynamics are highly complex. Attributing the instability primarily to the simple linear interaction between LR and the local Hessian's max eigenvalue is likely an oversimplification, ignoring other non-linear or stochastic noise effects. Therefore, while the conclusion "high sharpness + high LR = instability" is intuitive and matches the data, using Lemmas 4.1 and 4.2 as its primary theoretical basis acts more as an insightful **analogy** than a rigorous proof for this specific phase. The validity of this explanation depends on the extent to which this local linear approximation dominates the early global dynamics, which is not sufficiently justified in this paper. 2. The core theory in Section 5 (Prop 5.1 and Thm 5.1) relies on the SDE continuous-time limit, which assumes $\eta \to 0$. This contradicts modern LLM training practices (including the use of relatively large peak LRs in this paper's Section 4 experiments). Although the theory's key prediction (noise scale $\tau \propto \eta/B$) appears to match the experiments (Figure 8), this treats a heuristic approximation as a rigorous explanation. 3. The study is limited to 93M and 170M parameter models under the LLaMA-2 architecture, which differs significantly from current mainstream model sizes. Whether its conclusions hold for much larger models remains unknown. 4. This paper attributes the necessity of warmup to an external factor: the "landscape sharpness". However, a more direct and well-known explanation lies in the internal flaws of the Adam optimizer itself: its second-moment estimate ($v_t$) has high variance in the early stages, and its initial update degenerates into unstable "sign descent". The loss spikes observed in this paper are highly consistent with these known optimizer startup problems. This paper fails to clearly disentangle whether the observed instability originates from the landscape geometry or simply from the well-known startup deficiencies of the Adam optimizer. 1. Regarding the early-stage stability analysis: How do the authors justify that a noise-free, quadratic model (Eq 4), based on the neighborhood of a local minimizer, can effectively explain the instability phenomena in the earliest, far-from-equilibrium, and highly stochastic phase of training? Is there a more suitable theoretical model to describe this "chaotic initial" phase? 2. Regarding the SDE limit and steady-state assumption: Given that LLM pre-training uses finite, large learning rates and is terminated long before reaching a theoretical steady state (stationary distribution), can the authors provide additional evidence or arguments to support the approximate validity of the SDE limit and free energy minimization theory in this scenario? 3. Regarding the deeper reasons for blockwise heterogeneity: The authors attribute the ordering difference with Wang et al. to measurement methods and gradient sparsity. This raises a question: should we focus on the "intrinsic" Hessian geometry defined by the architecture, or the "effective" dynamic geometry jointly determined by data flow and the optimizer? Heavily AI-edited
DRIP: Decompositional reasoning for Robust and Iterative Planning with LLM Agent Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes DRIP, a planner that alternates backward goal decomposition with forward execution. On BlocksWorld, DRIP outperforms CoT and ReAct in terms of success. In Minecraft, DRIP achieves the highest success rate, trading off trading-step efficiency for robustness via finer-grained subtasks. Overall, DRIP is a lightweight, LLM-agnostic approach that scales from classical to open-world tasks. 1) The paper solves an interesting problem. The method cleanly separates planning from execution and is explained with good figures. 2) The paper compares against LLM baselines (CoT, ReAct) across two domains. 3) Results show consistent gains on tasks in both a classical benchmark, BlocksWorld, and Minecraft. 1) BlocksWorld is altered (3 ops, multi-hold), which weakens comparability to prior work. I suggest that the authors - Add a parallel track with the traditional 4-operator, single-gripper domain and - Include classical planning baselines, such as Fast Downward. Report success, plan-length gap to optimal/classical planning baselines, and expansions/time. 2) The study did not use GPT-4 (only Claude 3.5) in Minecraft, so cross-model conclusions are thin. Please include Minecraft with GPT-4o. 3) I also did not understand the "Manual" condition: specify the number of participants, the decision rules, the inter-rater checks, and whether participants could correct invalid steps. Please specify what the human condition actually was and the protocol. 4) The paper should also include an English variant as well; while not central, this may help eliminate any language effects on the accuracy (which look dismal) - There are many minor grammar/punctuation issues; the paper requires a careful read. 1) Why were classical planners not tested with conventional/original problem specifications? I strongly encourage adding that baseline (plus see my comments in the Weaknesses section). 2) What was the human protocol for Manual conditions? 3) Are the conditions met to use Fisher's test conditions met for your study design? 4) Will you add GPT-4o for Minecraft study? 5) Was there a specific reason for not comparing Japanese and English specifications? 6) Could you please provide citations to support this statement: "In contrast, LLMs offer a unique advantage in their ability to dynamically generate and adapt rules based on their extensive pre-trained knowledge."? Fully human-written
DRIP: Decompositional reasoning for Robust and Iterative Planning with LLM Agent Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes DRIP, a backward-reasoning, decomposition-first planning framework for LLM agents. Given a goal, an LLM recursively decomposes it into prerequisite subtasks; an executability module filters which subtasks can run under the current state; successful child nodes propagate executability upward (checkParentExec), yielding a plan. Experiments on BlockWorld (hard split, 6–15 blocks, modified rules to allow lifting stacks) and Minecraft (“mine diamond” from scratch) show higher robustness than forward approaches (CoT, ReAct). With Claude 3.7 Sonnet, DRIP hits 40.9% vs CoT 23.6% and ReAct 9.1% on BlockWorld; a manual-execution variant reaches 82.7%, indicating the gains are from planning rather than actuation. In Minecraft, DRIP succeeds on diamond 4/5 trials (ReAct 1/5, CoT 0/5). The paper is clear about limitations, LLM decomposition errors, reliance on natural-language state, more LLM calls than CoT, and occasional inefficiency in open-world tasks. 1. Clear decomposition loop with explicit executability check and upward propagation; easy to implement. 2. Robustness gains on BlockWorld (large effect vs ReAct; solid vs CoT on Claude) and open-world Minecraft where many forward plans stall. Manual actuator study isolates planning quality from execution bugs, showing good methodology. 3. DRIP uses ~4–5 fewer steps than baselines in successful cases. 4. Honest limitations and discussion (need for formal state, LLM call budget, trade-off between step count and success). 1. Non-standard BlockWorld setup (multi-block lifting & holding) inflates branching and may favor the proposed decomposition; please also report standard constraints for comparability. 2. Small-N in Minecraft (n=5 per resource) and single seed/model for many parts; results could be noisy. 3. No comparison to planner-assisted LLMs (e.g., LLM+P/Task-graphs) or hybrid symbolic planners with LLM heuristics. 4. Executability via natural language is brittle; the paper shows this, but there’s no quantitative analysis of that component (accuracy/confusion). 5. Efficiency trade-off in Minecraft (more subtasks than ReAct in its single success) is under-analyzed. What’s the token/call budget? 6. Novelty relative to recent backward-planning with LLMs (e.g., explicit backward search/goal regression) needs sharper positioning. 1. Report DRIP/CoT/ReAct under the standard “one block in hand, must clear top” constraints. How do the conclusions change? 2. Provide a labeled set of state–action pairs to measure precision/recall of “Executable/Unexecutable/Unnecessary,” and error breakdowns that lead to plan failure. 3. Report tokens and LLM calls per solved instance; DRIP vs ReAct vs CoT, and for Minecraft, include code-generation retries. 4. (a) depth cap / tree-policy; (b) re-decomposition strategy; (c) swapping backward step with least-to-most prompting. 5. Add a hybrid symbolic baseline (e.g., PDDL planner with LLM goal translation) or LLM+P. 6. How sensitive are results to language (the BlockWorld prompts were in Japanese)? Any cross-language trials? 7. Can DRIP reconcile goal maintenance vs temporary goal violations (e.g., allowing undo/redo with bookkeeping)? 8. Will you release code, prompts, and Minecraft environment scaffolding to ensure reproducibility? Fully AI-generated
DRIP: Decompositional reasoning for Robust and Iterative Planning with LLM Agent Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces DRIP, a planning framework for LLM agents based on backward reasoning and task decomposition, aimed at enhancing robustness in long-horizon planning tasks. Its core contribution lies in formalizing a human-like problem decomposition mechanism for LLM planning, realized through the construction of a dynamic, goal-driven plan via an executability reasoning tree. Experimental results in both BlockWorld and Minecraft environments demonstrate its superior robustness compared to forward-reasoning baselines. 1. **Originality:** The paper presents a systematic implementation of backward reasoning for LLM planning, offering a clear and contrasting alternative to the predominant paradigm of forward reasoning. 2. **Quality:** The proposed method is well-designed and rigorously described. The experimental design is comprehensive, effectively validating the framework across both structured (BlockWorld) and open-world (Minecraft) tasks. 3. **Clarity:** The paper is clearly structured. The inclusion of overview diagrams, detailed algorithm pseudocode, and a comprehensive symbol table greatly aids in understanding the proposed framework. 1. **Insufficient Experimental Comparison:** The empirical evaluation lacks direct comparisons with other recent backward reasoning methods. This omission makes it difficult to precisely assess the unique advantages and distinctive contributions of DRIP within the landscape of backward reasoning approaches. 2. **Limited Generalizability Validation:** The framework's performance is validated only in the BlockWorld and Minecraft domains. Broader assessment on more diverse and realistic task benchmarks—such as robotic manipulation or everyday planning tasks—is needed to fully establish its general applicability. 3. **Incremental Nature of Contribution:** The core idea of backward reasoning is well-established in classical AI planning. While the work is solid, the primary novelty lies in its effective adaptation and demonstration using LLMs, rather than in introducing a fundamentally new reasoning paradigm. * Q1: Have the authors considered a hybrid approach that strategically combines DRIP's backward reasoning with elements of forward reasoning? This could potentially strike a more optimal balance between robustness and planning efficiency, mitigating the observed increase in subtask steps in Minecraft. * Q2: How does DRIP handle scenarios with ambiguous goal states or multiple concurrent goals? Could the authors discuss the framework's stability in such settings and any potential strategies to address these challenges? * Q3:Are there plans to evaluate DRIP in more complex simulated environments or on real-world physical tasks? This would significantly strengthen the claims regarding its practicality and robustness for real-world applications. Fully AI-generated
DRIP: Decompositional reasoning for Robust and Iterative Planning with LLM Agent Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The authors consider planning an important problem, but current premier LLMs fall short of generating robust plans. The paper devises a planning process grounded in cognitive psychology, stating that the proposed DRIP framework leverages human-inspired decomposition to enhance LLMs’ planning capabilities. They argue that the novelty and effectiveness of DRIP lie in performing both forward, top-down reasoning and backward reasoning. The introduction is light on technical specifics and on clear statements of novelty; the authors should at least explain why backward reasoning is helpful. The tooth-brushing example is weak and offers little insight. This paper points out that planning is an important problems, and LLMs could be helpful. It causally reason that backward reasoning can be helpful to reduce computational cost. Limited studies on "simple" planning problem seems to yield some improvement. But the paper should step up to deal with serious planning problems such as supply-chain management, etc. There are several shortcomings in this paper. 1. Related work coverage. It is not yet comprehensive for a planning paper. The section emphasizes decomposition and regression planning but misses four pillars that serious planners consider essential: uncertainty and belief tracking, plan repair and rollback, tool-grounded interaction, and memory or context management. It also needs a brief evaluation critique. 2. Native LLM limitations unaddressed. Context loss on long horizons is a known problem. Self-validation is inherently limited in light of Gödel’s incompleteness results. The paper does not discuss these issues. 3. Cost of backward search. Backward search can explode when many goal configurations are admissible. What constraints are used (landmarks, goal ordering, HTN templates, causal graphs) to prevent exponential cost? The empirical study should examine efficiency and effectiveness trade-offs. 4. Persistent memory. LeCun has noted that LLMs lack persistent memory and therefore struggle with long-horizon planning. This fundamental issue should be addressed. 5. Evaluation realism. The empirical study uses rudimentary problems and does not stress-test the proposed schemes. The authors are encouraged to consider planning work from the database and systems community since the 1980s, including the recent SagaLLM work (VLDB 2025). 1. Positioning and related work. Can you provide a comparative assessment that covers: * native LLM limits such as context loss and attention narrowing, * structured speculative methods such as Tree of Thought and successors, * persistent memory and transactional stability such as SagaLLM (VLDB 2025). Explain why each is relevant or not to your planning setup and how your method addresses the gaps. 2. Does search complexity and pruning follow rigorous theories? Both forward and backward reasoning can exhibit exponential branching. How do you constrain backward alternatives in practice? Specify the constraints and heuristics you use, for example landmarks, goal ordering, HTN templates, causal graphs, bidirectional search, or admissible heuristics, and report their effect on complexity and token cost. 3. Grounding, commonsense, and uncertainty. You did mention in "limitations" that commonsense could be an issue. How does DRIP handle commonsense and locale dependent logistics that break pure context reasoning, for example landing time versus airport exit time, baggage claim, customs, or rental queues? Describe your belief tracking under partial observability, your information gathering actions, and any tool based validation or buffer policies, and evaluate their impact on plan validity. Fully human-written
Navigating the Latent Space Dynamics of Neural Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents an alternative representation of autoencoder (AE) models as dynamical systems acting on the latent manifold. The authors introduce the theoretical framework required for this interpretation and establish results linking the dynamical system induced by the iterative application of an AE to the gradients of the underlying data distribution. They also theoretically characterize the attractors of the latent vector field and link them to memorization and generalization. This framework is then used to analyze how regularization affects the phenomena of memorization and generalization in AEs, showing that as memorization decreases and generalization increases, the attractors of the AEs evolve from latent points corresponding to training examples toward more general attractors. The paper further shows that a transition from memorization to generalization occurs during training, with an increasing number of attractors being learned and the similarity of attractors found using different data converging. Finally, the authors extract AEs from common pre-trained models and show that noisy inputs can be used to find the attractors of the induced dynamical system, and that these attractors can serve as a dictionary helping reconstruct data points from diverse distributions (as compared with a random orthogonal basis). The trajectories of examples can also be used for OOD detection. - This new perspective on AEs is simple and intuitive. It is somewhat surprising that this type of analysis has not been done sooner. - The links to regularization, memorization, and generalization are interesting, and the proposed framework could be a useful analysis tool. - The theoretical framework is well presented and clear, and the theoretical results appear correct. - The paper is well written; the dynamical-systems terminology is clear and intuitive. - The experiments using AEs extracted from pre-trained models are particularly strong, without these, the toy settings described earlier would not have been sufficient. - The scope of the paper is somewhat limited since the theory only holds for AEs. - While the proposed framework is well justified and interesting in its own right, its impact is difficult to gauge. There is no immediate practical impact for practitioners, nor any strikingly new finding that this framework helps uncover. However, the work has clear potential as a future analysis tool. See the Questions section for more precise comments. - Theorem 1 strength and scope: The result relies on uniform contractivity and latent concentration on fixed points, which are strong assumptions rarely satisfied by general AEs. The empirical section provides motivation for approximate contractivity, but the theorem should be reframed as a local or heuristic alignment with the score, not strict proportionality. - Banach fixed-point misstatement: The text says well-posedness “holds iff f is Lipschitz-continuous.” Banach’s theorem requires a contraction (Lipschitz constant < 1), not mere Lipschitz continuity. Please correct. - Definition 3 typo: There appears to be an extra f after the Jacobian when defining the Lipschitz constant in Definition 3. - Detail: The numbering of the Theorems / Propositions is inconsistent across the main paper and appendix, which is confusing. Either match the numbering exactly or link the appendix theorem/proposition in the main paper. - Section 4.1: The claim that “OMP = PCA” when using a random orthogonal basis is incorrect. PCA involves the data-covariance eigenbasis; OMP on a random orthobasis is simply sparse projection in that basis, not PCA. Please fix the description and, ideally, compare against true PCA (top-k principal components learned from data) as a stronger baseline. - From the definition of the trajectory score in Sec. 3.2.2, it is not directly clear whether the distance is to all training attractors at once, and which exact point-cluster distance is used (this likely affects results since different point-cluster distances capture different notions of similarity). - OOD baselines are too weak. The proposed trajectory-distance score is only compared to K-NN (with K = 2000). Modern OOD detection for vision backbones includes MSP/energy scores, Mahalanobis, ViM, ODIN, KLM, etc. Adding these would materially strengthen the claim that trajectories convey additional signal beyond embeddings. Also, which neighbors are considered? Since it is trajectories that are analyzed, and each embedding along the trajectory may have different K-NNs, the reference points are moving if you recompute the K-NN for each point in the trajectory. - The distinction between “aggressive regularization” (1) and “over-parameterization” (2) as two forms of memorization is very interesting and would warrant further analysis. Perhaps this framework could allow the characterization of different forms of memorization in NNs. Currently, these are disjoint and hard to compare since (1) is presented in the main text as a function of k (latent-space dimension) and (2) is presented in the appendix as a function of dataset size. Unifying these observations would be valuable. Fully human-written
Navigating the Latent Space Dynamics of Neural Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper shows that iterating the autoencoder‑induced map $f(z)=E \circ D(z)$ implicitly defines a vector field in latent space. It then exploits the field's dynamics and attractor structure to diagnose memorization versus generalization and to detect out‑of‑distribution (OOD) inputs. - It is a novel and distinctive observation that iteratively applying $f$ induces the residual vector field $V(z) = f(z)-z$, whose fixed points serve as attractors toward which nearby trajectories converge. - The claim that this vector field is proportional to the score of the latent prior $q(z)$ is highly intriguing; it effectively generalizes the small‑noise limit result for denoising autoencoders to the latent space. - Proposition 2 is particularly insightful: when training biases the model toward memorization, the prototype term approaches zero while the coverage term narrows, yielding a clear, interpretable criterion for judging memorization versus generalization from the proposed error decomposition. - The paper also establishes a lower bound on the number of iterations required to converge in simple linear settings, grounding the dynamics with an interpretable complexity estimate. - The explanation for why contraction emerges *naturally* via initialization bias, explicit regularization, and implicit regularization would benefit from a stronger theoretical foundation or, at least, a more formal set of sufficient conditions. - Several assumptions, e.g., smoothness of the induced latent distribution and related regularity, are stated, but the extent to which they hold for large‑scale models in practice remains unclear. - **Numerical validation of Theorem 2.** Can you empirically validate Theorem 2? Such evidence would help assess the plausibility of the assumptions underlying its derivation and test the theorem's robustness in realistic settings. - **Iteration complexity under weaker assumptions.** Is it possible to analyze (or bound) the number of iterations required to reach a fixed point under assumptions weaker than those currently stated? Lightly AI-edited
PreviousPage 9 of 1516 (75800 total rows)Next