ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
$\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work proposes a 3-stage framework to understand the emergence of feature learning in large-width two-layer neural networks, named **Li$_{2}$**. This consists of: 1. Lazy regime: first-layer weights are effectively random. The second layer fits the data using random features. The back-prop term to the hidden layer, $GF$, carries little usable structure and, without weight decay, can vanish at the lazy fixed point (ridge solution). 2. Independent: With weight decay \(\eta>0\), \(GF\) acquires target structure. Under some assumptions and large width \(K\), it simplifies to $$ GF = \frac{\eta}{(Kc_1+\eta)(nc_2+\eta)}\tilde Y\tilde Y^{\top}F, $$ so each neuron follows gradient **ascent** on the single-neuron energy $$ E(w)=\frac{1}{2}\big\|\tilde Y^{\top}\sigma(Xw)\big\|_2^2, $$ i.e., neurons learn useful directions independently. 3. Interactive: As several features are learned, neuron–neuron interactions reshape $GF$: similar features repel, and the signal is steered toward missing directions. This is discussed explicitly in a modular arithmetic task and quadratic activation $\sigma(x)=x^{2}$. Using group-theoretical tools, it is shwon that local maxima of $E$ align with group irreducible representations which for this task coincide with Fourier modes, yielding closed-form descriptions of learned features and their attained energies $E^\star$. Finally, an extension to the multi-layer case is also discussed. The paper is clearly written and provides a fairly generic picture of the mechanism for feature learning in two-layer neural networks, which is task independent and rely only on an investigation of the gradients. The results for each section rely on different assumptions, which makes **Li$_{2}$** look more like different patches of results rather than a unified picture. Some of the observations appearing in this work have also appeared in other works in the feature learning literature, and more throughout comparison is lacking. I expand on these two points below. 1. Different results in this work appear to rely on different assumptions about $n,d,K,M$, and it is not immediately clear whether these are mutually consistent. For instance, Lemma 1 assumes that $K$ is sufficiently large and that $x_i^\top x_j=\rho$, which is only possible when $n<d$ unless the $x_i$ are degenerate. By contrast, Theorem 2 and Corollary 1 take $d=2M$ and $n=M^2$, which is consistent with Lemma 1 only for $M<2$. Similarly, Theorem 4 requires $n\gtrsim d_k^{,2} M \log(\delta/\delta)$ via Matrix Bernstein. Overall, I found it confusing to determine whether the regimes considered for the different phases of Li$_2$ hold simultaneously. 2. Feature learning for two-layer neural networks has been studied extensively in recent years, with several works arriving at a picture closely related to **$Li$_2$**. For example: - One line of work systematically analyzes one-pass SGD dynamics in teacher–student settings (a 2LNN learning another, not necessarily identical, 2LNN). These studies show that the dynamics decompose into plateau / saddle-to-saddle phases: first-layer weights move within a neighborhood of zero (the *mediocrity phase*), then individual neurons correlate with target directions independently before finally coalescing (the *specialization phase*); see Saad and Solla (1995), Goldt et al. (2019), and Arnaboldi et al. (2023). While these works differ technically (finite-width networks, one-pass SGD), the overall mechanism is closely related, with a key difference being the absence of an overfitting phase under one-pass SGD. - Another closely related line of work considers feature learning during the first few steps of GD with aggressive learning rates (Damian, 2022; Ba et al., 2022; Dandi et al., 2024). The energy in Eq. (8), often called the ``Correlation loss,’’ is a common approximation in this literature because it yields exact weak-recovery thresholds for the initial gradient steps. A notable observation is that, after a single aggressive step, the gradient can be asymptotically characterized—depending on sample complexity—by a rank-one matrix that correlates with the target, enabling the network to express nonlinear components with limited data. Is this the same mechanism as in Theorem 3? - More recently, Montanari and Urbani (2025) studied a related teacher–student setting for full-batch GD on a wide 2LNN learning a single neuron, and identified three timescales: a lazy timescale, a generalization timescale, and an overfitting timescale (motivating early stopping). (There is no specialization here because the target has a single neuron.) In Li$_2$, overfitting is associated with the lazy phase. How can this be reconciled with Montanari and Urbani (2025)? How should we understand the benefits of early stopping within the Li$_2$ framework? **References** - (Saad and Solla 1995) Dynamics of On-Line Gradient Descent Learning for Multilayer Neural Networks. NeurIPS 1995. - (Goldt et al. 2019) Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. NeurIPS 2019 - (Arnaboldi et al. 2023) From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks. COLT 2023 - (Damian 2022) Neural Networks can Learn Representations with Gradient Descent. COLT 2022. - (Ba et al. 2022) High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation. NeurIPS 2022. - (Dandi et al. 2024) How Two-Layer Neural Networks Learn, One (Giant) Step at a Time. JMLR 2024 - (Montanari and Urbani 2025) Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks. arXiv 2025 Fully human-written
$\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper proposes **Li²**, a mathematical framework to explain grokking (delayed generalization) in two-layer nonlinear networks via the structure of the back-propagated gradient matrix (G_F). Training is decomposed into three phases: **(I) Lazy learning**, where (G_F) is effectively random and the top layer overfits random hidden representations; **(II) Independent feature learning**, where each hidden unit learns independently because each column of (G_F) depends only on its own activation—weight decay injects label signal, the dynamics become exact gradient ascent on an energy (E), and the local maxima of (E) coincide with the emergent features; and **(III) Interactive feature learning**, where hidden units begin to interact and (G_F) reorients toward missing features that must be acquired for generalization. On group-arithmetic tasks, the authors analyze when these energy-induced features generalize, their representational power, and how they vary with sample size. The framework yields **provable scaling laws** for memorization and generalization as functions of weight decay, learning rate, and data size, and offers a first-principles explanation for the effectiveness of optimizers such as **Muon**. The analysis is argued to extend to deeper architectures. 1. The paper is overall solid and offers a detailed study of the dynamics of two-layer neural networks. 2. It provides experiments that validate the theoretical results. 1. The writing feels very rushed: many symbols are undefined or unexplained, making the paper hard to follow and overly dense. 2. The theoretical setup is quite restricted; for example, a projection function is deliberately designed so that the hidden layer receives gradients that are random noise. 3. It is unclear whether the group (arithmetic) example pertains only to Stage II or also extends to Stage III. 4. The theoretical analysis of Muon appears disconnected from the main body of the paper, and the setup and assumptions of Theorem 8 are entirely unclear. 5. The relationships among the three stages are not well articulated; the analysis reads like heuristic case-by-case treatment rather than a genuinely unified three-stage dynamics analysis. 6. The abstract mentions scaling laws, but it is not evident where in the paper this is actually developed. See weaknesses. Lightly AI-edited
$\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper analyzes grokking dynamics in the presence of regularization. They identify three phases in the learning process - lazy learning regime and independent and interactive feature learning regime, and prove various theorems which govern different aspects of the training dynamics. 1) The theorems proved are mathematically rigorous, with detailed proofs in the appendices. 2) The scaling law analysis in Section 5.4 will be useful since it gives a first-principles derivation of the scaling phenomenon. 3) Extensions to modern optimizers and deeper networks were adequately discussed. 1) My main concern is that the key observation (that there is a lazy and rich learning regime) was already reported in [1], who study this in the context of polynomial regression. While the current paper offers mathematically rigorous proofs, the distinction with [1] needs to be more elaborately discussed in the main text. 2) The mathematical framework presented displays the three stages, but fails to offer more insight on what drives these transitions or when do these transitions. For example, based on lines 471-475, it seems like it's the (inverse) learning rate which sets the scale for when these transitions occur, but it would be great if this could be explained in more detail. 3) How does one relate the top-down modulation in Sec 6 to the continuous feature learning observed in [2]? 4) Discussion on limitations not found in main text. [1] Tanishq Kumar, Blake Bordelon, Samuel J. Gershman, and Cengiz Pehlevan. Grokking as the transition from lazy to rich training dynamics, 2024. [2] Gromov, A. (2023). Grokking modular arithmetic. arXiv preprint arXiv:2301.02679. 1) Line 105-106 : I suggest modifying "grokking mostly happens ... regularization" to "grokking is accelerated ... regularization" 2) Why are the axes cut off till epoch 300 in Fig 2 last column? 3) Since the discussion on Thm 3 involves optimization in the complex domain, were any follow-up experiments done with complex weights? 4) The existence of maxima for the ascent functions is interesting (Thms 4-5). What can be said about the network's ability to find these maxima? Fully human-written
$\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a three-stage framework to theoretically study grokking using two-layer neural network with MSE loss and quadratic activation functions. It find connections between stage II and the optimization of an energy function. It rigorously analyzes the role of learning rate, weight decay, sample size, and proposes a scaling law of generalization/memorization boundary. It also theoretically analyzes the benefits of Muon optimizer in the feature learning regime. 1. The first work to rigorously analyze the role of learning rate, weight decay, and width on grokking dynamics. 2. Interesting and novel analysis of the interactive feature learning regime. 3. First theoretical analysis of Muon in the feature learning regime. 4. The theoretical results of two types of memorization is intriguing. 1. MSE loss is uncommon in the training of deep learning models, though it's fine for the ease of theoretical analysis. 2. The theory needs a nonzero weight decay to provably show the three phases, while in practice, weight decay is unnecessary for grokking to occur. 3. Several assumptions need to be justified. 1. In stage I, why the activations F are mostly unchanged? Any empirical observations to support such assumption? I guess it needs assumptions on the learning rate, weight decay, and initialization scale of V. 2. In Lemma 1, why can we assume W always follow normal distribution at each step? Besides that, W is assumed to follow normal, but in the proof (line 714-716), you assume w_i follow N(0, I), which is stronger. 3. In Lemma 1, why can we assume $<x_i, x_{i'}> = \rho$ ? Does this assumption hold in any synthetic tasks that show grokking? 4. In Theorem 5, what are focused and spreading memorization? What's the difference between memorization and overfitting? Fully human-written
DualTune: Decoupled Fine-tuning for On-Device Agentic Systems Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the performance shortcomings of local LLMs deployed in on-device agentic systems. It reveals the suboptimal performance of local LLMs in tool orchestration through detailed analysis. To address these issues, the authors propose a "decoupled fine-tuning" method and the "DualTune" inference framework, whose main idea is to decompose the tool-calling pipeline into tool selection and argument generation. Furthermore, the paper argues for the importance of toolset separation, which can effectively boost the performance of tool selection compared to direct fine-grained tool selection. Extensive experimental results demonstrate the effectiveness of the proposed tuning method and inference framework. 1. The analysis section is comprehensive and convincing. It also serves as a strong foundation for the proposed method and inference framework. 2. I agree with the finding that selection among fine-grained tools leads to overly long contexts, which further degrades the tool-selection capabilities of LLMs. To resolve this, the proposed 2-tiered tool-selection process is reasonable and efficient. 3. The experimental results demonstrate the effectiveness of the proposed methods. 4. The writing and structure of the paper are clear and easy to follow. 1. As illustrated in the Limitations section, the scalability of the proposed method is limited, as it requires further fine-tuning to adapt to new tools. 2. The experimental setup is somewhat unfair. In the main experiment, the DualTune-testset is an in-domain test set (since DualTuneModel-7B is trained on the corresponding training set), and tools in MCP-Bench are also fine-tuned using synthetic data. It would be more convincing to demonstrate the effectiveness of tool-selection/argument-generation separation and the two-tiered selection if the authors evaluated on some OOD benchmarks or toolsets. 3. In Line 294, the authors state that "while it may contain cases where GPT-5-mini performs poorly, such cases are rare." It would be more convincing if the authors provided quantitative results. 4. The formatting of the paper title does not meet the ICLR 2026 guidelines. 1. Can you provide some examples of GPT-5 generated data? 2. What is the average additional time required for DualTuneModel-7B to complete a full cycle of “toolset selection -> tool selection -> argument generation -> tool execution” compared to direct tool calling? Lightly AI-edited
DualTune: Decoupled Fine-tuning for On-Device Agentic Systems Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. Local LLMs struggle with tool selection and argument generation, so the authors split tool-calling into two subtasks and train separate LoRA adapters for each (“decoupled fine-tuning”). At inference, DualTune selects the tool, dynamically loads the matching adapter, and generates arguments with hierarchical orchestration to avoid unnecessary tools. On MCP-Bench, Qwen-2.5-7B with this method improves tool-calling accuracy by 46% and typically outperforms similar-size and even 2× larger local baselines. The contribution of the paper is a decoupled fine-tuning method—instantiated as DualTuneModel-7B—that trains separate LoRA adapters for tool selection and argument generation. Across two benchmarks, DualTune outperforms similar-size local models, matches or beats local reasoning baselines at lower latency, and yields ~2× larger gains in tool-calling on an MCP filesystem benchmark than conventional fine-tuning. It also runs efficiently on consumer-grade hardware, enabling privacy-preserving on-device agents. The strengths of the paper are, - The method is simple and easy to understand, making it easy to follow. The weaknesses of the paper can be summarized as, - Introduction Part - “Accessible directly through LLMs.” Tools are accessed via the client/runtime that wraps the model, not “directly through LLMs.” This overstates the model’s capabilities and hides the execution layer. - Cost claim is overstated. “Simultaneously eliminating the costly API expenses associated with orchestration via frontier models.” Local inference removes per-token API fees but introduces compute, energy, engineering, and possibly remote tool/API costs. “Eliminating” is inaccurate. - Unsupported blanket claims. Phrases like “poor tool selection capabilities” and “poor argument generation capabilities” are strong generalizations about “existing local LLMs” without quantitative evidence (models, sizes, datasets, metrics, error bars). - Causal story is dubious. It claims large tool sets → “expanded context length … overwhelms attention mechanisms.” Long context alone doesn’t imply degraded attention for modern long-context models; failure is more often due to retrieval/noise or planning, not the mechanism being “overwhelmed.” Provide ablations isolating (a) number of tools, (b) prompt length, (c) retrieval quality. - Iterative repair claim lacks basis. “Limited ability to fix mistakes in subsequent steps” is not inherent to local models; it depends on the planner/executor loop and tool feedback. If this is empirical, report the loop design and repair rates. - Metric confusion. “Prompt tuning yield marginal accuracy improvements” — accuracy of what? Tool selection EM? End-task success? Provide concrete metrics and deltas; avoid vague performance statements. - Oversimplified task taxonomy. Framing selection as “a classification task” and parameter filling as “syntactic accuracy” is reductive. Selection often requires multi-label reasoning with constraints; parameter filling requires semantic grounding and schema validity, not just syntax. - Rhetorical emphasis over analysis. Multiple assertions (e.g., “not substantial”) are qualitative without confidence intervals or statistical tests. Academic writing should ground these claims in data and controlled comparisons. - Method Part - Metric inconsistency. The text defines ToolFit as a 0–10 score, yet Table 1 reports percentages (e.g., 58.3, 88.4) and labels the column “ToolFit (%)”. These cannot both be true; specify one scale and keep it consistent. - LLM-as-judge overclaim. Calling the judge “deterministic and accurate” is unjustified. Even with temperature 0, LLM judgments can vary across prompts/versions; accuracy requires validation (e.g., agreement with human annotators), which is not provided. - The meaning of Qwen-3-8B* in Table 1? - Misleading “consumer-grade GPU” premise. Framing 24 GB as the “typical” consumer GPU capacity is inaccurate (8–16 GB is far more common). Conclusions based on this premise may not generalize. - Overgeneralized claim of incapability. From one toolset and 50 tasks the authors conclude “local models are not able to perform effectively as orchestrators.” That’s too strong: it ignores other toolsets, planners, retrieval strategies, and stronger local models; also the baseline GPT-5-mini is non-public, making the comparison unverifiable. - Latency claim without evidence. “Reasoning models incur high latency … a known problem” is asserted without measurements (decode lengths, tokens/sec) or citations that actually establish this for the authors' setups. - Confounded “decoupling” ablation. Swapping in GPT-5-mini to generate arguments when evaluating a local model’s tool selection (and vice-versa) does not isolate a single stage: ToolFit depends on both the chosen tool and the validity/format of its arguments. Using a much stronger model for the other stage changes the distribution and can inflate the measured gain, so attributing the 16%→60.8% jump “primarily to tool selection” is not warranted without a control that keeps the counterpart stage constant and equivalently capable. - Causal claim about long context is overgeneralized. “Large context increases the complexity of the attention mechanism, causing the accuracy to drop” is asserted as general fact with citations but no controlled evidence from the authors' setup (no ablation of prompt length vs. retrieval/noise, no per-token loss/attention diagnostics). It may be true in some cases, but the blanket causality is unsupported here. - Attribution error in the long-context experiment. The authors change two variables at once—tool count (12→30) and description length (≈3–4k→9.8k tokens)—then ascribe the drop (16%→10.4%) to context length. Increased choice set size alone can reduce ToolFit. A valid test must hold tool count constant while varying description length (or vice versa). - “Ground truth” misuse and unsupported quality claim. LLM-generated trajectories are labeled as ground truth and asserted to be “high-quality” with “rare” errors, but no human validation rates or QC protocol are reported. - Circularity/contamination risk. The same teacher model generates both the training data and the separate “DualTune-TestSet,” and earlier sections also use the teacher as a judge. This teacher-generated test plus teacher-as-judge setup can inflate performance and undermines independent evaluation. - Scalability mismatch. Training a separate argument-generator adapter per tool scales linearly with tool count and contradicts the claim of a “general-purpose” approach; memory/maintenance costs are not addressed. - Test set design is weak/ambiguous. A fixed 50-query test set across all tools is too small for reliable estimates and appears to be sampled from the same distribution as training prompts, with unclear disjointness from the 80/20 split. - Unsubstantiated scaling claim. “Allows the system to scale to a larger number of tools without performance degradation” is too strong. Hierarchical routing adds extra LLM calls and dynamic adapter loads; any savings from shorter prompts must be shown to outweigh these costs with measurements. - Method underspecified / hard to reproduce. Phrases like “guided by a system prompt and structured decoding” lack operational detail (inputs seen at tier-1, decoding rules, thresholds). Without this, others cannot reproduce or evaluate the claimed benefit. - Token/complexity accounting is missing. The authors argue context length drives difficulty but don’t quantify how much text the tier-1 router reads (names vs. descriptions) nor the end-to-end token, memory, and latency budget versus the flat baseline. - Experiments Part - Math/number inconsistencies. The baseline is 10.6% in Fig. 2 but 10.4% in the text below. You also write “up to 60% higher than Qwen-2.5-7B,” yet 61.5 vs 16.0 is +45.5 percentage points (≈+184% relative), and 61.5 vs 10.6 is +50.9 pp (≈+480%). Say “percentage-point gain” and ensure the baselines match. - Unfair ablation setup. You “perform this experiment on the Filesystem toolset,” yet claim hierarchical orchestration helps by restricting to the correct toolset. If evaluation only contains filesystem tasks, letting the hierarchical method filter to the filesystem toolset effectively gives it oracle prior knowledge and shrinks the search space, while the flat baseline appears to consider all tools. Make the candidate tool sets identical for all variants or route on a mixed multi-toolset workload. - Component attribution is confounded. The jump 16%→39% (traditional FT) and 16%→61.5% (decoupled FT) use training data generated by your own pipeline; without a held-out, independently sourced test set and variance/error bars, the “contribution of each component” plot is not trustworthy. - Reproducibility gap. The model name “DualTuneModel-7B” and training details (LoRA ranks, data volume per tool, selection prompts) are insufficient to reproduce these exact numbers; the figure has no confidence intervals or multiple-seed runs. Please refer to the Weaknesses section for details. Overall, the manuscript lacks sufficient novelty and technical depth for this area. The problem motivation and positioning are unclear; the method relies on several unvalidated assumptions; experiments are limited in scope, with narrow model/data choices, and the reported results do not adequately support the claims. Descriptions of implementation and workload are insufficient, hindering reproducibility. I recommend clarifying the contribution boundaries, adding strong baselines and ablations, and providing fuller methodological details before resubmission. Fully AI-generated
DualTune: Decoupled Fine-tuning for On-Device Agentic Systems Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a decoupled strategy for tool use, which divides the tool-using process into two stages: tool selection and argument generation. For each toolset, the authors train an independent LoRA module to enhance tool-specific performance. Experiments on the MCP-Bench dataset are conducted to demonstrate the effectiveness of the proposed approach. While the idea of decoupling tool-related sub-tasks is interesting and potentially beneficial, the paper overlooks important aspects regarding efficiency, scalability, and inter-toolset interaction, which limit the practicality of the approach. 1. The paper presents a motivated and intuitive decoupling idea, which offers a new perspective on tool-use modeling. 2. The writing is clear and the framework design is easy to follow, making the overall contribution accessible to readers. 1. The paper should provide a clearer illustration of the statistics and composition of the evaluation benchmark (e.g., number of tools). This would help readers better understand the experimental setup and the claimed generalization ability. 2. The proposed approach introduces two sequential inference processes (tool selection and argument generation), which substantially increases computational cost. Since these are performed using separate models, KV caching and similar acceleration techniques cannot be shared between stages, leading to high latency and memory overhead. The problem is further exacerbated in multi-turn function-calling scenarios, where model switching and repeated prefill computation may result in severe performance degradation. An analysis of training/inference efficiency or latency comparison with unified models would make the work more convincing. 3. The current framework assumes independence across toolsets, which restricts its applicability to tasks involving inter-tool dependencies (e.g., when the output of a tool in one set serves as input for another). Such relationships are common in complex workflows, but the proposed model cannot capture them. Moreover, the inference cost may grow linearly with the number of toolsets, making it impractical for large-scale deployment. 4. More comprehensive ablation studies are needed to justify the decoupling strategy. For instance, combining a well-trained tool selector with a base model as the argument generator to isolate the effect of each component. Please refer to the weaknesses Moderately AI-edited
DualTune: Decoupled Fine-tuning for On-Device Agentic Systems Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents a dualtune approach to fine tuning tool calling agents where instead of fine tuning the agent for the overall tool calling, they split the task into tool selection and argument filling and fine tune several LoRA adaptors (one for each function and tool selector) and show that this method can help with the on premise deployments to achieve better performance their base models and closes the gap with frontier models. Show the detailed analysis of the approach using MCP bench and present various results compared to baseline models. There are several other tool calling benchmarks and authors only presented numbers on a curated benchmark from their side and MCP bench. Results from other benchmarks would have been better to see. I don’t understand the reason to train a new adaptor for each function for the argument-filling task. This seems like overkill. Did the authors try a single LoRA adaptor for all the functions and see if it can do the generic argument filling across tools? Also, instead of LoRA adaptors, if I go with ICL examples for each function, will it give similar results as fine-tuning? What are the inference time implications of LoRA adaptor switching? Any analysis around that? Some of the basic baselines, like single argument filling model vs multiple, with respect to accuracy and time, comparison with simple ICL examples per selected function vs fine-tuned model needs to be shown to really present the case for the single function LoRA adaptor needed. Please check the weaknesses section. Fully human-written
NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes the NaviAgent bilevel framework, which decouples task planning and execution in LLM tool orchestration. Integrating TWNM (Tool World Navigation Model) to dynamically model tool dependencies and enable closed-loop optimization, NaviAgent outperforms baselines significantly in Task Success Rate (TSR) on the API-Bank and ToolBench datasets, achieving efficient navigation in large-scale tool ecosystems. 1. Its architectural innovation decouples key components (planning and execution). 2. The TWNM design unifies the capture of both tool structural and behavioral dependencies. 3. The paper have conducted comprehensive experiments cover multiple models and scenarios. 4. The paper is well-written, featuring a clear logical structure. 1. There is a gap between tools in simulated and real environments. Real-world APIs are diverse and dynamic, with frequent error fixes, feature updates (e.g., new parameters added), or temporary outages. Although TWNM incorporates a dynamic graph evolution design, it relies on historical execution feedback to update the graph structure, leading to a time gap—for instance, if an API’s error is just fixed but its weight in the graph remains low, NaviAgent may still avoid using it; conversely, sudden API failures without timely pruning result in invalid calls. 2. Real-world APIs are extensive, and numerous tools not included in the initial graph exist, causing TWNM to fail in generating optimal toolchains. please refer to Weaknesses Lightly AI-edited
NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents NaviAgent, a bilevel framework for large-scale tool orchestration by LLMs. It decouples task planning (4D decision space: direct response, intent clarification, toolchain retrieval, execution) from tool execution (Tool World Navigation Model, TWNM). TWNM models tool structural/behavioral dependencies via a dynamic graph, enabling adaptive toolchain search. Closed-loop feedback optimizes planning/execution. Experiments on API-Bank/ToolBench show NaviAgent outperforms baselines in TSR, balancing efficiency/robustness . 1. This paper proposes a novel bilevel architecture that decouples task planning from tool execution, enabling NaviAgent to handle thousands of tools without being hindered by inter-tool complexity, thus addressing scalability issues of existing agents . 2. The Tool World Navigation Model (TWNM) dynamically encodes tool structural and behavioral dependencies, supports adaptive toolchain search/evolution, and significantly boosts performance on complex tasks by up to 17 points . 3. It integrates a closed-loop optimization mechanism using real tool interaction feedback to refine planning and execution, enhancing robustness and adaptability to dynamic API ecosystems . 4. For me, the strengths of this paper lie in the following aspects: it can dynamically adjust based on the difficulty level of various problems and the latest status when addressing them; it exhibits strong overall engineering feasibility; and in terms of innovation, the search strategies such as Alpha-Beta pruning can handle and prune some extreme cases, enabling rapid acquisition of effective toolchains. 1. My biggest concern is that this paper imposes overly strong constraints on input problems. For example, if we obtain the corresponding tool invocation path through the proposed graph, can we dynamically switch to an alternative path if a problem occurs in the middle of the current path? The paper seems to lack sufficient explanation regarding how to handle such errors. 2. When using Alpha-Beta pruning for search, the evaluation of Alpha and Beta values is crucial. For instance, if I choose a certain edge under a specific decision, how do you update the evaluation value of this decision in the global context? If only factors like tool invocation success rate and relevance are used, the accuracy of relevance evaluation needs to be very high. From this perspective, the paper’s evaluation of heuristic search seems relatively simplistic, and in some cases, it might eliminate effective tool invocation branches. 3. How do you evaluate the dependency between two tools? The paper mentions using weights for evaluation—are these weights based solely on the historical information of the tools observed so far? If the invocation relationship between two tools changes significantly, will the tools overly rely on historical data? 4. Does the proposed framework rely too heavily on the evaluation of tool invocation success rates? If a tool has a certain invocation success rate but delivers excellent results when it works, we might not should sacrifice its usage frequency. Alternatively, for your paper, is tool invocation speed more important compared to the overall reasoning performance? 5. Real-world requirements are highly diverse. For a new requirement, can this framework demonstrate better tool planning capabilities compared to traditional methods like ReACT and Tool-Planner? See above. Lightly AI-edited
NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents NaviAgent, a bilevel planning framework for tool-use agents. It separates high-level reasoning (deciding when to respond, clarify, retrieve, or execute) from low-level execution using a Tool World Navigation Model (TWNM), a dynamic graph that captures dependencies among tools. Their quantitative experiments show significant gains in task success and completion rates over baselines. The research problem is timely and important, given the rise of agentic LLMs The quantitative experiments show consistent improvements over the baselines. 1. The paper provides no qualitative or quantitative analysis of the learned graph structure; TWNM is evaluated only indirectly through overall task performance, making it difficult to see what the graph actually learns. 2. The high-level decision labels (Direct Response / Clarify / Retrieve / Execute) are derived from rule-based relabeling of ToolBench and API-Bank traces with additional synthetic augmentation, rather than real human data. This raises concerns about whether the learned planner generalizes to real-world cases, particularly in the Direct Response and Clarify categories. 1. In Section 3.3 (L319–323), the paper mentions repeating recombination “until infeasibility,” but the stopping criterion is not defined. Could the authors clarify how termination is determined? 2. How sensitive is the planner to the 4 action taxonomy? Could a different or better set of decision type help? 3. The paper claims TWNM supports dynamic integration of new tools; do you have evidence for tool generalization unseen during training? 4. How expensive is maintaining and updating TWNM as tool set grows? Lightly AI-edited
NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper addresses the challenges LLM-based agents face when orchestrating large-scale, dynamic tool ecosystems, specifically targeting issues arising from sequential, step-by-step tool invocation, such as error accumulation and inability to handle complex inter-tool dependencies. The authors propose NaviAgent, a bilevel framework that decouples high-level task planning from low-level tool execution. - **Bilevel Decoupling:** The core architecture uniquely decouples high-level task reasoning (the four-dimensional decision space: respond, clarify, retrieve, execute) from low-level tool orchestration. This contrasts with standard ReAct-style agents that interleave reasoning and single-step execution, often leading to error accumulation in complex tasks. - **Evolving Tool World Navigation Model (TWNM):** Moving beyond static tool graphs, the TWNM is highly original in its integration of "behavioral chains" derived from actual execution traces alongside standard "structural chains" (API schemas). Treating inter-tool dependency discovery as a link prediction problem using Heterogeneous Graph Transformers is a sophisticated approach to a typically heuristic-heavy domain. - **Search Algorithms for Toolchains:** The adaptation of classic search algorithms—specifically Alpha-Beta pruning for backward search and a hybrid genetic/simulated annealing heuristic for forward search—to orchestrate entire toolchains is a creative departure from standard retrieval-augmented generation (RAG) or simple depth-first search methods. - **Cold-Start**: The Tool World Navigation Model (TWNM) heavily relies on "behavioral chains, derived from historical usage data" and "statistical weight... reflecting empirical invocation patterns". This creates a significant cold-start problem. The framework might underperform significantly in new domains where these rich execution traces are unavailable, yet the paper does not quantify this degradation. - **Justifications of Four-Dimensional Decision Space**: The high-level planner uses a fixed "four-dimensional decision space" (Direct Response, Intent Clarification, ToolChain Retrieval, Tool Execution). While functional, it is not clear if this specific taxonomy is optimal or necessary compared to a more flexible, LLM-driven dynamic planning approach. It risks being too rigid for edge cases that don't fit neatly into these four categories (e.g., partial execution with mid-stream replanning without full re-retrieval). - **No Optimization for Arguments**: For tool function calling tasks, selecting the correct function is important. But most time, LLMs usually fail at using the correct arguments for the functions. See Weaknesses. Fully AI-generated
TCMAgent: A Multi-Agent Framework for General Traditional Chinese Medicine Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents TCMAGENT, a multi-agent framework for Traditional Chinese Medicine (TCM) decision-making. It processes six types of clinical inputs through specialized agents, integrates knowledge via a TCM RAG database, and generates treatment plans through an internal debate between aggressive and conservative strategies, overseen by a reflective judge. Tests on the ClinicalLab dataset (1,500 cases) show that TCMAGENT consistently improves relevance, accuracy, and safety over standard LLMs. Key contributions include: (1) a distributed multi-modal reasoning design; (2) a debate–judge–reflection mechanism for explicit trade-offs; and (3) empirical evidence of reduced errors and contraindications. 1. Clear architecture (Fig. 1) with separate analysis, diagnosis, and debate stages for interpretability and ablation. 2. Retrieval and reflection/debate components (Figs. 4–6; Tables 6–8) consistently reduce errors and contraindications. 3. Code and prompt details (Appendix E) are publicly available. 1. The evaluation relies almost entirely on an LLM-as-judge framework, using GPT-4.1-mini as the evaluator (Sec. 4.1) while also including it as a backbone model in the system (Table 1), introducing potential coupling and bias. Moreover, no human expert evaluation is provided to validate the clinical accuracy, safety, or reliability of the judging criteria. 2. Baseline fairness is questionable: $\textbf{while baselines are given inputs sequentially “to mimic a clinical scenario”}$, $\textbf{TCMAGENT receives all modalities simultaneously (Sec. 4.1)}$, conflating the benefits of agent collaboration with those of richer input access. As a result, it remains unclear whether the observed gains stem from the multi-agent design or improved information packaging. 3. The paper overstates its novelty and generalization. Although it claims to provide the “first empirical evidence” that distributed, deliberative agent architectures outperform monolithic models in complex medical settings (pp. 1–3), it does not include direct comparisons with previously cited agent-based medical systems such as $\textbf{ClinicalAgent}$ [1] or $\textbf{MedAgents}$ [2], nor with competitive single-agent reasoning baselines. These omissions weaken the evidential basis for its broad claims, which should be either narrowed or supported by more comprehensive comparisons. 4. The paper’s use of the term “multi-modal” is ambiguous. All inputs appear to be textual summaries, such as written “imageological exam” reports, rather than raw images or physiological signals. Since no vision or multi-modal model is actually employed, the work does not fully support its claim of integrating parallel multi-modal evidence. [1] Yue, Ling; Xing, Sixue; Chen, Jintai; Fu, Tianfan. ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based Reasoning. In Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM BCB 2024), article 11, 2024. [2] Tang, X., Zou, A., Zhang, Z., Li, Z., Zhao, Y., Zhang, X., Cohan, A., & Gerstein, M. (2024). MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning. In Findings of the Association for Computational Linguistics: ACL 2024, 599-621. N/A Moderately AI-edited
TCMAgent: A Multi-Agent Framework for General Traditional Chinese Medicine Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces TCMAGENT, a multi-agent framework designed to replicate the complex reasoning of Traditional Chinese Medicine (TCM) practitioners. The system uses parallel agents for multi-modal data synthesis, a retrieval-augmented knowledge base for diagnosis, and a collaborative, adversarial debate module to refine treatment recommendations. The paper's primary strength lies in its architecture design that operationalizes a distributed and reflective clinical workflow. My major commens are: 1. The entire evaluation relies on using GPT-4.1-mini as an automated judge. This methodology is highly questionable for a high-stakes medical domain. It is prone to self-enhancement bias (when evaluating its own backbone) and potential alignment failures. 2. The paper's own analysis reveals that medication recommendations have the lowest actionability scores. The reason given is that they "often lack precise dosage or administration instructions", which makes the generated treatment plans clinically incomplete and unsafe for real-world use. 3. The novelty is highly limited and the real-world clinical usage is also limited. The framework's effectiveness is not universal. While it improves performance for models like GPT-4.1-mini and DeepSeek-v3, it causes significant performance degradation across most metrics for LLaMA-3.3-70b and yields mixed results for Gemini-2.0. 4. The cross-backbone analysis found very low Jaccard similarity in the generated treatment recommendations. The paper frames this as diversity that is similar to real-world clinical scenario. This is a weak interpretation; such high variance more likely indicates a lack of robustness and stability in the framework's outputs. 5. The "experiential reflection mechanism" is presented as a key innovation. However, its implementation is described vaguely as retrieving "deliberation traces from historical cases". The paper fails to specify how these traces are retrieved, represented, or used by the Judge Agent, making this contribution difficult to assess or reproduce. Please refer to my comments above. Heavily AI-edited
TCMAgent: A Multi-Agent Framework for General Traditional Chinese Medicine Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces TCMAgent, a multi-agent framework designed to emulate expert-level, holistic decision-making for Traditional Chinese Medicine (TCM) clinical tasks. The architecture features three key modules: (i) parallel evidence synthesis via modality-specific agents, (ii) a knowledge-grounded inference step pairing retrieved domain knowledge with diagnosis, and (iii) a deliberative recommendation module where adversarial agents debate potential actions. The framework is evaluated against several proprietary and open-source LLMs on a multi-modal TCM clinical dataset, with results showing improved safety, coherence, and interpretability versus LLM baselines. 1. The paper operationalizes a multi-agent workflow for TCM, coordinating parallelized evidence gathering, structured adversarial deliberation, and reflective learning. 2. The experimental results is thorough, with direct comparisons to strong proprietary and open-source LLMs, and with detailed metric-based benchmarking. 3. The paper includes extensive ablation studies isolating the effects of knowledge retrieval, debate, and reflection mechanisms. 1. The overall framework more closely resembles a meticulously orchestrated multi-step prompt chain, rather than a genuine agent architecture featuring novel learning mechanisms or reasoning structures. Its contributions are predominantly reflected in prompt engineering and domain-specific applications, rather than in innovations at the foundational model or algorithmic level. 2. There is no experiment or discussion comparing medical multi-agent framworks such as MedAgents (X Tang et al., 2023) and MDAgents (Y Kim et al., 2024) or voting-based non-agentic approaches. 3. All evaluations are conducted on a single dataset (ClinicalLab, 1,500 samples). Broader generalization, transfer to other TCM datasets, or wider clinical use cases (other diseases, Western clinical benchmarks, etc.) are not tested. 4. The case study only presents the final perfect outcome, but fails to reveal the core mechanisms claimed by the paper, such as the debate process between Aggressive debator and Conservative debator, how parallel analysis among multiple agents is integrated, and how the mechanism of reflection operates. 1. How does parallel encoding and multi-agent debate affect latency and resource consumption, especially as the number of agents or rounds scales? 2. While the reflection and retrieval enhancements are empirically shown to reduce error/contraindication, the analysis of their practical limitations is superficial. For example, when historical traces are sparse or knowledge retrieval is noisy, what is the system's failure mode? There is no in-depth case study or error analysis, and failure cases are not qualitatively explored. Moderately AI-edited
TCMAgent: A Multi-Agent Framework for General Traditional Chinese Medicine Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The work proposes a multi-agent design specifically for Traditional Chinese Medicine clinical decision-making. The system distributes reasoning across specialized agents for data analysis, diagnosis, and treatment deliberation, followed by a reflection phase to refine reasoning. The authors claim that this agent-based workflow better handles multi-modal patient information and conflicting therapeutic principles in TCM. Experimental results on a multi-modal TCM dataset show performance improvements over monolithic LLM baselines in terms of safety, coherence, etc. - The presentation is clear. - System design (figure 1) is comprehensive - The pipeline in Figure 1 and the motivation in the introduction (see below) seem to be generally applicable to many other clinical tasks as well. I fail to understand why the authors study Traditional Chinese Medicine (TCM) specifically. ``` As a medical system practiced for millennia, TCM’s efficacy hinges on holistic diagnosisderived from heterogeneous patient data (Yue et al., 2024b; Ma et al., 2021; Wang et al., 2023). Itsglobal relevance, especially for chronic and complex conditions, underscores the need for compu-tational frameworks capable of mastering this form of reasoning (Zhang et al., 2023; Zhuang et al.,2025). Yet the core cognitive tasks of TCM—synthesizing multi-modal evidence and deliberatingover conflicting therapeutic principles—remain beyond the reach of conventional AI architectures(Zhang et al., 2025). ``` - While the entire system might be novel in that it's a specific workflow for analyzing information, each separate component is not novel, as they have been proposed or used in previous works. Therefore, I am not sure about the novel part of this work. It seems to make it more like an engineering work. - The scope seems to be limited, as the work doesn't consider cases where there are missing data. Specifically, in ``` In this study, we leverage data from ClinicalLab (Yan et al., 2024) with their permis-sions, which contains 1,500 examples with features including patients’ medical histories, laboratory examinations, physical examinations, imaging studies, demographic information, and pathological assessments. ``` What if there's no patient medical history data? - This is connected to the first point. In the experiments, the evaluation is a bit limited. It seems the pipeline, except for the task-specific knowledge part, can be applied to other clinical diagnosis processes. - How do the authors know the treatment is the only ground truth? In ``` We leverage LLM-AS-JUDGE (Gu et al., 2024) to evaluate outcomes due to datascarcity of evaluating agent framework in TCM domain. ``` It indicates that there's only one ground truth, perhaps from the recommended treatment from humans? See weakness Fully human-written
DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes DynaIP, a training-free method for dynamically selecting visual prompts to enhance the out-of-distribution (OOD) generalization of frozen vision-language models (VLMs), such as CLIP. Rather than relying on static or handcrafted prompts, DynaIP selects relevant image patches from a diverse visual prompt pool using a lightweight policy model trained to maximize alignment with ground-truth labels. The method is designed to be plug-and-play and broadly applicable across different tasks and VLM architectures. 1. Strong OOD performance: The approach shows consistent performance gains across various OOD and compositional reasoning benchmarks. 2. Training-free adaptability: The method improves generalization at test time without requiring fine-tuning of the underlying VLM, making it easy to deploy. 3. Dynamic prompt selection: Unlike static prompt methods, DynaIP adapts to each test example by selecting the most relevant prompt patches, enhancing flexibility. 1. Complexity of policy training: While inference is training-free, the policy model itself must be trained in advance, and the training procedure is not fully detailed. This raises potential concerns regarding reproducibility and scalability. 2. Limited theoretical grounding: The paper lacks a deeper theoretical explanation for why dynamic patch selection improves generalization, especially under significant domain shifts. 1. Policy Training Details: it would be good to provide more specifics about how the policy is trained. For example: What is the exact reward function used? How sensitive is the method to the choice of policy architecture or optimization hyperparameters? How much training data is needed to train the policy effectively? 2. Prompt Pool Construction: How is the visual prompt pool curated? Is the diversity of the prompt pool critical to performance? 3. Failure Cases: Are there particular tasks or datasets where DynaIP fails or underperforms compared to static methods? It would be helpful to include a short discussion of failure modes or limitations in the results section. Fully AI-generated
DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation Soundness: 2: fair Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper DynaIP proposes a plug-and-play adapter for scalable zero-shot personalized text-to-image generation. It introduces a Dynamic Decoupling Strategy to disentangle concept-specific from concept-agnostic information in multimodal diffusion transformers, improving the balance between concept preservation and prompt following, and enabling multi-subject generation without retraining. Additionally, a Hierarchical Mixture-of-Experts Feature Fusion Module leverages multi-level CLIP features to enhance fine-grained visual fidelity and allow flexible control over visual granularity. Extensive experiments show that DynaIP outperforms prior methods in both single- and multi-subject personalization tasks. 1. The comparative experiments are comprehensive, including extensive comparisons with a wide range of related methods, which demonstrates the robustness and effectiveness of the proposed approach. 2. The proposed method is concise and easy to implementation, frindly to real-world applications. 1. In L264, the paper states that “In this way, the noisy image branch focuses on capturing the concept-specific information of the reference image, such as the subject’s ID and unique appearance, while the text branch specializes in learning the concept-agnostic information like posture, perspective, and illumination.” However, no experimental evidence is provided to substantiate this claim, which also appears inconsistent with intuition. 2. The prompts employed by FLUX.1 Kontext Dev and Qwen-Image-Edit adopt an instruction-based format, making the comparison in the paper potentially unfair. A fair comparison would require fine-tuning all models on the same dataset. 3. The study [Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices] has demonstrated that multi-layer visual feature fusion outperforms the use of only the final-layer features and explored multiple fusion strategies. The proposed HMoE-FFM is highly similar to that work and may not constitute a fundamentally novel contribution, yet the study is not cited in the paper. 1. During the inference stage, only the image tokens in cross-attention. Then how does the model decide, based on the prompt, whether to use the content or the style from ref image? Fully human-written
DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper unveils that current methods for personalized text-to-image (PT2I) generation faces critical challenges, including the difficulty of balancing concept preservation (CP) and prompt following (PF), the loss of fine-grained visual details from reference images, and limited scalability to multi-subject personalization in a zero-shot setting. To address these issues, the paper proposes DynaIP, a plug-and-play image prompt adapter for multimodal diffusion transformers (MM-DiT). DynaIP introduces a Dynamic Decoupling Strategy (DDS) to disentangle concept-specific and concept-agnostic features during inference, thereby improving the CP-PF trade-off and enabling robust multi-subject composition. Additionally, it incorporates a Hierarchical Mixture-of-Experts Feature Fusion Module (HMoE-FFM) that adaptively leverages multi-level CLIP features to preserve fine-grained visual details while maintaining semantic consistency. Experiments demonstrate that DynaIP achieves state-of-the-art performance in both single- and multi-subject PT2I tasks—despite training only on single-subject data. - *Achieves strong performance.* Methods proposed in this paper effectively addresses the key challenges faced by current approaches to personalized text-to-image (PT2I) generation, and its efficacy is convincingly demonstrated through extensive qualitative examples and comprehensive quantitative evaluations. - *Comprehensive comparisons.* The selection of methods for comparison is thorough and well-considered, encompassing a broad spectrum of both open-source and closed-source state-of-the-art approaches. - *Clear exposition.* The paper clearly articulates the key challenges currently facing the personalized text-to-image (PT2I) generation field and approach to address them. - *Limited post-hoc analysis of key components.* The paper lacks in-depth post-hoc analysis of the proposed Dynamic Decoupling Strategy (DDS) and Hierarchical Mixture-of-Experts Feature Fusion Module (HMoE-FFM). For instance, the decoupling effect of DDS could be visualized through attention maps, and the $w_l$ in Eq.~(7) of HMoE-FFM could be analyzed across diverse cases to reveal how granularity control is adaptively achieved. Such analyses would significantly enhance the interpretability and credibility of the proposed method. - *Prompt may count much in HMoE-FFM.* For example, replacing the prompt in Fig.1(b) with ‘A photograph of an old wooden bridge over a tranquil pond, rendered in melancholy tones’—which emphasizes the *color palette* and *mood* of the reference image—would likely require the model to prioritize high-level semantic attributes over low-level textural details, potentially leading to different $w_l$ distributions. - *User’s flexible control may be impractical.* In Fig.~1(b), the prompt specifies ‘in Impressionist swirls,’ and the fine-grained result indeed better satisfies this stylistic requirement. This outcome likely stems from the weight allocation $w_l$ computed by the HMoE-FFM. However, rather than leaving the choice of granularity to user customization, the system should arguably infer and apply the optimal set of $w_l$ automatically—i.e., directly output the configuration that yields the best visual fidelity and prompt alignment without requiring manual intervention. - Although the paper claims the generalizability of the proposed method to other models with different sizes or architectures (Lines 327–328), it does not provide compelling empirical evidence to substantiate this assertion. - Section B.5 (User Study) includes a wrong reference to Table 1 (supposed to be Table 3 maybe). - In the HMoE-FFM, features from CLIP layers 10 and 17 are selected as inputs for the low- and mid-level expert networks, respectively. Are there additional ablation studies exploring alternative layer choices—for instance, using layer 9 (or other layers) for the low-level expert? - In the Dynamic Decoupling Strategy (DDS), what would be the impact on disentanglement performance if, during inference, the same reference image token interaction mechanism used during training were retained? Lightly AI-edited
DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces DynaIP, a novel Dynamic Image Prompt Adapter designed to enhance the capabilities of state-of-the-art Text-to-Image (T2I) multimodal diffusion transformers (MM-DiT) for Personalized Text-to-Image (PT2I) generation. DynaIP addresses three key challenges: balancing Concept Preservation (CP) and Prompt Following (PF), retaining fine-grained concept details, and extending single-subject personalization to multi-subject scenarios. The proposed solution leverages a Dynamic Decoupling Strategy (DDS) to separate concept-specific from concept-agnostic information and a Hierarchical Mixture-of-Experts Feature Fusion Module (HMoE-FFM) to effectively utilize multi-level CLIP features. Extensive experiments demonstrate DynaIP's superior performance in both single- and multi-subject personalization tasks. 1. The proposed method is technically sound and effective for personalized image generation. The DDS effectively disentangles concept-specific and concept-agnostic information, leading to a better balance between CP and PF, and the HMoE-FFM module utilizes multi-level CLIP features, providing flexible control over visual granularity. 2. Comprehensive Experiments: The paper includes extensive experiments on both single- and multi-subject datasets, demonstrating the effectiveness of DynaIP across various scenarios. 3. DynaIP can be integrated with various downstream applications including ControlNet-like geneartion, and reginal generation. 4. The overall writing is clear and easy to follow, the presented figures demonstrate the motivations, model framework and qualitative results. 1. The evaluation details, including the system prompts, detailed metric of Table.1 should be involved. To me, the prompt following (PF) results of multi-subject, i.e., 0.997 is not convincing, as this means nearly all user instructions are perfectly rendered by the proposed method. It would be better in include more details to clarify this. Additionally, the reliance on a single Vision-Language Model (VLM) for evaluation could still introduce biases or limitations. 2. In MM-DiT style models, afther the multimodal text and visual branch, text hidden states and visual hidden states are concatenated and perform full attention on them. Yet, the authors claim that MM-DiT exhibits decoulpling learning behavior that the noisy image branch captures the concept-specific information while text branch learns the concept agnostic information, are there any qualitative or quantitative evidences to illustrate such phenomenon? Also, add corresponding references if any. 3. In my opnion, some presentation of Sec.2 and Sec.5.4 should be included in the main paper. 4. Qwen-Image, SD3 are also based on MM-DiT architectures, it would be better to perform evaluation on these models to demonstrate generalization. My major concern is the evaluation results, so I give a score of 4 at current version. I would consider raise my score if authors could address my concerns, especially on the evaluation results and details. Fully human-written
Towards a more Holistic Evaluation of Object-Centric Learning Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper critiques the limited evaluation paradigms in object-centric learning (OCL), arguing that standard metrics like mean IoU or object discovery fail to assess what properties are actually represented in the learned slots. The authors propose a new evaluation framework using vision–language models (VLMs) built via visual instruction tuning (LLaVA-style) where object-centric encoders provide visual tokens to the LLM. They introduce a new metric, Attribution-aware Grounded Accuracy (AwGA), which jointly measures “what” (object attributes) and “where” (localisation) by computing grounded accuracy weighted by slot-level attribution maps. They benchmark multiple OCL methods (e.g., DINOSAUR, FT-DINOSAUR, StableLSD, SPOT) on VQA, counterfactual reasoning, OOD generalisation, and compositional reasoning. They also propose a new baseline, mFRESA, which adds multiple reconstruction targets (pixels, DINO features, HOG features) and shows improved performance on both VQA and AwGA metrics. - The paper convincingly highlights a major gap in OCL evaluation: existing metrics are narrow and don’t reflect broader reasoning goals (OOD, compositionality, counterfactual reasoning). - The distinction between Type1 and Type2 inconsistencies (localisation vs. redundancy) is well thought-out and nicely visualised. - Novel metric (AwGA). A sound attempt to unify localisation and representation evaluation, addressing known shortcomings of mIoU and VQA-only accuracy. - Evaluates a wide range of OCL methods across diverse tasks, providing a valuable comparative baseline for the field. - Simple, interpretable baseline (mFRESA). The multi-feature reconstruction approach is practical and demonstrates that better feature-level reconstruction helps representation quality. * From my understanding, the proposed framework evaluates through a language-mediated bottleneck, meaning performance is confounded by how the LLM integrates slots via cross-attention rather than by intrinsic slot quality. For instance, two encoders with very different slot semantics could yield similar VQA results if the LLM adapts its attention weights effectively. * Thus, the claim that this is a holistic evaluation is weakened by the absence of direct geometric or retrieval-based analyses (e.g., nearest-neighbour retrieval between slots, clustering consistency, object-level contrastive evaluation). * There seems to be an overreliance on LLaVA-style architecture. The improvements might partly come from instruction tuning and alignment quality of the VLM, not necessarily from the OCL model itself. There’s limited analysis of how much the connector or instruction tuning dominates the results. 1. How sensitive are the results to the choice of LLM and connector architecture? 2. Could AwGA be applied without ground-truth masks, e.g., via attention supervision or self-generated masks? Otherwise its scalability remains limited. 3. How much do you believe instruction-tuning data leakage affect fairness? Many LLaVA datasets contain COCO images, which overlap with OCL training data. 4. Would instance retrieval or clustering consistency across views yield the same model rankings as AwGA? This could validate whether AwGA captures true representational quality. The result will not be exact, but one would expect that a good representation captures both what and where of the object in the scene. 5. For mFRESA: do the additional reconstruction heads improve binding (object separation) or attribute encoding? A disentanglement or slot purity analysis would clarify this. Fully AI-generated
Towards a more Holistic Evaluation of Object-Centric Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents a new evaluation framework for object-centric learning using vision-language models. The idea is interesting and addresses a real gap in current OCL evaluation methods. The proposed metric and experiments are convincing. But the paper did no include necessary computation requirement data. Overall, it is a solid and meaningful contribution. - originality: 3/5, - quality: 3/5, - clarity: 3/5, - significance: 4/5. Four takeways are very valuable and worth-thinking for all future OCL researches. W1 --- Line 073-076: > Specifically, we employ the visual instruction tuning method of Liu et al. (2023), which modifies an LLM into a vision-language model (VLM). We use object centric models as the vision encoders, enabling us to evaluate OCL methods through visual question answering (VQA) via the VLM. The proposed method is valuable and effective. However, the visual intruction tuning required in the authors' evaluation framework poses a very high computation demands, both in space and time. This even makes it impractical for most researchers, who have limited computation resources. ### W1.1 Figure 2: Although stage 1 visual intruction tuning only requires tuning of "MLP Connector", the inference of LLM is still very expensive. Let along stage 2 needs tuning the LLM. There fore, it is necessary to provide detailed training-time space and time costs, and the GPU numbers and models. W2 --- Figure 7. > Multi-feature reconstruction for slot attention (mFRESA). The authors claim to propose a novel learning/optimizing method for OCL models. However, the reconstruction of VFM (vision foundation model) feature and the reconstruction of input pixel, except HOG, have already been used in many existing OCL methods. Besides, the two extra replica decoding times the training costs. Thus, the related data should also be provided. W3 --- Section 2 Related Work: Some important OCL advances are missing. Discussions should be provided on these OCL baselines and the ones being chosen in the experiment. - SOLV: Self-supervised Object-Centric Learning for Videos - VQ-VFM-OCL: Vector-Quantized Vision Foundation Models for Object-Centric Learning - MetaSlot: MetaSlot: Break Through the Fixed Number of Slots in Object-Centric Learning - VideoSAUR: Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities - SlotContrast: Temporally Consistent Object-Centric Learning by Contrasting Slots ### W3.1 I am very interesting in the effect of slot pruning techniques, like SOLV and MetaSlot, on the authors' experiment results. ### W3.2 OCL can be conducted on both images and videos. Thus, it is suggested to also provide some results on video-based vision-lanaguage tasks. The corresponding OCL methods can be SOLV, VideoSAUR or SlotContrast. W4 --- According to Table 6, only seven slots are used, which is very much less than the dense feature used in original VLMs. So it would make this work more complete if results of larger #slots are used, e.g., 16, 32 and 64. N.A. Fully human-written
Towards a more Holistic Evaluation of Object-Centric Learning Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper proposes a broader evaluation protocol for object-centric learning (OCL), arguing that standard object discovery metrics (e.g., segmentation quality) do not reflect downstream reasoning and robustness. It evaluates slot-based models by integrating them into a vision-language pipeline and testing them on diverse VQA-style, compositional, OOD, and counterfactual benchmarks. The authors also introduce AwGA, a metric that jointly measures answer correctness and whether the model used the correct object regions. They further present mFRESA, a multi-feature reconstruction variant of an object-centric model, which they claim improves both grounding and downstream robustness compared to prior slot-based methods. 1. The paper is well written, clear, and easy to follow. 2. The VLM-based evaluation pipeline for OC models enables a wide range of evaluations across perception, reasoning, robustness, and compositionality. 1. Some of the claims have already appeared in the literature, although under different experimental setups. Please see the questions below for more details. 1. As my main concern: several claims and takeaways in the paper seem to have already been shown in prior work, but with a different evaluation pipeline. For example, [1] has already analyzed (1) comparisons between foundation models and OC models in terms of raw VQA performance, and (2) the correlation between unsupervised object discovery and downstream reasoning performance. In addition, [2] analyzes compositional generalization of OC models compared to foundation models in a fully controlled setting, showing advantages for OC models. Given this, it seems that the main contributions of the current paper go down to (1) the VLM-based evaluation pipeline, (2) the introduction of mFRESA, and (3) the AwGA metric. I would appreciate it if the authors could elaborate on this. 2. As an addition, I would like to see an ablation on the loss term weights for mFRESA to see how much each reconstruction target (e.g. pixels, DINOv2 features, HOG features) affects the downstream/upstream performance of the model. 3. To further enhance the results of the paper, I would also recommend showing qualitative effectiveness of mFRESA compared to other methods e.g. showing attribution maps for a few questions or (if possible) attention maps over slots from the LLM, to show which slots are actually being used for question answering. 4. In Section 4.1 (last paragraph), it is mentioned that on the compositional reasoning task of SugarCrepe, OC models lag behind foundation models like DINOv2. On the other hand, on similar (but synthetic) compositional generalization tasks, [2] shows that OC representations generalize better compositionally. Could you please elaborate on the differences between these two papers and why the trends seem reversed? 5. Minor typo: Line 147 defines S as a k-slot vector, but the indices are listed from 0 to k. Given all the above, I believe this paper is taking an important and necessary step towards better understanding the role of object-centric learning in the current era of foundation models, and I would recommend the acceptance of the paper. Fully human-written
Towards a more Holistic Evaluation of Object-Centric Learning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces a new framework for evaluating object-centric learning (OCL) models beyond traditional object discovery tasks. It argues that existing metrics such as linear probing or segmentation accuracy fail to capture higher-level reasoning abilities like compositionality, out-of-distribution robustness, and counterfactual reasoning. More specifically, the authors highlight that current evaluation methods suffer from Type 1 inconsistencies (when a model predicts object properties correctly but localizes them poorly) and Type 2 inconsistencies (when multiple slots redundantly encode the same object, causing fragmented representations). To address these issues, the authors propose a Vision-Language Model (VLM)-based evaluation protocol, where OCL models act as vision encoders in visual question answering tasks, enabling multi-dimensional assessment. They also introduce a unified metric, Attribution-Aware Grounded Accuracy (AwGA), which jointly measures the “what” (object properties) and “where” (localization) aspects of object representations. Finally, they present mFRESA, a simple multi-feature reconstruction OCL baseline that outperforms existing methods across several holistic evaluation benchmarks. 1. Addresses a critical gap in OCL evaluation. The paper tackles the limitations of current evaluation schemes that focus mainly on object discovery and fail to assess higher-level reasoning such as compositionality, counterfactual reasoning, and out-of-distribution (OOD) generalization. It also clearly identifies Type 1 and Type 2 inconsistencies, motivating the need for a more holistic framework. 2. Integration of VLMs as evaluators. The authors propose using Vision-Language Models (VLMs) for evaluating OCL by treating OCL models as vision encoders connected to an LLM through a small MLP mapping. This design enables multi-dimensional assessment through visual question-answering tasks. 3. Unified evaluation metric (AwGA). The Attribution-Aware Grounded Accuracy (AwGA) metric effectively combines “what” (object properties) and “where” (localization) into a single measure and shows strong correlation with both, helping address the identified Type 1 and Type 2 issues. 4. Clear and professional presentation. The paper is clearly written and well-structured, with informative figures and detailed experimental descriptions that enhance readability and reproducibility. 5. Practical baseline improvement (mFRESA). The proposed mFRESA model, which uses multiple reconstruction targets, provides a simple yet effective improvement over existing OCL methods and supports the paper’s empirical claims. 1. Evaluation may not isolate OCL representation quality (most critical). Because the proposed framework embeds OCL encoders within a large Vision-Language Model (VLM), much of the reasoning can be performed by the language model itself. This makes it unclear whether the evaluation truly measures the representational quality of the OCL model or simply the VLM’s doing the heavy lifting. While linear probing is limited in scope, its simplicity ensures that it directly reflects representation quality; in contrast, using a complex multimodal pipeline introduces confounding factors that weaken interpretability. 2. Limited analysis and validation of the AwGA metric (highly important). The Attribution-Aware Grounded Accuracy (AwGA) metric depends on a Top-K attribution step that can still include redundant slots and may not fully resolve Type-2 inconsistencies. The paper provides no ablation or sensitivity study on K, leaving questions about how stable or fair the metric is across different settings. 3. Heavy dependence on pretrained components. Results may partly reflect the capabilities of large pretrained models (LLMs and feature encoders) rather than the OCL methods themselves. 4. Overstated causal claims. The evaluation benchmarks counterfactual question answering rather than true causal reasoning, so causal interpretability might remain unproven. 5. High computational cost. As mentioned in the paper, the proposed framework requires multi-stage fine-tuning of large models, which reduces scalability compared with simpler baselines like linear probes. 1. How do the authors ensure that the VLM-based framework measures OCL representation quality rather than the LLM’s reasoning ability? 2. How sensitive is the AwGA metric to the choice of Top-K, and could redundancy among slots affect its reliability? 3. Why is the proposed complex evaluation preferable to simpler linear probes that directly assess representation quality? Heavily AI-edited
Gen-DFL: Decision-Focused Generative Learning for Robust Decision Making Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes Gen-DFL, a “Decision-Focused Generative Learning” framework for robust decision-making under uncertainty. It extends decision-focused learning (DFL) by replacing deterministic point predictions with a conditional generative model (e.g., conditional normalizing flow) that captures the conditional distribution $p_\theta(c|x)$ of optimization parameters. The framework combines this with a CVaR-based risk-sensitive optimization objective, forming a generate-then-optimize (GTO) paradigm. Theoretical analysis provides regret bounds suggesting that Gen-DFL performs better than traditional DFL when variance, dimensionality, or nonlinearity increases. The topic is timely and relevant, as robust decision-making under uncertainty is a rapidly growing area at the intersection of optimization, learning, and generative modeling. The proposed framework is general and conceptually interesting—it integrates generative modeling and decision-focused optimization in a unified formulation that, in principle, could be applied across a wide range of uncertain decision-making problems. The problem setup is clearly motivated, and the generate–then–optimize structure is intuitively appealing. - Limited conceptual novelty: The central idea—modeling parameter uncertainty via a generative model and optimizing CVaR—is highly similar to existing work in distributionally robust DFL and end-to-end conditional robust optimization (E2E-CRO). The paper does not clearly articulate how Gen-DFL differs from or improves upon these prior methods. - Mathematical presentation is confusing: As noted in the comments, notation is inconsistent ($p_\theta(c|x)$ vs. $q(c|x)$ in equation (7)); $w^\star$ is used even though it is inaccessible; and several definitions (e.g., for CVaR-based regret) are unclear. Some derivations are incomplete or informal, lacking well-defined assumptions and proof steps; and several mathematical objects (e.g., $R(x)$ in Theorem 5.4) are never defined. - Lack of explanation for the surrogate loss function: A central component of the algorithm—the surrogate loss function used for training—is introduced without any explanation, justification, or derivation. Since the entire algorithm builds upon this surrogate, the omission makes the methodology difficult to follow and gives an impression of a lack of transparency, as if the authors are glossing over key technical details. This severely hurts the paper’s clarity and credibility. - What is the relationship of $p_\theta(c|x)$ and $q(c|x)$? Are they refering to the same thing? - Since $w^\star$ is not observable in real-world decision-making, how is regret approximated during training and evaluation? - The regret definition, why is CVar positioned outside the regret difference rather than inside in equation (5)? Note that these two quantities can be different Fully AI-generated
Gen-DFL: Decision-Focused Generative Learning for Robust Decision Making Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes GEN-DFL, a decision-focused learning approach for contextual stochastic optimization problems that uses generative models (e.g. normalizing flows) to generate samples that are fed into sample average approximation model used to optimize a CVAR objective. In particular, in analogy to classical DFL approaches, they propose to train the generative (prediction) model using a surrogate decision-oriented loss function in order to obtain “decision-focused” samples that minimize a CVAR regret. The paper proposes a contrastive surrogate loss function for this setting, and it provides a theoretical analysis that provides an error bound for the surrogate loss function and a quantification of the improvement of the regret obtained by their model in contrast to classical (single-point-forecast-based) DFL approaches. In a set of computational experiments with classical predict-then-optimize, DFL, and a robust DFL approach as baselines, it is shown that the proposed approach yields much better results in terms of a CVAR-based regret. - The paper addresses a critical weakness of classical decision-focused learning approaches, namely the fact that the optimization models rely on a single point forecast, and addresses that weakness by using a SAA-based approach to compute a CVAR objective. - The paper provides a surrogate loss function that is then usable in a classical end-to-end learning pipeline, and quantifies the error introduced by the surrogate loss. - The paper ignores the fact that there exist many papers that deal with contextual stochastic optimization, see e.g. the survey paper Sadaba et. al (2025) (A survey of contextual optimization methods for decision-making under uncertainty, European Journal of Operational Research, https://doi.org/10.1016/j.ejor.2024.03.020), many of them using SAA-based approaches that rely on generating samples from conditional distributions that are generated by machine learning models. - Given those approaches, for me, the big (and obvious) question is: What is the benefit of training the (probabilistic) machine learning models used for generating the sample with a decision-oriented loss function instead of using classical loss functions? In other words: What is the benefit of step 2 (Model learning) in section 4.2 of the paper? This question is not answered in the computational experiments, and this is, in my opinion, the big weakness of the paper. If there is not a big benefit, why should the reader bother with the expensive DFL pipeline? Personally, I dot not think that the task-specific loss will bring a lot of advantage, but I would love to see results from experiments that show the contrary. Also, I would like to emphasize that in classical DFL, it is common to compare “predict-then-optimize” (with classically trained prediction models) to “predict-and-optimize” aka DFL. This is why I find it strange not to compare “generate-then-optimize” (for which there is a lot of literature, see section 4 of the abovementioned survey) to “generate-and-optimize” (which is obtained by using a decision-oriented loss function). - In my opinion, the experiments are also flawed in a different way as the approaches use different objective functions. This similar to a setting in which one solves a classical stochastic optimization problem with an expected value objective, and another with a CVAR-objective, and then compares their performance in terms of CVAR – it would come to no surprise that the solution of the first perfroms much worse. - The paper does not mention approaches such as the one Nathan Kallus, Xiaojie Mao (2022) Stochastic Optimization Forests. Management Science 69(4):1975-1994., which also uses a decision-focused approach to solve risk-averse objective functions involving CVAR, also using a sample-based approxiation objectives (this would also be a good benchmark approach) In its present state, I find that the weaknesses sketched abive namely neglecting whole stream of relevant literature on contextual stochastic optimization and not comparing the decision-focused training of the generative model to a classically trained generative model using the same SAA-based CVAR loss function. If this decision-focused training is not significantly better, the main contributions of this paper are not significant enough to justify a publication at ICLR. - The most important question is: In your experiments, given a fixed number of samples, how does generate-then-optimizes using a generation model trained in a classical way compare to your approach that uses a surrogate decision loss for training Fully human-written
Gen-DFL: Decision-Focused Generative Learning for Robust Decision Making Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper combines the advantages of Decision-focused learning (DFL) and robust optimization (RO) to propose a decision-making framework robust to model uncertainty. The authors introduce a method that uses the conditional distribution learned by generative models instead of uncertainty sets to protect the decision-making against tail regions of the distribution. This results in optimizing a Conditional Value-at-Risk objective. The paper provides theoretical results on the loss difference using an approximation of the conditional distribution and regret gap between standard DFL and their method. They validate their methods on an energy-cost aware scheduling problem and the COVID-19 resource allocation problem, along with three synthetic experiments. The paper proposes a method that addresses the faults of RO and DFL. I believe the introduction, related works, and preliminaries provide sufficient context for the paper. The method was robustly validated in experiments with a wide range of data sets and ablation studies. The method was compared against many baselines in RO and DFL. The empirical results make a strong and convincing case for using this method. Theorem 5.4, which is one of the two main theoretical contributions of the paper, studies the upper bound of $\mathbb{E}_x\left[|\Delta R(x)|\right]$ to characterize the factors that influence the performance gap between standard DFL and Gen-DFL. However, the authors analyze these factors to highlight the failure modes of Pred-DFL, which is reflected by $\mathbb{E}_x\left[\Delta R(x)\right]$ instead of $\mathbb{E}_x\left[|\Delta R(x)|\right]$. While the analysis of the performance gap is interesting, it doesn’t strengthen the case made for Gen-DFL. Additionally, it seems like the performance gap, as defined by the authors, should also have a distribution distance term between $p(c\mid x)$ and $q(c\mid x)$ (similar to the result in Theorem 5.1). It seems odd that the miscalibration of the generative model has no effect on the performance gap. If a satisfactory upper bound and lower bound for $\mathbb{E}_x\left[\Delta R(x)\right]$ is provided, I would be willing to increase my score. Minor weaknesses * The theorem statement of Theorem 5.4 isn’t the same as Theorem A.8. * In lines 344-345, the authors argue that the estimation error of the predictor in Pred-DFL grows at a certain rate, but I don’t think this is immediately obvious. It would’ve been nice if there were some explanation around that statement or a reference to the original result. * Why aren’t the RO baselines included in Table 1? * What data sets are used to produce Figure 3-7? * Why is the $\beta$ hyperparameter introduced in the experiments section? Why can’t $\gamma$ be varied in experiments instead of $\beta$? * What is this “auxiliary model” in line 259? Are there two models in Gen-DFL? Fully human-written
Gen-DFL: Decision-Focused Generative Learning for Robust Decision Making Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper tackles the problem of making decisions via predictive models. Specifically, this paper focuses on the end-to-end regime of decision focused learning (DFL) which has been explored in recent years. For DFL, the model often takes the form of bilevel structure with a predictive model (for the cost vector) and an inner minimization of a downstream expected objective. This paper defines gen-DFL which instead of a predictor uses a generative model that tries to predict a distribution rather than a point prediction of the cost vector. This results in improvements in high dimensional settings over baselines and the authors provide theoretical bounds on performance. It has clear empirical improvements and clear theory on generalization bounds and provides an interesting alternative method to the other formulations of DFL. The justification of DFL versus point-wise robust methods is not theoretically clear. First, while they provide gaps between the point-prediction methods of pred-DFL and gen-DFL, first it is unclear what the \Delta R term really means. The differing definitions of regret for gen-DFL and pred-DFL make sense as one is with a regret realized by a singular cost value and the other is over a distribution of realizations. To then take the difference between these two terms is what seems a little strange. Later it seems that the definition of \Delta R in appendix A.8 uses the same CvaR regret for both but doesn’t use the defined pred-DFL regret. Thus, the \Delta R term is confusing and some clarification would help here. Furthermore, the purpose of this bound is a little unclear. If the motive is to demonstrate theoretical improvements of the gen-DFL method over pred-DFL type methods, why is it an upper-bound on the absolute residual gap? Shouldn’t it instead be a lower-bound type result on the gap without absolute difference such that the regret of pred-DFL methods is at least some quantity larger than the regret of gen-DFL? So while the upper-bound is valid, it doesn’t necessarily distinguish the two methods, and perhaps a corresponding lower-bound would help. Lastly, while there are many solid synthetic experiments, there are only 2 real-world data ones, for which one of them diff-DRO outperforms (admittedly by a small amount). It would be nice to include another such experiment in which the empirical performance of gen-DFL on real data is demonstrated. Additionally the results on these real world experiments seems lacking. Could the computational overhead of gen-DFL be problematic under settings that require retraining under distribution shift, perhaps? Is there an equivalence in solutions between the "right size" of an uncertainty set for pred-DFL with robust optimization and the full tail distribution of gen-DFL? Fully human-written
GAMBIT: A Graph-structured and Decision-Aware Benchmark for MoBile GUI Tasks Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces GAMBIT, a new benchmark designed to evaluate mobile GUI agents in complex, graph-structured, and decision-aware task environments. The benchmark features 830 task episodes (totaling over 11,000 actions) across 35 Android and iOS apps, with a variety of task topologies including sequential, conjunctive, conditional, and hierarchical workflows. Tasks are annotated at both high and low levels, and the authors propose new evaluation metrics (weighted LCS and decision accuracy) to more effectively measure long-horizon progress and decision correctness. Extensive experiments on seven current agents demonstrate GAMBIT’s significant difficulty and diagnostic value, exposing key deficiencies in decision-aware and conditional reasoning under mobile GUI interaction. 1. GAMBIT advances the field by offering a benchmark that moves decisively beyond template-based and sequential task formulations, emphasizing branching, conditional, and hierarchical dependencies. The graph-based modeling of tasks expands coverage over prior GUI datasets. 2. The benchmark’s 830 graph-structured tasks, spanning multiple app categories, platforms (Android/iOS), and languages (English/Chinese), offer significantly more realistic complexity than previous datasets. Figure 1 and Figure 2 make the contrast between simple sequential templates and graph-structured tasks explicit, showing how real-world constraints and decisions are integrated. 1. The methodology and evaluation sections do not discuss or compare against several directly relevant benchmarks from the broader graph learning literature. Notably, there is no mention of GSLB (Li et al., 2023) or GC-Bench (Sun et al., 2024), which set standards for graph-structured benchmarks and could inform both task modeling and metric choices in GAMBIT. Incorporating a principled comparison or at least discussion of these would sharpen the benchmark's positioning and theoretical underpinnings. 2. While the weighted LCS and decision accuracy metrics are described at a high level (Section 3.6, Equation W-LCS), critical formal and implementation details are missing. For example, the exact procedure for weighting within branching workflows (i.e., how weights are propagated on variable-length branches, how conflicting actions in parallel branches are resolved) is not fully specified. For decision accuracy, the criteria for identifying branch points and their correct traversal could be formalized, perhaps as a function $D(\hat{\mathcal{T}}, \mathcal{G})$ given the prediction $\hat{\mathcal{T}}$ and gold graph $\mathcal{G}$. This lack of mathematical rigor makes reproducibility and external comparison more difficult and reduces trust in the fairness of new metrics. 3. While the empirical pipeline is thorough, the paper does not attempt to theoretically ground the choice of task structures, guard conditions, or evaluation metrics. For instance, it is unclear whether the four chosen graph topologies (Figure 2(b)) cover the space of real-world app workflows (or could be unified under a broader formalism). There is also no formal complexity analysis (e.g., task branching factor distribution), nor a justification (proof or constructivist argument) of metric sensitivity or discriminative power over prior best practices (EM, SR, GP). 4. Insufficient comparison/enumeration of alternative metrics: Although GAMBIT introduces new metrics, the paper under-delivers on a rationale for why the proposed design is preferable to other sequence/graph alignment metrics (e.g., edit distance, graph isomorphism, tree edit distance), or whether decision accuracy is robust to minor label annotation inconsistencies. This leaves open whether the evaluation suite truly advances the measurement of agent competence. 5. Table 4 and appendix tables reveal stark differences in per-action model performance (e.g., on Long Press and Complete/Stop). Yet there's little explanation or modeling of possible action ambiguity (multiple equivalent ways to complete a task), nor an attempt to quantify inter-annotator agreement on low-level/stepwise execution. 6. For tasks involving decisions or fallback paths (hierarchical/conditional), the sampling procedure for branches (e.g., negative/”IMPOSSIBLE” cases) is only cursorily discussed. How are ambiguous, unreachable, or failed state paths annotated and incorporated in analysis? Are rejected paths equally weighted when scoring W-LCS or decision accuracy? N/A Fully AI-generated
GAMBIT: A Graph-structured and Decision-Aware Benchmark for MoBile GUI Tasks Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes GAMBIT, a graph-structured and decision-aware benchmark designed for evaluating mobile GUI agents on long-horizon and complex tasks. It features diverse graph topologies, covers both Android and iOS platforms, and includes cross-application scenarios. It also proposes novel evaluation metrics beyond task success rate and step accuracy. - The focus on long-horizon, decision-aware complex tasks is well-motivated and addresses a clear gap in existing mobile GUI benchmarks. - Representing complex tasks using graph structures is innovative. - The benchmark is comprehensive, covering cross-platform (Android and iOS) and cross-app scenarios. - The benchmark remains offline and static. It is unclear how it handles evaluation scenarios where multiple valid action trajectories or paths exist for completing a task, which is common in real-world GUI interactions. - The experimental evaluation lacks results from state-of-the-art closed-source models (e.g., Claude, Gemini), limiting the analysis of actual task difficulty - Many descriptions in the paper are unclear and ambiguous. For instance, the "dual-layer quality control" is mentioned in the paper but not elaborated on subsequently. - How does the benchmark's graph architecture account for tasks where multiple valid trajectories can lead to successful completion? Are all valid paths considered in the ground truth or metrics? - Could you please clarify what the "dual-layer" in "dual-layer quality control" means? - The benchmark is claimed to be representative of everyday usage but lacks justification, especially compared with other other benchmarks like AndroidWorld[1], SPA-bench[2]. [1] Rawles C, Clinckemaillie S, Chang Y, et al. Androidworld: A dynamic benchmarking environment for autonomous agents[J]. arXiv preprint arXiv:2405.14573, 2024. [2] Chen J, Yuen D, Xie B, et al. Spa-bench: A comprehensive benchmark for smartphone agent evaluation[C]//NeurIPS 2024 Workshop on Open-World Agents. 2024. Fully human-written
GAMBIT: A Graph-structured and Decision-Aware Benchmark for MoBile GUI Tasks Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces GAMBIT, a new benchmark designed to evaluate mobile GUI agents on long-horizon, decision-aware tasks across Android and iOS apps. Unlike existing datasets focused on short, linear workflows, GAMBIT includes over 800 task episodes with branching structures and provides dual-level annotations. The authors also propose decision-sensitive evaluation metrics, such as weighted LCS and branch accuracy, and evaluate 7 models, finding sharp performance drops in complex settings. GAMBIT reveals significant challenges in current agents’ ability to reason, plan, and adapt in realistic mobile scenarios. GAMBIT addresses a timely and underexplored challenge—evaluating mobile GUI agents on complex, decision-aware tasks—through the design of a well-constructed benchmark. GAMBIT is notable for its diversity (830 tasks across 35 apps), realistic graph-structured workflows, and dual-level annotations that support both fine-grained and high-level evaluation. The proposed metrics, particularly weighted LCS and decision accuracy, offer a more nuanced view of agent performance beyond traditional step-level success. The experimental evaluation is thorough, spanning seven competitive agents and revealing significant performance gaps that highlight the difficulty and diagnostic value of the benchmark. * Although the paper claims that code and data are hosted on an anonymous site, the link was space. * The experimental results show that all evaluated agents perform poorly on complex decision-aware tasks, with success rates dropping below 5% on longer or branching workflows. While this highlights the benchmark's difficulty, the paper stops short of analyzing *why* agents fail. A more detailed breakdown—e.g., by reasoning failures, perceptual grounding issues, or instruction misinterpretation—would clarify which capabilities current models lack. Additionally, exploring whether training on a subset of the benchmark improves performance on held-out complex tasks would help assess whether these challenges are surmountable with current architectures. * Although the dataset includes a nontrivial portion of cross-app tasks (~12.5%), the paper does not explicitly analyze how agent performance varies between single-app and cross-app scenarios. Given the increased complexity of app-switching workflows, a focused breakdown would enhance understanding of where current models struggle and how to better design agents for multi-app interactions. * Given the extremely low success rates on longer or branching tasks, can the authors provide more detailed analysis of failure cases? Specifically: * Are these failures more often due to reasoning errors, perception/grounding problems, or instruction misunderstanding? * Could a few concrete examples be shared to illustrate common error modes? * Have you considered analyzing errors by task topology (e.g., are hierarchical tasks disproportionately prone to early failure)? * Have you considered training or fine-tuning any models on a subset of GAMBIT, such as shorter or less-branching tasks, and evaluating on held-out complex ones? This would help clarify whether current models can improve with exposure, or whether structural architectural innovations are required. * While GAMBIT includes a nontrivial portion of cross-app tasks, there appears to be no separate evaluation or discussion of how agents perform on these compared to single-app workflows. Could the authors report metrics stratified by cross-app vs. single-app scenarios to better understand the specific difficulties introduced by cross-context transitions? Fully AI-generated
GAMBIT: A Graph-structured and Decision-Aware Benchmark for MoBile GUI Tasks Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces GAMBIT, a new benchmark for evaluating mobile GUI agents (with more then 800 episodes). Unlike existing benchmarks that primarily focus on simple, linear tasks, GAMBIT presents complex, graph-structured challenges that involve decision-making and branching scenarios to better reflect real-world usage. 1. The paper is well-structured and clearly written, making it easy to follow. 2. The contribution of a new, open-source benchmark dataset is a key strength and a valuable resource for the community. 3. The paper provides a thorough experimental evaluation, including comprehensive comparisons with existing agents. 1. The figures in the paper are of low resolution and are difficult to read. 2. The core innovation appears to be more complex high-level instructions (i.e., adding more conditions and logical judgments), which may not be a substantial enough contribution. The authors need to better justify this. 1. Fundamental difference from existing datasets: For a "Conditional" or "Hierarchical" episode, are the underlying sequences of screenshots and step-level instructions still inherently sequential? If this is the case, the main distinction of GAMBIT seems to lie only in the complexity of the high-level instruction. Could a similar dataset be constructed by rewriting existing sequential datasets? For instance, by augmenting them with app-specific atomic instructions and constraints, one could transform sequential data into Conditional or Hierarchical tasks. The authors should add a discussion or experiments in the paper to address this point. 2. Unrealistic instructions: In real-world scenarios, users are unlikely to provide such long and detailed instructions. They tend to offer shorter constraints. The example in Figure 1 illustrates this well: a user is more likely to say, "Help me find the cheapest, non-smoking, pet-friendly accommodation in Amsterdam," rather than articulating the complex if-else logic shown in the blue box. The authors need to re-evaluate the plausibility of these high-level instructions. If they do not accurately reflect real user behavior, they should be considered for rewriting to be more naturalistic. 3. About atomic instructions: How many atomic instructions and constraints were generated for each application? This number directly impacts the diversity of the final dataset. If the variety and quantity of these building blocks are insufficient, the dataset's overall diversity will be limited. The authors should provide an analysis and present statistics on this. Lightly AI-edited
DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces DATASETRESEARCH, a benchmark for evaluating agent systems on demand-driven dataset discovery and synthesis. It curates 208 real-world dataset demands (from Hugging Face and Papers with Code) and pairs each with a reference dataset and reference metadata (MetaTriplets). Agents are assessed along three axes: (i) metadata alignment (o3-judged semantic similarity), (ii) few-shot performance (1/3/5-shot), and (iii) fine-tuning performance on LLaMA-3.1-8B, with scores normalized by the reference-data upper bound. Baselines span search agents, synthesis agents (o3-generated 500-sample datasets), and deep research agents. Results show a clear split: search excels on knowledge-based tasks, synthesis on reasoning-based tasks, while all methods struggle on the harder DatasetResearch-pro subset (best ≈0.22), highlighting substantial headroom for hybrid and more generalizable approaches. Turns “find or build the dataset that matches a natural-language demand” into a measurable benchmark with paired reference data and metadata, covering the full path from requirements to downstream utility. Combines intrinsic (metadata similarity) and extrinsic (few-shot and fine-tuned task performance) measures, with normalization that enables comparison across heterogeneous NLP tasks. Systematically contrasts search, synthesis, and deep research paradigms, revealing a knowledge vs. reasoning specialization and consistent failure on corner cases—useful guidance for designing future hybrid agents. - The same model family (o3) is used to generate reference metadata/demands, parse discovered data, and score alignment—inviting self-consistency bias rather than genuine agreement, and masking contamination through stylistic echoing. - Overreliance on closed-source systems (o3 for synthesis/judging; GPT-4o/Deep Research variants for search) undermines reproducibility, accessibility, and cost realism. Results may reflect vendor-specific capabilities rather than agent design quality. - No systematic comparisons with open retrieval stacks (e.g., BM25 + dense retrievers/ColBERT), open reasoning LMs (e.g., Llama-3.x-70B, Mistral-Large-Instruct, Qwen2.x/3-Instruct), or open toolformer/agent frameworks. - Despite broad claims, coverage is text-only across six NLP tasks; no CV/audio/tabular/time-series/multimodal demands; limited external validity. - The benchmark is closer to a controlled template-matching exercise that reproduces a known target than to open-world dataset discovery. In practice, the “task” is largely to find (or approximate) someone else’s already-curated dataset given a stylized natural-language demand. But in real settings, you usually don’t have such ready-made, perfectly matched datasets—you have to prospect, acquire, clean, align schemas, and handle licensing/privacy—so the setup falls well short of real-world data discovery. see weakness Fully AI-generated
DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper explores a novel agent application that utilizes LLMs to discover or synthesis datasets that meets specific user requirements. Its major contribution is the introduction of a dataset discovery benchmark covering 208 types of demands covering knowledge-intensive and reasoning-intensive tasks. The paper hopes the benchmark and analysis can benefit the progress of self-improving AI systems. If well-justified, dataset discovery would be an interesting direction for LLM agents to explore. 1. The paper still needs in-depth justification on the motivation of dataset discovery demands. It is always intriguing to utilize LLM-based agents for exploring different applications. However, it is still lacking examples of practical use cases for human users to utilize dataset discovery agents. 2. The paper claims the dataset discovery agent shows interesting demands related to knowledge-intensive tasks or reasoning-intensive tasks. However, both tasks have mature strategies regarding data exploration/building. For instance, RAG and search agent related techniques are widely used in knowledge-intensive tasks; and reasoning tasks involves data synthesis (e.g., WizardMath) and RL-related long CoT & test-time scaling strategies. It is unclear how the proposed data discovery agent differs from these widely used existing methods on resolving related tasks. 3. The benchmark results in Table 2 seems incomplete. It misses multiple open-source models like QWen, DeepSeek, etc, and other popular models like Gemini and Claude series. I would also be interested to see performances of GPT-4.1 and GPT-5 models. None. Fully human-written
DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduce a benchmark designed to evaluate AI agents’ ability to autonomously discover or synthesize datasets given natural-language task requirements. The benchmark consists of 208 real-world NLP dataset, with reference datasets and metadata for objective comparison. Models are assessed via metadata alignment, few-shot performance, and fine-tuned downstream results. Experiments show a clear split: search-based agents excel at knowledge-oriented tasks, while synthesis-based models achieve superior performance on reasoning tasks. The work provides the first systematic evaluation pipeline for demand-driven data discovery. 1. First comprehensive framework targeting automated data discovery—a growing but under-studied problem. 2. Uses gated datasets + reference metadata, preventing leakage and reflecting real research workflows. 3. Combines metadata scoring, few-shot results, and fine-tuning—much richer than single-metric evaluation. 1. Used OpenAI o3 to generate both reference/discovered metadata and judges metadata similarity. This creates a closed loop that may favor o3’s rather than true task fit. 2. When starting from gated datasets, it prevents agents from downloading the ground-truth data. This structurally disadvantages search agents (vs. synthesis) and conflates ``access policy'' with ``discovery ability.'' 3. Data scope is narrow: NLP-only and text-only. See weakness. Moderately AI-edited
DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper points out that relevant data forms a crucial bottleneck to advance AI models. The authors seek to answer if “conventional data search” methods can be replaced with “AI-agent search.” To answer this question, they propose a new benchmark aimed at evaluating AI agents’ ability to discover and synthetically generate datasets. Sourcing data from Hugging Face and papers with code, the authors built a benchmark that automatically generates data demands split between knowledge-intensive and reasoning-intensive tasks. They use this benchmark to evaluate several leading models and conclude that current models do not perform well. The authors further show that “search agents” do better at knowledge tasks while “synthesis agents” outperform on reasoning tasks. Both agent types perform poorly on “corner cases.” [**significance**] The authors correctly identify data and data discovery as an important challenge to improving AI models. Efforts aimed at creating benchmarks designed to isolate capabilities useful for automating such challenges is an important endeavor. [**clarity**] - The paper (rightfully) emphasizes the importance of data to further advance AI models. Unfortunately, the problem specification is overly vague, entangling different challenges and data use cases. For example, the abstract mentions “countless valuable datasets [...] and domain platforms]” (l13-14], but does not specify if these are hidden due to access constraints or limitations attributable to search algorithms. - The word “synthesis” is used several times in the introduction without defining its meaning. - The experimental setup misses many important details [**quality**] - The related work section is severely lacking. For example, it appears to completely ignore decades of information retrieval research. The literature on synthetic data generation similarly lacks any discussion of core concepts like diversity, complexity, and quality. Also absent is the extensive literature on “retrieval augmented generation” (RAG) based systems. - The core contribution of this work is a new benchmark designed to simulate real-world data discovery. However, the methodology used to create this benchmark appears to have various questionable aspects (see questions). - The paper seemingly uses OpenAI’s o3 model for every step of the pipeline. This reviewer fears that any takeaways or analysis is therefore overly biased and does not necessarily generalize. - Reported metrics lack confidence intervals. - In Section 5.2, the authors write “we identify that [...] instruction-following capabilities” (l431-448). As OAI o3 is used to generate data, this can simply reflect existing data knowledge of o3, rather than any relation to retrieved or “discovered” data. This is an important confounding factor not accounted for in the empirical evaluation. [**significance**] The authors claim that their work provides “the foundation for the next generation of self-improving AI systems” (l29-30), which is a lofty claim that does not appear to be supported by empirical or theoretical evidence. Q1. Confusing notation: “Given a natural language [...] the specified demand D” ( l146-148), in this notation, what is the subscript “d” in S_d? Do the authors mean a “set” of datasets? This continues in l150-153, where now “r_i” is used without introduction. Q2. In Section 3.2, Step 1, the authors write that “gated” datasets are used to mitigate data leakage. What evaluation was performed to check against data leakage? Q3. In Section 3.2, Step 2 and 3, a number of filtering steps are performed to narrow down the dataset candidates. If the goal of this challenge is to measure models on “realistic” conditions, these steps appear to strongly bias the remaining set towards an unrepresentative sample. Q4. In Section 3.2, Step 6-7: What were the rejection criteria to decide if a dataset was “unsuitable for fine-tuning” (l236), and what were the criteria used to check if the generated meta data and demand descriptions are faithful to the underlying data and ecologically valid? (l235-241) Q5. In Section 3.2, l250-260, the authors propose a binary classification of knowledge-based vs. reasoning-based tasks. Yet, to this reviewer, it appears that *many* queries require a combination of these two. Could you please provide the systematic rubrics used to annotate these tasks? Were any consistency checks performed, e.g., cross-annotator agreement? Q6. In 3.3 it is claimed that using OpenAI’s o3 model for scoring both reference and discovered metadata mitigates potential scoring biases (l296-l297). This claim lacks evidence, e.g., are the reference and discovered metadata distributions similar? Is scoring consistent across different types of underlying data? Is scoring robust across multiple samples and/or prompts? Q7. The authors report a “Normalized Score” (l321), which finetunes a model on a reference dataset and uses this as the “theoretical maximum performance achievable” (l316-317). First, this assumes that a reference dataset contains both a train and test subset. Second, this reviewer sees no reason why combining one or multiple other datasets could not lead to a better performance. For example, training a model on a challenging math dataset and evaluating it on a simpler reference dataset fits this scenario. As such, what is the difference of dividing the S_{\text{eval}} by an arbitrary fixed number, given that scores are now “on a scale from 0 to 1, or higher” (l322)? Q8. A “synthesis agent” uses OAI o3 to generate 500 data samples (l357): How? What criteria are used to evaluate these samples? Q9. How is finetuning done? Q10. Key experimental setup details are missing to explain how the “deep research” systems of various providers were evaluated. The text mentions manual actions: what were these? Suggestions: - typo: “evaluable” (l204) Fully human-written
SafeCoop: Unravelling Full Stack Safety in Agentic Cooperative Driving Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper studies full-stack safety for natural-language-based collaborative driving systems. It identifies four attack surfaces (connection disruption, relay/replay, content spoofing, and multi-connection forgery) and proposes SafeCoop, an agentic defense pipeline combining (i) semantic firewalling, (ii) language-perception consistency checking, and (iii) multi-source consensus with temporal checks. Evaluations in CARLA across 32 scenarios demonstrate that malicious language communication severely harms collaborative driving, and SafeCoop substantially recovers performance and detects adversarial agents, achieving up to ~69% driving score improvement and ~67% F1 detection. 1. Timely problem: Addresses emerging risks in language-based V2X collaboration, an important but under-explored area as driving LLMs become more capable. 2. Comprehensive threat model: The taxonomy spans channel-level and semantic attack vectors, grounding them in adversarial V2X literature. 3. Agentic defense pipeline: Novel multi-module agent approach (firewall, perception-language consistency, consensus), offering interpretability. 1. Novelty feels incremental. The paper mainly repackages known security concepts (trust-based filtering, majority consensus, temporal consistency) into an LLM-driving setting. While the agentic framing is interesting, the core mechanics resemble classical V2X trust scoring + consistency checks, raising questions about conceptual novelty. 2. Limited realism & scalability assumptions. The system assumes synchronous simulation, and perfect ego perception during consistency checks. Real V2X networks are asynchronous, lossy, and perception-noisy, which may reduce effectiveness. 3. High latency & unclear deployment feasibility. Even fast models are around 700ms and larger models exceed 3s latency, far above real-time driving constraints. The paper acknowledges this but does not meaningfully address deployment pathways. 4. How frequently is the defense executed? If the defense must run continuously or per-frame, the computational burden may be prohibitive; a one-time malicious-agent flag is insufficient for practical systems. Clarifying the defense invocation frequency and cumulative runtime overhead (e.g., cost per second of driving) would be important to judge real-world viability. 5. Formula Clarification. For Eq(2), why is the attacker constrained to modifying a single sender? Additionally, the expression for $D_j$ appears to use observation $o_i$ instead of $o_j$. Is it intentional or an indexing error? Please respond to each point in Weaknesses. Fully AI-generated
SafeCoop: Unravelling Full Stack Safety in Agentic Cooperative Driving Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents the first systematic study of safety in natural-language-based collaborative autonomous driving, identifying four attack surfaces (Connection Disruption, Relay/Replay Interference, Content Spoofing, and Multi-Connection Forgery) and proposing SafeCoop, an agentic defense pipeline with three specialized agents. Evaluated on 32 CARLA scenarios, SafeCoop achieves 69.15% driving score improvement under attacks and 67.32% F1 score for malicious detection. + Novel and Timely Problem Formulation +Comprehensive Attack Taxonomy + Closed-loop and Thorough Evaluation Weakness - Unclear relationship between dual objectives: The system outputs a trust score for each agent, but claims to address both "performance" and "anomaly detection" objectives. However, the system design only explicitly addresses anomaly detection through trust scoring and filtering. It remains unclear how performance is directly optimized, or whether the authors simply assume that anomaly detection will inherently improve performance. This assumption should be explicitly stated and justified. - Insufficient attack implementation details: While Section 3 and Appendix D provide a taxonomy of attacks, the paper lacks a detailed explanation of how attacks are actually designed and implemented in the evaluation. For reproducibility and proper assessment of the defense mechanisms, more specifics are needed on attack generation, particularly for the MLLM-based Content Spoofing attacks (e.g., prompt engineering strategies, attack success rates, stealthiness measures). Further, in Table 1, several adversarial scenarios (w/o defense) achieve better performance than the Benign (Non-collab) baseline across multiple metrics. This counterintuitive result raises questions about attack severity and undermines claims about the effectiveness of the defense. While Appendix G partially addresses this through the computation scaling phenomenon, this critical finding deserves prominent discussion in the main paper with a thorough analysis of what it implies for both attack potency and defense necessity. - Unexplained performance degradation in ablation study: Table 3 shows that adding LPC and MSC agents actually increases the Vehicle Collision (VC) score under CS+MCF attacks, which is concerning. The paper provides no investigation or explanation for this degradation. Could false positives in anomaly detection be causing the defense to filter legitimate messages, thereby reducing situational awareness and increasing collisions? Without presenting false positive/false negative rates and analyzing this phenomenon, the reliability of the defense pipeline remains questionable. How is "performance" explicitly optimized in the system design, or is it assumed to naturally follow from anomaly detection? Why do some adversarial scenarios (Table 1) outperform the non-collaborative baseline? What does this imply about attack severity? What causes LPC and MSC agents to increase vehicle collisions under CS+MCF attacks (Table 3)? Are false positives responsible? Why does the Firewall Agent require MLLMs for semantic reasoning when it appears to only process JSON-formatted input? Lightly AI-edited
SafeCoop: Unravelling Full Stack Safety in Agentic Cooperative Driving Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a comprehensive study on full-stack safety and security issues in natural-language-based collaborative driving. The authors develop a taxonomy covering both generic threats such as connection disruption and relay interference, and language-specific vulnerabilities like content spoofing. They introduce SafeCoop, a defense pipeline integrating a semantic firewall, language-perception consistency checks, and multi-source consensus. The experimental evaluation in CARLA simulations demonstrates the approach's effectiveness against adversarial scenarios. While the methodology is well-structured and the analysis thorough, the defense mechanisms primarily adapt existing natural language processing security techniques rather than introducing novel multi-agent security innovations. The paper's key contribution lies in systematically addressing security gaps for language-mediated vehicular cooperation, though it does not sufficiently explore emergent security challenges arising from multi-agent coordination dynamics in adversarial environments. This work provides valuable insights for robust defense design in natural language-based V2X systems despite the limited novelty in its core defense architecture. 1. It develops a well-structured taxonomy that effectively categorizes both generic threats and language-specific vulnerabilities. 2. The methodology is clearly presented with a thorough analysis of the proposed solution. 3. The work systematically addresses security gaps in language-mediated vehicular cooperation frameworks. 4. The SafeCoop defense pipeline offers practical insights for building robust natural language-based V2X systems. 1. Some of the proposed attacks, such as connection disruption and relay interference, are not specific to natural language-based collaborative driving systems. 2. The defense mechanisms primarily adapt existing natural language processing security techniques rather than introducing novel multi-agent security innovations. 3. The work fails to adequately highlight how the multi-agent perspective creates unique security considerations not addressed in standard NLP security approaches. 1. From a multi-agent perspective, what aspects of the proposed attacks and defenses most clearly distinguish this work from existing NLP security research? Specifically, how do the interactions between agents in collaborative perception introduce unique vulnerabilities that traditional NLP security approaches do not address? Does the SafeCoop framework effectively mitigate these through its semantic firewall language-perception consistency checks and multi-source consensus mechanisms? 2. What assumptions does the defense mechanism make regarding the ego vehicle's knowledge of the attacker's capabilities? How does the proposed defense perform against unknown or adaptive attacks? 3. How does the language-driven driving system meet the real-time requirements of autonomous vehicles? What is the computational overhead introduced by the additional defense components, like semantic firewall analysis and language-perception consistency verification, and could these impact critical driving decisions in time-sensitive scenarios? Fully AI-generated
SafeCoop: Unravelling Full Stack Safety in Agentic Cooperative Driving Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes using LLMs to safeguard natural language-based cooperation among vehicles. The paper summarizes the taxonomy of the attack surface, which includes Connection disruption, Relay interference, Content spoofing, and Sybil attack (multi-connection forgery). The proposed approach combines several LLM agents, each more effective in one, or a subgroup of, defense. Evaluation using Carla shows defense performance using end-to-end metrics such as driving score, route completion, etc. The ablation study showcases different agents' capabilities, and the authors also compare the performance among multiple LLMs as the defense agents. + The paper explores cooperative driving security in language domain. + The writing quality is above average though missing a lof of details and clarity + The evaluations are systematic and comprehensive - Missing details on the range/magnitude of attack being applied during the evaluation. How is each of the attacks staged? What is the range of the attack, e.g., How much temporal misalignment in relay/replay interference? What is being introduced in content spoofing? How many forged connections are there in the Sybil attack? These are important details to provide context to gauge the evaluation metrics. Whether these staged attacks are trivial or sophisticated, the results can lead to entirely different conclusions. One way to show the sophistication is to show if conventional defense against these attacks can be successful, how does LLM compare against conventional methods, and whether using an LLM is an overkill. - Missing details on how the LPC agent compares ego perception with received messages. Is it taking a multi-view image or lidar as input? What if the received message is outside the field of view of the perception module? - Figure 1, appearing before the abstract, could use additional legends and details to be self-explanatory. What do the numbers after CD, CS, and CS+MCF mean? Why is the MCF description pointing to the CS score? And CS description points to CS+MCF score? Before the readers read the abstract, a couple of legends and explanations could be very helpful to avoid confusion. - The paper claims to have studied four attack surfaces (CD, RI, CS, MCF). What has been studied for CD (connection disruption)? There are no results on CD in Table 1. Table 2 is attack detection, not defense. If it's difficult to detect and evaluate, ok to tone down a bit. - The evaluation results show LLMs are insufficient in safeguarding against attacks, cannot recover collaborative driving performance before attacks, and cannot be run in real-time. These observations are good to know, but marginally advance the field's understanding. It would be helpful to show contributions/insights obtained to advance the defense. Please refer to weakness section Fully human-written
Neurosymbolic Language Reasoning as Satisfiability Modulo Theory Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors introduce Logitext - which is a neurosymbolic language to extend NLTCs to SMT solvers. They identify "compositional" and "combinatorial" reasoning gaps in LLMs when dealing with certain types of documents. In these cases, Logitext lets you formalize part of the document (the "logic") - and then uses an iterative solver with Z3 for logical and an NLSolver for textual constraints. However, this method feels quite contrived, and I don't see any applicability of this beyond some carefully curated examples that require a lot of manual annotation anyways. The complexity of the Logitext systems appears to not be proportionate to the demonstrated gains. 1. The paper does address a real problem wrt LLMs struggling with logical consistency in policy documents. 2. The paper is clearly motivated through empirics on reasoning gaps. 3. To my knowledge, this mixture of SMT solvers with NL constraints is novel. 1. The Logitext system seems quite contrived and requires significant manual effort to convert natural docs. 2. Manual annotation of the logical structure defeats the purpose of scalable automated reasoning. Therefore real usage of this is highly questionable. 3. The convergence of the NLSolver is not guaranteed and the caching employed seems quite ad-hoc. 4. Some of the proposed baselines appear weak and raise concerns about high quality evals. 1. How does the cost of annotation simply compare to using better prompting strategies? 2. How sensitive is the performance to annotation quality and completeness. 3. What happens (as is likely) when logical structures in the document are incomplete or vague? Fully human-written
Neurosymbolic Language Reasoning as Satisfiability Modulo Theory Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces LogiText, a neurosymbolic language which supports partial formalization, and a novel SMT (Satisfiability Modulo Theory) solving framework. This approach aims to bridge the "compositional" and, most notably, the "combinatorial" reasoning gaps that persist in LLMs by coupling an SMT's Boolean search with an iterative LLM-driven "generate-validate-refine" loop. 1. This paper proposes a neurosymbolic language, LogiText. LogiText does not require converting the entire document into strict logical formulas; instead, it allows for explicitly annotating only the key logical structures (e.g., Boolean relations) while retaining ambiguous textual clauses as natural language. This design bridges the gap between traditional symbolic solvers (which require fully formalizable domains) and real-world complex documents (which are essentially a mix of text and logic), greatly enhancing the practical value of neurosymbolic methods in domains like legal analysis and content moderation (CMOD). 2. The authors propose a neurosymbolic framework for reasoning with semi-structured language that positions the LLM as an SMT theory solver. Specifically, the SMT symbolic algorithm is responsible for efficient Boolean structure search, and the LLM-driven NLSolver then generates assignments that satisfy logical and semantic constraints by iteratively calling LLM sampling, validation, and refinement operations. 3. Experimental results demonstrate that on the text instance generation (TIG) and text Coverage Generation (TCG) tasks, the performance of the LogiText-based formalization and the LLM-driven SMT neurosymbolic solving algorithm significantly outperforms that of end-to-end LLMs. 1. The framework's reliance on precise clause-level annotation and evaluation is a critical vulnerability. LogiText relies on human experts to manually annotate natural language documents. This not only incurs high labor costs but also limits the method's scalability and application scope. Furthermore, the framework is fragile to "clause-level" errors. It still relies on the precise evaluation of each clause, and a failure in evaluating one clause can cause the entire logical chain to collapse. As results on LegalBench (Figure 8) show, a textual judgment error by the LLM on any single clause can lead to reasoning failure. In contrast, the "holistic reasoning" of end-to-end LLMs sometimes exhibits stronger reasoning capabilities. Especially in real, complex scenarios, the partitioning of clauses and the formalization of their logical relationships remain a significant challenge. 2. In the NLSolver algorithm, although the authors introduce a caching mechanism to reduce the number of LLM calls, the cost of LLM calls in the "generate-validate-refine" framework (an iterative refinement process) is still a non-negligible issue. Admittedly, we believe that proposing a low-cost and efficient solving method remains a challenge in this type of search-based neurosymbolic framework. To better assess the practical viability of this framework, we ask the authors to report the average (and maximum) number of LLM calls required by NLSolver to solve each task in the experimental evaluation. 3. Section 3 is hard to follow, partly because the complex notation. Please improve the presentation quality of this part. (Formatting Issue) Line 412 is incomplete. 1. The paper states that the benchmarks consist of 15 tasks with "10+ instances each". This seems like a very small scale for evaluation. Could the authors elaborate on the size of the test sets? How can you be confident in the generalizability of the results? 2. The paper mentions using the set of unsatisfied constraints to guide the refinement (Algorithm 1b, line 24). This is a key mechanism. Could the authors provide a concrete example of this refinement prompt? How is the LLM instructed to 'fix' the previously generated text based on which specific natural language constraints failed? Fully human-written
Neurosymbolic Language Reasoning as Satisfiability Modulo Theory Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper describes an approach to the logical annotation of unstructured text documents that allows the definition of logical constraints in natural language in the context of a semi-structured prompt to a hybrid LLM/SMT solver. This allows the document to be translated into an SMT theory, formed by combining LLM valuations for atomic propositions within the constraints with auto-formalization of complex statements in the document, for which a solver then finds a satisfying assignment or determines the theory is unsatisfiable. An effective and innovative approach to logical reasoning with LLMs that uses an SMT solver as cognitive scaffolding. This is an improvement over previous approaches which depend on autoformalization into logical statements followed by independent execution of a solver over the generated theory. The deeper integration of the LLM into the core of a hybrid reasoning engine is an important direction to pursue, because it begins to address how best to exploit the complementary strengths of formal reasoners and the approximate retrieval of LLM parametric knowledge for reasoning. The presentation is clear and for this reviewer informative and thought-provoking, the description of the formalization for natural language text constraints and the NLSolver were thorough easy to follow. The author(s) effectively make the case that using an LLM to ground atomic statements in the context of a solver is a viable, practical approach to reasoning in neurosymbolic systems that addresses some of the shortcomings of current approaches. The paper extends SMT with a theory for textual constraints, but does not discuss formal soundness or completeness guarantees. What are the conditions under which the NLSolver is guaranteed to find a solution if one exists? The paper would benefit from a more rigorous theoretical treatment of these questions. The paper doesn't adequately address how it ensures factuality of LLM evaluations when these serve as atomic propositions in the logical theory. What happens when the LLM misclassifies a natural language constraint? The LLMVerify function is treated as an oracle, but in practice, LLMs can be inconsistent or incorrect. The paper would benefit from error analysis on how LLM evaluation errors propagate through the logical reasoning process and what safeguards exist to detect or mitigate such errors. Have the authors investigated the usability of the Logitext language from a human factors perspective? What is the learning curve for non-experts to write effective annotations? How error-prone is the annotation process? Understanding the practical challenges of getting users to correctly annotate documents is crucial for real-world deployment. How does the system handle ambiguous natural language that could have multiple valid autoformalizations? Does it maintain multiple hypotheses or commit to a single interpretation? This is particularly important for legal and policy documents where ambiguity may be intentional or unavoidable. Can the authors provide any formal guarantees about their system? For example, under what conditions is the NLSolver guaranteed to be sound, never returning an incorrect solution? Even partial guarantees would strengthen the theoretical contribution. Fully AI-generated
Neurosymbolic Language Reasoning as Satisfiability Modulo Theory Soundness: 2: fair Presentation: 1: poor Contribution: 3: good Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes Logitext, a neuro-symbolic language that bridges the gap between natural language and formal language. Logitext is represented as natural language text constraints, which streamlines a logical structure of the natural language text. Furthermore, authors extend an SMT solver and propose a NLSolver using an LLM to solve natural language text constraints. The proposed method is evaluated on 15 tasks and yield impressive performance on three types of settings, supporting the effectiveness of the method. 1. The paper newly introduces a language called Logitext to fully represent the logical structure behind the natural language text. This is a very important problem in legal, political, and societal domain. 2. The performance is impressive. Logitext performs much better than few-shot prompting and the previous neurosymbolic approach on various benchmarks. The presentation should be further improved for clarifications. ### Major points - There are lots of words "solver", but I'm confused which solver do you exactly mean. I think you'd better come up with a clearer notation. There are lots of related words (e.g., solver, logical solver, SMT solver, LLM-based solver, symbolic solver), but for people who read this paper for the first time, it's really hard to get which "solver" do you mean in the paper throughout. - For Logitext, what if an implicit premise is hidden in the text, so that you can't capture the entire underlying logical structure of the text by simply annotating it? - The description of algorithm (Section 3.3) could be further improved. Now, there are too many variables only defined in Figure 4, not in the main paper, which hinders to fully understand the details. - For the new benchmark CMOD, what is the source of these moderation policies? I cannot find it in anywhere. - Line 367: 10+ instances per task seems so small. Elaborate the dataset size for each task. - Line 425-426: Authors should elaborate how this neurosymbolic prompt looks like. If it is something from a previous work, then authors should cite that work. ### Minor points - In Figure 2 (b), authors should mention that the definition of combinatorial gap is in Appendix A.1. Readers would be confused. - Which solver did you use in Section 2.2? Did you use NLSolver or an SMT solver? - In Figure 3, why is there C8? for defining disruptive behavior and immediate threat, you even do not use C8, and as the definition of C6 shows, i guess C8 is equivalent to C6. - Line 213, if I understand correctly, the first `<var>` and the second `<var>` point two different ones. for clarification, you should indicate those as `<var1>` and `<var2>` instead. Also, what do `<var>`, `<clause>`, and `<phrase>` exactly mean? Authors should elaborate this right after they describe the whole new notations. - Line 217, authors should cite related papers for `pyz3`. - Line 253-254, authors should describe how u_i is related to c_jh and what do p_i's mean here. I could understand from the context, but to formally define Logitext, authors should consider this. Also, why does this formalization re-occur here? I think it's already briefly describe in 3.1 Clause naming. - Line 263, the notation l_jh suddenly appears. what does it mean? - Line 268: in the previous paragraph, NLTC was notated as v_jh but here it is notated as v_i where 1<=i<=n. Could authors clarify this point? - Line 412: Text is overlapped by Figure 8. See Weaknesses. Fully human-written
How Confident are Video Models? Empowering Video Models to Express their Uncertainty Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a black-box framework that lets text-to-video models express uncertainty by decomposing predictive uncertainty into aleatoric (prompt vagueness) and epistemic (model ignorance) components. The framework is evaluated with a rank-correlation-based calibration metric, and a 40K-video UQ benchmark is released. 1. The paper is well-presented, well-written, and the motivation is justified. 2. The research topic of principled evaluation of synthetic videos is very timely and important. 3. The proposed dataset will be valuable. 1. The method’s evidence of **general** video-model UQ almost entirely depends on one text-to-image-to-video pipeline (Cosmos-Predict2). While I appreciate that authors state the API/compute constraints, it will be more convincing if the paper proposes potential solutions or fixes to overcome the challenge. That being said, the practicality and calibration of stronger video models shall be evaluated. 2. Please fix salient typos such as "video modes" (Page 3) and "peak signal-to-noise ration" (Page 13). While there are several weaknesses stated above, I believe this paper will be contributive and will provide new insights to the community. I therefore have the initial rating of 6 for this paper. Please note that my final rating will be conditioned on the soundness of the rebuttal. Fully human-written
How Confident are Video Models? Empowering Video Models to Express their Uncertainty Soundness: 3: good Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. - The authors introduced a framework to measure the uncertainty of video generative models. - The framework consists of a metric for evaluating the calibration of video models based on robust rank correlation estimation. - They also introduce S-QUBED, a black-box UQ method for video models. S-QUBED effectively distinguishes between uncertainty arising from ambiguous prompts and uncertainty stemming from the model's lack of knowledge. - They will also release a dataset of 40K videos across diverse tasks to help benchmark calibration in video models. - The authors used their method to disentangle and understand aleatoric and epistemic misunderstandings of the video generation models. For example, to assess epistemic misunderstanding, they generated multiple videos for the same prompt and embedded them. Then, they measured the embeddings' spread, with wider spread indicating higher epistemic uncertainty. - For the main result of their work, they further study the correlation between accuracy and the different uncertainties. They find that when uncertainty is higher, accuracy tends to be lower. This holds for both overall uncertainty and aleatoric/epistemic misunderstanding. - Uncertainty quantification of LLMs is well studied, but not studied at all for video generation models. This work was novel in that it studied uncertainty quantification of video generation models. - The black box approach makes it accessible to evaluate any video generation model. - The authors presented the material well, providing the necessary background to understand the motivation and importance of this work, which is especially important given its novelty. - I would like to see empirical results and to validate S-QUBED on other open (non-API) video models, given that it is a black-box approach. The authors mentioned that different models were considered but not evaluated due to access and compute constraints. However, I believe there should be multiple open text-to-video models to evaluate S-QUBED on (e.g., OpenSora). - Typical metrics (e.g., CLIP, PSNR) for evaluating text-to-image and text-to-video models often do not align with human judgment. Would like to see the correlation of uncertainty with human judgment metrics. No questions as the background, motivation, and results were presented well. Fully human-written
How Confident are Video Models? Empowering Video Models to Express their Uncertainty Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. Text-to-video models are quickly improving and creating excitement both among researchers and users of AI. However, like LLMs, these models are prone to hallucinate details of their output, especially when the input prompt is underspecified or underrepresented in training data. To address this challenge, this work presents the first (to their knowledge) study of uncertainty quantification in text-to-video models. They propose a black-box UQ method based on the epistemic/aleatoric decomposition to help identify when a text-to-video model is likely to hallucinate, and also plan to release a dataset of 40K videos for benchmarking UQ. Effective uncertainty quantification is a central pillar in creating trustworthy AI systems. While most focus on UQ in deep learning has been in image classification and more recently LLMs, it is important that these tools are extended to other fields and application areas, for example robotics or other generative media besides text. This paper aims to take the first step towards developing a framework and tools for UQ in text-to-video systems. This is a very solid motivation, and creates the potential for a significant contribution. The main weakness I find is that this paper does not carefully treat the concepts of epistemic and aleatoric uncertainty, in particular by treating them primarily in terms of the input prompt rather than as properties that depend on the interaction between the model, its capacity, and the data distribution. Aleatoric uncertainty is described as randomness from prompt vagueness, while epistemic uncertainty is tied to a lack of model knowledge. This framing assumes these uncertainties are intrinsic to the prompt, but in practice, they are model- and data-dependent. For instance, if the entire training set consists of videos of cats napping on purple beds in the backs of pickup trucks, then the prompt “a cat napping on a purple bed in the back of a pickup truck” would still display high aleatoric uncertainty, not because the prompt lacks specificity, but because the data distribution itself is highly variable in that region. By focusing almost entirely on prompt semantics, the paper overlooks the fact that the distinction between epistemic and aleatoric uncertainty depends fundamentally on the model and the data it has seen. This conceptual problem extends directly into the method. The decomposition in Equation (3) is presented as a principled separation between epistemic and aleatoric uncertainty, but in practice both quantities depend on the behavior and biases of the specific models used to estimate them. The authors estimate aleatoric uncertainty by prompting an LLM to generate refined textual variants and epistemic by sampling multiple videos from the same generative model. Both steps produce variability that arises from model architectures and training data of the various models, not from isolated intrinsic uncertainty types. What they call aleatoric uncertainty reflects the LLM’s own distribution, while their epistemic uncertainty reflects the video embedding model’s representation space, making the split depend on implementation choices rather than underlying epistemic principles. As a result, the decomposition is not theoretically or empirically meaningful. Beyond these conceptual issues, the method relies on untested and implausible assumptions. The independence assumption discards dependence between the text prompt and the generated video, which is unlikely to hold in text-to-video generation. The estimation of entropy in embedding spaces further introduces arbitrary geometric distortions, since the embedding dimensions and projection have a major effect on the computed entropies. The authors provide no sensitivity analysis or justification for these choices, leaving the reported uncertainty values largely uninterpretable. The experimental evaluation generally lacks rigor. The decision to use CLIPScore as the primary accuracy metric is based on a small 10 sample correlation study, an inadequate basis for methodological justification. The subsequent experiments that claim to disentangle aleatoric and epistemic uncertainty depend on opaque subsetting of data where one component is deemed zero according to the authors’ own estimators, introducing circular reasoning. These experimental protocols make the reported calibration and correlation results difficult to trust. Overall, the proposed decomposition lacks solid conceptual grounding, the implementation does not meaningfully separate uncertainty types, and the empirical evaluation does not convincingly support the claims. See weaknesses. Fully human-written
How Confident are Video Models? Empowering Video Models to Express their Uncertainty Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper is (to the authors’ knowledge) the **first study of uncertainty quantification (UQ) for text-to-video models**, proposing a three-part framework: (i) a **calibration metric** based on robust rank correlation between uncertainty and task accuracy, (ii) a black-box UQ method, **S-QUBED**, that uses a **latent-space factorization** to **decompose total predictive uncertainty** into **aleatoric** (prompt vagueness) and **epistemic** (model ignorance) components, and (iii) a ~**40K-video UQ dataset** for benchmarking. Experiments on VidGen-1M and Panda-70M show that S-QUBED’s total uncertainty is **significantly negatively correlated** with semantic accuracy (CLIP score), and its decomposition yields calibrated aleatoric/epistemic trends on subsets where the other source of uncertainty is minimal. * Positions UQ for video generation as a first-class problem; formal **entropy decomposition** (h(V|\ell)=h(V|Z)+h(Z|\ell)) cleanly maps to epistemic vs. aleatoric sources. * **S-QUBED** operates without model internals, aligning with many **closed-source video models**. * Uses **Kendall’s τ** and demonstrates **significant negative correlation** between S-QUBED uncertainty and **CLIP accuracy**, with visuals that match the trend. * Empirical **disentangling** of aleatoric vs. epistemic uncertainty shows expected behavior on curated subsets. * Plans to release a **~40K-video UQ dataset** covering diverse tasks. * Calibration hinges primarily on **CLIP similarity**; other perceptual metrics (SSIM/PSNR/LPIPS) show weak or insignificant correlations, raising concerns about **metric sensitivity** and potential semantic-evaluator bias. * Estimating **epistemic uncertainty** requires **multiple generations per latent prompt**, which the authors acknowledge as a limitation. * Main experiments use **Cosmos-Predict2** and two datasets; broader **model diversity** and real-world perturbations (codecs, length, audio conditions) are not deeply explored. 1. Beyond CLIP, what **additional accuracy signals** (e.g., human semantic judgments, video-text retrieval scores, physics consistency probes) are necessary to **validate calibration** and mitigate evaluator bias? 2. What **sampling schedules** (fewer latent prompts/videos, adaptive stopping) or **latent-space proxies** would you require to deem S-QUBED **computationally practical** without sacrificing epistemic resolution? 3. Which **additional models/datasets** or **deployment artifacts** (compression, prompt styles, audio/no-audio) would most convincingly demonstrate that the **aleatoric/epistemic decomposition** remains **stable and calibrated** in the wild? Heavily AI-edited
AdaSpec: Adaptive Spectrum for Enhanced Node Distinguishability Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This is a good paper that mainly investigates how distinct eigenvalues and missing frequency components affect node distinguishability. The paper proposes AdaSpec, which includes three modules: INCREASE DISTINCT EIGENVALUES, SHIFTS EIGENVALUES FROM ZERO, and INCREASE FREQUENCY COMPONENTS. The proposed method is supported by solid theoretical proofs, and extensive experiments demonstrate the effectiveness of AdaSpec. 1.The paper studies how the graph matrix and node features jointly influence node distinguishability, which is an interesting direction. 2.The paper provides solid theoretical proofs to support the proposed method. 3.The time complexity analysis and comprehensive experiments further validate the effectiveness of AdaSpec. 1.In line 175, the paper states that “The presence of zero eigenvalues can hinder node distinguishability”, but I did not find any detailed explanation about this point. It should be the zero frequency components, not the zero eigenvalues, that hinder node distinguishability. The authors need to clarify the role of Section 5.2 in detail; otherwise, this section should be removed. 2.In lines 91–94, the paper mentions that eigenvalue correction does not ensure permutation invariance, but no further explanation is provided, leaving readers unclear about why this is the case. 3.In Figure 1, why can’t node 1 and node 3 be distinguished? Based on the current analysis, this conclusion does not seem directly supported. 4.Although the paper mainly studies node distinguishability, only Figure 1 involves this concept; both the method and experiment sections lack relevant discussion. The authors could conduct further analysis, for example, by relating their work to the Weisfeiler–Lehman (WL) test. 5.The two factors affecting node distinguishability mentioned in this paper — distinct eigenvalues and nonzero frequency components — have already been discussed in prior work, such as Wang & Zhang (2022). This weakens the contribution of the present paper. please see Weaknesses Lightly AI-edited
AdaSpec: Adaptive Spectrum for Enhanced Node Distinguishability Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. AdaSpec is an adaptive spectral module for GNNs that learns to modify a graph’s spectral structure (eigenvalues and frequency components) to make node representations more distinguishable. It generates an adjusted graph matrix that increases the number of distinct eigenvalues and enhances the frequency coverage of node features in the spectral domain. The approach is theoretically grounded, preserving permutation equivariance and providing provable guarantees on node distinguishability. - AdaSpec provides a clear theoretical analysis linking graph spectra and node features to node distinguishability, and even derives a lower bound on how many nodes can be distinguished. - Unlike prior spectral GNNs that use fixed graph matrices, AdaSpec learns to adjust the graph’s spectral structure (eigenvalues and frequency components), leading to improved representational power without extra computational cost. - The method maintains a critical property for graph learning (permutation equivariance) ensuring that node reordering doesn’t change the model’s predictions, which preserves theoretical and practical soundness. - The model is rigorously tested across 18 benchmark datasets, consistently improving node distinguishability and classification performance, showing that the theoretical ideas transform into real-world gains. - While AdaSpec focuses on adaptive graph matrix generation, prior works like ARMA-GNN [1], SpecFormer [2] already explored adaptive spectral filtering or learnable spectral responses. AdaSpec’s contribution lies more in its theoretical framing (node distinguishability + eigenvalue diversity) than in introducing a fundamentally new mechanism. [1] Graph Neural Networks with convolutional ARMA filters [2] Specformer: Spectral Graph Neural Networks Meet Transformers - Although AdaSpec theoretically explains how distinct eigenvalues improve distinguishability, it doesn’t provide much empirical interpretability (how the learned spectral modifications relate to graph structure or which frequencies become emphasized). This makes the adaptive mechanism somewhat of a black box. - The paper claims “no increase in computational complexity,” but adaptively generating or modifying a graph matrix could introduce training instability or hidden overhead. The supplementary doesn’t discuss runtime comparisons or scalability to very large graphs, areas where methods like GPR-GNN [3] are better optimized. What will be the performance of AdaSpec on the cases of very large graphs? [3] Adaptive Universal Generalized PageRank Graph Neural Network - Recent works such as SpecFormer achieve similar goals but with stronger end-to-end learnability and better scaling to large, dynamic graphs. - Can AdaSpec handle temporal or evolving graph spectra, or is it limited to static adjacency matrices? See all the points in weaknesses. Fully AI-generated
PreviousPage 41 of 1516 (75800 total rows)Next