ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
ACON: Optimizing Context Compression for Long-horizon LLM Agents Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This authors proposed a framework called ACON designed to reduce the computational cost of LLM agents for long-horizon tasks. The authors identify the growing context length due to accumulated histories of actions and observations as a key obstacle to efficiency. ACON tackles this by introducing a compression guideline optimization that learns how to summarize and retain essential information across the steps in long horizon tasks through a contrastive, min–max formulation. The authors also experimented with distillation of the learned compressor into smaller models using LoRA fine-tuning. Experiments on three benchmarks of AppWorld, OfficeBench, and multi-objective QA show improvements in task success rates and moderate reductions in peak input tokens. However, while ACON achieves better reasoning stability, the actual efficiency gains in terms of total token usage and runtime cost are less convincing. * Formulation of context compression as learning problem: ACON elegantly formulates context compression as a contrastive optimization problem. By pairing successful trajectories with failed ones after compression, it directly trains the model to preserve information that determines success. This min–max objective formalizes what to keep and what to drop in a principled way, moving beyond rule-based or heuristic memory truncation. * Clear and rigorous methodological description: The paper provides a detailed explanation of the compression guideline optimization process. The use of LLM-as-a-judge evaluation for multiple candidate guidelines, iterative feedback generation, and adaptive prompt selection is described with strong clarity. This makes the method reproducible and highlights the thoughtfulness of the design. * Exhaustive evaluations with multiple benchmarks: The authors conducted experiments on three distinct benchmarks under varying conditions and provided detailed analyses to understand various aspects of the proposed framework. Evaluations on three distinct long-horizon agentic benchmarks demonstrate consistent improvements in accuracy and moderate token reductions. The inclusion of both full-scale and distilled compressors supports the framework’s flexibility and practical deployment value. * Strong contribution to reasoning stability: Even though ACON’s original goal was efficiency, its most significant contribution appears in reasoning stabilization. Compressed and structured contexts improve coherence in long-horizon planning, reducing redundant exploration and logical drift in LLM agents. * Limited resolution of the claimed efficiency problem: Although the paper motivates ACON as a solution to computational inefficiency caused by long contexts, experiments show that overall token usage and runtime cost did not decrease significantly. In fact, repeated compressor invocations increased API calls, and the authors acknowledge that execution latency rose. The framework thus enhances task performance but not genuine efficiency * High optimization cost in the guideline learning phase: The compression guideline optimization is extremely expensive, involving iterative LLM calls across the full D_cont dataset. With 20–25 candidate prompts per iteration and multiple iterations, the process may require hundreds of thousands of API calls and many hours of training time. The paper admits this cost but omits quantitative measurements, treating it as an offline overhead. This weakens the practicality of ACON for large-scale or domain-adaptive deployment. The authors argue that guideline optimization is performed once per domain and reused across tasks. However, in realistic multi-domain settings, new environments would require data collection and retraining that reintroduces the same heavy computational overhead. This limits ACON’s scalability and adaptability for general-purpose agent systems. * Distillation effect Is limited: The distillation step offers only marginal performance gains. The paper itself notes that even GPT-4.1-mini without distillation performs comparably to distilled small models. Hence, the true utility of distillation lies in cost reduction rather than learning transfer, making its contribution modest. * Guideline optimization depends heavily on heuristic search: Although the optimization is presented as learning-driven, it fundamentally relies on prompt-based heuristic exploration with LLM-as-a-judge feedback. This process lacks theoretical guarantees of convergence or optimality and may depend heavily on model biases and dataset idiosyncrasies. Please refer to the Weaknesses section to address the raised issues. Fully AI-generated
ACON: Optimizing Context Compression for Long-horizon LLM Agents Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Large language models serve as agents in dynamic environments where they accumulate extensive interaction histories, leading to increased computational costs and inefficiencies in long-horizon tasks. The motivation arises from the need to compress these growing contexts effectively, as prior methods primarily address single-step or domain-specific scenarios and fail to preserve essential multi-step signals. Challenges involve retaining diverse information such as states, causal relations, preconditions, and decision cues across heterogeneous tools without losing critical details. ACON introduces a unified framework that optimizes compression guidelines through natural language failure analysis and distills them into smaller models to achieve efficient, informative condensations. 1. ACON reduces peak tokens by 26-54% while preserving or enhancing task performance. This efficiency stems from targeted compression that eliminates redundancies without sacrificing key information. Agents can thus handle longer horizons more cost-effectively. 2. The guideline optimization leverages contrastive feedback from successful and failed trajectories. This process refines prompts in natural language space to better capture task-specific needs. As a result, compression becomes more adaptive and effective across diverse environments. 3. Experiments demonstrate consistent gains on AppWorld, OfficeBench, and Multi-objective QA benchmarks. These validations cover varied domains like productivity and question answering. The broad applicability underscores the framework's robustness. 4. ACON improves smaller agents' performance by 20-46% by mitigating long-context distractions. Concise summaries focus reasoning on essential details. This equalization empowers less capable models to tackle complex tasks. 5. The method operates gradient-free, making it suitable for API-based LLMs. No parameter updates are required during optimization. This flexibility supports integration with proprietary systems. 1. The optimization phase demands collecting feedback from multiple trajectories. This requires significant upfront computation for contrastive pairs. Deployment in time-sensitive scenarios becomes challenging. 2. Benchmarks are simulated and may not capture real-world variability. Unforeseen environmental changes could degrade performance. Broader testing in live settings is essential. 3. Distillation incurs a minor performance drop despite high retention. Critical applications risk failures from lost nuances. Enhanced techniques to minimize this gap are necessary. 4. Thresholds for invoking compression need per-benchmark tuning. Suboptimal values lead to either excessive calls or insufficient reduction. This hyperparameter dependency complicates usage. 5. Comparisons omit some recent agent-specific compression methods. Relative advantages remain unclear without these baselines. Expanding evaluations could better position ACON. See Weaknesses. Fully AI-generated
ACON: Optimizing Context Compression for Long-horizon LLM Agents Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Agent Context Optimization (ACON), a framework designed to address the challenge of ever-growing context length for LLM agents operating in long-horizon tasks. The core contribution is a novel, gradient-free method for optimizing context compression guidelines. This is achieved by analyzing pairs of trajectories—one with full context that succeeds and one with compressed context that fails—using a powerful LLM to identify the causes of failure and iteratively refine the compression prompt. The authors also propose distilling the optimized compressor into a smaller, more efficient model. Experiments conducted on three benchmarks (AppWorld, OfficeBench, and Multi-objective QA) demonstrate that ACON can significantly reduce peak token usage (26-54%) while largely preserving task performance, and in some cases, even enhancing the capabilities of smaller agent models. Originality and Significance: The paper tackles a highly significant and practical problem for the advancement of LLM agents: context management. The proposed method for optimizing compression guidelines is novel and clever. Using the task outcome (success vs. failure) as a supervisory signal in a gradient-free, natural language optimization loop is an original approach that is broadly applicable, even to closed-source API-based models. Clarity: The paper is exceptionally well-written and clearly structured. The problem, the proposed solution, and the experimental results are all explained with high clarity. The figures, particularly Figure 1 and 3, are effective at illustrating the core trade-offs and the optimization mechanism. Empirical Rigor: The experimental evaluation is comprehensive, covering three distinct and challenging long-horizon benchmarks. The results are strong, showing that ACON not only maintains performance close to the "no compression" upper bound but also significantly outperforms other compression baselines, which often suffer from severe performance degradation. Practical Cost vs. Token Efficiency: The primary weakness lies in the trade-off between peak token reduction and overall computational/API cost. The paper itself acknowledges this limitation in Section 4.5. While ACON successfully reduces the maximum context length (peak tokens), the process of history compression (which involves frequent calls to the compressor LLM) can break the KV-caching mechanism of the agent LLM. This forces re-computation and can lead to a higher total number of tokens processed and thus higher API costs, as shown in Figure 7. This is a significant practical drawback that might limit the adoption of the history compression part of the framework where cost, not just memory, is the main concern. Cost and Scalability of the Optimization Process Itself: The paper details the effectiveness of the ACON framework but does not sufficiently discuss the "meta-cost" of the guideline optimization process. This process requires running multiple full trajectories (both with and without compression) and then using a powerful "optimizer" LLM for analysis. For new, complex domains, this optimization phase could be prohibitively expensive and time-consuming. The scalability of this approach to a wide variety of new tasks without incurring substantial upfront costs is unclear. Generalizability of Optimized Guidelines: The experiments show that guidelines optimized on a specific benchmark's training set work well on its test set. However, the generalizability of these highly specialized guidelines across different domains remains an open question. For instance, would a guideline optimized for AppWorld's application-based tasks be effective for a radically different domain like code generation or scientific literature review? The paper could be strengthened by including an experiment that tests this cross-domain transferability. Regarding the critical issue of computational overhead: Could the authors provide a more detailed analysis of the trade-off between peak token reduction and total API cost, especially for history compression? For what types of tasks or interaction patterns does the cost-saving from reduced context outweigh the extra cost of the compressor calls and KV-cache invalidation? Could the authors elaborate on the cost and complexity of the guideline optimization phase? What is the estimated computational cost (e.g., in terms of number of LLM calls or GPU hours) required to generate an optimized guideline for a new benchmark of similar complexity to AppWorld? The quality of the optimized guideline seems to depend heavily on the capability of the "optimizer" LLM (O3 model in this case). How sensitive is the final guideline quality to the choice of this optimizer? If a less capable model (e.g., gpt-4.1-mini) were used as the optimizer, would the process still yield significant improvements, or does ACON fundamentally rely on having access to a state-of-the-art reasoning model for optimization? Fully AI-generated
Learning Part-Aware Dense 3D Feature Field For Generalizable Articulated Object Manipulation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. In this work, the authors consider the task of learning representations for articulated objects which are useful in downstream manipulation tasks. Specifically, the authors propose a procedure to pre-train a neural network to map 3D point clouds to part-aware features, with two different contrastive supervision techniques (spatial and semantics). They build on top of Sonata (PTv3 pre-trained), but make some architectural modifications to enable higher-resolution representations. These representations are then used in several downstream manipulation tasks. The authors show compelling results on simulated & real tasks. * Their proposed architecture, pre-training, and modification are all sensible & principled approaches to handling object-level features at high-resolution * The results do improve over SOTA considerably * There are extensive ablations showing how different parts of the system contribute to performance, as well as qualitative visualizations of feature representations. * The paper is well-written * It’s unclear whether the comparison with DP3 is completely valid (e.g. Sonata + DP3); the authors should clarify the difference between DP3 and the various ablations (e.g. where/when SigLip are included, architectures, etc.). It would help the authors cleanly show that 1) the point cloud architecture change and 2) the pre-training compared to DP3 make a major difference on the task (right now it’s just difficult to tell from the details of the paper). * It’s unclear how much the spatial vs. semantic components actually make a difference. In ablations, the contrastive pretraining (feature refinement) is not broken down by whether the spatial or semantic components make the difference * Details on architecture / training are a bit sparse * Lots of details of training / architecture are omitted - despite being pointed to Appendix A I didn’t see much there. Particularly interested in specific architectural modifications, fine-tuning/pre-training technique when using the pre-trained Sonata weights. * Why is it called a field instead of a representation? Seems to be per-point features, like querying other points in space don’t seem to be possible without altering the representation? Fully human-written
Learning Part-Aware Dense 3D Feature Field For Generalizable Articulated Object Manipulation Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a novel Part-Aware 3D Feature Field (PA3FF), which is trained via contrastive learning to integrate 3D geometric priors and functional part awareness, addressing the challenge of limited generalization in articulated object manipulation. Building upon this feature, the authors introduce the Part-Aware Diffusion Policy (PADP), an imitation learning framework , and demonstrate superior performance over existing 2D and 3D representation baselines in both simulated and real-world tasks. 1. Clarity and Novelty of Motivation: The paper proposes PA3FF as a novel 3D-native feature representation, directly addressing challenges faced by lifting existing 2D foundation features to 3D space, such as long runtime, multi-view inconsistencies, and low spatial resolution. PA3FF explicitly incorporates functional part awareness, which is crucial for generalizable articulated object manipulation. 2. Demonstrated Generalization Capability: The proposed PADP framework is claimed by the authors to achieve superior performance over baselines in both simulated and real-world environments. It demonstrates notable robustness, particularly in handling unseen objects, spatial generalization, and environment generalization tasks. 3. Methodological Completeness: The proposed approach features a complete multi-stage learning framework: 1) leveraging the pre-trained Sonata model to extract 3D geometric priors; 2) employing contrastive learning to fuse a geometric loss and a semantic loss, thereby enhancing feature part-awareness and semantic alignment ; and 3) integrating PA3FF into a diffusion policy for action generation. 1. The architectural modification of the PTv3 backbone (removing down sampling layers and stacking additional Transformer blocks) is a core engineering contribution. However, the paper lacks sufficient quantitative details (e.g., parameter count, FLOPs, precise layer configuration) and a dedicated gain analysis for these changes. This absence of detailed exposition and architectural diagrams prevents readers from adequately assessing its contribution to the final performance and hinders the reproducibility of the research. 2. The real-world experiments are evaluated with only 10 trials per task. This low number of evaluations in robotics may not provide sufficiently high statistical reliability to convincingly support the "state-of-the-art" claims. Furthermore, restricting the ablation study to a single task ("Put in Drawer") severely weakens the proof of generality for component contributions across a broad range of tasks and different generalization types (e.g., OI, OC). 3. PADP exhibits a significant drawback in inference speed compared to baselines like DP3 (4.23 FPS vs. 12.7 FPS). Although PADP achieves higher success rates, this over 60% reduction in inference speed has not been adequately justified as a necessary trade-off (i.e., whether a 10~20% success rate gain, which does not guarantee deterministic success, is worth the real-time cost). In real-time robotic control requiring high-frequency feedback or integration into large policy frameworks, this performance-efficiency trade-off may reduce its practical applicability. 1. Generalization Source and Action Semantics Decoupling: The PartInstruct benchmark tests generalization across various factors, including Object State (OS), Object Instance (OI), Part Combination (TP), Task Category (TC), and Object Category (OC). Please confirm if the model is trained only on the Training Set data. If so, how does PADP or its language encoding module achieve generalization over action direction semantics (e.g., generalizing from *forward* to *backward* action prediction as seen in Figure 12)? This requires explaining the policy's mechanism for understanding and decoupling non-object-related semantics in the language instruction. 2. Precise Role of Language Embedding in Feature Aggregation: The paper states that the "semantic embedding of the task-critical part name" is used as the CLS token in the Transformer encoder to guide aggregation. Concurrently, Language Instruction is shown as an input in Figure 2, Stage III. Please clarify which specific text input (task-critical part name vs. full language instruction) is input to the PERCEPTION part for embedding, and how it relates to the language information input to SigLip during PA3FF training. 3. Definition of Real-World Evaluation Metrics: Table 2 reports Train/Test success rates for real-world tasks. Given the experiment statement "Each task is evaluated with 10 trials under randomized initial conditions", please explicitly define: Does the Train success rate represent testing under randomized initial conditions using the exact objects and scenes used for training? And does the Test success rate represent testing under randomized initial conditions using unseen object instances or environmental changes? Clarifying these definitions is crucial for interpreting the real-world generalization performance. 4. Feature Robustness and Cross-Topology Consistency: For the same object category with significantly different spatial morphologies, can PA3FF effectively cluster and align features? For example, regarding the faucets with topologically distinct structures shown in Figure 6, please provide a quantitative assessment of PA3FF features' cross-topology consistency/transferability between them. This is needed to more fully demonstrate the robustness limits of the part-aware features. 5. Completeness of the PADP Framework Flow: The overall flow of the PADP framework remains insufficiently clear. Please provide a detailed description of the training process for a new task, explicitly detailing like which point cloud data requires manual labeling,how the data flows in the framework, and what information needs to be unified. Fully human-written
Learning Part-Aware Dense 3D Feature Field For Generalizable Articulated Object Manipulation Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper argues that robots handle new objects better when they reason about the parts that matter for action—like handles, buttons, and lids—rather than whole objects. The authors introduce a 3D “part-aware” feature field that turns a point cloud into dense features where points on the same functional part look alike, and those features are tied to plain-language part names. They then use this representation in a diffusion policy that conditions control on the named part, letting the robot plan motions directly from the 3D scene. Because the features are native to 3D, they’re more consistent across viewpoints and make part boundaries clearer. In experiments spanning simulation and eight real-world tasks, the method outperforms strong 2D-feature and 3D-policy baselines, particularly when generalizing to unseen objects and states. The same features also enable point-to-point correspondence and unsupervised part segmentation, making the approach a broadly useful backbone for part-centric perception and manipulation. The paper introduces a 3D-native, part-aware representation that’s aligned with language and plugs cleanly into a diffusion control policy, leading to strong generalization across unseen objects, states, and tasks. The evaluation is thorough—covering simulation and eight real-world tasks with clear five-way generalization splits—and shows sizable gains over both 2D-lifted and 3D baselines. Beyond control, the same features enable point correspondence and unsupervised part segmentation, and ablations clarify why the 3D-native design outperforms view-lifted alternatives. 1. The work has extensive evaluation on robot experiment, but lack of quantitative evidence on the feature field quality. 2. Runtime/latency: PADP runs at ~4.23 FPS vs. DP/DP3 at ~12 FPS, limiting high-frequency control. 3. Dependence on part supervision & external text embeddings: training leans on labeled parts and SigLIP part-name embeddings; baselines may not use comparable supervision. Here are more feature splatting paper to cite: LERF: Language Embedded Radiance Fields (ICCV 2023) Feature Splatting: Language-Driven Physics-Based Scene Synthesis and Editing (ECCV 2024) M3: 3D-Spatial Multimodal Memory (ICLR 2025) Lightly AI-edited
Learning Part-Aware Dense 3D Feature Field For Generalizable Articulated Object Manipulation Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Summary: The paper proposes PA3FF, a 3D-native, part-aware dense feature field learned directly from point clouds, and PADP, a diffusion policy that uses PA3FF as a frozen perception backbone with language and robot-state conditioning. Together they target articulated-object manipulation with better sample efficiency and generalization to unseen objects, showing superior results to prior 2D/3D features plus diffusion-policy baselines in both simulation and real-world tasks, and also enabling downstream uses like correspondence and part segmentation. Main contributions: 1.PA3FF: a part-aware 3D feature field that enforces within-part feature coherence and between-part separability, trained with contrastive objectives and text alignment to functional part names. 2.PADP: a diffusion policy built on the frozen PA3FF backbone, conditioned on language cues and robot state, improving sample efficiency and cross-instance and cross-category generalization. 3.Strong empirical gains over representative baselines (e.g., CLIP/DINOv2 features and DP/GenDP families) across multiple generalization splits in simulation and eight real-world articulated-object tasks. 4.Versatility of the learned representation, supporting additional perception tasks such as shape correspondence and part segmentation. ### Quality * Pipeline is reasonably complete: pretrained 3D backbone, contrastive representation learning, language conditioning, and diffusion policy, with corresponding ablations. * Experiments cover simulation and a modest set of real tasks, include cross-instance/category splits, and compare against common 2D/3D feature and diffusion baselines with generally consistent gains. ### Clarity * Problem framing is clear: focus on functional-part consistency to address articulated-object manipulation bottlenecks. * Exposition is structured (representation → policy), with losses and inputs described in layers; implementation details are sufficient for high-level replication. ### Significance * Potential to reduce instance-specific engineering and data needs, especially under shifts from unseen objects or deformations. 1. Insufficient novelty (core issue) The paradigm—“part-aware dense 3D representation + language prompts + diffusion policy”—reads as a combination/tuning of existing components (NDF-style dense correspondence, DP3/GenDP-style 3D-aware diffusion, ULIP-style language–3D alignment). The manuscript does not present an indispensable conceptual increment (new inductive bias/new representational property/new problem formulation); current differences are mainly in implementation and loss engineering. — Suggestion: Use a “conceptual comparison + ablation proof” to pinpoint your **single unique idea**: show that removing that idea (e.g., the part-consistency term or specific field structure) causes a **significant** drop, and provide head-to-head results against the strongest nearby baselines (DP3/GenDP/NDF variants). 2. Lacking verifiable “necessity evidence” for the representation claim You claim “part-aware” beats a “generic 3D semantic field,” but there is no counterfactual under matched supervision and budget to show the advantage comes from the representation itself rather than backbone scale or the pretraining data distribution. — Suggestion: Under the **same backbone and training budget**, swap only “part-aware field ↔ generic semantic field,” and report cross-task/cross-object gains with significance tests. 3. Unclear scope of the language module’s contribution Conditioning/alignment on part names is not novel, and there is no degradation/robustness quantification (synonyms, hierarchical terms, noisy/wrong labels, no-language variant) to show language is a **key** driver rather than a cosmetic add-on. — Suggestion: Provide curves of **language-noise strength → performance**, and report **per-task/per-part** marginal contributions. 4. Heavy reliance on strong pretrained backbones; insufficient factorized ablations A large portion of the gains may come from PTv3/Sonata-style pretraining; current ablations do not sufficiently disentangle backbone capacity from your objectives/structure. — Suggestion: Run **full-factorial ablations** (backbone type × with/without large-scale pretraining × with/without language alignment, contrastive losses, part supervision, structural tweaks). 1. Please state the paper’s **single indispensable conceptual increment** (not implementation or loss details) and explain what **new inductive bias/representational property** it introduces. 2. Please provide a **conceptual comparison table** contrasting NDF / DP3 / GenDP / ULIP / *this work* , and mark which elements are **first introduced** by this paper. 3. Please report ablations that **remove the key new component(s)** (e.g., part-consistency loss, specific field structure, alignment mechanism): do all primary metrics **drop significantly**? Include statistical significance. 4. Under **identical sensing inputs, action space, number of demonstrations, and training budget**, run **head-to-head** comparisons against the strongest nearby baselines (DP3/GenDP/NDF variants). If the method still does not win, explain how the claimed “novelty” stands. 5. Please provide a **correlation analysis from field quality to performance**: e.g., within-part coherence, cross-view consistency versus success rate, with correlation coefficients and visualizations. 6. Please add **degradation and robustness** studies for the language component: * Synonym substitution (handle/knob/grip), hierarchical terms (door handle vs. handle); 7. How is the vocabulary constructed and disambiguated? How do you handle **same-name different parts** or **cross-category semantic drift**? Please provide **error cases and proportions**. 8. Please provide a **full-factorial ablation**: {Backbone: PTv3 / Point Transformer / others} × {With/without Sonata or equivalent large-scale pretraining} × Report results in **both simulation and real** settings. **If the authors can satisfactorily address these questions, I would raise my score.** Fully AI-generated
Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses object hallucination, a phenomena that is prevalent in existing VLMs, which are typically caused by over-reliance on language priors. Authors propose Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT), a simple training-free method that pre-computes a visual saliency map by tracking positive changes in visual attention, which are further leveraged to amplify attention at decoding step. They test the effectiveness on multiple hallucination benchmarks and show effectiveness. - Presentation. The overall presentation of this paper is clear. - Clarity. The overall idea is straightforward and easy-to-follow. - Comparison to Attention Modification Approaches. Note that there are a series of studies [A,B,C,D] that focuses on fixing the attention patterns to address object hallucination in this field, while authors ignores such discussions, which are suggested to include. This could involve discussions in related works, and performance comparisons. [A] Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding. CVPR 2025. [B] Mitigating Object Hallucination via Concentric Causal Attention. NeurIPS 2024. [C] Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention. CVPR 2025. - Performance on General Benchmarks. How are the performance gains on general benchmarks rather than hallucinations, for example, MMStar, or MMBench? See weakness 2. Fully human-written
Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Gaze Shift-Guided Cross-modal Fusion Enhancement, a novel method for mitigating hallucinations in VLMs. The proposed method tracks changes in visual attention, referred to as gaze shifts, during the processing of information-rich query tokens. These gaze shifts are then used to create a visual saliency map that guides cross-modal fusion, enhancing both visual and query attention during decoding. The paper demonstrates that GIFT effectively reduces hallucination in VLMs across various tasks and datasets, providing significant improvements in hallucination mitigation while maintaining general performance with low computational overhead. 1. The idea of using gaze shifts to dynamically adjust visual attention in VLMs is a novel and promising approach. It effectively addresses key challenges in cross-modal fusion and visual attention misallocation (visual attention sink), which are critical issues in VLM performance. 2. The paper provides extensive experiments that show GIFT achieves up to 20.7% improvement in hallucination mitigation, outperforming existing methods across several vision-language datasets and models of varying architectures. 3. GIFT demonstrates that it can improve hallucination mitigation without introducing significant computational overhead, making it a practical solution for inference-time interventions in VLMs. 1. Some formulas are missing concluding punctuation (e.g., periods at the end of equations). Sections 5 and 6 could be merged. Both sections discuss experimental results and analyses, and their separation feels redundant. Combining them into a single cohesive section would improve the flow and clarity of the paper. 2. The experiments in the paper are mainly focused on the LLaVA model, which limits the generalizability of the results. Although the authors show promising results for LLaVA, there is no comprehensive evaluation on other popular VLMs or tasks. This raises concerns about the method's applicability to a wider range of models and real-world scenarios. The lack of a broader experimental comparison is the primary reason I am rating this paper 6/10 instead of 8/10, as it makes it difficult to assess whether GIFT is a universally applicable solution or if it is specific to certain architectures. 3. While the paper emphasizes that GIFT maintains a low computational cost compared to some baselines, it still incurs a slight increase in latency (1.13x compared to greedy decoding). However, this is not a major issue. See Weakness. Moderately AI-edited
Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses the critical issue of hallucination in VLMs. The authors propose GIFT (Gaze Shift-Guided Cross-Modal Fusion Enhancement), a lightweight inference-time method inspired by human visual gaze dynamics to tackle: (1) misallocated attention to irrelevant visual regions (2) over-reliance on linguistic priors (3) imbalanced cross-modal fusion. 1.GIFT introduces a human-inspired "gaze shift" tracking approach that addresses a critical gap in existing work: static attention averaging (used by baselines like VAF) often misallocates attention to irrelevant regions. 2.It integrates into existing VLMs without retraining, unlike training-based methods that incur high computational costs. 3.It consistently improves performance across diverse VLMs (LLaVA-1.5 7B/13B, Qwen2-VL 7B) and tasks (object detection, captioning, VQA), demonstrating its versatility. 1.GIFT heavily relies on "information-rich query tokens" (identified via POS tagging) to compute accurate saliency maps. The authors acknowledge that vague, ambiguous, or visually irrelevant queries (e.g., "Describe this image" without specific cues) may lead to inaccurate maps and reduced hallucination mitigation. However, they do not provide concrete strategies to handle such cases—e.g., no analysis of performance on low-specificity queries or a fallback mechanism for query-scarce scenarios. 2.While the authors tune key hyper-parameters, they lack a deeper analysis of how these choices generalize. 3.GIFT is compared to three baselines (VAF, Rel-Attn, VAR) but not to recent state-of-the-art contrastive decoding methods[1,2]. These methods reduce hallucination by contrasting outputs with perturbed visual inputs and have shown strong performance on VLM hallucination tasks. Omitting this comparison limits the paper’s ability to position GIFT against the broader landscape of mitigation strategies. [1] Mitigating Object Hallucinations in Large Vision-Language Models Through Visual Contrastive Decoding. [2] Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models. please refer to weaknesses. Moderately AI-edited
Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes GIFT, an inference-time hallucination mitigation method for VLMs. The key novelty lies in creating a visual saliency map by tracking positive changes in visual attention during comprehension of information-rich query tokens. Unlike previous approaches that only enhance visual attention, GIFT also proportionally adjusts query token attention to preserve cross-modal fusion balance, reducing the risk of visual attention sink and low visual contribution. Evaluations on multiple hallucination benchmarks (CHAIR, POPE, MMHal-Bench) and general VLM benchmarks (MME, SEED-Bench) show significant decrease in hallucination rates with minimal impact on overall reasoning capabilities. The method is lightweight, training-free, and generalizes across different VLM architectures. 1. Clear and intuitive idea: The gaze shift concept is easy to understand and well-motivated by human visual attention behavior. 2. Addresses multiple issues simultaneously: Tackles visual attention sink, low visual contribution, and imbalanced cross-modal fusion, which existing methods often address in isolation. 3. Low computational overhead: Achieves improvements with modest runtime increase compared to greedy decoding. 4. Comprehensive evaluation: Experiments span multiple benchmarks, models, and hallucination types, with ablation to validate design choices. 1. Experimental setting is somewhat outdated: The chosen base VLMs (LLaVA-1.5 series and Qwen2-VL) were released over a year ago. More recent models—such as LLaVA-Next, InternVL—implement Dynamic High Resolution image processing, which could impact saliency computation. Testing the method on these architectures would strengthen claims about generality. 2. Limited hallucination benchmarks: Evaluation could include newer datasets such as HallusionBench or other recent challenging hallucination tasks to better measure robustness. 3. Interpretability validation missing: Since the method relies heavily on saliency maps, adding segmentation-based experiments from classic interpretability literature could reveal whether human-perceived semantic enhancement indeed contributes to hallucination reduction[1, 2]. 4. Hyperparameters vary per model without explicit robustness test: While tuning is explained, an ablation on sensitivity to hyperparameter changes would reinforce the robustness claim. [1] Optimizing Relevance Maps of Vision Transformers Improves Robustness [2] Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers See weaknesses. Moderately AI-edited
Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper shows that homogeneous MAS workflows (same base LLM, different prompts/tools) can be simulated by a single LLM via multi-turn role-play. Then it proposes OneFlow, which consists of two parts: 1. search for optimized workflow. 2. perform single LLM implementation. across six benchmarks, single-agent execution often matches or slightly exceeds multi-agent performance but the price is much cheaper. * The paper proposes an interesting point of view. * The experiments test on six benchmark and report both accuracy and cost to support claims. * The OneFlow methods composes of two parts: search for optimized workflow and perform single LLM implementation. The first part seems like an improved version of Aflow and lacks novelty, for example, the critic prompt is adopted from AFlow. * The costs for single-agent are simulated due to closed-weight APIs; add open-weight runs (or vendor KV-sharing APIs) to validate real-world latency/$ savings * While the method mentions tool calling, the benchmark tested are static QA/math/code; include tool-use tasks with external side-effects and interactive settings. * See weakness. * Clarification: In 4.2 single-LLM simulator, it writes "Set the system message to p_{i_t}", does this mean to replace the system prompt at the beginning? Can you give an example of how this is different from multi-agent system? Fully human-written
Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates whether the advantages of multi-agent systems built from homogeneous LLMs can be replicated by a single LLM through multi-turn interactions and KV-cache sharing. The authors empirically evaluate this hypothesis across six benchmarks (code generation, mathematics, and QA tasks) and introduce OneFlow, an automated workflow design algorithm that employs dual meta-LLMs (Designer and Critic) under an MCTS framework. The results suggest that single-agent implementations can match or exceed the performance of multi-agent workflows while substantially reducing inference cost. The paper further discusses the limits of this equivalence in heterogeneous multi-agent contexts and proposes directions for future research. - **S1.** The paper tackles a timely and important issue whether multi-agent systems provide real advantages over single-agent reasoning when the base LLM is homogeneous. - **S2.** Well-explained theoretical formulation that logically connects shared KV cache to computational efficiency. - **S3.** Comprehensive experimental coverage across six benchmarks and one domain-specific dataset. - **W1.** The OneFlow framework largely replicates the AFlow architecture with minor adaptations. The use of MCTS for workflow generation is not new, and the manuscript does not clearly articulate what conceptual or technical innovation distinguishes OneFlow from AFlow. - **W2.** The evaluation primarily relies on closed-weight models (GPT-4o-mini, Claude 3.5 Haiku), and the KV-cache advantages are simulated rather than directly measured. The real experiments using open models capable of genuine KV sharing are absent, limiting the credibility of efficiency claims. - **W3.** The paper primarily contrasts with AFlow and manual CoT baselines, omitting recent heterogeneous agentic frameworks (e.g., MasRouter) that could reveal where single-agent designs fail. - **W4.** No exploration of when and why the single-agent execution begins to fail (e.g., under longer reasoning chains or tool dependencies). - **Q1.** Could the authors provide concrete evidence (with open-weight models) that KV-cache reuse yields measurable cost savings in practice rather than theoretical simulation? - **Q2.** How does OneFlow's Designer-Critic interaction differ algorithmically from AFlow's meta-LLM setup beyond re-using prompts? - **Q3.** Several datasets (e.g., GSM8K, MBPP) are solvable via direct prompting. Have the authors tested tasks that genuinely require multi-stage *agentic* reasoning or tool usage? Fully AI-generated
Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper questions whether multi-agent LLM workflows truly outperform single LLMs when all agents share the same base model. It formally shows that for homogeneous workflows (same base LLM, different prompts/tools), a single LLM can simulate the entire multi-agent pipeline through multi-turn dialogue while reusing the KV cache, gaining efficiency without loss of expressivity. Based on this, the authors propose OneFlow, an automatic workflow design framework using dual meta-LLMs (Designer + Critic) and Monte-Carlo Tree Search to generate workflows optimized for single-agent execution. Across six benchmarks (HumanEval, MBPP, GSM8K, MATH, HotpotQA, DROP) and one domain-specific Shopping-MMLU set, OneFlow-single achieves comparable or better performance than existing multi-agent frameworks (AFlow, etc.) while cutting inference cost by up to 10×. The paper concludes that homogeneous MAS can be largely simulated by a single agent and that future work should focus on truly heterogeneous systems. The author Reframes multi-agent research with a rigorous single-agent equivalence argument. they provide Six general benchmarks + domain-specific tasks. they also Quantifies KV-cache benefits clearly. it shows that OneFlow’s dual-meta LLM + MCTS is a creative and reproducible design. and it clearly delineates where single-agent simulation applies and where heterogeneity still matters. Limited empirical heterogeneity analysis: Pilot study is small; results inconclusive about real multi-model synergy. Simulation of KV cache: Since APIs hide internal caching, efficiency results are theoretical. A small open-weight replication (e.g., LLaMA-3 8B) would strengthen credibility. Ablations: Lack of ablation on MCTS parameters (α, β, iterations) and meta-LLM roles; unclear how much each contributes. Over-dependence on closed models: Limits reproducibility beyond cost estimation. Writing could be tighter: Some redundant explanations and long prompts in appendix. Can you verify KV-cache reuse gains empirically using an open-source model? How sensitive are results to the α/β weights in Eq. (1)? Would OneFlow still outperform AFlow if inference cost were excluded (i.e., pure accuracy metric)? Have you tested whether role-switching (different prompts within same chat) affects coherence or context interference? How does OneFlow perform when the base model has small context windows (e.g., 4k tokens)—does summarization degrade accuracy? Fully AI-generated
Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper examines the value proposition of LLM-based multi-agent systems in settings where current frameworks are largely homogeneous. The authors empirically show that a single LLM using multi-turn conversation with KV cache reuse can match or outperform such multi-agent workflows in both performance and cost across six benchmarks spanning coding, mathematics, and question answering. Building on this finding, they propose OneFlow, an algorithm for automatic, cost-aware workflow optimization tailored for single-agent execution without compromising accuracy. 1. The paper offers a reassessment of the prevailing practice of homogeneous multi-agent workflows in LLM systems, combining theoretical reasoning with strong empirical evidence. This provides an important sanity check for the rapidly growing MAS literature. 2. Experiments on six standard and one domain-specific dataset using multiple LLMs convincingly show that single-agent execution can match or surpass homogeneous multi-agent performance while reducing cost substantially. 1. The heterogeneous experiments (Table 3) rely on automatically generated workflows with unclear tuning and no ablation of model-assignment policies. Thus, the claim that a single-LLM implementation can outperform heterogeneous setups is only provisional. 2. The OneFlow search process is fixed and shallow, with no analysis of sensitivity to search depth, hyperparameters, or model choice. This leaves the robustness of the optimization procedure underexplored. 3. While quantitative results are comprehensive, there is little discussion of failure cases or qualitative differences between single-agent and true multi-agent behaviors, leaving interpretability and diagnostic insight limited. 1. How does OneFlow’s workflow quality and cost-performance trade-off scale with deeper or wider search? Is the dual meta-LLM architecture robust to prompt or model changes? 2. Have the reported KV-cache efficiency gains been validated using open-weight models that support cache reuse, or might the simulated API-based estimates introduce systematic bias? Fully AI-generated
TusoAI: Agentic Optimization for Scientific Methods Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces TusoAI, an agentic AI framework that automatically develops and optimizes computational methods for scientific tasks. Given task descriptions and evaluation functions, it can generate new algorithms rather than just run existing analysis pipelines. The main idea is to structure domain knowledge into a “knowledge tree,” combine it with Bayesian-guided iterative optimization and diagnostic feedback, and iteratively generate, implement, and evaluate candidate solutions. On 11 benchmarks, TusoAI achieved a higher average rank compared to expert-designed methods and existing AI agents. - Clear motivation and problem framing. The paper focuses on creating methods rather than just executing pipelines. It specifies the objective and agent loop, including the task description and the evaluator $h(\cdot)$, with cold/warm start. - Breadth of benchmarks with application case study. Empirical scope spans 11 tasks across two domains (six single-cell, five scientific deep learning). - The warm-start improvements on scDRS and pgBoost are interesting. It is interesting to consider concrete empirical deltas and biological findings (e.g., new T-cell disease associations; rs138917529→GCK link). - Too many symbols in Algorithm 1. If the authors want to introduce these annotations, they should include an appendix table. - Baseline coverage is insufficient, mainly including general-purpose agents (AIDE, Biomni, ChatGPT-Agent) and simple expert or AutoML models. It lacks domain-specific, publication-level baselines for each scientific task. Including parts of the leaderboard is recommended. - Ablation and diagnosis analyses lack depth. The ablation study in Table 3 mixes tasks from unrelated domains. It omits per-task variance or statistical testing, making it impossible to isolate the contribution of each component (knowledge tree, Bayesian updates, diagnosis). - No detailed analysis of failure cases or performance drops is provided, so readers cannot understand when the system fails or regresses. - Case studies serve as illustrations but lack control. The genetic applications (scDRS, pgBoost) show improvements, yet they lack independent validation or error analysis. - Can the authors clearly specify what metric is used in each column of Tables 1 and 2, and how the “Avg” and “Avg rank” values are computed across heterogeneous tasks? If normalization was applied (e.g., 1 – MSE, scaled decomposition R²), please make this explicit in the main text or table captions. - How do the qualitative examples, such as NMF with dropout/Poisson modeling and Satellite ensembling, support the narrative of the methods being "distinct" or "custom"? - Is using text similarity (Fig. 2A) a sufficient proxy for diversity, or should a more principled novelty metric be employed in the evaluation? - This work’s broad, application-driven scope and descriptive evaluation would be better suited for an applied-science journal rather than a methodology-focused venue like ICLR. Lightly AI-edited
TusoAI: Agentic Optimization for Scientific Methods Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents TusoAI, a method that hand designs a set of steps for automated model discovery using LLMs. The authors compare their method against baseline methods on set of task and find that their approach achieves good performance with high diversity in the suggested solutions. - the problem of automated computational model design is important - ablation studies are provided to reveal some insight into the importance of the methods components - strong results are achieved on a set of tasks - the paper lacks any discussion or conclusion - the Bayesian update step is unclear and not described - the method seems overall very heuristic. A (theoretic) justification of the algorithm would be of great value. - additionally the algorithm is rather complex and the explanation are difficult to follow at times - figure are sometimes unclear and would need re-working - what does your acronym TusoAI stand for? - Table 1 and 2: How is performance defined in these tasks? - Is the comparison in the experimental section fair in terms of computational load? How long are baseline methods run compared to the 8h for TusoAI? - Figure 2, 3 and 4: font size is too small and hard to read. - line 323: How were the text embeddings computed? - Figure 2B: Does this "optimization trajectory" look similar for all other tasks? Are there examples where the algorithm converges fast and then does not find a better solution? - In the methods there is no explanation of how the Bayesian update is performed. Can you explain where and how exactly a Bayesian update is performed and what the prior and posteriors involved are? - Table 3: Do the ablations only disable one of the parts (i.e. "No Bayesian" only disables the Bayesian part but keeps all other parts)? - line 380: How do you define the mean time to optimize? What is the criterion that optimization is achieved? - line 385: "Results are shown in Table ??" - Figure 3 B/C: What does the circles show? How should we read and interpret those figures? - Figure 4: - what does the distance threshold mean? - How should read and interpret figure 4C? Fully human-written
TusoAI: Agentic Optimization for Scientific Methods Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. ## Summary - The paper introduces TusoAI, an LLM agent for developing computational methods for domain-specific scientific tasks. They apply it to several biological and deep learning problems and demonstrate improvement over other agentic and LLM models. They also demonstrate its performance on iterating on existing state-of-the-art methods and show new scientific insights. ## Recommendation - Reject. While results comparing with other agentic models are there, the actual performance isn't that much better compared to existing models on the specific tasks. The most interesting result is the case studies, but the other models were not run on the same case studies. There is a lack of case studies in other domains. Code is poorly written and not amenable to open source. ## Strengths - Model architecture is properly motivated and described in sufficient detail. - The utility of a model is conveyed properly, I can see the usefulness of this. - Code is provided. ## Weaknesses - TusoAI is designed to work with any domain as long as domain knowledge is provided, however, there is a strong focus on biological applications. This paper would benefit from showing TusoAI's performance in other scientific domains similar to the two biological case studies. I see that it worked on deep learning but only a cold start was performed. I am specifically asking for a warm start case study in a non-biology domain. - Diversity in methods produced is a core architecture consideration, but it is not substantiated in the text. While evidence is provided that TusoAI writes diverse code, there are no references supporting the assumption that generating more diverse code is better. - Code is poorly written / formatted. ## Questions - Was an ablation study performed on the number of points saved for each paper's methods in Step 1 "Gather domain knowledge"? How does increasing/ descresing point summaries affect the final model performance? - Why were other models not tested on the case studies? A lot seems to be riding on TusoAI's new findings, but other models were not given a chance. - How is the iterative refinement done for "The initialization agent Ainit drafts 5 candidate solution descriptions from T and iteratively refines them using each paper summary in P"? It is not clear and appears to be super important. - How many hyperparameters are there total for this agent? It appears to have a lot, and I would be curious how those hyperparameters affect the solutions. (ie. run time, integer values used in various places, utility scalars, probabilities, bug-fix attempts, etc..) - Which hardware is used for all of this? 8 hours means a totally different thing on a laptop vs. a data center. Are all models tested on the same hardware? I see some models are only accessible on the web: how do you account for different compute capabilities? - How many runs were used for ablation analysis (Table 3)? Was it just run once, or several times with mean score? If it was run once then it needs to be run more times, since LLMs are very stochastic and the results can vary significantly. - Where is the conclusion/future work section? The paper seems to end abruptly. ## Feedback - The code is a mess. There are import statements all over the place, comments are unstructured and lacking in many places, and there are no clear instructions on how to reproduce results. Please fix this so I can take a look at the code myself and run things locally. - I_{diag} is not updated in Algorithm 1, it appears to remain as an empty list. Please consider removing or correct this. - Line 265: Change "instruction by first uniformly draws" to "instruction by first uniformly drawing". - The claims (line 297, first paragraph in section 4.1) that TusoAI provides "novel" and "computationally efficient" methods are not substantiated. Provide quantitative results for computational efficiency and a better empirical comparison of the results across all models. Simply stating the method TusoAI came up with does not mean it's novel. - Please make it clear which model is presented in figure 2B in the caption. This specific sub-figure would benefit from a better description in general. - Please provide motivation for why we care about generating diverse code. I would argue that if an agent can produce better methods, then it doesn't matter how diverse the methods produced along the way are. This is a central point studied in the paper that needs to be substantiated. In fact, producing lots of diverse models is at odds with the 8 hour time limit, since I can also imagine a model would benefit from trying to iterate on a model that works decently well instead of making smaller changes to a larger amount of models in the same period of time. - Line 386, please correct: "Table ??". - Line 430: "This improvement reflects TusoAI’s ability...", is not really a fair comparison. TusoAI started from the work the original authors had already done. Please correct this sentence to show how TusoAI expands upon the original work in a short amount of time, instead of making it seem it tested 167 unique versions _from scratch_. - Same comment for line 464 for the second case study. - The text in the figures are tiny. Please make the text larger so it's easier to read instead of having to zoom in 300% to make out what it says. - Line 1244: "It's samples" change to "Its samples". Fully human-written
TusoAI: Agentic Optimization for Scientific Methods Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper develops TusoAI, an agentic AI system that, given a scientific task description and an evaluation function, autonomously develops and optimizes computational methods for the application. TusoAI is tested across several single-cell tasks and some general deep learning tasks. 1) Mixing Bayesian optimization in the loop of an agentic AI system is interesting. I wish the paper had substantiated whether this is the first agent of this type or whether the idea is inspired by existing agents. 2) The results show that some of the methods constructed by TusoAI are novel rather than simple re-implementation of existing approaches or calls to standard packages. 3) The agentic system is tested in many scenarios, from final performance to behavior over time. 4) Single-cell results show promising biological results. 1) The writing can be improved. For example, the Abstract is excessively long. Instead of listing all results, please summarize the key points and condense the Abstract's opening. 2) While the evaluations involve tasks beyond a single cell, it seems like TusoAI's performance is more pronounced in the single cell case, which might suggest the paper could have made that the central focus rather than a general-purpose multi-agent AI. 3) The code is not available for a full review of the paper. 4) Narrow scope to problems that can be solved using off-the-shelf ML models and strategies rather than those that need in-depth model development (e.g., new deep learning models rather than say fine-tuning Restnet). 5) As discussed in the *LLM-based general learning agents* section, there are already agents developed for ML engineering. This work differentiates itself from those by focusing on scientific domains, but some baselines are still missing (R&D, DS, etc). 6) Limitation in scale: only 10 papers retrieved from Semantic Scholar. 7) I have some concerns about model selection and evaluation, which I have asked in the box below. 1) The key motivation behind TusoAI is to develop new computational solutions to existing problems in science. Do the results show a new computational method developed beyond your expectations? One that is far beyond the reach of human ML engineers? 2) Is there an underlying optimization problem that the Bayesian framework is solving? 3) What is the convergence criterion of the system? Figure 2A and 2B suggest that TusoAI never really converges, as code diversity oscillates and performance dips. Is this expected or desired? How did you pick when the algorithm should be terminated? Are there any prompting guardrails that avoid oscillations? 4) Does increasing the number of papers retrieved from Semantics Scholar (from 10) improve the results? 5) Can you elaborate on how the paper ensures the model selection has been done in an apples-to-apples way across the datasets/baselines? Is model evaluation also handled by TusoAI or externally? Fully human-written
TusoAI: Agentic Optimization for Scientific Methods Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces TusoAI, an LLM-based agentic system for automatically developing and optimizing computational methods for scientific applications. The system structures domain knowledge into a hierarchical knowledge tree, uses Bayesian updates to sample optimization strategies across categories, and incorporates diagnostic feedback during iterative refinement. The authors evaluate TusoAI on 11 scientific benchmarks (6 single-cell analysis tasks and 5 deep learning tasks), showing consistent improvements over baselines including AIDE, Biomni, and expert-designed methods. Two case studies in genetics demonstrate practical utility by improving existing methods (scDRS and pgBoost) and uncovering novel biological insights. 1. The paper evaluates TusoAI on 11 tasks spanning single-cell analysis and scientific deep learning, demonstrating broad applicability. The inclusion of both cold-start and warm-start settings, along with real-world case studies in genetics, strengthens the practical relevance. 2. The knowledge tree representation provides a structured way to organize domain knowledge, and the Bayesian update mechanism offers a principled approach to balancing exploration and exploitation. The diagnostic-based optimization is a nice addition that grounds the system in empirical data patterns. 3. TusoAI consistently outperforms strong baselines including AIDE and domain-specific expert methods. The ablation studies effectively demonstrate that each component (categories, Bayesian updates, diagnostics, domain knowledge) contributes meaningfully to overall performance. 1. The core algorithmic components are largely existing techniques (LLM agents, tree-based search, iterative optimization), and the main contribution is their combination for scientific method development. The concurrent work by Aygun et al. (2025) appears to address very similar problems, but the paper dismisses direct comparison due to unavailable code without sufficiently differentiating the approaches conceptually. 2. The 8-hour optimization budget is relatively short and may favor methods that converge quickly over those requiring longer refinement. The diversity analysis (Figure 2A) shows TusoAI explores more than AIDE, but it's unclear whether this translates to better generalization. The selection strategy (shortest code within 0.1% of best performance) seems ad-hoc and could inadvertently favor simpler but less robust solutions. 3. The paper doesn't discuss when TusoAI fails or performs poorly, nor does it address practical concerns like how to set appropriate evaluation functions for novel scientific problems, how sensitive the system is to task description quality, or how to validate generated methods when ground truth is unavailable. The computational cost analysis is limited to one brief mention ($0.37-$0.41 for case studies). 4. The case studies claim to discover "9 new associations" and "7 previously unreported links," but these are computational predictions that would require experimental validation to be considered true biological discoveries. The statistical methodology for declaring associations as "novel" versus "missed by previous methods" is not clearly described. 5. While the authors promise to release code upon publication, key details are missing: which specific LLM API versions were used, how were papers selected from Semantic Scholar (just citation count?), what are the exact prompts for different agents, and how stable are results across different random seeds beyond the reported 95% CIs? 1. How does TusoAI handle the cold start problem for truly novel scientific domains? The knowledge tree construction relies on retrieving papers from Semantic Scholar based on task descriptions. For emerging research areas with limited literature, how does the system bootstrap domain knowledge? Have you tested TusoAI on tasks where relevant literature is scarce or the task description is deliberately vague? 2. Can you provide more details on the comparison with Aygun et al. (2025) and clarify the key technical differences? Beyond the unavailability of their code, what are the conceptual and methodological differences between your approach and theirs? This seems critical for establishing the novelty of your work. Fully AI-generated
One Bad Sample May Spoil the Whole Batch: A Novel Backdoor-Like Attack Towards Large Batch Processing Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a novel Batch-Oriented Backdoor Attack named BOBA, which aims to control the classification results of all the samples in a batch by poisoning only one of them. BOBA exploits an intrinsic mechanism of the Batch Normalization (BN) layer in deep learning models, where the BN layer relies on the statistics of the current batch. This allows a single anomalous sample to contaminate the mean and variance of the entire batch, thereby affecting the feature representations of all other normal samples within it. Notably, for CIFAR-10, BOBA can make 848 of 1024 samples within a batch misclassified when manipulating only 10 poisoned samples, indicating the harmfulness of security risks in the BN layers. 1. This work reveals a security risk in the Batch Normalization (BN) layer of deep learning models and demonstrates that its intrinsic mechanism can be exploited to implant a backdoor. 2. This backdoor attack considers the scenario of batch data processing, which has a certain degree of practical relevance. 1. The approach of designing attacks by exploiting the intrinsic mechanisms of deep learning models is not highly novel, as similar research already exists. For example, Yuan et al. [1] designed an attack by utilizing the random neuron dropping mechanism of Dropout, while Wei et al. [2] implanted a backdoor by leveraging the down-sampling mechanism in DL models. 2. The paper's threat model states that the backdoored model is delivered to the user as a black-box product. However, the experimental section severely lacks an evaluation against current, state-of-the-art, general-purpose black-box backdoor defense methods, such as [3] [4] [5]. These defense methods require no prior knowledge of the backdoor attack and align perfectly with the paper's black-box setting, yet the paper lacks comparative experiments for BOBA against these SOTA defenses. 3. In Section 4.1, the paper sets the default training batch size to n = 1024 , while in Table 2, the authors evaluate three different inference batch sizes (512, 1024, 2048). The authors need to clarify the experimental setup: for each column in Table 2 (e.g., (n = 512)), are the results obtained from a model trained with the corresponding batch size (n = 512), or are all results derived from a single model trained with the default batch size (n = 1024)? In other words, did the experiments train a separate model for each inference batch size n , or did they use one fixed model trained with ( n = 1024 ) and test it under different batch sizes? If the training batch size is fixed, how does BOBA perform at much larger batch sizes (e.g., 10,000)? If a different model must be trained for each batch size, this would weaken the practicality of the attack. 4. An analysis of the computational cost of the BOBA training process is missing. [1] Yuan A, Oprea A, Tan C. Dropout attacks[C]//2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024: 1255-1269. [2] Wei C, Lee Y, Chen K, et al. Aliasing backdoor attacks on pre-trained models[C]//32nd USENIX Security Symposium (USENIX Security 23). 2023: 2707-2724. [3] Guo J, Li Y, Chen X, et al. SCALE-UP: An Efficient Black-box Input-level Backdoor Detection via Analyzing Scaled Prediction Consistency[C]//ICLR. 2023. [4] Zeng Y, Park W, Mao Z M, et al. Rethinking the backdoor attacks' triggers: A frequency perspective[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 16473-16481. [5] Yang Y, Jia C, Yan D K, et al. Sampdetox: Black-box backdoor defense via perturbation-based sample detoxification[J]. Advances in Neural Information Processing Systems, 2024, 37: 121236-121264. 1. What is the total computational time required for the entire training process of BOBA? Compared to training a standard benign model on the same dataset, by what factor does this overhead increase? 2. During the trigger's gradient optimization or the inference process, are the trigger's pixel values constrained to a valid image data range (e.g., [0, 1] or [0, 255])? Because illegal pixel values can often influence a model's output more significantly, the authors need to clarify this setup. If the trigger contains illegal pixel values, the credibility of the reported high attack success rates would be questionable. 3. The trigger optimized in this paper is in the form of a patch. Could it be replaced with a global perturbation, for example, by blending the perturbation with the image at a certain ratio to serve as the trigger? Would the effectiveness of the attack be affected by this setting? 4. The paper's experimental evaluation is primarily focused on low- or mid-resolution image datasets. How does the proposed BOBA attack perform on higher-resolution images (e.g., 224x224)? 5. What is the Attack Success Rate (ASR) of the trigger optimized in Stage 1 of BOBA? Fully human-written
One Bad Sample May Spoil the Whole Batch: A Novel Backdoor-Like Attack Towards Large Batch Processing Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposed a new Batch-Oriented Backdoor Attack(BOBA), which includes two stages: trigger derivation and contrastive contamination-based retraining. As long as a carefully designed "poisoned sample" is mixed into one batch, BOBA can lead to the contamination of the prediction results of the entire batch in large-scale inference or training using Batch Normalization (BN). The experimental results show that on various models (such as ResNet, VGG, EfficientNet) and datasets (MNIST, CIFAR-10, GTSRB, Tiny-ImageNet), as long as a very small number of samples with triggers are mixed in the batch, most samples can be misclassified. 1. This paper reveals for the first time the security risk of cross-sample contamination existing in Batch Normalization (BN) in large-batch scenarios and proposes a brand-new attack view. 2. The proposed BOBA framework is divided into two stages (trigger derivation + contrastive contamination), with clear logic 1. The model specificity of triggers limits the generalization of this method. In this paper, triggers are derived for specific trained models, the ablation experiment also indicated that the effect of trigger derivation on untrained models was very poor. 2. Some assumptions look too strong, the first is BOBA the method need to set track_running_stats=False, when the defender set track_running_stats=True, BOBA is basically ineffective. However, in many typical deployments (especially for inference/online services), the running stats of BN is frozen (track_running_stats=True and using running_mean/var) to ensure inference stability. The second is the attacker have a pre-trained target model (not a randomly initialized raw model, but a normally trained model with decent performance) to reverse derive the most effective trigger patch. 3. Although the paper evaluated various defenses, the defenses methods such as "noise addition" and "statistical detection" was relatively simple, how about more advanced defenses? 4. There is a lack of theoretical explanations or quantitative modeling of pollution propagation in the BN normalized equation. 1. One of my concerns is that the method is only evaluated on low-resolution datasets and small models. This may limit the method's interest to a broader audience, and it is not clear why the authors chose to focus on small-scale datasets and why they did not include pre-trained models such as ViT. 2. In the experiment, the paper set batch size from 512 to 2048, how about the smaller batch size, such as 128, 256? Fully human-written
One Bad Sample May Spoil the Whole Batch: A Novel Backdoor-Like Attack Towards Large Batch Processing Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper offers an original and thought-provoking contribution by exposing a batch-level vulnerability in BN layers and designing a corresponding attack mechanism. The experimental validation is thorough, but the clarity of presentation can be improved. The paper identifies a previously underexplored vulnerability in BN layers under large batch settings, revealing that inter-sample dependencies can be exploited for a new type of batch-oriented backdoor attack. The evaluation is extensive, covering multiple datasets and architectures. The introduction of new metrics like attack contamination rate (ACR) demonstrates methodological rigor. The exploration of adaptive and differential privacy-based defenses shows a thoughtful attempt to analyze attack resistance and propose mitigation strategies. While the batch-oriented perspective is interesting, the overall structure still closely parallels traditional backdoor frameworks. The novelty lies more in the attack surface than in fundamentally new techniques. Most results are empirical. Analytical insights into why one poisoned sample can dominate batch statistics would enhance the scientific depth. While the attack scenario differs from classical backdoors, including comparisons with state-of-the-art stealthy attacksunder modified conditions would better position the work in context. While the batch-oriented perspective is interesting, the overall structure still closely parallels traditional backdoor frameworks. The novelty lies more in the attack surface than in fundamentally new techniques. Most results are empirical. Analytical insights into why one poisoned sample can dominate batch statistics would enhance the scientific depth. While the attack scenario differs from classical backdoors, including comparisons with state-of-the-art stealthy attacksunder modified conditions would better position the work in context. Fully AI-generated
DERMARK: A Dynamic, Efficient and Robust Multi-bit Watermark for Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes DERMARK, a dynamic multi-bit watermarking scheme for autoregressive LLMs that (1) models per-segment embedding success via a CLT/Poisson-binomial approximation, deriving an inequality to decide when a generated token segment has enough capacity to encode one watermark bit, (2) performs online variable-length segmentation during generation to place each watermark bit into just-large-enough segments, and (3) recovers bits with a dynamic-programming extractor that minimizes segmentation + color losses to improve robustness to edits. Empirically the method is evaluated on OPT-1.3b and LLaMA-2-7b and shown to reduce tokens-per-bit, lower embedding-time overhead, and increase robustness vs a Balance-Marking baseline. 1. Principled theoretical framing (Poisson-binomial → CLT → inequality) that connects token-level probabilities to required segment length. 2. Practical algorithm: online segmentation during inference with negligible extra compute compared to baseline multi-bit methods. Reported embedding overhead is near zero and extraction is efficient enough for practice. 3.Strong empirical gains on tokens-per-bit and robustness to small insertion/deletion attacks across two model families. Table 1 + figures show consistent improvements. 1. CLT approximation may be unreliable when segments are short (the very regime the method targets), and the paper lacks finite-sample error bounds or bootstrap-style corrections. 2. many heuristics (λ smoothing, β weighting, iterative ϵ updates). The paper reports defaults but more ablations on hyperparameter sensitivity and cross-domain robustness (beyond news-like prompts) would be helpful. 3. the authors justify using Balance-Marking as SOTA and critique MPAC; still, including more recent multi-bit baselines (or reproducing MPAC carefully under comparable settings) would make the empirical claims stronger. The authors discuss this choice, but reviewers may still view it as a gap. 4. the method improves edit robustness but remains vulnerable to large rewrites / reorderings — inherent to dispersed multi-bit strategies. The paper states this limitation but does not quantify the breakpoint where robustness collapses. 1.Can you provide a small-N correction or empirical calibration strategy that quantifies CLT approximation error for segments of length, say, 5–20 tokens? (A simple calibration table would help.) 2.How sensitive are final detection rates to β and λ across domains (e.g., code, dialogue, scientific text)? Please provide an ablation sweep or an appendix table. 3.The DP extractor is O(N²); what are practical limits on N for real documents? Is there a streaming or beamed approximate extractor that keeps near-optimal segmentation with lower cost? Fully AI-generated
DERMARK: A Dynamic, Efficient and Robust Multi-bit Watermark for Large Language Models Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a new watermarking framework for LLMs that dynamically adjusts watermark embedding based on text capacity and token statistics. 1. DERMARK adaptively determines segment boundaries in real time based on token-level statistics, achieving 2–4 fewer tokens per bit at the same detection rate. This dynamic rule substantially enhances embedding efficiency without retraining. 2. The embedding complexity is linear (O(N)) and extraction is O($kL^2$), and test for a large model (LLaMA-2-70B). The method is fully plug-and-play, requiring no fine-tuning or architectural modification. 3. The inclusion of perplexity (PPL) experiments confirms that semantic quality is largely preserved across different watermark strengths ($\delta$), addressing the concerns about possible generation degradation. 1. While Appendix C discusses MPAC (NAACL 2024) conceptually, the paper still provides no quantitative comparison with recent multi-bit watermarking approaches. Furthermore, I think the method can be extended to multi-bit watermarking methods such as MPAC. A discussion of this point would greatly strengthen the paper. 2. The robustness tests remain restricted to random insertion/deletion attacks. No experiments address paraphrasing, shuffling, gradient-based, or LLM-assisted removal attacks, which are crucial for assessing real-world resilience. 3. The central-limit-theorem assumption in Lemma 2 is untested for short segments, leaving the statistical soundness of the normal approximation uncertain. 4. Detection assumes perfect access to the watermark key and exact segmentation alignment; the paper does not discuss desynchronization or partial-key scenarios. 5. Although equations for bias parameters and color loss are formalized, their conceptual motivation and iterative update dynamics remain only briefly explained. See above. Heavily AI-edited
DERMARK: A Dynamic, Efficient and Robust Multi-bit Watermark for Large Language Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces DERMARK, a dynamic multi-bit watermarking framework for large language models (LLMs). The method adaptively determines text segment lengths during generation based on an inequality derived from a normal distribution assumption, aiming to balance watermark capacity, efficiency, and robustness. * The paper is well-motivated, addressing the limitations of fixed-length segmentation in prior multi-bit watermarking methods. * The theoretical formulation connecting watermark embedding and normal distribution is novel and mathematically rigorous. * **Typos and citation issues** Yoo et al. (2024a) and Yoo et al. (2024b) refer to the same paper and should be merged. Line 265: “Eq. equation 4” should be corrected to “Eq. (4)” for consistency. * **Misrepresentation of prior work (L115)** The description of Yoo et al. (2024b) is inaccurate. Their method does not assign bits to segments manually; instead, the bit–token mapping is determined via a hash function, as shown in BiMark [1], Robust Multi-bit Watermarking [2], and StealthInk [3]. The paper should revise this discussion to reflect the actual mechanism. * **Limited experimental comparison** The comparison in Section 5 includes only Balance-Marking. More recent and representative baselines such as MPAC (Yoo et al., 2024a), BiMark [1], and StealthInk [3] should be incorporated to strengthen the empirical claims. Without these, it is difficult to assess the relative advantage of DERMARK in the evolving landscape of multi-bit watermarking. [1] Feng, X., Zhang, H., Zhang, Y., Zhang, L. Y., & Pan, S. (2025). BiMark: Unbiased Multilayer Watermarking for Large Language Models. arXiv:2506.21602. [2] Qu, W., Zheng, W., Tao, T., Yin, D., Jiang, Y., Tian, Z., ... & Zhang, J. (2025). Provably Robust Multi-bit Watermarking for AI-generated Text. USENIX Security 2025. [3] Jiang, Y., Wu, C., Boroujeny, M. K., Mark, B., & Zeng, K. (2025). StealthInk: A Multi-bit and Stealthy Watermark for Large Language Models. arXiv:2506.05502. * **Incomplete treatment of text-length limitations** Section 4.2 discusses handling overly long text but overlooks the case when the generated text is too short to encode the full bit string. Appendix F.1 briefly mentions this issue, but the main paper should explicitly explain how DERMARK behaves or fails in this scenario, and whether it adapts δ or truncates the watermark. * **Segmentation vulnerability and missing evaluation metrics** The method’s reliance on segmentation per bit raises robustness concerns when the text is truncated or edited. As each bit is tied to a segment, any truncation makes bit recovery impossible. Moreover, although DERMARK is presented as a multi-bit watermark, the bit match rate (BMR)—a standard evaluation metric for multi-bit detection—is not reported. Including this metric would provide a fairer comparison. Moderately AI-edited
DERMARK: A Dynamic, Efficient and Robust Multi-bit Watermark for Large Language Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces a multi-bit text watermarking method for LLMs that dynamically determines segment lengths for embedding each watermark bit based on a probabilistic criterion derived from the model’s logits. It (1) analyze watermark embedding as following a normal distribution, leading to an inequality that estimates whether a segment has enough capacity to reliably encode one bit; (2) use this condition online during generation to adaptively end a segment and move to the next bit; and (3) propose a dynamic-programming extractor that combines a segmentation loss (how tightly the inequality is satisfied) with a “color” imbalance loss to improve robustness to edits. Experiments on OPT-1.3B and LLaMA-2-7B claim fewer tokens per bit and lower time overhead than Balance-Marking, with improved robustness to token insertions/deletions. 1. The paper explains why multi-bit watermarking (beyond one-bit detection) is needed for fine-grained attribution (LLM/user) and why fixed-length segmentation can fail, especially on low-entropy text. 2. Derives an inequality from a CLT-style analysis that treats aligned-token proportion as approximately normal; this enables an online, per-bit stopping rule during generation. 3. For matched detection rates, DERMARK uses fewer tokens per embedded bit, with further gains on low-entropy subsets. 1. The experimental setup is outdated. There are many new multi-bit watermarking works. However, this paper only uses one 2023 paper as a baseline. Comparing with more extensive, recent baselines will strengthen the claims. 2. Considering most application scenarios of the LLM watermark are under the chat. Evaluating the performance on long-form QA dataset and instructed models (e.g., Llama-3.1-8B-Instruct) is necessary. 3. Robustness focuses on random insert/delete at 5–10%, which is limited compared with existing works, e.g., [1]. [1] http://arxiv.org/abs/2401.16820 Please see above. Lightly AI-edited
KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper propose a Kinematic-Aware diffusion for category-level articulated object shape reconstruction and generation. It first encode SDFs, joint angles, and part segmentation into a structured latent space via a Kinematic-Aware VAE and then employ two conditional diffusion models for regressing global pose and joint parameters. Finally, it produce an iterative optimization module to refine reconstruction. 1. The idea of encoding everything including the geometry and kinematic informations into a unified latent space sounds reasonable to me, since the development of 3D generation models gradually switch to native 3D space. 2. The two diffusion models that respectively learns kinematic-aware informations and part geometry sounds reasonable. 1. The way the authors cite papers is really hard for reading, which I believe is due to the package or template. 2. The authors should polish the figures, especially Fig. 2. In the Pose and Joint Estimation Module, what's the difference between the two lines with (X_T, Y_T)? Does that mean a single inference step? If I understand it correctly, this is the part of a conditional diffusion model that conditions on the partial point cloud (encoded by PointNet++) and predicts base pose and joint parameters but the way the authors draw the figure is quite confusing. What is actually been denoising? 3. Does the Pose and Joint Estimation Module and the reconstruction module related with each other? From the reviewer's understanding, it seems that these two modules are separate ones. 4. The authors should elaborate on the generation mentioned in all the experiments. It seems that they are mainly generating novel poses of the objects instead of generating articulated objects from few images? 5. The original PARIS method assumes images from two-state but from the reviewer's understanding, it seems that the proposed method only takes in single-state images as input? How is the comparison performed? I appreciate to directly denoise in a native VAE space, but the writing and explanation seems confusing, so I currently leans towards rejecting. But I am willing to change my view after the rebuttal and see other reviewers' comments. On what dataset are the KA-VAE and the diffusion models trained? Fully human-written
KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents KineDiff3D, a diffusion-based framework for reconstructing articulated 3D objects from sparse multi-view or single-view inputs. The approach achieves state-of-the-art results on benchmarks like PartNet-Mobility and new synthetic data. - Novel integration of kinematic constraints into diffusion-based 3D modeling. - Demonstrates improved generalization to unseen articulations and novel part combinations. - Visualization and ablations clearly illustrate the role of kinematic priors. - The writing of the paper needs improvement, and the overview figure cannot present the methods clearly. - Some improvement margins over baselines are modest, suggesting incremental benefit in certain settings. Moreover, it seems that it is not compared with the latest SOTA methods, but only with those from a few years ago. - Limited qualitative demonstrations on real-world data; most results are synthetic. - The novelty mainly lies in integrating existing techniques (diffusion + kinematic loss) rather than introducing a fundamentally new formulation. - It seems rather strange that the ablation experiment is only conducted in one category (Dishwasher). How can experiments in one category prove that these components are also useful in other categories? see weakness Moderately AI-edited
KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents KineDiff3D, a single-view articulated object reconstruction framework that integrates a Kinematic-Aware VAE (KA-VAE), conditional diffusion models, and an iterative optimization module. The pipeline first encodes geometry (SDF), part segmentation, and joint angles into a shared latent space using KA-VAE. Two diffusion models are then trained: one for pose/joint estimation, and another for latent code generation from partial point clouds. Finally, a joint-centric optimization loop refines pose and geometry via Chamfer distance while preserving kinematic constraints. Experiments on synthetic, semi-synthetic, and real datasets (ArtImage, ReArtMix, RBO) show that KineDiff3D improves Chamfer Distance and joint pose accuracy over previous category-level baselines such as A-SDF, CARTO, Paris, and Ditto. The paper tackles an important and challenging problem — category-level articulated object reconstruction from single views. Integration of geometry, kinematics, and generative modeling within a single framework is conceptually appealing. The ablation on iterative optimization shows the model can improve with refinement steps. The implementation appears complete and reproducible in principle, showing non-trivial engineering effort. Lack of true novelty: The proposed KA-VAE and diffusion combination is a straightforward hybrid of known components. Recent works (Real2Code 2024, Reacto 2024, ArticulatedGS 2025) already address similar goals with more rigorous modeling and stronger baselines. Misleading claim of “generation”: The paper never performs unconditional or cross-category generation; it only interpolates joint angles of known shapes. Incomplete evaluation: The experiments omit essential baselines, use limited datasets, and report no standard deviations or statistical tests. Unclear articulation encoding: Handling of variable joint topology, parameterization of revolute/prismatic joints, and fusion between geometry and motion are insufficiently explained. Superficial diffusion analysis: There is no comparison showing that diffusion improves over direct latent regression or normalizing flows. Overly dense presentation: The paper reuses long derivations of SDE-based diffusion without insight, making it difficult to distinguish novelty from background. No qualitative failure analysis or runtime discussion, despite introducing multiple heavy submodules (two diffusion networks + VAE + optimization). What is the exact advantage of diffusion over a simple regression network for latent prediction? Can you provide quantitative evidence (e.g., ablation replacing diffusion with MLP)? How does the framework handle variable joint topology across categories (e.g., eyeglasses vs. drawer)? Are joint parameters padded, masked, or predicted per-category? Is the KA-VAE trained jointly with the diffusion modules or sequentially? If sequential, how is latent-space alignment ensured? How many iterations are required for the optimization loop, and what is its runtime overhead? How are “generation” results produced? Are they stochastic samples from diffusion or deterministic angle interpolation? Could the authors compare against 2024–2025 state-of-the-art methods (e.g., Reacto, Real2Code, ArticulatedGS) using the same metrics and datasets to substantiate the claimed progress? Please clarify whether the real-world evaluation uses any domain adaptation or fine-tuning, as results seem unexpectedly strong given the fully synthetic training. Fully AI-generated
KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work tackles 3D reconstruction and pose estimation of articulated objects from single-view inputs. The proposed KineDiff3D framework uses a Kinematic-Aware VAE to encode geometry, joint angles, and segmentation, along with conditional diffusion models for pose/joint regression and latent code generation. An iterative optimization module further refines results while maintaining articulation constraints. Experiments demonstrate strong performance across multiple datasets. 1. Comprehensive unified framework: The paper presents a well-designed end-to-end system that jointly addresses multiple challenging tasks—shape reconstruction, pose estimation, and novel articulation generation—within a single framework. 2. The bidirectional optimization module that simultaneously refines reconstruction accuracy and kinematic parameters while preserving articulation constraints works well. This design leverages the mutual dependencies between geometry and kinematics, likely leading to more robust and accurate results compared to methods that treat these aspects independently. 1. The inputs to the model should be clarified at the beginning of the method section. Specifically: a) Is the input a single-view image with depth information? b) How is the full object point cloud (shown at the top of Figure 2) obtained? c) How is the partial object point cloud obtained? 2. Why did you choose to use PointNet and PointNet++ as there are many more powerful models? 3. The pipeline overview in Figure 2 needs improvement. The flow lines are difficult to follow and make the overall process unclear. 4. The glasses' legs appear incomplete in the generated novel articulations in Figure 4. Could you clarify why this occurs? I think that novel articulation generation should preserve the object's shape and only modify its pose configuration. 5. Appendix formatting (Section 4.1): When using `Metrics` as a paragraph. I think that "Reconstruction and Generation Task" and "Pose and Joint Estimation Task" should not be listed as paragraphs. Please see the weaknesses. I think the writing is not friendly for readers, especially the method section. It would be better if the authors could further improve it. Lightly AI-edited
Guaranteeing Conservation of Integrals with Projection in Physics-Informed Neural Networks Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposed an analytical formula for enforcing conservation law in the form of $u_{t}=-\nabla \cdot \mathbf{F}(u)$. The integral form $\frac{d}{dt} c(t)=\frac{d}{dt} \int_{\mathcal{X}} u\; dx=-\oint \mathbf{F}(u) \cdot d \mathbf{S}$ describe how the conserved quantity $c(t)$ changes overtime, and $c(t)$ can be computed via temporal integration. The key idea is to represent the PINN ansatz $u_{\theta}$ by its evaluation on an uniform grid on the domain, and compute the conserved quantity using an uniform quadrature: $\hat{c}(t)\approx 1_{n}^{\top}u_{\theta} \Delta x$. The constraint $\hat{c}(t)=c(t)$ becomes linear, and a minimum L2 distantce projection $\tilde{u}_{\theta}$ has analytical solution which depends on $c(t)$. The method is tested on 4 1D equations and show that conservation law constraint is well-satisfied by the method. The paper is well-written and is easy to follow. The proposed method is mathematically clear and easy to implement. - One of the key strengths of PINN is its mesh-free property. Enforcing the constraint by projecting PINN to a uniform grid defeats the purpose of having PINN in the first place. If we are already projecting PINN to a uniform grid, why don't we use a finite element? Also, from Table 4, we see that PINN-proj is very slow, exactly due to the use of a uniform mesh. Also, this is just 1D; for 2D and 3D, it would be much worse. Furthermore, in real problems, the domain could be irregular, and uniform discretization doesn't always work. The paper didn't discuss these limitations at all. - The baseline is weak: one can use augmented Lagrangian, ADMM, and even just a straightforward soft constraint $\lambda |\mathbb{E}[u_{\theta}] - c(t)|^{2}$. Furthermore, the baseline PINN-SC uses only a single weight of $\lambda=10$, without any ablation. - The experiment in the paper is only toy examples. Showing the method works on a more realistic example would greatly increase the impact. - Why is the extending the projection to continuous space (eq. 33) valid? - How is $c(t)$ calculated? - The initial condition doesn't satisfies the boundary condition. Why is that the case? - Although the paper proposed a method that works for $\frac{d}{dt}c(t)\neq 0$, for example the Reaction-Diffusion, Figure 1 only show the case for $\frac{d}{dt}c(t)= 0$. Why is that the case? Non constant $c(t)$ is much more interesting. Fully human-written
Guaranteeing Conservation of Integrals with Projection in Physics-Informed Neural Networks Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper studies the projection method for the conservation of integral quantities in PINNs. It can be regarded as additional regularization for solutions of PINNs, with additional knowledge on integral quantities (energy). Instead of adopting soft constraints in the loss, the paper formulates it as a nonlinear projection after the training of PINNs, therefore, acting as hard constraints. Experiments show the superiority of the proposed projected PINNs over the soft-constrained PINNs. (1) The presentation is good. (2) The method is simple and motivated. (3) The conservation of solution energy boosts the performance of PINNs. (4) The experimental results are strong that PINN-Proj consistently outperforms PINN and PINN-SC. (1) The scale of the method (for high-dimensional PDE) is difficult. The advantages of PINNs over classical numerical methods are their potential in high-dimensional PDEs. However, the projection method still suffers from the curse of dimensionality, since it requires the discretization of the space. Therefore, the nonlinear projection is infeasible and computationally expensive for high-dimensional PDEs. I think it is the main weakness of the method. PINNs try to avoid discretization, but the proposed method requires it. (2) It is good if you can present the algorithm, so that readers can easily capture the core idea of this paper. The projection applies once after the training of PINNs. For example, if PINNs overfit the data so that it does not satisfy the conservation law, will the projection further degrade the performance of PINNs? I think if we can apply the projection during PINNs training (alternating between PINNs training and conservation projection), then the projection will gradually guide the PINNs to learn this conservation. Fully human-written
Guaranteeing Conservation of Integrals with Projection in Physics-Informed Neural Networks Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a projection-based method to enforce conservation laws within Physics-Informed Neural Networks (PINNs) by incorporating linear and quadratic integral constraints through a post-training projection step. The approach aims to improve the physical fidelity of PINN solutions by ensuring that conserved quantities are satisfied exactly, thereby addressing common issues with conservation violations in neural PDE solvers. The authors demonstrate the method on several benchmark PDEs, showing improved conservation and, in some cases, enhanced convergence properties. The paper addresses an important limitation of PINNs by proposing a projection method that guarantees the conservation of integral quantities. The formulation is mathematically well-motivated and provides a clear, closed-form projection for enforcing conservation for 1D uniform-grid cases. Empirical results demonstrate consistent improvements on standard benchmark problems. 1. It is unclear whether the observed improvement in PDE solution error is truly due to the enforcement of conservation constraints. The error reduction may result either from matching the integral quantities or from the projection improving training stability. An ablation study is needed to clarify this. 2. The post-hoc projection does not consider the PDE residual. Therefore, even if the conservation constraint is enforced, the solution may violate fundamental physical laws, such as mass, momentum, or energy conservation, especially in nonlinear or strongly coupled systems. This raises concerns about the physical plausibility of the projected solutions. 3. It is unclear whether the conservation is preserved at points $(x,t)$ outside the training set. Demonstrating generalization beyond training data—particularly in comparison with approaches like PINN-SC—is necessary. 4. The comparison set is limited. Evaluating only vanilla PINN and soft-constrained PINN may be insufficient. In particular, PINN-SC requires careful tuning of the regularization parameter. Additional baselines, such as Lagrange multiplier-based methods, would strengthen the evaluation. 5. Since the integral is approximated via a discretized sum, the projection inherits discretization errors. The method may be sensitive to the number of discretization points $n$. Moreover, the method’s scalability to higher-dimensional domains (e.g., 2D or 3D) is questionable because: * The number of constraints grows significantly, * The computational cost of the projection becomes prohibitive even in 1D, * Closed-form projection solutions may not exist. Furthermore, extending this approach to tensor- or vector-valued fields is nontrivial. Therefore, experiments on higher-dimensional problems are essential to validate the method’s scalability and practical feasibility. Extending the projection to non-uniform grids, adaptive meshes, or higher-dimensional domains (>=3D) may not be straightforward, and closed-form solutions may not exist for such cases. 6. The experiments are limited to simple, low-dimensional examples. It remains unclear how much enforcing conservation alone improves PDE accuracy or stability, and whether the approach generalizes to a broader class of PDEs or practical problems. 7. The claim that Hessian eigenvalues of PINN-Proj are more tightly clustered near zero, suggesting improved conditioning, is potentially misleading. Eigenvalues concentrated near zero typically indicate ill-conditioning, which can slow or destabilize optimization. A more detailed explanation and clarification are necessary. Additionally, eigenvalue distributions alone may not fully characterize optimization performance or convergence, so such claims should be presented cautiously. 8. For hyperbolic conservation laws, convergence to the entropy solution is crucial before enforcing conserved quantities. The proposed method does not address this requirement, which may limit its applicability to such systems. **Minor Comments** * Subsections 3.1 and 3.2 do not present the proposed methodology of this paper and would be better placed in a preliminary or background section for clarity. * In Eq. (4), the conserved quantity c depends on the solution $u$. Therefore, it would be more precise to denote it as $c(u(⋅,t))$ or $c[u](t)$ instead of simply $c(t)$, to explicitly indicate this dependency. Please refer to the comments in the Weaknesses section for detailed remarks. Moderately AI-edited
Guaranteeing Conservation of Integrals with Projection in Physics-Informed Neural Networks Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The manuscript considers physics-informed neural networks (PINNs) for the solution of evolution equations with conserved quantities. It proposes a method of obtaining predictions that satisfy the conservation law exactly based on an explicit transformation of the network's output. The resulting method evalauted on a variety of PDEs showing improved accuracy and conservation of the quantities over vanilla PINNs. + The manuscript treats the imporant and timely problem of producing more physically plausible predictions with neural networks. + The manuscript gives a clean way to exactly impose conservation of linear and quadratic quantities. + Depth of the contribution: The manuscript proposes essentially a post-processing of the predictions. Whereas a slight improvement over vanilla PINNs is shown, it is not clear, whether this enables the application of PINNs to problems previously out of reach. + Ablation: The post-processing is used during training. An ablation study of training without the post-processing, but using it at evaluation is not presented. + The Complexity of the training procedure and training times are not discussed and reported. + Evaluation: + The manuscript does not give insight into the training process. In particular, the choice of the optimizer is crucial in PINNs as demonstrated by the recent success of second-order optimizers like energy natural gradient. In particular, energy natural gradient leads to a loss with an optimal condition number. + Results not competitive with PINNs optimized with current state-of-the-art optimizers. + The method is relatively ad-hoc, hence only applicable to the mean and variance as conserved quantities. + At times, the manuscript is not an entirely smooth read, e.g., in the abstract *While the soft constraint PINNs use to enforce the structure of partial differential equations (PDEs) enables necessary flexibility during training* might be a typo? Also, in Subsection 3.1 it is not clear what is meant by *The physics-informed neural network (PINN) is defined as $f\coloneqq u_t + \mathcal N[u]$*_. Firstly, it is good to remind the reader that $u_t$ refers to the time derivative. Secondly, it is a bit of a stretch to call the residual $f$ _physics-informed neural network_. Further, Section 3 consists of a lot of very short subsections and would benefit from streamlining and condensation. + Which optimizer was used during training? + How does your architecture perform when compared to PINNs trained with second-order optimizers? + Can you use second-order optimizers to efficiently train your architecture? + Can you generalize your method to other conservation laws? Fully human-written
Personalization Under Value Conflict: Resolving Contradictory Preferences with Paired Fine-Tuning Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Humans often adjust their preferences when facing different situations, which causes models trained on a single preference to underperform. In this work, the authors propose **Preference-Paired Fine-Tuning (PFT)**, which optimizes the models to align contradictory preferences under the same scenario simultaneously. They present a dataset called **Value Conflict Dilemma (VCD)** that consists of contradictory paired examples, and apply supervised fine-tuning on it. The experimental results indicate that PFT performs better than all single-preference baselines for both positive and negative preferences. - The motivation for aligning the models for both contradictory preferences at the same time is well-defined. - The inclusion of human studies is appreciated, as it strengthens the evaluation. **About the method** - In Section 3.5, the authors mention that the PFT-trained model can quickly adapt to an ICL scenario, but do not explain why. The training process itself does not appear to include any ICL-related design. **About the experiments** - The ICL experiments in Section 4.3 are not clearly explained. For example, the source of the user history data is unclear, and it is not specified which model was used. Without these details, the results are difficult to interpret. - In line 418, the authors claim that single-preference trained models cannot generalize well to other preference types without giving the supporting evidence. **About the writing** - The use of the synchronous update method for the main approach is mentioned only in the captions of Tables 1 and 2. This should be clearly stated in the main text to distinguish it from the baselines. - Human evaluation is mentioned in line 387 but is only described in the appendix, making it hard to follow. - Figure 5 appears to contain duplicated bars. - Does the improvement mainly come from using contradictory preferences, or simply from sampling multiple preferences within the same scenario? Have the authors tested a dataset built with independent preferences and fine-tuned the models on it? - For the single-preference baselines, do they use the same number of training examples (1000), or does the main method actually use 1000 *pairs* of examples? - The negative preference results seem consistently lower than the positive ones. Since both types of preferences should be neutral with respect to human values, what might explain this difference? - This framework appears adaptable to RLHF-based methods such as PPO or GRPO. Have the authors explored or considered experiments in that direction? Lightly AI-edited
Personalization Under Value Conflict: Resolving Contradictory Preferences with Paired Fine-Tuning Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper studies personalization when a single user can toggle between contradictory preferences. It introduces Preference-Paired Fine-Tuning (PFT): train with both sides of each pair via either alternating updates or a summed loss, then steer at inference with ~3 in-context examples. The paper also releases a synthetic Value Conflict Dilemma (VCD) dataset and evaluates on VCD and selected tasks. 1. The paper presents a clear problem framing of value conflict alignment and a simple recipe that plugs into SFT/DPO pipelines. 2. VCD dataset could be potentially useful as it defines explicit contradictory pairs and is human-checked for label quality. 1. **Baselines for multi-objective control are missing:** Several directly comparable multi-objective alignment/controlled generation methods, e.g., [1,2,3], are not compared. These methods also aim to train one steerable policy that trades off conflicting objectives during inference. 2. **Limited technical contribution:** PFT is essentially cross-entropy on both sides of a pair (either alternated or summed), with standard gradients and weights. The proposed fast personalization path is simply a user context-conditioned generation by adding user history to the prompt, which has already been explored in several inference-time steering methods, e.g., [2,3]. 3. **Scalability and robustness:** The paper studies k=2 (binary contradictory), but many user preferences are multi-dimensional and non-exclusive. The method’s effectiveness to scale to >2 dimensions or interactively changing preferences is unknown. [1] Zhou, Zhanhui, et al. "Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization." Findings of the Association for Computational Linguistics ACL 2024. 2024. [2] Wang, Kaiwen, et al. "Conditional Language Policy: A General Framework For Steerable Multi-Objective Finetuning." Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. [3] Guo, Yiju, et al. "Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment." Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. See Weaknesses. Moderately AI-edited
Personalization Under Value Conflict: Resolving Contradictory Preferences with Paired Fine-Tuning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper tackles the challenge of aligning LLMs with heterogeneous and contradictory user preferences. The paper proposes Preference-Paired Fine-Tuning (PFT), a method that fine-tunes a single model on both sides of a contradictory preference pair, enabling the model to handle opposing preferences without requiring separate models for each preference direction. The paper additionally introduces the Value Conflict Dilemma dataset and show that PFT can combine paired training with lightweight in-context adaptation to better match individual preference histories. - Well-motivated problem. The paper addresses an important limitation in current LLM alignment approaches: most methods optimize for universal preferences rather than handling individual-level preference diversity and conflicts. - Comprehensive experiments. Tests on multiple model sizes and families (Qwen, LLaMA), multiple baselines (SFT, DPO, CAA), multi-format evaluation (multi-choice classification with “pick-one” and “select-all-that-apply” protocols, and open-ended generation scored by GPT-4o-mini), as well as ablations on dataset size and preference-pair combinations. - New Dataset. VCD focuses specifically on value-conflict scenarios and includes human validation. Even if synthetic, the attention to contradictory labeling could be useful. - Methodological Novelty. The core idea is essentially training on both sides of a preference pair simultaneously. This is a relatively incremental modification to standard SFT. The mathematical formulation (especially the synchronous update in Eq. 5-7) is just standard multi-task learning with weighted losses - Strength of Claims vs. Results. Some narrative framing seems overstated relative to the reported improvements. For example, several gains in the tables are modest, and certain baselines (e.g., DPO in single-preference directions) outperform PFT in their own setting. The claim that PFT “significantly” improves open-ended generation would benefit from a more tempered interpretation. - Missing Critical Operational Detail. Several important methodological details are missing or insufficiently described in the main text. For example, regarding multi-choice evaluation, the main text does not explain how model outputs are converted into selected choices, nor how generation is constrained for the “All” setting. Additionally, given that VCD is positioned as one of the principal contributions, the main text provides only high-level construction details. - Unclear How Explanations are Used or How Important They Are. Although explanations are repeatedly emphasized as part of the single-choice training data (“triplet of <scenario, preference, explanation>”), the paper does not run any experiments isolating the impact of these generated explanations. Their actual contribution remains unclear. - What is the empirical impact of including generated explanations during training or inference? Since explanations appear prominently in the data pipeline, an ablation (e.g., training with vs. without explanations) seems necessary to understand their influence. - How exactly is the user-history context constructed for ICL? Are histories sampled directly from the training distribution, synthesized in a principled way, or drawn from held-out samples? - Why are there duplicate entries in Figure 5? - The abstract mentions a ~40% reduction in data requirements compared to single-preference fine-tuning, and the conclusion similarly claims that the method is more “data-efficient than SFT and DPO.” However, none of the reported results in the main text seem to justify these numbers. Could you clarify how this number was computed and how the experiments support this conclusion? Moderately AI-edited
Personalization Under Value Conflict: Resolving Contradictory Preferences with Paired Fine-Tuning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes Preference-Paired Fine-Tuning (PFT), a method that trains a single LLM on paired demonstrations of contradictory preferences to enable dynamic personalization under value conflict. It introduces the Value Conflict Dilemma (VCD) dataset and shows that PFT outperforms baselines in both classification and open-ended generation tasks. 1. The work tackles the important and underexplored challenge of personalizing LLMs when user preferences conflict. 2. It introduces VCD, a high-quality, human-validated dataset that supports future research on value conflicts. 3. The paper is well-written and easy to read. 1. Despite introducing the terms “asynchronous” and “synchronous” update strategies, the method is essentially standard supervised fine-tuning on preference-conditioned paired data and offers limited novelty. 2. The paper models preferences as strict binary opposites, whereas real-world preferences often exist on a spectrum or are contextually blended, limiting the framework’s applicability to nuanced user behaviors. 3. The evaluation primarily focuses on the single-dimensional VCD benchmark, lacking assessment in multi-dimensional or finer-grained preference settings, which limits the validation of the method’s generalizability. 4. The paper primarily compares against general alignment methods (e.g., SFT, DPO, CAA) but omits comparisons with recent specialized personalization techniques. This limits the assessment of PFT’s relative advantages in the broader landscape of personalized LLMs. See the weaknesses. Moderately AI-edited
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes a new reinforcement learning (RL) algorithm for diffusion large language models (dLLMs), which have intractable likelihoods that make standard policy gradient methods infeasible. Prior approaches approximate the log-likelihood using the ELBO, but the lower-bound approximation introduces gradient bias and limits learning from negative rewards. To overcome this, the authors introduce the Sandwiched Policy Gradient (SPG) method, which optimizes a “sandwiched” objective combining both a lower bound (ELBO) for positive-reward samples and an upper bound (EUBO) for negative-reward samples, thereby reducing bias in the policy gradient. SPG further employs a block-wise masking strategy to stabilize Monte Carlo estimation and a mixture formulation that adaptively blends upper and lower bounds to reduce gradient variance. Experiments on four reasoning benchmarks—GSM8K, MATH500, Countdown, and Sudoku—show that SPG achieves consistent improvements. - Originality: A novel Sandwiched Policy Gradient (SPG) framework is proposed to leverage both lower and upper bounds of log-likelihood for diffusion LLM, which is a clear conceptual advance over prior ELBO-only RL approaches. - Quality: Theoretical development is coherent and well-motivated, with a solid integration of Rényi-based upper bounds and a mixture formulation that balances bias and variance. - Clarity: The paper is well-written and logically structured. Figures and equations clearly illustrate the SPG process. 1. Non-standard evaluation protocol. The paper selects checkpoints every 100 steps based on the highest test accuracy, which risks test set overfitting. While this follows d1 for consistency, it is methodologically questionable. The model should instead be selected via a validation set or by reporting the final checkpoint performance, as adopted in [1]. Revising the evaluation protocol would improve the experimental rigor. 2. Absence of standard RL stabilization techniques. SPG appears to use a naive policy gradient update without employing the clipping or KL-regularization mechanisms used in PPO or GRPO. Without importance sampling ratio corrections, it is unclear whether SPG remains stable for off-policy updates $\mu > 1$. This raises concerns about potential instability or divergence during training. 3. Lack of comparison with key related work. The paper omits comparison with [2], which is the first to successfully apply trajectory-level RL to diffusion LLMs and reports state-of-the-art results on similar reasoning benchmarks. Including this baseline is essential for contextualizing SPG’s improvements and validating its claimed advantage. [1] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation, arXiv:2506.20639 [2] Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models, NeurIPS 2025. - Regarding Weakness 1: Would the authors consider adopting a more standard evaluation setup (e.g., validation-based checkpoint selection or reporting final-step results) to reduce overfitting concerns? - Regarding Weakness 2: Could the authors clarify why SPG does not adopt clipping or KL regularization? Is this omission due to the intractability of computing importance ratios caused by the constant $C(T)$ term in EUBO, or was it omitted for simplicity? - Regarding the tightness of the EUBO: As discussed in [3] and [4], the ELBO of dLLMs equals the AO-ARM loss, and becomes tight when the joint distribution $p(x|\sigma)$ is consistent across different orders of $\sigma$ (i.e., the inequality in Eq. (2) of [3] holds with equality). However, as shown in Appendix C.3 of this paper, the EUBO does not appear to be tight even under this condition. Could the authors explain why this phenomenon occurs, and whether it implies a fundamental looseness in the Rényi-based upper bound? [3] AUTOREGRESSIVE DIFFUSION MODELS, ICLR 2022 [4] YOUR ABSORBING DISCRETE DIFFUSION SECRETLY MODELS THE CONDITIONAL DISTRIBUTIONS OF CLEAN DATA, ICLR 2025 Moderately AI-edited
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces SPG for dLLMs. Because dLLMs have intractable log-likelihoods, standard policy gradients cannot be applied directly. SPG sandwiches the likelihood by maximizing an ELBO for positive advantage samples and minimizing an EUBO for negative advantage samples, paired with blockwise masking for Monte Carlo estimation. SPG achieves SOTA accuracy on GSM8K, MATH500, Countdown, and Sudoku among RL for dLLMs. 1. The paper articulates the RL bottleneck for dLLMs and motivates a natural sandwiched approach. 2. Experimental results show consistent gains across benchmarks, sequence lengths, and inference strategies; ablations on some choices further support robustness. 3. The paper is easy to follow and uses a compact structure that makes the method clear. 1. LoRA-only evidence without full fine-tuning controls: all experiments rely on LoRA, so findings may hinge on LoRA’s inductive biases; it remains unclear whether the gains and stability persist under full fine-tuning. 2. The authors mention that masking 15% of the prompt can improve performance. Although they claim this choice follows d1, they do not seem to explain here the motivation or why it leads to better results. 1. In all main experiments, the paper set the number of Monte Carlo samples to $m=2$. Ignoring computational constraints, how does increasing $m$ affect optimization variance and end-task performance? 2. Figure 9 depicts the training dynamics of the effective generation length. What do these trajectories imply about how SPG allocates its reasoning budget over training? For tasks such as Countdown and Sudoku, the effective length shows large-amplitude fluctuations; what factors drive these oscillations? Lightly AI-edited
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses policy gradients for Diffusion language models where $\log \pi_{\theta} (x |c)$ is intractable by optimizing a lower bound for positive-advantage traces and an upper bound for negative advantage traces with a practical block-wise masking estimator. Because dLLMs do not expose a tractable log-likelihood, prior works optimize ELBO or one-step proxies, which bias gradients (ELBO is only a lower bound). Sandwiched Policy Gradient proposes to "sandwich" the objective: maximize a tractable lower bound $L_{ELBO}$ on positive-advantage samples while minimizing a tractable upper bound $U_{EUBO}$ on negative advantage samples (For instance, GRPO). The problem is well-defined with clear mathematical analysis and extensive experiments on RL baselines. 1. Very clear problem: ELBO-only training biases gradients when rewards can be negative; upper bounds make the algorithm penalize low-reward traces without relying on true $\log \pi$. 2. Strong experimental results: 4 reasoning tasks achieve very high improvement/performance gains. 3. The theory is reasonable. EUBO derived from a Renyi variational inequality gives a tractable surrogate. Mixing lower/upper bounds reduces variance both intuitively and theoretically (Prop. 1) and avoids gradient clipping/vanishing. 1. Baselines. Comparisons focus on GRPO-like/diffusion-RL. Missing: preference optimization tailored to diffusion (DPO-style for dLLMs), score-function or pathwise estimators based on learned surrogates or likelihood-ratio via learned decoder controls. 2. The datasets are limited to reasoning tasks. Though the performance gains are significant, the datasets are limited to math/logic. No tool-use or multi-turn agents settings. No natural-language preferences. 3. Training cost vs baselines and wall-clock time are not clearly examined. Blockwise MC might add computational overhead. Can the authors report that the computational overhead for each method? What’s the wall-clock and token-throughput overhead of block-wise MC vs random masking and vs GRPO/D1/W1, at equal budgets? Could the authors add (even small-scale) comparisons to (a) DPO-style for diffusion, (b) score-matching/pathwise surrogates with a learned decoder, (c) GRPO + stronger variance reduction? Can the authors add more datasets? (non-reasoning, multi-turn, tool-use benchmarks) Fully human-written
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a policy gradient approach for RL fine-tuning diffusion LLMs (dLLMs). The authors not that the policy gradient involves the gradient of the log-likelihood of the policy, this log-likelihood being intractable in the case of dLLLMs. Existing approaches rely on an evidence lower bound (ELBO), which holds only when the return is positive. The authors build an evidence upper bound (EUBO) as a corollary of the Rényi variational bound, and use it for negative returns (ELBO being used for positive rewards). The authors also provide additional improvements. One is the mixing of ELBO and EUBO for negative returns, with a theoretical argument about decreasing variance (not discussing the bias though), the other one being a block-wise masking strategy for MC estimation. The authors then present a quite thorough experimental study, comparing to a number of baselines and ablating the different components of the proposed approach. * The paper is clearly written and structured, easy to follow globally and well argued * The point of the ELBO being no valid for negative rewards is a very good one, and the proposed approach makes a lot of sense, and provides good experimental results * The empirical study is quite thorough with interesting ablations and relevant baselines * From a reinforcement learning perspective, a lot of aspects regarding the proposed approach lack clarity. See questions for more details. * The experiments may be presented in a too favorable aspect for the proposed approach. This may also be related to not enough discussed baselines (especially what are the key differences). See questions for more details. ### Clarity on the reinforcement learning aspect The reinforcement learning part is presented in a quite limited way, raising a number of questions. * The underlying Markov decision process is not even defined, it could * The fundamental argument of the paper is that the ELBO doesn’t hold for a negative reward. This is true. However, in RL, the optimal policy is invariant to a reward shifting. So in principle one can assume the reward to be positive without loss of generality. This may not be the case empirically speaking, for various reasons (one being the variance of the policy gradient, for example), but this would call for at least some discussion, ideally some baseline/ablation. The more close ablation is SPG w/ neg, which is something fundamentally different (it’s closer to SFT on positive examples, ignoring negative ones, a smooth version of best-of-n). Such a baseline could be Reinforce, but one could also imagine a GRPO variation with a normalization leading to positive advantages (using the fact that a state-dependent baseline does not bias the gradient). Please discuss this aspect as thoroughly as possible. * Objective in Eq (4) is biased, because GRPO one is biased (because the baseline does depend on the sampled action). The correct way to do it would be to have a leave-one-out empirical expectation of the return, known as RLOO [A, B]. It’s not a big deal, probably doesn’t change much empirically speaking, but given that the whole point is about better sandwiching the policy objective, it is worth starting from an unbiased one. * Obj in Eq (4), and then the proxy in Eq (5), do not consider importance sampling at all. However, it seems that most of the baselines are. Is the proposed approach purely on-policy, like Reinforce, or does it insures some off-policyness, like GRPO, but not taken into account in the loss, implying some additional bias in what is sandwiched ? (As a side remark, RLOO without importance sampling is not biased in the off-policy case, as a corollary of [C], but the overall discussion point remains) * Many approaches in LLMs, but also in dLLMs, consider regularized RL (typically KL-regularization towards the initial model). It is not discussed here. Is it that no regularization is considered (which is perfectly fine, but worth discussing), or that it is skipped to lighten notations (but in this case it should not, could have implications) ### Too favorable aspect of experiments This may be a wrong feeling, happy to be corrected, but it relies on the following points. * As an initial side note, it would make sense to consider some baselines/ablations suggested above (related to the optimal policy invariance to shifting rewards). * The main point is that the considered baselines (D1, WD1, UniGRPO) are not described enough (even considering the related works section in the appendix), and so we do not know what are the key differences with the proposed approach. The current ablation shows well that the different components are important to proposed approach, but what is missing is that the considered baselines are probably a form of ablation somehow. For example, SPG w/ ELBO seems pretty close to some of the baselines (like uniGRPO, but without IS/clipping?), how does it compare? Another example is that, if we look at the results of Table 1 for the baselines, and compare it to the results of SPG with the random masking strategy, the results are much closer (eg uniGRPO vs SPG w/ EUBO and random masking). So one may wonder if just applying the block-wise masking to uniGRPO would not lead to very good results too (weakening the EUBO contribution). Maybe it is not the case thanks to the mixing, but also maybe that the results are more nuanced then how they are showed (or not, but this would then call for a better discussion about baselines, their key differences, notably wrt SPG). * Another point that presents too favorably the proposed approach is that the SPG specific hyper parameters ($\beta$ and $\omega$) are basically tuned on the test set, which biases too favorably SPG. [A] Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLM, Ahmadian et al., 2024 [B] Buy 4 REINFORCE Samples, Get a Baseline for Free! Kool et al, 2019. [C] Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion, Flet-Berliac et al, 2024. Fully human-written
EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes EmotionThinker, a novel framework for explainable speech emotion recognition (SER) that leverages CoT reasoning and reinforcement learning (RL) to move beyond standard categorical classification. The authors introduce EmotionCoT-35K, a large dataset with CoT annotations and fine-grained prosodic and semantic factors tailored for emotion reasoning. They further proposed an RL-based optimization framework (GRPO-PTR) that incorporates a progressive, trust-aware reasoning reward, balancing outcome accuracy and reasoning quality. Extensive experiments over four benchmarks and ablation studies demonstrate that EmotionThinker achieves superior emotion recognition accuracy and richer, more interpretable explanations compared to a wide range of baselines. 1. The reformulation of SER as a deep reasoning task—rather than mere label prediction—is timely and promising for advancing interpretability in multimodal LLMs. 2. The proposed dataset, EmotionCoT-35K, fills a significant gap with CoT-style, prosody-aware emotion reasoning data, with a scalable, largely automated annotation pipeline. This may have value for the broader community. 3. The proposed reinforcement learning scheme employs progressive reward scheduling and a trustworthiness weight to dynamically balance outcome and reasoning reward signals. This helps mitigate reward hacking and stabilizes training and may be meaningful for the LLM RL community. 1. The data construction pipeline heavily relies on LLMs, and the reasoning trace data is constructed with GPT4o without the actual speech input. This may lead to unexpected failure and bias in the dataset. It would also be beneficial to input the speech and conduct a human review of the data quality. 2. The proposed reward model plays a critical role in the RL process. However, there is little discussion or quantitative validation of its calibration. The distributions of GPT-annotated versus human-annotated reward scores are not directly compared. 3. The description of the RL part is not very easy to follow. It would be better to improve the logit flow in this part. 1. Can the authors provide more analysis comparing the similarities and differences between GPT-4o-based and human-based scoring for CoT data and reasoning reward trace quality? Are there specific failure modes or biases in the model-synthesized data? 2. Are there statistics on annotation accuracy for the automated annotations (prosody, stress, speaker traits) used in EmotionCoT-35K? Do certain emotion categories or speaker groups have systematically noisier annotations or explanations? 3. Typos: Line 270, Appendix without reference. Fully human-written
EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. First, I would like to summarize the contributions of the work by reading the abstract. First, a speech emotional dataset was constructed. Second, current speech LLMs have weak prosody perception. This work tries to address this issue by developing a prosody-enhanced foundation model. Third, a new type of reinforcement learning protocol is proposed, which progressively introduces a reasoning award by dynamically adjusting it with trustworthy weights, reflecting the alignment between reasoning and outcome. **1.** The motivation of the work is clearly stated and explained. **2.** A first RL-based emotion recognition that has the ability not only for accurate classification, but detailed reasoning rationales and informative captions for the audio. **3.** Each stage of the proposed framework is clearly defined. **4.** The evaluation and abolition are comprehensive. **1.** For the accuracy of emotion recognition, I would also like to know the performance on each individual discrete emotion. That way, we can have a more concrete and detailed understanding of the framework's capabilities and limitations. **2.** To construct the reasoning responses, is there a specific reason that only GPT 4.0 is used? **Other comments:** For section 3.1, the authors discussed the open-sourced datasets they used to construct the EmotionCot-35k. I think authors could have a brief discussion on other related, multimodal datasets in the paper as well, whether to use them for constructing the new dataset or not. Such as the following: MSP-PODCAST (the most recent, final version): Busso, Carlos, et al. "The msp-podcast corpus." arXiv preprint arXiv:2509.09791 (2025). MERSA: Zhang, Enshi, Rafael Trujillo, and Christian Poellabauer. "The MERSA dataset and a transformer-based approach for speech emotion recognition." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. CMU-MOSEI: Zadeh, AmirAli Bagher, et al. "Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. DECAF: Abadi, Mojtaba Khomami, et al. "DECAF: MEG-based multimodal database for decoding affective physiological responses." IEEE Transactions on Affective Computing 6.3 (2015): 209-222. Fully human-written
EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces **EmotionThinker**, a novel model for speech emotion reasoning that aims to reframe Speech Emotion Recognition (SER) from a simple classification task into a deep, explainable reasoning problem. The core contribution is the design of a **Prosody-Aware Reinforcement Learning (RL) framework**. This framework guides a Large Language Model (LLM) to generate coherent, feature-grounded text explanations (i.e., reasoning paths), thereby bridging raw acoustic signals, the textual reasoning process, and the final emotional label. This innovative approach addresses the critical lack of interpretability in existing SER systems and SpeechLLMs. - **Originality (Originality):** Very High. The combination of SER, LLM, and RL specifically tailored for generating prosody-grounded explanations is highly novel within the speech community. - **Quality (Quality):** High. The approach moves beyond simple performance metrics by incorporating explanation quality into the optimization objective, indicating a rigorous research focus on a complex problem. - **Clarity (Clarity):** Good. The core idea is presented clearly, and the reasoning process (via case studies) is easily digestible. - **Significance (Significance):** Substantial. This work significantly pushes the boundaries of transparency and trust in SpeechLLMs for emotional tasks, which is an important step for the future of multimodal AI. 1. **Technical Granularity of RL and Prosody Integration (Critical):** - The paper emphasizes **"Prosody-Awareness,"** but the precise and **explicit mechanism** by which the RL framework guides the LLM to attend to the *most critical* prosodic features (e.g., sudden pitch shifts over average pitch) needs more profound elaboration. - The **RL Reward Function** is paramount. A detailed ablation study is essential to show how the different components of the reward (classification accuracy vs. fluency vs. factual grounding to acoustic features) are balanced and how this balance impacts the quality and faithfulness of the final explanation. The current description suggests this crucial balance may be underspecified. 2. **Data and Generalization Concerns:** - Training LLMs via RL for generation tasks often relies heavily on high-quality **human-annotated reasoning path data** for initial Supervised Fine-Tuning (SFT) or as part of the reward signal. The paper must provide a candid discussion on the cost and scarcity of this data and how the model manages to generalize its reasoning to novel or atypical speech examples outside the training distribution. - Generalization to different languages or accents, crucial for a model involving LLM-style reasoning, is also a concern that needs addressing. 3. **Efficiency and Deployment Feasibility (Practicality):** - The combination of LLM and RL training typically incurs a substantial computational overhead. The paper is currently lacking in a detailed analysis of the **training efficiency, required computational resources (GPU-hours)**, and most importantly, the **inference latency** compared to existing, lightweight SER systems. This is vital for assessing the model's practical viability for real-world deployment. 1. RL Reward Faithfulness and Ablation (Critical) Provide an **Ablation Study** on the RL reward components. How do you ensure the explanations are **truly faithful to prosodic facts** and not just syntactically fluent fabrications? 2. Technical Mechanism of Prosody Grounding Clarify the **explicit mechanism** that links a generated text token (e.g., "high pitch") to the **specific, salient acoustic feature** in the speech input. 3. Practicality and Efficiency Analysis Provide detailed **Inference Latency** and **Training Cost (GPU-hours)** analysis. Is this model practically deployable compared to lightweight SER baselines? 4. Generalization Show **Out-of-Distribution (OOD)** results to prove that the learned reasoning generalizes beyond the training corpus. Fully AI-generated
EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper explores an interesting problem by extending emotion modeling from classification to reasoning with promising results. However, the methodology and training details are unclear, reproducibility is lacking due to missing code release, and definitions of prosody and emotional cues need stronger justification and clarity. The motivation is clear, and the research problem is interesting, as it extends beyond improving emotion classification toward developing deeper reasoning capabilities. The proposed model demonstrates strong performance in both emotion recognition and emotion reasoning, providing valuable insights for advancing SpeechLLMs toward more effective emotion reasoning capabilities. It is unclear how your model is trained and how it builds upon Qwen2.5-Omni-3B. Please clarify the training process and provide clear explanations for all symbols and notations in your equations, as they are currently difficult to interpret. The methodology section, particularly Section 3.3, lacks clarity. Please provide a clear description of the overall training pipeline and explain the motivation behind each step. The writing in Section 3.3.1 should be further improved for better structure and readability. Additionally, clarify the purpose of the forward reward and outcome accuracy reward, why both are needed, how they relate to the components shown in Figure 3, and what their specific inputs and outputs are. Will you release your code and dataset? The reproducibility checklist is missing, and without a clear commitment to open-sourcing these resources, the paper’s reproducibility and credibility are severely limited. I may not be able to recommend acceptance unless this issue is properly addressed. How do you handle emotional cues that arise from linguistic content? For example, if the text conveys a happy emotion but the corresponding speech expresses sadness, which modality is prioritized in your final emotion prediction? Does emotion inferred from text affect the overall performance of your model? Please provide both theoretical justification and experimental evidence to support your claim that “prosodic signals are core carriers of emotional intent.” How do you account for the role of textual content and nonvocal components (e.g., crying, laughter)? If you argue that prosody is the most dominant factor, please include empirical evidence demonstrating that prosody contributes more significantly to emotion perception than textual and nonvocal cues. Please clarify how you define prosody. Are speaker traits such as gender and age also included under this term? Appropriate literature references should be provided to accurately define both “prosody” and “speaker traits,” as some of the current definitions appear to be inaccurate. Lightly AI-edited
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper introduces TEMMED-BENCH, a benchmark for assessing temporal medical image reasoning for LVLMs. Each test item pairs a historical and a current chest radiograph from the same patient and evaluates models on three tasks: binary VQA about condition change, change-focused report generation, and an image-pair selection task where the model must pick the pair that matches a specified change statement. The paper also proposes pairwise image retrieval for multi-modal RAG, which retrieves image-pairs (and their reports) whose historical/current images are jointly similar to the query pair. The result shows that several LVLMs benefit more from multi-modal RAG than text-only RAG. Overall, most LVLMs perform near chance on several tasks; reasoning-oriented proprietary models fare best, but none are strong yet on temporal change analysis. - The motivation is crisp and clinically grounded: real radiology practice is often longitudinal, while most benchmarks are single-visit. - The proposed image-pair selection task is novel to me. It is also more vision-centric than typical multi-choice VQA and stresses multi-image reasoning. - The paper is clear, with concrete task definitions, corpus statistics, and straightforward evaluation protocols. **1. Some claims are to be clarified with discussion** - Please check the claim “the first benchmark that focuses on evaluating the temporal reasoning ability of LVLMs”, since there are existing works (e.g., MedFrameQA [1], MMXU [2], MedMIM [3], ICG-CXR [4]) that also feature in temporal evaluation by gathering longitudinal studies from a pool of imaging studies). The authors should discuss how their work differ from those. [1] MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning (arxiv 2025.05) [2] MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression (ACL 2025) [3] Medical Large Vision Language Models with Multi-Image Visual Ability (MICCAI 2025) [4] Towards interpretable counterfactual generation via multimodal autoregression (MICCAI 2025) - In Line 201, the authors claim that “We randomly selected 1,000 instances as the test set and used the remaining instances as the knowledge corpus.” I am unclear whether the split is based on patients or instances (the images from a same patient can appear in multiple instances). If patient overlap exists between test set and knowledge corpus, there might be near-duplicate visits from the same patient and the RAG gains might be inflated. I suggest the authors check the split and see if the results need to be re-evaluated. **2. Some important details are missing** - During dataset curation, from multiple longitudinal images from the same patient, how are each image pairs in each case chosen? Are these image pairs from consecutive examinations (e.g., study1-study2, study3-study4), from non-consecutive examinations (e.g., study1-study3, study2-study4), or from both? - What is the distribution of time gaps between historical/current images, and how do performance and error types vary by short vs. long intervals? - While the image-pair selection task is novel to me, I am curious how this task is done by existing LVLMs since the models may not natively support this task. Specifically, how do the authors modify the LVLMs so they can percept the order of input image-pairs (“A”, “B”, “C”)? Do the authors directly concatenate tokens of different image pairs together? Does that need an extra level of order position encoding? - In Lines 223-226, could the authors provide some statistics that leads to this observation? (e.g. how many reports match the regular expression) This can benefit follow-up research on CXR report analysis. - Which image encoders are used in the pairwise retrieval score? Please see "Weaknesses" Fully human-written
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents TemMed-Bench, a new benchmark for evaluating temporal reasoning in medical vision-language models. Unlike previous datasets that test single-image understanding, TemMed-Bench requires models to compare image pairs from different patient visits to assess disease progression or improvement. It includes three task types: VQA, report generation, and image-pair selection. The authors evaluate open and proprietary LVLMs and find that most fail to perform reliable temporal reasoning. Incorporating multi-modal retrieval (image + text) improves performance over text-only retrieval but only modestly. - **High practical relevance**: The proposed benchmark reflects real clinical workflows where patient history and temporal comparison are essential for diagnosis. - **Comprehensive benchmark design**: It combines multiple task types (VQA, report generation, image-pair selection) to test LVLMs in medical reasoning from different perspectives. - **Clear writing**: The paper is well written and clearly structured. The figures and tables are visually clear and effectively support the main points. - **Benchmark difficulty**: The best-performing model, GPT-4o mini, already achieves around 80% accuracy (Table 3 and 4), suggesting that parts of the benchmark may not be very challenging for advanced models. In my opinion, this raises concerns about the benchmark’s long-term difficulty and its ability to differentiate performance among next-generation LVLMs. - **No human expert validation**: The benchmark lacks human radiologist agreement or interpretability checks to confirm tasks and annotation correctness. Although the paper mentions a manual quality control for the VQA dataset, it focuses only on consistency checking rather than clinical verification by radiologists. - **Limited task diversity beyond radiology**: The benchmark focuses mainly on chest X-ray data (CheXpert Plus), limiting the generalization to other modalities such as CT, MRI, or ultrasound. 1. The paper shows that multi-modal RAG (combining image and text retrieval) improves performance over text-only retrieval. Could the authors clarify how the retrieved visual and textual evidence are integrated into the LVLM’s reasoning process? 2. How do the authors ensure that retrieval method does not introduce data leakage or spurious correlations from similar patient cases within the same dataset? 3. Given that GPT-4o mini already reaches around 80% accuracy, how likely is the benchmark to remain challenging and discriminative as future models improve? Moderately AI-edited
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces TEMMED-BENCH, a benchmark to assess temporal reasoning by comparing historical vs. current medical images, demonstrates that multi-modal RAG substantially boosts performance, and includes manual QC for VQA, yet its single-modality/single-source design, limited failure-case analysis, and lack of broader SOTA baselines leave it vulnerable to score gaming and constrain generalizability. 1. Presents a benchmark that explicitly compares historical vs. current medical images to judge disease progression. 2. Systematically explores the impact of multi-modal RAG and demonstrates that it can markedly enhance task performance, providing a practical pathway for future agents/medical LLMs to handle such problems. 3. Includes manual quality control for the VQA subset, correcting approximately 10% of items and thereby improving the benchmark’s credibility. 4. Compared with datasets like U-Xray and MIMIC-CXR, this benchmark emphasizes reasoning grounded in retrieved evidence, making it less vulnerable to simple pattern-matching hacks. 1. The image modality is overly narrow, and the data source is single and easy to get; as a benchmark, it is easy to be reverse engineered and reward hacks, which may make it become obsolete quickly. 2. The paper mainly offers macro-level discussion of “why this is hard” and “which settings cause models to fail or become ineffective,” but lacks a systematic, category-wise failure analysis with concrete error cases. Since a benchmark is meant to drive LLM progress in a specific domain, the work should provide insights explaining why certain models fail. 3. Many evaluated LVLMs are not SoTA in 2025. For example, only Gemini 2.5 Flash was tested (not Gemini 2.5 Pro), and Claude 3.5 Sonnet (not 3.7 or later). Newer models should be included. Several open-source baselines are also relatively old, e.g., LLaVA-Med. Minor: 2. The benchmark’s “temporal reasoning” setup is essentially **two-timepoint differencing**, rather than true longitudinal reasoning about disease progression over time. Please answer the following questions: 1. Because the dataset comes from a single, easily accessible source, the benchmark appears highly vulnerable to reverse engineering and score gaming. Do you have any safeguards to prevent this? 2. Please provide a failure-case analysis with concrete examples, and consider adding this section in future revisions. 3. If possible, include results from more advanced models to strengthen the evaluation. Lightly AI-edited
PreviousPage 26 of 1516 (75800 total rows)Next