|
Where and Why in Image Forgery: A Benchmark for Joint Localization and Explanation |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Forgery Attribution Report Generation, a new forgery task that simultaneously localizes forged image regions (“Where”) and generates human-readable explanations of the manipulations (“Why”). To address the challenge, the authors release Multi-Modal Tamper Tracing (MMTT), a large-scale dataset with human annotation.
The authors further propose ForgeryTalker, a unified vision–language framework for forgery localization and textual explanation. The model is trained in two stages for both alignments and finetuning. Extensive experiments show that ForgeryTalker outperforms strong language and localization baselines (e.g., InstructBLIP, LISA-7B, SCA) in both textual report generation and forgery localization. Ablation studies confirm the importance of the Forgery Prompter Network (FPN) and the pretraining design.
Overall, the work contributes a new benchmark, task definition, and baseline model for explainable image forgery analysis, moving beyond binary classification toward interpretive forensics.
1. The paper clearly formalizes localization and explanation as a joint problem for image forensics to provide interpretability.
2. The proposed MMTT is a large-scale (150K+) dataset that integrates pixel-level masks with linguistical annotations.
3. ForgeryTalker elegantly integrates multimodal reasoning with a shared encoder and dual decoders, enabling joint optimization on localization and text explanation.
4. The model achieves state-of-the-art performance across both language (CIDEr and BLEU) and localization (IoU, Precision) metrics, outperforming existing baselines.
Major Weaknesses:
1. While the paper proposes the new task of joint localization and explanation, the necessity for combining these subtasks is not clearly justified. Given that both have been studied in the field of image forgery, the authors should clarify how their joint formulation yields additional insights or mutual improvements.
2. In Fig. 4, the textual instruction fed to the mask decoder differs in Forgery-aware Pretraining Stage and Forgery Generation Stage, which may introduce potential biases.
3. The baselines are limited to general multimodal language models, while existing multimodal forgery explanation methods (e.g., FakeShield[1], M2F2-Det[2]) are absent.
4. The method is evaluated only on DQ F++, another face-forgery dataset. Broader out-of-distribution forgeries would strengthen the authors' claims.
Minor Weaknesses:
1. Some inconsistent names and definitions (e.g., “Forgery Attribution Report Generation” vs. “Joint Forgery Localization and Explanation”, "Forgery Generation Stage" vs. "Attribution Report Generation Stage") may confuse readers.
2. Typo in "Cross-model Alignment Learning". "Cross-model" should be "Cross-modal".
[1] Xu, Zhipei, et al. "Fakeshield: Explainable image forgery detection and localization via multi-modal large language models." arXiv preprint arXiv:2410.02761 (2024).
[2] Guo, Xiao, et al. "Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
1. What advantages arise from jointly forgery localization and explanation compared to handling them separately?
2. Details of the textual instructions fed to the mask decoder between pretraining and generation stages are expected.
3. The authors should include multimodal forgery explanation methods (e.g., FakeShield, M2F2-Det) for fair comparisons.
4. How does the model generalize to out-of-distribution forgeries in other benchmarks? |
Lightly AI-edited |
|
Where and Why in Image Forgery: A Benchmark for Joint Localization and Explanation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces a multimodal task of joint forgery localization and explanation, proposing the MMTT dataset with 152,217 samples containing forged images, pixel-level masks, and human-authored textual descriptions. The authors present ForgeryTalker, a unified framework that combines InstructBLIP with a Forgery Prompter Network (FPN) to generate both segmentation masks and natural language explanations of facial manipulations.
1. The paper constructs a large-scale dataset of 152,217 samples with careful human annotation, including both pixel-level masks and detailed textual descriptions, which will benefit the research community.
2. The paper successfully combines InstructBLIP, SAM's decoder, and a custom FPN to jointly address forgery localization and explanation generation.
3. The paper is easy to follow with informative figures and accessible writing that clearly motivates each design choice.
1. The annotation process may introduce confirmation bias since annotators are shown the ground-truth masks before writing descriptions. This guidance could lead them to describe artifacts that may not be perceptually obvious or even non-existent, especially for high-quality forgeries. The paper lacks inter-annotator agreement analysis or blind validation to verify annotation quality.
2. The paper only evaluates on MMTT (their own) and DQ_F++ datasets, both synthetic and research-oriented. Critical tests on real-world deepfakes, cross-forgery-tool generalization, and robustness to different manipulation methods are missing, raising concerns about practical applicability.
3. The paper does not explain why Q-former is necessary over the now-standard image encoder + LLM architecture. An ablation comparing direct ViT features + LLM versus the Q-former approach would help justify this design choice.
4. The paper lacks discussion and comparison with recent relevant works in multimodal face forgery detection, particularly: "FFAA: Multimodal Large Language Model Based Explainable Open-World Face Forgery Analysis Assistant", "MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection", "FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models" (ICLR 2025), "Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector(CVPR2025)" and "Towards General Visual-Linguistic Face Forgery Detection" (CVPR 2025, which similarly uses mask information to assist annotation). Direct comparisons would better position this work's contributions.
5. While the paper ablates the FPN component, it lacks individual analysis of the four pretraining losses (Lmlm, Llm, Lseg, Lcon). How were the loss weights (2:1:1:1) determined? What happens if individual losses are removed? Table 5 shows only different weight combinations but no systematic analysis of each loss function's contribution.
1. How do you ensure annotation quality given the confirmation bias risk? Since annotators are shown ground-truth masks before describing forgeries, what measures prevent them from "over-interpreting" high-quality fakes or describing non-existent artifacts? Did you conduct any blind validation or inter-annotator agreement tests?
2. Why does the FPN achieve only marginal improvement (39.16 vs 38.92 PLM)? Given the large performance gap between using ground-truth prompts (95.1 CIDEr) and predicted prompts (59.3 CIDEr), how do you address this bottleneck? Have you explored end-to-end training or alternative prompt generation strategies?
3. What are the computational costs and inference efficiency? How long does the two-stage training take, and what are the GPU memory requirements? What is the inference time per image? This is important for assessing the practical deployment feasibility of ForgeryTalker. |
Fully AI-generated |
|
Where and Why in Image Forgery: A Benchmark for Joint Localization and Explanation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper introduces a joint image forgery understanding task that requires models to determine both where an image has been manipulated and why. To support this, the authors build a new large-scale dataset (MMTT) containing over 150k manipulated images with pixel-level masks and human-written explanations covering multiple manipulation types (face swapping, attribute editing, inpainting, etc.). They also propose a unified model called ForgeryTalker.
1. The joint task of localization and explanation is clearly defined and offers a more interpretable understanding of image forgeries.
2. The proposed MMTT dataset contains over 150k samples with pixel-level masks and region-grounded textual descriptions. It covers diverse manipulation types and provides fine-grained part-level statistics.
1. If the dataset focuses primarily on face manipulations rather than general image edits, this should be explicitly stated in the title and abstract. Moreover, there are existing works on general image tampering that include both facial and non-facial examples with region-level and textual annotations, but comparisons and cross-domain evaluations with such datasets are missing.
2. Cross-dataset evaluation is very limited. There is no mask-level generalization test on established third-party localization datasets.
3. VLMs are evaluated in zero-shot mode, while ForgeryTalker is trained on the MMTT dataset. More importantly, comparisons with specialized deepfake or forgery detection methods are absent, making it difficult to assess whether the gains come from model design or data familiarity.
4. The paper claims over 150k human-written explanations, which implies a massive annotation effort. However, it does not specify how many annotators participated, how annotation consistency was ensured, or how quality control was performed. These details are crucial to assess dataset reliability for the benchmark.
Issues raised in the *Weaknesses* section. |
Moderately AI-edited |
|
Where and Why in Image Forgery: A Benchmark for Joint Localization and Explanation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces a new large-scale dataset of real images from FFHQ and CelebA, manipulated using three groups of techniques (namely, face swapping, face editing, and image inpainting). The generated deepfakes are then shown to 30 trained human annotators, who, after observing the images and manipulation masks for a minimum of 1 minute, write a text description of the obvious areas of manipulation.
Consequently, the dataset is used to train an MLLM on mask prediction and interpretation. The MLLM extends the InstructBLIP architecture by incorporating a SAM decoder and several losses to improve performance.
- The paper is well written and easy to follow. The steps to obtain the manipulated data and annotations can be replicated from the paper and supplementary material. Similarly, the MLLM architecture is relatively clear (although it has many components).
- In terms of performance, ForgeryTalk performs better compared to InstructBlip on localisation which is expected. We also see good performance on DD_F++.
- Most weaknesses identified are in relation to the trained MLLM. First and foremost, the ablations seem incomplete. The method uses several losses and a two-stage training method. The ablations are limited to w/wo FPN and very few variations for the weight of each loss component; what happens if we only train end-to-end? what about if we drop the mlm loss?
- The model is trained on localisation, but this is assuming FPN has already detected the areas. Converting the class labels to coherent sentences is a trivial task for the MLLM, so the effectiveness of the architecture is a little misleading. Of course, the pipeline still works, but it seems overkill --i.e. why convert a set of class labels to sentences if they are already human-readable? The architecture would make more sense in a VQA format.
- The architecture seems to be trained only on forgeries. As such, we should expect some hallucinations and false positives on real images. This is, in fact, a drawback, as it assumes another pre-trained method ahead of the FPN to do the binary detection, further raising computational requirements.
Some minor points:
- It is unclear if the "augmented evaluation to DQ_F++" means cross-dataset generalisation or if the model if finetuned on the dataset
- line 281: there should be a new paragraph for this component, to maintain template consistency
- The mlm loss section is a little unclear. t is never defined, and it is also unclear what is the ground truth and reasoning behind it.
- How does the model perform on other standard datasets, eg FF++?
- How does the model perform when contrastive and mlm losses are not used?
- How does the model behave when real images are given?
- Is table 3 showing cross-dataset generalisation?
- Why is the binary detection dropped? What is the motivation for such a choice? |
Fully human-written |
|
Linkage-Guided Genetic Variation: Overcoming Operator Blindness in Genetic Algorithms |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper introduces the Evolving Locus Linkage Graph (ELLG), an online learning mechanism to make genetic crossover and mutation location-aware. In a standard GA, crossover and mutation points are chosen uniformly at random, often disrupting co-adapted genes (“building blocks”). ELLG addresses this by maintaining a weight $W(k)$ for each adjacent pair of loci $k$ and $k+1$ along the chromosome (effectively a weight on each edge in a linear genome graph). These weights represent learned linkage strength between loci – high weight implies the loci form a tightly linked segment (should be preserved together), while low weight implies a weak dependency (a good cut point).
1. A major strength is that ELLG can be integrated into existing GAs without changing their selection mechanism or encoding. It treats the GA as a black-box optimizer and only tweaks how variation operators are applied.
2. Empirically, ELLG shows dominant or at least competitive performance on a wide array of problems – from low-dimensional, multimodal test functions to complex neural network searches.
3. The learning update and sampling are lightweight. The paper notes essentially no runtime penalty for using ELLG.
1. ELLG only learns linkage between adjacent loci on the chromosome (like a chain graph). This assumes that important building blocks correspond to contiguous segments in the representation. In many problems, especially combinatorial ones, genes that are far apart in the encoding might actually be correlated.
2. The method’s application to multi-objective problems requires collapsing multi-dimensional performance into a scalar $\Delta$. The paper doesn’t explain how this is done, which is a technical weakness in clarity. If the chosen scalarization is naive (e.g., using just one objective or a weighted sum), the learning signal might misrepresent true solution quality in multi-objective space.
3. The integration of ELLG requires that crossover and mutation be implemented in a way that a specific “site” can be chosen. In the standard NSGA-II (and many MOEAs for real vectors), crossover is SBX, which does not have a single crossover point – it mixes every variable. This seems fine, but it effectively means the baseline NSGA in comparisons might not be using exactly the same operator behavior as ELLG-GA.
4. I don't quite get why Eq. (6) can intuitively learn the linkage weight $W^{(k)}$ for each k. Since this rule/update is applied to every individual and between each "cutting site" (k, k+1), I do not see how the fitness gap $\Delta = f(G) - \bar{f}$ can reflect which genes/bits should be bound together w.h.p. More details or better clarification is needed here.
1. Eq. (2), what is $p_{11}$, $p_{A_1}$, and $p_{B_1}$?
2. Why take the absolute value of LD? Because it can be negative?
3. What value of the reinforcement threshold $\lambda$ was used in Eq. 6 for the experiments? The paper defines $\lambda \ge 0$ but doesn’t say if it was set to 0 or some positive value.
4. You initialize all linkage weights to 1.0. Did you try other initializations (e.g., all weights 0, or random small values)? Would it actually matter in the long run? |
Heavily AI-edited |
|
Linkage-Guided Genetic Variation: Overcoming Operator Blindness in Genetic Algorithms |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This manuscript presents a new mechanism to decide the cutting position of crossover and position of mutation in genetic algorithms (GAs), based on the linkage strength of the variables. The motivation of the work is to preserve the strongly connected building-blocks while explore at those more independent positions. Experimental validations on multi-objective problems and neural architecture search are performed.
1. The proposed mechanism is based on observations in biological genetics, with a well justified motivation.
2. The experimental design is reasonable, and a number of baseline algorithms are compared.
1. The presentation of the algorithmic details is confusing. See “Questions” for details.
2. Detecting linkage strengths based on fitness statistics is not new. Estimation of distribution algorithms (EDAs) are doing it as well.
3. Ablation studies kind of show that strong preservation of the linkages may hamper the explorative ability of the search, defeating the purpose of the mechanism.
1. The meaning of each entry in the strength matrix is confusing. In Figure 2, it seems like each W represents the strength of two specific alleles at two positions. But in the description of the method, each W only has a single superscript k, which corresponds to a locus, instead of a specific allele combination. This inconsistency in notations makes the algorithm description very confusing. Are you trying to learn the strength for every possible pair of alleles, or for each locus?
2. The ablation study in Figure 4 shows that smaller sensitivity parameter leads to better performance. However, this also means we should not preserve strong linkages, which is opposite to the motivation of this work. This observation makes the story a bit self-conflicting. Can you elaborate on this aspect?
3. Can you discuss the difference and connection of the proposed method to Estimation of distribution algorithms (EDAs), which also try to preserve strong linkages based on fitness statistics.
4. Why not test on those non-separable single-objective problems?
5. For the topology search problems in NAS, can you also compare with the shortest edit path crossover [1], which also tries to preserve strongly linked subgraphs while exploring in those more independent regions?
[1] Xin Qiu , Risto Miikkulainen, “Shortest Edit Path Crossover: A Theory-driven Solution to the Permutation Problem in Evolutionary Neural Architecture Search”, ICML 2023. |
Fully human-written |
|
Linkage-Guided Genetic Variation: Overcoming Operator Blindness in Genetic Algorithms |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work proposes a mechanism for adaptively modifying mutation and crossover probabilities according to the current fitness of each individual in the Genetic Algorithm. The adaptivity allows GA to avoid bottlenecks in the evolutionary process and arrive more rapidly at the suboptimal solutions.
The strength of this work lies in its simplicity; the proposed adaptive mechanism is well-described and makes sense. The simplicity allows it to be adopted to the conventional GA without altering its core mechanism.
1. The main weakness is the absence of a theoretical argument for this work. For example, in standard GA, there is a Schema for arguing about the convergence of GA. In contrast, this work does not provide any argument on what will be changed by the adaptive mutation and crossover rates.
2. The reviewer is not sure that this work is appropriate for ICLR. This work may draw better attention in, for example, the IEEE Congress on Evolutionary Computation (CEC), Genetic and Evolutionary Computation Conference (GECCO), etc.
1. Please try to develop a theoretical argument to add technical depth to this work. For example, how the adaptivity of the mutation and crossover probabilities changes the convergence in the Schema Theorem.
2. This work is not entirely novel. There were many works dealing with the adaptive mutation rates, for example:
P. Hartono, S. Hashimoto, and M. Wahde, Labeled-GA with Adaptive Mutation Rate, Proc. IEEE CEC 2004, pp. 1851-1858 (2004),
doi: 10.1109/CEC.2004.1331121.
Please attempt to compile a more comprehensive reference list of past work on adaptive evolutionary parameters and compare the current work with the relevant past works.
3. Figure 3 should be explained better, as it illustrates the core idea of this paper. For example, it isn't easy to understand Fig. 3(a). The reviewer understands that the column represents the sequence of alleles in an individual in a particular generation. But what do the rows represent? If they represent the generations, doesn't the order of alleles in each generation change? Representing them with the same sequence of colored dots feels strange.
4. Fig. 1 is too small to see.
5. If possible, please provide one negative example in which the adaptivity is detrimental to the evolutionary process, and argue on what kind of condition this happens. This will add depth to this paper. |
Fully human-written |
|
Linkage-Guided Genetic Variation: Overcoming Operator Blindness in Genetic Algorithms |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces the Evolving Locus Linkage Graph (ELLG), a novel approach to address the "operator blindness" problem in Genetic Algorithms (GAs). The authors draw inspiration from natural genetic processes to dynamically learn and update linkage weights, guiding crossover and mutation operators to preserve strong genetic segments and target weak boundaries. While I appreciate the creative biological inspiration and the potential to overcome limitations in traditional GAs, I have significant concerns about the paper’s clarity, computational feasibility, and experimental rigor.
On the positive side, ELLG demonstrates promising performance improvements in benchmarks, suggesting it could enhance optimization tasks. However, the paper is difficult to follow for non-experts, and the lack of analysis on computational overhead raises doubts about its practical applicability. Additionally, the limited experimental evaluation, particularly in the NAS case study, and the unavailability of source code undermine the reliability and reproducibility of the results. While the idea is innovative, the paper needs substantial improvements in presentation, computational analysis, and experimental rigor to be fully convincing.
- The paper presents a novel evolutionary algorithm inspired by natural genetic processes, addressing a fundamental limitation of traditional GAs. The biological inspiration is creative and well-motivated.
- ELLG outperforms state-of-the-art approaches in genetic algorithm benchmarks, demonstrating its potential to improve optimization tasks.
- The paper is difficult to follow for non-experts in genetic algorithms. The authors frequently reference biological and chemical concepts that may be unfamiliar to readers in the ICLR community. The presentation of the approach needs significant improvement for broader accessibility.
- The computational cost of ELLG is not investigated. While traditional GAs rely on randomness to reduce complexity, ELLG’s approach of identifying relevant segments for mutation and crossover may introduce substantial overhead. This is a critical concern, as GAs are already computationally expensive, and the lack of analysis on computational requirements undermines the practical relevance of the paper. The limited scope of experiments (e.g., NAS benchmarks) suggests potential computational constraints. The authors should evaluate and report the computational requirements of ELLG to demonstrate its feasibility in practice.
- The evaluation on NAS is limited and insufficient:
1 - The authors did not use gold-standard NAS benchmarks like NAS-101 or NATS-Bench, which are essential for reliable comparisons.
2 - The comparison is restricted to a single genetic baseline, whereas Table 1 includes multiple baselines. The authors should extend their experiments to include multiple genetic and non-genetic baselines, as GAs are not state-of-the-art in NAS.
3- Given the limitations of GAs in NAS, the authors should consider focusing on domains where genetic search is more impactful or provide a stronger justification for using NAS as a case study.
- The source code is not available, which hinders reproducibility and reliability. The authors should release their code to allow the community to verify and build upon their results.
- The paper is challenging to understand for non-experts. Could the authors simplify the presentation and avoid assuming familiarity with biological or chemical concepts? For example, the high-level overview of ELLG’s mechanism is not helping with clarity, but rather it is confusing the reader, as the authors attempt to combine a lot of information across a few images without providing clear explanations.
- What is the computational cost of ELLG compared to traditional GAs? Could the authors provide an analysis of the overhead introduced by learning and maintaining linkage weights? How does this cost scale with problem size?
- Why were gold-standard NAS benchmarks (e.g., NAS-101, NATS-Bench) not used in the evaluation? Could the authors extend their experiments to include these datasets for a more robust comparison?
- The NAS evaluation only compares ELLG to a single genetic baseline. Could the authors include multiple genetic and non-genetic baselines to provide a comprehensive assessment of ELLG’s performance?
- Given that GAs are not state-of-the-art in NAS, could the authors justify their choice of NAS as a case study or consider focusing on domains where genetic search is more effective?
- Why is the source code not available? Releasing the code would greatly enhance the reproducibility and credibility of the results. Could the authors commit to making the code publicly accessible?
- ELLG’s reliance on learning linkage weights may limit its practical applicability due to computational constraints. Could the authors discuss potential optimizations or approximations to reduce overhead while maintaining performance?
- The biological inspiration behind ELLG is intriguing. Could the authors elaborate on how specific biological mechanisms (e.g., genetic linkage, recombination hotspots) directly informed the design of ELLG? Are there biological processes that could further inspire future improvements?
- How do the linkage weights evolve over generations? Could the authors provide examples or visualizations of how these weights change in response to fitness feedback, particularly in complex problems?
- Are there theoretical guarantees (e.g., convergence properties, optimality conditions) associated with ELLG? Could the authors discuss how ELLG’s design aligns with or deviates from traditional theoretical frameworks in evolutionary computation? |
Fully AI-generated |
|
Linkage-Guided Genetic Variation: Overcoming Operator Blindness in Genetic Algorithms |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Rightly suggesting that the crossover operator in evolutionary algorithms can be disruptive, the authors propose to learn good crossover points that preserve semantic "building blocks" in the genotypes of the parents. The paper evaluates their approach with standard single- and multi-objective evolutionary algorithms, given good results.
Their approach is described in great detail, including with pseudo-code! Experimental results are evaluated for statistical significance.
The authors seem unaware that linkage learning has been an established subfield of evolutionary computation for several decades. The paper makes no reference to the many directions and ideas that have been proposed earlier. Two prominent examples are NEAT which is a "neat" way of doing semantically meaningful crossover in neuro-evolution, and the GOMEA class of algorithms. The authors should have compared with existing linkage learning algorithms.
How does your approach compare with state of the art linkage learning algorithms, e.g., those utilised in GOMEA? |
Fully human-written |
|
MIAU: Membership Inference Attack Unlearning Score for Quantifying the Forgetting Quality of Unlearning Methods |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes MIAU (Membership Inference Attack Unlearning Score), a composite metric for evaluating machine unlearning. MIAU compares three MIA setups—Forget vs Test, Forget vs Retain, and Retain vs Test—then normalizes each attack’s accuracy between a baseline model trained on all data and a retrained model trained without the forget set. A logistic mapping turns a “gap-closure fraction” into a bounded 0–100 score, and the final MIAU is a weighted average (default equal weights). Experiments on MNIST, CIFAR-10/20, and MUCAC across ResNet-18, All-CNN, and ViT, plus several unlearning methods (Fine-tune, SSD, Amnesiac, Teacher), indicate MIAU can separate methods and is often monotone under “partial retraining” baselines.
- This paper introduces an interpretable metric that aggregates complementary MIA views and normalizes them between baseline and retrain references, yielding a single bounded score that’s easier to compare across methods and datasets.
- This paper proposes a practical audit$\rightarrow$deploy workflow: use a one-time retrained reference to select an unlearning method offline, then apply the chosen method in production—keeping evaluation principled while limiting operational overhead.
- The contribution feels largely constructive/standardizing rather than conceptually new: it consolidates existing MIA signals with anchoring and a bounded mapping to improve interpretability, which is valuable but incremental in novelty.
- The proposed audit$\rightarrow$deploy workflow reads as a sensible formalization of common practice in unlearning evaluation, offering operational clarity but not introducing a fundamentally new deployment paradigm.
- There is growing evidence that MIAs lose discriminative power as model capacity increases. The paper’s own ResNet vs. ViT results seem broadly consistent with this trend. It would strengthen the contribution to discuss the implications for MIAU at larger scales and, where feasible, include evaluations on substantially larger models or complementary privacy signals for regimes where MIA signal weakens.
- The placement of Figure 2 may be a bit early: it introduces notation before the symbols are formally defined (later in the text), which can make a first read harder to follow. Consider moving the figure to the methods section where definitions appear, or adding a brief notational reference in the caption so the figure is self-contained.
- For $\beta,\gamma,\delta$, the paper fixes equal weights but offers no sensitivity study or guidance to select weights for different privacy/utility trade-offs. This limits practical tunability.
[1] Duan, M., Suri, A., Mireshghallah, N., Min, S., Shi, W., Zettlemoyer, L., ... & Hajishirzi, H. (2024). Do membership inference attacks work on large language models?. arXiv preprint arXiv:2402.07841.
See Weaknesses section. |
Moderately AI-edited |
|
MIAU: Membership Inference Attack Unlearning Score for Quantifying the Forgetting Quality of Unlearning Methods |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes MIAU (Membership Inference Attack Unlearning Score), a new metric designed to evaluate the effectiveness of machine unlearning methods. Unlike prior approaches that rely on a single MIA setting or raw accuracy differences, MIAU integrates three complementary MIA comparisons, Forget vs Test, Forget vs Retain, and Retain vs Test, to capture residual memorization, removal specificity, and generalization stability. The key idea is to position an unlearned model between two meaningful reference points: the baseline model trained on all data and the fully retrained model without the forget set. Using a “gap‐closure fraction” and a calibrated logistic transformation, MIAU provides a normalized, interpretable score (0–100) indicating how closely an unlearning method approximates the privacy behavior of full retraining. Extensive experiments across datasets, architectures, and unlearning methods demonstrate that MIAU can differentiate strong and weak unlearning approaches and generally increases under partial retraining, reflecting progressive forgetting.
(1)One of the most valuable aspects of the proposed MIAU score is the normalization between the fully trained model and the fully retrained model. This creates a principled evaluation interval and enables intuitive interpretation of “how much forgetting has been achieved.”
(2)The authors evaluate MIAU on multiple benchmarks, including MNIST, CIFAR-10/20, and MUCAC, and across diverse model families such as ResNet, AllCNN, and ViT. This broad coverage enhances the external validity of the findings.
(1)The proposed MIAU score entirely depends on membership inference attacks, inheriting several well-known limitations of MIAs themselves, such as instability across random seeds, sensitivity to data complexity, calibration issues, and weak discriminative power on well-generalized models (e.g., MNIST). As a result, the reliability of MIAU is fundamentally constrained by the weaknesses of its underlying signal.
(2)The method implicitly assumes that a retrained model fully represents “ideal forgetting,” but this assumption does not necessarily hold. Retraining may still capture distribution-level information about the forgotten data or produce nontrivial MIA signals. Thus, anchoring the metric strictly between baseline and retrain introduces a conceptual bias and may misrepresent unlearning effectiveness in certain settings.
(3)Different applications impose different priorities: privacy-focused scenarios emphasize Forget vs Test, while utility-centric ones emphasize Retain vs Test. Using fixed uniform weights lacks methodological grounding and may mask important trade-offs.
The experimental setup largely relies on weak or limited MIAs, uses small and simple datasets, includes only a narrow set of unlearning baselines, and lacks evaluation on more recent or complex models. |
Moderately AI-edited |
|
MIAU: Membership Inference Attack Unlearning Score for Quantifying the Forgetting Quality of Unlearning Methods |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes MIAU, an unlearning audit score that aggregates several MIAs and normalizes them between a baseline model and a fully retrained model, aiming to provide a quick offline way to compare unlearning methods without retraining.
1. The “offline audit --> choose a method --> deploy” workflow is well presented and could be convenient in practice.
2. The evaluation is broad, spanning multiple datasets and architectures, with partial-retrain references that acknowledge MIA limits and attempt graded validation.
1. The paper states that prior works “often rely on a single comparison and lack reference points (baseline/retrain).” However, this statement is inaccurate as many unlearning papers do compare to retrain/baseline, including (and not limited to):
[1] Fan, Chongyu, et al. "Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation." ICLR 2024.
[2] Zhao, Kairan, et al. "What makes unlearning hard and what to do about it." NeurIPS 2024.
2. The argument that Retain–Test is essential as a generalization sanity check is not so convincing when standard retain/test accuracies already capture this. Moreover, for the forget-retain setup, the paper suggested that effective unlearning should increase separability between forget and retain sets, but this also risks enabling attackers to identify the forget set, which undermines privacy goals. As a result, I don't think the justification for these two additional steups is strong enough.
3. There are some other questionable statements in the paper. E.g."after unlearning, accuracy on the forget set should drop slightly, ideally approaching test-level performance" This sounds vague and can be misleading, as matching forget-set performance to test-set performance is not generally a sound or universal unlearning target.
1. You use a logit-based binary classifier as the MIA, but have you tried other (more advanced) MIAs? And how would the choice of MIA affect the results? |
Fully human-written |
|
MIAU: Membership Inference Attack Unlearning Score for Quantifying the Forgetting Quality of Unlearning Methods |
Soundness: 1: poor
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces the Membership Inference Attack Unlearning Score (MIAU), a composite metric to assess machine unlearning quality. MIAU aggregates three pairwise MIA tasks: Forget vs. Test, Retain vs. Forget, and Retain vs. Test, and normalizes performance between a original model and a retrained model.
A logistic transformation maps gap closure into a 0–100 score. The authors propose MIAU as an offline auditing tool to select the best unlearning method for a model-dataset pair, avoiding repeated retraining in deployment. Experiments span image classification benchmarks (MNIST, CIFAR-10/20, MUCAC), model architectures (ResNet-18, All-CNN, ViT), and unlearning methods (Fine-tune, SSD, Amnesiac, Teacher). The paper claims MIAU captures gradual forgetting and overcomes limitations of single MIA evaluations.
- The paper clearly articulates failure modes of individual MIA tasks (Section 1.1) and motivates the need for a unified metric.
- The experimental scope is reasonable, covering multiple datasets, models, unlearning methods, and robustness checks with multiple iterations.
- Combining three complementary MIA tasks is a logical step beyond single-pair evaluations in prior work.
- The core contribution appears to be a calibration and aggregation of three existing MIA comparisons (Forget vs. Test, Retain vs. Forget, Retain vs. Test), which are already well-established in the literature, which cited in L58-L59. The normalization via gap closure and logistic transformation feels like a minor post-processing step rather than a substantive innovation.
- Although the paper asserts that MIAU detects imperfect unlearning where individual MIAs fail (e.g., retained representations despite low forget accuracy), but provides no targeted experiments to demonstrate this. For example, no adversarial unlearning setups, synthetic failure cases, or ablations are included. Without such evidence, the claimed improvement over raw MIAs remains unsubstantiated.
- In L325-L327, the paper expects MIAU to increase strictly with partial forgetting (MIAU_{25%} < MIAU_{50%} < MIAU_{75%}), but offers no theoretical justification. MIA accuracy reflects binary classification on differing logit distributions; there is no inherent reason for monotonic improvement. Diverse training dynamics in partial retraining may alter separability unpredictably. This assumption is critical to the “gradual forgetting” claim and requires formal derivation or counterexamples.
- Standard deviations frequently exceed 20 points on a 0–100 scale. This implies that the same unlearning method may receive scores differing by over 40 points across random seeds. Such instability prevents reliable method ranking and defeats the stated goal of efficient offline auditing.
- Can you provide experiments where MIAU detects imperfect unlearning (e.g., retained internal representations with low forget accuracy) while individual MIAs fail?
- Why should MIAU increase monotonically with partial forgetting levels? Please include a theoretical derivation or counterexamples where this property fails.
- How can the high standard deviations in Table 1 be reduced for practical use? |
Heavily AI-edited |
|
PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Preserving direction is more important than preserving magnitude in vector quantization. However, current VQ methods emphasize reducing the magnitude error. PCDVQ utilizes polar coordinates to enhance the expressivity of codebook for directional information. PCDVQ also shares codebook for entire model to minimize the memory consumption by regularize all weights to follow the same Gaussian distribution. Experimental results show that PCDVQ achieves the state-of-the-art performance.
1. The idea of decomposing the expressive power of the codebook into fine-grained components (direction and magnitude in PCDVQ) is innovative.
2. Regularizing the distribution of weights to share the codebook sounds solid and effective.
1. The paper lacks a theoretical analysis on why directional information is more important than magnitude information.
2. The efficiency should be compared not only with the full-precision model but also with methods such as VPTQ.
3. (minor) Citation format seems inappropriate. Should have used \citep instead of \cite.
1. How are the direction and magnitude individually quantized in Figure 1(a)?
2. How does PCDVQ determine bit widths $a$ and $b$ for direction and magnitude?
3. It appears that using a polar coordinate representation requires additional computation during dequantization. Does this make the method slower than VPTQ? |
Fully human-written |
|
PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes PCDVQ (Polar Coordinate Decoupled Vector Quantization), a novel post-training quantization (PTQ) framework for compressing large language models (LLMs).
The key insight is that a vector’s direction is more sensitive to quantization errors than its magnitude, yet most existing vector quantization (VQ) methods couple them together and use Euclidean distance, which overemphasizes magnitude errors.
To address this, the authors:
1) Propose Polar Coordinate Decoupling (PCD) — representing weights in polar form and independently quantizing direction and magnitude, allocating more bits to direction.
2)Introduce Distribution-Aligned Codebook Construction (DACC) — building codebooks aligned with theoretical distributions: E8 lattice-based greedy sampling for direction and Lloyd-Max quantization for magnitude.
Extensive experiments on multiple LLMs (LLaMA-2/3, Mistral) show consistent improvements in 2-bit quantization performance, without introducing extra inference cost.
Mathematical rigor: The decomposition of quantization error and the codebook derivations are theoretically justified.
Strong empirical validation: Broad experiments on LLaMA-2/3 and Mistral confirm robustness and generality.
Practical efficiency: PCDVQ maintains inference speed while achieving higher accuracy and compression.
Limited discussion on scalability: While the method performs well at 2–2.25 bits, it remains unclear whether benefits persist at moderate bitwidths (e.g., 3–4 bits) or in activation quantization.
Dependency on Gaussian regularization: The approach assumes weights approximate a standard Gaussian distribution after the randomized Hadamard transform; it would be useful to test models with non-Gaussian weight distributions.
Overlap with QuIP#: Many technical components (e.g., E8 lattice codebook and fine-tuning scheme) are similar to QuIP#. The paper does not sufficiently clarify the conceptual distinction and the essential novelty beyond reinterpreting QuIP# in polar coordinates.
Reproducibility and code release: The paper does not explicitly mention whether code and trained quantization configurations will be made publicly available, which is important for validation and adoption.
Q1:How does the method perform at moderate bitwidths (e.g., 3–4 bits) and for activation quantization?
Q2:How sensitive is PCDVQ to the Gaussian regularization step? What happens if it is omitted or replaced with another normalization?
Q3:What is the essential novelty beyond QuIP#? How does the method conceptually differ from QuIP# despite using similar components like the E8 lattice codebook and fine-tuning scheme?
Q4:Will the authors release code, pretrained codebooks, and fine-tuning scripts to ensure reproducibility? |
Fully AI-generated |
|
PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work is motivated by the observation that in vector quantization, the directional component is more sensitive to quantization errors than the magnitude component. Existing Euclidean distance based quantization methods primarily focus on minimizing magnitude errors, which contradicts this finding and consequently leads to larger overall quantization errors. To address this issue, the paper proposes a polar decoupled vector quantization framework, which achieves satisfactory results across multiple experimental settings.
1. Introducing polar decoupled vector quantization is an interesting and novel attempt.
2. The overall writing is clear and easy to follow.
3. The method demonstrates superior performance on several large language models, including LLaMA-2/3 and Mistral, achieving better zero-shot accuracy and perplexity at the 2-bit weight quantization level compared with existing state-of-the-art quantization approaches, which validates its effectiveness.
1. The PCDVQ framework introduces additional computational steps, including polar coordinate conversion, two independent codebook searches using cosine similarity and Euclidean distance respectively, and possible inverse conversion. The paper reports improved throughput mainly due to reduced memory bandwidth, but it does not quantify the impact of these added operations on single inference latency. This is important because on many edge devices compute cost is more critical than memory bandwidth. The authors should provide a detailed latency breakdown, including wall clock time per layer for conversion, codebook lookup, and inverse mapping, measured on representative CPU and low-power GPU hardware, and compare end-to-end latency and energy consumption with baseline methods.
2. The effectiveness of the DACC module relies on the assumption that weight vectors, after a random Hadamard transform, follow an approximate standard Gaussian distribution. It is unclear whether this approximation holds uniformly across all layers and architectures (for example, different LLaMA and Mistral variants). It would be better to include empirical diagnostics showing distributional statistics (mean/variance/skewness/kurtosis) of transformed weights per layer and per model, and discuss cases where the Gaussian approximation breaks down and how that affects quantization error.
3. All experiments are conducted on decoder-only Transformer large language models and evaluated on a limited set of zero shot tasks and language modeling benchmarks. It remains an open question whether the direction magnitude decoupling idea transfers to encoder models such as BERT, Vision Transformer models, or multimodal models. These models have different activation and weight statistics and different sensitivity to quantization.
4. For tasks that require stronger reasoning ability, such as mathematical reasoning or code generation, it is important to know how PCDVQ affects fine-grained semantic fidelity. The current evaluation set does not cover these demanding reasoning tasks. The paper would be stronger if it reported results on a suite of hard reasoning benchmarks and provided error analyses that reveal whether performance degradation (if any) is systematic and whether it is attributable to directional quantization errors or to capacity limits of the codebooks.
Please see the Weaknesses. |
Heavily AI-edited |
|
PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose to decouple magnitude and direction quantization when performing vector quantization in large language models (LLMs). The proposed pipeline transforms vectors of weight matrices into polar coordinates and quantizes the magnitude and direction separately, taking into account their distinct statistical distributions. The paper presents a persuasive comparison with other scalar-based and vector-based quantization methods across a range of model sizes.
- The paper provides an insightful observation about vector-based quantization: the difference in the approximation behavior of direction and magnitude.
- The idea of decoupling magnitude and direction components and handling their different distributions is creative and conceptually elegant.
- The proposed method achieves strong performance compared to established baselines.
- The experiments primarily focus on a range of LLaMA models, with only a single Mistral experiment included. Broader evaluation across different architectures would strengthen the paper.
- The comparison with scalar-based quantization methods is limited and could be expanded for a fairer assessment.
- While the idea is simple and well-motivated, its conceptual simplicity raises questions about whether it is substantial enough for a full-length scientific paper.
-Is there a difference in inference speed between scalar-based and vector-based quantization methods? If so, wouldn’t it be fair to include that in the comparison?
- Throughout the paper, you mention quantizing model weights one-by-one. Did you mean layer-by-layer?
- Were the results in Table 1 and Table 2 obtained without fine-tuning?
- Why was QuaRot not included in the experimental comparison, despite being mentioned in the paper? |
Lightly AI-edited |
|
PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Polar Coordinate Decoupling Vector Quantization (PCDVQ), a method designed to improve the accuracy of low-bit quantization for Large Language Models (LLMs). The core objective is to mitigate the substantial performance degradation faced when compressing LLMs to extremely low bitrates (e.g., $\leq 2.5$ bits). PCDVQ addresses this by observing that a vector's direction is significantly more sensitive to quantization error than its magnitude. It thus proposes to decouple the vector into polar coordinates (direction and magnitude) for independent quantization. Empirically, PCDVQ demonstrates superior accuracy retention compared to existing quantization methods.
**Novel and Well-Motivated Decoupling Mechanism:** The fundamental insight that vector direction and magnitude exhibit different quantization sensitivities is novel for LLM quantization and provides a strong, intuitive justification for the methodological decoupling. Quantizing these components separately via polar coordinates is a sound solution to preserve the crucial directional information.
**1. Lack of Robust Experimental Validation on Difficult Benchmarks:** The experimental evaluation is limited to zero-shot multiple choice benchmarks. To fully validate the method's contribution, the paper must be evaluated on more difficult and diverse reasoning benchmarks like MMLU and GSM8K, where small quantization errors often lead to catastrophic failure. The reported accuracy improvement also appears marginal on the limited set of reported tasks, necessitating further validation.
**2. Missing System-Level Inference Evaluation and Comparison:** The inference speed (or throughput) of the quantized model is a crucial component for any quantization algorithm. The paper currently lacks a systemized analysis and comparison of the PCDVQ inference latency against competitors. This absence makes it impossible to assess the practical, end-to-end efficiency trade-off of the proposed method.
**3. Unclear Experimental Consistency in Fine-tuning:** The paper does not clearly articulate whether the same post-quantization fine-tuning methods (if any were used) were applied across all compared baselines (e.g., GPTQ, AQLM, GPTVQ) and PCDVQ. Without explicitly confirming that all methods were compared under the same training/fine-tuning regime, the claimed accuracy improvements may be due to differences in the fine-tuning processes rather than the core PCDVQ mechanism.
1. Could you provide a detailed analysis of the inference speed of PCDVC? I want to check if it slows down the model’s inference speed compared to the existing method.
2. Could you provide experimental results on MMLU and GSM8K, or other considerable benchmarks?
3. Could you clarify the fine-tuning recipe for all competitors? For example, did you perform fine-tuning after quantizing GPTQ or GTPVQ? |
Lightly AI-edited |
|
PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces PCDVQ, a post-training weight-only quantization framework for LLMs that operates in polar coordinates: each weight vector is decomposed into direction and magnitude, which are quantized independently with a larger bit budget for direction. To reduce distortion, the method builds distribution-aligned codebooks. Across multiple LLMs and benchmarks in low-bit settings, PCDVQ consistently outperforms strong VQ and SQ baselines, indicating that decoupling and aligning to componentwise distributions yields better accuracy at very low precision.
1.The paper shows direction is markedly more sensitive to quantization than magnitude, and analyzes why Euclidean MSE emphasizes magnitude errors more strongly, supporting the decoupling design. The motivation of the work is clear and reasonable.
2.This work provides a clear and comprehensive theoretical foundation for polar coordinate decoupling, demonstrating strong depth and theoretical rigor.
3.Across multiple LLM families and standard zero-shot benchmarks, the main results tables show that PCDVQ generally matches or surpasses strong low-bit VQ/SQ baselines.
1.The choice to allocate more bits to direction is well supported by experiments, but the paper offers no formal analysis to guide the split or to select an optimal allocation under different conditions.
2. The method adopts a fixed vector dimension and borrows several settings from prior work, but it remains unclear how to adapt the dimension or the direction–magnitude bit split across model sizes, layer types, or differing weight statistics. Robustness to these design choices is not systematically examined.
1.How sensitive is PCDVQ to design choices such as the direction similarity metric, codebook size, and the vector dimension, and are there general guidelines for setting these across models? |
Lightly AI-edited |
|
Musculoskeletal simulation of limb movement biomechanics in Drosophila melanogaster |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper constructs a foreleg musculoskeletal systems of fly based on anatomical data and optimization. The authors model 15 muscle-tendon units of each leg based on two public and one custom anatomical dataset, and optimize the muscle parameters to align with fly joint movement on OpenSim. They also convert the OpenSim model to the Mujoco platform, and augment the model stability by adding passive biomechanical forces. Model analysis demonstrate the qualitative accuracy of the fly musculoskeletal model in OpenSim, and the converted Mujoco model with passive forces is capable of imitating fly motion data.
1. The proposed model is the first musculoskeletal model of the fly, which requires substantial effort for the anatomical data. The model provides a platform for better biomechanics and neuroscience research.
2. The whole model building pipeline is detailed, which might help the musculoskeletal modeling of other animals.
3. The paper is clear and well written.
1. This model only has foreleg muscles and ignores all the contacts, which cannot achieve common behaviors such as locomotion and manipulation.
2. The whole model building, optimization and analysis is conducted in OpenSim, but the musculoskeletal control is conducted over Mujoco. The Mujoco model has passive biomechanical forces where the OpenSim model does not have. It is not clear whether the muscle accuracy and synergies result in Opensim can be transfered to a different platform.
3. In figure 3c, the RMSE of coxa-trochanter is large with low correlation. Video results on the tracking performances over both OpenSim and Mujoco might better help assessing the model building fidelity.
1. In figure 3c and Figure 4d, what does (au) means in the y axis label?
2. In figure 4a, it would be better to show the reference joint trajectory compared against the joint movement generated by OpenSim static optimization.
3. How to determine the parameters of passive forces in Mujoco? |
Fully human-written |
|
Musculoskeletal simulation of limb movement biomechanics in Drosophila melanogaster |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper discusses the addition of muscles to the NeuroMechFly drosophila body model. 15 muscle-tendon units were added to each of the forelegs, based on micro-CT imaging. The authors used Hill-type muscle models, and tuned the parameters to reconstruct captured tethered walking and grooming data. They then used imitation learning to learn policies to perform walking and grooming behavior. They also analyzed muscle synergies and examined the effect of passive forces on the policy learning process.
1. The paper contributes a new component to a whole-body model of the fruit fly: anatomically realistic leg muscles.
2. The authors integrate anatomical imaging data, behavior data, muscle and joint dynamics modeling, and imitation learning in an interesting way. Their pipeline seems like a reasonable starting point that others can learn from, especially with respect to selecting which muscles parameters to optimize and the actual process of optimizing them.
3. The paper is well-written and the results are presented clearly.
1. The model the authors present only has muscles in parts of the front two legs, and the work doesn't deal with ground contact or external forces, which could make it difficult to directly apply when modeling many behaviors.
2. It's a little hard to tell without frame-by-frame comparisons, but it seems like the reconstruction quality presented in panel 3c might not be that high, at least for walking data.
3. I'm not sure what to make of the muscle synergy analysis. If a meaningful set of synergies was really discovered, I would expect to see a knee in panel 4c, but it doesn't seem like there is one. The explained variance numbers are a little hard to interpret as well; it might be more interesting to know how the fit quality would degrade using 1-, 2-, or 3-dimensional control vectors (with learned per-muscle weights) instead of controlling each muscle independently.
4. The motivating claim on line 451, "Placing a model of the musculoskeletal system—an additional layer of processing—between the policy network’s output and physical actions stabilizes the control task by making the action space better formed and more error-tolerant", does not seem to be substantiated. To support this claim, I think the authors would need to compare policy learning in models with and without muscles and per-joint passive forces.
5. Unless I missed it, there isn't a direct comparison to the original version of the NeuroMechFly model the authors are extending. This would probably help contextualize the behavior fitting results.
1. What exactly are the inputs to the static optimization procedure used to fit muscle parameters? Model structure + instantaneous joint positions, velocities, and accelerations? And what is the loss function?
2. Why does panel 3c use arbitrary units? Either degrees or (unitless) joint range fractions would be easier to interpret.
3. The plot labels in panel 3d might be a little confusing to some readers. I think I eventually figured out the shorthand, but it took me a minute.
4. On line 381, what is meant by "systematic evaluation"? Do the curves plotted correspond to the best models after discovered via a grid search, in which case shouldn't the red-orange family be expected to have an advantage, since there's a larger space to explore?
5. Where did the reference angles for the joints come from? Would it be worth attempting to discover better values for these?
6. In terms of reproducibility, it looks like the tracked keypoint locations are included in the supplementary material, which should be helpful to readers interested in replicating the authors' results. Are the behavior videos and/or the imaging data available as well, or will they be made available? |
Fully human-written |
|
Musculoskeletal simulation of limb movement biomechanics in Drosophila melanogaster |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors of this manuscript developed a pipeline to estimate the properties of leg muscles for simulation directly from X-ray scans. They then applied them to imitation learning experiments to make predictions about muscle synergies employed across different behaviors, and investigate the influence of passive joint properties on learning.
The muscles in musculoskeletal modeling are a much-needed part of understanding the neural control of movement. This has been sorely lacking in biomechanical simulations of Drosophila, but this paper aims to address that gap.
1. The details of their optimization-based approach to estimating muscle parameters is interesting, since experimentally estimating these properties is challenging.
2. While the results about muscle synergies being different across behaviors are not surprising, the investigation of passive joint properties effect on learning is a clever utilization of MuJoCo joint parameters to address a scientific question about the advantage of using a combination of passive mechanics and active control.
1. The result that distinct synergies are employed across behaviors (walking vs. grooming) aligns with well-known concepts. While this confirms the model’s biological plausibility, it does not yield surprising scientific insight.
2. Environmental contact forces and interactions are excluded, restricting the ecological validity of locomotion simulations.
3. Although the optimization framework is solid, the lack of direct experimental validation (e.g., comparison with EMG) limits confidence in the accuracy of the predicted activations and synergies.
4. The topic is very niche.
See weaknesses above. |
Fully human-written |
|
Musculoskeletal simulation of limb movement biomechanics in Drosophila melanogaster |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper presents a 3D musculoskeletal model of drosophila legs. They propose a data-driven approach to building these models and unknown muscle parameters are estimated by an optimization approach, quite similar to system identification techniques. Once these detailed muscle models have been built, the authors present results on muscle activation patterns, muscle synergies and the effect of passive joint properties like damping, stiffness on muscle activates using Reinforcement Learning technique to mimic recorded drosophila leg movements for two distinct tasks - locomotion and grooming.
The paper is very well written, structured and easy to read. The level of detail, extensive experiments and the thoroughness of the evaluation presented in this paper needs to be appreciated. The problem statement is well motivated and its relation to existing research clearly presented.
The visualizations used in the paper, both of the muscles as well as quantitative results, are great and are really helpful for a reader to understand the text better.
The results presented in the paper are very insightful and come from well thought out experiments. The discussions about muscle synergies and effect of passive joint properties will definitely inspire future research.
While the results are promising, I found that certain details make the descriptions in the paper confusing. For example, All through section 3.2 and 3.3, one of the tasks being described are locomotion and most readers will have a mental image of a drosophila walking. But, in the limitations and future work section, it’s mentioned that body-body and body-environment contact forces are ignored - does this mean that the trajectory following was done without any contact between the leg and a surface? Adding to the confusion - In the appendix - the authors mention “We concentrated our efforts on developing a fully functional front-leg muscle model for two main reasons”. Does this mean that only the front two leg motion were simulated in the experiments for locomotion? Clearly describing the exact task would be very helpful to the reader.
Some details about imitation learning would be a great addition to the paper - what was the dimensionality of the state and action space?
For most muscles in figure S6, the learned activations seem to be either 0 or 1 across all passive joint properties. Can the authors comment on the plausibility of this in a real musculoskeletal system?
Typically for reward functions in reinforcement learning for locomotion or any in general imitation of a recorded trajectory - there is a term that penalizes effort. For example, sometimes this can be minimizing joint torques, accelerations, muscle activations, etc.. I'm curious if this was considered during experiments?
Im also interested in the authors thoughts about the relationship between muscle synergies and passive joint properties, would it make sense to experiment how the synergies changes as joint properties change?
Comment - Typically exploration in reinforcement learning is challenging, especially when the action space is muscle activation. There has been some work in addressing this - https://openreview.net/forum?id=C-xa_D3oTj6 . The idea is not to naively explore but to identify some state dependent exploration that helps learning efficiency. This could be a useful tool for next set of experiments using reinforcement learning. |
Fully human-written |
|
Structure Guided Equation Discovery with Influence-Based Feedback for Large Language Models |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes SGED, a framework where LLMs iteratively propose basis functions for linear models, then prune them using per-term influence scores that measure each term's contribution to validation MSE. The method can operate through simple iterative refinement or be enhanced with MCTS for broader exploration. Experiments on biological and synthetic datasets show competitive performance, with some interesting detailed case studies (e.g., on RNA Polymerase II pausing). The core claim is that providing LLMs with granular, per-term feedback enables more effective equation discovery than coarse scalar metrics alone.
- The paper is very well-written and clearly structured, making the proposed method easy to follow. The inclusion of extensive experiments and ablations in the appendix (influence feedback impact, MCTS contribution, robustness across LLMs, scalability analysis, etc.) demonstrates thorough empirical work that goes beyond what is typical and I really appreciate this.
- The benchmark problems are practical and interesting, particularly the RNA Polymerase II pausing case study with large number of 263 features. High-dimensional feature spaces where domain knowledge can guide and narrow down feature selection and engineering represent realistic scenarios where LLMs could provide real value. The biological validation of discovered patterns and correlations adds value.
- Several methodological choices are well-motivated: using MCTS to avoid local optima in the search space, the dual-agent architecture separating proposal from pruning, and the idea of providing granular per-component feedback rather than scalar metrics. The influence score mechanism provides an interpretable signal about term contributions that is more informative than overall MSE alone.
While I really appreciate the extensive experiments and ablations in the paper, I have a major concern with the paper's positioning and evaluation:
- The framing of the paper as "symbolic regression and equation discovery" and comparison with SR methods is not very accurate. The method uses a fixed choice of modeling (linear model) with discovered/transformed features. I see the work much closer to efforts in the field of automated feature discovery/engineering than symbolic regression, which aims at discovering general nonlinear equation structures.
- Furthermore, given that the studied problems in this work (e.g., biological processes) seemingly have unknown complex (and most certainly nonlinear) behaviors, one could question the constraining to linear models. A good well-studied alternative could be tree-based methods which also provide some level of explainability (similar to linear models, as we are eventually sacrificing full explainability by selecting linear models).
- Apart from the choice of model (linear vs. tree-based), the main focus of this paper is automated feature discovery/engineering using an LLM-based framework. However, the paper omits evaluation, comparison, and discussion with the large body of research in automated feature engineering, from statistical and non-LLM-based approaches (e.g., AutoFeat [1], OpenFE [2]) to recent LLM-based approaches (e.g., CAAFE [3], FeatLLM [4], OCTree [5]).
- First, I try to elaborate upon my major concern for the authors: The paper positions itself as symbolic regression, but in my opinion it is a kind of automated feature discovery for linear models. One might argue that methods like SINDy also use linear models with nonlinear basis functions; however, those models are designed for specific problems (typical ODEs) that commonly have linear forms, and admittedly have limitations for general nonlinear forms. In this work, however, the studied problems and benchmarks have very complex (and almost certainly nonlinear) behaviors, and in some cases we are dealing with partially observable systems, where we do not expect recovering or discovering ground truth functions from the data for fully explainable models.
- Given this, I think the paper should discuss, evaluate, and compare with more relevant methods in the large body of research in automated feature engineering/discovery and tree-based methods. To clarify, there are two main components here: (1) The approach for feature discovery/engineering, and (2) the choice of model that uses these features for prediction.
- In terms of (1), there are various methods that approach this including statistical and non-LLM-based approaches (like AutoFeat [1], OpenFE [2], with linear models) and more recent LLM-based approaches that are very similar to this work (e.g., CAAFE [3], FeatLLM [4], OCTree [5], ...) that use LLM domain knowledge for feature discovery similar to this work, while mainly using tree-based models due to their capabilities.
- In terms of (2), this work could provide tree-based methods like XGBoost, LightGBM, etc. as baselines, which are currently missing. (A simple baseline could be on raw features, and a better baseline would be on top of discovered features from methods in the previous part.) It would be also interesting to see how the feature importance and explainability provided by those models (e.g., using SHAP) compare to current linear models.
- In Figure 2, why is there already a substantial MSE gap at iteration 0 before feedback mechanisms should differentiate the methods?
- For MCTS where children are generated through stochastic sampling, what temperature or sampling parameters were used? how diverse are the candidates?
- Have you tested simple threshold-based or top-K pruning directly on influence scores as a baseline compared to the use of LLM and instructing it for pruning? This might reduce the need for the second agent call and reduce computations.
References:
[1] AutoFeat: The autofeat python library for automated feature engineering and selection, Horn et al (2020)
[2] OpenFE: OpenFE: Automated Feature Generation with Expert-level Performance, Zhang et al. (2023)
[3] CAAFE: Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering, Hollman et al. (2023)
[4] FeatLLM: Large language models can automatically engineer features for few-shot tabular learning., Han et al., 2024
[5] OCTree: Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning, Nam et al., 2024 |
Fully human-written |
|
Structure Guided Equation Discovery with Influence-Based Feedback for Large Language Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes a two-agent LLM framework for symbolic equation discovery. The “Propose” agent generates candidate basis terms, a linear model fits coefficients, and per-term influence scores measure the contribution of each term by zeroing its weight and computing the change in validation error. Experiments cover pharmacological, biological, and synthetic datasets.
+ The per-term influence mechanism provides a clear and interpretable feedback signal for guiding LLM pruning.
+ The two-agent architecture (Propose / Prune) is modular and amenable to ablations.
+ The biological case study shows potential for domain-specific insight generation.
+ The method demonstrates stable performance across several datasets and LLM backbones.
- The biological case study is mostly qualitative; there are no out-of-batch or cross-condition validation results to substantiate mechanistic claims.
- It is unclear which LLM produced the main “SGED (Ours)” results, and the number of random seeds differs across methods (25 vs 10). - - This invalidates confidence intervals.
- PySR and AI-Feynman appear only in the appendix despite being the most relevant SR competitors.
- Ablations show minimal gains from MCTS relative to iterative pruning, suggesting the default configuration is suboptimal.
- Evaluation relies solely on MSE, ignoring structure correctness, sparsity, or parsimony.
- The method’s complexity still scales with the number of proposed terms and LLM context size, yet no empirical analysis is provided.
- The same validation split is used for both pruning and reward evaluation, creating a high risk of overfitting.
- Datasets are largely older and omit standard SR benchmarks (SRBench, Nguyen, LLMSRBench), limiting comparability.
- There are no ablations on noise robustness, out-of-distribution generalization, or prompt and LLM sensitivity.
- Critical hyperparameters, token budgets, and search details are scattered across appendices, hindering reproducibility.
- Why were standard SR benchmarks excluded?
- Why was the “no-refit” influence variant chosen as default when “refit” or deeper MCTS rollouts yield better MSE in the appendix?
- What are the compute and wall-clock costs for MCTS vs the simple iterative loop?
- Which specific LLM(s) generated the “SGED (Ours)” results in Table 3? Are all baselines evaluated under the same model and seed count?
- Can the authors add structure-recovery metrics (exact match rate, symbolic distance) on synthetic ground truths? |
Fully AI-generated |
|
Structure Guided Equation Discovery with Influence-Based Feedback for Large Language Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a symbolic regression framework called Structure-Guided Equation Discovery (SGED). In this approach, an LLM is first used to generate candidate basis functions, whose coefficients are then calculated using a linear model. The influence score (contribution) of each candidate term is evaluated by removing it and measuring the resulting change of MSE. Based on these influence scores, LLM are used to decide which terms to retain or remove. The entire process is iterative and can be integrated with a Monte Carlo Tree Search (MCTS) strategy. The proposed method achieves state-of-the-art performance on several benchmark datasets.
1、The motivation of the paper is commendable — previous methods that relied solely on mean squared error (MSE) for evaluation were overly coarse and lacked specific guidance mechanisms. 2、The inclusion of evaluation results on real-world case studies enhances the persuasiveness and practical relevance of the work.
1、The influence score (contribution) of each candidate basis functions is computed by removing one term while keeping others fixed. However, equation terms are often highly coupled, and this assumption fails to consider their interdependencies. The so-called “fine-grained” evaluation may therefore be misleading.
2、The pruning decision (remove which candidate term) is based on an explicit influence score (Delta_j). Using an LLM to decide which terms to keep, prompted only by Delta_j, seems unnecessary and even risky given the possibility of hallucinations. Using an LLM to perform pruning decisions seems unnecessarily redundant, as this step could be accomplished simply by setting an appropriate threshold of Delta_j.
3、Since LLMs are pre-trained on massive corpora that may include common equations or datasets, data leakage is a serious concern. The paper does not provide experiments on datasets unseen by the LLM would improve credibility. (Please refer to “LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models”.)
4、The paper reports only predictive accuracy. In symbolic regression, equation complexity is equally important. Without this, it is difficult to judge the trade-off between interpretability and performance.
5、Lack of fine-grained ablation analysis. Such as robustness to prompting strategy, or noisy data.
6、The paper mentions better efficiency under fixed token budgets but omits quantitative results on the number of LLM calls, latency, or computational overhead—critical for assessing real-world feasibility.
Please refer to weakness. |
Lightly AI-edited |
|
Structure Guided Equation Discovery with Influence-Based Feedback for Large Language Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Structure Guided Equation Discovery (SGED), a framework for LLM-driven symbolic equation discovery that provides granular, influence scores as feedback to guide iterative refinement. Unlike prior LLM-based systems (e.g., D3, LLM-SR, etc.), SGED quantifies each basis function’s contribution to predictive performance and integrates this information into a dual-agent “propose-and-prune” process, enhanced by Monte Carlo Tree Search (MCTS). The approach is evaluated on some biological and pharmacological datasets and ablation experiments suggest that the influence feedback and MCTS both positively impact model convergence and accuracy.
- The propose of influence-based structure-guided feedback as a more granular feedback is well-motivated
- Applying LLM-driven equation discovery to real-world biological and pharmacokinetic data is also interesting and highlights practical benefits beyond benchmarks.
- The paper is generally well written and easy to follow.
- How does the proposed method perform on LLM-SRBench [1], the recent benchmark designed for llm-based equation discovery beyond memorization? Evaluating only on six dataset from one domain is not sufficient to assess generality of the proposed method (as claimed in the title). I believe a more comprehensive comparison on a larger, standardized benchmark like LLM-SRBench (which spans 100+ tasks across multiple scientific domains) is needed to provide a fair assessment of the framework’s contribution.
- In Figure 2, why do the two models start from different initial points at iteration 0? The influence-based feedback should affect the efficiency and convergence of discovery, not the initial performance, which usually depends on the initial candidate pool, before any feedback is applied.
- I would suggest authors to include some qualitative examples of these structure-guided feedback and how they are incorporated in the iterative refinement. Also, I think a simplified example perhaps integrated into Figure 1 may be helpful to better understand this.
- It would also be helpful to provide some qualitative examples of showing how symbolically influence feedback alters the model’s reasoning or the discovered equations.
[1] LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models, ICML 2025
included in the weaknesses section |
Fully human-written |
|
Liars' Bench: Evaluating Deception Detectors for AI Assistants |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces a benchmark for evaluating deception detection methods. The benchmark consists of deceptive and honest outputs from LMs. The paper evaluates a number of different detection methods on their benchmark.
Overall I think the paper contributes a solid benchmark and methodology with somewhat limited evaluation of different detection methods.
The paper is well written and presented. Figure 1 provides clear examples from the benchmark.
Overall good awareness of related literature.
Solid dataset created of LM responses curated.
Principled and informative metrics used. Use of the alpaca control dataset to set the FPR is very intuitive and sound. It would be helpful to add a random classifier as a baseline to the plots.
I think the main weakness is that there are not that many detection methods tested (especially as the two BB methods are essentially the same with different models).
Could you try some other methods? Like the black box lie detector from https://arxiv.org/abs/2309.15840
Minor
Especially as you consider the reasons for deception, you should related this to the fact that deception is intentional cf the refs below
https://arxiv.org/abs/2312.01350
https://arxiv.org/abs/2402.07221
I guess it's unsurprising that the Claude model outperforms self-evaluation, because the openweiht models are very small in comparison. For models with the same capability, it would be interesting to evaluate whether they are better at self evaluation.
Are there other detection methods you can try? |
Fully human-written |
|
Liars' Bench: Evaluating Deception Detectors for AI Assistants |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents Liars' Bench, a benchmark of deceptive and honest responses from 6 distinct categories of deception, generated by three open-weights frontier models. Datasets are categorised based on two axes: 'object of belief' and 'reason for deception'. Various standard deception-evaluation metrics, including LLM-as-a-Judge and linear probes, are used to validate the dataset, and find differences in performance across datasets and models.
This is a needed, broad addition to the field of LLM deception benchmarks.
Typology of the datasets is cool & seems meaningful and reasonable.
Great to identify the limitations of existing datasets e.g. true/false factual claims, deceptive cover-ups.
Great to have fine-tuned some generative models, so as to not always depend on instruction via prompt (whether implicit or explicit).
58k samples is substantial
Language is exceptionally clear; terms are precise ("functional definition of deception"); citations are clean, abundant and relevant; flow is easy to follow.
It's good to mention that existing datasets are off-policy.
The detailed comparison to MASK (and TQA) is welcome. The principled exclusion of MASK from Liars' Bench is appreciated.
The identification of two axes, 'object of belief' and 'reason for deception', is novel and interesting.
For knowledge-report section: great to take WMDP results and filter for knowledge.
The identification of Knowledge-Report as a form of deception with particular challenges is a valid insight.
The benchmark is helpfully compiled in a standard format (ChatML) with relevant info (boolean flag, model used to generate the transcript).
It shows thorough thinking and forthrightness to recognise that prompt information may still be indirectly present in the assistant's response, even after the system prompt is masked.
The metrics are carefully chosen to account for potential balance issues, 1% FPR on control set is a reasonable rate, and care is taken to mitigate methods which trigger on presence of mere mention of deception.
Recap of linear probes is succinct.
The caution against over-reliance on LaaJ is appreciated.
The variety of deception-detection methods is sufficiently broad, and the methods sufficiently standard, to validate the dataset.
The Future Work section benefits from discussion about how response behaviour will likely change as situational awareness and model introspection improves at larger model size and capability.
Existing datasets are off-policy, true: but doesn't your dataset become off-policy as soon as it's tested on a new model? (FYI, I've seen a pre-print of a paper showing that probes trained off-policy do well on-policy, though performance is slightly harmed (~90% efficacy))
The chats are all model-generated, based on fictitious scenarios, rather than instances of deception in the wild.
The Sonnet-3-as-a-Judge was the best-performing method on most datasets, yet still falls foul of judges interpreting harm-preventing deception as responsible rather than deceptive.
typos: L52, L73 (should "compiles" be "comprises"?)
L122: "evaluate" is justified but it seems cheeky to "propose" LLM-as-a-Judge as a black-box approach - implies (at least to me) some novelty - there's got to be public prior art here. Same with self-evaluation.
What's the inspiration for the typology? Is it principled, or intuitive?
You say (L79) that these settings are potentially more realistic - how do you define, measure or quantify this? Same for "challenging".
I'm a little confused by the Harm-Pressure work: you filter based on models answering WMDP correctly: but I'm surprised that models wouldn't be HHH anyway, to avoid answering correctly. So could imagine these selected questions being the 25% which the model randomly succeed on? Do you determine model knowledge some other way, e.g. checking resistance to prompt perturbation?
1% FPR on control is reasonable when doing experimentation, but is surely too high for deployment. How would tighter tolerances affect the experiments? |
Fully human-written |
|
Liars' Bench: Evaluating Deception Detectors for AI Assistants |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces Liar's Bench, a benchmark for deception detection methods. The benchmark consists of 6 datasets and contains responses from 3 open-weights LLMs. An evaluation of several black-box and white-box detection methods is performed on the benchmark.
Overall, the paper tackles the very timely problem of evaluating methods for detecting deceptive behaviour in LLMs. I believe a benchmark for deception detection is of interest for the community. I found the paper to be mostly well written and easy to follow. In particular:
* The presentation is clear, the taxonomy of deception as well as the individual datasets are explained well.
* Both white- and black-box methods are evaluated, which covers a good amount of methods.
* The evaluation metrics are rigorous with the balanced accuracy being calibrated for at most 1% false positive rate on the control dataset Alpaca.
* The prompts are clearly outlined in the Appendices and datasets are also described in detail.
While the topic is highly relevant I found the evaluation and overall contribution of the paper limited. Specifically, I think it would be very meaningful to identify some broader trends rather than benchmark and model specific observations. Below are my main concerns.
**Marginal novelty of the overall dataset:**
The main contribution of the paper seems to be the dataset as a benchmark. However most datasets are slight adaptations and combinations of existing ones, with newly generated responses. In other words, only limited new data or methodology is introduced. Overall, I believe a benchmark for deception detection is needed, so it would be great if Liar's bench could be extended to include more novel and realistic scenarios (see also below re artificial deception setups).
**Model scope is limited:**
The scope of models is limited for a large-scale deception benchmark. The paper only uses 3 open-weights models that are smaller and less capable than current frontier models. This may not capture more complex deceptive behaviour by larger and more recent current models such as GPT5, Claude Sonnet 4.5,... While I appreciate the constraints regarding cost and also the requirement of activations for the white-box methods, I do think the paper would be significantly stronger if it included a larger range of models (the white-box methods would not need to be evaluated on all models). This would also allow for some interesting results / trends that are generalisable beyond the specific benchmark and models: for example, how does the performance of deception detectors scale with the deceptive model size and performance, and similarly when scaling the strength of the detector.
**Artificial deception setups:**
While the datasets are mostly adapted from prior work, I find the setups rather artificial, which makes me doubt the real-world applicability of the benchmark. For example, several datasets explicitly instruct the model to lie.
**Lack of human/external evaluation:**
A more in-depth analysis of the policy compliance, i.e., do the models actually respond deceptively when instructed to do so would be beneficial. As far as I understood, the labels of deceptive and truthful are purely based on the prompts / intent? I checked Appendix E which was quite useful for understanding why Llama and Qwen were removed from HP-C, do you have a hypotheses or justification for why these differences happened?
**Overall analysis and discussion of results could be improved:**
It would be great to see a more in depth discussion of why the results are as they are and going beyond the main metrics shown in Figure 2. In particular, the discussion of the LLM-as-a-Judge method is quite speculative. I believe a more detailed analysis with quantitative results regarding the hypotheses could significantly strengthen the discussion, e.g., can the fact checking of relevant claims be analysed (is there a correlation between the judge being able to answer the questions correctly / know the facts and being able to detect deception), and how often did the judge explicitly state that lying to prevent harm is responsible?
**Additional points:**
* The standard errors seem to be missing on HP-C and CG in Figure 2?
* What are the "given models" in step 2 of the probe training pipeline (line 345)? This seems like an important detail.
* While conceptually straight forward, I think the description of the follow-up probe could be improved for clarity.
* I think discussing MASK in 2.2 is distracting, removing it would improve the flow in my opinion.
* line 421 "this method performs depends with how..." -> this method depends on how...
Please see my main concerns above. Here are some additional questions that would be great if the authors could clarify:
* How is the pooled subset for the training of the upper-bound probe selected?
* Does the strong performance of the upper-bound probe (in comparison to the weaker performance of the mean probe) suggest that features of deception are very dataset dependent? I.e. are the differences between deceptive and truthful behaviour not generalisable?
* The reason for excluding ST is unclear to me based on lines 357-359. What do you mean by "each variation"? |
Fully human-written |
|
Liars' Bench: Evaluating Deception Detectors for AI Assistants |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces LIARS’ BENCH, a benchmark (~58k examples) for evaluating deception detection in AI assistants across six settings that vary by (i) object of belief (world vs. self-knowledge) and (ii) reason for deception (pressure from context vs. inherent). Data are generated with three open-weight models (and fine-tuned variants). The authors evaluate three detector families: LLM-as-a-judge, self-evaluation (ask the model to reflect on its prior answer), and white-box linear probes. They report that LLM-as-a-judge performs best on average, while all methods struggle when deception hinges on the model’s latent knowledge (HP-KR).
* Addresses a timely and important problem with a broad, multi-slice benchmark that goes beyond simple factual lies.
* Clear articulation of detector design axes (model access, scope, interactivity, semantic verification).
* Inclusion of challenging scenarios (e.g., insider-trading concealment; soft-trigger deception) likely to matter in practice.
* Open-weight focus enables reproducibility and white-box baselines.
### Overall Weaknesses
* Conceptual ambiguities and inconsistencies. Figure 1’s ID example/caption conflict (the caption does not reflect the figure); definitions blur deception with persuasion. The text claims alignment with MASK, but MASK targets belief consistency rather than intent to change another’s belief [2]. However, the authors write on line 161
> The assistant behaves deceptively if it shows apparent intention to change a belief of their conversational partner, in a way that is unfaithful to the evidence held by the assistant regarding
* On-policy vs. pre-filled motivation is undercut by setups where deception is induced by hidden system prompts that detectors cannot see; realism claims should be tempered.
* Baselines are incomplete. The widely used unrelated-question detector [1] is a natural, strong baseline and should be included. White-box probe results are weaker than recent reports (e.g., [3]) without a clear replication/diagnostic.
* HP-KR slice is effectively unobservable for black-box methods under the paper’s own definition (depends on what the model knows). Provide belief evidence/oracles or mark it as white-box-only; otherwise it serves more as a stressor than a fair benchmark.
### Clarity
Writing and figures need tightening. Resolve Figure 1, clearly distinguish deception vs. persuasion, and state how your definition relates (or not) to MASK’s honesty definition. Reduce “see appendix” dependencies by surfacing key prompt templates and sampling decisions in the main text.
### Relation to Prior Work
* Positioning is incomplete. Please compare against Pacchiardi et al. (unrelated questions) [1] and discuss differences with MASK [2] (honesty vs. accuracy disentanglement). For white-box, reconcile your probe performance with Goldowsky-Dill et al. [3].
* Heavy reliance on appendix for core details; multiple cross-references and naming are hard to follow.
[1] Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner. How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions. ICLR 2024.
[2] Richard Ren, Arunim Agarwal, Mantas Mazeika, Cristina Menghini, Robert Vacareanu, Brad Kenstler, Mick Yang, Isabelle Barrass, Alice Gatti, Xuwang Yin, Eduardo Treviño, Matias Geralnik, Adam Khoja, Dean Lee, Summer Yue, Dan Hendrycks. The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems. arXiv:2503.03750, 2025.
[3] Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn. Detecting Strategic Deception with Linear Probes. ICML 2025 (PMLR v267), 2025.
1. Fix Figure 1: which ID configuration is correct?
2. Clarify definition: is deception labeled by contradiction to the model’s own beliefs (MASK-style) or by intent to change the user’s belief? If different from MASK, please say so plainly.
3. For HP-KR, what information can a black-box detector access to infer the model’s belief? Can you release per-item belief evidence (e.g., neutral elicitation answers/consistency stats)?
4. Include the unrelated-question detector [1] or justify its exclusion.
5. White-box probes: Can you replicate the strongest probe configurations reported in [3]? |
Fully AI-generated |
|
Leveraging Generative Trajectory Mismatch for Cross-Domain Policy Adaptation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces DADiff, a diffusion-based framework that addresses the challenge of transferring reinforcement learning policies across domains with different dynamics. By leveraging the generative trajectory discrepancy between source and target domains, DADiff estimates dynamics mismatch and adapts policies through either reward modification or data selection strategies. Supported by theoretical analysis showing the performance difference is bounded by generative deviation, the method demonstrates superior effectiveness in experiments with kinematic and morphology shifts compared to existing approaches.
# Strengths
- This paper is well-motivated and mostly well-written
- This paper is easy to follow, and the studied topic is of importance in the context of the reinforcement learning community. It is always important to develop more general and stronger transfer algorithms in RL, especially considering the fact that online off-dynamics RL papers have rarely appeared in recent years
- The authors include theoretical analysis to provide better guarantees for the proposed method (despite the fact that some of the theoretical results resemble those in prior works, they are still interesting and bring some insights into the cross-domain reinforcement learning). I appreciate that the authors include a detailed discussion about the connections between the theoretical bounds of their method and those of PAR
- The presentation is good, and I like the way the authors tell the whole story
- The parameter study is extensive, covering numerous tasks in the main text and the appendix.
# Weaknesses
- The authors propose to address the online policy adaptation problem from the perspective of generative modeling; however, the downstream methods still rely on reward modification or data filtering, which resembles DARC, PAR, and VGDF
- The evaluations are limited to kinematic shift and morphology shift. As far as the reviewer can tell, ODRL provides other dynamics shifts like gravity shift, friction shift, etc. This paper can benefit greatly from extending its experimental scope
- The authors mention flow matching in the main text. This raises questions that there are numerous generative modeling methods other than diffusion models. This paper lacks a comparison between different generative modeling methods.
Overall, I would recommend a "weak accept" of this paper.
# Questions
1. As a generative modeling method, diffusion can also be used for data augmentation, e.g., generating samples that lie in the scope of the target domain. What is the insight in using diffusion model for *Generative Trajectory Mismatch* rather than target domain data augmentation?
2. How diffusion models compare against other generative modeling methods like flow matching, VAE?
3. The diffusion steps seem to have a significant impact on DADiff. Can authors provide more insights on how to select this parameter and why different diffusion steps can have such significant impacts on DADiff? |
Fully human-written |
|
Leveraging Generative Trajectory Mismatch for Cross-Domain Policy Adaptation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper provides a diffusion-based method to obtain the domain gap and provides a reward modification and data filtering method for policy learning. They provide a theoretical guarantee of the policy $\pi$'s performance on the two domain. Theoretical results and empirical results shows performance improvement of the method.
The paper is well written and easy to follow. They propose a new diffusion-based domain gap measure method similar to DARC and PAR. Similar to DARC and PAR, they identify a performance gap in policy $\pi$ between the source and target domains, defined by the KL divergence of the latent representation.
1, the odrl benchmark has both gravity shift and friction shift, which are not included in the experiments. Also, what is the shift level of the experiment? Is it easy, medium or hard?
2, the novelty of the paper seems not significant to me. The high-level idea of it is to obtain a more fine-grained shift measurement compared to DARC and PAR, and the theoretical analysis is actully similar to those paper, except with sligtly different notation of the shfit measurement. Also, the performance doesn't have a significant improvement compared to them.
3, DARC and PAR have been shown to be ineffective when the shift is large. What is your performance on a large shift case?
4, The performance of your method seems to rely on an assumption that the domain shift is not that large. If the shift is too large, the KL in Eq. 5 will grow extremely large, or even infinity. The performance is bounded only when the KL of the source and target is bounded. Also, as stated in [1], the KL can be ill defined when the shift is large because there is no support of some target transition in source domain.
In summary, I am questioning whether the reward modification method is still a valid method to solve the off-dynamcis RL problem as many previous work [1,2] has shown the limitation of it and when the shift is large (the joint distribution is small), the reward modicication method always fails, performing good in the source but poorly in the target.
[1] Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data
[2] Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation
See weakness. |
Fully human-written |
|
Leveraging Generative Trajectory Mismatch for Cross-Domain Policy Adaptation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the problem of online dynamics adaptation in reinforcement learning, where a policy is pre-trained in a source domain (e.g., a simulator) and must be adapted to a target domain (e.g., the real world) with only limited interactions. The authors propose DADiff, a novel framework that leverages generative models, specifically diffusion models, to capture the dynamics mismatch between domains. The core idea is to interpret the state transition as a conditional generative process and to measure the "generative trajectory deviation"—the discrepancy between the latent state trajectories of the source and target domains during the diffusion generation process. The paper provides a theoretical performance bound linking this deviation to the policy's performance gap and proposes two practical variants: DADiff-modify (which penalizes source-domain rewards based on the deviation) and DADiff-select (which filters source-domain data). The method is also extended to the Flow Matching framework. Experiments on MuJoCo environments with kinematic and morphology shifts show that DADiff outperforms several strong baselines, including PAR.
The paper offers a fresh and principled perspective on dynamics adaptation by framing it as a problem of generative trajectory mismatch. This is a significant conceptual shift from prior work that often relies on domain classifiers or single-step representation learning.
The primary concern is the justification for the added complexity of using a diffusion model for dynamics modeling. While the results are strong, the performance gain over the strongest baseline, PAR, is sometimes marginal (e.g., in `ant(broken hips)` or `walker(broken right foot)`). The paper acknowledges that VGDF, a model-based method, is significantly slower, but it does not provide a detailed analysis of DADiff's own computational cost (e.g., training/inference time, memory footprint) compared to PAR, which is a crucial factor for real-world applicability. The increased complexity needs a more compelling justification in terms of capability.
The experiments are conducted on standard MuJoCo locomotion tasks, which, while common, have relatively simple and deterministic dynamics. The paper’s core claim is about capturing complex dynamics mismatches via diffusion models. To truly validate the advantage of modeling the full generative trajectory, experiments on tasks with more complex, high-dimensional, or highly stochastic dynamics would be far more convincing. The current experiments, which largely follow the setup of PAR, do not fully showcase the potential benefits of the proposed method in more challenging scenarios.
The paper mentions an extension to Flow Matching (Appendix C). Could the authors elaborate on the specific modifications required in the algorithm? In the diffusion framework, the deviation is calculated using the noise prediction model `ϵ_θ`. What is the direct analogue in the Flow Matching framework? Is it solely based on the vector field prediction `v_θ`, and if so, how does the continuous-time nature of the trajectory in Flow Matching affect the calculation and interpretation of the "generative trajectory deviation" compared to the discrete steps in diffusion? |
Fully AI-generated |
|
Leveraging Generative Trajectory Mismatch for Cross-Domain Policy Adaptation |
Soundness: 3: good
Presentation: 1: poor
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes DADiff, an online dynamics adaptation method for RL that measures source–target dynamics shift via generative trajectory deviation from diffusion models. It developed two variants: reward modification and data selection. A performance bound links return gap to KL terms along a shared latent trajectory. Experiments on 4 MuJoCo envs report competitive or superior performance to DARC, VGDF, PAR, SAC‑tune, and SAC‑tar.
1. Clear theoretical link from generative trajectory discrepancy to performance, with clean proof and recovery of PAR as a special case.
2. Consistent improved empirical performance on many tasks. DADiff‑modify often leads; DADiff‑select is strong when penalties mis‑shape rewards.
3. Parallel latent sampling avoids reverse‑chain cost; runtime comparable to model‑free baselines and far below VGDF.
1. Baseline fairness. SAC‑tar is trained for 10^5 target steps, while DADiff and SAC‑tune use 1M source steps + 10^5 target steps. This probes a target‑only‑from‑scratch regime but does not compute‑match total experience. Please add a compute‑matched target‑only control with comparable total environment interactions and gradient updates
2. Insufficient analysis: The text narrates Fig. 2 but provides little analysis in Sec 5.2. Please also quantify the deviation differences between the two generative trajectories, since the paper only covers the computational effeiciency. There should exist a deviation difference between these two trajectories.
3. Writing quality: Multiple typos, symbol switches, and undefined or late‑defined notation reduce clarity. For examples: Fig 4(a) using $\gamma$ while Eqn 11 using $\lambda$. $\phi_i$ is undefined in Eqn 14 until I found out the algorithm is based on SAC. Sec 5.3 states optimal $\lambda$ is task-dependent while Sec E.2 (line 1019) says $\lambda$ is task-independent.
Same as weakness
Additional questions:
1. Is there a missing square in the Eqn 12 and 13? If not, justify using $E[Q−TQ]$ rather than MSE. If yes, re‑run results with corrected losses and report any deltas.
2. Can you extend the analysis of why "directly filtering for transitions with low dynamics mismatch is a more effective strategy than modifying rewards." in your Sec 5.2 (line 352). Provide mechanism‑level reasoning and ablations that include filtering only vs reward‑shaping only vs both. Maybe analysis from the perspective of probabilistic trajectory in diffusion model could explain why filtering is better. |
Lightly AI-edited |
|
How to Spin an Object: First, Get the Shape Right |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a new method to generate novel-view point maps and RGB images from single-view inputs. The key idea is to adopt multiview diffusion to generate both. Then, the point cloud could be directly extracted from the generated point maps and RGB images. Finally, the performance is evaluated on the novel view synthesis and 3D reconstruction tasks, outperforming baselines like Free3D, One-2-3-45 and OpenLRM.
The whole pipeline is reasonable and authors successfully train the model and demonstrate the performance.
1. The idea of generating point maps has already been explored by SweetDreamer (ICLR'23) two years ago. Some recent works, like World-consistent Video Diffusion with Explicit 3D Modeling (CVPR'25), also use this idea. The paper does not discuss the difference. Some very similar papers about multiview diffusion papers, like MVDream, SyncDreamer, and Wonder3D, are not included in the discussion either. The method proposed by the paper is already well-studied in these existing works.
2. Another main problem is that the paper seems to miss a whole set of papers about latent vecset diffusion, like CLAY, Hunyuan, TripoSG, and so on, which could produce much better results than the proposed method.
In summary, the idea is already well-explored by existing works, and the authors are encouraged to read these papers and include more discussion on the differences from the existing works.
N/A |
Fully human-written |
|
How to Spin an Object: First, Get the Shape Right |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
- The paper is clearly written, with many details deferred to a well-organized appendix.
- It fine-tunes a dedicated VAE for CROCS and reports a substantial VAE score improvement after fine-tuning.
- It presents extensive quantitative experiments and ablations, with results consistently favoring CROCS on the novel-view synthesis tasks.
- Code and checkpoints are open-sourced.
- The paper is clearly written, with many details deferred to a well-organized appendix.
- It fine-tunes a dedicated VAE for CROCS and reports a substantial VAE score improvement after fine-tuning.
- It presents extensive quantitative experiments and ablations, with results consistently favoring CROCS on the novel-view synthesis tasks.
- Code and checkpoints are open-sourced.
- Figure 5 is the only qualitative 3D reconstruction example; stronger qualitative evidence is needed. CROCS point maps may appear noisy on edges and thin structures, making denoising and detail preservation non-trivial; the resulting point cloud may be coarse, and vertex color aggregation across predicted views can be inconsistent, with non-trivial texture post-processing.
- Novel view synthesis is restricted to eight canonical views and cannot sample arbitrary viewpoints.
- Baselines are a little outdated; recent open-source SOTA (e.g., TRELLIS) is missing.
- The novelty of CROCS is limited: CROCS is adapted based on SpaRP’s NOCS variant (Xu et al., 2024a). Sec. 3.3 and Figure 3 explain the differences between the original NOCS and CROCS. However, both SpaRP’s NOCS variant and CROCS are oriented by the source camera’s azimuth and are axis-aligned; at the source view (Figure 3), CROCS is the same as SpaRP’s NOCS.
- Please provide more qualitative examples and comparisons for 3D reconstruction results (point clouds) to demonstrate usefulness beyond canonical novel views.
- In the novel view synthesis experiments (Table 3), how are target views selected for each method? These baselines define different canonical target views than unPIC. Please clarify to ensure a fair comparison. |
Lightly AI-edited |
|
How to Spin an Object: First, Get the Shape Right |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents a framework for generating 3D-consistent novel views of an object from a single image. Unlike previous methods that jointly infer geometry and appearance, unPIC disentangles them in two stages: geometry prior and appearance decoding. In the first stage, the model predicts the object geometry using CROCS (Camera-Relative Object Coordinates), a dense, camera-aligned geometric representation. In the second stage, it decodes this geometry into multiview textured images, ensuring geometric consistency across different views. Moreover, the use of CROCS also enables direct reconstruction of 3D point clouds from the generated views.
- The paper introduces a simple yet sound idea that leads to clear performance improvements.
- The use of CROCS allows consistent 3D encoding without explicit segmentation or class priors, outperforming existing alternatives like NOCS.
- CROCS also allows direct extraction of 3D point clouds from generated views, simplifying 3D reconstruction pipelines by removing postprocessing.
- The experiments are extensive
- Tab3 shows NVS results on 4 different datasets (Objaverse-XL, GSO, ABO, and DTC) and comparing with 6 baselines.
- Tab5 and Fig5 demonstrate superior performance on 3D reconstruction.
- The author also ablates the importance of geometry prior in Tab4.
- There are insufficient qualitative results showing the reconstructed 3D objects – either as point clouds or meshes. Only a single example is provided in Fig5.
- The quantitative results in Tab3, particularly the PSNR values, appear unusually high compared to the baselines. However, the qualitative examples (Fig4 and Supp.) do not seem to reflect such a large margin of improvement. This discrepancy raises concerns about how the metrics were computed and whether all methods were evaluated under the same viewing conditions.
- Since some test images include camera elevation, is elevation provided or conditioned for the baseline methods?
- In Fig1, the authors refer to “input image(s),” suggesting that the method may accept multiple input views. However, the paper only presents results using a single input image. It would be helpful to clarify whether the proposed framework can be extended to multi-view inputs, and if so, how the model’s performance scales with additional views. |
Lightly AI-edited |
|
How to Spin an Object: First, Get the Shape Right |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces unPIC, a method for generating a fully 3D-consistent spin of an object from a single input image by explicitly separating the prediction of underlying 3D geometry from textured appearance.
This hierarchical generation is implemented using two independently trained diffusion models: a multiview geometry prior, followed by a multiview appearance decoder. A key contribution enabling this architecture is a novel geometric representation called CROCS (Camera-Relative Object Coordinates), which provides dense pointmaps encoding per-pixel 3D coordinates anchored to the source camera.
The predicted geometry serves as a blueprint to coordinate the final views, enforcing consistency and enabling the direct generation of a 3D point cloud without a separate post-hoc reconstruction step.
This geometry-driven framework significantly outperforms leading methods on novel-view quality, geometric accuracy, and multiview consistency.
- The paper introduces CROCS (Camera-Relative Object Coordinates), a novel dense pointmap representation that is critical to the method's success. Empirical evidence strongly supports its effectiveness.
- By conditioning the appearance decoder on CROCS, the framework allows for direct 3D generation. The output multiview CROCS images provide the vertices, and the RGB images provide the vertex colors, which assemble directly into a colored point cloud, bypassing the need for a separate post-hoc reconstruction step common in other pipelines.
- unPIC demonstrates superior quality, consistently outperforming strong contemporary baselines. The authors provide extensive experiments that substantiate the method’s performance advantages, and the model exhibits robust generalization to challenging real-world captures.
- The paper is easy to follow, with clear writing.
- The authors made a deliberate design choice not to canonicalize for changes in camera elevation. While this choice aligns with typical human mental rotation habits, it forces the model to implicitly infer the camera elevation from the source image. This implicit reliance on the appearance module to deduce a crucial geometric parameter is a source of fragility, which is confirmed by the observed failure case where the model misinterprets the source image (e.g., as an overhead view) and performs incorrect planar rotation.
- Both the geometry prior and the appearance decoder are implemented as multi-view diffusion models (MVD) and trained separately. This hierarchical two-stage MVD architecture, totaling 1.1 Billion parameters, is inherently computationally expensive.
- Why did the CROCS VAE have a significantly lower KL divergence before fine-tuning compared to the RGB VAE (20823 vs 28005)? Does this suggest that the latent space of the CROCS representation is inherently smoother or closer to a standard Gaussian, contributing directly to CROCS's superior predictability?
- Could the authors provide a detailed breakdown of the total 1m13s inference time? Specifically, what proportion is spent in the Geometry Prior module versus the Appearance Decoder module? Such a decomposition would be valuable for guiding subsequent runtime optimizations.
- unPIC provides "Direct 3D" output by combining CROCS vertices and RGB colors into a colored point cloud. Unlike geometry-supervised pipelines that return explicit surfaces, the point cloud is not a ready-to-use explicit surface, potentially necessitating further reconstruction steps for applications requiring watertight meshes. Given that the output is a high-accuracy colored point cloud, have the authors explored post-processing to convert this point cloud into an explicit mesh? If so, what are the resulting mesh quality and the practical utility for downstream applications?
- A key advantage claimed for unPIC is its hierarchical approach designed to maximize diversity by sampling multiple geometries and appearances; however, the main evaluation focuses on the accuracy of a single best output. How is diversity quantified? Have the authors conducted a quantitative assessment of generative diversity—for example, for a given input image, how much variation is observed among the N geometry latents $\hat{Z}_{g}$ produced by the Prior (e.g., via Chamfer distance or LPIPS), and among different appearances $\hat{Z}_a$ generated by the Decoder conditioned on the same $\hat{Z}_g$? |
Lightly AI-edited |
|
Masked Generative Policy for Robotic Control |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces the Masked Generative Policy (MGP), a novel framework for robot imitation learning that eliminates the inference bottlenecks of diffusion models and the sequential constraints of autoregressive models.
MGP-Short is specifically designed for Markovian tasks, adapting the masked generative transformer for short-horizon sampling. It demonstrates improved success rates on standard benchmarks while significantly reducing inference time.
MGP-Long allows for globally coherent predictions over long horizons, enabling dynamic adaptation, robust execution under partial observability, and efficient, flexible execution. It achieves state-of-the-art results in dynamic, observation-missing, and non-Markovian long-duration environments.
The authors validated the effectiveness of MGP in multiple simulated environments.
The authors conducted a thorough analysis of current action generation methods and proposed MGP to address the latency issues inherent in diffusion-style or autoregressive-style action generation. The paper is clearly articulated and easy to follow. The concept of using MGP to re-predict tokens with low confidence while maintaining those with high confidence is intriguing. Theoretically, this approach could indeed reduce the time consumed in predicting actions.
1. I acknowledge that the results in the simulated environment are impressive. However, due to the sim-to-real gap, it is often necessary to demonstrate effectiveness in real-world settings within this field.
2. Regarding the confidence score. Could you analyze the situations that might lead to a lower confidence score? Additionally, how can we ensure the accuracy of the confidence score itself?
3. About the MGP-Long settings. In long sequences, certain objects may cause environmental changes due to previous actions. At this point, the predictions may no longer remain globally coherent, and we would need to generate a new action sequence based on the changed objects.
1. The results in the simulated environment are impressive. However, due to the sim-to-real gap, it is often necessary to demonstrate effectiveness in real-world settings within this field.
2. How can we ensure the accuracy of the confidence score itself?
3. Regarding the MGP-Long settings: In lengthy sequences, some objects may lead to environmental changes as a result of prior actions. When this occurs, the predictions may lose their overall coherence, necessitating the generation of a new action sequence that takes into account the modified objects. |
Lightly AI-edited |
|
Masked Generative Policy for Robotic Control |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Masked Generative Policy (MGP), a new visuomotor imitation learning framework that models robot control as a masked token-generation problem.
MGP first discretizes continuous actions with a VQ-VAE tokenizer, then trains a masked generative transformer (MGT) to reconstruct full action sequences from partially masked tokens conditioned on current observations.
Two inference paradigms are proposed:
- MGP-Short for Markovian, short-horizon tasks: parallel token generation with one or two score-based refinement steps.
- MGP-Long for non-Markovian, long-horizon tasks: predicts the entire trajectory in one pass and adaptively refines uncertain future tokens through posterior-confidence estimation (PCE) as new observations arrive.
Extensive experiments on Meta-World and LIBERO benchmarks show strong gains—up to 35× faster inference and higher success rates (+9% overall, +60% in dynamic or missing-observation settings).
Ablations (MGP-FullSeq, MGP-w/o-SM) validate that PCE-based selective refinement is critical for efficiency and global coherence.
Original idea: creatively transfers masked-generation paradigms (MaskGIT/MUSE) to robotic action synthesis.
Technical soundness: clearly defined VQ-VAE tokenizer, transformer conditioning, and confidence-guided refinement loop.
Empirical rigor: evaluated on 150+ tasks across difficulty levels; includes robustness tests (dynamic, missing-observation, non-Markovian).
Fair comparison: benchmarks against continuous-action (diffusion/flow) and discrete-token baselines under identical encoders and demos.
Ablation insight: MGP-w/o-SM (without score-based masking) confirms that selective refinement improves both efficiency and success rate.
Relevance: unifies the advantages of diffusion (sample quality) and autoregressive (temporal coherence) methods in a parallelizable design.
Limited analysis of tokenizer sensitivity: performance may depend on the VQ-VAE codebook design, but this is not explored.
Hyperparameter transparency: the exact confidence-masking threshold and its effect on refinement stability are not analyzed.
Potential complexity: the two-stage training (tokenizer + policy) increases implementation effort; joint end-to-end training would strengthen the approach.
How is the confidence-based masking threshold determined? Fixed ratio or adaptive per step?
Does the posterior-confidence estimation ever over-mask or destabilize refinement when confidence calibration drifts?
How sensitive is performance to the tokenizer’s codebook size and discretization granularity?
Would an end-to-end jointly trained transformer + VQ-VAE outperform the current two-stage pipeline?
Discrete tokens normally introduce information loss—what do the authors believe enables MGP’s discrete representation to outperform continuous-action models like Diffusion Policy? Is it the global trajectory modeling, masked refinement dynamics, or some property of the VQ-VAE discretization? |
Fully AI-generated |
|
Masked Generative Policy for Robotic Control |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This manuscript proposes a novel imitation learning framework for learning visuomotor policy parameterized by masked generative transformer (MGT), which enables high inference efficiency for closed-loop control while maintaining robustness in long-horizon and non-Markovian tasks. Specifically, two sampling strategies are designed: (1) MGP-Short performs short-horizon sampling and refines action tokens with few iterations for the best performance-efficiency trade-off in Markovian tasks; and (2) MGP-Long samples the full trajectory and adaptively refines tokens with updated observations from the environment to retain global coherence. Experiments demonstrate the strong performance of the proposed methods in Markovian and more challenging tasks.
- Unlike diffusion-based policy, which might require external distillation for fast inference speed, MGP puts less stress on iterative sampling for obtaining clean actions, and has high flexibility of test-time adjustment with proposed sampling strategies.
- MGP-Long iteratively refines the action tokens using the executed actions along with the updated observation to improve trajectory-level coherence, which achieves strong performance in Non-Markovian and dynamic environments, and remains robust to missing observations
- Baselines such as diffusion-based policies (e.g. ) as well as VQ-BeT stand out when learning multimodal action distributions, while MGP is also built on top of vector quantization, it is not yet clear how the proposed sampling methods work on tasks with explicit multimodality
- As all tokens are predicted in parallel, the refinement process can be affected if there are low-quality actions predicted initially with high confidence, causing error accumulation throughout the following iterations. Furthermore, it would be helpful to extend the first ablation studies to investigate how many performance gains can be obtained from more refinement steps, especially in more challenging environments.
- Please include standard deviations in the table for thoroughness if multiple seeds are used to aggregate the result.
- Typo: “blcoks” -> “blocks” in line 191
- In Figure 3, should the unexecuted token “52” at the bottom left be “53” before Posterior-Confidence Estimation
- In line 269, the authors mentioned four ablation studies were conducted, but in section 4.5, only three of them are elaborated.
- How many actions are encoded into one discrete token? And would that hyperparameter affect performance on different tasks? |
Fully human-written |
|
Masked Generative Policy for Robotic Control |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces Masked Generative Policy, which is a new framework for visuomotor imitation learning that models robot actions as discrete tokens and leverages masked generative transformers to efficiently generate and refine action sequences. Unlike autoregressive or diffusion generative policies, MGP tries to generate globally coherent future plans and refine them online. It combines MaskGIT-style generation with robotic action modeling. The experimental results demonstrate state-of-the-art performance on Markovian and non-Markovian control.
- It reframes the policy generation problem as masked generative modeling is new and practical, especially given the latency and horizon challenges in robotics. The tokenization of actions is smart to allow transformer modeling of full sequences.
- The global coherence maintains long-horizon consistency through token memory. The parallel sampling and selective refinement drastically cut latency, leading to high inference efficiency.
- The experimental results are comprehensive and demonstrate the effectiveness of the proposed method across simulations and tasks. While diffusion models model smooth distributions and autoregressive models enforce causality, MGP smartly bridges them using mask-and-refine semantics, achieving both speed and robustness.
- The system design and two-stage training are complex. The VQ-VAE and MGT pipeline introduces extra overhead and possible distribution shift between discrete tokens and true continuous actions.
- When predicting all tokens at once, it loses the explicit notion of conditioning the next tokens on the current action. In dynamic control, this can lead to physically inconsistent predictions.
- The model must have enough context to predict consistent future tokens without sequential conditioning. It could work in structured simulation, but may fail with partial observability or noisy real-world sensors where causality exists.
- Is the pipeline easy to smoothly transfer to real-world tasks? The robustness to sensor noise, delays, or physical contact uncertainty remains a question, especially when it requires a strong encoder and global context.
- There were few visual rollouts or per-task failure analyses. How token refinement behaves in specific dynamic scenes could be more illustrative.
- A discussion section on the potential domain mismatch and increased complexity of the proposed two-stage training is helpful. |
Fully AI-generated |
|
Designing Observation and Action Models for Efficient Reinforcement Learning with LLMs |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces LOAM (LLM-based design of Observation and Action Models), a novel framework that leverages Large Language Models (LLMs) to "automate the generation" of observation and action models for reinforcement learning agents, particularly in complex domains like humanoid robotics. The core idea is to use structured prompts—containing information about the agent's morphology , the "task description" , and "available state variables" —to guide an LLM to generate Python functions (compute_obs and compute_action). These functions serve as a "wrapper" around the original environment , creating a lower-dimensional and more informative state-action space for the RL agent. To address the "inherent stochasticity" of LLM outputs, the authors also propose LOAM-Race, a mechanism based on the "principle of optimism in the face of uncertainty (OFU)" that efficiently evaluates multiple generated models in parallel and allocates training resources to the most promising candidates. The method is evaluated on the HumanoidBench benchmark, where it demonstrates significant improvements in sample efficiency and final performance over strong baselines.
The paper's claimed contributions are exceptional. The proposed method demonstrates strong empirical performance on the challenging HumanoidBench benchmark. This suggests a true qualitative leap in capability, not merely a quantitative improvement. The qualitative analysis in Appendix B further reveals that the LLM (allegedly) generates sophisticated, domain-aware Python code that embeds complex physical priors, such as "heading-invariance" via quaternion rotations and "biomechanical priors" like contralateral coordination.
1. The paper tackles a well-known and significant bottleneck in applying RL to complex robotic systems: the design of observation and action spaces. The proposed approach of using LLMs to automate this process is novel, timely, and presents a compelling new direction for environment design in RL.
2. The experimental results, as presented, are impressive. Achieving "over 3x faster learning on average" on a challenging benchmark like HumanoidBench (Figure 1) is a substantial improvement. The learning curves in Figure 4 clearly demonstrate that LOAM and LOAM-Race consistently and significantly outperform strong baselines like FastTD3 and LESR across a wide variety of tasks. The qualitative win on the h1hand-reach-v0 task is particularly noteworthy, suggesting the framework can discover representations superior to human-engineered ones.
3. The LOAM-Race mechanism is an intelligent and practical solution to the inherent stochasticity of LLM code generation. Instead of viewing variability as a weakness, the authors turn it into an opportunity for robust model selection. The method, based on the principle of optimism in the face of uncertainty, is principled and shown to be effective at finding better and more stable solutions.
1. The work primarily automates the implementation of the wrapper, not the conceptual design of the task. The LLM is effectively "automating the translation of a detailed human specification into code." The framework's success hinges on access to the pre-existing, human-engineered structure of the HumanoidBench environment and a "well-defined reward signal", a limitation the authors admit in the conclusion. The LLM does not operate from raw physics but from a curated set of variables and, most importantly, a human-scripted reward function for each task. The LLM is given the task description and reward logic, which drastically simplifies the problem of identifying relevant features. The paper frames this as "automating design," but it feels more like "automating the translation of a detailed human specification into code." The framework's success hinges on access to a "well-defined reward signal" , a limitation the authors admit in the conclusion. This raises significant questions about its utility in more realistic scenarios where the reward is sparse or the task goal is not so clearly defined.
2. While simulation is a necessary first step, the paper makes strong claims about solving a key bottleneck for real-world robotics without providing any experiments or even a substantive discussion on the challenges of transferring these generated models to physical hardware. LLM-generated code might create brittle policies that overfit to simulation-specific dynamics or artifacts. An analysis of the sim-to-real gap would be essential for a paper with such practical claims.
See Weakness. |
Fully AI-generated |
|
Designing Observation and Action Models for Efficient Reinforcement Learning with LLMs |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces LOAM (LLM-based design of Observation and Action Models), a framework that automates the design of observation and action representations in reinforcement learning (RL). Instead of relying on handcrafted feature and actuator mappings, LOAM uses large language models (LLMs) to generate executable Python functions that define observation and action models. It also introduces LOAM-Race a model selection mechanism that evaluates multiple LLM-generated candidates in parallel and adaptively allocates training resources to the most promising ones using upper-confidence-bound (UCB) estimates.
Applied to **HumanoidBench**, a high-dimensional humanoid control benchmark, LOAM achieves up to **3× faster learning** than expert-designed models using the same RL algorithm (**FastTD3**). LOAM-Race further improves robustness by mitigating the variability of LLM-generated designs.
- **Novel contribution:** Automates a fundamental but underexplored RL design component, observation and action model specification using LLMs.
- **Clear methodology:** The structured prompting framework for code generation (system, observation, and action prompts) is systematic and well-motivated.
- **Strong empirical results:** Demonstrates consistent gains across locomotion and manipulation tasks on HumanoidBench, surpassing FastTD3 and LESR baselines.
- **Comprehensive experiments:** Includes detailed ablations on observation-only vs. full design, candidate count, and racing behavior, alongside qualitative code analysis.
- **Reproducibility:** Provides complete templates, prompts, and structured pipeline descriptions, making the approach easily replicable.
- **Limited theoretical grounding:** The paper is largely empirical and it lacks formal analysis of why LLM-generated designs improve representation quality or exploration.
- **Dependence on LLM reliability:** Quality and efficiency depend on the correctness of generated code while the robustness under different model types (e.g., GPT-4 vs. GPT-5) is not explored.
- **Limited scope of environments:** Focuses solely on humanoid control in simulation. Additional results on other domains (e.g., vision-based or multi-agent tasks) would test generality.
- **Novelty overlap:** Shares conceptual territory with recent LLM-for-environment design papers such as **LESR (Wang et al., 2024)**, **ExploRLLM (Ma et al., 2024)**, **Eureka (Ma et al., 2023)**, and **Text2Reward (Xie et al., 2023)**. The distinction lies mainly in targeting observation and action models rather than rewards.
- **No real-world validation:** Physical robot experiments or noisy sensory settings would significantly strengthen claims of practical impact.
LOAM-Race is claimed to be the first use of LLM and checking different candidates. ExploRLLM (https://arxiv.org/abs/2403.09583) also does that. They also shape observation and action spaces.
Along with weaknesses:
1. How does LOAM generalize to non-robotic domains (e.g., grid-worlds or visual navigation)?
2. How sensitive is LOAM to LLM type and prompting format? Would smaller models (e.g., GPT-3.5) produce usable models?
3. Could LOAM-Race be extended to also handle reward model design simultaneously?
4. How does LOAM ensure the generated code’s physical plausibility (e.g., avoiding unfeasible joint mappings)? |
Fully AI-generated |
|
Designing Observation and Action Models for Efficient Reinforcement Learning with LLMs |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes LOAM, which uses an LLM to automatically generate the observation and action functions that define how an RL agent perceives the environment and issues commands. Instead of relying on handcrafted mappings, LOAM creates these interfaces in code and plugs them into existing training pipelines. A second component, LOAM-Race, samples multiple candidate designs, briefly trains each one, and then reallocates the fixed training budget to the most promising candidates. Experiments across 12 HumanoidBench tasks show faster convergence and, in some cases, better performance than other baselines.
1.1: Addresses an underexplored component in RL, which is the design of observation and action mappings.
1.2: Strong experimental results with different environments (same domain) with multiple seeds.
2.1: No optimization in the design space. The method goes from zero-shot to train to pick. There is no iterative refinement or evolution of the generated code (obs/action functions).
2.2: LOAM-Race is not clearly described. The figure mentions selection “every ~128k steps,” but the paper does not specify the details. The results show that LOAM-Race has the same total training steps as the other baselines, so it is unclear how the total training budget is divided.
2.3: The discard rule of weakest every 128k steps can be wrong. More complicated functions require more training time to converge and may lead to better results. That should be investigated.
2.4: The claim that LOAM-Race “handles LLM stochasticity” is misleading. The approach samples multiple candidates and reallocates training based on performance every 128K steps.
2.5: Limited domain diversity. All experiments are on humanoid tasks; the method should be tested on other domains or harder environments to verify generality and performance beyond convergence speed.
2.6: Section-3 is overly detailed with prompt templates that belong in the appendix. The space should instead expand on 2.2 (racing/budget details).
3.1: Can the authors provide more information about the overall pipeline design and justify why there is no optimization or refinement over the generated code space (i.e., beyond zero-shot generation and simple selection among K candidates)?
3.2: Can you show additional results or analysis verifying whether weaker early-performance candidates in LOAM-Race eventually lead to worse final policies, or if early performance reliably predicts long-term outcomes?
3.3: Can you add additional results in more domains where LOAM (and/or LOAM-Race) outperforms other baselines consistently, not only in speed but in performance as well? |
Lightly AI-edited |
|
Designing Observation and Action Models for Efficient Reinforcement Learning with LLMs |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
LOAM uses LLMs to automatically generate Python functions for observation and action models in RL, based on environment specs and tasks, enabling efficient integration into training pipelines. LOAM-Race races multiple variants to select the best under fixed budgets. Evaluated on HumanoidBench (12 locomotion/manipulation tasks) with FastTD3, achieving 3x faster learning and higher returns than baselines like handcrafted features and LESR.
1.Automates a key RL bottleneck (obs/action design) with LLMs, yielding compact, task-relevant models that boost sample efficiency and final performance across diverse tasks.
2.LOAM-Race effectively mitigates LLM output variability via optimistic racing, identifying strong designs in a single run without exhaustive training.
3.Structured prompts incorporate robotics priors (e.g., posture stability), enhancing model quality as shown in ablations.
1.No quantitative LLM cost analysis—racing requires multiple generations/evaluations, potentially prohibitive for larger setups.
2.Baselines (e.g., LESR) are adapted but may not be optimally tuned; lacks comparisons to non-LLM obs/action methods.
3.Results confined to simulation; real-robot gaps (noise, delays) unaddressed, limiting claims of practical utility.
4.Over-reliance on proprietary GPT-5 without testing open-source alternatives or model robustness.
1.How does LOAM perform with open-source LLMs (e.g., Llama-3) versus GPT-5? Any degradation in model quality?
2.What are the total LLM inference costs (tokens, time) for generating and racing models per task?
3.Can LOAM handle visual or partial observations, e.g., by incorporating neural encoders?
4.Why no ablation on racing hyperparameters like K (candidates) beyond K=1-4, or confidence estimation methods?
5.Have you tested LOAM on non-MuJoCo envs or real hardware to validate beyond simulation? |
Fully AI-generated |
|
Designing Observation and Action Models for Efficient Reinforcement Learning with LLMs |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Disclosure: Claude is used to refine this review.
This paper introduces LOAM (LLM-based design of Observation and Action Models), a framework that leverages large language models to automatically generate observation and action representations for reinforcement learning tasks. Given environment specifications and task descriptions, LOAM produces Python functions that transform raw sensory inputs into compact observation vectors and map low-dimensional policy outputs to full actuator commands. The authors also propose LOAM-Race, which handles LLM output variability by racing multiple generated designs and progressively selecting top performers. Experiments on HumanoidBench demonstrate that LOAM-designed models achieve approximately 3× faster learning on basic locomotion tasks compared to handcrafted baseline models.
- The paper addresses an important yet understudied problem in reinforcement learning - automated design of observation and action spaces, which practitioners typically handle through manual feature engineering. The core insight that LLMs can automate this process is compelling and timely.
- The empirical results are strong, with consistent improvements across multiple tasks in HumanoidBench, and the reach task result is particularly impressive where LOAM succeeds while all baselines fail.
- The paper provides extensive implementation details, including full prompt templates and generated code examples in the appendices, which aids reproducibility and understanding.
- The evaluation scope is limited to a single benchmark (HumanoidBench), so it's unclear if it can generalize to different robots or task domains.
- The prompts appear heavily engineered with domain-specific guidance, which contradicts claims of automation and suggests significant manual tuning was required. Maybe a reasonable comparison is how much human effort it saves (i.e., how many human hours are needed to match the performance of the proposed method).
N/A |
Fully AI-generated |
|
Active Learning for Flow Matching Model in Shape Design: A Perspective from Continuous Condition Dataset |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper investigates the effect of active learning strategies on flow matching. It presents an analysis based on piece-wise linear framework. Based on this analysis, they present an acquisition strategy to increase the diversity of the generative model. Secondly, the authors describe a query strategy to increase the accuracy of the flow matching strategy. The experiments were performed on a number of shape generation datasets. They show that both strategies reach their respective goal of maximizing the diversity or accuracy.
1. One of the first works to investigate active learning for flow matching.
2. The writing is easy to follow.
1. The piece-wise linear framework appears to be overly simplistic. According to the framework, if we would only select a single context c, flow matching could only replicate training data samples?
2. For the diversity strategy $Q_D$ , the terms $-distance(y, \mathcal{Y})$ and $\Delta entropy$ seem to be conflicting. The first one is supposed to bring the samples to the label of an already known sample, the second one is supposed to "promote a more uniform label distribution".
3. It appears that the approach of a pool-based setting already reveals all the shapes, as they are the inputs of this unlabeled datasets. Wouldn't we get the perfect diversity by just training the generative model on all the pool inputs, at least unconditionally?
4. The active learning strategies do not make use of the actual flow matching model at all. Instead, the accuracy query strategy is just the standard output diversity maximization, which is not flow-matching specific at all.
6. Many formatting errors (parentheses, spaces around citations, the subfigure captions in Figure 1, referencing equations as eq2 instead of Eq. 2 for example).
7. No statistical evaluation of the results. This is really important for AL, since it can be sensitive to the initial data for example and the small datasets lead to a large variance. Hence the experiments should be repeated multiple times.
1. How was the coreset baseline applied? Using the latent space of the regression model?
2. In the ablations, did you also try diversity sampling without entropy and distance in label space?
4. Why did you choose an RBF network as the regression model? |
Fully human-written |
|
Active Learning for Flow Matching Model in Shape Design: A Perspective from Continuous Condition Dataset |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes an active learning method for a flow matching model that emphasizes both accuracy and diversity.
Flow matching models are relatively new and it's great this paper has decided to explore active learning for these exciting methods.
I'll admit that the theoretical foundations of the flow matching model are not clear to me. Additionally, I was having trouble following the piece-wise linear interpolation argument, possibly because I am also not familiar with this theory. What is the condition? What are you interpolating exactly? Could you provide a less technical and more conceptual explanation?
Ignoring my inability to follow the primary theoretical motivation, Equation 4 does not make sense to me. Where do y and Y come from? Additionally, I'm unsure if distance(x, X) is the correct formulation for Coreset optimization (perhaps the greedy version of Coreset?). Adding entropy to the distance metrics seems ad hoc to me. For example, if we were to calculate pairwise distances for the Coreset algorithm and then include entropy when calculating the distances, this seems somewhat more sensible to me. Additionally, there are too many tuning parameters, which suggests that this algorithm probably overfits to a particular choice of tuning parameters. To confirm, in (7), are you combining the different queried points to interpolate? The whole methodology is unclear to me.
Evaluation is also unclear to me - why are you evaluating diversity and accuracy separately and giving them equal weight? We should focus primarily on accuracy and consider diversity as a secondary metric to assess robustness. How we determine accuracy, in simple terms, would also be helpful. How would we assess accuracy if we are generating novel images? It somewhat hurts the evaluation that all the datasets are synthetic and, seemingly, simple, which suggests that the method may not generalize to other use cases. However, I don't fully understand the method, so I may be wrong (perhaps the novelty of the approach warrants looking at simplified evaluations first, but I don't quite understand the method and novelty).
Please see the "Weaknesses" section. |
Fully human-written |