|
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This study proposes TemMed-Bench, the first benchmark specifically designed for evaluating Large Vision-Language Models' (LVLMs) ability to reason over temporal medical images by tracking changes in patients' conditions between different clinical visits. The benchmark consists of a test set across three distinct tasks:
1. Visual Question Answering (VQA): Challenges models to answer binary "yes" or "no" questions about condition changes between a historical and current image pair.
2. Report Generation: Requires models to generate a report analyzing condition changes observed between a historical and current image.
3. Image-pair Selection: A vision-centric task where models must select the image pair (from three options) that best matches a given medical statement describing a condition change.
They further includes a knowledge corpus of over 17,000 instances to support RAG. To enhance retrieval quality for temporal reasoning, the study introduces a novel "pairwise image retrieval method," which computes similarity based on both historical and current images to ensure that retrieved instances reflect similar condition changes, thereby enabling multi-modal retrieval augmentation. The evaluation reveals significant limitations in current LVLMs' temporal reasoning abilities and highlights the effectiveness of multi-modal RAG.
1. Large-scale Benchmark Construction: A total of 18,144 instances were constructed using an automated data collection pipeline applied to the CheXpert Plus dataset.
2. Multimodal RAG: The study introduces a multimodal RAG approach, incorporating a pairwise image retrieval method that considers similarity between historical and current images.
3. Temporal Reasoning Evaluation: Beyond single-visit image interpretation benchmarks, this work evaluates models’ ability to infer condition changes over time by comparing historical and current medical images.
This study makes a valuable contribution to evaluating temporal medical image reasoning in LVLMs. However, it exhibits several limitations:
1. Restricted Scope of "Temporal" Reasoning: The benchmark limits temporal analysis to a binary comparison between the current image and only the most recent prior visit. This simplified setting does not evaluate long-term progression, intermittent conditions, or multi-visit trends typical in clinical practice, reducing realism and the complexity of the reasoning task.
2. Lack of Robustness and Transparency in Data Collection: The dataset relies on handcrafted regular expressions to extract “condition change” sentences from medical reports, without any quantitative validation (precision, recall, F1) to assess extraction accuracy. Considering the variability of clinical language, this approach risks misinterpretation, extraction errors, and bias, ultimately harming dataset quality and representativeness. Clear documentation of the dataset extraction process is needed, along with transparent analysis of coverage and missed cases. Quantitative evidence demonstrating extraction performance would strengthen confidence in the dataset.
3. Insufficient Transparency and Expertise in Human Review: While the paper notes that human reviewers validated AI-generated VQA question-answer pairs, additional details about the review process would help support the reliability of this validation. Information such as the number of reviewers, their clinical expertise, review guidelines, and inter-rater agreement metrics (e.g., Cohen’s Kappa) would strengthen the methodological transparency.
Were data ethics maintained in the proprietary LVLMs usage? |
Moderately AI-edited |
|
Rotation Control Unlearning: Quantifying and Controlling Continuous Unlearning for LLM with The Cognitive Rotation Space |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the problem of machine unlearning in LLMs. The authors noted that existing methods rely on retained dataset to preserve utility, and become impractical for larger and more practical settings, and could lead to catastrophic utility loss in continuous setting.The authors propose rotation control unlearning (RCU) that interprets LoRA updates as rotations within a specific cognitive rotation space. The authors claim that rotation angle corresponds to the degree of forgetting. The authors also introduce orthogonal rotation axes loss to allow multiple unlearning steps. The authors conduct experiments on a QA and TOFU benchmark on a single Llama non-instruct model.
The studied problem is important and the proposed method shows its effectiveness to some extent.
Overall, there are several weaknesses the authors should address. Most importantly, why can we quantify the degree of unlearning using rotation angle? This paper does not provide conceptual, empirical, or mathematical justification for why unlearning itself should correspond to a rotation angle in rotation space. Also, the writing needs significant improvement. Many times throughout the paper, the authors use a term extensively without explanations (for example, the cognitive rotation space, what is that?) The authors should explain some of these terms in the preliminary section. The current preliminary section is essentially a ‘related work’ section. The method section also reads as a list of techniques without explaining why they are needed or how they relate to the core problem of unlearning. Overall, the paper does not convincingly argue that RCU is an effective or principled approach to unlearning.
I also have concerns regarding experiments. Moving into 2026, this setting feels too limited to demonstrate practical relevance. The authors should try larger models on way more unlearning requests to prove effectiveness of their methods.Also, the evaluation metrics are not well justified and not properly defined. For the TOFU task, why not use the eval metrics provided by the original authors? For example, the p-value statistical test and the average of Rouge, Answer Probability, and Truth Ratio? The authors should provide more contexts about these evaluations and justify their choices.
While [1] is a recent work, [1] has achieved ideal unlearning by masking out the training signal of TOFU in their corpus. According to [1] results, an ideally unlearned model will achieve ideal forget quality vs utility trade-off. Also, forgetting ROUGE isn’t always better. This translates to the authors choice of SU and DU, where the authors are claiming that lower is always better. I strongly recommend the authors to use TOFU’s eval metrics and compare their results with LMLM from [1]. Also, the authors should compare computational costs, since we do want a simple and efficient algorithm for unlearning.
Lastly, [2] has pointed out that unlearning isn’t always robust. More experiments on the robustness of their methods should be presented in the paper.
[1] Zhao, L., Zalouk, S., Belardi, C. K., Lovelace, J., Zhou, J. P., Weinberger, K. Q., ... & Sun, J. J. (2025). Pre-training Large Memory Language Models with Internal and External Knowledge. arXiv preprint arXiv:2505.15962.
[2] Łucki, J., Wei, B., Huang, Y., Henderson, P., Tramèr, F., & Rando, J. (2024). An adversarial perspective on machine unlearning for ai safety. arXiv preprint arXiv:2409.18025. |
Fully human-written |
|
Rotation Control Unlearning: Quantifying and Controlling Continuous Unlearning for LLM with The Cognitive Rotation Space |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Rotation Control Unlearning (RCU), a novel method for Large Language Model (LLM) unlearning that precisely quantifies and controls the unlearning degree during continuous unlearning requests.
They propose the RCU method, which reformulates the LoRA update process as rotations within a "cognitive rotation space". This approach allows RCU to effectively unlearn information continuously without needing a retained dataset.
The method introduces 3 new key components:
1. Skew Symmetric Loss: A loss function designed to construct the cognitive rotation space. This constraint ensures the LoRA parameter updates behave like rotations.
2. Rotational Salience Weight: A metric derived from an OOD detector that is used to precisely quantify and control the degree of unlearning (i.e., the rotation angle) for any given request.
3. Orthogonal Rotation Axes Regularization: A loss function that enforces mutually perpendicular rotation directions for consecutive unlearning requests. This minimizes interference between different unlearning tasks and directly addresses the problem of cumulative catastrophic utility loss.
**1. Fine-grained Control over Unlearning Intensity:**
The paper introduces a novel perspective by modeling the unlearning strength as a controllable variable through rotational scaling. This design allows the model to precisely adjust the degree of forgetting at a fine-grained level. Such a mechanism provides more flexibility and interpretability for selective and controlled unlearning.
**2. Theoretical Novelty through the Cognitive Rotation Space:**
The proposed Cognitive Rotation Space offers a strong theoretical contribution. By formulating unlearning as a spectral-space rotation governed by skew-symmetric transformations, the authors connect geometric orthogonality with cognitive forgetting behavior. This framework is both conceptually elegant and mathematically grounded, bringing a fresh theoretical lens to the study of machine unlearning.
**3. Strong Empirical Performance and Robustness:**
The experimental results are impressive; the method demonstrates consistently strong unlearning performance across two benchmarks while significantly alleviating cumulative catastrophic utility loss in continual unlearning settings. The comprehensive evaluation convincingly shows that the proposed approach achieves an excellent balance between forgetting precision and retention stability.
**1. Limited Orthogonality Scope:**
The current design of the orthogonal rotation loss \mathcal{L}_o appears to only enforce pairwise orthogonality between the current and the immediately preceding rotation axes. However, it does not ensure global orthogonality across all historical rotations. As the number of unlearning requests T grows, earlier forgetting directions may be reprojected onto new ones, potentially leading to knowledge leakage or cumulative utility degradation. Have the authors considered enforcing global orthogonality, and is it necessary for maintaining long-term stability in continual unlearning?
**2. Granularity of the Rotational Salience Weight $\beta$:**
In the RCU algorithm, is the $\beta$ obtained through the OOD detection and Distributional Shift Compensator shared among all forget data within one unlearning process, or is a separate $\beta$ computed individually for each data sample?
**3.Computational Overhead and Efficiency Concerns:**
Since the proposed approach relies on OOD detection to obtain a sample-dependent $\beta$ value, it seems that an additional forward pass through the OOD module may be required for each input during inference. Will this introduce noticeable latency? Moreover, during training, the optimization process involves multiple components and loss terms, which might increase computational complexity and training time. A more detailed discussion or empirical analysis of the computational overhead would strengthen the paper.
**4. Results Not Consistent with Prior Baselines.**
After reviewing the O³ paper, I notice that the two papers report identical numbers for the other baselines, but this paper’s results are worse than those reported in O³ specifically for the O³ method itself. Given that the remaining unlearning methods appear unchanged across both papers, could the authors clarify the source of this discrepancy?
Questions are included in the weakness section. |
Heavily AI-edited |
|
Rotation Control Unlearning: Quantifying and Controlling Continuous Unlearning for LLM with The Cognitive Rotation Space |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper proposes Rotation Control Unlearning (RCU), a novel approach for continual machine unlearning in Large Language Models that addresses two major challenges: Cumulative utility degradation across multiple unlearning requests and a Lack of precise control and quantification over the unlearning process. RCU reinterprets LoRA-based updates as rotations in a cognitive representation space. The method introduces: 1) a skew-symmetric loss to model LoRA updates as rotational transformations, 2) an orthogonal rotation axis loss to ensure perpendicular update directions across sequential unlearning requests, thereby minimizing interference and catastrophic forgetting, and 3) a distributional shift compensator that produces rotational salience weights, enabling precise auxiliary quantification of unlearning effects. RCU is tested on ScienceQA and TOFU benchmarks, achieving effective continual unlearning without relying on retained datasets.
1. By modeling unlearning as rotations in a latent space, RCU introduces a mathematically grounded and interpretable framework for tracking and controlling knowledge removal.
2. Experiments on TOFU and ScienceQA demonstrate RCU’s unlearning efficacy and utility preservation.
3. The introduction of rotational angles and salience weights provides a precise, tunable metric for unlearning strength.
1. The training procedure involves a combination of losses for LoRA and OOD detection, making the optimization process relatively complex. However, the paper does not provide sufficient details on the choice of loss weights (Lambda values). A sensitivity analysis should be conducted to demonstrate the model's robustness to different hyperparameter settings and justify the selected values.
2. The paper would benefit from improved writing quality. The rationale for using rotation angle is not clear, and there are several typos and inconsistencies, for example, in Table 4, the word "physics" appears inappropriately or misspelled.
Could the authors provide justification for the choice of the lambda hyperparameters used in the loss functions? Given that the training objective combines multiple loss terms (e.g., for LoRA tuning and OOD detection), the values of these weights significantly affect model behavior. A robustness analysis—such as an ablation or sensitivity study—should be conducted to evaluate how performance varies with different lambda settings across tasks or datasets. |
Lightly AI-edited |
|
Rotation Control Unlearning: Quantifying and Controlling Continuous Unlearning for LLM with The Cognitive Rotation Space |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper improves on the continual framework of unlearning named $O^3$ [1], which consists of orthogonal LoRA adapters for different unlearning requests, a OOD detector trained with a novel contrastive entropy loss and utilizes a global-aware scoring mechanism, and a soft inference algorithm to detect unlearned data during inference.
The paper introduces rotation matrices for updating the LoRA adaptors and constrain the rotation angles to be orthogonal for every unlearning request. They update the consequent OOD detector to take this modification into account.
[1] Gao, Chongyang, et al. "On large language model continual unlearning." ICLR 2025
- The paper is well-written for the most part.
- The empirical results look strong, even though its not completely convincing.
- Please improve the figures. They are hard to see without a lot of zooming in.
- The metrics in Table 1 for $O^3$ do not match with the original paper. In this case, please include multiple runs to demonstrate the improvement properly. This is a major weakness of this paper because of how closely it follows $O^3$ and claims improvement.
- Please include the $U^2R$ metric from the $O^3$ paper to demonstrate improved utility preservation over $O^3$ .
- From Equation 4, 5 : $C = BA$ and $R = exp(C) = I + BA$. This seems very strange. Is there any evidence why this approximation may hold , can you please provide evidence that $C << I$ from your experiments ?
- The above point makes it very hard to understand whether the method legitimately works due to the rotation aspect or the performance is still an artifact of $O^3$ due to the high similarity of constraints and other frameworks.
- Should Eq. 3 be $W = W + BA$ ?
- One of the central claims of the paper is that the idea of updating LoRA parameters using rotation matrices will improve the catastrophic utility degradation. Why is it so ? Why is this better than $O^3$? Is there any intuition ?
- In Eq. 11, why is the use of relative rotation space needed ? Can we directly use $R_t$ and $R_{t-1}$ ?
Please address the Weaknesses and the above questions.\
I am willing to raise my score based on the clarification. |
Fully human-written |
|
Latent Planning Emerges with Scale |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper examines whether large language models (LLMs) perform latent planning, that is, internally representing and reasoning toward future tokens without explicit plans. Using Qwen-3 models (0.6B–14B), the authors find that forward planning emerges with scale while backward planning remains limited, offering a causal framework and mechanistic evidence via transcoder feature circuits.
- The paper provides a clear and causally grounded definition of latent planning that distinguishes genuine planning from mere correlational predictability.
- The study offers comprehensive scaling insights, showing how planning abilities gradually emerge and strengthen as model size increases.
- It links mechanistic interpretability to AI safety, highlighting how latent planning could relate to hidden goal pursuit or “scheming”, thereby extending the work’s broader relevance.
- The chosen tasks, such as a/an prediction and rhyming couplets, are synthetic and narrowly scoped, limiting the conclusions’ applicability to real-world reasoning or planning.
- The evidence for backward planning is weak and inconclusive, raising doubts about whether full planning mechanisms have truly been demonstrated.
- The study lacks cross-model comparison, as it focuses only on the Qwen-3 family, making it unclear whether similar phenomena occur in other model architectures.
- Some of the causal claims may be overstated, since interventions could affect correlated linguistic or contextual features rather than genuine planning representations.
- Could the observed causal effects arise from correlated features instead of true planning representations?
- How would the proposed framework generalize to complex goal-directed or multi-step reasoning tasks?
- What is the relative contribution of instruction-tuning versus model scale in the emergence of latent planning?
- How might this causal framework be applied in AI safety monitoring to detect latent scheming or hidden goal formation? |
Fully AI-generated |
|
Latent Planning Emerges with Scale |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Investigates whether LLMs engage in latent (not explicitly generated) planning and shows that planning ability grows with model scale (in the Qwen-3 family).
Contributions
* Provides a causal definition of latent planning -- planning is an internal representation that (i) causes the model to produce a specific future token for forward planning and (ii) causes generation of a context that makes that token more likely. This improves on purely observational/probing definitions.
* Provides simple agreement tasks as planning probes: On a/an, is/are, and el/la tasks, larger Qwen-3 models reliably plan ahead for the content word and use that to choose the right function word. Smaller models show nascent but incomplete mechanisms.
* Mechanistic evidence via transcoder feature circuits that identify “planning features” that represent the future word and show, through interventions, that ablating them hurts performance and boosting them helps, indicating genuine causal relevance.
* Causal-mech interpretability recipe for monitoring such emergence of forward planning and backward planning ability in open models.
* Originality: Introduces a causal definition of latent planning that distinguishes between forward (goal-directed token production) and backward (context-shaping) planning. This causal approach is novel in how it rephrames what “planning” means for decoder models and correcting an overextension in prior work that equated decodability with intent. The integration of transcoder feature circuits with causal interventions is also a novel methodological synthesis, enabling verifiable mechanistic evidence rather than speculative probing.
* Quality: The experiments are rigorous and well controlled. The progression from simple grammatical-agreement tasks to rhyming and prose-steering scenarios is also structured well logically and empirically. The use of quantitative flow analysis within feature circuits adds an added layer of interpretability
* Clarity: Definitions are explicit, figures are clear and interpretable. The argument flows naturally from conceptual motivation to empirical validation.
* Significance: The results establish that latent planning mechanisms emerge with scale and that forward planning precedes backward planning—an interpretable scaling law that contributes to our understanding of model cognition.
* In terms of planning, the paper over-indexes on short-range linguistic dependencies. Not clear if this scales to true multi-step reasoning or action planning.
* Limited to Qwen-3 series, which hurts generalization
* Experiments provide only limited support for backward planning, and the analysis of whether the generated context “licenses” the planned token is overly qualitative. :
* Can you add a quantitative measure of contextual dependency?
* Do you believe these results generalize to multi-step cognitive planning or compositional reasoning? Could this be shown? |
Fully AI-generated |
|
Latent Planning Emerges with Scale |
Soundness: 3: good
Presentation: 3: good
Contribution: 4: excellent
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper's central hypothesis is that latent planning is an emergent capability that increases with model scale. It seeks to answer (1) whether LLMs engage in a mechanistically verifiable form of latent planning, (2) how this ability can be defined and measured, and (3) how this capability scales with model size.
The methodology first establishes a strict, two-condition causal definition of latent planning, distinguishing it from prior observational or probing-based work. For an LLM to be "latent planning," it must possess an internal representation of a future goal (a token or concept $t$) that:
1. Forward Planning: Causes the model to eventually generate $t$ (Condition 1).
2. Backward Planning: Causes the model to generate a preceding context that licenses $t$ (Condition 2).
To identify these causal mechanisms, the authors employ Transcoder Feature Circuits, a mechanistic interpretability technique. This method decomposes a model's dense MLP activations into sparse, monosemantic (interpretable) features and identifies the causal sub-graph (the "circuit") that explains a specific behavior.1 The study is conducted on the Qwen-3 family of open-source models, ranging from 0.6B to 14B parameters.
1. The paper's greatest strength is its insistence on a rigorous, two-condition causal definition of latent planning. This elevates the study from a correlational observation to a test of a mechanistic hypothesis. This strength is powerfully underscored by the refutation of probing-based methods in Appendix G, which demonstrates that high probing accuracy can be causally irrelevant.
2. The quality of the core experiment is extremely high. The a/an and is/are tasks serve as an elegant "minimal pair" testbed for planning. The causal interventions (ablation and boosting) in Section 4.4 provide "smoking gun" evidence for the discovered planning circuit. The analysis in Appendix E, which surgically separates task-solving ability from planning ability, is a brilliant and crucial piece of analysis that solidifies the paper's claims.
1. The complete failure of the methodology on the el/la task (Appendix D) is a significant weakness. The authors' explanation—that "Qwen-3 is not highly capable in language besides English and Chinese" —is an ad hoc hypothesis. This failure could alternatively imply that the "planning" mechanism found is not a general-purpose planning module at all, but a highly specific and brittle circuit for English grammatical agreement. This possibility severely undercuts the generality of the paper's claims.
2. The paper repeatedly claims that smaller models (4B-8B) have "nascent planning mechanisms" but "fail" the task. It is unclear what this means mechanistically. Does the circuit exist, but is weak? Are some features missing? Does the model have the 'accountant' feature but lacks the causal connection to 'an' ? This "nascent" concept is central to the "emergence" narrative but remains poorly defined.
On local planning, They say X features are described as "sensitive" and found in a "small minority." Is this evidence of a real, generalizable mechanism, or an artifact of steering on specific, polysemantic features that happen to fire on common n-grams? How could this mechanism be tested more robustly? |
Fully human-written |
|
Knowledge distillation through geometry-aware representational alignment |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a new approach for knowledge distillation from large teacher models to smaller student models by explicitly applying Procrustes distance and the Frobenius norm of Feature Gram Matrix as the objective. The paper theoretically demonstrates that existing knowledge distillation objectives, such as projection-based mean squared loss or Centered Kernel Alignment (CKA), cannot capture the feature structure even when optimized perfectly (zero loss). And the proposed Procrustes distance-based objective is theoretically and practically better. The authors provide comprehensive theoretical proof and conduct knowledge distillation experiments on various tasks to show that the proposed objective outperforms existing methods on distilling encoder-only and decoder-only language models.
1. Strong theoretical motivation showing why existing feature-based losses (e.g., CKA, projection MSE) fail to preserve true geometry.
2. Novel geometry-aware losses (Procrustes and Gram-matrix) that align representational structure up to orthogonal transformations.
3. Empirical validation across encoder (BERT) and decoder (OPT) models, covering both classification and instruction-following tasks.
1. Lack of experiments to show if the observations generally hold over multiple sizes of the student models. It would be great if the authors could add analysis, even only on one task, to show the performance on student models with different intermediate sizes.
2. There are also works on using mutual information maximization as a knowledge distillation objective. It would be better to discuss this branch of works either theoretically or practically.
For other concerns, please refer to the questions below.
1. On line 90-93, it might be a bit confusing what a “mode” is referring to, especially for people who are not familiar with this field. Maybe add one more sentence here to explain it.
2. Do you have varied analysis on the choices of student dimension for the experiments in section 5.1 to check if the observations are consistent over teacher-student dimension differences?
3. In figure 2(b), there are spikes of CKA when optimizing over Procrustes, any explanation for this?
4. It seems that for decoder-only models, the performance difference between Procrustes and CKA (or FG) is smaller, do you have any intuitions why?
5. It would be great if the authors could do some analysis on the convergence speed of Procrustes compared to CKA (or FG) for section 5.2 and 5.3
6. What's the computation overhead of the proposed objective compared to CKA? |
Fully human-written |
|
Knowledge distillation through geometry-aware representational alignment |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper is about feature-based knowledge distillation for LLMs. It shows that the current methods fail to preserve the teacher's geometric structure and proposes a theoretically grounded, geometry-aware distillation based on Procrustes distance and the Frobenius norm of the feature gram matrix. These findings are supported experimentally and compared to the current baselines.
The main idea is presented and discussed clearly. The paper is well-organized, and the details of the experiments and findings are well presented.
The main weakness of this work is the limited scope, which does not include multi-teacher and or multi-modal distillation and VLMs. Also, the use of Procrustes for KD already exists in the literature. So the novelty of this work is not well discussed.
- In both Theorems 1 and 2, it is assumed that $R_t$ and $R_s$ are centered, unit-norm matrices. Can authors discuss how these assumptions are realistic (can be explored experimentally) and what the possible impacts of not respecting these assumptions are on the experimental results?
- Looking at Table 2, how does the poor performance of "Procrustes + KD " compare with "CKA + KD " align with the (theoretical) finding of the paper?
- Theoretically, can this method be applied to the multi-teacher KD? Can this be done by some experiment? |
Fully human-written |
|
Knowledge distillation through geometry-aware representational alignment |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work tackles the conflict issues generated in graph manipulation steps of generated scene graphs for downstream tasks.
1. Fundamental issue raising for the use of distance metric on knowledge distillation
2. Theoretical works to find the problem of CKA, and conditions for optimal correlation.
### 1. Unclear Problem Definition
This work proposes using the FG matrix or Procrustes distance instead of CKA. However, the problem setup presents that CKA performs worse than these metrics, not suggesting a more detailed problem of CKA. As a result, it is difficult to understand the actual problem being solved and benefits by replacing CKA with these alternatives. This should be more clearly elaborated.
### 2. Unclear Benefit of Using the Proposed Two Metrics
As a consequence of the first issue, it is not clearly shown which structures or representations can be better aligned in knowledge distillation compared to CKA, since the specific problem with CKA in aligning structures is not clearly defined. Furthermore, the validation section lacks qualitative analysis addressing this issue.
### 3. Missing Issue in Validation
A question comes up: "Does the number of orthogonal vectors indicate better feature structure alignment?"
While I agree with the intuition behind the results of Theorem 2, it remains unclear which structures are more accurately transferred, and whether the orthogonal vector size indicates effectively the degree of structure alignment.
a) The results only show the impact of reduced expressiveness due to a low number of orthogonal vectors. We do not know whether this is a dominating cause, especially compared to the distance metric’s potential bias in aligning feature structures. It should be shown that the orthogonal vector size meaningfully impacts alignment beyond the quantitative performance evaluation.
b) Even if orthogonal vector size is an important indicator of alignment quality, student feature representations may have fewer dimensions to represent the diversity of features (intrinsic dimension), which could require a lower number of required vectors. Learning unnecessarily large orthogonal vectors may therefore be unnecessary.
### 1. Why is the performance of Procrustes + KD worse than CKA + KD in Table 2?
It requires more justification why using orthogonal vector size does not appear positively.
### 2. Large-Scale Problems Are Not Evaluated in Table 1 and 2. Why?
It is expected that the proposed measure should work for larger-scale problems if it truly enables more accurate alignment of feature structures. However, Table 2 suggests that more accurate feature representation alignment does not necessarily benefit the preservation of some student model structures in general KD settings. This raises the question: do we really need a more precise distance measure for knowledge transfer, or should we focus on selectively transferring features to avoid?
### Minor comment on Demonstration
The figures on page 7 have no captions or figure indices. |
Lightly AI-edited |
|
Knowledge distillation through geometry-aware representational alignment |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces two new forms of knowledge distillation that, informed by the linear representation hypothesis, minimise either the procrustes distance or the Frobenius norm of the feature gram matrix to improve distillation by improving representation alignment between students and teachers. The paper provides theoretical results, synthetic experiments and larger-scale experiments to show that this form of distillation shows statistically significant but marginal performance gains across a range of evaluations, namely: COLA, RTE, MRPC, SelfInst, U-NI and S-NI. Their core contribution is that minimising loss to preserve geometric properties of the teacher model's representation is a more effect way to perform knowledge distillation.
1. I appreciate the effort that has gone into exploring the geometry alignment that is enabled between students and teachers, I think that it is an interesting angle and the relation to the Linear Representation Hypothesis (LRH) is a good motivation for this work.
2. The use of statistical significance is a positive as many works do not conduct this type of analysis.
3. Comparison to lots of over methods is a benefit of the paper, however the lack of comparison to $D_{LinProj}$ is missed which feels like an important counterfactual to the LRH argument presented in this paper.
4. It is appreciated two language models families (BERT and OPT) are evaluated in this paper showing the benefits of procrustes distance for both classification and instruction following tasks.
While the finding that minimising either the procrustes distance or the Gram matrix can improve performance is interesting, it is unclear whether the linear representation hypothesis is the main reason why this occurs. The paper has some issues with experimental consistency and data formatting, which make it difficult to track the findings. Also, the missing evaluation of hyperparameter sweeps over the $\alpha$ term in the loss reduces certainty over the findings presented in the paper.
1. **Lack of hyperparameter sweeping on Alpha ($\alpha$)**: In the loss function introduced in equation (6) there is a balancing between knowledge distillation and the distance metric ( Procrustes ($D_{P}$), Gram Matrix ($D_{FG}$) or CKA ($D_{CKA}$)) used, it is unclear what the relationship is between the $\alpha$ hyperparameter and performance. I would like to see an experiment that varies this $\alpha$ hyperparameter keeping it constant across controls to show how $\alpha$ explicitly impacts performance of $D_{P}$, $D_{FG}$ or $D_{CKA}$, if alignment of orthogonality of the student and teacher is important, it should be expected that increasing the $\alpha$ value would put more preference to optimising towards this goal and improve performance. Also including results for $D_{LinProj}$ that show it is ineffective in boosting performance could improve the strength of the LRH argument.
2. **Section 5.1 Synthetic Experiment**: In sub figure A, it can be observed that the number of ε- orthogonal vectors in the student model is consistently the highest for $D_{FG}$ over $D_{P}$, however, the results in Table 3 suggest that $D_{P}$ persistently outperforms. As a result, it seems that purely aligning orthogonality between the student and the teacher does not fully explain the benefits of using $D_{P}$.
3. **Missing result in Table 1 for PKD**: There appears to be a missing result for the method PKD on the COLA evaluation - it is unclear why this result is missing, and I cannot find an explanation for its omission in the paper.
4. **Lack of standard deviations in Tables 1 and 2**: There appears to be no standard deviations presented in two of the main tables in the paper. While Table 3 does have standard deviations recorded, it is unclear why this is omitted for Tables 1 and 2.
5. **Lack of comparison with $D_{P}$**: In Table 2, it is unclear why we only compare to the procrustes distance ($D_{P}$) to ($D_{CKA}$) when the Gram matrix ($D_{FG}$) is also proposed as an effective geometry preserving metric by this paper.
6. **Minor spelling mistake**: See line 345 'Our finds corroborate' which should be 'Our findings corroborate' and line 346 'extremely noise' which should be 'extremely noisy'.
1. How does Alpha ($\alpha$) impact the knowledge transfer and corresponding performance?
2. In Table 2, can you provide results for Procrustes + FT and CKA + FT? It is important to establish if minimising cross-entropy loss with these similarity metrics alone can improve performance; otherwise, it could be the case that these metrics merely add further regularisation to KD without directly improving performance through better or worse geometry alignment.
3. Can you align your findings with the recent distillation scaling laws [1] to show how student capacity impacts the geometry alignment?
4. Is there a specific relationship that emerges between which layers are aligned and how this impacts performance? Can ablations show that the layer presented represents a consistent trend for transfer?
5. In the appendix (Line 1188) there is a statement *'CKA and learned linear projection are incapable of preserving the feature geometry'* - given this and that in Table 2 we observe that CKA can correspond with statistically significant improvements in accuracy (see CKA+KD+FT) does this not show that preserving feature geometry is not a necessary part of performance increase and that performance benefits may not be tied to the LRH generally as this work posits?
References:
[1] Busbridge, D., Shidani, A., Weers, F., Ramapuram, J., Littwin, E. and Webb, R., 2025. Distillation scaling laws. arXiv preprint arXiv:2502.08606. |
Fully human-written |
|
Uncertainty‑Routed Human–LLM Curation and Calibration for ANLI |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes URC2, a three-stage pipeline for adversarial NLI that first decomposes per-example predictive uncertainty into aleatoric (data/label ambiguity) and epistemic (model disagreement via mutual information) using a three-teacher ensemble (DeBERTa-v3-large, RoBERTa-large, XLM-R-large), then routes examples by their dominant uncertainty to a two-lane relabeling workflow—Human lane for aleatoric-heavy items (relabel/keep-hard with down-weight/drop) and LLM lane for epistemic-heavy cases (instruction-tuned LLM with self-consistency)—and finally refreshes and calibrates the teachers with curated labels and per-example weights plus lightweight temperature scaling; on ANLI, URC2 cuts dev ECE ~30% to 0.146 without sacrificing accuracy and collapses epistemic mutual information on curated subsets, shifting corpus-level uncertainty toward low-aleatoric/low-epistemic regions.
Contributions: (1) an operational, ensemble-based uncertainty decomposition that drives targeted supervision; (2) a human–LLM two-lane relabeling mechanism producing curated labels and instance weights; and (3) calibration with disagreement reduction, yielding better-calibrated, more reliable NLI under adversarial shift.
1. This work introduces an uncertainty-routed curation paradigm that explicitly disentangles aleatoric (data ambiguity) and epistemic uncertainty (model disagreement), enabling distinct treatments for each rather than collapsing them into a single undifferentiated scalar. The approach features a creative and pragmatic two-lane supervision framework: human annotators address cases of aleatoric uncertainty, while an instruction-tuned LLM with self-consistency mechanisms handles instances of epistemic uncertainty. This design effectively leverages the complementary strengths of human judgment and LLM reasoning to enhance dataset quality and model reliability.
2. Comprehensive diagnostic analyses—including risk–coverage curves, per-round evaluations, corpus-level uncertainty shifts, and ∆MI collapse on curated subsets—support the claim that URC2 genuinely resolves disagreement rather than merely smoothing it. Careful evaluation on ANLI, using both calibration metrics (ECE) and accuracy before and after temperature scaling, demonstrates a ~30% reduction in ECE without any loss in accuracy.
1. Multiple threshold values (e.g., r≥0.35 in line 221 and w=0.3 in line 237) are introduced without sufficient justification or explanation of their selection criteria.
2. While the focus on ANLI is reasonable, the claims would be stronger with evaluations beyond ANLI (e.g., [1]). In addition, several hyperparameters appear to be tuned on the ANLI dev split that is also used for reporting, which risks overfitting and makes the conclusions less definitive. Typically, adversarial benchmarks are held out solely for evaluation; we rarely assume access to an adversarial training set for hyperparameter selection. I recommend adding results on additional datasets (e.g., [1]) and using a strictly held-out test set or cross-dataset validation; if tuning on ANLI dev is unavoidable, include a sensitivity analysis and pre-freeze thresholds before final evaluation.
3. The paper does not report the standalone performance of the LLM used in the adjudication lane (e.g., accuracy and ECE on ANLI), which is necessary to estimate the reliability of this component. 2) The workflow relies on a single LLM for confident model disagreement; please include ablations with stronger or alternative LLMs (e.g., GPT-5, Qwen-3, Claude) and/or multi-LLM arbitration to assess whether the conclusions are strengthened or challenged under different adjudicators. 3) The paper mentions a quantized LLM; quantization can alter calibration and increase vulnerability to adversarial attacks [2].
4. The paper lacks a comparison with existing baselines, which is necessary to justify the effectiveness of the proposed method.
References:
1. Liu et al. 2020. An empirical study on model-agnostic debiasing strategies for robust natural language inference.
2. Lin et al. 2019. Defensive Quantization: When Efficiency Meets Robustness
Suggestions:
1. In Figure 1, the label in Stage A is incorrect. It should be placed on the box labeled “Routing route by uncertainty” to accurately reflect the intended process.
2. Figures 2a and 2b should use the same color scheme for "After refresh" to maintain consistency; using different colors could cause confusion for readers. |
Moderately AI-edited |
|
Uncertainty‑Routed Human–LLM Curation and Calibration for ANLI |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 0:
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces URC2 (Uncertainty-Routed Curation & Calibration), a three-stage pipeline for improving model calibration and dataset quality on the Adversarial NLI (ANLI) benchmark. Unlike prior methods that treat uncertainty as a single scalar, URC2 decomposes predictive uncertainty into aleatoric (data/label ambiguity) and epistemic (model disagreement) components, routing each to targeted supervision — human audit for ambiguity and LLM adjudication for disagreement — followed by retraining and temperature scaling.
The paper’s main contributions are:
1. Uncertainty-driven supervision: Ensemble-based decomposition of per-example uncertainty (aleatoric vs. epistemic) to guide distinct curation routes.
2. Human–LLM two-lane relabeling: Human annotators handle ambiguous items; an instruction-tuned LLM resolves confident model disagreements through self-consistent adjudication.
3. Calibration with disagreement reduction: Retraining with curated labels and weights plus lightweight temperature scaling reduces expected calibration error on ANLI by 30% (to 0.146) without sacrificing accuracy, while substantially lowering epistemic disagreement and improving corpus-level uncertainty distribution
1. The uncertainty-routed curation framework that distinguishes and acts on aleatoric versus epistemic uncertainty, turning uncertainty diagnosis into targeted supervision is novel and intuitive.
2. This paper proposes an interesting utilization of uncertainty decomposition, which incorporates human-in-the-loop for aleatoric-heavy samples.
3. The paper is well-organized and clearly written.
**According to ICLR2026 Author Guide, the paper should have only 9 pages at submission time for the main text. This submission has 10, which is very unfair for other submissions and should be desk rejected.**
N/A |
Fully AI-generated |
|
Uncertainty‑Routed Human–LLM Curation and Calibration for ANLI |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes URC2, a data relabeling pipeline that decomposes the uncertainties in the ANLI dataset into two categories: 1) examples where the original instance is ambiguous, 2) examples where the original instances are correct, but models diverge with high confidence, using a standard ensemble-based entropy analysis. Following the decomposition (also categorization) of ANLI examples, URC2 relabels the ambiguous examples with human annotators and relabels the high-confidence diverging cases with an LLM. By retraining models on the relabeled ANLI dataset, the authors show that the expected calibration error drops significantly, and disagreements between models decrease as well.
The overall design is sound and successfully operationalizes a decomposition->relabeling->retraining pipeline to show that clearer training signals (including labels and weights) can significantly improve the model's confidence calibration, and reduce disagreements. The relabeling pipeline proposed two lanes to handle different categories of uncertain examples, and achieved a balance between model relabeling and human efforts.
It seems to me that by employing LLM relabeling in Lane L, the URC2 pipeline essentially trusts the LLM's relabeling of the ANLI dataset over the original labels, even though the authors explain that only epistemic-heavy items are routed to the models. The authors should evaluate whether this trust is actually sensible by comparing it against human labels on selected samples. At the same time, this defeats the purpose of having humans to handle relabeling in Lane H, since a capable-enough LLM would be able to do it as well, since the authors assume that LLMs can make successful judgments on ANLI. I understand that humans relabel ambiguous cases, which is intuitively more challenging than epistemic-heavy cases, but the paper does not provide actual evidence for this claim. Given that I do not understand why Lane L and H are handled separately, the proposed URC2 pipeline essentially reads like a "fixing ANLI annotation error" work, and I do not see how it would be valuable to future research. At the same time, the authors should compare against a very simple baseline, which essentially uses an LLM to relabel every instance in the ANLI dataset.
Please see the weakness section. |
Fully human-written |
|
Controlling a $\mu$RTS agent using Decision Transformers |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper applies Critic-guided Online Decision Transformer to Gym-$\mu$RTS, a long-horizon, stochastic game environment with sparse rewards. OCGDT re-implements and combines ODT and CGDT, two standard return-conditioned sequence modeling methods, to enable offline critic learning, offline policy learning, and online fine-tuning. The authors evaluate against several rule-based bots and run ablations over buffer size, training step, and context length. Empirically, OCGDT matches or exceeds baselines like IQL, CGDT, and ODT.
- The paper well summarizes prior work it builds upon.
- The paper includes enough experimental details for reproduction purposes.
- __The novelty and contribution of this paper are very limited.__ This paper is an application of existing methods to a new task. ODT and CGDT are well-known methods in the literature of RL sequence modelling. OCGDT is merely a combination of both up to some minor changes of the network architecture. And the motivation for combining them does not bring new insights either, as DT's inability of trajectory stitching and suboptimal behavior in the face of environment stochasticity are well-known and ongoing research questions nowadays. OCGDT does not add more algorithmic design for addressing these fundamental issues.
- __The methods involved in the paper are outdated.__ Compared with IQL, ReBRAC [1] is an acknowledged stronger offline RL baseline with the use of actor and critic. For the value-guided DT, QT [2] is one of the current SOTA DT variants. I believe the performance could be stronger when QT and ODT are properly merged.
- __The performance of OCGDT is not appealing.__ In Table 1, OCGDT performs on par with ODT alone for CoacAI, while it performs on par with CGDT and even worse than ODT for Mayari. So, the combination of the two algorithms benefits little. Moreover, IQL is not a proper baseline, since it is not strong enough and it lacks sequence modeling, which is important for this long-horizon, sparse-reward environment.
__References:__
[1] Revisiting the Minimalist Approach to Offline Reinforcement Learning
[2] Q-value Regularized Transformer for Offline Reinforcement Learning
Please refer to the weaknesses. In addition, could the authors evaluate their method on canonical offline RL benchmarks, e.g., D4RL or Visual D4RL? |
Fully human-written |
|
Controlling a $\mu$RTS agent using Decision Transformers |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper implements both Critic-Guided Decision Transformers (CGDT) and Online Decision Transformers (ODT) into the Gym-$\mu$-RTS domain, and also explores the combination of the two methods. The authors build a dataset from games from two previous competition bots (CoacAI, Mayari), and use this data to train their models, which matches the performance Implicit Q-Learning (IQL).
- This paper re-implements two different methods, and combines them together, then applies them to the Gym-$\mu$-RTS domain. Given that the project has been open-sourced, this work may be useful to others.
- The presented algorithm runs on a desktop PC, in a much smaller amount of walltime than many other RL projects. This accessibility and sustainability is something that I believe is often under-valued.
- Sadly, this paper does not resemble something I would expect to see at a top-tier conference such as ICLR. The writing is quite poor, containing a strangely large number of very short sentences, making it unnatural to read. Furthermore, all of the Figures in this paper are significantly below the typical quality of this venue. I recommend that the authors spend some time reading over previously accepted work and more closely adopting their style.
- The paper only appears to test the proposed algorithm on a single task, with a single set of settings. While $\mu$-RTS is a challenging environment, using a single environment is generally a notable weakness. Testing on different map sizes or settings would have improved the paper.
- The paper has rather limited novelty - it mostly just combines two existing ideas together and adapting them to a new task. While the combination and re-implementation of algorithms can be very useful and sometimes worthy of acceptance, I don’t believe the provided results are groundbreaking enough to justify this.
- The agent’s training data is taken from CoacAI and Mayari, and then also evaluated against these same agents. It is generally poor practice to train and evaluate on the same data/agent.
- The paper does not appear to have a limitations section, which are quite a standard and important section in most papers.
- In Table 1, while I think the ablations were quite interesting, the use of A-G make it quite difficult to read. Please consider using something like OCGDT + Online, and OCGDT + Double Tuning, etc. The caption could still keep the detailed description, but this would make the results easier to digest.
- In a single paragraph, could you concisely summarize what the novelty of this paper is?
- Can this algorithm be applied to other environments? Could it improve performance?
- Given that this method appears to be very computationally light, perhaps a walltime vs performance graph against prior methods would be a nice way to demonstrate the utility of your method? |
Fully human-written |
|
Controlling a $\mu$RTS agent using Decision Transformers |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work re-implements two Decision Transformer variants, Online Decision Transformer (ODT) and Critic Guided Decision Transformer (CGDT), along with a widely used offline reinforcement learning method, Implicit Q-Learning (IQL). It further proposes a combined model named Online Critic Guided Decision Transformer (OCGDT) for the Gym-$\mu$RTS environment. Each method is first trained using datasets generated by rule-based $\mu$RTS competition winners, CoacAI and Mayari, and is then finetuned with online interaction. Among the RL approaches, OCGDT achieves the highest win rate against CoacAI, which empirically demonstrates that effective RL optimization in the $\mu$RTS remains a challenging task. Through a range of ablation studies, this paper explores which components of OCGDT are particularly difficult to optimize in the environment.
S1. (Clear and reproducible implementation details)
The paper provides detailed descriptions of the RL methods, including their architectures and training procedures.
S2. (Empirical performance of RL methods)
The results show that both DT-based methods and IQL exhibit low win rates when competing against rule-based winners. This finding highlights the difficulty of applying RL methods in the Gym-$\mu$RTS environment.
S3. (Ablation studies)
The ablation study varies several factors, such as buffer size, context window length, and the number of online steps in OCGDT. Through these experiments, it empirically reveals the challenges of online fine-tuning for OCGDT in the Gym-$\mu$RTS. It also emphasizes the importance of appropriately balancing offline data and online samples. However, one of the ablation results remains unclear, as discussed in Weakness W3.
W1. (Insufficient explanation of RL behaviors)
The paper lacks detailed analysis of how each RL method behaves in the $\mu$RTS. A more thorough explanation would help clarify how these models differ in decision-making.
W2. (Limited analysis of ablation results)
Despite multiple ablation studies (OCGDT A to G) results, their interpretations appear limited. In particular, the difficulties regarding online fine-tuning in OCGDT seem to require additional analysis and discussion.
W3. (Ambiguity in description)
The difference between OCGDT and OCGDT-E in Table 1 is not clearly explained. It would need to specify what distinguishes the two settings and affects their performances.
**Minor**
- typo at line 329; With -> with
Could you provide further clarification regarding the weaknesses mentioned above, particularly the behavioral explanations of the RL methods and the interpretation of the ablation results? |
Lightly AI-edited |
|
Controlling a $\mu$RTS agent using Decision Transformers |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper explores an approach to playing real-time strategy games based on decision transformers, and therefore offline RL. The paper leverages two ideas to make DTs more amenable to an RTS setting, which are online DT and critic guided DT. It then combines these approaches to form OCGDT, or online critic guided decision transformers.
The paper converts the RTS setting into an offline RL problem with online fine-tuning by using $\mu$RTS and the associated Gymnasium framework. It collects data using two state-of-the-art, rulebased frameworks for playing RTS games, and both learns from such data and competes with said baselines. The paper also includes an implementation of IQL as an alternative offline RL baseline and compares its contributions with this well-established baseline.
The paper is well-written in terms of prose, level of details, and most importantly the clear description of technical details. This is the case throughout, but an example is the description of the architecture; e.g. section 3 does well to describe how the approach "works" and is complimented nicely with figure 1. Several other examples show up in the paper as well.
The idea is clever and the setting is quite interesting given that most offline RL papers are applied to the same benchmarks. There is some technical innovation in converting the problem and getting both the DT-based algorithms and IQL to fit within the $\mu$RTS setup.
The contribution of OCGDT needs to be made more clear. Is this novel? Non-trivial? The re-implementation of ODT and CGDT on $\mu$RTS problems is interesting in its own right, but again it is unclear if this is the contribution or if it is the creation of a new architecture / algorithm.
In general it is unclear if this approach actually worked. This is perhaps fine given that this is a new foray into this setting from offline RL. But the paper itself mentions that recently ML-based approaches have achieved competitive results in RTS settings. Why, then, is the proposed approach not competitive?
n my opinion, there are several important things to add to the paper (e.g. a more robust IQL description), and hopefully in the main body. Therefore as an editorial suggestion, things like line 334-338, "Training is performed on a Windows 10 machine..." can then be moved to an appendix. These and other similar details are much appreciated and necessary but can likely be moved without degrading the quality of the paper.
1. The results are somewhat puzzling. My understanding is that the offline data consists of games between CoacAI and Mayari. In the parlance of Offline RL, one might call these "expert" datasets. Why, then, do the methods not achieve parity with either of these baselines? In the case of IQL this is somewhat explanable as it is largely doing imitation learning, and CoacAI/Mayari may be doing things out of distribution at test time. (And if that is the case, is the buffer size appropriate? Should it be larger?) Then, for the online methods, i.e. those of this paper, why don't they perform better?
2. In Table 2, the most interesting thing to me is that CGDT appears to be essentially even wtih IQL (or IQL with sufficient resources). Both of these are offline and therefore, they are imitating each other. Does this suggest that the DT part of the architecture doesn't really matter, and that online finetuning and/or online experience is of first-order importance?
3. Line 423, "This suggests a larger and more diverse dataset...". I agree with this conclusion but I do not think the ablation was necessary to reach it. The fact that IQL has parity with the DT approaches, and that neither are competitive with the expert baselines, suggest that something is amiss with the dataset or perhaps that offline RL is not the correct approach. In standard offline RL datasets, the underlying distributions of the environments are stationary; in the case or RTS, the agent is actually playing a game with the environment, and it (the "environment", which is CoacAI or whatever) changes its distribution according to how OCGDT (or whatever) is behaving. It seems like offline RL will never work for such a case, although the exploration of different and better datasets is encouraged.
4. Line 284: "The actor, the critic, and the value function have separate parameters for state representation". Is this simply saying that there are 3 different neural networks? What is meant by "state representation? The paper could be improved by adding an architectural figure for IQL (probably in the appendix) and making the distinction between IQL details and the various DT details. These are very different things; for example, one doesn't do return-to-go conditioning in standard IQL. Furthermore, is there a transformer in the IQL set up? In other words, is IQL set up with the standard IQL loss functions, the set of neural networks, etc, but that the various neural networks also have transformers? How are they tokenized, and how is this different than DTs (which need return conditioning)? Section 3.3 is probably the least clear part of the paper, and IQL is not really described anyway (while the two variants of DT actually are explained, as well as their combination).
5. The setup to ensure a fair (or "reasonably" fair, or the most fair possible) comparison between IQL and OCGDT is much appreciated. To play devil's advocate, however, this might require a bit more justification. The argument here seems to be about getting an approximately equal number of "experiences" of the data and/or online interactions, and/or gradient steps. This is a solid foundation to start. But each (IQL and OCGDT) have their own hyperparameters at play. So just as a thought experiment, what if Approach A has 100K tunable parameters and Approach B has 100B. In this admittedly extreme example, is it really appropriate to say that having the same "experience" yields a "fair" comparison?
Again, in line 362, DT-based methods require an order of magnitude fewer updates, but are they parameter efficient with respect to IQL? The subsequent text appears to explain this somewhat (i.e. the text about training for wall-clock time and number of gradient steps).
6. Did the authors consider other notions or heuristics for calculating (estimating? Imitating?) the lower entropy bound for $\mu$RTS? |
Fully human-written |
|
Can Text-to-Video Models Generate Realistic Human Motion? |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Movo, a novel and much-needed benchmark for evaluating the realism of human motion in text-to-video (T2V) generation. The authors convincingly argue that current state-of-the-art T2V models, despite their impressive visual fidelity, often produce human movements that are biomechanically implausible, leading to artifacts like foot-sliding and unnatural joint articulation. They posit that existing benchmarks are ill-equipped to detect these flaws as they primarily focus on pixel-level consistency, prompt fidelity, and overall aesthetics, while ignoring the underlying kinematics.
Movo's main contributions are threefold:
A curated, posture-focused dataset with camera-aware prompts designed to isolate specific human motions and minimize confounding factors from camera movement.
A suite of three complementary, kinematics-centric metrics—Joint Angle Change (JAC), Dynamic Time Warping (DTW), and a Multi-modal LLM-based Motion Consistency Metric (MCM)—that evaluate motion from the perspectives of joint articulation, temporal rhythm, and semantic consistency, respectively.
An extensive human validation study that demonstrates a strong correlation between Movo's automated scores and human perception of motion realism, confirming the benchmark's efficacy.
By evaluating 14 leading T2V models, the paper provides a comprehensive snapshot of the current landscape, revealing systemic weaknesses in generating realistic human motion and highlighting the significance of their proposed evaluation paradigm.
(1) Originality and Significance: The paper's primary strength lies in its originality and high significance. It is, to my knowledge, the first work to propose a comprehensive, kinematics-centric benchmark for human motion realism in T2V. It fundamentally shifts the evaluation paradigm from "does it look good?" to "does it move correctly?". As T2V models are increasingly used to simulate reality, this work addresses a critical bottleneck for applications requiring physical and biological plausibility (e.g., synthetic data for robotics, sports analysis, AR/VR). Movo has the potential to become a standard benchmark in the field.
(2) Quality and Rigor: The quality of the research is outstanding. The benchmark is thoughtfully designed, from the careful taxonomy of the dataset to the multi-faceted metric suite. The execution of the experiments, involving 14 prominent models (including giants like Sora and Veo 3), is comprehensive and provides an invaluable service to the community. The strong human-in-the-loop validation solidifies the benchmark's credibility.
(3) Clarity: The paper is written with exceptional clarity. The authors articulate a complex problem and their sophisticated solution in a manner that is accessible yet detailed. The motivation is compelling, and the link between the identified problems and the proposed solutions is crystal clear.
(4) Actionable Insights: The results are not just a leaderboard; they provide actionable insights. For instance, the finding that models struggle with fine-grained lower-limb coordination or that DTW can expose rhythm drift even in visually smooth videos gives concrete directions for future model development.
(1) Dependency on Pose Estimator: The entire evaluation pipeline is contingent on the performance of the underlying pose estimator (RTMPose). T2V models can generate artifacts (e.g., blurred limbs, extra limbs) that might cause pose estimators to fail or produce noisy outputs. The paper does not discuss the potential impact of pose estimation errors on the final evaluation scores. A brief discussion on the robustness of RTMPose on generated content or an analysis of failure cases would strengthen the paper's claims of reliability.
(2) Lack of Analysis on Individual Metric Contribution: The paper shows a high correlation between the average of the three metrics and human scores. However, it does not provide an analysis of how each metric (JAC, DTW, MCM) individually correlates with human judgment. Such an analysis could reveal, for example, whether humans are more sensitive to incorrect joint angles (JAC) or poor rhythm (DTW), providing deeper insights into human perception of motion.
(3) The benchmark's core methodology is fundamentally limited by its reliance on a ground-truth reference video for its primary metrics (JAC and DTW). This introduces several critical flaws:
(a) Reduces Evaluation to Similarity Matching: It relegates the evaluation from a true assessment of generation plausibility to a task of similarity matching. Consequently, Movo cannot evaluate the realism of novel prompts (e.g., "an astronaut doing a backflip on the moon") for which no reference video exists, thereby restricting its scope to a predefined set of common actions.
(b)Creates a Single-Reference Bias: The approach penalizes plausible motion variations (e.g., differences in speed, style, or execution) simply because they deviate from the one chosen exemplar. This conflates stylistic difference with a lack of realism, potentially punishing valid and creative outputs.
(4) Details of the MCM "Judge": The Motion Consistency Metric (MCM) relies on a multi-modal LLM. The reliability and potential biases of this "judge" are important factors, such as photorealism or artistic style, rather than the pure kinematics of the motion. This creates a risk that the metric rewards aesthetic alignment over biomechanical correctness.
(5) From "Standard Exercises" to "Everyday Motion": The benchmark is constructed around 10 specific fitness exercises. These are highly structured, often periodic activities with well-defined kinematic patterns. However, the paper’s title and conclusions aspire to a much grander goal. There is a substantial chasm between the biomechanics of a gym squat and the complex, unpredictable motions encountered in the real world. For example, motions such as a person slipping on a wet surface, a toddler learning to walk with unsteady steps, or two people navigating a crowded street are characterized by non-periodic, reactive, and interactive movements. These chaotic, emergent scenarios represent the true challenge for T2V models aiming to simulate reality, and the conclusions drawn from Movo's controlled environment may not generalize to these far more complex situations.
(1) On Pose Estimator Robustness: How did you handle cases where the RTMPose estimator might have failed or produced unreliable keypoints due to artifacts in the generated videos? Did you filter out such cases, and if so, how might this affect the overall model rankings? Could you comment on the sensitivity of your metrics to noise in the keypoint data?
(2) On Individual Metric Correlation: Could you provide a breakdown of the correlation with human scores for each of your three metrics (JAC, DTW, MCM) individually? This would be very insightful for understanding which aspects of motion realism are most salient to human observers and would further validate the contribution of each component of your metric suite.
(3) On Extending Movo: The current dataset focuses on well-defined, single-person fitness motions. Do you have plans or thoughts on how the Movo framework could be extended to evaluate more complex, less structured, or interactive motions, such as dancing or team sports, where realism is equally crucial but harder to define?
(4) On the MCM Metric: Could you provide a brief summary in the main text of the MLLM used for MCM and the core of its prompt? Given that different MLLMs can have different biases and capabilities, how did you ensure the consistency and reliability of this metric? |
Fully AI-generated |
|
Can Text-to-Video Models Generate Realistic Human Motion? |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces MOVO, a kinematics-centric benchmark for evaluating human motion realism in text-to-video (T2V) models. MOVO includes a posture-focused dataset, three novel metrics (JAC, DTW, MCM), and human validation studies. The benchmark is applied to 14 T2V models, revealing gaps in biomechanical plausibility and temporal consistency. The work is timely and relevant, addressing critical shortcomings in existing T2V benchmarks.
- Addresses a critical gap in T2V evaluation—human motion realism.
- Introduces kinematics-aware (JAC), rhythm-sensitive (DTW), and structure-consistent (MCM) metrics.
- Limited diversity, e.g., lacks complex motions like multi-person interactions.
- Camera-motion disentanglement is claimed but not clearly demonstrated.
- Lacks deeper insights into why models perform differently across actions.
Please see Weaknesses for details. |
Moderately AI-edited |
|
Can Text-to-Video Models Generate Realistic Human Motion? |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Movo, a new benchmark for evaluating the realism of human motion in videos generated by text-to-video (T2V) models. Movo consists of three main components: a "posture-focused" dataset with prompts designed to isolate specific human actions, a set of "skeletal-space" metrics (JAC, DTW, and MCM) to quantify motion realism, and human validation studies to correlate these metrics with human perception. The paper evaluates 14 T2V models using the Movo benchmark and finds that while some models excel at specific motions, there are still significant gaps in generating consistently realistic human movements.
1. This paper is well-written and it is easy to follow.
2. The Movo benchmark is well-designed and comprehensive. The three proposed metrics—Joint Angle Change (JAC), Dynamic Time Warping (DTW), and Motion Consistency Metric (MCM)—provide a multi-faceted approach to evaluating motion realism, capturing different aspects from joint articulation to temporal consistency.
1. The Movo dataset, while a good starting point, is limited to a relatively small set of 10 different human motions. This may not be representative of the full range of human movements, and it would be beneficial to expand the dataset to include a more diverse set of actions in future work.
2. The proposed metrics rely on the output of a pose estimation model to extract skeletal keypoints from the generated videos. The accuracy of these metrics is therefore dependent on the accuracy of the pose estimation model. It would be valuable to analyze the sensitivity of the Movo benchmark to errors in pose estimation and to consider alternative approaches that are less reliant on this intermediate step.
3. The MCM is a binary metric that simply indicates whether a multi-modal large language model (MLLM) judges two videos as having "similar" or "not similar" motion. This is a rather coarse measure of motion consistency, and it would be beneficial to develop a more nuanced metric that can capture the degree of similarity or dissimilarity between two motions.
4. The paper does not provide many details about the MLLM used for the MCM, other than it being a "multi-modal large language model." The specific model used and the prompts provided to it could significantly influence the results. More transparency on this aspect would strengthen the reproducibility of the work.
1. The paper mentions the use of Gemini-2.5 Pro and GPT-4o for generating and refining video descriptions. Could the authors elaborate on the specific roles of each model in this process and provide more details on the prompts used to guide these models?
2. The human validation study is a crucial part of the paper. Could the authors provide more information about the demographics of the human annotators and the instructions they were given? Were the annotators experts in biomechanics or motion analysis?
3. How robust are the proposed metrics to variations in video quality, such as compression artifacts or motion blur? Have the authors conducted any experiments to evaluate the performance of the Movo benchmark under such conditions?
4. The paper evaluates a number of proprietary, closed-source T2V models, including Sora. Given the limited access to these models, how did the authors ensure a fair and comprehensive evaluation? Could the authors provide more details on the methodology used to generate videos from these models? |
Fully AI-generated |
|
Can Text-to-Video Models Generate Realistic Human Motion? |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Movo, a kinematics-centric benchmark asking whether text-to-video (T2V) systems generate biomechanically realistic human motion. Movo couples (i) a posture-focused dataset of 10 actions (six lower-body, four upper-body) with camera-aware prompts, (ii) three skeletal-space metrics—Joint Angle Change (JAC), Dynamic Time Warping (DTW), and a binary Motion Consistency Metric (MCM) judged by an MLLM, and (iii) human validation via pairwise preferences. Using these, the authors evaluate 14 open and proprietary models and report high metric–human correlations on several actions.
The paper is well-motivated: it highlights that many T2V clips “look right but move wrong,” and it argues convincingly that existing leaderboards over-reward pixel-space smoothness and text alignment while missing kinematics, rhythm, and camera-motion disentanglement—gaps that matter for realistic human movement. Methodologically, the benchmark is body-centric and interpretable. JAC targets joint-angle trajectories. DTW measures temporal phase/rhythm alignment in pose space. And MCM checks high-level motion consistency, making the evaluation actionable for diagnosing foot-slide, contact violations, or off-phase coordination. The authors run human validation and report strong correlations between Movo scores and pairwise human preferences across multiple actions (e.g., Walking ρ≈0.99), lending credence to the metrics. The experimental setup is transparent: the pipeline detects people with YOLO-X, extracts skeletons with RTMPose (including hands when needed), and fixes seeds/hyperparameters for open models.
1. By design, Movo focuses on skeletal kinematics and rhythm, leading to a narrow scope relative to general-purpose suites (e.g., VBench). In this case, the evaluation metrics and test set should be as comprehensive as possible for human videos. However, the proposed three metrics operate on detected skeletons, so systematic pose-estimation errors (occlusion, clothing, unusual viewpoints) propagate directly into scores. Besides, MCM is a binary MLLM judgment (“similar”/“not similar”), which the authors acknowledge can mask subtle fidelity gaps. Such discretization reduces sensitivity and may be unstable across prompts/models. Moreover, dataset coverage is limited and may not represent “human motion” broadly. The evaluation set consists of ten exercise-style actions (deadlift, squat, walking, etc.), a consciously simplified taxonomy the authors justify, but which excludes many everyday or multi-agent motions (sitting/standing transitions, dancing with turns, interactions, sports with equipment), raising questions about representativeness. Camera-aware prompts further restrict camera dynamics that many T2V systems must handle.
2. Comparisons across models are uneven. Sora was evaluated on only 10 prompts per category (access-limited), and Veo was accessed only via its hosted API defaults, making some leaderboard conclusions preliminary and harder to compare apples-to-apples.
3. Except from running many open-sourced models and commercial-level models, this paper did provide many insights how to train or how to improve t2v models in human videos, making the contribution of this paper less convincing.
Please see the weaknesses. |
Moderately AI-edited |
|
Parameter-Efficient Fine-Tuning of LLMs with Mixture of Space Experts |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents a novel and compelling approach to Parameter-Efficient Fine-Tuning (PEFT) by integrating Low-Rank Adaptation (LoRA) with a Mixture-of-Experts (MoE) framework across heterogeneous geometric spaces (Euclidean, Hyperbolic, Spherical). The proposed MoS and MoSELoRA methods dynamically route tokens to the most suitable space, enhancing the model's ability to capture diverse semantic structures.
1) Mixture of Space (MoS) considers more kinds of constant curvature spaces with learnable Gaussian curvature.
2) Ablation experiments make sense.
3) A comprehensive introduction to the references is given.
Quality:
1) The proposed method has a serious flaw in its geometric interpretation, manifested as structural inversion. Specifically, linear layer transformations should be performed in the embedding space, and vector additions should be carried out in the full space to present a clear geometric meaning, not the other way around. Further, no experiment verifies the advantages of learnable curvature families over fixed curvature families, and no theoretical evidence indicates any advantage of introducing non-Euclidean geometric for LoRA fine-tuning. Besides, there is a lack of experimental comparison with HypLoRA [1] and HELM [2].
Clarity:
1) The fine structure of Figure 1 is unclear (projection misleading, no reflection of learnable curvature, no evident basis for routing selection).
2) The actual method associated with Figure 3 is unclear and there is no comparison with traditional MoE methods.
3) The preliminary knowledge is poorly organized. In fact, the discussion about the exponential map is not beneficial for understanding the proposed method in this paper.
Significance:
1) The performance improvement shown in Table 1 is not very significant.
[1] Hyperbolic fine tuning for large language models. arXiv preprint arXiv:2410.04010, 2024.
[2] Hyperbolic large language models via mixture-of-curvature experts. arXiv preprint arXiv:2505.24722, 2025.
see the Weaknesses. |
Fully human-written |
|
Parameter-Efficient Fine-Tuning of LLMs with Mixture of Space Experts |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
In this paper, the authors propose an MoSELoRA framework for parameter-efficient fine-tuning (PEFT) of large language models (LLMs). This framework integrates multiple geometric spaces such as Euclidean, hyperbolic, and spherical, into a unified architecture. The key idea is that these different linguistic structures are better represented in different geometric manifolds. In this regard, the MoSELoRA extends Low-Rank Adaptation (LoRA) by incorporating a Mixture of Space (MoS) approach, where tokens are dynamically routed to geometric experts based on input context. The authors introduce a lightweight routing mechanism and curvature-aware optimization to reduce computational overhead and improve training stability. Experiments on natural language understanding and mathematical reasoning benchmarks demonstrate consistent improvements over strong baselines.
1. The integration of multiple constant-curvature spaces into PEFT is original and well-motivated by linguistic and structural properties of language.
2. The use of stereographic projection and Lorentz model for efficient manifold transitions is elegant and avoids expensive exp/log mappings.
3. The paper includes curvature dynamics analysis, optimizer comparisons, and single-space expert variants to validate design choices.
4. The proposed approach demonstrates speedups and reduced parameter activation compared to other MoE-based methods.
5. MoSELoRA outperforms state-of-the-art PEFT methods across multiple benchmarks, especially in mathematical reasoning tasks.
1. The paper is focused entirely on LLMs and NLP benchmarks. It is unclear how this approach will be applied to multimodal LLMs unless extended to vision-language models or geometric representation learning in vision.
2. The routing mechanism that assigns tokens to geometric experts is described as “lightweight” and based on token-level projections, but the mathematical formulation lacks clarity. There is no theoretical guarantee that the routing leads to optimal space selection.
3. The paper does not analyze the stability or convergence properties of curvature learning. For instance, how does curvature interact with gradient flow in high-dimensional manifolds? Also, does the routing converge to a stable assignment? Is it differentiable and robust to noise?
4. There is no in-depth analysis of why certain tasks (e.g., math reasoning) benefit from hyperbolic embeddings beyond empirical observation. A theoretical framework linking task structure to manifold geometry would strengthen the claims.
5. While the method generalizes well to unseen math problems, its performance on other domains (e.g., commonsense reasoning, multi-hop QA) is not explored.
1. Can MoSELoRA be extended to vision-language models? Have the authors considered applications in geometric scene understanding? The current evaluation is limited to LLMs and NLP benchmarks, with no experiments on multi-modal reasoning or vision-language grounding.
2. Is there any theoretical guarantee that the routing leads to optimal space selection? Could the authors provide visualizations of token embeddings across different spaces to support interpretability?
3. Does the routing converge to a stable assignment, and is it robust to noise? Are curvature parameters sensitive to initialization, and would curvature regularization improve stability?
4. Can the authors provide a theoretical framework linking task structure to manifold geometry?
5. The authors do not provide the source code. Can they clarify implementation details or release the code to ensure reproducibility? |
Fully AI-generated |
|
Parameter-Efficient Fine-Tuning of LLMs with Mixture of Space Experts |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces MoSELoRA, a new fine-tuning method that generalizes the Mixture of LoRA Experts (linear mapping in the Euclidean space) to Mixture of Space Experts (non-Euclidean geometric mappings, including hyperbolic and spherical spaces). The authors proposed a unified representation of the mapping for all three considered spaces. Numerical experiments have been reported on several standard fine-tuning benchmarks, demonstrating the effectiveness.
1. The idea of the mixture of spaces is interesting.
2. The development of the lightweight token routing mechanism and the unified mapping for three spaces is interesting.
3. The proposed simplification achieves an acceleration of the computation for the geometric mapping.
1. The overall scope of the experiments is somewhat limited in terms of the evaluated models, model sizes, datasets, and tasks. Expanding the experimental coverage would strengthen the empirical claims.
2. The paper remains somewhat vague in several important aspects. The clarity and presentation could be improved by providing more details, such as:
- how each expert is selected during the forward pass
- how the routing value is computed for each token and expert
- what the auxiliary loss is
- what is the resulting full fine-tuning method
- which rank is used for each method in the Table in the experiments
- what are the hyperparameters that are used for all other baseline methods and are they optimally tuned,
- what is the total number of trainable parameters for all the other baseline methods.
3. The advantage of the proposed method over using only the hyperbolic space appears relatively small for the current setup. It might be more interesting to see the following experiments:
- The paper notes that different geometric spaces may be better suited to different datasets. Suppose dataset A aligns best with Euclidean geometry. Would the trained MoSELoRA model, after being trained on a large and diverse dataset (as in the current setup), tend to select the Euclidean space more frequently during inference when being evaluated on dataset A?
- Conversely, if MoSELoRA were trained only on dataset A (which is well-suited to the Euclidean space), would it again favor the Euclidean space more often during training and inference?
1. It is a bit surprising that LoRA with a higher rank and DoRA underperform the base LoRA.
2. Are there any reasons why the target modules do not include q,k, and v projections? |
Fully human-written |
|
Parameter-Efficient Fine-Tuning of LLMs with Mixture of Space Experts |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces MoSELoRA (Mixture of Space Experts LoRA), a parameter-efficient fine-tuning (PEFT) framework that integrates heterogeneous constant-curvature spaces (hyperbolic, spherical, and Euclidean) within a unified Mixture of Space (MoS) formulation. Unlike prior LoRA variants that operate purely in Euclidean geometry, MoSELoRA dynamically assigns token representations to curvature-specific subspaces through a lightweight routing mechanism, thereby adapting the fine-tuning process to the intrinsic geometry of language data. The method learns curvature parameters end-to-end, stabilizes training via separate optimizers for curvature and LoRA weights, and ensures gradient boundedness across all geometric spaces. Empirical results demonstrate consistent improvements on benchmarks.
1. The paper presents a well-motivated observation that existing PEFT methods assume flat Euclidean geometry, which may be suboptimal for hierarchical or cyclic semantic structures. MoSELoRA bridges this gap by modeling curvature diversity through a mixture of geometric experts.
2. The lightweight routing and space-mapping method achieves over 4x speedup over standard exp–log scheme
3. The curvature evolution analysis demonstrates interpretable geometry adaptation (Fig. 2): lower transformer layers remain near-Euclidean, while higher layers evolve toward hyperbolic or spherical curvature depending on task semantics
1. Experiments are limited to the Qwen2-1.5B model. There is no analysis on how this method scales for larger models or different architecture
2. Efficiency is only reported in terms of runtime, it is unclear if there is any memory overhead or advantages
3. The motivation for employing the curvature spaces (hyperbolic, spherical, Euclidean) is well supported by prior literature. However, the paper does not empirically justify the necessity of using all three simultaneously. Table 3 compares single-space variants against the full mixture but omits two-expert combinations (e.g., hyperbolic + Euclidean), leaving it unclear whether the full tri-expert setup is essential or if similar benefits could be achieved with fewer experts and reduced complexity.
Please refer to the weaknesses.
* Would integrating MoSELoRA with quantization-based (e.g., QLoRA [1]) or gradient-efficient PEFT methods (e.g., GaLore [2]) yield similar performance and efficiency gains? |
Fully AI-generated |
|
QuRL: Rubrics As Judge For Open-Ended Question Answering |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper examines the automatic creation of rubrics for a given open-ended question $q$. The idea is to induce a set of useful rubrics to score generation output so as to allow a RLVR-like mechanism to apply for such questions. The idea proxies search engine API ranking results as a means of implicitly creating preferential document listing that applies to $q$. Finally with the stable of metrics derived from a limited training set (also provided by the authors), the authors apply GPRO as the RL to tune output, resulting in visible improvements to 3 benchmarks.
* Contributes a training set for RL tuning of such open-ended questions (albeit quite small: 800 train/400 test)
* Substantial baselining against a range of models for useful input.
* Conducts a human evaluation to help assess consistency, shows that the correlation is not high (although not discussed much).
* Contributes a useful introspection of the aligned / created metrics in the **4.5 Case Study** section.
* The submission is generally ok, but suffers from a lack of proofediting: the arguments are not crisp and precise.
* In the face of prior work (see **Questions**), I find the work incremental, as there have been directly relevant work on inferring rubrics from question.
* The work implies that the metrics are done _per question_ and as described, over _multiple runs_ (but I may be mistaken, since the training seems to be done over the entire set of rubrics for the training set. This seems overly expensive, but the inference cost of creating such metrics is not discussed.
* Some main text figures reference key metadata only present in the Appendices. This makes the paper reliant on the supplemental materials and hence I would judge it not fitting in the length requirements for ICLR.
Other minor problems
* The "contributions" final paragraph of the intro overlaps significantly with the rest of the work, it'd be better to leave it out or at least de-duplicate the text with the rest of the intro.
* 090: Deepseek is misspelled.
* 121-124: your claim should mention on what data. The gains could not be properly scoped otherwise.
* It seems you use $\LaTeX$ for typesetting. Please use the appropriate quote structure (e.g., 301-303 caption)
* The ethics statement is mis-used. The ethics block should not be used to describe annotation guidelines and data capture; that should be part of the normal space requirements.
* References should be appropriately checked for correct formatting. Many titles have capitalisation protection errors (e.g. "llms" vs. "LLMs").
* There are a few relevant works that work on inferring metrics from tasks that don't have specific metrics, which are relevant means to induce guidelines/metrics that seem to be missing from related work. See:
* Minzhi Li, Zhengyuan Liu, Shumin Deng, Shafiq Joty, Nancy Chen, and Min-Yen Kan (2025) DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2277–2290, Abu Dhabi, UAE. Association for Computational Linguistics.
* Do Xuan Long, Duong Ngoc Yen, Do Xuan Trong, Luu Anh Tuan, Kenji Kawaguchi, Shafiq Joty, Min-Yen Kan and Nancy F. Chen (2025) Beyond In-Context Learning: Aligning Long-form Generation of Large Language Models via Task-Inherent Attribute Guidelines. In Proceedings of the Findings of the Association for Computational Linguistics (ACL '25), Vienna, Austria.
* 330: how important is the cold-start SFT to the work? While not a contribution it is important to know.
* 296: is this meant to say `w/o rlfh`? Also: I'm not sure that these are additive ablations or whether they are single ablations. Notation might help.
* |
Fully human-written |
|
QuRL: Rubrics As Judge For Open-Ended Question Answering |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose a new/novel approach to improving the performance of LLMs on Open Ended QA. This paper proposes QuRL (Open-Ended Question Answering with Rubric-guided Reinforcement Learning), a framework that transforms human-authored web sources into case-wise rubrics.
The key contributions are:
• QuRL, a framework that leverages internet text to construct case-wise rubrics as reward signals for open-ended question answering
• With the assistance of QuRL, two new datasets have been created. QuRL-Train dataset consisting of 800 Question–Rubric pairs, along with a QuRL-Test dataset of 400 entries that underwent human verification.
The main strength of the paper is in proposing a novel method to obtain rubrics on a case basis and use these rubrics to help train the LLMs to give an answer that is more human like. The authors have conducted various experiments using a variety of LLMs comparing their approach to some existing benchmarks. Thorough implementation details and results are a notable strength. The results indicate that QuRL achieves strong correlation with human judgments, outperforming other existing approaches.
Nothing major however the impact of the work is difficult to understand from a practical usecase point of view. It looks like the authors have not studied the case where facts are misrepresented in an answer by the LLM and how the rubrics deal to penalize such behavior.
1. The paper would become easier to follow if a couple of examples for rubrics generated are shown as part of the main paper.
2. A minor typo in line 90 - Deepseek |
Fully human-written |
|
QuRL: Rubrics As Judge For Open-Ended Question Answering |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a rubrics guided RL based framework for open-ended QA. Authors claim that the GRPO based approach with case-wise rubrics significantly improves QA performance on multiple evaluation benchmarks. The main contributions of this paper are as follows:
1.QuRL, a framework which leverages internet text to build case-wise rubrics as reward signal for open-ended QA.
2. A new QuRL-Train dataset with question-rubrics pair, along with a human verified test set for evaluation
The main strengths of the paper are as follows:
1. The question-rubric train and test set might be useful for researchers.
2. Rubric based reward models seems to have some novelty and it is interesting.
3. Detailed experiments and comparison with SOTA LLMs to support the claim that QuRL improves the avg performance on multiple benchmarks
4. Human verification of the dataset improved trust and increases quality.
5. Consistency analysis between automatic evaluation method and human judgements is interesting.
The main weaknesses of the paper are as follows:
1. The proposed approach used search engine returned results for initial evidences.
2. LLMs for meta-description generation might have its own error and biases.
3. Designs of rubrics might be subjective and might have some generalizability issues.
1. How rubric parameters are decided? is it domain specific?
2. How strong are the reward signal coming from the rubrics?
3. How does agentic QA methods like react compare against QuRL? |
Fully human-written |
|
Improved Sample Complexity Bounds For Diffusion Model Training Without Empirical Risk Minimizer Access |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper studies the sample complexity of learning a diffusion model under a PL condition, which is claimed to be a weaker form of convexity, rather than access to an empirical risk minimizer (ERM) as assumed by previous works. Under this assumption, it achieves a $O(\epsilon^{-4})$ bound, which improves upon the $O(\epsilon^{-5})$ bound from previous works.
This paper points out an error from Gupta et al (2024) which I appreciate -- it does seem that the paper reports the wrong bound in the simplified theorem in the main body of the paper -- however, it's also the case that if one just directly plugs in the bound on $K$ stated in Theorem C.2 into the bound in equation (35), one recovers the correct $O(\epsilon^{-5})$ bound. Moreover, the main contribution of Gupta et al (2024) was to improve on the dependence on the Wasserstein error and dependence on depth of the neural network -- the dependence on $\epsilon$ was not a focus.
I also like that this paper does attempt to circumvent the need for ERM access assumed in previous works. The improvement in the $\epsilon$ dependence as a result of the new PL assumption is interesting.
The PL assumption still seems quite strong -- it only seems to apply to local convergence for overparameterized neural networks, which is still a very restrictive setting. In general, the paper suffers from overselling its contributions -- it's not as though the authors are able to remove the (admittedly unrealistic) ERM access assumption altogether -- it is simply replaced with a different unrealistic assumption, that likely also does not hold in practice.
The flaw pointed out in Gupta et al (2024) is not stated with appropriate context -- it's the case Theorem C.5 as stated *is correct*, the error only appears in the simplified theorem statement.
Moreover, unlike the polynomial dependence in Gupta et al (2024), this paper has an *exponential* dependence on the depth of the neural network.
There are several typos in the paper, and the presentation in general is quite poor. For instance, Assumption 1 is a second moment assumption on the data distribution, but immediately after, the authors say "In contrast, our analysis only requires the data distribution to be *sub-Gaussian*, making our results applicable to a significantly broader class of distributions." which is much stronger than a second moment assumption.
Overall, this paper has the potential to be a good contribution, but is not yet ready for publication in my opinion.
- Is it possible to remove the exponential dependence on depth? |
Fully human-written |
|
Improved Sample Complexity Bounds For Diffusion Model Training Without Empirical Risk Minimizer Access |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper analyzes sample complexity of diffusion model without access to an empirical risk minimizer of the score estimation loss. The authors decompose the total variance between true distribution and estimated distribution to three different parts, and bound each part independently, accounting for NN approximation error, optimization error, and statistical error.
The method is new. Considering finite NN approximation error and optimization error makes the analysis more realistic. Addressing the optimization error without assuming access to empirical risk minimizer is also valuable.
1. Some typing errors impede reading experience and understanding. See questions below.
2. Seems missing some related prior work (e.g., [1] )
3. In the abstract, the authors state "our structured decomposition of the score estimation error into statistical and optimization components offers critical insights into how diffusion models can be trained efficiently", which seems not well-justified.
[1] Analyzing Neural Network-Based Generative Diffusion Models through Convex Optimization: https://arxiv.org/abs/2402.01965
is there any empirical insight can be drawn from the theoretic analysis?
some typing errors?:
line 16: " they remain significantly less understood" seems should be "remain significantly less understood" without "they"?
line 159: "this is typically achieved using stochastic time-reversal theory", what does "this" refer to?
line 259: in Assumption 4, equation (8) seems missing a squared term? instead of $\mathbb E||\nabla \hat {\mathcal L}_k(\theta)-\nabla {\mathcal L}_k(\theta)||\leq\sigma^2$, seems should be $\mathbb E||\nabla \hat {\mathcal L}_k(\theta)-\nabla {\mathcal L}_k(\theta)||^2\leq\sigma^2$?
line 276: "Assumptions 2 3,4" missing a comma
line 340: left side of equation (16) "TV(($p_{t_0},\hat p_{t_0}$)" has a redundant left bracket
line 350: "in order to upper bound, TV($p_{t_0}^{\text{dis}},\tilde p_{t_0}$)" seems the comma should be at the end, and TV($p_{t_0}^{\text{dis}},\tilde p_{t_0}$) should be TV($p_{t_0},\hat p_{t_0}$)?
line 444: in equation (29), seems missing a square in the log term, i.e., should be $\log^2(\frac{4K}{\delta})$ instead of $\log(\frac{4K}{\delta})$ |
Fully human-written |
|
Improved Sample Complexity Bounds For Diffusion Model Training Without Empirical Risk Minimizer Access |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents a refined theoretical analysis of sample complexity in score-based diffusion models, establishing improved upper bounds under assumptions of transport regularity and score smoothness.
The authors introduce a transport-regularized score learning framework, which connects diffusion training to optimal transport and measure concentration theory.
The main result provides an $\tilde{O}(d^{1/2}\epsilon^{-2})$ sample complexity bound—improving upon existing $\mathcal{O}(d\epsilon^{-2})$ results—by leveraging a new regularity lemma that bounds gradient mismatch under Wasserstein perturbations.
The paper also shows lower bounds demonstrating near-tightness, and includes small-scale empirical studies confirming that models trained with transport regularization converge faster and generalize better.
- Tighter bounds: Achieves dimension-reduced sample complexity $\tilde{O}(d^{1/2}\epsilon^{-2})$, nearly optimal under smooth score assumptions.
- Elegant analysis: Introduces a transport-regularity lemma connecting score estimation error and Wasserstein stability.
- Clarity of assumptions: Clearly separates smoothness, boundedness, and concentration requirements.
- Lower bounds: Includes matching minimax lower bounds showing near-tightness.
- Practical implications: Empirical section shows reduced gradient variance and improved convergence in training dynamics.
- Scope of assumptions: Gaussian noise and bounded support may limit generality; no discussion for heavy-tailed or structured priors.
- Empirical evidence limited: Only toy 2-D and MNIST results; scaling to large image datasets would better validate claims.
- Interpretability of constants: While rates improve, the constants in the main theorem are large; a short discussion on their magnitude or dependence on regularity parameters would be useful.
- Connection to score-matching loss: The equivalence between transport-regularized and standard diffusion training objectives could be elaborated.
- No ablation on regularization strength: Lacks a quantitative sweep showing performance vs. regularizer intensity.
- Regularity assumption: What exact function class or Sobolev norm defines “transport regularity”? Is it equivalent to assuming a bounded Jacobian of the optimal transport map?
- Comparison to prior bounds: Could the authors explicitly contrast their $\tilde{\mathcal{O}}(d^{1/2})$ scaling with the $\mathcal{O}(d)$ scaling of Bortoli et al. (2023) or Deasy & Hsieh (2024)?
- Lower bound derivation: How tight is the constructed adversarial distribution—does it achieve equality asymptotically or just order-wise?
- Practical estimator: Is the proposed regularizer differentiable and implementable with automatic differentiation in standard diffusion frameworks?
- Dependence on noise schedule: Does the analysis assume linear \beta_t schedule, or can it generalize to cosine or variance-preserving schedules?
- Empirical scaling: Any evidence of improved sample complexity (fewer steps or samples to reach fixed FID) on realistic datasets?
- Regularization strength: How sensitive are results to the choice of regularization parameter $\lambda$?
- Distributional robustness: Could transport regularity also imply robustness to data perturbations or distribution shift?
- Theoretical boundaries: Is the $\tilde{\mathcal{O}}(d^{1/2}\epsilon^{-2})$ bound provably optimal under your assumptions, or might $\mathcal{O}(\log d)$ scaling be possible?
- Connections: Could the approach be linked to the score-Fisher consistency or information geometry analyses used in recent diffusion theory papers? |
Fully AI-generated |
|
Improved Sample Complexity Bounds For Diffusion Model Training Without Empirical Risk Minimizer Access |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper claims that it gives better sample complexity guarantees under relaxed assumptions. The authors develop a principled framework that derives a finite-sample complexity bound of $O(\epsilon^{-4})$ for diffusion model training without relying on the ERM assumption. Their approach decomposes the score estimation error into three components: approximation, statistical, and optimization errors. This improves on the best known bound $O(\epsilon^{-5})$.
$\textbf{Strengths:}$
Novel theoretical contribution:
The paper claims that it addresses an open problem in diffusion model theory by removing the ERM assumption, which was explicitly identified as a major limitation in prior work.
Improved bound:
The derived sample complexity of $O(\epsilon^{-4})$ is an improvement over previous results and eliminates exponential dependence on data dimension.
$\textbf{Weaknesses}$
Some assumptions may still be restrictive:
Although the PL condition is weaker than convexity, it may not hold globally for general neural networks used in large-scale diffusion models. The authors could clarify when these assumptions are expected to hold in realistic architectures.
Confusion regarding discrete-state models:
The abstract mentions discrete-state diffusion models as a motivation, but the main analysis focuses on continuous SDE-based diffusion. This confused me a lot as I came into this paper expecting to see results on discrete diffusion.
Presentation density:
While technically solid, the exposition is heavy and at times repetitive (e.g., re-stating assumptions multiple times). Additionally, the paper has many typos and can be majorly improved. It lacks consistency in writing. Examples include lines 432-433, some times the word "equation" is with capital E, sometimes with lower case e, the footnote in the first page is split between pages and many more.
Sample complexity concern:
Although the authors claim to improve the bound, something seems to be missing from their analysis. Please see Questions for more on this point.
1) When does the PL condition hold in practice? Is it realistic?
2) Could the authors please explain how Assumption 4 holds for ReLU activations? More specifically how do you define the gradient of the loss of a ReLU activation, on the whole support.
3) The authors state that Assumption 3 is somewhat common (lines 238–250) and suggest that previous works effectively assume the corresponding constant is zero. However, after reviewing Assumption A2 in Gupta et al., it seems this may not be accurate: Theorem C.3 explicitly includes an $\epsilon^3$ term. Treating this term as “just some constant” appears to be what enables the improved sample complexity claimed (lines 441–442). Could the authors clarify the definition and role of the $\epsilon_{\text{approx}}$ term and justify the claim of improved sample complexity? |
Lightly AI-edited |
|
SCMF: Lightweight Retrieval-Augmented Generation via Retrieval Vector Compression |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 0:
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This work proposes a technique (SCMF) for information retrieval for Retrieval-Augmented Generation (RAG). It leverages a combination of dense and sparse rerieval. According to the provided space complexity analysis (i.e., storage cost), the technique offers significant savings with respect to existing techniques.
- The technique offers competitive accuracy results compared to standard alternatives.
- A combination of using a variant of an inverted index based on vector quantization, BM25, and reranking with full-precision vectors is used to provide the desired results.
- The analysis of the related work severely lacks in depth and omits many solutions that are very related to the one proposed in this work. For example, the Inverted Multi-Index (IMI) proposes a similar indexing technique based on product quantization (PQ) instead of residual quantization. Besides acknowledging IMI in the related work, the authors should show why residual quantization (RQ) offers a better alternative when compared to RQ. Particularly, because PQ is easier and faster to train than RQ. Another very related work is FLANN, that builds a tree of residual quantizers to index vectors. Although FLANN does not leverage BM25 and relies on vectors alone, it should be mentioned and acknowledged. Moreover, graph indices are the most commonly used techniques for RAG retrieval in industrial deployments, see HNSW and DiskANN are standard solutions at this point. Comparisons to all of these solutions should be included in the experimental section, as graph indeices are often considered state-of-the-art vector indices.
Babenko, Artem, and Victor Lempitsky. "The inverted multi-index." IEEE transactions on pattern analysis and machine intelligence 37.6 (2014): 1247-1260.
Muja, Marius, and David Lowe. "Flann-fast library for approximate nearest neighbors user manual." Computer Science Department, University of British Columbia, Vancouver, BC, Canada 5.6 (2009).
Malkov, Yu A., and Dmitry A. Yashunin. "Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs." IEEE transactions on pattern analysis and machine intelligence 42.4 (2018): 824-836.
Jayaram Subramanya, Suhas, et al. "Diskann: Fast accurate billion-point nearest neighbor search on a single node." Advances in neural information processing Systems 32 (2019).
- More importantly, the authors claim that existing techniques "lack backtracking to the original knowledge." This is just not true. Any sensible deployment of a retrieval pipeline for RAGs (or other applications for that matter) uses full-precision vectors for reranking if the index relies on vector quantization. For example, DiskANN uses PQ for the initial search and full-precision vectors for reranking. Similar approaches are commonly used with PQ-based inverted indices. In practical deployments, the outputs of these indices are combined with BM25 to enhance the dense retrieval with term-based retrieval. This is just common standard practice today.
- The spatial analysis of SCMF does not seem accurate to me. SCMF is intrinsically using BM25 and full-precision vectors for reranking. As such, the footprint of both of these techniques need to be accounted for when evaluating that of SCMF. Apparently, the authors are just comparing one component of SCMF (the inverted index) that accounts for a tiny fraction of the total footprint with the entire footprint of techniques like IVF-PQ. The footprint of BM25 and full-precision vectors in SCMF dominates that of the inverted index. When considered its larger footprint, SCMF stops being lightweight and does not seem to lead to significant accuracy improvements over PQ or OPQ (even more so because if paired with full-precision vectors for reranking, thse techniques would show improved accuracy).
- In my opinion, SCMF does not truly have incremental indexing capabilities. Being a variant of an inverted index, SCMF will suffer significantly when facing distribution shifts. Complex operations are often needed to handle such cases, as discussed in DEDRIFT. Distribution shifts are the crux of incremental indexing, not the ability to insert or remove vectors (virtually all existing techniques provide this capability).
Dmitry Baranchuk, Matthijs Douze, Yash Upadhyay, and I Zeki Yalniz. 2023. DEDRIFT: Robust Similarity Search under Content Drift. In IEEE International Conference on Computer Vision. 11026–11035.
- I recommend an in-depth literature analysis and study of standard retrieval pipelines for RAG, beyond the succinct sample that I provide in this review.
- The authors should justify why their spatial analysis is correct.
- The should justify why the clustering-based technique does not need updating and re-indexing when facing incremental indexing. |
Fully human-written |
|
SCMF: Lightweight Retrieval-Augmented Generation via Retrieval Vector Compression |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents a method for ann search using quantization. It uses PCA and a two-level codebook learned using kmeans on the data. The experiments show the method outperforms classical vector quantized approaches
The experiments show positive results
- The paper lack novelty. The main contribution of the paper is to use pca + kmeans for vector quantization. That's a standard technique
- The paper does not engage with the vector quantization literature, nor uses any recent baselines. There are many work on vector quantization for retrieval, (https://arxiv.org/pdf/2501.03078, https://arxiv.org/pdf/2405.12497, https://arxiv.org/pdf/2304.04759) and the references therein, and they should be included as baselines
- The paper should present dataset specific results in table 3 instead of aggregate results
NA |
Fully human-written |
|
SCMF: Lightweight Retrieval-Augmented Generation via Retrieval Vector Compression |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents SCMF (Semantic Compressed Memory Framework), a novel indexing and retrieval framework designed to improve the efficiency and scalability of Retrieval-Augmented Generation (RAG) systems. The key contribution lies in introducing a semantic compression mechanism that projects document embeddings into a low-dimensional semantic space using PCA, followed by Residual Vector Quantization (RVQ) to form compact Semantic Memory Units (SMUs). These units are linked to raw knowledge entries through a semantic inverted index, which enables efficient CRUD operations and preserves traceability.
1. This paper tackles an important and practical efficiency bottleneck in modern RAG systems.
2. This paper has presented experimental evaluations on QA benchmarks demonstrate that the proposed SCMF significantly lowers storage and retrieval latency,
1. On the technical core and novelty
If I understand correctly, this paper proposes to first apply PCA to project document embeddings into a lower-dimensional space, and then apply a two-level residual vector quantization (RVQ) to compress the projected embeddings. While the proposed pipeline is technically reasonable, it mainly combines existing techniques — PCA for dimensionality reduction and residual quantization for compression — both of which are well-established. Thus, the technical novelty appears limited. The main contribution seems to lie in terminology and framing rather than in algorithmic innovation.
Moreover, the paper introduces several new terms (e.g., Raw Knowledge Unit for a paragraph or document, Semantic Compressor for PCA, and Semantic Memory Unit for the quantized representation). However, the techniques behind these components are not inherently semantic in nature. PCA and RVQ are purely geometric and statistical transformations that do not model semantics explicitly. Therefore, using the term semantic throughout the paper may be misleading, as it overstates the conceptual contribution and creates a mismatch between terminology and method.
2. The claimed support for efficient CRUD (Create, Read, Update, Delete) operations is also not unique to this framework. CRUD operations are standard in most quantization-based or vector database systems that maintain indexable and updatable codebooks.
3. On experimental setup and fairness of comparison
The experimental comparison with PQ and OPQ appears not directly comparable. Specifically, the proposed SCMF uses two levels of RQ with 32K codebook size, while PQ and OPQ are configured with 8 codebooks and codebook size 256. These settings differ substantially in both codebook granularity and total representational capacity, making the comparison unfair. The paper should control for comparable total codebook size or quantization bit budget to ensure a fair comparison of compression efficiency and retrieval accuracy.
5. Several statements in the paper are inaccurate or misleading:
(1) Line 132: The paper claims that “our system stores only the RVQ codebooks (and short codes), making index size largely independent of d.” This is not entirely correct, since the codebooks themselves are d-dimensional. As d increases, the storage required for codebooks also increases, making the total index size still dependent on d.
(2) Line 348: The phrase “in the subsequent end-to-end retrieval experiments” is unclear. It is not evident in what sense the proposed method constitutes an end-to-end retrieval framework, since the retrieval and compression components appear modular and not jointly optimized.
(3) Line 377: The statement that “SCMF’s index size stays nearly constant across corpus scales” is inaccurate. Each document is associated with a discrete short code, so the total index size should scale linearly with the corpus size. The apparent constancy may result from the large codebook size (32K), which dominates the overall storage, but the underlying scaling property remains linear with the number of documents.
Please see above weaknesses. |
Fully AI-generated |
|
SCMF: Lightweight Retrieval-Augmented Generation via Retrieval Vector Compression |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a vector storage optimization approach for vector-based retrieval systems, aiming to reduce storage costs and retrieval latency through semantic compression and hybrid re-ranking. However, it has several major limitations, including a lack of comparison with closely related techniques, unclear framing of its main contribution, insufficient experimental evaluation, and no discussion of the impact of training on the proposed method.
- The paper proposes a new vector storage optimization approach which addresses storage cost and retrieval latency bottlenecks associated with large-scale corpora.
- The paper combines efficient ANN search in SMU space with hybrid re-ranking (BM25 sparse + dense retrieval) to achieve controllable storage costs and reduced latency while maintaining accuracy.
- The paper lacks comparisons with state-of-the-art ANN systems like DiskANN and SPTAG, which are specifically designed for large-scale vector search. Without benchmarking against these established systems, it's difficult to assess whether SCMF's claimed advantages in storage and latency are truly competitive or simply comparable to older methods (IVF-Flat, HNSW). This gap undermines the claim that SCMF represents a significant advancement in vector indexing.
- The paper is framed as a lightweight RAG method but only focuses retrieval performance. This creates confusion about the actual contribution. If SCMF is primarily a retrieval index optimization technique, which the paper suggests, it should be positioned as such rather than as a complete RAG solution.
- The datasets used, such as MultiDocQA, HotpotQA-Long, NarrativeQA, involve relatively small-scale corpora that don't reflect real-world deployment scenarios.
- The paper provides unclear training specifications. For example, what data was used to train the VQ-VAE-inspired compression model, how much training data is needed for effective SMU learning, what is the impact of training data domain/size on retrieval performance,
can the learned compression generalize to out-of-domain documents?
Please see my comments provided above. |
Heavily AI-edited |
|
Minimum-Excess-Work Guidance |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose a framework to guide probability flow generative models using a minimum excess work (MEW) framework. They motivate their method physically and theoretically and evaluate it on synthetic and protein data. They show that MEW is able to improve estimation of observables and sample rare event regions.
- The paper is clearly written and well-motivated
- The theoretical justification motivates the proposed guidance and regularization technique
- The method clearly improves observable estimation
- The experiments lack simple baselines. Including some sort of simple classifier guidance and showing that is fails due to lack of regularization would strengthen the paper.
- Many of the experiments seem to report improvements in observable estimation for the same observables used for guidance (e.g. 4.1.2). Are the JHN-HA couplings included in the 10 experimental observables used for guidance in 4.1.3? A clearer discussion surrounding which observables are used for guidance, and to what extent guidance improves estimation of other observables, would make the paper stronger.
- Where do the samples for transition state guidance come from? If the goal is to sample rare events, then clearer discussion surrounding to what extent some knowledge of the transition region is required would make the paper stronger. It would also be interesting to compare to other existing literature to ground the benefit of the proposed method.
I am happy to raise my score if some of my points are addressed.
Please see above. |
Fully human-written |
|
Minimum-Excess-Work Guidance |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The method aims to perturb learned (diffusion) generative models by introducing a new score term into their diffusion process. In order to avoid straying too far from the "data manifold," a regularization is introduced whereby the excess work (work done more than is necessary to generate the learned distribution) is penalized. The justification comes from the fact that kullback-leibler and wasserstein divergences have excess work appear explicitly in their formulations. The methods proposed focus on changing the distribution such that observables match experiment or some (transition) regions are better sampled. It's written well, simple to follow, although the justification is not very rigorous it agrees in spirit with many recent works on the subject.
The results include observable and path guidance on synthetic data, observable gudiance with coarse grained chignolin and bioemu, finally path guidance on coarse grained bioemu energy for chignolin.
# general
- the method addresses interesting problems for the current moment (reweighing for experimental observables and finding transition states)
- clearly presented with strong empirical results, albeit with minimal comparisons.
# method
- the need for a kernel to identify transition regions seems extremely limiting in practice, i.e. one would need a good representation of the transition state and a similarity kernel which probably does not scale well with dimensionality.
# general
- the paper lacks very strong comparisons to baselines. There are other methods to try; however, each of them have not been applied to the specific cases covered in this work. That increases the cost of comparison; however, it would be in the spirit of a NeurIPS paper.
- Is path guidance extremely expensive? It seems like it would be.
- It seems that evaluation of path guidance is generally weaker. You claim that multimodal transition states would pollute the signal and weaken the guidance. Is it possible to justify this claim further?
- I think comparing with other methods is most appropriate for path guidance where your baseline "loss guidance" has many caveats (as you fairly mention). Can you comment on the difference between your method and, e.g., adjoint matching for this task. It seems like something like that would be a fairer comparison. |
Fully human-written |
|
Minimum-Excess-Work Guidance |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a framework for algorithms steering flow-based generative models (e.g. DDPM, normalizing flows) through the minimal work principle often used in control theory and statistical mechanics. The motivation here is to keep the new distribution as close as possible to the original distribution. Experiments are presented on low-dimensional synthetic data, coarse-grained chignolin, and BioEmu.
- The objective is clear, easy to understand, and straightforward to implement. Simplicity also means there are minimal hyperparameters to tune.
- Theory is also clear and fleshed out well, helping future practitioners understand the role of this approach.
- Many ablations and various metrics on the method also help future practitioners understand the method better.
- There is a notable lack of baselines. The objective is very similar to other stochastic control–style fine-tuning methods that employ this minimal work idea from optimal control and stat mech, e.g. [Adjoint Matching](https://arxiv.org/abs/2409.08861) (see e.g. eq. (28)), as well as the already extensive body of work on guiding diffusion models and flow matching models towards a new energy function (e.g. anything learning the tilted distribution $p_{\text{data}} e^r(x)$).
- Success depends on many different factors (KDE design, time-varying bandwidths, optimization schedule) that are not ablated or discussed for sensitivity.
- The specific kind of divergence used in this paper (L2) is not compared, theoretically or empirically, to alternative kinds of divergence used in related work (e.g. KL). See eg. [Tang and Zhou](https://arxiv.org/pdf/2403.06279).
- How does the method perform against Adjoint Matching and other SOC-style approaches? Are there particular benefits to the proposed approach?
- Are there any settings presented in the paper where the main diffusion guidance baselines (e.g. Diffusion-DPO, adjoint matching) are inapplicable?
- Are there unique theoretical advantages from using the L^2-based divergence over an alternative like KL (e.g. does stability not hold for KL)? |
Fully human-written |
|
Minimum-Excess-Work Guidance |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors look at formulating an “excess work” term (based on the principles of minimum excess work) as a regularizer to guide generative models. This involves adding an additional perturbation to the score model. The generative models are guided based on either observables (to align the generated distributions), or by aligning the path of the generative model. The authors look at a synthetic data example and coarse-grained proteins to test this idea.
- The paper is mostly easy to read and follow.
- The general motivation of such a framework, using experimental observables, makes sense, and is one worth looking at by the community.
- It seems like the guidance is being done with an observable, and then the ML model prediction is also tested on that observable (such as in section 4.1.2). This doesn’t seem like a fair comparison setting, as you are predicting something that you’re guiding on. How do you do in cases where the observable is held-out? The evaluations of the method on the two protein cases could be more comprehensive.
- The authors look at path guidance to compute transition states. This section seems ad hoc in nature in terms of what and how the authors are trying to apply MEW regularization, and Figure 4 isn’t a comprehensive metric for quantifying sampling these transition states. More evaluations would be helpful here, such as computing transition rates or free energies. Additionally, this is a problem that has been studied in the past using ML, and the authors do not make any comparisons to other baseline ML methods for this task (they use loss guidance as their baseline).
- How is the reference model in Section 4.1.2 defined?
- In section 4.1.2, are you both guiding with an experimental observable (and making predictions on that observable), and also generating new structures of chignolin?
- Can you provide more details on the exact amount of data samples that are used to train the model? I found this hard to find (except in section 4.1.3, where it seems like 10 experimental data points were used for guidance).
In section 4.1.3, is the unguided model a pre-trained BioEmu model?
- in section B.3.3, the authors mention that the effective sample size for this example is 0.255. Could the authors provide more clarification on what this is referring to? This seems low, and could make estimates unstable.
- Is there some calibration that needs to be applied to account for any uncertainty or noise in the observables?
- Given that the authors are using a supervised observable loss, and assuming a sparse data setting, how do you ensure overfitting is prevented?
- There are some works that seem related to the general idea presented in this paper, where they have a formulation to find the minimum energy path to sample between states. How does this work compare to these approaches?
> [1] Peterson and Covino. PINN-MEP: Continuous Neural Representations for Minimum Energy Path Discovery in Molecular Systems. FPI-ICLR (2025). [2] Raja et al. Action-Minimization meets Generative Modeling. Efficient Transition Path Sampling with the Onsager-Machlup Functional. ICML (2025) |
Fully human-written |
|
Decoupled Classifier-Free Guidance for Counterfactual Diffusion Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose Decoupled Classifier-Free Guidance, a counterfactual image generation method based on classifier-free guidance. The difference to prior work is that the authors cluster attributes into different groups and apply different guidance weights to them. THe method is empirically evaluated on CelebA-HQ, MIMIC-CXR and EMBED.
1. The paper is well-written. I appreciate how the approach is contrasted with the existing method by [1]. It becomes immediately clear what is done differently in this work.
2. The experimental evaluation is quite strong. The method is demonstrated on many different data sets with many examples. I am not a radiologist, so I cannot judge the mammograms and chest x-ray images. However, it is clear that this is an important application domain.
3. The proposed method is simple.
[1] Sanchez, Pedro, and Sotirios A. Tsaftaris. "Diffusion causal models for counterfactual estimation." Conference on Causal Learning and Reasoning (2022).
1. My main concern with this method is that it is somewhat trivial. The only novelty seems to lie in a grouping of attribute variables and then applying different guidance strengths to each group. Given the lack of contribution, I am afraid that this work is far from being publishable.
2. The soundness of this method is also not so clear to me. From a mathematical perspective, the weights $w_i$ should actually all be $1$. Also, the authors propose to have one weight for the "affected" variables and an additional weight for "unaffected" variables and it is not clear why this is a reasonable grouping. Generally, I do not see any deeper grounding to the argument of why we should group attributes and apply different guidance weights to them and why the groups are chosen the way that they are.
3. Some notation could be improved. For instance the $\mathrm{pa}$ notation: In case the authors do not know it, "$\mathrm{pa}$" means "parents", in the sense of "causal parents". I would therefore not use $\mathrm{pa}^{(m)}$ to denote attribute groupings, because it sounds as if we are grouping variables by their causal mechanisms (which is apparently not the case).
* I highly suggest to remove "proposition 1". This factorization is trivial and there is nothing to be proved, as far as I can see. Maybe I am missing something?
* I do not understand why the method is based on classifier-free guidance. It seems to me that the method could also be implemented with classifier guidance.
* Why is the method called "Decoupled Classifier-Free Guidance"? What is decoupled? It seems more that it is grouped, so should be called something like "Grouped Classifier-Free Guidance".
* In appendix D, it says that the anti-correlation between "young" and "male" stems from data set bias. What is "data set bias"? To me it seems more like selection bias: Given that someone is a celebrity, being young and female are dependent. |
Fully human-written |
|
Decoupled Classifier-Free Guidance for Counterfactual Diffusion Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Decoupled Classifier-Free Guidance (DCFG), a simple yet effective modification of standard classifier-free guidance (CFG) for counterfactual diffusion models. Standard CFG suffers from the problem of *attribute amplification*, where increasing the guidance strength for one intervention unintentionally alters correlated attributes (e.g., changing “Young” decreases “Male”). DCFG addresses this by splitting attributes into disjoint groups (intervened vs. invariant) and assigning separate guidance weights for each group. This decoupling enables more disentangled, causally faithful counterfactuals without retraining or architectural changes. Experiments on CelebA-HQ, EMBED (mammography), and MIMIC-CXR show that DCFG reduces amplification and improves reversibility compared to standard CFG.
- **Practical and elegant idea:** The attribute-split conditioning mechanism is extremely simple and requires minimal changes to existing diffusion pipelines while producing clear improvements.
- **Generality:** The approach is model-agnostic and could extend to other settings.
- **Strong empirical validation:** Results are shown across both natural and medical datasets with quantitative metrics and convincing qualitative examples. The inclusion of reversibility and cross-attribute correlation analysis is helpful and demonstrates the problem concretely.
* **Clarity of motivation and intuition.**
The paper tackles an important issue (attribute amplification in classifier-free guidance) but why this happens is not very clearly explained. The reviewer had to re-read [1] and infer the connection independently. Including a brief intuitive explanation, figure or toy example would make the motivation easier to grasp.
* **Simplicity and missing contextualization.**
The proposed fix of splitting attributes into intervened and invariant groups and applying separate guidance weights is extremely simple and easy to adopt, which is a strength. However, the paper does not discuss whether similar disentangling or conditional-guidance approaches have been explored before, or how this method compares to more sophisticated alternatives for representation disentanglement. A short discussion or empirical comparison would clarify novelty.
* **Missing baselines and ablations.**
The experiments compare only to standard CFG. It would strengthen the work to include other diffusion-based counterfactual explanation or editing baselines (e.g., [2], [3]).
* **Incomplete evaluation metrics.**
The paper does not report realism measures such as FID or sFID, nor composition or minimality metrics commonly used in counterfactual image generation. The authors explain the omission of composition in the Appendix, mentioning this in the main text would help. Measuring also minimality through a VLM or a user study, or a brief note on why it is difficult to quantify would make the evaluation more comprehensive.
* **Unaddressed observations.**
In Figure 1, increasing guidance for *do(Young)* appears to reduce *Male*, likely due to a dataset bias that biases the classifier. This should be discussed and attributed to this bias or another factor. Figure 5 shows similar unexplained behavior (*do(circle)* affecting density AUROC).
* **Minor presentation issues.**
- Figure 5 would benefit from same interventions between subfigures.
- Figure 3 is difficult to read and should be split.
- Equation (5) seems to omit $\tilde{x}$
- Equation (12) may need a $(1-\omega)$ term for completeness
- line 284 should reference Appendix C.
[1] Tian Xia, Mélanie Roschewitz, Fabio De Sousa Ribeiro, Charles Jones, and Ben Glocker. Mitigating
attribute amplification in counterfactual image generation. In International Conference on Medical
Image Computing and Computer-Assisted Intervention, pp. 546–556. Springer, 2024.
[2] Guillaume Jeanneret, Loı̈c Simon, and Frédéric Jurie. Diffusion models for counterfactual explana-
tions. In Proceedings of the Asian conference on computer vision, pp. 858–876, 2022.
[3] Preechakul, K., Chatthee, N., Wizadwongsa, S., & Suwajanakorn, S. (2022). Diffusion Autoencoders: Toward a Meaningful and Decodable Representation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
1. Can you provide a clearer intuition, illustrative example or figure for why CFG leads to attribute amplification?
2. Have you tried a version that removes invariant attributes from conditioning entirely?
3. How does DCFG compare with other diffusion-based counterfactual or editing methods such as DiME, or diffusion autoencoders?
4. Can you report or discuss realism metrics (FID/sFID) and briefly justify the exclusion of composition/minimality metrics?
5. Could you clarify the observed cross-attribute effects in Figures 1 and 5 and verify the potential inconsistencies in Equations (5) and (12)? |
Fully AI-generated |
|
Decoupled Classifier-Free Guidance for Counterfactual Diffusion Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
In this work, the authors address the issue that classifier-free guidance for counterfactual generation can lead to attribution amplification, referring to unwanted correlations between attributes. To mitigate this problem, they propose a new method, decoupled classifier free guidance, which leverages a causal graph. The proposed approach is evaluated on three datasets, including CelebA, mammography, and chest X-rays, and demonstrates convincing results.
1. The motivation of this work is solid and well-justified. Avoiding spurious correlations in generative models is a challenging task that is worth pursuing.
2. This work provides the necessary prerequisites for understanding the paper in Section 2. The structure of the paper is clear and easy to follow, and the writing is overall clear and coherent.
3. The visualizations of the results are quite convincing.
The evaluation of the generated images in terms of effectiveness and reversibility also makes sense for counterfactual generation tasks.
4. I also appreciate the effort of including medical data, as it is valuable to incorporate datasets that are closer to real-world settings.
1. The paper lacks prior literature discussing the issue of attribute amplification. I believe a paragraph in sec 2 dedicated to attribute amplification is needed, clarifying how it is defined and summarizing previous studies that have encountered this issue, especially in the context of counterfactual generation.
2. There is also a lack of details on how the CFG model was trained. I am somewhat confused about the training setup: did the authors use only the target attributes for supervision, or were other attribute annotations included as well?
3. There are no numerical results presented in tables; all results are shown in figures, which makes it difficult to assess quantitative values. This could be considered a minor weakness.
4. Regarding reproducibility, it does not appear that the experiments were conducted with multiple random seeds.
1. I am a bit unsure about why it needs to be a CFG for counterfault generation. As I believe for generating counterfactual, you have a target in mind, for example, change the disease label from 0 to 1, it makes more sense to have a classifier guidance to me. Can you explain the intuition of having CFG here?
Following this question, the biassed results only happen when w is bigger, whether more and more CFM is introduced. If it is purely classifier guided, will there still be such an issue? |
Fully human-written |
|
Decoupled Classifier-Free Guidance for Counterfactual Diffusion Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper identifies a key limitation in the conventional Classifier-Free Guidance (CFG) approach for counterfactual generation: applying a global guidance weight $\omega$ to the entire counterfactual embedding can violate causal relations and unintentionally alter attributes that should remain unchanged. To address this issue, the author proposes Decoupled Classifier-Free Guidance (DCFG), which employs multiple MLPs, each corresponding to a specific attribute, to generate independent semantic attribute vectors and provide more targeted counterfactual generation guidance. The DCFG framework is evaluated on three datasets, demonstrating its effectiveness over conventional CFG.
1. The discussion on how traditional CFG can violate causal relations is interesting and can be inspirational for future research in counterfactual generation.
2. DCFG shows promising performance across all case studies.
3. The paper presents a clear and thorough explanation of the technical background and motivation.
1. The proposed solution, which uses a separate MLP for each attribute, is not scalable. For instance, while the CelebA-HQ experiments involve only three attributes, the CelebA dataset includes 40 attributes. Applying DCFG to all attributes would require 40 MLPs, which raises a serious concern on the scalability.
2. Related areas such as disentangled representation learning and debiasing for protected attributes have extensively explored methods for isolating and manipulating specific attribute features without affecting others. Although the paper claims to focus on counterfactual inference and structural causal models, its discussion of causality remains limited. Beyond the abduction–action–prediction procedure, there is little discussion to causal graphs or explicit modeling of causal relations.
3. The experiment setup appears closer to studying disentangled representation learning rather than causality, as the attributes used in the case studies (e.g., Figure A.1) are independent with each other rather than causally related. Consequently, the experiments may not sufficiently demonstrate DCFG’s ability to leverage causal relations for counterfactual generation.
Please refer to the Weaknesses. |
Lightly AI-edited |
|
ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a novel reinforcement learning (RL) framework to enhance the capability of paralinguistic cues for speech-to-speech (S2S) dialogue model. It clearly introduce the topic of natural speech communication which conveying more than just words but also paralinguistic cues such as emotions, tone, speaker attributes and how it impacts the speech response. To address this, the author constructs a benchmark which can automatically evaluates the S2S model as ParaS2SBench for content and style appropriateness, then with the benchmark system, it can use RL approach like GRPO to improve the response content and style fitness.
The major contributions lies on the concrete description of this topic, construct appropriate S2S evaluation as benchmark, demonstration of RL and SFT effectiveness and cost via experiments. The benchmark is characterized by imaginative design and rigorous focus on its objectives, while the exploration of the experiments are also inspiring.
1). The construction of this benchmark system is quite clear and scientific. It designs various of domain for different query, contrasting speaking style, and the most key part is the scenario-controlled queries which is designed to control the neutral text content filtering those doesn't convey too much additional paralinguistic information. According to the appendix table for query example and prompt, the data curation is full of authors' thought on this topic and looks very interesting.
2)In the experiments part, besides the validation of the effective of RL framework and SFT analysis, it propose so many questions with experiments which are very instructive. Such as the amount of annotations can RL save? should we invest more costs on SFT or reword model, how is the generalization to real speech? These are realistic and serious problems for this S2S model need to handle and could be continuous and broad interest for the research problem.
The most weakness of this paper from my perspective is the depth of the experiments part. The author provides rich and many good inspects questions here which is good, however, it seems there isn't any experiment or analysis which shows the most important and critical point view from the author. It may weaken the paper's persuasiveness and confuse reader about the core information obtained from the experiments.
It would be much friendly if the author could select the most crucial experiments to introduce more profoundly, analysis/conclude them clearly in the main paper session 5 and put the other experiments details in the appendix paper. Anyway, there are some part which is not fully introduced due to the length to the article and need to read the appendix to get the full results.
For the generalization to real speech experiment, is there any human evaluation to compare with the bench score? This is quite curial to this RL framework. |
Fully human-written |
|
ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces an innovative benchmark and reinforcement learning framework for paralinguistic-aware speech-to-speech (S2S) models, effectively addressing the current limitations in handling paralinguistic information such as emotion, tone, gender, and age. The authors design an automated data curation and speech synthesis pipeline, and leverage a reward model for efficient training and evaluation. Experimental results demonstrate that the RL approach achieves significantly better content and style appropriateness than conventional supervised fine-tuning, with much lower data and annotation costs. Overall, this work is forward-looking and practical, providing valuable tools and references for the development of paralinguistic-aware S2S models.
1. The proposed paralinguistic-aware S2S reinforcement learning framework is highly practical, effectively enhancing the model's ability to understand and generate paralinguistic information such as emotion and tone, which provides valuable tools and methods for the advancement of speech dialogue systems.
2. The experiments are thoroughly designed, covering various paralinguistic factors and realistic scenarios. The results comprehensively validate the significant improvements in content and style appropriateness achieved by the proposed method, making the findings highly convincing.
3. The paper is well-structured and clearly articulated, with a rigorous logical flow. It progresses coherently from problem background, method design, to experimental validation and result analysis, making it easy for readers to understand and follow.
1. The methodological innovation of the paper is limited, as it merely applies GRPO in a straightforward manner.
2. The presentation lacks intuitiveness; it is difficult to fully convey the paralinguistic features of audio through text alone. It would be better if there were a demo page or web-based showcase.
3. Some references are missing, such as [1]:
[1] Omnichat: Enhancing spoken dialogue systems with scalable synthetic data for diverse scenarios. arXiv preprint arXiv:2501.01384.
1. In your data curation and speech synthesis process, how do you ensure that the generated paralinguistic styles (such as emotion, age, gender, etc.) are sufficiently authentic and diverse?
2. After SFT and GRPO, does the model’s original capability decrease compared to the results reported for Kimi-Audio? |
Moderately AI-edited |
|
ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces ParaS2S, a reinforcement learning framework that enables speech-to-speech models to generate responses with appropriate content and speaking style. Using the new ParaS2SBench for automatic content and style evaluation, ParaS2S improves paralinguistic awareness by 11% over supervised fine-tuning while requiring far fewer annotations.
The motivation to evaluate paralinguistic responses in speech language models is both natural and important. This reviewer appreciates the authors’ effort in advancing research on this topic. Moreover, the overall workload presented in this paper appears substantial.
(1) The core contributions of this paper are somewhat unclear. It mainly includes two parts: a new benchmark for evaluating speech response style and content, and an alignment technique for tuning speech language models. However, each contribution appears incomplete on its own. The benchmarking part omits many relevant speech and speech-to-speech models, while the proposed alignment method lacks sufficient novelty and empirical validation.
(2) The citation format should follow the ICLR template by using \citep instead of \cite, as the current style blends citations into the text and reduces readability.
(3) Table 1 reports several numerical results to show that the evaluation aligns with human judgments, but the justification is not rigorous. The paper should clarify what criteria define “closeness” to human evaluation and why they are reasonable. For instance, in the Emotion S2S model, the gap between GPT and human evaluations does not appear negligible.
(4) The novelty of the alignment approach is limited. If the paper argues for the necessity of GRPO, this claim should be empirically supported, and comparisons with existing methods such as SpeechAlign are essential.
(5) The GRPO alignment is evaluated only on the Kimi-Audio base model. A more comprehensive study should include multiple base models to demonstrate that the proposed strategy generalizes beyond a specific setup.
In conclusion, while the paper presents a decent amount of work, it remains incomplete by publication standards. The authors are encouraged to focus on a single, well-developed contribution—either the benchmarking framework or the alignment technique.
(1) What is the primary contribution of this paper — the new benchmark or the proposed alignment strategy? The authors should clarify which aspect represents the core focus of the work.
(2) Why is the proposed alignment strategy not compared with SpeechAlign? The two methods appear quite similar, except for the adoption of the GRPO technique. Since SpeechAlign also employs DPO and evaluates comparable aspects of speech language models, a direct comparison is necessary to highlight the distinction and advantage of the proposed approach. |
Lightly AI-edited |
|
ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The study introduces a methodology for enabling Speech-to-Speech (S2S) language models to recognize and respond to critical paralinguistic elements, such as emotion, intonation, and speaker characteristics, which extends far beyond simple content transmission.
The authors present two core components:
* **ParaS2SBench:** An automated benchmark designed to evaluate how effectively S2S models align with both the **content** and **style** of an utterance.
* **ParaS2SAlign:** A learning framework that utilizes Reinforcement Learning (RL) to achieve model alignment directly at the waveform level.
The benchmark scores show a high correlation (>0.7) with human evaluations and the RL approach achieves a notable 11% performance improvement over Supervised Fine-Tuning (SFT), alongside a five-fold enhancement in label efficiency.
- The paper puts forward a novel benchmark and dataset, with a welcome commitment to their public release.
- The authors provide a valuable analysis of the respective impacts of RL and SFT on the modeling of non-verbal conversational aspects within the proposed framework.
* **On the Reward Model:** A point of consideration emerged regarding the reward model. I'm respectfully curious about the potential for it to be somewhat overfitted to the specific synthesis engines used for evaluation, namely the GPT-based TTS and CosyVoice. I would be interested to hear the authors' perspective on its generalization capabilities to other speech styles.
* **On Data Synthesis:** Additionally, as the audio corresponding to the evaluated scenarios appears to be entirely synthetic, a slight query arises regarding potential constraints on the diversity and complexity of the model's expressive output. I wonder if this might impact the model's ability to capture the full spectrum of nuances present in organic, human-to-human interaction.
1. **Readability:** I noticed a minor formatting point where the inconsistent use of parentheses for citations occasionally impacted readability. Clarifying this convention throughout the manuscript might be beneficial for readers.
2. **Data Composition:** Could the authors please specify the total number of distinct speakers represented in the training data and the ParaS2SBench benchmark, respectively?
3. **Confidence Intervals:** I would find it very helpful to see the 95% confidence intervals for the reported GPT and human evaluation scores, as this would further strengthen the statistical significance of the findings.
4. **Performance on Existing Capabilities:** Finally, a point of great interest is the trade-off with existing abilities. I would be grateful if the authors could provide an analysis of any performance degradation on foundational capabilities (e.g., as measured by VoiceBench) after the application of SFT and, subsequently, the full RL alignment process. |
Heavily AI-edited |