|
What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration? |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces MATHLENS, a benchmark designed to disentangle the subskills of multimodal reasoning—specifically perception, reasoning, and their integration—in the context of geometry problems.
The authors argue that existing benchmarks often rely on aggregate accuracy, which obscures the specific capacities where models fail or improve.
MATHLENS is built from symbolic specifications to generate aligned annotations, including visual diagrams, textual descriptions, perception probes, and multimodal questions, ensuring consistency and robustness.
Through controlled experiments on open multimodal models, the paper reveals that different training approaches
have uneven effects.
1. MATHLENS provides a rigorous framework for decomposing multimodal reasoning into perception, reasoning, and integration, addressing a gap in existing evaluations. The use of symbolic semantics ensures controlled and reproducible annotations.
2. The paper thoroughly evaluates multiple model families across diverse settings, including robustness tests with semantic diagram modifications. This allows for nuanced insights into how training strategies affect specific capacities.
3. MATHLENS demonstrates strong alignment with datasets like MathVista and MathVerse, enhancing its credibility and utility for the community. The release of data and tools promotes reproducibility and further research.
See Questions.
1. Have you explored preliminary strategies to improve integration? If so, what were the results? If not, what directions do you prioritize for future work?
2. MATHLENS relies on geometry problems, which have well-defined symbolic representations. For tasks with ambiguous symbolic mappings, how would you adapt MATHLENS’s decomposition framework to ensure consistent evaluation of perception, reasoning, and integration?
3. The diagram modifications test structural changes, but how would MATHLENS perform under real-world perturbations like occlusions or lighting variations? |
Moderately AI-edited |
|
What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration? |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the problem that aggregate accuracy metrics are insufficient for evaluating and understanding the progress of Multimodal Large Language Models (MLLMs) on complex reasoning tasks. By comparing a model's performance across these tests, the authors can decompose errors into failures of perception, reasoning, or Integration (defined as failing the joint task despite succeeding on perception and reasoning individually). The methodology also involves generating semantic variations of diagrams to test robustness.
- Error analysis reveals RL shifts errors to integration, providing evidence-based guidance for future training, unlike aggregate benchmarks.
- Use of symbolic states ensures equivalence across modalities, supporting valid isolation of subskills with high diagnostic value.
- The evaluation of robustness using semantic-level diagram modifications, rather than just pixel-level noise.
- The primary analysis and all major findings are derived exclusively from the geometry domain. While this allows for tight control, it leaves the generalizability of the conclusions.
- The accuracy of the error decomposition hinges on the assumption that the perception probes are exhaustive.
- The paper presents many empirical comparisons but does not report confidence intervals or use statistical tests to confirm the significance of the observed differences.
- Regarding the error decomposition, how did you ensure the set of perception probes for each problem was comprehensive? Is it possible that some errors classified as "integration" failures are in fact subtle perceptual failures not captured by the probes?
- What compute was used for evaluations, and how might nondeterminism affect API models? |
Lightly AI-edited |
|
What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration? |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
They introduce MATHLENS, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbookstyle geometry problems. The benchmark separates performance into three components: Perception: extracting information from raw inputs, Reasoning: operating on available information, and Integration: selecting relevant perceptual evidence and applying it within reasoning.
1. The paper is clearly written with demonstrative figures.
2. The idea of separating the reasoning process of MLLMs is interesting.
3. The analysis of the result of different methods is detailed and shines light on the inner workings of MLLM reasoning.
1. The perception probes test for a finite set of atomic facts. A model might correctly identify all the probed facts but fail to perceive another crucial, un-probed visual detail. This perceptual failure would be misclassified as an integration failure.
2. The primary benchmark, MATHLENS, is composed entirely of geometry problems. It cannot catch the full scope of MLLM reasoning. The add-on MATHLENS-GENERAL cannot maintain the same rigor.
See weaknesses. |
Fully human-written |
|
What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration? |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 2: reject
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a math benchmark to evaluate MLMs through different lenses rather than a single one. Specifically, they present and made publicly available a dataset called MathLens for geometry to perform evaluation based on three aspects: perception, reasoning and integration, enabling evaluation beyond a single score accuracy. Authors then did several fine-tuning analysis and made these main observations: (1) RL boosts perception, (2) textual SFT indirectly strengthens perception through reflective reasoning, (3) integration is the weakest skill among all, and finally (4) robustness varies (e.g., RL vs. SFT).
1. Paper is well-written and easy to follow. The core idea is interesting and novel.
2. A public release of the dataset for the community is nice and will help future research.
3. Extensive experiments and analyses were provided. For instance robustness analysis investigates changes to the diagram (e.g., rotation, etc.).
1. Single data source is just very limited -- only FormalGeo-7K was used as the basis of the data. This makes it hard to trust the findings IMO (e.g., relation between RL and SFT).
2. The core analysis were performed using open-weight models and it's unclear how the fine-tuning of closed-source models such as Gemini would make any difference.
3. Integration measurement is done indirectly -- this could be conflated with other latent failures, like context-length limitations.
1. As mentioned in the paper, integration is the main bottleneck in structured geometry. Is this true for real-world less-structured data as well?
2. Since integration is the weakest skill, what training or architectural changes can be made to improve it? |
Fully human-written |
|
Target Before You Perturb: Enhancing Locally Private Graph Learning via Task-Oriented Perturbation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper studies graph neural networks under local differential privacy and, for the first time, introduces a task-relevant optimization mechanism. The authors compare their approach with existing LDP methods that protect node features and demonstrate a better trade-off between privacy and utility.
1. In terms of novelty, the paper is the first to propose a multi-stage perturbation mechanism guided by task-relevant feature selection.
2. The proposed method achieves a superior privacy–utility trade-off compared to existing approaches.
3. The paper is well-structured, clearly written, and the experimental results are easy to follow.
1. The proposed method includes an additional server-side aggregation step that merges results from two rounds of perturbation, whereas the baselines do not. Therefore, it is unclear whether the observed improvement in the privacy–utility trade-off arises from the proposed LDP mechanism itself or from the aggregation process on the server.
2. The authors compute task-relevant features based on the first-round perturbed data and then perturb these features again in the second round. Intuitively, this means the most important features are perturbed twice, and the noise magnitude is larger than that of existing single-shot mechanisms. The authors should clarify why this design leads to higher utility rather than degradation.
3. The paper lacks a clear definition of what aspects are protected under LDP in the main text, only stating in experiments that feature privacy is considered. Since GNNs may involve protecting features, edges, or labels, the lack of explicit scope may cause confusion.
4. The paper does not evaluate resistance to feature inference attacks, which is an important aspect of verifying practical privacy protection. Such experiments are strongly recommended.
See the issues discussed in the “Weaknesses” section above. |
Lightly AI-edited |
|
Target Before You Perturb: Enhancing Locally Private Graph Learning via Task-Oriented Perturbation |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents a new locally private graph learning framework from a task-oriented graph learning perspective (TOGL). It contains three phases: locally private feature perturbation, task-relevant attribute analysis, and task-oriented private learning. Extensive experiments demonstrate TOGL's substantial utility improvements over existing baselines.
1. Well-structured and clearly written.
2. This paper emphasizes the urgent need to connect local differential privacy (LDP) with downstream tasks to achieve better utility, and empirically demonstrates its importance.
3. This paper provides fundamental theoretical proof and analysis, showing the correctness of its use of LDP.
1. This paper does not contribute to the LDP part, only designing a task-oriented attribute selecting mechanism in the server to benefit downstream tasks. Phase I is a one-time perturbation, no different from LPGNN (Sajadmanesh & Gatica-Perez, 2021).
2. The presentation of Phase III in Figure 2 is misleading. According to Algorithm 2, the selected attributes $S^*$ and hyperparameter $\rho$ do not directly affect the LDP, but utilize the LDP's post-processing invariance properties, ensuring strict privacy guarantees for subsequent processing.
3. There is no summary of task-oriented methods. Is LPGNN a task-oriented method?
- If not, why? And what special adjustments are needed for different tasks (node classification and link prediction) compared to the baselines?
- If it is, then the contribution of this paper will be diminished. Overall, the method in this paper is similar to LPGNN in its approach, as both utilize embedding and labels to constrain task performance.
4. The LDP mechanisms of PM, MB, and SW lack an explanation of the coefficient $\delta$ $, which is only described in the Gaussian mechanism.
5. The accuracy in Figure 6 was normalized, which may overemphasize the differences between methods. It is recommended to show actual ablation study results.
6. The interpretation in Figure 9 is weak, casting doubt on the method's utility. The results show that random feature selection achieves near-suboptimal results when $\rho$=0, indicating that random diversity is more helpful. However, when $\rho$=1, the algorithm relies entirely on task-relevant effects (approximately 30%), almost losing its inference ability for downstream tasks, indicating that this module contributes little.
7. Attack experiments are lacking to demonstrate that the method's privacy guarantees are not compromised to address the second challenge in line 88.
8. The lack of open-source code and insufficient reproducibility reduce the credibility of this work.
1. How is Equation 7 used in Algorithm 2 to represent the SMA mechanism?
2. Are the six LDP mechanisms implemented by changing the perturbation mechanism based on the LPGNN framework? Please clarify.
3. Do the LDP mechanisms share the same set of parameters in the same dataset? For example, $K$, $\rho$, etc.
4. Which mechanism is described as the state-of-the-art (SOTA) in Figures 4, 5, and 6?
5. Why can the analysis of parameter $K$ be an ablation study in Figure 7? |
Lightly AI-edited |
|
Target Before You Perturb: Enhancing Locally Private Graph Learning via Task-Oriented Perturbation |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents Task Oriented Graph Learning (TOGL) framework for locally private graph learning under Local Differential Privacy (LDP) constraints. The paper advocates that instead of considering random dimensions of node attributes for perturbation, which provides privacy but at a cost of utility one should consider identifying task-specific features. To this end, the authors introduce the notion of “target then perturb” for LDP. TOGL follows a three-stage pipeline: in the first stage, node features are perturbed locally using LDP to satisfy privacy requirements, and the server then denoises the perturbed features through neighborhood aggregation. In the second stage, the server identifies the top-m task-relevant feature dimensions from the denoised representations using either Fisher Discriminant Analysis (FDA) or Sparse Model Attribution (SMA). Finally, in the third stage, a second round of LDP perturbation is performed to balance privacy and utility. The authors have performed evaluation on 6 small to medium scale datasets in the main paper and 2 additional in the appendix.
1. The flow of the introduction, along with the motivation for the proposed framework, is good. Overall, the paper is well-motivated and nicely written.
2. The three-stage framework is intuitive and easy to understand.
3. TOGL demonstrates strong utility improvements compared to baseline LDP methods.
4. The method performs well across various GNN architectures.
1. I believe the authors should at least mention experiments on large-scale datasets and robustness evaluations in the main text.
2. The method relies on access to task-specific signals, which may not always be practical in real-world scenarios.
3. The motivations for using FDA and SMA as feature-selection modules should be discussed, along with an analysis of how sensitive the algorithm is to this choice.
4. Could the authors also provide fairness evaluations for the baselines in Table 6?
5. There should be a discussion on the selection of the hyperparameter $\rho$ for practical deployment.
6. The neighborhood aggregation used for denoising may be detrimental for heterophilic datasets.
See weakness. |
Fully human-written |
|
LDLCC: Label Distribution Learning-based Confidence Calibration for Crowdsourcing |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper focuses on the problem of confidence calibration in crowdsourcing. Traditional crowdsourcing label aggregation methods (such as MV, DS, LAGNN, etc.) primarily focus on label accuracy but ignore the inconsistency between the aggregation confidence and the label accuracy rate (i.e., miscalibration). The paper defines confidence calibration for crowdsourcing and proposes a Calibration method based on Label Distribution Learning (LDLCC, Label Distribution Learning-based Confidence Calibration).
1. Define the confidence calibration problem in the crowdsourcing scenario and point out its essential differences from supervised learning calibration
2. Introducing "Label Distribution Learning" into the calibration task is a novel idea
3. A two-stage strategy (sharpening + regression learning) is proposed, taking into account both noise robustness and distribution modeling capabilities
4. Not only evaluate ECE, but also verify the actual gains for downstream tasks; It includes complete ablation experiments to verify the necessity of each module
1. Why not use temperature-scaled softmax or other calibration output layers? More thorough argumentation is needed.
2. The actual crowdsourcing noise is non-Gaussian, non-independent, and category-dependent (such as some categories being easily confused). This assumption is too strong and weakens the theoretical support.
3. Lack of comparison with the latest calibration methods: Although compared with TS/LS/FL, these are classic methods in supervised learning. In recent years, unsupervised/weakly supervised calibration methods (such as Dirichlet-based calibration, Zong et al., AAAI 2024 - this article has been cited but not used as a baseline) should be included in the comparison.
4. The verification of downstream tasks is single: It was verified using only one noise correction method, CLNC, and was only demonstrated on the Music dataset. It should be extended to more downstream tasks (such as semi-supervised learning, robust training) and datasets.
5. Computational overhead and scalability were not reported: The time complexity of LDLCC is O(N²(M + log N)) (Appendix A), which may not be feasible on large-scale data. However, the experiment did not discuss running time or memory usage.
6. The literature review is somewhat insufficient: Although Zong et al. (2024) was mentioned, the essential differences between it and LDLCC were not discussed in depth (Zong uses the Dirichlet distribution to model and predict uncertainty, while LDLCC is based on regression learning label distribution). Recent works on crowdsourcing uncertainty modeling (such as Bayesian aggregation with uncertainty quantification) have not been cited.
Same as weaknesses |
Lightly AI-edited |
|
LDLCC: Label Distribution Learning-based Confidence Calibration for Crowdsourcing |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes LDLCC (Label Distribution Learning-based Confidence Calibration), a method designed to address the problem of confidence miscalibration in crowdsourced label aggregation. LDLCC introduces a two-stage framework—label refinement and label distribution learning—to improve calibration when ground-truth labels are unavailable. Experiments on multiple datasets and aggregation methods demonstrate improved calibration and downstream performance.
- The paper proposes a new and meaningful problem, extending calibration research to the crowdsourcing setting where true labels are not directly available.
- The authors design the LDLCC algorithm and verify its effectiveness across multiple benchmarks and label aggregation methods, including downstream validation, which enhances the practical relevance of the approach.
The algorithmic description is unclear and incomplete---the motivations for using specific techniques and key concepts are missing. For example, why is sample filtering and label refinement necessary? Why use the average confidence as a threshold? What is label distribution learning? Why does the translation invariance of softmax make the network overconfident? Besides, the validation for Q3 and Q4 is too limited, as it is only performed on the Music dataset with the MV method. More extensive experiments across datasets and aggregation methods would strengthen the reliability of the results.
Please refer to the weaknesses. |
Lightly AI-edited |
|
LDLCC: Label Distribution Learning-based Confidence Calibration for Crowdsourcing |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper tackles the issue of miscalibration in crowdsourced label aggregation. To address this, the authors formally define confidence calibration in the crowdsourcing setting and introduce a Label Distribution Learning-based Confidence Calibration (LDLCC) framework. Specifically,
- LDLCC first identifies high-confidence instances by combining confident learning with a $K$-nearest-neighbor consistency check, and refines their label distributions through a sharpening procedure.
- It then trains a regression network to learn calibrated label distributions by minimizing the mean squared error (MSE) between the predicted and refined distributions.
1. The paper studies the problem of miscalibration in the context of label aggregation for crowdsourcing.
2. The paper proposes LDLCC to address the problem.
3. The paper includes experimental results to validate the proposed method.
### Weaknesses, Detailed Comments, and Questions:
1. The mathematical formulation of calibration in Eqs. (1)–(2) is not rigorous. Specifically, the *probability* in Eq. (1) and the *expectation* in Eq. (2) are not defined with respect to any explicit probability measure or sample space. These expressions should be written explicitly to clarify the underlying stochastic model.
2. Weak connection to the crowdsourcing setup.
- Although the paper claims to “formally define confidence calibration for crowdsourcing,” the definitions of *perfect calibration* (Eq. 1) and *Expected Calibration Error (ECE)* (Eq. 2) are mathematically identical to those used in standard supervised learning (Guo et al., 2017).
- Furthermore, although Section 3 introduces the crowdsourced noisy labels $\mathbf{L}_i$, the subsequent sections, including the calibration formulation (Eqs. 1–2) and algorithm design (Sections 4.1–4.2, Algorithm 1), operate solely on the aggregated label distributions $\mathbf{P}_i$. The paper does not explain why the proposed framework is *specific* to crowdsourcing.
- In fact, if the crowdsourcing context were removed, most of the analysis and algorithmic structure would remain unchanged. Thus, the method does not exploit key properties of crowdsourced data such as label sparsity, annotator heterogeneity, or instance-dependent noise.
3. In Section 4.2 (Eqs. 11–13), the authors assume additive Gaussian noise $\epsilon\sim\mathcal{N}(0,\sigma^2\mathbf{I})$ to argue that “the effect of noise on the MSE loss is a fixed constant, which means that the MSE loss is relatively robust to noise.” This assumption is unrealistic in the context of crowdsourced labels:
- Real annotation noise is highly *instance- and annotator-dependent*, rather than i.i.d. Gaussian.
- The variable $\mathbf{P}_i$ itself results from an aggregation process, not a direct observation, so the additive noise model is conceptually inconsistent.
- Consequently, the claimed robustness of the MSE loss to noise lacks theoretical justification under the actual data-generating mechanism.
4. The three datasets (Music, LabelMe, Income) are small, and none represent modern large-scale crowdsourcing or complex perceptual labeling tasks (e.g., CIFAR-10H, AMT-based image classification).
Moreover, several experiments, particularly those related to Q4 in Sec. 5.1 and Sec. 5.3, report results only on the Music dataset, without explaining why the other datasets were excluded.
The experimental evidence therefore does not sufficiently support claims of robustness or generality.
5. While the paper targets an underexplored problem, the proposed LDLCC framework offers limited methodological novelty and does not contribute new theoretical insights or algorithmic principles beyond existing work.
See above. |
Fully AI-generated |
|
Intention-Conditioned Flow Occupancy Models |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work leverages recent advances in generative models in the direction of flow-matching algorithms in order to tackle the paradigm of pre-training and finetuning RL models. This is done by learning occupancy models over the state space using flow matching (as done in Farebrother et al. (2025)) while taking into account the fact that large pretraining datasets usually contain a mixture of intentions since they are collected by different users. The proposed approach InFOM explicitly models the intention as a latent variable using a VAE which is used to condition the occupancy model. While fine-tuning, the method generates intention-conditioned Monte-Carlo estimates of the crictic from sampled future states and then distills them into a single critic via an upper-expectile loss. The authors show that across 36 state-based and 4 image-based tasks ), InFOM matches or outperforms strong pre-train-and-fine-tune baselines, reporting a 1.8× median return gain and a 36% success-rate increase, with particularly large gains on harder manipulation domains.
* The paper is well written and easy to follow
* The experimental results showing the proposed approach outperforms the baselines in most of the case with the gap widening on more difficult tasks
* A large number of baselines have been included and sufficient experimental detail is provided.
* I really the like the qualitative analysis of the learnt latent intention model.
The proposed approach builds on top of Farebrother et al. (2025) which uses flow matching to learn occupancy models, by explicitly modeling intention as a latent variable. Including an ablation / baseline comparing the downstream performance with intention conditioning of the occupancy model vs without conditioning it seems to be missing.
Is there prior work one using other generative modeling approaches, specifically diffusion models for learning occupancy models? If yes, could the authors provide some intuition on comparison between using diffusion versus flow matching for learning occupancy models? |
Fully human-written |
|
Intention-Conditioned Flow Occupancy Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes InFOM, a method for unsupervised pre-training in offline reinforcement learning that learns a latent-variable generative model of long-horizon state occupancy conditioned on inferred user intentions. By combining variational inference with flow matching, the authors model the discounted occupancy measure. During fine-tuning, they use implicit generalized policy improvement (GPI) via expectile distillation. The method shows strong empirical gains over prior pre-training approaches.
1. Novel integration of flow matching with intention-aware occupancy modeling.
2. Strong empirical results: consistent gains across diverse domains, including challenging sparse-reward and vision-based tasks.
3. Well-motivated design choices, especially the use of SARSA-style bootstrapping and expectile-based implicit GPI to stabilize training.
1. Choice of Prior over Intentions Lacks Justification. The paper assumes a standard Gaussian prior $p(z)=N(0,I)$ for the latent intention $z$ . While common in VAEs, this choice may be suboptimal for modeling user intentions, which are often discrete or categorical (e.g., “pick”, “place”, “navigate to A”).
Suggestion: The authors should consider discrete latent variables (e.g., via Gumbel-Softmax) to improve interpretability and align better with the semantics of “intentions” in multi-task or goal-directed settings. Ablations comparing continuous vs. discrete priors would strengthen the modeling claim. The current Gaussian prior may encourage over-smoothed representations, potentially conflating distinct behavioral modes (as hinted in Fig. 4, where InFOM already shows better clustering—but could it be sharper with discrete codes?).
2. Incorrect ELBO Formulation: The parameter $\lambda$ should be at least one, not arbitrary. In equation (3), the coefficient $\lambda$ must satisfy $\lambda \geq 1$ to guarantee that the derived expression is the lower bound.
Q1: How well does the flow occupancy model generalize to out-of-distribution intentions?
Q2: Is there a risk that the generative occupancy model produces unrealistic futures for novel $z$, harming policy learning? |
Heavily AI-edited |
|
Intention-Conditioned Flow Occupancy Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The work provides a way to pre-train a general RL occupancy model using an unlabeled dataset, using a technique close to the newly introduced Temporal-Difference Flows combined with variational inference over the "intention" latent variable, which helps learn the final occupancy model. Given the pretrained flow occupancy model, the authors proposed a way to train a new policy and its Q-value using generalized policy improvement and a Q-value estimate from a reward model and an occupancy measure model. The authors provided extensive experimental validation of their pre- and post-training pipeline across various robotics benchmarks, as well as various ablation studies on the algorithmic choices.
- The paper is very well-written and describes the whole training pipeline in great detail.
- The approach bridges the well-developed flow-matching literature with an RL pretraining and subsequent fine-tuning, which distinguishes this approach from prior work on TD-flows.
- The approach of modeling the policy intentions as latent variables seems to be fresh and interesting, since it doesn't require any additional supervision, and distinguishes this approach from Multi-Task RL.
- Strong performance on many benchmarks;
- The final algorithm combines four (4) neural networks and, at first glance, looks extremely complicated, which can be prone to the accumulation of errors;
- In Appendix C.1., in the derivation of the ELBO loss, does an inequality (c) (line 1155) require $\lambda \geq 1$ for this derivation?
- Did the newly trained reward, Q-value, and policy models use the features learnt by the occupancy model?
- The improved effect of using the behavior cloning suggests that the dataset consists of high-quality data that is worth being "cloned". What is the issue if the dataset consists only of highly suboptimal but diverse rewards? And how strongly does the diversity of the dataset influence the pre-training performance?
- What are the results on the occupancy measure pretraining performance separately, compared to a standard TD-flows without intention decoding? |
Fully human-written |
|
Intention-Conditioned Flow Occupancy Models |
Soundness: 3: good
Presentation: 1: poor
Contribution: 3: good
Rating: 2: reject
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The present work leverages flow-matching models to predict trajectories of future states for RL tasks to help the actor finding better policies.
Extensive experiments shows on gym and robotic manipulation benchmarks shows promissing results.
- The idea of leveraging "intent" for multi-step generation especially in the multi-agent setting or iterative refininement with human input.
- Extensive experiments show practicability of the method.
Major
After several reading, I fail to understand how the algorithm works. It should not be that hard to understand:
- flow-matching to predict future states is straightforward given trajectories either conditionned or guided on observed states.
- the predicted latent variable has to be used somehow by the policy, either by boosting an pre-trained one or by training from scratch.
However, the main body of the paper discusses Q-functions which in the end are conditioned on the intention. This leads the reader to infer that the intention variable is infered and fed to the agent along actor trajectories.
My wild guess: since the policy is mainly trained via a behaviorial cloning loss (see algorithm 2 in appendix) which is barely mentioned in the main body, using the predicted state actions, the flow-matching is merely doing data-augmentation on the off-line dataset.
I thus think that the abstract is misleading, sections 2 and 3 should be rewritten entirely to focus method rather than the intention (ironically). Algorithm 2 is rather involved and should be explained fully in the main body of the paper, all losses should be introduced clearly.
Minor
1- Diffentiating through an ODE solver is associated to a citation to Park et al. 2025b. However, this is a problem already accounted for in the Normalizing flow litterature, see FFJORD https://arxiv.org/abs/1810.01367 or even Neural ODE https://arxiv.org/abs/1806.07366. Considering that Ricky Chen is a major contributor to both flow matching and Normalizing flow literature, it is odd to cite Park et al for this aspect.
2- "The deterministic nature of ODEs equips flow-matching methods with simpler learning objectives and faster inference speed than denoising diffusion models (Lipman et al., 2023; 2024; Park et al., 2025b)" is problematic for two reasons. I doubt that the training objective is simpler for flow-matching compared to Diffusion model if by that the authors mean the target function to learn or even the numerical stability of the loss. It is true however that the numerical stability of the *inference* is better for FM compared to DDPM. It is also true that the MSE-based loss of FM is more numerically stable and lighter than the KL-based loss of ODE-based Normalizing flows such as FFJORD, see for instance https://arxiv.org/abs/2107.07232. The second reasin is that Park et al. 2025b has little to do with the statement.
1- How does the inference work ? No algorithm is provided.
2- How do you intend to use this method for pre-trained policies ? |
Fully human-written |
|
REAR: Scalable Test-time Preference Realignment through Reward Decomposition |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes REAR, a test-time method for aligning large language models with user preferences without further training. The key idea is to decompose the implicit reward into question-related and preference-related components and to rescale the preference term at inference time. The authors show that REAR can be computed as a linear combination of log-probabilities and integrated into existing test-time scaling (TTS) algorithms such as best-of-N sampling and diverse verifier tree search (DVTS). Experiments on several preference-alignment benchmarks (PrefEval, Multifaceted, PingPong) demonstrate modest gains compared to existing TTS and test-time alignment baselines.
- The paper is well written and clearly structured, with theoretical derivations and reasonable experimental design.
- The proposed formulation is efficient and lightweight, requiring no extra training or reward models.
- The idea of using log-probabilities or policy scores for implicit reward shaping at test time has been explored in prior work. The contribution mainly reformulates known ideas in a slightly different analytical framing. The proposed reward decomposition essentially feeds the reward model (or policy probability) with different segments of the same input (question vs. question + preference) to derive a rescaled score. This approach, while intuitive, lacks genuine novelty or theoretical depth.
- In the second paragraph of the introduction part, the authors state that "However, existing TTS research has predominantly focused on domains such as mathematics and coding". However, test-time alignment has been long studied, even before the prevalence of TTS, including papers such as:
1. Args: Alignment as reward-guided search.
2. Inference-time language model alignment via integrated value guidance.
3. Fast Best-of-N Decoding via Speculative Rejection.
- Performance gains over baselines are modest, often within small margins and without strong qualitative differentiation.
- The method remains heuristic: while mathematically presented as a decomposition, it does not provide clear theoretical or empirical evidence that the “preference” and “question” components can be distinctly separated in practice.
See weaknesses. |
Fully AI-generated |
|
REAR: Scalable Test-time Preference Realignment through Reward Decomposition |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces REAR, a framework for aligning with user preferences at test time, specifically targeting subjective and open-ended tasks. REAR decomposes the reward function into question-related and preference-related components, enabling dynamic re-weighting of user preference without retraining. The authors show that REAR can be efficiently formulated as a linear combination of policy probabilities, making it tractable and compatible with TTS approaches such as best-of-N (BoN) sampling and Diverse Verifier Tree Search (DVTS).
1. REAR is plug-and-play, and requires no external models or retraining, making it attractive for deployment.
2. REAR outperforms both token-level preference-alignment baselines (e.g., Amulet, Linear Alignment) and TTS methods.
3. The method is grounded in a reinforcement learning formulation and provides detailed proofs.
1. Performance of REAR depends on the choice of the λ parameter, which may require tuning for different questions or preferences.
2. If user preferences are not expressed in words (for example, are only implicit, behavioral, or external), the REAR method as proposed would not function without modification.
3. Heavy reliance on LLM-based evaluation for some tasks raises concerns about evaluation robustness and objectivity.
1. Hyperparameter selection of λ: How robust is the method to the choice of λ across different tasks and datasets? Is there a principled or automated way to select λ at inference time, possibly in the absence of validation data?
2. It it insufficient to use only the “helpfulness” score to measure the general response quality in the Section "Analysis on Generated Responses", and also "helpfulness" can be one of the preferences to consider. |
Fully human-written |
|
REAR: Scalable Test-time Preference Realignment through Reward Decomposition |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes an idea of reward decomposition that is used in test-time inference for the personalized preference task. The idea is simple, decompose the whole reward into question-related reward and preference-related reward. Then, the authors use the policy probability $\pi(a|s)$ as a proxy of the reward $r(s,a)$.
Based on these reward decomposition, the authors use them in two test-time algorithms, Best-of-N and DVTS. Experiments demonstrate the effectiveness/efficiency of their reward design.
The reward decomposition strategy is simple and easy to understand, while the effectiveness is demonstrated by the experimental results.
1. The authors tried to use some mathematical derivation to show the depth of their approach, however, I feel these components were not well stated. For example: From (5) to (7), it feels like nothing substantial is explained. Lemma 3.1 also appears quite abrupt — it introduces a seemingly fancy formula, but its meaning is unclear, and in the end, everything just circles back to (7). Note that Lemma 3.1 is merely an intermediate result and does not offer any theoretical guarantee. It only shows that your policy is an optimizer of a certain expression, making the interpretation of this relation crucial. However, I don't find it particularly insightful or helpful for understanding your algorithm’s design. In fact, it ultimately leads to equation (7), which is equivalent to (5).
2. What truly matters is Lemma 3.2, as it informs the reader how to compute the reward. However, the authors merely state that the reward can somehow be replaced by the policy probability. I believe this is a critical step in the algorithm’s design, yet the paper provides no intuition or explanation for this substitution. Although proofs are included in the appendix, they also fail to offer any meaningful intuition.
2. The presentation could be improved for better clarity and coherence. For example:
(a) What is the REAR score, and how is it defined? (I can roughly infer its meaning, but the paper should state it explicitly.)
(b) What is $\hat{r}_{REAR}$ in Lemma 3.2?
(c) In Proposition 2.1 and several other places, you should distinguish between the symbols '=' (equality) and ':= / \triangleq' (definition). This is especially important when your notation deviates from standard conventions. For instance, in Equation (3), your Q function is not the one that I learned. If you want to define a new Q function (or the soft Q-function, as you called), you should use a definition symbol rather than the equality sign.
3. After analyzing the paper, I find the novelty is quite limited. It decompose the reward into two reward terms, then replaces the two reward terms by two policy probability terms, and use it as guidance in two common test-time inference methods. The mathematical derivations make little sense and sometimes disrupt the reading flow of the paper.
see weakness. |
Fully human-written |
|
REAR: Scalable Test-time Preference Realignment through Reward Decomposition |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents a test-time scaling method for preference alignment in LLMs, in case of non-verifiable rewards. This paper makes the assumption that preference is specified in-context. The authors decompose rewards into question-related and preference-related components, then derive a score based on policy probabilities that can be integrated with best-of-N sampling and DVTS. While the core idea has merit, the paper suffers from significant theoretical gaps, questionable experimental design, and overclaimed contributions.
1. The "realignment" framing is intuitive: base models have implicit preferences from training that may not match specific user needs.
2. The paper correctly identifies that test-time scaling (TTS) has been limited to verifiable domains (math, coding) and extending it to subjective preference alignment is a worthwhile research direction.
1. What is α? Is it:
- Task specific constant
- A property of how the model was trained?
2. This paper would have been much easier to understand if it were presented as "Test-Time Preference Alignment via Policy Interpolation". Overcomplicating a simple method doesn't add value to the paper.
3. Experimental Design Lacks Statistical Rigor. Statistical significance of the results is not reported.
Please see weaknesses. |
Fully human-written |
|
Interactive Agents to Overcome Underspecificity in Software Engineering |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper investigates how large language model (LLM) agents handle underspecified instructions where bug reports or feature requests omit critical details. Building on the SWE-Bench Verified dataset, the authors include three settings: (1) Full, where complete issue descriptions are given; (2) Hidden, where GPT-4o-generated summaries remove key information; and (3) Interaction, where agents can query a simulated user (GPT-4o) that has full task details under Hidden setting. The study evaluates four across three capabilities: detecting missing information, asking clarification questions, and leveraging answers to solve tasks. Results show interaction boosts success rates by up to 15.4% over non-interactive in hidden settings, though performance still lags behind fully specified inputs. Most models fail to detect underspecificity without explicit prompting, with only Claude Sonnet 3.5 achieving 84% accuracy under moderate encouragement, while Claude and DeepSeek ask more specific, information-rich questions and Llama tends to pose generic ones.
1. The paper explores a critical issue within current LLMs where they typically cannot recognize the underspecioficity in user query.
2. The paper clearly defines underspecificity as “missing information that prevents an expert from producing a correct fix,” grounding it in the SWE-Bench Verified rubric rather than using vague notions of ambiguity
3. The study divides performance into three measurable capabilities: 1) detecting underspecificity, 2) asking targeted questions, and 3) leveraging responses, which enables more diagnostic evaluation rather than a single overall score
4. Resolve-rate improvements are validated using Wilcoxon signed-rank tests to confirm significance across models.
1. The paper admits that naturally underspecified GitHub issues often still contain concrete technical cues (error messages, file references, conversational fragments), whereas the generated summaries mainly remove details, which may exaggerate the severity of underspecificity and may bias the task toward “missing vital context” rather than “ambiguous intent.”
2. The paper mentions using the OpenHands agent environment but gives minimal explanation of how the agent framework is structured or how its components (planning, editing, execution, and interaction) collaborate.
3. In Section 3.3, the discussion of navigational gains lacks clarity: what counts as navigational information, why it matters for task success, and whether such behavior mirrors how real developers seek information.
4. The simulated user based on GPT-4o is insufficiently validated; the paper provides no evidence that GPT-4o’s responses align with real-user behaviors or communication patterns. And there is lack of analysis about the correctness of the response to the clarification questions generated during the interaction setting.
5. In Section 5, information gain is defined as 1 – cosine similarity between the summarized task and the cumulative knowledge after interaction. This metric may overestimate improvement, as asking many irrelevant questions could still yield a high score. Would it be more accurate to calculate the score between the fully specified knowledge and the cumulative knowledge after interaction?
1. How is the agent framework structured?
2. What exactly constitutes navigational information, why is it important for task success, and does this behavior reflect how real developers seek information?
3. How well does the simulated GPT-4o user align with real-user behaviors and response patterns, and how accurate are its answers to clarification questions during interaction?
4. Would it be more appropriate to compute information gain between the fully specified knowledge and the cumulative knowledge after interaction, rather than using the summarized task as the reference? |
Lightly AI-edited |
|
Interactive Agents to Overcome Underspecificity in Software Engineering |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper evaluates how well LLM-based agents handle underspecified instructions in software engineering tasks. Using SWE-Bench Verified as a foundation, the authors create synthetic underspecified versions of GitHub issues and test whether models can (1) detect missing information, (2) ask effective clarifying questions, and (3) leverage interactions to solve tasks. They evaluate four models (Claude Sonnet 3.5, Claude Haiku 3.5, Llama 3.1 70B, Deepseek-v2) across three settings: fully specified issues, underspecified issues without interaction, and underspecified issues with a simulated user proxy.
1. The paper addresses a practically important problem. Real-world task descriptions are often incomplete, and understanding how agents handle this is valuable.
2. The experimental design is generally rigorous.
1. The most significant weakness is the lack of human validation for the synthetic underspecified issues. The authors use GPT-4o to generate summaries but provide no evidence that these summaries would actually prevent human experts from solving the tasks. Are the findings representative of real underspecification?
2. The classification of missing information into only "informational" and "navigational" details is overly simplistic. The authors mention "multiple, interdependent gaps" in real tasks but do not provide a formal taxonomy or analyze which types of underspecification are most challenging.
You cite several papers on ambiguity resolution and clarification questions, but don't compare against them. Do you think those methods can be applied to the datasets for comparison?
Interactive approaches require multiple API calls, longer execution times, and user attention. Can you provide cost analysis? |
Lightly AI-edited |
|
Interactive Agents to Overcome Underspecificity in Software Engineering |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper asks how well agents perform when faced with an underspecified or ambiguous question, where the ideal solution/user experience would require eliciting additional information from the user. The paper repurposes an existing benchmark, SWE-bench, and turn it into interactive / hidden-information benchmarks, by rephrasing and processing the inputs with LLMs. The paper asks three well-formulated research questions, which are in turned answered.
1. In my view, RQ1 and RQ2 are really interesting and well-scoped. I appreciate the experimental design, which attempts to build comparable evaluations across "Hidden", "Interaction" and "Full" settings. The primary results in Figure 3 are super interesting, and I believe potentially very influential in the field of software-engineering evaluations. However, I would really like to see a more comprehensive set of models here.
2. The construction of the dataset is well-described and I appreciate the careful analysis in section 2.1 that elicits qualitative differences between human-written underspecified problem statements and the synthetically rewritten issues.
1. Outdated models: Both proprietary (Claude-3.5) and open-source models (DeepSeek-v2 and Llama-70b) are unfortunately rather behind the state of the art, given the rate of progress in recent months. Just for Claude models alone, we've seen Sonnet-3.7, Sonnet-4, and Sonnet-4.5 in the meantime. For this paper to be relevant for a conference presentation, I fear that the paper would really require updated results from more up-to-date frontier models. This will also make the claims around open-weight vs. proprietary models more relevant. In fact, I am intruiged if the type of "interaction" gap in the paper still remains with the current generation of models.
2. I find the results and methodology for RQ3 "Can the LLM generate meaningful and targeted clarification questions?" slightly underwhelming, since the main message seems to be that Llama 3.1 70b is not very good at question answering. The evaluation metrics do not really seem to be able to reveal fine-grained differences in question-asking abilities. The qualitative analysis is nice, but in my opinion lacks comprehensive documentation of full problem statements and agent traces featured in the appendix.
3. The paper relies quite heavily on SWE-bench for the original problem statements, which is quite heavily targeted by model creators. I think an analysis of how results would transfer to a more niche software engineering settings would be very interesting.
> We did not evaluate on naturally underspecified SWE-Bench examples because they lack the paired ground truth (complete specifications) necessary for causal measurement of interaction impact.
Would it be possible to infer the complete specification from the known `gold_patch` and `test_patch`? How would this approach compare to the chosen approach in the paper?
> However, there are naturally occurring underspecified issues that are similarly vague as well (django_django-13952, django_django-15744, pytest-dev_pytest-7283, sphinx-doc_sphinx-9467, sympy_sympy-12977 are some specific examples)
I would encourage the authors to feature and discuss these examples in the appendix, and simply refer to the appendix here. This is a rather distracting amount of information for page 3. |
Fully human-written |
|
Similarity-Constrained Reweighting for Complex Query Answering on Knowledge Graphs |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a new extension to complex query answering (CQA) over knowledge graphs (KGs), the paper names it similarity-based constraints CQA. The similarity constraint can be applied to any variable within this query. To address this new question, the paper proposes SCORE (Similarity-COnstrained REweighting). It achieves this via a logit-space reweighting mechanism that only contains two new hyperparameters.
1. It is praiseworthy to introduce the novel generalization of similarity constraints.
Notably, the extension of SimCQA allows for similarity constraints of intermediate variables. Overall, the paper addresses a realistic but previously unstudied case in CQA.
2. SCORE has good interpretability. Unlike black-box neural methods, SCORE’s update mechanism is transparent and traceable to individual preference contributions.
3. The methodology of SCORE is generally easy to comprehend and follows, like its logit-space reweighting mechanism, the paper also proposes some theoretical results to show its soundness.
4. The code, datasets are provided.
1. The setting of similarity function is far too simple. The paper just uses binary classification of True/False to determine whether one entity is similar to another one in the problem setting. This is quite simple but another problem is bigger: the paper does not explain very clearly how it decides the ground truth of similarity. To my understanding, the paper uses other answers from same query as ground truth which can be very problematic.
2. The performance is highly dependent on the backbone model as it only introduces two new hyperparameters. I also found that the experimental performance falls behind NQR in pairwise accuracy and also only outperforms very simple baselines like MeanCosine slightly. Therefore, I suspect the effectiveness of SCORE.
Perhaps using numerical attribute (which is provided in NELL and other dataset) instead of Boolean similar/not similar is a better alternative. Have you considered that? |
Fully human-written |
|
Similarity-Constrained Reweighting for Complex Query Answering on Knowledge Graphs |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces a method for complex query answering (CQA) on knowledge graphs, where entity similarity constraints are exploited to guide the solver towards solutions that are consistent with such constraints. In particular, the method extends a previous one in order to exploit an entity similarity constraint defined on arbitrary variables, rather than on only the answer entity (also called target). While the method is rather simple, it turns out that it can achieve a better CQA accuracy with little computational overhead.
I have found the writing and the plots to be extremely clear. The writing slowly introduces concepts when they are needed with many examples and a sufficiently detailed notation. Overall, I have also appreciated the simplicity of the proposed method, as well as the execution of the experiments. From a quick look, it looks like the code can help reproducing all the presented results.
Although the CQA benchmarks used in the paper are particularly established in the community, some queries can suffer from link leakage from the training set. For instance, most of 2p complex queries can actually be decomposed into the simplest task--link prediction (1p queries)--if one considers also the training triples at test time. This means that ranking metrics for certain query types are actually inflated. Recently, [A] showed this problem and proposed a new set of much more challenging CQA benchmarks, where all triples in a query are missing in the observed knowledge graph and therefore must be predicted. Evaluating the baselines and the proposed method on these other recent benchmarks could strengthen the value of the obtained empirical results. Although I do not expect a huge difference w.r.t. the presented results, I suggest the authors to evaluate their method on these other datasets as well.
[A] C. Gregucci, B. Xiong, D. Hernández, L. Loconte, P. Minervini, S. Staab, A. Vergari. Is Complex Query Answering Really Complex? ICML 2025.
- The definition of similarity-constrained complex queries assumes that there exists a single similarity constraint for one of free variables (e.g., see Eq. 6). Do you think this framework can be further extended to a setting where several of the free variables take part in a similarity constraint?
- What is the "unconstrained" baseline reported in Figure 2? Is it evaluated on the same dataset for SimCQA? I do not understand why the unconstrained baseline performs better than the baseline NQR, which instead takes into account similarity constraints. Could you please clarify this aspect? |
Fully human-written |
|
Similarity-Constrained Reweighting for Complex Query Answering on Knowledge Graphs |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a new method for complex query answering. Unlike traditional settings, it focuses on scenarios where similarity constraints are applied to either intermediate nodes or answer nodes within complex queries. The key idea is to represent the potential answers of each variable as fuzzy sets, and then perform Similarity-Constrained Reweighting to adjust these fuzzy vectors accordingly. Overall, the method is conceptually simple yet addresses an important and underexplored problem.
The paper is clearly written and easy to follow. The writing quality is excellent, and the authors address a novel and meaningful problem—introducing similarity constraints on both intermediate variable nodes and answer nodes in complex query answering.
The inclusion of theoretical analysis enhances the soundness and credibility of the proposed algorithm, while the experimental results convincingly demonstrate its effectiveness across different datasets and settings.
The core idea of the paper is simple and intuitive, with the primary contribution being the introduction of the Similarity-Constrained Reweighting mechanism. However, this contribution appears somewhat limited in scope.
The experimental section requires significant improvement for better clarity and completeness. First, although the paper introduces a new problem setting, it does not provide sufficient details on how existing methods (such as SCORE, NQR, and other baselines) were adapted to this new domain. This information is essential and should be described explicitly.
Second, the presentation of experiments is not very clear. Both the experimental setup and the performance analysis are difficult to follow. The section would benefit from a thorough revision to improve organization, explanation, and readability.
no |
Heavily AI-edited |
|
Similarity-Constrained Reweighting for Complex Query Answering on Knowledge Graphs |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes the new CQA tasks considering the soft constraints over any variables, which extends the constraints for the free variable in previous works. To address this new setting, this paper proposes a new re-weighting method to interfere with the symbolic search based on the new constraints, where this new method is light and has linear complexity involving two parameters. Then the experiments showed that this new method has advantages over the previous method considering the free variable and the naive baseline.
1. Propose a new setting for CQA: consider the additional soft constraints over intermediate variables.
2. Propose a light and effective method to rerank the answer sets based on the new soft constraints and the results are good.
1. In my opinion, for the paper proposing new tasks, it’s key to introduce the motivation and application of this new setting. After reading this manuscript, I know the soft constraints over intermediate variables are new compared with existing methods. However, I don’t know why to extend this new setting. In real situation, it's hard to prepare such preference set for re-weighting.
2. The presentation of method is not clear. Though this manuscript provide the details of the re-weighting operations, it's lacking for how this operation integrated with the symbolic search process. Is the re-weighting operation applied over each updating for fuzzy vectors?
3. The compared baselines are limited and all symbolic search methods. It’s important to include more baselines for new task. Query embeddings methods are mainstreams lines for complex query answering, thus I suggest authors can include some classic methods in query embedding methods.
Typos:
I found some potential typos in this manuscript and listed them in the following:
1.Equation 1 in Line 147, this formula lacks the existential qualifier \exists based on the semantics of the natural language question.
2.Equation 3 in Line 161: the symbolic \exist v_{\lneg i} \in \mathcal{E} only is right for two variables. Please consider the general formula for any variables.
3.Line 69: e as a symptom,” a user |
Fully human-written |
|
DAL: A Practical Prior-Free Black-Box Framework for Non-Stationary Bandits |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper focuses on a classical problem, that of learning in non-stationary bandits. The idea, essentially, is to augment a standard bandit algorithm with a "change detector". Classical bandit algorithms have their theory (and presumed applications) made under the stationarity assumption, which is not necessarily true in practice. Such changes can take the form of both abrupt and gradual changes. The authors propose a framework based on (1) detecting change of distribution by considering shifts in mean action rewards (2) forced exploration according to a schedule, forcing the bandit algorithm to essentially "drift" in state space. The mean-action shift is done by choosing an "appreciable" mean shift, exploiting some structure of the problem in deciding on which one. Some theoretical results on regret are provided.
Strengths: this is a nice problem, and one that has been considered by many authors over the years. The approach, while fairly simple, is effective. The experiments seem to be justifiable and demonstrate the performance of the method.
Weaknesses: the paper is not so easy to digest and understand at times. The tuning of the methods seems challenging, and the authors do not convince the reader otherwise. No details on the construction of the covering set are provided, as an instance.
Questions: what if the process contains a mix of abrupt and gradual changes? Can this method be augmented with memory, allowing to go back to previous regimes, instead of effectively starting from scratch every time?
N/A |
Fully human-written |
|
DAL: A Practical Prior-Free Black-Box Framework for Non-Stationary Bandits |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
For non-stationary bandits, most existing methods — such as restart, weighted/discounted, and sliding-window methods — can get good empirical performance and near-optimal regret guarantees. However, they rely on strong prior knowledge about the non-stationarity of environment. In contrast, MASTER achieves optimal regret without requiring such prior knowledge, but it is very complex: it runs many learners in parallel, which makes it hard to use in practice and often weak in experiments.
This paper focuses on the piecewise-stationary setting and proposes a black-box method that achieves (near-)optimal regret and strong empirical performance. The method keeps a small covering set of arms, occasionally pulls arms from this set to detect changes, and restarts the base learner when a change is detected. This removes the need to know the degree of non-stationarity and avoids maintaining many parallel learners.
1. The method provides an algorithm with theoretical guarantees that does not rely on prior knowledge of the environment, and it also shows strong empirical performance.
2. The method is general: it acts as a black-box change detector that can be wrapped around different types of bandit algorithms, and it works across multiple bandit settings.
1. The method does not provide theoretical guarantee for the drifting case. This is expected, because the change-detection mechanism is designed for abrupt changes, not for drifting changes. The paper only shows empirical performance on drifting, but bandits are primarily a theoretical setting, so having a matching optimal regret guarantee there is important and is currently missing.
2. Compared to MASTER, this paper’s analysis in the piecewise-stationary setting relies on an extra assumption: changes in the environment must be separated by a sufficiently long stable period. This assumption appears inside Theorem 4.4, but it is not stated clearly as its own assumption. I suggest the authors make this assumption explicit and discuss it up front. Otherwise, the comparison to prior work (MASTER) is not fair, and the assumption feels too hidden.
1. The paper repeatedly uses the broad term “non-stationary bandits,” but after reading the paper, the theory really only covers the piecewise-stationary case. For drifting, there is no matching theoretical analysis, but only experiments. By this standard, any prior piecewise-stationary bandit method could also run on a drifting simulation and then claim to solve “non-stationary bandits,” which would be an overclaim. Since the proposed method is not specifically designed for drifting, I believe the paper (including the title) should make it explicit that the setting is piecewise-stationary, not general non-stationary.
2. Prior work on piecewise-stationary bandits already has prior-free detection-and-restart methods. It is not yet clear to me what the real difficulty is in turning those approaches into a black-box wrapper, and how this paper goes beyond that in a substantive way.
I would be happy to raise my score if the authors can make the requested revisions and clarify these points. |
Fully human-written |
|
DAL: A Practical Prior-Free Black-Box Framework for Non-Stationary Bandits |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes Detection Augmented Learning (DAL), a parameter-free, black-box framework for non-stationary bandits. DAL takes a stationary bandit algorithm and a change-point detection subroutine as inputs, and, through a forced-exploration mechanism, it adapts to non-stationary environments without requiring prior information. Extensive experiments on multiple benchmarks are conducted to validate the effectiveness of the proposed framework.
- This paper proposes a general framework for non-stationary bandits and establishes order-optimal regret guarantees in the piecewise-stationary setting. For the drifting case, the paper provides partial insights.
- The experimental evaluation is _thorough and diverse_, covering multiple bandit setups and including realistic datasets, which enhances the practical significance of the work.
- From a theoretical perspective, the main idea of augmenting a stationary bandit algorithm with a change-point detection module has been explored in prior work, limiting the conceptual novelty.
- Although the framework is claimed to extend naturally to contextual bandits, this case is not rigorously analyzed.
- The analysis for the drifting case remain limited, which constrains the overall contribution of the framework.
- Some assumptions, such as those in Proposition 4.2, require clearer justification or guidance on how they can be verified in practice.
- How sensitive is DAL to the choice of covering set $A_e$ in large continuous action spaces?
- DAL depends critically on a GLR-type change detector, but the implementation specifics are not fully described, e.g., what is the exact testing statistics and threshold used for triggering restarts? How are the false alarms controlled?
I find the paper’s practical relevance to be stronger than its theoretical depth, and I would appreciate it if the authors could clarify the points raised above. |
Lightly AI-edited |
|
DAL: A Practical Prior-Free Black-Box Framework for Non-Stationary Bandits |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work focus on the regret minimization problem in non-stationary bandits. It proposed the DAL technique to detect unknown changes in the environment. Both numerical experiments and theoretical analysis are presented in this work.
1. Many related works are discussed.
2. Numerical experiments are done in various datasets.
This work presents a set of numerical results and a set of analytical results while neither of them fully convince me the superiority of the algorithm. I wonder what is the key contribution/focus of the work. Some key concerns are as below:
1. Abstract: It is claimed that 'DAL accepts any stationary bandit algorithm as input' while Propositions/theorems (e.g. Theorem 4.4) come with some assumptions/conditions. It is somehow confusing.
1. Line 28: It is claimed that 'MABs fall into ... PB, NPB, CB'. I feel maybe it is not that proper to say so. For example, contextual bandits can also be viewed as a parametric setting from some perspective.
1. Algorithm 1: I think the algorithm is a key contribution of this work, while the pseudocode is not that easy to understand.
1. What is $N_e$?
1. When will $D( \ldots )=\text{detection}$ (in line 6)?
1. Many subplots in Figures 1 and 2 present the regret/reward of only a portion of discussed algorithms? Do those missing algorithms perform better than DAL? An explanation is appreciated.
1. Proposition 4.2: It is a bit unusal that the Lipschitz constant $BL_u$ does not affect the bound on $|V_T|$. Some explanations are appreciated.
1. Theorem 4.4 comes many conditions/assumptions without discussions. Besides, how the regrets stated in the paragraph beginning from Line 414 is not that clear. Some explanations here are also appreciated.
Besides, here is one minor suggestion:
1. The algorithms should be arranged in the same order in the legend box for Figures 1 and 2.
See *Weaknesses* above. |
Fully human-written |
|
Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces a hierarchical benchmark that evaluates MLLM’s ability to understand and reason about orientation. The paper evaluates multiple MLLMs on the benchmarking and shows poor performance.
* The motivation behind the curated dataset is well justified. The curated dataset is diverse and contains common objects that shall be viewed by MLLM during pre-training.
* The paper has a clear presentation of the dataset and complete evaluation of the popular open-source and closed-source MLLMs.
* This dataset has a potential be improved to 3D. If so, this will be very beneficial for active learning and robotic pre-training, etc..
* I have a concern about whether the "counter-clockwise" and "clockwise" are consistently defined. In a 3D setting, when talking about rotation, we always need to specify the direction of the z-axis. But such information is not provided in the dataset.
* I also have a concern about whether "face toward" is well defined. This clearly requires the described object to have a "face" that is visually decidable. If the object is a human, it is simple. But for other objects like tables or sofa, this language may not apply.
* This is not a weakness. Since the tasks proposed in the paper mostly require 3D reasoning, it may make the dataset stronger if 3D point cloud or depth are also provided for those simulated images.
No question. |
Fully human-written |
|
Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes DORI, a diagnostic benchmark for orientation understanding in MLLMs across four dimensions (frontal alignment, relative orientation, rotational transformation, canonical orientation).
It uses standardized MCQ prompts (with a Cannot be determined option) and reports that models handle coarse judgments better than granular angles; token-based fusion appears stronger than linear projection.
LoRA fine-tuning on DORI reportedly transfers to external spatial benchmarks.
The paper cleanly isolates orientation understanding into complementary abilities (frontal, relative, rotational, canonical) and tests them with a coarse–granular design that probes both category and precise angle reasoning.
The benchmark is well engineered, and broad model coverage exposes consistent weaknesses.
Findings are actionable, making DORI a practical diagnostic tool for geometry-sensitive applications.
- Since prompting is part of the measurement apparatus, please ablate the components to quantify their contribution and ensure models aren’t over-relying on the scaffold rather than vision.
- Ground-truth fidelity and metric design: While synthetic sources yield precise angles, human-annotated natural images can have ambiguous frontality (e.g., symmetric furniture) and unknown camera intrinsics, which may distort a fixed discrete angle taxonomy.
- Architectural claims need stronger controls: The observation that token-based integration > linear projection is compelling but potentially confounded by pretraining data or instruction tuning.
- Human study scale and reporting: Human evaluation covers 30 examples per type with seven experts. This is useful but small.
- Have you tried free-form numeric responses (regression-style) and then quantized at evaluation time? Do model rankings persist? Please share results with permuted answer choices and removed examples section to quantify prompt-component effects.
- For the claim that token-based fusion > linear projection, can you provide experiments with the same visual backbone and identical instruction-tuning, changing only the fusion scheme? Any results with feature token counts swept to test capacity vs mechanism? (If it's not possible, that's understandable)
- Would you consider scaling the human study to 300–500 items with crowdworkers + expert adjudication, and report results to better anchor the human–model gap? |
Heavily AI-edited |
|
Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents the DORI benchmark, developed to specifically evaluate how well Multimodal Large Language Models (MLLMs) understand object orientation. DORI uses a cognitive science-informed approach, assessing orientation perception across multiple facets including how objects face the viewer, how they change with rotation, their orientation relative to other objects or viewpoints, and their typical 'right-side-up' state. The evaluation includes both basic categorical questions and more demanding fine-grained angular questions. It is applied to a substantial amount of real and synthetic images (over 13k images from 14 sources) with structured prompts. Experiments involving 18 MLLMs indicated difficulties in this dataset, particularly in making precise orientation judgments versus simpler classifications. Notable performance declines happen when tasks required understanding rotations or shifts in perspective. The findings suggest current models may lack robust internal mechanisms for representing and reasoning about object orientation.
1. The most noticeable contribution is the dataset collected. The benchmark's hierarchical structure, decomposes orientation related questions into four dimensions, which care frontal alignment, rotational transformations, relative orientation and canonical orientation, is quite inspiring. The inclusion of both coarse and fine-grained questions allows a more comprehensive assessment of model proficiency
2. The paper effectively identifies and addresses a limitation in existing MLLM, which is the lack of ability to assess object orientation understanding, separate from general spatial reasoning. This dataset helped to evaluate more detailed object orientation understanding ability in MLLM.
3. DORI is constructed from a quite diverse set of images (13,652 images from 14 sources), which include both real-world and synthetic data. The evaluation is conducted across 18 different MLLMs, providing a quite holistic benchmark evaluation.
1. The presentation of the paper can be improved. Limited examples are provided for the VQA questions involving canonical orientation. I have remaining concerns on these types of questions since canonical orientation or frontal alignment itself might remain inherently ambiguous for certain object types, like symmetric ones. This might introduce noise into the ground truth and evaluation, and I am interested in seeing how they are addressed more detailedly.
2. Another limitation is the absence of empirical validation showing that performance on DORI actually correlates with MLLM capabilities in real-world applications like robotic manipulation or autonomous navigation. The paper claims relevance but doesn't demonstrate a predictive link between benchmark scores and success on applied tasks requiring orientation understanding.
3. The paper's writing can be improved. Some tables exceed text width. In main experiments tables, using vertical axis to separate different task is recommended. Overall, there are some minor presentation issues in the paper.
Please refer to the weakness.
Will this dataset be released? |
Fully human-written |
|
Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This work introduces DORI, a benchmark designed to evaluate the orientation perception ability of current multimodal large language models (MLLMs). DORI comprises 13,652 images from 14 sources, forming a total of 33,656 samples. It assesses four key aspects of object orientation understanding: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation comprehension. The results show that even the best-performing models achieve only 54.2% accuracy on coarse-level tasks and 33.0% on fine-grained orientation judgments, with performance degrading significantly on tasks involving reference frame shifts or compound rotations.
1. The proposed benchmark addresses an important problem.
2. The writing is generally clear and easy to follow.
3. The related work section is detailed and clearly explains the limitations of existing benchmarks.
4. The proposed benchmark is novel in terms of its practical usability.
1. Figure 1 is not very clear and does not effectively convey the definitions of the four task categories. In particular, the examples for rotational transformation and relative orientation appear very similar, making it difficult to distinguish between them.
2. Although the paper cites many works from related fields such as cognitive science to explain how humans understand rotation, the definitions of the four subproblems lack clear logic and structure. The relationships among them are not well articulated, leaving it unclear whether the proposed categorization is both complete and necessary.
3. For some objects, the front face is inherently ambiguous (e.g., a table). Although the paper mentions that specific prompt designs are used to define the front face for tested models, such strategies cannot fully resolve these ambiguities. This raises concerns about the correctness and answerability of certain questions.
4. Based on the above, I suspect that some samples may be ambiguous. However, the paper does not describe any quality control process to ensure dataset accuracy. For a benchmark, it is generally expected that every sample be manually verified to guarantee correctness.
5. While the paper evaluates several state-of-the-art models, it omits important models such as the InternVL series, and for the Qwen family, only the 3B variant is tested without including larger models.
6. Lines 418–419: The observed difference may stem from variations in training data, so this conclusion should not be drawn too hastily.
7. Table 5: Please correct the label from “GPT-4 O” to “GPT-4o.”
8. The paper claims that the proposed systematic approach isolates orientation understanding from scene perception skills and minimizes confounding factors such as object recognition difficulty, scene clutter, linguistic ambiguity, and contextual distractions that affect existing benchmarks. However, no experiments or examples are provided to substantiate these claims.
1. In Section 3.1, the definition of the viewing plane is not clearly explained. |
Lightly AI-edited |
|
From Traits to Circuits: Toward Mechanistic Interpretability of Personality in Large Language Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper discusses personality in LLMs, pioneering the application of mechanistic interpretability to analyze the models themselves. The authors discovered that the identified circuits are functionally complete, structurally sparse, and heavily dependent on early MLP layers, which act as causal bottlenecks.
This is a thoroughly analyzed paper. While it does not introduce a novel methodology, it rigorously analyzes and investigates personality circuits. This approach uncovers many phenomena previously unknown to the community and refutes the conjecture that "personality is a globally diffuse property."
The paper's definition of personality is oversimplified. I do not believe the Big-Five model is sufficient to encapsulate personality, which could also include, for example, dark personality traits (e.g., the Dark Triad) or aspects such as values, beliefs, and motives. The "personality circuits" identified in this paper actually correspond to "Big-Five trait circuits," rather than the broader "personality circuits" as claimed. The authors should conduct further analysis on these aspects to reach a more definitive conclusion.
It is difficult for the paper to prove that these personality circuits are exclusive; they are very likely key components that are also involved in executing other semantic tasks.
The study was conducted on two relatively small and older LLMs. It remains unknown whether the conclusions can be generalized to state-of-the-art models, especially MoE-based architectures. An analysis of models like Qwen-3-30B-A3B and Qwen-3-235B-A22B would significantly enhance the paper's contribution.
The template used to identify personality is highly structured. It is unclear whether similar phenomena persist in more realistic, user-focused tasks such as free-form conversation, extended dialogues, or long-form writing. |
Lightly AI-edited |
|
From Traits to Circuits: Toward Mechanistic Interpretability of Personality in Large Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
the paper studies whether personality in LLMs may similarly be realized through structured internal computation paths.
The authors come to the conclusion that "only a compact set of attention heads and MLP units appears necessary
for encoding and expressing each trait across different models".
The research is on a very timely and interesting topic. The results are convincing and should be interesting for a broad community of researchers.
Though authors correctly mention that LLMs simulate certain behaviour or personality they do antropomorphize LLMs in other parts of the text. This is unfortunate but minor problem in my opinion.
How general is your approach and would it be applicable to bigger models? In particular how can we be sure that the conclusions will hold in the models that are order of magnitude bigger and go through a significantly longer post-training phase for alignment? |
Fully human-written |
|
From Traits to Circuits: Toward Mechanistic Interpretability of Personality in Large Language Models |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper investigates whether the personality of large language models can be traced to identifiable transformer circuits. Using tools from the mechanistic interpretability community, the authors identify a small set of sparse nodes in a small, pretrained LLaMA model (LLaMA-2-Chat, 7B) that are responsible for generating answers on the proposed Trait-Trace dataset. Ablation studies and causal-intervention analyses show that certain nodes within these circuits can substantially influence LLM performance on the Trait-Trace task.
- The perspective of identifying interpretable circuits in Transformer models that are causally responsible for personality-like behaviors is interesting and could have important implications for safety, alignment, and the development of better chatbots.
- The paper is generally well written and easy to read.
- A major weakness lies in the evaluation. The study relies on a newly proposed Trait-Trace dataset generated by GPT-4o that focuses on single-word reactions to vignettes/trait prompts. All circuit-discovery and causal-intervention experiments depend on this fragile single-word reaction task. It is unclear what the task actually measures—the discovered circuits may merely capture distributional shifts in certain personality-related words rather than any higher-level notion of personality in LLMs. Generalization tests are essential. For example, under causal interventions/steering, do circuits discovered on Trait-Trace transfer to more complex settings (e.g., dialogue generation, storytelling, or psychometric evaluation items)? Given the authors’ access to trained psychology graduate students, such evaluations seem feasible. Demonstrating this would better justify the claim that the identified circuits reflect personality rather than confounding word-distribution shifts.
- The Trait-Trace task design is too simple. The template “I’m {p}, regarding {s}, I feel very {r}” biases lexical and affective choices, making the discovered circuits specific to particular word choices rather than to general personality constructs. It remains unclear what construct this task is evaluating.
- Limited conceptual insight. As framed, one could likely find circuits or sparse subgraphs for almost any language-model behavior. The authors should better demonstrate—or at least discuss—why these circuits matter and why the discovered early-layer MLP features align with human intuitions.
If you replace the prompt with random fillers unrelated to personality, would the intervened circuits still induce a similar shift in the output-logit distribution? |
Lightly AI-edited |
|
From Traits to Circuits: Toward Mechanistic Interpretability of Personality in Large Language Models |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper explores whether personality traits (based on the Big Five model) can be localized as identifiable “circuits” within large language models (LLMs). The authors construct a synthetic dataset, TRAITTRACE, containing prompts that express high or low levels of each trait and corresponding trait-consistent reactions. Using Edge Attribution Patching with Integrated Gradients (EAP-IG), they identify minimal subgraphs within the model that preserve performance on trait-consistent response prediction. Results suggest that small, sparse circuits can reproduce the full model’s behavior, that high and low levels of traits share many nodes but differ in edge directions, and that early MLP layers serve as bottlenecks for trait information.
This paper tries to move beyond behavioral probing toward the mechanistic interpretability of social/psychological constructs.
## Weaknesses and Suggestions
### 1. The motivation is weak.
The authors justify this work through an analogy with neuroscience, arguing that personality traits in humans arise from neural circuits and therefore may also emerge as “trait circuits” in LLMs. However, this analogy is conceptually flawed. Human traits are latent psychological dimensions, not localized neural entities, and the connection to artificial circuits is purely metaphorical.
---
### 2. Ethical statement is missing.
Because the paper draws direct analogies between human brain circuits and model activations, it risks **anthropomorphism**, suggesting that LLMs “possess” personality traits or human-like psychology. Such framing requires careful ethical consideration and a clear disclaimer, but no ethical statement is provided. The authors should explicitly acknowledge these limitations and clarify that their findings do not imply genuine human-like cognition.
---
### 3. Experimental rigor is low.
Only two small instruction-tuned models (LLaMA-2-7B-Chat and Phi-2) are tested, without including base or larger models. This makes it difficult to assess whether the findings generalize across training phases or scales of LLMs. Including additional models or verifying whether similar trait circuits emerge in non-chat variants would significantly strengthen the paper. Also, I understand that this paper's goal is to discover the circuits lying in the LLMs. But to evaluate the quality and validity of the dataset, I recommend that authors provide the evaluation on other methods, such as pure prompting.
---
### 4. Prompt and task design are conceptually flawed.
The Big Five traits are continuous spectra, but the dataset reduces them to binary self-descriptions (e.g., “I am high in openness” vs. “I am low in openness”). This introduces strong lexical cues and risks capturing superficial associations between trait names and responses rather than genuine trait inference. A more realistic approach would involve inferring traits from open-ended essays or autobiographical texts. This method is one of the canonical methods to evaluate the personalities of humans [1].
For methodological reference, see [2].
---
### 5. Novelty is limited.
The technical approach, combining edge-attribution patching with pruning, is a direct application of existing interpretability methods. The main novelty lies in dataset curation, but the dataset curation is not rigorous enough.
---
### 6. Causal interpretation is overstated.
In Section 6.3, the authors conduct causal intervention analysis. However, the key question, whether these circuits truly represent personality traits rather than lexical correlations, remains unresolved. Without ruling out such confounders, it is premature to claim that the identified subgraphs mechanistically encode traits.
[1] McAdams, Dan P. "Narrative identity." Handbook of identity theory and research. New York, NY: Springer New York, 2011. 99-115.
[2] Suh, Joseph, et al. "Rediscovering the latent dimensions of personality with large language models as trait descriptors." arXiv preprint arXiv:2409.09905 (2024).
1. **Evaluation details.** The details of the evaluation are not fully provided. The words or phrases like "procrastinating" are divided into 5 tokens. And the LLM response can be like a sentence, but how authors check the overlap between the references and LLM-generated tokens is not fully explained.
2. **Dataset details.** Please provide more details of the curated dataset to evaluate the quality. |
Lightly AI-edited |
|
Action Chunking Proximal Policy Optimization for Universal Robotic Dexterous Grasping |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes an action chunking framework for learning universal dexterous grasping, and introduces the use of a state value function instead of an action value function to mitigate the challenge posed by extremely high-dimensional action spaces. Comprehensive simulation experiments are conducted to demonstrate the effectiveness of the proposed design.
1. The proposed method achieves a success rate comparable to the state of the art, without relying on residual policies or curriculum learning, and demonstrates significantly higher efficiency than other approaches.
2. A comprehensive ablation study is conducted to validate the effectiveness of the proposed method.
1. In other papers, PPO typically performs poorly; however, in this work, PPO achieves even higher performance than PPO with curriculum reported elsewhere. If the PPO implementation in this paper is inherently better than in previous works, the additional improvement provided by the proposed method may not appear significant. Could the authors clarify the reason behind this?
2. In the supplementary demonstrations, the dexterous hand appears to oscillate after grasping objects, suggesting that using action chunking may not clearly reduce jittering.
See weakness. |
Lightly AI-edited |
|
Action Chunking Proximal Policy Optimization for Universal Robotic Dexterous Grasping |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a reinforcement learning algorithm, Action Chunking Proximal Policy Optimization (ACPPO), for general-purpose grasping with high-degree-of-freedom robotic hands. The method extends PPO by incorporating action chunking, which outputs multiple actions at each step, enabling temporally coherent exploration without introducing high-dimensional Q-functions. Experiments on the DexGraspNet dataset demonstrate that ACPPO outperforms existing PPO-based methods, achieving higher success rates (95.4%) and faster training (2.3× speedup).
1. The paper provides a comprehensive literature review and includes a reasonably broad selection of baseline methods for comparison.
2. The ablation studies are comprehensive, including experiments on chunk size, decision frequency, and comparisons with ACFQL, which collectively validate the design choices.
1. The contribution of this work is rather incremental. The method mainly incorporates action chunking into PPO, without offering substantial conceptual novelty or broader insights.
2. The paper lacks sufficient depth and clarity. The presentation is somewhat disorganized, making it difficult to follow the main ideas and contributions.
3. The improvement in performance is modest: ACPPO achieves only a slight increase over the best baseline (ResDex, 94.6% → 95.4%). The main contribution lies in simplifying the training process rather than delivering a substantial performance gain.
1. The ablation study indicates that the optimal chunk size is $ h=2 $, which seems to contradict the claimed benefits of action chunking. Given that the typical control frequency in the previous work is $ 60 \mathrm{Hz} $, a chunk size of two steps corresponds to a very short time interval.
2. The reported success rate of vanilla PPO (over 90%) is inconsistent with previous works[1].
3. In the supplementary videos, the simulator still exhibits noticeable jitter, and there is little improvement in motion coherence, which does not align with the authors’ claims.
4. Although the paper claims that action chunking encourages temporally coherent exploration, the comparison with vanilla PPO shows only marginal improvement in training efficiency.
[1] Yinzhen Xu, et al. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy, CVPR 2023 |
Lightly AI-edited |
|
Action Chunking Proximal Policy Optimization for Universal Robotic Dexterous Grasping |
Soundness: 1: poor
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper studies PPO with an action-chunking actor for learning universal dexterous grasping policies. Experiments on DexGraspNet show that the method outperforms both previous RL-based methods and the vanilla PPO without action chunking.
- RL with action chunking is an important problem in robot learning, as imitation learning with action chunking has shown great promise.
- The choice of PPO for action-chunking RL makes sense, avoiding the need to learn the high-dimensional Q-function.
- The algorithm achieves very good performance on DexGraspNet, surpassing prior RL methods.
- The good result of vanilla PPO without any curriculum learning design (achieving 90+% success rates as shown in Table 3 and 4) is strange. Prior works (UniDexGrasp, UniDexGrasp++, and ResDex) show that directly running PPO on 3200 DexGraspNet objects results in poor success rates (typically 10%~60%) due to the inherent multi-task optimization challenge.
- Even if the results are reliable, the improvement of PPO with action chunking compared with vanilla PPO is quite minor (from 90% to 93%). Simply increasing the training time of vanilla PPO may leads to the same gain.
- The claim that "ACPPO improves exploration with temporally coherent, smooth actions" (line 311-318) may not be true. Since the policy parameters are trained from scratch and the Gaussian randomness for each action dimension is independent, the policy is not encouraged to output temporally coherent actions no matter action chunking is used. The cited works (line 307-310) are either imitation learning or offline-to-online RL works and their policies are initially trained from high-quality demonstrations where actions are smooth and temporally coherent. Therefore, these works cannot induce the same claim in this paper under the RL-from-scratch setting.
- Experimental results are not comprehensive. Though action-chunking may not succeed in locomotion tasks, the ACPPO proposed in this paper is not specifically designed for the grasping task. As a general purpose algorithm design, evaluation on more tasks (such as manipulation tasks in the Meta-World benchmark) is quite necessary. On the other side, if the paper focuses on the specific dexterous grasping task, it is necessary to include vision-based results, compare with more recent methods like RobustDexGrasp, and do some sim-to-real experiments.
- Please see my concerns in Weaknesses.
- The issue of being less reactive of action-chunking policies can be addressed by receding horizon control during inference (e.g. predict the whole action chunk but only execute the first one). Can this be integrated into the RL algorithm to address dynamic tasks such as locomotion? |
Fully human-written |
|
Action Chunking Proximal Policy Optimization for Universal Robotic Dexterous Grasping |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes **Action Chunking PPO (ACPPO)** for high-DoF dexterous grasping, where standard PPO fails due to poor exploration. The authors identify that prior action chunking (ACRL) methods are intractable in this domain because they require learning a high-dimensional $Q(s, a_{chunk})$. ACPPO avoids this by modifying the PPO objective to use a **chunked importance sampling (IS) ratio** $\rho_{t,h}^{ch}(\theta)$ while retaining a simple $V(s)$ critic and standard GAE. This novel on-policy formulation achieves state-of-the-art results on DexGraspNet, training 2.3x faster than prior work without auxiliary augmentations.
***Novel Problem-Solving:** The paper clearly identifies the $Q(s, a_{chunk})$ bottleneck that makes prior ACRL methods infeasible for high-DoF robotics and proposes an elegant on-policy solution.
***State-of-the-Art Performance:** ACPPO achieves SOTA results on the complex DexGraspNet benchmark.
***Exceptional Training Efficiency:** The 2.3x training speedup, achieved without any auxiliary mechanisms, highlights the efficiency of the core algorithmic change.
***Strong Ablation Study:** The ablation showing that simple action repetition diverges effectively proves the contribution is the chunked optimization, not just a lower decision frequency.
* **Mismatch Between Motivation and Results:** The central premise of the paper is that action chunking provides temporally coherent exploration. However, the empirical results (Table 4) show that the best performance is achieved with a minimal chunk size of $h=2$. Performance degrades at $h=3$ and $h=4$, and collapses to 0% at $h=8$. A chunk of 2 barely qualifies as "temporally coherent" and undermines the core motivation. The paper would be far more convincing if it could demonstrate SOTA performance with a more substantial chunk length (e.g., $h \ge 5$).
* **Limited Methodological Novelty:** The proposed change, while elegant, is a single modification to the PPO objective. The core components (actor-critic architecture, $V(s)$ critic, GAE) are identical to PPO. The contribution is essentially replacing the per-step IS ratio $\rho_t$ with a chunked IS ratio $\rho_{t,h}^{ch}$. For an ICLR submission, this level of technical contribution is borderline.
* **Shallow Analysis of Failure Mode:** The paper's analysis of the $h=8$ failure is shallow, attributing it to a "coarse" decision frequency (a behavioral explanation). A more critical optimization-based analysis is missing. A far more likely culprit is the **instability of the chunked importance sampling ratio**. The variance of an IS ratio grows exponentially with the sequence length ($h$). It is highly probable that for $h=8$, the ratio $\rho_{t,h=8}^{ch}$ is so volatile that it is *always* outside the PPO clipping range $[1-\epsilon, 1+\epsilon]$. This would cause the clipping mechanism to **zero out the gradient signal**, making learning impossible. This suggests that on-policy, IS-based methods like PPO may be fundamentally unsuited for large chunk sizes—a critical limitation that is not addressed.
* **Unverified Theoretical Claims:** The authors claim (Eq. 12) that their formulation reduces the "tail bias" from GAE. This is a key justification for their specific objective function, yet it is presented without any empirical validation. An experiment measuring the advantage estimation bias (e.g., relative to Monte Carlo returns) for PPO vs. ACPPO would be required to substantiate this claim.
* **Missing Diagnostic Experiments:** Given the hypothesis about the clipping mechanism, a crucial experiment is missing: a plot of the
**percentage of clipped updates** as a function of the chunk size $h$. This single experiment would provide vital insight into the algorithm's optimization dynamics and likely confirm why $h=8$ fails.
* **Limited Scope:** The empirical validation is confined to a single, albeit difficult, task family. The authors admit in Section 6.4 that the method would likely fail in dynamic tasks (e.g., locomotion), limiting its generality.
1. How do the authors reconcile the "temporal coherence" motivation with the empirical fact that the minimal chunk $h=2$ is optimal?
2. Please provide a plot of the **percentage of clipped updates** vs. chunk size $h$ (from $h=1$ to $h=8$). Does this confirm the hypothesis that the $h=8$ gradient is zeroed out by the clipping mechanism?
3. Can the authors provide empirical evidence for their claim (Eq. 12) that ACPPO reduces GAE's "tail bias"? A direct measurement of advantage estimation error would be convincing.
4. Given the IS variance issue, do the authors believe that on-policy, IS-based methods like PPO are fundamentally a poor choice for action chunking with $h > 2$, and that this problem requires an off-policy formulation? |
Fully AI-generated |
|
Boost the Identity-Preserving Embedding for Consistent Text-to-Image Generation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper tackles identity preservation in text-to-image (T2I) diffusion. It observes that a cross-frame, identity-bearing direction exists in the text-encoder embeddings. Building on this, the authors propose BIPE, a training-free, plug-and-play framework with two parts: Adaptive Singular-Value Rescaling (adaSVR) and Union Key (UniK). The paper also proposes DiverStory, a benchmark using varied natural-language prompts (not a single template), and reports gains on ConsiStory+ and DiverStory with moderate runtime/memory overhead.
- Operating purely on text embeddings makes BIPE easy to attach to SDXL-like pipelines; the paper also shows a video case (Wan 2.2).
- The IPemb observation (leading singular directions capture identity) is plausible and supported by attention-map probes.
- On ConsiStory+, BIPE achieves the best CLIP-T and VQA, with identity metrics close to the best and better efficiency than training-heavy baselines; ablations indicate complementary roles for adaSVR and UniK.
- DiverStory highlights robustness to varied natural-language prompts which is a realistic setting often under-tested.
- Evidence suggests BIPE helps on both template-based and diverse prompts, but several core claims and implementation details are insufficiently justified (see “Questions”).
- The empirical methodology is mostly standard, but ablations don’t fully isolate design choices (e.g., sensitivity to the weighting temperature, the role of per-layer SVD).
1. Is any finetuning performed anywhere (text encoder/adapters)? If truly training-free, please correct the Table-1 flag; if not, specify what is trained and where.
2. Do UniK keys/values come from adaSVR-enhanced embeddings (as in the main text) or the original embeddings (as suggested in the appendix)? Please standardize and report the performance delta between the two setups.
3. Provide per-layer SVD dimensions and a wall-clock/VRAM profile that separates the costs of adaSVR vs. UniK, and how these scale with number of frames (N) and subject token count.
4. Include sweeps for the temperature (\tau) in Eq. (3) and the number of frames (N); additionally, report robustness to token selection (([EoT]) vs. subject tokens) and layer-wise on/off.
5. Any human studies on identity consistency under Diverse Prompts? What is the release timeline/spec for DiverStory to enable community verification? |
Moderately AI-edited |
|
Boost the Identity-Preserving Embedding for Consistent Text-to-Image Generation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes BIPE, a training-free, plug-and-play framework that improves subject identity consistency in multi-image text-to-image generation by operating purely on text embeddings. BIPE has two components: adaptive singular-value rescaling (adaSVR), which spectrally amplifies identity-preserving directions in subject and [EoT] token embeddings across every layer of the text encoder, and Union Key (UniK), which concatenates cross-attention keys from all prompts while using per-frame values to align attention without leaking full values across frames. Experiments on ConsiStory+ and a new Diverse Prompts benchmark, DiverStory, show strong text alignment and competitive identity consistency with low memory and runtime overhead, and the authors also illustrate integration into Wan 2.2 for cross-video consistency.
Originality is solid: rather than new networks or retraining, the work identifies and boosts an intrinsic identity-preserving component in text embeddings and enforces consistency via key-sharing in cross-attention, which is simple and broadly applicable. Quality is supported by clear math for adaSVR with energy-matched normalization, principled token selection for subject and padding tokens, and a practical 1/N weighting of extra key-value pairs to control dominance and cost. Clarity is generally high, with an end-to-end pipeline and ablations that isolate adaSVR vs UniK contributions. Significance is promising since BIPE is architecture-agnostic, requires no additional data or training, and achieves strong alignment and competitive identity metrics with near-base latency, while DiverStory broadens evaluation beyond templated prompts.
The paper claims BIPE is training-free, yet Table 1 marks BIPE as not training-free on both ConsiStory+ and DiverStory, which conflicts with the text and should be corrected or explained. The evaluation emphasizes SDXL as the default and shows case studies with Wan 2.2, but broader quantitative tests on additional backbones would better support the architecture-agnostic claim. Identity consistency is mostly measured by CLIP-I and DreamSim with background removal; a small human study or per-attribute identity analysis would strengthen conclusions on visual identity. Finally, while the method uses only a subset of tokens in UniK to cap compute, sensitivity to the number and type of shared tokens, and scaling with the number of frames N, is not systematically profiled.
a) Please reconcile the training-free claim with Table 1, which currently lists BIPE as not training-free. If this is a typesetting error, clarify and update; if not, explain what part of BIPE requires training.
b) How does BIPE scale in runtime and memory with the number of frames and with the count of shared keys in UniK? A plot of latency and VRAM vs N and vs number of shared subject/[EoT] tokens would help practitioners.
c) Beyond SDXL and the Wan 2.2 illustration, can you report quantitative results on at least one non-CLIP text encoder or a DiT-based T2I backbone to substantiate architecture-agnostic claims.
d) Could you add sensitivity studies for adaSVR’s temperature and the decision to include [EoT] alongside subject tokens, plus an ablation on the 1/N weighting strategy.
e) DiverStory is valuable; can you provide statistics on prompt diversity and subject types, along with plans and licensing for release, so the community can reproduce and extend your results.
f) The limitations note that BIPE does not accept external identity references; can you outline how BIPE would integrate with reference-image encoders or identity embeddings while retaining the training-free property. |
Fully AI-generated |
|
Boost the Identity-Preserving Embedding for Consistent Text-to-Image Generation |
Soundness: 3: good
Presentation: 1: poor
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose BIPE (Boost Identity-Preserving Embedding), a training-free for consistent text-to-image generation. The approach focuses on identity-preserving embeddings (IPemb) and introduces two techniques: Adaptive Singular-Value Rescaling (adaSVR) and Union Key (UniK). AdaSVR applies singular value decomposition to amplify identity-related components. UniK enhances consistency by concatenating cross-attention keys from all frame prompts. BIPE uses SDXL as the base model and is evaluated on ConsiStory+ and a newly proposed DiverStory benchmark.
1. The paper analyzes and visualizes the relationship between identity-preserving embeddings and attention mechanisms focused on the subject.
2. The authors design the DiverStory benchmark, which employs varied natural language prompt formulations rather than relying on a single fixed template as in ConsiStory+.
3. The paper provides numerous visual examples to illustrate results.
1. The motivation is not entirely clear. The authors claim that previous works overlook the fact that identity-relevant embedding components are already implicitly encoded within the aggregated textual embeddings of a full frame-prompt sequence. However, this limitation does not seem particularly significant, nor is it obvious that it would strongly affect results.
2. The description of the method is difficult to follow and not clearly structured. It required substantial time and effort to understand the novelty of BIPE and how it differs from 1Prompt1Story. Nevertheless, the proposed approach appears quite similar to 1Prompt1Story. For instance, the UniK component in BIPE seems analogous to Prompt Consolidation (PCon) in 1Prompt1Story, as both combine all prompts. Likewise, adaSVR in BIPE appears to resemble Singular-Value Reweighting (SVR) in 1Prompt1Story. The primary difference seems to lie in the explicit use of IPemb in BIPE. However, the distinction between explicit use in adaSVR and implicit use in SVR is not clearly explained.
3. The paper contains several typos. For example, $\bar{V}_i$ should be $\tilde{V}_i$ in Eq. (5). In Table 1, “Train-Free” should be “Train”, or “√” and “×” should be interchanged.
Since BIPE is applied to video generation in the experiments, would it be more accurate to use the term “consistent visual generation” rather than “consistent text-to-image generation”? |
Lightly AI-edited |
|
Myna: Masking-Based Contrastive Learning of Musical Representations |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper focuses on music representation learning. It follows a contrastive learning framework, with its main contributions being the use of a Vision Transformer (ViT) as the backbone model and the application of token masking. Furthermore, considering the characteristics of music analysis, the authors extend the approach into a hybrid model that incorporates vertical filters to better capture the frequency-related features of spectrograms. Through this relatively simple training strategy, the proposed model achieves competitive performance on several downstream tasks compared to models that require more than 5× the training time and parameters. Overall, the paper is well written and presents a solid contribution to efficient representation learning for music.
The proposed use of ViT and token masking seems promising in music representation learning.
The paper is easy-to-read and the illustration of the proposed method, experimental design, and results seems promising.
The proposed method seems to be only applicable to the clip-level MIR tasks. I wonder the authors opinion (maybe discussions) on how the proposed architecture can be applied towards frame-level tasks as well.
I wonder the effect of the patch size variations on performances. For example, would 4x4, 96x2, 128x3, 32x32, hybrid of them, etc these kind of diverse patch size affects the performance? |
Fully human-written |
|
Myna: Masking-Based Contrastive Learning of Musical Representations |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
In this paper, the authors proposes a new training method for the representation learning of music audio. The proposed method includes aggressive input masking that seems to allow avoiding pitch shifting as an augmentation step, making the model aware of pitch and key information. The experiment showed that even without finetuning on the downstream task's training set and despite its smaller size & training data, Myna outperforms many other methods.
- Good performance
- Parameter-efficient
- Trained on a public dataset only
- The proposed method is simple
- Limited novelty: Some core changes such as using ViT and masked autoencoder are already proposed in other, similar work including audio domain.
- Although the performance is strong, the margin is rather reasonable, not outstanding.
- I don't think we should call the used audio processor as a "tokenizer", no matter how the word is over-subscribed in the community. It does not tokenize (making the input a discrete representation) at all, and it's even worse because some architectures indeed discretize the input audio. |
Fully human-written |
|
Myna: Masking-Based Contrastive Learning of Musical Representations |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes MYNA, an efficient, masking-based contrastive learning framework for musical representation. It replaces traditional music augmentations with a high-rate token masking (90%) on mel-spectrograms. This heavy masking significantly reduces the number of input tokens, allowing for large batch sizes (4096) on a single GPU, and achieves an efficiency gain over prior contrastive methods. The authors also introduce a hybrid patching scheme (combining vertical and square patches) to capture complementary features (general purpose vs. pitch structure). The model is pretrained on the public AudioSet music subset. Myna achieves competitive performance with larger private models and establishes a new public-data SOTA.
1. The mask-only approach is simple and allows single-GPU large-batch training (batch size 4096), which translates to an 85x increase in efficiency over traditional contrastive methods like CLMR. The model achieves competitive average scores (68.6 for Myna-Hybrid) with MERT-95M, and surpasses public baselines like MERT-95M-public and MULE.
2. The hybrid patch design improves key detection (achieving SOTA among self-supervised methods) by integrating frequency-sensitive vertical patches. The method retains pitch sensitivity by avoiding traditional data augmentations (e.g., pitch shifts), which is beneficial for tasks like key detection.
1. Table 1 mixes public and private data baselines (e.g., MERT-330M) without transparently clarifying the training resource budgets.
2. The claim that "90% masking performs best" is not strongly supported by Figure 4. This is due to two issues: (a) Performance differences across high masking ratios looks marginal and lack verification of statistical significance; (b) The "average across all four benchmarks" curve can be mathematically unrigorous as it combines different metrics from different tasks.
3. The model's poor performance on EmoMusic is attributed to short clip length, a hypothesis that needs empirical verification.
1. It would be helpful if Table 1 were explicitly partitioned to clearly distinguish models trained on public data from those trained on private or internal corpora.
2. Could you provide supplementary figures showing the performance curves across different masking ratios for each of the four downstream tasks (MTT, GiantSteps, GTZAN, and EmoMusic)? |
Lightly AI-edited |
|
CHyLL: Learning Continuous Neural Representations of Hybrid Systems |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper studies the problem of learning hybrid dynamical systems, that is the systems combining continuous flows and discrete mode switches. Diversly from existing approaches, authors propose method CHyLL (Continuous Hybrid System Learning in Latent Space) that learns directly from time-series data avoiding major scalability bottlenecks of explicit segmentation or event detection. The central insight is topological: hybrid systems’ discontinuities arising from mode switching can be “glued” to form a piecewise smooth quotient manifold, on which the overall flow becomes spatially continuous. CHyLL operationalizes this known theoretical result, by: (1) learning a singularity-free neural embedding of the quotient manifold in a higher-dimensional latent space, (2) concurrently learning the continuous flow within this embedded space, and (3) enabling prediction and control of hybrid dynamics without explicitly modeling discrete modes or resets. Experiments show that CHyLL can reconstruct hybrid system flows with high accuracy, recover topological invariants, and be applied to stochastic optimal control problems.
(1) I find paper generally well written, besides few minor issues listed bellow. I like the clarity of motivation, transparency of novelties and appreciate the balance in presenting both intuition and technical complexity.
(2) Up to my knowledge, CHyLL appears genuinely novel in its topological formulation of hybrid system learning. The reinterpretation of mode switches as gluing operations forming a quotient manifold and the learned embedding that makes hybrid flows continuous is non-standard path. While some ideas overlap with Neural ODE extensions and manifold-learning methods, no prior work explicitly connects hybrid system topology, latent manifold embedding, and continuous neural flow learning in a single approach.
(3) I believe that making the embedding theorem (Simic et al. 2005) from differential topology operational within Neural ODEs can inspire new methods that up to known were not able to successfully tackle hybrid systems.
(4) Experimental setup is generally appropriate.
(1) Presentation:
- Section 3: I find that introduction of main concepts like guards and resets should be smoother. Before jumping to formal Definitions, authors can use Figure 2 to introduce these concept first informally to build the intuition. e.g. in simple terms, what is the role of $q$.
- Section 2: I feel that related work section would be easier to parse after the intuition and notation on hybrid systems is current Section 3.
- Levenberg-Marquardt: Due to unclear Experimental conclusions (lack of std), this aspect can be either avoided or emphasised. Computational overhead of using it should be discussed.
(2) Experiments: Standard deviations across trials is missing in experiments of Section 6.2. This makes it hard to conclude which version of the proposed method (w/o LM) is better performing, and diminishes the impact of the conclusions.
(3) Minor:
- should read "data" in line 182
- "be" lacking in line 189
- notation should be ${\cal L}_c(\theta)$ in line 299
While my current score reflects the above weaknesses, I am happy to revise it if the rebuttal is successful.
Given the good performance of DynamicAE of Lusch et al. 2018 in experiment of Section 6.3, what do authors think of adding additional discussion on combining CHyLL with Koopman-based method. Namely, instead of using parametrising the vector filed and using Eq. (5) to evolve the latent dynamics, Koopman approach would learn linear evolution. Also, since DynamicAE is known to fail in modelling evolution in longer time-horizon, one can think of combining the CHyLL with representation learning for Koopman/Transfer operators, e.g. of references bellow, to learn appropriate representations of hybrid systems.
Han et al. Deep learning of Koopman representation for control, IEEE CDC2020
Kostic et al. Learning invariant representations of time-homogeneous stochastic dynamical systems, ICLR2024
Kostic et al. Neural conditional probability for uncertainty quantification, NeurIPS2024
Jeong et al. Efficient Parametric SVD of Koopman Operator for Stochastic Dynamical Systems, arXiv preprint 2025 |
Fully human-written |
|
CHyLL: Learning Continuous Neural Representations of Hybrid Systems |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
CHyLL presents a groundbreaking method for learning hybrid system dynamics directly from time-series data, eliminating the need for mode switching or event detection. Its core innovation lies in reformulating the discontinuous state space into a continuous quotient manifold—the hybrifold—by topologically gluing guard surfaces via reset maps. A dual-phase training strategy separately optimizes the continuous encoder/flow and the discontinuous decoder. Evaluations on systems like the bouncing ball and torus show CHyLL's superior prediction accuracy and its ability to identify correct topological invariants.
This article is quite abstract, drawing on profound mathematical theories to guide the methodology design. It addresses a very practical problem and has achieved promising results in the selected experimental examples.
1. Although this article cites many theorems and employs sophisticated mathematical frameworks, its contributions are primarily concentrated on methodological design, with the theoretical contributions not being sufficiently sound. Perhaps the authors could further incorporate theoretical analysis or proofs regarding their methods (such as the design of the loss Eq. 4).
2. The experiments in the paper all use simulated data from simple examples. There is an absence of benchmarking on publicly available real-world datasets.
1. Although profound mathematical theorems are cited in the paper, I have concerns regarding their rigor. For instance, Theorem 1 states that the manifold is piecewise smooth, while Theorem 2 requires the manifold to be C^r—what is the relationship between these two conditions? Additionally, I seem to be unable to find the precise definition of 'piecewise smooth'.
2. Could there be performance evaluations on some real-world public datasets? |
Moderately AI-edited |
|
CHyLL: Learning Continuous Neural Representations of Hybrid Systems |
Soundness: 3: good
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose chyLL, a method to learn continuous representations of hybrid systems in latent space. The method ensures the representations become continuous in a higher-dimensional latent space, then fit a latent ode there and decode back, avoiding mode labels, event functions, or segmentation. The contribution is the gluing/conformal/anticollapse losses to learn this manifold from time-series alone, with demos (bouncing ball, torus/klein bottle topology checks, and a control task).
The paper is clear and well-motivated, and the pipeline and objectives are easy to follow.
The route explored is worthwhile as a method to modeling hybrid dynamics that generalizes across systems.
The proposal of learning a glued quotient manifold for hybrids via gluing + conformal losses is nice, and the training setup seems sensible (e.g. curriculum, anti-collapse, LM projection).
1. Although latent encoding for ODEs is not a new avenue, the quotient/gluing idea is a nice addition, but as the authors hint at in the paper, learning the correct 'glued' space might be hard in principle, and the lack of guarantees can be concerning.
2. The experimental scope is a little narrow, with toy problems/examples, and only a few comparative methods. The results for the ball juggling with MPPI experiment also show a deep Koopman baseline achieving a lower mean tracking cost, without much of a detailed explanation.
3. It's hard to comment on the scalability and robustness of the approach since there are no results under sensor noise/partial observability, many-guard/mode systems, or higher-dimensional robotics experiments.
2. The authors should expand further on the limitation of the proposed method, and its failure mode, with respect to existing literature.
1. Can you provide any diagnostic or bound (e.g., jump norms across detected guards, encoder/decoder Jacobian conditioning near resets) that indicate the learned latent is truly continuous, or conditions under which it fails?
2. How sensitive is performance to the gluing/conformal/anti-collapse weights and latent dimension? It would be valuable to include a sweep or at least failure modes when turning each off.
3. Can you expand further on why deep Koopman wins on mean cost, and whether CHyLL improves with different horizon/MPPI settings or controller?
4. How do you think the method would behave under more realistic sensor noise or partial observability? |
Fully human-written |