|
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a method to fine-tune a multimodal large language model (MLLM) to produce prompt-aware embeddings for audio and video. Starting from Qwen2.5-Omni, the authors first learn a BEATs-based adapter to ingest an additional audio input. They then apply contrastive training with LoRA using two objectives: bidirectional cross-modal retrieval and a QA objective that treats the source modality to text as retrieval. Text embeddings are taken from the last token of the last layer. For non-text modalities, the method conditions on a prompt, aggregates the last tokens from all layers, and passes them through a fusion module. Experiments show improved retrieval performance over other MLLMs, and the fine-tuning also improves the base MLLM’s text generation task performance.
The task formulation and model design are clear and sensible. The experiments are comprehensive and the results strongly support the efficiency of the proposed method. The presentation is clear.
In the evaluation benchmarks, retrieval tasks appear to be reported in a single direction only, for example text-to-audio on Clotho. Reporting both directions would provide a fuller picture.
The use of the term “QA” is potentially confusing. During training, the QA objective is a source-modality-to-text contrastive setup, while in evaluation some QA tasks are executed via text generation with the MLLM. The same ambiguity appears for other evaluation benchmarks. It would help to specify, for each evaluation task, the exact inference procedure.
1. What is the architecture of the BEATs aligner, and is it similar to the fusion module?
2. How do text prompts vary across training tasks and evaluation tasks? |
Fully human-written |
|
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces Wave, a unified multimodal embedding model capable of embedding text, audio, video and audio-video inputs. Wave is validated on a variety of video, audio and text retrieval tasks. Furthermore, Wave is capable of robustly using instructions to improve multimodal embedding performence.
- Introducing audio to multimodal retrieval is relatively novel. The performance of the model is quite strong.
- Using a concatenation of many layers last token hidden is a novel method for vector embedding in this setting.
- The manuscript is clear and well-written.
- The approach is very similar to [1] except an omni-model is used instead of a VLM.
- The paper is missing implementation details relating to model architecture, including how inputs are templated into LLM backbone, resolution and visual tokenization in the vision encoder.
- The paper does not compare their proposed pooling strategy with mean pooling with bidirectional attention, which tends to outperform last-token pooling strategies [2].
- The evaluations of audio-visual retrieval is limited. The reviewer can only find a single task that tests WAVE’s ability to retrieve audio-visual items (RET split of MMEB-v2-Video), and only in one direction. Furthermore, It is unclear if MMEB-v2-Video is designed to have audio used its retrieval task. This could result in the audio track trivializing the task or being useless.
- It is unclear whether WAVE’s superior video retrieval performance comes from its different techniques or larger scale video retrieval training or its specialization into video retrieval (I.e. no image retrieval).
Minor:
- Potentially missing related work: the use of “distractor” answers for QA training seems very similar to [3].
Several question can be found in the above weaknesses section, in addition:
- How were the “distractor“ answers used during QA training created? |
Fully human-written |
|
CAP: Improving the Robustness of LLM-as-a-Judge Against Adversarial Score Manipulation via Comparative Augmented Prompting |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the vulnerability of LLM-as-a-Judge systems to adversarial score manipulation attacks by proposing CAP (Comparative Augmented Prompting), a defense framework. The core idea of this method is to integrate comparative assessment principles into absolute scoring scenarios by using a TUTOR LLM to generate high-quality and low-quality reference samples as anchors, while employing activation vector modification techniques to ensure consistency in reference sample quality. Experiments on two datasets validate the effectiveness of CAP in defending against both white-box and black-box attacks on open-source and API-based models.
**1. Important and Practically Significant Research Problem:** The paper addresses the security issues of LLM-as-a-Judge systems, which represents a critical challenge in the current automatic evaluation field. The authors clearly demonstrate the severity of manipulation attacks, showing that adversarial attacks can inflate scores from 2.7 to 4.3. This research has strong practical value.
**2. Comprehensive Experimental Design:** The experiments cover 5 judge models (2 open-source + 3 API-based) and 2 tutor models, testing white-box attacks (AdvSuffix) and black-box attacks (DSI, BED), and include ablation studies, adaptive attack testing, and efficiency analysis.
**3. Clear Writing:** The paper is well-written with clear logic and smooth flow.
**1. Lack of Statistical Significance Verification:** The paper does not explicitly report statistical significance verification, including multiple repeated experiments and significance testing, making it impossible to determine whether the observed differences exceed the range of random fluctuation. The SummEval dataset contains 100 source documents, each with 16 machine-generated summaries, and the TopicalChat dataset contains 60 conversational contexts, each with 6 machine-generated responses. These datasets are relatively small in scale, and results from single experiments may be influenced by sample selection, making multiple repeated experiments particularly important for verifying result stability. The paper uses multiple large language models as judges and tutors, and these models have inherent randomness in the generation process, making repeated experiments even more necessary to verify experimental validity.
**2. Relatively Simple Adaptive Attack Design:** The adaptive attack in Section 5.6 only uses prompts to "ignore reference texts," which is a relatively basic attack strategy. Meanwhile, Table 7 shows that in some cases, the adaptive attack effect is even lower than standard attacks (e.g., FlanT5-XL: 12% vs 49%), further indicating that the adaptive attack design is insufficient.
**3. Insufficient Parameter Selection and Hyperparameter Sensitivity Analysis:** The method involves multiple key parameters, but the sensitivity analysis is insufficient. Although Appendix B mentions sensitivity analysis, and Tables 12 and 13 display the αh and αl values under different dataset and model combinations, the main text does not adequately explain how these values were selected (through grid search optimization) and why different combinations require different parameter values. Although Appendix D.3 mentions selecting α values by "calculating the mean and variance of the standard references", and Figure 10 shows score changes under different α values, the explanation is insufficient. What are the selection criteria? How is the quality of high-standard and low-standard references balanced?
**4. Insufficient Discussion of Method Limitations:** The paper lacks in-depth discussion of the method's applicable scope and failure scenarios. There is insufficient analysis of cases where defense effectiveness is poor in Tables 2-3 (such as Gemini-2.0 + DSI: 33%). There is no discussion of the impact of tutor model quality on defense effectiveness. Standard vector identification requires constructing a scoring reference set, which may be difficult to obtain in certain domains.
See Weaknesses |
Moderately AI-edited |
|
CAP: Improving the Robustness of LLM-as-a-Judge Against Adversarial Score Manipulation via Comparative Augmented Prompting |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces CAP (Comparative Augmented Prompting), a defense framework designed to improve the robustness of LLM-as-a-Judge systems against adversarial score manipulation. Motivated by the observation that comparative assessments are inherently more robust than absolute scoring, CAP integrates comparative principles into the absolute scoring process. Specifically, it employs a Tutor LLM to generate high-quality and low-quality reference outputs, which are refined via activation vector steering to serve as sample-specific anchors.
1. he paper provides a fresh perspective by importing comparative assessment principles into absolute scoring defense. This bridging insight is conceptually elegant and experimentally justified.
2. Figures and algorithmic descriptions are intuitive (especially Figure 3 illustrating the CAP workflow). The writing is generally clear and logically structured.
1. Efficiency and scalability – The approach requires an additional Tutor model invocation per evaluation, leading to 10–30× slower inference (Table 5). Although the paper acknowledges this, there is no exploration of smaller Tutors or precomputed reference caching. A study on how Tutor size or layer choice affects robustness vs. cost would make the work more practical.
2. Limited baseline diversity – The baselines are restricted to Perplexity-based detection and Chain-of-Thought prompting. Since score manipulation overlaps with broader prompt-injection/jailbreak attacks, comparisons with simple sanitization or rewriting-based purification defenses are missing and could clarify whether CAP brings unique advantages beyond input preprocessing.
There are some typos, e.g., in Line 167, 'together with an expert reference (), typically produced by human' -> 'together with an expert reference (), typically produced by human'
1. Transferability of standard vectors: Can the extracted steering direction learned on one dataset (e.g., SummEval) generalize to another (e.g., TopicalChat)? In real-world evaluation, user inputs vary widely—would CAP require re-estimating standard vectors per task/domain?
2. CoT baseline setup: The Chain-of-Thought prompt used here seems to focus on multi-step reasoning rather than explicit defense reasoning. Prior work such as “Unraveling the Mystery: Defending Against Jailbreak Attacks via Unearthing Real Intention” suggests first summarizing user intent before response. Did you test such CoT variants that explicitly incorporate intention extraction or self-verification steps? |
Fully AI-generated |
|
CAP: Improving the Robustness of LLM-as-a-Judge Against Adversarial Score Manipulation via Comparative Augmented Prompting |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes CAP, which addresses the issue of "against adversarial score manipulation" that may occur in LLM-as-a-Judge, and uses comparison principles to inject absolute score evaluation to defend this issue. Specifically, CAP utilizes high-score and low-score preference pairs generated by a TUTOR LLM, which are modified through activation vectors, as reference examples to guide robust scoring.
1. The paper provides a preliminary study to investigate comparative evaluation versus absolute evaluation, and proposes a preference data generation scheme to generate high-quality and low-quality example anchors for models in comparative evaluation.
1. $\textbf{Justifications of the key design choices are weak.}$ The paper provides insufficient justification or rationale behind its key method design:
- The paper does not justify the choices of "standard references geneartion" in Section 4.2 and Section 4.3 .
- Arbitrariness of standard vector thresholds: The paper sets the scoring thresholds for high/low standard reference texts as the 80th and 20th percentiles of the samples generated by the TUTOR model, but fails to explain why these two percentiles are the optimal choices, and there is also a lack of explanation and analysis regarding the number of generated samples.
2. $\textbf{Significant issue of efficiency trade-off is under-explored.}$ CAP has a critical practical limitation — extremely high efficiency costs. Although the paper acknowledges this trade-off, it does not conduct sufficient exploration or scenario-based analysis on it:
- The quantitative gap is obvious: As can be seen from Table 5 (TopicalChat dataset) and Table 9 (SummEval dataset), compared with the "no defense mechanism" scenario, CAP increases the evaluation time of small open-source models by 40 to 70 times. For example, the FlanT5-XL model takes only 4.0 seconds to process a single sample without defense, but 162.4 seconds when CAP is enabled (CAPₗ configuration) and 283.5 seconds with the CAPₘ configuration; even for API-based models like ChatGPT-3.5, CAP adds approximately 100 seconds of extra time per sample.
- Lack of optimization exploration: The paper describes this efficiency loss as "a reasonable price to pay for robustness" but does not explore improvement schemes that can enhance efficiency — such as reusing reference anchors for similar samples (instead of generating unique references for each sample), using a TUTOR model with a smaller parameter scale, or caching activation vectors. Without such optimizations, CAP is completely impractical in large-scale evaluation scenarios.
3. $\textbf{Types of tasks for evaluation are limited.}$ The scope of experimentation is relatively narrow, which makes it hard to assess the performance of CAP outside the tested tasks. All experiments focus on two types of tasks — text summary evaluation (SummEval dataset) and dialogue response evaluation (TopicalChat dataset). The paper does not apply CAP to other high-risk LLM-as-a-Judge scenarios, such as code generation evaluation or factual accuracy evaluation. For tasks where evaluation criteria are subjective or domain-specific, the definition of "comparative reference text" may be more difficult to delineate, and the effectiveness of CAP in such tasks has not been verified.
4. $\textbf{Lack of sufficient empirical analysis on preference data generation.}$ Specifically:
- In Related Work, there is no review on the related methods for generation of preference data.
- In ablation study, the comparative results of preference data quality are insufficient. The only comprative baseline for comparison is W-CAP, which uses different instructions to make the model generate high-quality summaries and low-quality summaries. This comparison with CAP may not be able to illustrate the effectiveness of Standard Vector Identification. It is needed to supplement comparative experiments such as: experiments using High-Standard Score and Low-Standard Score as preference data in the process of Standard Vector Identification, and experiments with other preference data generation schemes.
- The paper can also benefit from a case study comparing the preference data generated by the proposed method with those generated by other relative methods.
5. $\textbf{This paper also has some presentation issues.}$ for example:
- Multiple errors in the direction of double quotes, Lines 92 and 100.
- The title of the prompt in Line 864 is incorrect.
- In the experiment of Section 2, why is it reasonable to compare between score and probability in Figure 2?
Please see the weaknesses above. |
Fully human-written |
|
Structuring Hidden Features via Clustering of Unit-Level Activation Patterns |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper propose a novel way to align the feature across many different layers of a neural network. The author propose to collect a latent embedding buffer during training, which contains the embedding across different sample, position, and layers, then cluster these embedding to create "feature anchor" points. Then the author define an auxiliary loss to constraint the latent of the neural network to match these feature anchor accordingly. The author claims the this auxiliary loss makes the model more explainable in two way: 1. features across different layers are more aligned. 2. model trained auxiliary loss when applied with grad-cam produces better unsupervised segmentation map.
The author present a novel way to make neural network more interpretable.
Method presented by the author is not post-hoc, unlike many other interpretability works.
The author shows their method aligned with class-level segmentation map better, when applied grad-CAM.
The author only provide the baseline against VIT trained with standard classification loss, but did not compare their method with other method that improves model's interpretability.
The model does
The author does not provide standard evaluation (classification accuracy) between standard VIT training and model trained with their auxilary loss.
There are many moving part of the design, ranking as preprocessing, and group rank, the author did not provide enough ablation study to show these design are necessary.
Can author read my summary to see if my understanding is correct? If not, please tell me and also explain to me how the model actually works.
I would guess the training with auxiliary will improve model's interpretability but hurt the model's performance, if so, how much?
The ranking part is confusing to me. Why do the author use "rank" to preprocess the latent embedding, then something like normalization? |
Fully human-written |
|
Structuring Hidden Features via Clustering of Unit-Level Activation Patterns |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a self-supervised learning framework aimed at improving the interpretability of deep neural networks by structuring hidden feature representations. The method operates at the hidden-unit level, clustering activation patterns across data samples and imposing a structure-aware regularization that encourages cross-layer feature reuse and the emergence of representative anchor units.
1. Structured feature representation is an interesting topic. The paper does introduce a hidden-unit-level approach to organize features though it may remain complex.
2. The combination of clustering hidden units and enforcing structure via a regularization objective is conceptually interesting and aligns with efforts to improve interpretability through learned representations.
1. The evaluation relies heavily on Grad-CAM++ metrics. Gradient-based attribution methods are known to have limitations (especially in deep networks) and can produce misleading explanations. This raises concerns about whether the reported scores in interpretability are meaningful.
2. The paper does not adequately position itself relative to prior explanation methods. Many traditional explanation methods are missing, such as TCAV (Kim et al.) and DINO. While these methods are not specifically relevant to "structure", they are helpful for understanding features.
3. The paper does not convincingly demonstrate that structured representations are helpful for downstream tasks. If not, the advantage of structured features over existing feature characterization methods can be the key. But this part is missing in the current scope.
4. I am curious about the impact of the structure-aware regularization on feature dynamics and learning efficiency.
See above |
Lightly AI-edited |
|
Structuring Hidden Features via Clustering of Unit-Level Activation Patterns |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
Deep neural networks develop complex and unstructured internal representations, often creating redundant features that are difficult to interpret. This paper introduces a self-supervised regularization method to better organize hidden features, enabling their reuse across layers and increasing feature diversity within layers. This approach improves interpretability, makes better use of network resources, and may enhance generalization performance.
The method has two main components: First, it identifies redundant features through cross-layer clustering. Second, it implements a structure-aware regularization that encourages the reuse of one unit per cluster through residual connections while allowing other units to learn complementary features.
The authors tested their approach on three datasets: a synthetic task, CIFAR-10, and ImageNet, using variants of the ViT architecture. They developed new metrics to measure feature reuse, diversity, while utilizing previously proposed metrics for interpretability, and performance. Compared to standard training methods, their results showed better feature organization and interpretability.
- The paper presents its ideas clearly and comprehensively, with excellent organization and complete details.
- The approach is novel, introducing efficient methods to reduce computational costs without compromising effectiveness. The use of group ranked transformation for clustering helps reduce sensitivity to magnitude differences. The evaluation framework and analysis metrics are well-designed.
- The concept of enabling precise unit-level feature reuse across layers while utilizing residual layers is particularly novel.
- The paper's primary weakness lies in its limited experimental scope. While the presented results are promising, a broader evaluation across diverse datasets, model architectures, and network layers would better demonstrate the method's generalizability and practical impact. Enhanced visualizations of feature organization across multiple layers would also strengthen the paper's empirical validation.
- The introduction of multiple hyper-parameters without detailed ablation studies makes it challenging to determine optimal settings for future applications.
- Is the structure loss calculated at the sample level?
- When clustering flattened representations across token positions and layers, multiple units from different token positions can end up in the same cluster, but only one anchor unit is selected per cluster. Would selecting multiple anchor units per cluster for each unique token position improve results?
- How does the method handle cases where the same unit position from different layers appears in different clusters but has the lowest index in each? This could create multiple loss computations for the same residual stream position.
- How does this method perform with text inputs, where token counts vary and token positions can have significantly different representations? A discussion is required on the applicability across domains. |
Lightly AI-edited |
|
variCOT: Variational Inference for Implicit Chain-of-Thought in Language Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a variational training method for latent reasoning in LLMs. The key idea is to use control tokens to cast a LLM backbone into the prior, the posterior, the CoT decoder and the answer decoder. These different pathways are tied together with an ELBO, where the CoT decoder and the answer decoder are assumed to be conditional independent. Three different ways of injecting the latents to the decoders are compared, and the authors claim an innovation on adding latents by an extra cross attention layer. The authors conducted experiments based on GPT2 and LLaMA3.2, and report results on some reasoning benchmarks.
The idea of formulating latent reasoning as variational inference is interesting.
The exploration of different architectural design is inspiring.
1. Confusing presentation: There are several places in this paper that conflict each other. For example, in Proposition 2.3, one of the factor is $p(Y^r|X^q, Z)$. However, in Fig 1, 3rd row, the blue squares, which represents $Y^r$ seems to solely generated from $Z$. I also find the "horizontal" vs "vertical" (and the "hybrid" of them) taxonomy very hard to understand.
2. Unsatisfying experiment results: In Table 1, the proposed model only occasionally outperform the baseline CoT-SFT. The authors can argue that their method is more efficient. But the question is, if we give the proposed model more compute budget, can it outperform the baseline?
3. Inadequate ablation: The idea of sharing backbone for prior, posterior and decoder is interesting. However, the authors' experiments cannot justify the gain comes from variational inference. For example, how about ablating the pathway of posterior (2nd row in Fig. 1) and directly use the latents from 1st row in decoder training? Note that the ELBO in Theorem 2.4 is very prone to posterior collapse.
4. Critical related work missing: Phan et al. are probably the first to discuss the relation between latent reasoning and variational inference, who definitely deserve their credits. Kong et al. also introduced latent variational inference to LLMs, which as far as I know is the first to to use cross attention for latent injection. Liu et al. also introduced continuous latent reasoning, which also demonstrate the scaling property of latent dimension.
Phan et al., Training chain-of-thought via latent-variable inference, NeurIPS'23
Kong et al., Latent Thought Models with Variational Bayes Inference-Time Computation, ICML'25
Liu et al., Deliberation in Latent Space via Differentiable Cache Augmentation, ICML'25
1. What is the actual role of the 3rd row of Fig. 1? Are prompt tokens used for this row? Did the authors try ablating it?
2. How is the training dynamics? Would sharing the network for all these row make the training unstable? Does the backbone degrade after this training?
3. The authors didn't mentioned this in their experiment section, but did they train from scratch or post-train the GPT2 and LLaMA3.2? |
Fully human-written |
|
variCOT: Variational Inference for Implicit Chain-of-Thought in Language Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes variCoT, a method to accelerate chain-of-thought (CoT) reasoning in language models by performing reasoning in a continuous latent space rather than through explicit token generation. The method used strategic control tokens for single-pass training, and let z guide the generation with training small MLPs on top of backbone models. The authors claim 2.5× inference speedup while "matching or exceeding" explicit CoT accuracy on mathematical and commonsense reasoning benchmarks.
The author makes a great point of using latent z to guide reasoning, and tries to achieve single-stage training. And the approach provides 2.5× inference speedup with ~80-90% reduction in token generation, trying to solve a important issue in the field.
* The posterior is computed via a simple feedforward MLP based on a pre-trained backbone is naive. A proper VAE inference should refine Z iteratively based on how well it reconstructs the data, but the current approach is "discriminative" rather than "generative" (optimization-based), which is suboptimal for structured inference.
* It is hard to believe that token-level posterior can leads to performance gain. In previous literatures on VAE LMs, it is clear that fine-tuning pretrained autoregressive models as VAEs leads to posterior collapse. Autoregressive and variational objectives are fundamentally incompatible. I think the author should provide more insights and explanation on why the proposed architecture can achieve this.
* In the experiment, the paper claims to "match or exceed explicit CoT accuracy" but Table 1 shows performance degression. Also, the gap between the 2 sizes of base model shows the scalability issue.
* The reconstruction evaluation is severely inadequate. The paper only reports ROUGE-1 and BLEU-1 (both unigram metrics) without baselines, standard practice in text generation requires multiple ROUGE variants, and maybe human evaluation.
* Also, the training method is unclear to me. I'm not sure if the backbone is frozen or fine-tuned. If frozen, then how can tiny MLPs learn proper VAE features from representations optimized for a different objective? I think this part should include more details.
* The presentation of the paper can be improved. The figure and algorithm/notation can be clearer.
* About the KL term, since the beta weight is very small(only 0.01), I'm wondering if the model really use the VAE framework meaningfully.
* Are all pretrained GPT-2/LLaMA parameters updated, or only the newly added modules (cross-attention, prior/posterior heads)? Please provide exact numbers of trainable vs frozen parameters.
* Can you add more ablation studys to isolate whether the gain comes from VAE or just the cross-attention architectural change.
* The paper doesn't provide any posterior distribution study. I'm wondering if the author can bring latent space visualization, samples from prior vs posterior and also KL values during training.
* Why does the simplest baseline COT-SFT (standard fine-tuning with explicit reasoning) achieves better accuracy? |
Fully human-written |
|
variCOT: Variational Inference for Implicit Chain-of-Thought in Language Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper presents variCoT, a novel and principled framework for implicit Chain-of-Thought (CoT) reasoning. The work addresses the high inference latency of explicit CoT, which requires the autoregressive generation of many intermediate tokens. While other implicit CoT methods exist, they often rely on heuristics or multi-stage knowledge distillation.
The key contribution of variCoT is to formalize the unobserved reasoning trace as a continuous stochastic latent variable, Z. This enables end-to-end training within a unified Evidence Lower Bound (ELBO) objective, which jointly optimizes a prior $p(Z|X^q)$ and an approximate posterior $q(Z|X^q, Y^r, Y^a)$.
The framework is shown to match or exceed the accuracy of explicit CoT on reasoning benchmarks (e.g., GSM8K, CommonsenseQA) while delivering a significant 2.5x inference speedup by generating only the final answer. The model also retains the ability to reconstruct the explicit rationale Y^r from Z, providing a valuable mechanism for interpretability.
1. Principled Probabilistic Framework: The paper's primary strength is its departure from heuristic-based methods. By grounding the latent reasoning process in a formal variational inference framework (the ELBO objective), the authors provide a theoretically sound and elegant foundation for end-to-end optimization.
2. Strong Empirical Results & Efficiency: The method is not merely theoretical; it delivers strong practical results. Achieving a 2.5x inference speedup (Figure 3) while matching the accuracy of the much slower explicit CoT-SFT baseline on GSM8K is a significant and compelling result.
3. Reversible Reasoning for Interpretability: A major drawback of most implicit reasoning methods is that the process becomes an opaque "black box." The inclusion of the reasoning decoder p(Y^r | X^q, Z) and the strong reconstruction results (Table 2) is a key advantage, offering a path to interpretability that competing methods lack.
1. Validity of Assumption 2.2: The entire ELBO decomposition (Theorem 2.4) hinges on Assumption 2.2, which posits that the explicit rationale Y^r and the final answer Y^a are conditionally independent given the latent trace Z. This is a strong assumption, and its validity is questionable; the linguistic formulation of a rationale likely provides constraints that aid in answer generation, and vice-versa. The paper would be strengthened by a justification or analysis of this assumption.
2. Fixed-Size Latent Bottleneck: The method trades the variable-length bottleneck of explicit rationale tokens for a fixed-size latent bottleneck (where K=6 is used in experiments). The sensitivity analysis (Figure 6) confirms K is a critical hyperparameter. This raises a scalability concern: it is unclear whether this small, fixed-size trace can scale to problems requiring significantly more reasoning steps than those in the benchmark datasets.
3. Scalability to Large Models: All experiments are conducted on relatively small models (GPT-2 and LLaMA3.2-1B). The stability and effectiveness of VAE-style training on frontier models (e.g., 70B+) remains an open question. Furthermore, the addition of a new cross-attention mechanism at every Transformer layer introduces a non-trivial computational cost that is not fully analyzed.
4. Incomplete Efficiency Analysis: The paper's efficiency analysis is one-sided, focusing exclusively on the clear win in inference latency. The paper is silent on:
(1) Training Efficiency: The training protocol (Algorithm 1) and the augmented architecture (per-layer cross-attention) are almost certainly more computationally expensive than the CoT-SFT baseline. This trade-off is not discussed.
(2) Data Efficiency: The model is trained on the same large augmented dataset (385K samples) as the baselines. No analysis is provided to suggest that this more complex variational objective improves sample efficiency or could learn effectively from less data.
1. Analysis of Posterior Collapse: A discussion on posterior collapse—a primary challenge for VAEs—is notably absent. The paper appears to use two mechanisms to mitigate this: (1) a β=0.01 coefficient on the KL divergence, and (2) a dual-reconstruction objective (L_reasoning and L_answer). A formal analysis of these mechanisms is needed. How sensitive is model performance to the choice of β? Is the dual-loss objective a sufficient regularizer on its own?
2. Justification for Query-based Guidance: The authors use the latent Z as the Query and the text representations H_self as the Key/Value, implying the "reasoning state" is actively probing the text for information. Could the authors elaborate on this important design choice? Did they experiment with the reverse configuration (i.e., H_self as Query) and, if so, what were the results?
3. Discussion of Concurrent Work: The "Related Work" section should be updated to situate the paper against highly relevant concurrent research. Specifically, "Latent thought models with Variational Bayes inference-time computation" (ICML'25) appears to be a parallel effort, also proposing a variational Bayes framework for learning "latent thought vectors." A discussion comparing and contrasting the variCoT framework with this "LTM" approach (e.g., differences in architecture, objective, or optimization) is essential to properly clarify this paper's unique contributions. |
Heavily AI-edited |
|
variCOT: Variational Inference for Implicit Chain-of-Thought in Language Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes variCoT, a variational framework for implicit Chain-of-Thought (CoT) reasoning. The key idea is to model a continuous latent reasoning trace Z and optimize a single ELBO that jointly trains the prior, posterior and decoder. The method claims CoT-level accuracy with faster inference and the ability to reconstruct rationales when desired.
The ELBO decomposition with a KL to a question-conditioned prior is clear and aligns training and inference paths.
Guided latent reasoning uses Z as cross-attention queries with per-layer gates is a smart design of unifying prior, posterior, and decoders without multi-stage distillation or external modules.
* The conditional independence assumption may be strong in practice, especially for tasks where rationales constrain answer style or calibration; the paper does not test violations of this assumption. Especially when the Z is learned from a pre-trained base model, it is harder for the latent variable to cover abstract information.
* Although the “guided latent reasoning” mechanism makes sense, the whole method is incremental. The paper doesn't have fundamentally new inference method, while the novelty is largely in packaging/unifying these into a single-pass training recipe.
* The evaluation results are modest in the breadth and rigor. The number shows marginal improvement. It is unclear how results translate to large-scale LLMs, where implicit CoT behavior and inference costs differ. Also, it doesn't include ablation study on which part of the design choice leads to the improvement.
* The VAE and posterior design is shallow. A simple MLP on top of a pretrained backbone state parameterizes the posterior, with no iterative refinement tied to reconstruction. As we know, latent space models often collapses the posterior. The paper does not provide diagnostics (e.g., KL usage, mutual information, latent utilization) or a clear explanation for why collapse is avoided here.
* Figures, algorithm boxes, and notation could be clearer about where the posterior head sits, how Z flows across layers, and how gates are applied. Also, more evaluation results (harder metrics) can be added to the experiment part.
* For interpretability, can you provide more analysis on Z? I t would be great to show the distribution of Z and the generation between different value of Z.
* For efficiency, can you report throughput under batched decoding and long-context settings, and include cost curves vs. target accuracy?
* How sensitive is performance to the Beta coefficient and the number/shape of latent embeddings K beyond what Figure 6 shows? Can you provide trends on posterior collapse or over-regularization? |
Fully AI-generated |
|
Holistic Prompting: Joint Reasoning with Reusable States and Shortcut Discovery |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes Holistic Prompting, a prompting framework that enables Large Language Models (LLMs) to reuse intermediate reasoning results both within and across problem instances. Existing multi-step reasoning frameworks, such as Chain-of-Thought (CoT) and Tree-of-Thoughts (ToT), usually use trajectory-based state representations. Each state encodes the full reasoning history, preventing the reuse of partial reasoning outcomes and leading to redundant computation. To address this, Holistic Prompting constructs a shared state space of intermediate thoughts, supporting cross-instance reuse and shortcut discovery between solved and unsolved subproblems.
The proposed framework is empirically evaluated on two tasks (Game24 and retrosynthetic planning), showing improved success rates.
This paper introduces a unified framework for reasoning reuse and shortcut discovery, which conceptually bridges CoT/ToT-style prompting with retrieval-augmented reasoning paradigms.
- Limited Evaluation. The experiments focus on two specialized domains. For example, Game24 is a quite old synthetic dataset (used in ToT). As most results of the experiment and the Appendix are reported on this dataset, it remains a question whether the proposed method can be applied to more practical domains, such as tool-use tasks [3] and coding tasks [2].
- Comparison with Retrieval-Augmented Methods. The paper claims conceptual similarity to Retrieval-Augmented Generation (RAG) but does not include direct comparisons or ablations against RAG-based baselines that could also leverage reusable intermediate results [1].
[1]. Buffer of thoughts: thought augmented reasoning with large language models. NeurIPS 2024.
[2]. SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution.
[3]. ToolRL: Reward is All Tool Learning Needs.
The authors are encouraged to address the concerns above. |
Fully human-written |
|
Holistic Prompting: Joint Reasoning with Reusable States and Shortcut Discovery |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors focus on a "memoization" opportunity in LLM reasoning. Instead of making the LLM do its reasoning from scratch for each problem, they aim to discover and reuse common intermediate steps/results.
- The proposed approach connects unsolved problem instances to already-explored reasoning paths from other samples, which is a good contribution. It is similar to dynamic programming and can support more efficient decoding and token usage.
- The results show that while the success rates are similar, the required steps (in their domain captured as reaction and molecule nodes) are fewer.
- The error analysis and ablation results are good.
- The effectiveness of this approach depends on the presence of reusable sub-structures in the targeted task class; it is not clear how much overhead is introduced when overlap is minimal (or non-trivial to adapt).
- Aggressive pruning, while controlling complexity, risks prematurely discarding valuable reasoning paths for atypical instances, potentially missing correct or novel solutions. The authors should think of situations where this can happen.
- There is a general assumption of using the LLM for batch or clustered problem solving rather than one-shot, highly individualized queries, potentially limiting its applicability in interactive or open-ended settings. (This is fine, but it needs to be acknowledged and mentioned).
- It would have been ideal if the authors could have connected this to RAG architectures and talked about situations where sub-structures are re-used even across problem settings or domains.
- Please address the points raised in the weaknesses section. |
Lightly AI-edited |
|
Holistic Prompting: Joint Reasoning with Reusable States and Shortcut Discovery |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Holistic Prompting (HP). The authors argue that conventional "trajectory-based state representations", where each state encodes its entire reasoning history, are redundant and prevent the reuse of intermediate computations, especially when tasks share overlapping subproblems. HP addresses this by processing multiple problem instances jointly within a shared And-Or graph structure, utilizing "collapsed states" that are Markovian and self-contained. This representation allows different reasoning paths to converge on and reuse identical subproblems, both within a single sample and across different instances. A core innovation of HP is an active "shortcut-discovery" mechanism, a type of inverse search that finds actions to connect existing unsolved subproblems to known, previously solved states, thereby aggressively pruning the search. Experiments demonstrate HP's effectiveness in the Game24 math puzzle and retrosynthetic planning.
1. The paper presents a novel idea on reusing reasoning states.
2. The proposed method is efficient in terms of tokens generated compared to ToT
3. The proposed method finds better performance over ToT
4. The shortcut discovery to intentionally arrive at already solved paths is interesting
1. The methodology seems to require common states that are exactly the same, so that different tasks could lead to common intermediate states, and the previous approach can be reused. Such tasks are rare, and the work only evaluates on 2 specific tasks.
2. The presentation is not clear. The descriptions are filled with jargon and not simple, intuitive explanations or illustrations of the underlying meaning.
3. The proposed methodology lacks a memory component. Thus, it is forced to solve all input problems simultaneously and could solve problems consecutively, storing intermediate results from prior steps.
1. What other types of artificial or real-world problems can the proposed method solve?
2. Can a memory module be designed to store intermediate results for later reuse? |
Fully human-written |
|
Holistic Prompting: Joint Reasoning with Reusable States and Shortcut Discovery |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a new reasoning method for LLMs as an alternative to chain of thought (CoT) and tree of thought (ToT). The novel contribution is to build a graph of thoughts which is shared across input samples, where edges are built between intermediate reasoning states such that the states can be reused across different samples, thereby cutting down the length of reasoning traces. The states can only be reused as exact matches (as opposed to clusters or other abstractions). Two experiments are presented, the first a simple arithmetic problem using LLMs as the base predictor and the second a chemical synthesis problem, where existing domain-specific predictors were used instead of LLMs, due to LLMs giving high errors in this domain. The results on the arithmetic problem noticeably outperformed CoT and ToT with significantly fewer intermediate states and model calls. In the chemistry task, it matched the already high performance of the existing baselines but with fewer intermediate states.
The paper seeks to tackle an important problem and the idea of reusing intermediate reasoning states across inputs is definitely promising. However, in its current form, I am not convinced that this method will allow such an architecture to actually scale to standard LLM text-based reasoning problems, due to a combination of my intuition about the architecture and the lack of results on complex text-based domains. If the equality test was abstracted into some form of clustering or high-level concept correspondence, that may be a different story as this could potentially compress complex state spaces. For now, I don’t believe the method is competitive.
* The paper is clearly written
* It tackles the important and well-motivated problem of intermediate state representation and reuse in LLM reasoning
* The main methodological contribution, a sample-shared graph allowing reuse seems novel, though bear in mind that some very recent (last couple of months) methods tackle the reuse problem (e.g. metacognitive reuse, cross-question method reuse)
* The scope seems very limited. Since the matched intermediate states must be very/exactly similar and are low-level states without any abstraction, it is hard to see how this method could extend beyond problems with very simple input and intermediate token sequences. If the intermediate states were whole paragraphs or even sentences, how could these be reused at all?
* This paper is presented as a method for reasoning over LLMs, yet the second experiment didn’t use LLMs at all. If the problem precludes LLMs, you might as well use a domain-specific method rather than reasoning.
* I’m not sure why Game24 was tested with simple baselines, without considering higher performing ones like Graph of Thoughts or Self-discover, which may also be more relevant methodologically. It’s reasonably likely these methods would have matched your performance.
See Weakness section. |
Fully human-written |
|
Fair Graph Machine Learning under Adversarial Missingness Processes |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies graph fairness when demographic attributes are partially observed under adversarial missingness processes.
It introduces BFtS (Better Fair than Sorry), a 3-player adversarial learning framework combining a GNN classifier, a fairness adversary, and an imputation adversary. BFtS imputes worst-case sensitive attributes to make fairness evaluation robust under adversarial missingness. Both theoretical analyses and empirical results demonstrate superior tradeoffs between fairness and accuracy compared to baselines.
- The work identifies a realistic yet overlooked issue: adversarial missingness of sensitive attributes, which can mislead fairness evaluations in graph learning, and formalizes two adversarial missingness problems (AMAFC, AMADB)
- Theorems 2 and 3 clearly demonstrate that BFtS minimizes worst-case bias and approximates robust fairness
- Extensive evaluations are conducted across synthetic and real-world datasets. Empirical results consistently show superior fairness–accuracy trade-offs and robustness under limited or missing sensitive data
- The paper is well organized and easy to follow
- The practicality of adversarial missingness is vague. Whether a value is missing or not seems to be difficult to control by adversaries. If the adversaries can deliberately drop some values, in this case, modifying these values seems to be a stronger adversary we can think about. It would be helpful to add more discussions on the practical scenarios of adversarial missingness in real-world cases.
- It would be helpful to add an introduction of the threat model setting in the main text.
- The bilevel optimization raises concerns about training stability. Although the authors propose the loss curve in Figure 6, it seems that Figure 6 is plotted with only a few points. It would be helpful if the training stability could be better shown in the experiments.
- It seems that the proportion of missing sensitive attributes is more than 50% of the total nodes in Figure 8. It would be helpful to see more experimental results under more stealthy attack settings.
Please see Weaknesses |
Fully human-written |
|
Fair Graph Machine Learning under Adversarial Missingness Processes |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses a critical and underexplored problem in fair graph machine learning: the effect of adversarial missingness of sensitive attributes on fairness-aware Graph Neural Networks (GNNs). The authors argue that prior fairness methods assume that sensitive attributes are either fully available or missing completely at random (MCAR), which is unrealistic in practice. To overcome this limitation, the paper introduces Better Fair than Sorry (BFtS), a 3-player adversarial framework that jointly learns a node classifier, a fairness adversary that predicts sensitive attributes from learned embeddings, and an imputation adversary that imputes missing sensitive attributes to approximate the worst-case fairness scenario. Moreover, the authors provide theoretical analysis showing that BFtS corresponds to a min–max optimization that minimizes classifier bias under worst-case imputations.
1. This paper has principled theoretical grounding, clear definitions, and basic robust justification.
2. This paper proposes an innovative framework with 3-player adversarial setup elegantly unifies imputation, fairness estimation, and classification.
3. Extensive experiments on both synthetic and real-world benchmarks and the robustness is shown under varying degrees of missingness.
1. The 3-player adversarial training could be computationally heavy, particularly on large-scale graphs. A detailed runtime or memory comparison would strengthen the empirical analysis.
2. The study focuses on Demographic Parity (ΔDP) and Equality of Opportunity (ΔEQOP). It would be beneficial to include other fairness notions (e.g., Equalized Odds, Counterfactual Fairness) for completeness.
3. While the adversarial imputation approach is conceptually powerful, there is little discussion of how it performs under realistic partial observability (e.g., when only 5–10% of sensitive data is known).
4. The intuition behind the imputation adversary’s learned behavior could be further explored, perhaps via visualization or sensitivity analysis.
1. How sensitive is BFtS to hyperparameter tuning, especially the fairness weight (α) and imputation adversary weight (β)? Could the authors provide empirical or theoretical guidance on selecting them?
2. Does the degree-bias assumption for adversarial missingness generalize to graphs with highly non-homophilous structures?
3. Could BFtS be extended to handle multi-valued or continuous sensitive attributes rather than binary ones?
4. Is there any evidence of mode collapse or instability during the 3-player adversarial training, and if so, how is it mitigated? |
Fully AI-generated |
|
Fair Graph Machine Learning under Adversarial Missingness Processes |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper investigates the problem of fair graph learning when sensitive attributes are missing under an adversarial missingness process. The authors propose Better Fair than Sorry (BFtS), a three-player adversarial framework involving a graph classifier, a bias discriminator, and a missing-value imputer. The method aims to enhance fairness robustness by simulating worst-case imputations. Both theoretical arguments and empirical evaluations are presented, demonstrating that BFtS achieves superior fairness–accuracy trade-offs on multiple graph datasets compared to existing methods.
1. Novel and important problem
The paper targets a realistic setting where sensitive attributes are not missing at random, which is often overlooked in existing fair graph learning literature. Formulating this as an adversarial missingness problem is both intuitive and practically meaningful.
2. Methodological soundness
The three-player adversarial design is conceptually well-motivated and integrates ideas from fairness, robust optimization, and adversarial learning into a unified framework. The training procedure is clearly described and the objectives are well defined.
3. Comprehensive experiments
The evaluation covers both synthetic and real-world datasets, reporting multiple fairness and accuracy metrics. The results consistently show that BFtS outperforms baseline methods under different missingness settings.
1. Limited theoretical depth
The theoretical results provide general insights but remain high-level. The proofs are brief and do not include convergence or generalization guarantees for the proposed min–max training objective. A more formal analysis of the optimization dynamics would strengthen the paper.
2. Comparison to related methods
While the paper compares BFtS with existing fair graph learning approaches, it could more clearly articulate how its mechanism differs from other fairness-aware imputation or robustness frameworks. The conceptual novelty may appear incremental without deeper discussion.
3. Experimental diversity
The adversarial missingness is modeled primarily through degree bias, which may not capture all possible real-world scenarios. Including other structural or attribute-based missingness patterns would make the empirical evaluation more convincing.
4. Fairness metrics and discussion
The choice of fairness metrics (Demographic Parity and Equality of Opportunity) is standard, but the paper could briefly justify why these particular measures were selected and whether the method generalizes to others.
5. Presentation details
Some notations are inconsistent between equations, and the visual presentation of a few figures could be improved for clarity.
1. Limited theoretical depth
The theoretical results provide general insights but remain high-level. The proofs are brief and do not include convergence or generalization guarantees for the proposed min–max training objective. A more formal analysis of the optimization dynamics would strengthen the paper.
2. Comparison to related methods
While the paper compares BFtS with existing fair graph learning approaches, it could more clearly articulate how its mechanism differs from other fairness-aware imputation or robustness frameworks. The conceptual novelty may appear incremental without deeper discussion.
3. Experimental diversity
The adversarial missingness is modeled primarily through degree bias, which may not capture all possible real-world scenarios. Including other structural or attribute-based missingness patterns would make the empirical evaluation more convincing.
4. Fairness metrics and discussion
The choice of fairness metrics (Demographic Parity and Equality of Opportunity) is standard, but the paper could briefly justify why these particular measures were selected and whether the method generalizes to others. |
Fully AI-generated |
|
Hierarchical Speculative Decoding through Training-Free Slim-Verifier |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces HVSD, which refines the verification stage of speculative decoding into a three-tier paradigm. Specifically, the authors incorporated a slim-verifier that can directly execute accept, reject, or resample operations on tokens where it demonstrates high confidence. This design effectively minimizes the frequency of calling the full verifier.
1. The core idea of this paper holds up, and the experimental results substantiate its effectiveness.
1. The writing clarity of this paper is insufficient. Although I generally understood the paper's main idea, I failed to completely understand the main points that Section 3 and Section 4 attempt to explain. The author seems to be detailing the construction process of the slim-verifier $q'$. I suggest the author provide the high-level idea of this part to help understanding.
2. From a methodology perspective, the proposed HVSD is a lossy method, which approximates the target model's output by means of layer skip. Theoretically speaking, its performance (e.g., accuracy) should reasonably be lower than lossless methods (e.g., standard SD). However, the author's experimental results show that the performance of HVSD consistently outperforms standard SD. Please, the author provide a detailed explanation for this abnormal phenomenon.
3. In HVSD, for tokens where the slim-verifier confidence is lower, the full verifier is needed for processing. I want to know, if the sequence of tokens generated by the draft model contains tokens with varying confidence levels (i.e., partly high confidence, partly low confidence), how will the system handle this? Specifically, if verification is performed using a draft tree (e.g., drafting 64 tokens at once) or batching method, since the probability of all tokens being low confidence is extremely low, does this mean the full verifier will almost need to be called in every verification step? If so, this will severely impact the overall acceleration effect.
4. The proposed HVSD seems difficult to directly apply to the latest speculative decoding methods (e.g., EAGLE). In EAGLE, every draft needs to reuse the target model's information from the last verification. If the full verifier is skipped in a certain verification, then the reusable information for the next draft will be missing. Will this affect the generation quality of the next draft tokens, and further affect the overall speedup ratio?
Please see the weakness. |
Lightly AI-edited |
|
Hierarchical Speculative Decoding through Training-Free Slim-Verifier |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Hierarchical Verification for Speculative Decoding (HVSD) — a three-tier, training-free speculative decoding framework that integrates a lightweight Slim-Verifier between the drafter and the full verifier. The work aims to overcome inefficiencies in conventional two-tier draft–verify frameworks by reducing unnecessary large-model calls for tokens that can be verified by a smaller intermediate model.
- The hierarchical three-tier framework generalizes speculative decoding beyond binary verification, backed by an information-geometric justification using KL projection.
- The KL-based derivations are mathematically sound, but somewhat they took too much content in my opinion
- Practical and easy to integrate into existing inference systems.
- line 060: "s li m" -> "slim"
- line 315: "This blockwise Under a fixed skipping ratio r" -> what does it mean?
- The authors chose T5 as the evaluation model, this model seems a bit too old for benchmarking. Perhaps it will be better to use mainstream models like llama and qwen.
- the benchmarks tasks are allocated to these two models, instead, these benchmarks should be tested against all models.
- other methods such as EAGLE 1/2/3 should be compared?
- how to tune the gate thresholds $\delta_{1}$ and $\delta_{2}$ ?
- why are the benchmark scores higher than the baselines even if the method in the paper is lossy? |
Fully human-written |
|
Hierarchical Speculative Decoding through Training-Free Slim-Verifier |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes a three‑tier speculative decoding framework that inserts an intermediate **slim‑verifier** (denoted \(q'\)) between a small drafter \(p\) and the full target model \(q\). The slim‑verifier is constructed by skipping layers of \(q\) and is adapted at runtime via **Dynamic Layer‑Skipping Adaptation (DLSA)**. The decoding pipeline is hierarchical: tokens drafted by \(p\) are first checked by \(q'\). Accepted tokens are emitted immediately. Rejected tokens trigger one of two actions: either a local regeneration at \(q'\) when its confidence is sufficient, or a deferral to the full model \(q\) for a lossless verification step. The authors also introduce a token‑wise **lossy** acceptance gate at the intermediate tier and provide a KL‑style decomposition motivating the insertion of \(q'\). Experiments on Gemma‑2 and T5 across summarization, translation, QA, reasoning, and coding report lower rejection and \( $2\times \sim3.3\times$ \) speedups over non‑draft baselines, with gains over single‑tier SD baselines on many tasks.
1. The paper studies the observation that many rejected draft tokens have the potential to be accepted, which could reduce the expensive target model verfication.
2. The authors provide sufficent theoretical analysis to suppport their claims.
3. The presentation and the figures are clear and easy to understand.
1. The idea of both "multi-tier speculative decoding" and "layer-skip speculative decoding" is well explored by the related works. The motivation to accept some potentially correct draft tokens and reducing the verification cost is straightforward, but the proposed method seems more like integration of cascade and layer-skip speculative decoding methods.
2. **Theoretical gap:** KL‑projection guarantees rely on convex U, but the implemented U (skip masks) is discrete/non‑convex. The theory thus motivates but does not **justify** the claimed optimality for DLSA’s search space.
3. HVSD achieves better speedup ratio at the cost of losing the theoretical lossless property of speculative decoding, which is especially important in real-world applications.
4. Missing Related Works. Some related works [1] already explored the idea of accepting some potentially correct draft tokens. The lack of these baselines weakens the novelty and evidence.
5. The experiments and evaluation are limited to T5-series models and Gemma-series models. The authors should provide more experiments on some state-of-the-art open source LLMs, e.g. Llama 3 and Gwen 3, and extend the comparison to some latest speculative decoding methods, e.g. training-based Eagle-3 [2] and training-free PEARL [3].
[1] Bachmann, Gregor, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Schönfeld, Ali Thabet, and Jonas Kohler. "Judge decoding: Faster speculative sampling requires going beyond model alignment." *arXiv preprint arXiv:2501.19309* (2025).
[2] Li, Yuhui, Fangyun Wei, Chao Zhang, and Hongyang Zhang. "Eagle-3: Scaling up inference acceleration of large language models via training-time test." *arXiv preprint arXiv:2503.01840* (2025).
[3] Liu, Tianyu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. "Pearl: Parallel speculative decoding with adaptive draft length." *arXiv preprint arXiv:2408.11850* (2024).
1. Could you please provide more experiments on some extremely difficult tasks? Will HVSD significantly decrease the model performance?
2. Could you please provide measurement of the *DLSA search* time or any extra kernel launches for q′ construction? How many skip‑mask candidates are evaluated per context and per sequence? What is the wall‑clock % spent in DLSA when reporting end‑to‑end latency?
3. KV/cache reuse. Do you share KV caches or hidden states between q′ and q? If not, what’s the measured overhead, and how would cache reuse change the results? |
Fully human-written |
|
Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a novel one-step method called Implicit Bridge Consistency Distillation (IBCD) for unpaired Image-to-Image translation by extending the consistency distillation framework to concatenated source-target PF-ODE obtained via Dual Diffusion Implicit Bridge (DDIB). In addition to this, the authors propose using a Distribution Matching loss with an adaptive reweighing scheme based on the introduced distillation complexity proxy to enhance the realism of samples, and a cycle-consistency loss to improve faithfulness. The proposed approach beats the corresponding baselines across multiple commonly used unpaired image-to-image benchmarks.
1. The application of Consistency Distillation to the unpaired image-to-image translation is novel and conceptually interesting.
2. The method proposed in the paper enables simultaneous bidirectional training and one-step inference.
3. The experimental section is extensive, with convincing quantitative and qualitative results, including ablations, failure cases, and a user study.
4. The MRI Contrast Translation experiments suggest potential applicability beyond standard image translation tasks, which strengthens the paper’s general interest.
1. The current quantitative comparison omits comparison with Optimal Transport methods, such as NOT [1] and/or ASBM [2], which is a significant methodological gap.
2. The two-stage training pipeline raises questions about efficiency and stability. As indicated in Table 4, the training of the first IBCD-only stage consumes the majority of the training time, while the second stage, which enables a better trade-off in the end, accounts for less than 20% of the total training steps. This imbalance suggests that the student initialisation may be suboptimal and that the training dynamics could be unstable when combining objectives.
3. The benefit of adding DMCD and DMCD & Cycle losses to IBCD-only on Figure 3 is not clearly demonstrated. Since DMCD should enhance target distribution realism and cycle consistency should enforce source-target faithfulness, the visual distinctions should be more evident.
References:
- [1] Neural Optimal Transport
- [2] Adversarial Schrodinger Bridge Matching
1. How does bidirectional training affect performance compared to a unidirectional IBCD model? Does it improve the final trade-off between realism and faithfulness, or does it introduce additional instabilities?
2. What motivates the two-stage training design and the rapid convergence of the second stage? Why is joint training (IBCD + non-adaptive DMCD + Cycle) from the start not feasible?
3. How long does the student model take to converge, and how does its total training time compare to the teacher model’s training time?
4. Could you please provide parameter counts for both teacher and student models, and for the diffusion-based baselines used in the image translation experiments?
5. Could you expand on the MRI Contrast Translation setup? Specifically, how was the IBCD teacher model trained, and did the diffusion-based baselines share the same teacher initialisation? |
Fully human-written |
|
Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes a novel approach, called IBCD, for solving unpaired image-to-image (I2I) problems. The method suggests training a one-step bidirectional translation map via consistency distillation of the DDIB (Denoising Diffusion Implicit Bridges) trajectory (which consists of the two composed diffusion ODE trajectories: source$\to$ noise and noise $\to$ target). Additionally, the authors propose using DMD loss with adaptive weighting to enhance the realism of the map's outputs and cycle-consistency loss to improve the input-output alignment. The authors empirically validate the importance of the components in the toy 2d experiment and in the unpaired image-to-image translation problems. The method yields superior results compared to the GAN-based and diffusion-based baselines on the AFHQv2 and CelebA-HQ translation benchmarks.
1) Combination of the cycle-consistency loss with DMD on the outputs and DDIB distillation is novel;
2) Adaptive weighting of the DMD loss looks promising and demonstrates efficiency in the toy experiment;
3) The method has efficient one-step inference;
4) The method outperforms the GAN-based baselines and most of the diffusion-based baselines (except the teacher DDIB, where IBCD has better alignment but worse FID).
1) Comparison with the diffusion-based baselines raises questions about fairness. While DDIB teacher and IBCD share the same class-conditional EDM backbone trained by the authors, the results reported in ILVR, EGSDE, and CycleDiffusion are obtained with the discrete-time DDPM backbone introduced in ILVR in 2021. I think unifying the backbone for all the sampling-based diffusion methods is essential for the fair comparison;
2) Several baselines are missing. The authors report quite a comprehensive amount of GAN-based and diffusion-based baselines, but it is essential to perform comparison with optimal transport (OT)-based baselines. GAN-based methods are older and are typically outperformed by diffusion-based methods (thus, I believe, their relevance is limited), while diffusion-based methods often suffer from lower input-output alignment compared to the one-step counterparts (and one of the advantages of IBCD is better alignment compared to e.g. DDIB teacher model). I appreciate adding UNSB, but comparing IBCD to such methods as e.g. NOT [1] and DIOTM [2] would greatly benefit the paper in terms of positioning against one-step baselines;
3) An important related work [3], which proposes to modify the DMD procedure for image-to-image scenarios, is missing;
4) The method description sometimes seems overloaded and overcomplicated (e.g. Equations 9, 10, 11);
5) The method would greatly benefit from studying higher-dimensional problems or a more diverse set of problems e.g. class- or prompt-conditional I2I translation, or translation between different types of domains.
[1] Neural Optimal Transport
[2] Improving Neural Optimal Transport via Displacement Interpolation
[3] Regularized Distribution Matching Distillation for One-step Unpaired Image-to-Image Translation
In Table 3, the authors present the effect of different components of the method on the performance in terms of FID and PSNR. The improvement in the input-output alignment is expected after adding cycle consistency loss. However, adaptive DMCD strategy seems to be designed for enhancing realism of samples. Could you please tell why it has a pronounced effect on alignment while slightly harming realism? |
Fully human-written |
|
Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The work proposed a single-step bidirectional unpaired image-to-image translation model using the combination of consistency distillation and distribution matching distillation for DDIB teacher model. Key contributions of the work include the formulation of consistency distillation for PF-ODE trajectories obtained with DDIB model and the extension of DMD framework for the proposed consistency distillation. The evaluation of the proposed IBCD model was done using cat-to-dog, wild-to-dog, and male-to-female unpaired image-to-image translation problems. IBCD is compared with adversarial methods, including CUT and UNSB, and diffusion methods, including EGSDE, CycleDiffusion, SDDM, SDEdit, and the teacher model DDIB. These image-to-image translation models are evaluated using FID, density and coverage for image realism, inference time and NFE for inference efficiency, and PSNR and SSIM for input-output similarity. The results of IBCD show its advantages against the teacher model DDIB in terms of the inference efficiency and input-output similarity, and other adversarial and diffusion models in terms of image realism.
1) Surprisingly, according to Table 2 and Table 7, the proposed method has the inference time, which is even less than for GANs, while having better realism and input-output similarity metrics. It makes diffusion models much closer for applications in unpaired image-to-image translation problems.
2) The novel adaptive DMCD loss greatly improves the input-output similarity, as demonstrated by Table 3 and Figure 7.
3) The extension evaluation of IBCD with different image-to-image translation metrics in Table 2 is also supported by user study and perceptual measures in Table 6. The method is also compared with LLM-based GPT-Image-1 model, which shows existing limitations in domain specific problems of foundational large models in zero-shot editing.
4) Figure 8 studies the effect of realism-similarity trade-off for the proposed model by balancing between DMCD and cycle-consistency losses
1) According to Table 4, the distillation of DDIB model requires more than 200k steps for the training IBCD. It remains unclear how fast and stable the distillation with the IBCD is performed compared to the training of other unpaired image-to-image translation models.
2) As pointed in many prior works on image-to-image translation problems [1, 2, 3], diversity remains an important characteristic of image-to-image translation models for multimodal pairs of domains. Even though authors provide standard deviations of the image quality metrics, the study lacks of diversity evaluation for the proposed method.
3) As pointed out in [1], optimization of cycle-consistency losses struggles for pairs of image domains, where there is a big complexity difference and bijection assumption does not hold, for example, for the sketch-to-image problem. The study lacks of such pairs for image domains in the evaluation protocol.
4) The work lacks of explanation about the inference efficiency of IBCD compared with GAN-based image-to-image translation models such as number of training parameters and model sizes, which seems to be unexpected.
[1]. Augmented CycleGAN: Learning Many-to-Many Mappings from Unpaired Data. ICML-2018.
[2]. Multimodal Unsupervised Image-to-Image Translation. ECCV-2018.
[3]. StarGAN v2: Diverse Image Synthesis for Multiple Domains. CVPR-2020.
1) Can you comment on the training time and stability of the distillation for IBCD compared with baselines?
2) Can you quantify the diversity of IBCD model and baselines, for example, following the MUNIT approach [2]?
3) Can you comment on the number of training parameters and model size of IBCD compared to other baselines? The result of IBCD, which is 5 times faster than StarGAN-v2, looks impressive and I would like to understand this difference.
4) Can you comment on the applicability and performance of IBCD on the pairs of image domains, where the image-to-image translation is multimodal by the nature and bijection assumption does not hold? For example, for the problem of sketch-to-image translation.
5) Can you comment on the choice of the distance function $d$ and how it affects the results? For example, some methods employ $L_2$ distance instead of LPIPS in consistency loss [4].
6) DMD loss seem to be applied in previous works for diffusion unpaired image-to-image translation models. I suggest authors discuss the relation of their implementation of DMD loss and its implementation in the paper [5].
[4] Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion. ICLR-2024.
[5] Regularized Distribution Matching Distillation for One-step Unpaired Image-to-Image Translation. Structured Probabilistic Inference & Generative Modeling workshop of ICML 2024. |
Fully human-written |
|
Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose a distillation approach for the previously introduced method for domain translation using unpaired data, known as the Dual Diffusion Implicit Bridge (DDIB). DDIB utilizes two pre-trained diffusion models for each domain, or one conditional diffusion model pre-trained on multiple domains to “concatenate” PF-ODE between the first domain and the noise distribution, with PF-ODE between the noise distribution and the second domain. This approach enables the creation of a PF-ODE that follows from the first domain through the noise distribution to the second domain. The method Implicit Bridge Consistency Distillation (IBCD) proposed by the authors is, in essence, a combination of consistency distillation and distribution matching distillation, adapted to distill “concatenated” PF-ODE in both directions of the first and second domains with the addition of CycleGAN loss. The authors evaluate their method on toy data as well as unpaired image translation on AFHQ and Celeba-HQ datasets, and compare with previous works on the unpaired image-to-image translation method, including the OpenAI foundational model.
- The authors propose the adaptation of a combination of consistency and DMD distillation for DDIB.
- A wide list of competitor methods is considered, including the OpenAI foundational model.
- The authors show that the developed model outperforms competitors on quality of generation.
- A user study is provided.
- The proposed approach is more like an engineering combination of previously proposed distillation methods to the previously proposed method for unpaired domain translation.
- While the authors show superiority of their method compared to other competitors in the quality of generation, this result is mainly due to the usage of the strong teacher model, which also outperforms competitors. However, there is no comparison of the parameter count and training time of these models. There is a possibility that the teacher model outperforms other baselines due to the usage of significantly more parameters and compute, and the distilled version does the same for the same reason.
- I suspect that in the comparison of inference time in Table 7, usage of different batches may not be correct, since there may be a constant overhead for each call of the neural network; hence, methods that use higher batch sizes may be better in img/sec ratio. Since the key part of all methods is to make some number of forward passes of the neural network, a comparison of 1 forward pass of each network with the same batch of images multiplied by NFE would better demonstrate the computational complexity of each model.
- Some other baselines based on SB theory are missed, such as, but not limited to [1, 2].
[1] Shi Y. et al. Diffusion schrödinger bridge matching //Advances in Neural Information Processing Systems. – 2023. – Т. 36. – С. 62183-62223.
[2] Mokrov P. et al. Energy-guided Entropic Neural Optimal Transport //The Twelfth International Conference on Learning Representations.
Could the authors provide more details on the number of parameters of all methods? |
Fully human-written |
|
Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper asks how small a joint prediction set can be when doing transductive conformal prediction at a given confidence. It proves lower bounds on expected set size. The main asymptotic message is that to maintain any non-trivial confidence, the expected joint set must grow roughly like exp of n times conditional entropy. A non-asymptotic refinement adds a second order “dispersion” term that depends on the variance of log probabilities. The authors also give an achievability result in an oracle setting where P of Y given X is known, and they analyze a special case where all test points share one true label by connecting to classical hypothesis testing with the generalized Jensen–Shannon divergence. They include a small MNIST toy study to illustrate trends.
Puts a clean information-theoretic lens on transductive conformal prediction and tries to sharpen earlier entropy-based limits with a finite sample expansion. The reduction to hypothesis testing in the same-label case is neatly explained and hooks into known optimal tests.
The statements are precise, the asymptotic and finite-blocklength styles are consistent, and the proofs trace to standard tools like information density bounds, Berry–Esseen, and method of types. The paper is careful about what is guaranteed and where the constants come from.
Key quantities like efficiency rate, dispersion, and the role of conditional entropy are defined clearly. The contrast between joint confidence and per-point Bonferroni is made explicit and illustrated. The same-label section is self-contained and readable.
Joint guarantees are relevant in batch certification, safety screening, and ranking. Having even pessimistic lower bounds helps practitioners understand why transductive sets often balloon as n grows. The work could become a common reference when teams debate whether transductive guarantees are worth the price in set size.
The headline asymptotic bound reaffirms the conditional entropy barrier already highlighted in recent work on conformal efficiency. The finite sample refinement with a dispersion term is welcome, but the experiments show a persistent gap that closes slowly, which makes it hard to see the practical sharpening.
The achievability result assumes oracle access to P of Y given X. That turns the task into thresholding products of true class probabilities and inevitably matches the converse to first and second order. This is informative theoretically but not actionable. The paper stops short of proposing any implementable transductive procedure that approaches the bound when P of Y given X is learned with error.
Efficiency is measured only by expected set size. In transductive practice, teams often optimize other surrogates like false coverage proportion, cost-weighted set size, or rank-based utility. The bounds are said to “extend” to other notions, but those are deferred. Without at least one nontrivial worked out alternative, the generality claim feels thin.
The MNIST label noise toy is not enough. It uses simple noise models where H of Y given X is tractable, then shows that Bonferroni blows up. That result is unsurprising. There is no stress test on real transductive use cases such as ranking or batch classification with covariate shift, no study of how well one can estimate H and dispersion from data, and no attempt to check tightness against a strong transductive algorithm rather than a Bonferroni baseline.
Dispersion is defined as the standard deviation of log P of Y given X. This could be quite useful as a diagnostic, yet the paper does not show how to estimate it reliably from finite data or how it correlates with observed set growth. The reader is left without a recipe to turn the bound into an engineering rule of thumb.
Can you give a theorem or a corollary that shows your finite sample lower bound strictly dominates prior entropy-only bounds over a clear domain of alpha and n, with explicit constants? A small synthetic where you can compute both exactly would make the gain concrete.
Suppose P of Y given X is approximated by a calibrated classifier with a known risk or Bregman divergence to the truth. Can you translate that misspecification into a slack in the achievability side, even if loose? A bound of the form “gap grows as epsilon to some power” would be valuable.
Pick one alternative efficiency metric, for example expected rank of the true label inside the joint set or a budgeted FCP, and carry your full derivation through to a nontrivial corollary. This would support the claim that the framework covers broader measures.
Can you sketch an extension of the finite sample bound to continuous X with mild regularity on scores, perhaps using empirical process tools instead of types? Even a simplified theorem for plug-in density ratios would widen the impact.
Provide a practical estimator and a confidence band for H of Y given X and dispersion from held-out data, with a study of bias and variance. Then compare the predicted lower bound against observed growth on at least one real task. This would turn the theory into a planning tool.
Compare your bound with the joint set sizes produced by a modern transductive method that is more nuanced than Bonferroni, across a range of n and calibration sizes. Highlight the regimes where the bound is close to achievable and where there is a big gap. |
Fully AI-generated |
|
Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction |
Soundness: 3: good
Presentation: 1: poor
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
In standard setup for conformal prediction we guarantee the coverage per each test point. This paper addresses the problem for joint prediction sets where the guarantee is to be derived for joint coverage — all labels should be in all prediction sets within a batch to be count as a coverage. The authors pose two questions, one on the information theoretic bound over the trade off between the set size and the joint coverage, and the other is how can er find an optimal way to leverage the entire labeled set (including the training set) for conformal prediction without sacrificing the coverage guarantee (conventionally, this guarantee breaks if we introduce training points to the process).
The authors provide lower bound on this setup showing that either the set size increases drastically or the guarantee vanishes to zero by increasing the number of points over which the decision is jointly made.
Furthermore they discuss a specific case where the joint prediction is over a single label -- the case where a single datapoint is examined several times for robustness.
*Disclaimer.* I should note that due to the representation issues I mentioned in the weaknesses, I could not fully follow the paper.
1. The targetted problem is very interesting (however in the naming it intersects with full conformal prediction). Being able to provide joint guarantee better than bounds from the Bonferroni correction is applicable.
2. The theory of the paper is concretely written, and (to the best of my understanding during reading the paper) all theorems are proved solidly.
3. The connection of their work to robustness (while it should not be mistaken with the adversarial and probabilistic robustness) is very interesting.
**Introduction is not easy to read.** Surely the introduction is written with a solid mathematical notation and there is no issues with that. But it is at least not friendly to non-expert reader. I would suggest elaborating more to the comparison of inductive and transductive conformal prediction for example and offer a one sentence brief explanation by what you mean when introducing a new concept. One really helpful way to improve the readability is to directly say that “by transductive conformal inference on a batch of datapoints we are interested on he probability that all labels are within all prediction sets”. I know that formally it can be inferred from the text, but explicitly saying that in the introduction increases the reading speed considerably.
The authors introduce two interesting questions in Section 2, while I can not see a footprint of those questions (specifically the second one) in the introduction.
Even the setup with all equal labels is not presented as it is in the introduction. At least it could be better if the authors mentioned the robust prediction application when introducing the setup for the first time.
**General readability.** The paper (due to the subject) is already not easy to read, and the authors sometimes introduce a new notation without fixing it. For instance $P_e$ in line 48. The authors also do not provide any intuitive message from the theorems they proof (for example in theorem 3.1 I can not parse what the theorem is trying to say about the joint probabilities).
**Limited Experimental Setup.** The paper only introduces numerical results on MNIST dataset. I do not count this as a negative point in my score as the paper is a theory paper. The question remains that how their results could be written in terms of a lower bound for any dataset. If the authors provide a clear algorithmic approach to derive the bounds over any dataset, then I would increase my score.
1. In line 54 is the term $P(Y^{m+n}_{m+1}m | X^{m+n}_{m+1}m)$ equal to the product of all conditional probabilities from m+1 to m+n? If so, can you elaborate why it not a function of the number of elements? From reading your proof I assume this is because the value alpha already encodes the number of the elements but I am not sure why.
2. Is the entropy mentioned in Theorem 3.2, written in terms of the true confidence? Is there any bounds on the accuracy of the model?
3. What is delta in theorem 3.4? Also is there any intuitive understanding about the other variables sigma and rho? What are they encoding?
4. How your results can be expressed in terms of a guaranteed upper bound on the joint guarantee for any dataset? Is this bound affected by the number of classes or the quality of the model? Or is there any need to exactly specify the ground truth p(y|x) to derive these bounds? |
Fully human-written |
|
Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This submission is essentially split into two separate parts: first, it provides bounds on the set size for "transductive" CP by assuming a known data distribution; and then it provides a set-based extension of a known result by Gutman on a special case of the problem.
The first part builds on standard information-theoretic tools, while the second leverage standard results on binary hypothesis testing with training data.
Experimental results are provided for the first part only.
The two parts of the submission, while limited on their own, offer a useful reference for researchers interested in "transductive" CP (defined here as the problem of producing a set prediction for a batch of test samples).
The paper is clearly and formally written, and useful pointers are provided to the literature.
The analysis in Section 3 essentially assumes knowledge of the data distribution. While it is true that this assumption yields upper bounds on the true performance of transductive CP, it is also the case that the assumption appears to completely hide the role of the calibration data size m.
Furthermore, the results of the analysis appear to be rather expected and limited in scope. The typical set of a sequence of i.i.d. variables grows exponentially with the entropy, and so must also any prediction set with non-vanishing coverage. The result in Theorem 3.4 is also a refinement of the same idea.
The setting studied in Section 4 is a direct extension of Gutman's work on binary hypothesis testing with training data. In fact, most of the section is devoted to reviewing existing results, and the new contribution follows directly by reframing the problem as one of set prediction.
No experimental results are provided for the setting studied in Section 4.
1) How can the analysis in Section 3 be extended to provide insights on the role of the size of the calibration dataset?
2) Can Section 4 be rewritten to address directly the case with any number of hypotheses?
3) If so, can Theorem 5 be simplified to provide clearer insights into the average prediction set size?
4) How do the results in Section 4 connect to the theorem in Section 3?
5) Can experimental results be provided to relate the material in Section 3 and Section 4? |
Fully human-written |
|
Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This is a theoretical paper that characterizes the trade off between coverage and set size of transductive conformal prediction in the classification setting. In transductive conformal prediction, the goal is to construct prediction sets for $n$ test points such that all test labels fall within the joint prediction set with probability at least $1-\alpha$. The paper proves both asymptotic limits and non-asymptotic bounds for the “efficiency rate”, a normalized measure of the size of the transductive prediction set.
The paper makes some interesting connections between conformal prediction and information theory and applies some interesting tools.
As someone who does not have a strong background in information theory or familiarity with Gutman’s test, I found it hard to follow section 4.
1. Theorem 3.1: Can you add the interpretation of this result in words? This applies to all theorem statements.
2. Do you view this work as more than simply “theory for the sake of theory”? Can this theory eventually lead to work that will inform practice?
3. On line 53, you write “When all test points share the same label, a scenario relevant to safety-critical applications…” — what is an example?
Typos/stylistic comments:
* There are multiple places where \citep is used where \citet should be used instead
* I would mention somewhere that what you call “confidence” is commonly called “coverage” in the conformal prediction literature
* Line 53: “same label” -> “same unknown label”
* Line 143: Capitalize “in”
* Line 174-175: asymptotically is used twice in this sentence
* Line 175: “In case” -> “In the case”
* Line 245: remove “setup”
* Line 249-250: I would replace the “=“ with “:=“
* Line 281: “prediction sets a single” -> “prediction sets with a single”
* Line 283: math mode error
* Theorem 4.4: “M class”? Should this be “M-class setting”? |
Fully human-written |
|
Nested Hash Layer: A Plug-and-play Module for Multiple-length Hash Code Learning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper studies supervised hashing code generation, with idea of generating multiple hashing code with different lengths in a single model. The proposed COMPACT is capable of training multiple models for different hash code lengths. Two additional tricks named Dominance-Aware Dynamic Weighting, and Long-short Cascade Self-distillation are adopted to address conflicts in training objective and improve the performance of short codes.
1. It is interesting to generate hash codes with multiple lengths in a single model.
2. The proposed method shows good performance as shown in experments.
3. Dominance-Aware Dynamic Weighting, and Long-short Cascade Self-distillation are well motivated and reasonable.
1. The "PLUG-AND-PLAY MODULE" is overclaimed as the proposed module still need to trained with loss functions.
2. The proposed method does not address a more important issue, i.e., how to seek optimal code lengths for different tasks.
3. It might be necessary to compare against hashing code expansion and compression methods, as they also generate hash code with different lengths.
4. The efficiency comparison is not fair, i.e., compare the time to train one NHL model against the time to train five separate models.
5. The cascade self-distillation adopts well-studied distillation strategy, thus is not novel and does not show significant performance enhancement as shown in Table 2.
1. the "plug-and play" claim should be justified.
2. Could the proposed method be applied to seek optimal code lengths for different tasks? discussions can be added.
3. Please provide discussion or detailed comparison against code compression methods.
4. The efficiency advantages might be over-claimed. |
Fully human-written |
|
Nested Hash Layer: A Plug-and-play Module for Multiple-length Hash Code Learning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper presents a plug-and-play module for multi-length hash code learning. The proposed Nested Hash Layer (MHL) projects the DNN features into one long-length feature and build multiple hashing from it. To solve gradient conflict issue, a kind of hand-crafted weights are calculated to adjust the gradient direction. Besides, the long hash codes are leveraged to guide the short hash code learning through gradient stopping. The proposed NHL make an improvement on the training speed and can be applied into various deep hashing approaches. Performance boosts are observed on CIFAR-10, ImageNet and MSCOCO.
1. Multi-length feature learning is an interesting topic especially for image hashing.
2. The idea of nested hash code is reasonable and the gradient constraint is deliberately designed according to the principle of short hashing alignment.
3. Plugging NHL into various deep hashing method will obtain performance gains.
4. The paper is organized-well and easy to follow.
1. My biggest concern for this work is about the technical contribution, which is limited. The design of gradient weighting is very intuitive and is hard to interpreted. Even though the probability of the anti-domination is high, it is hard to judge the influence is positive or negative on the hand-crafted weight manner as in Eq.(7). The self-distillation loss should be also discussed. The reason of only applying distillation between adjacent code number (from k+1 to k) is not clear. Some ideas are very similar to previous work such as MRL and the whole formulation is not elegant.
2. The dimension curse (sharp decrease) is not a common issue since it only occurs in the DSH. There is not such issue for the other methods.
3. There should be some investigations or analysis to dig into hash code distribution once the multi-length hashing has been well-learned.
4. The evaluation should be conducted on larger datasets and stronger architectures (e.g., ViT) to demonstrate the generalization ability, instead of only conducting evaluation on the old school baselines.
Please see the weaknesses. |
Fully human-written |
|
Nested Hash Layer: A Plug-and-play Module for Multiple-length Hash Code Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper focuses on addressing limitations of traditional deep supervised hashing models in large-scale image retrieval, which only generate single-length hash codes—creating an efficiency-effectiveness trade-off, requiring multiple model trainings for optimal lengths, and ignoring relationships between different-length codes. It proposes the Nested Hash Layer (NHL), a plug-and-play module that generates multiple-length hash codes simultaneously in a nested structure. To tackle optimization conflicts from multi-objective learning, the paper introduces a dominance-aware dynamic weighting strategy for gradient adjustment; it also proposes a long-short cascade self-distillation method, where long hash codes guide shorter ones to improve overall code quality.
The paper identifies a practical limitation in traditional deep supervised hashing—i.e., the inefficiency of training multiple single-length models to find an optimal hash code length—and targets it with a plug-and-play module (NHL), which aligns with the need for flexible, low-overhead solutions in large-scale image retrieval. The proposed long-short cascade self-distillation also addresses the understudied relationship between different-length hash codes, and the reported training speedup (5–8x) and performance gain (3.4%) suggest potential practical utility.
The abstract provides no details on how the nested structure of NHL generates multiple-length codes or how the dominance-aware dynamic weighting strategy adjusts gradients. This lack of technical transparency makes the method unreproducible and unconvincing.
The core idea of multi-length hash code learning is not novel, and the paper fails to articulate how NHL advances beyond these prior efforts. The 3.4% average performance gain is also modest and unsupported by analysis of when/why NHL outperforms existing solutions.
The paper does not discuss NHL’s drawbacks—e.g., whether the nested structure introduces computational overhead at inference time, how it handles extremely long/short code lengths, or its robustness to noisy data. This one-sided presentation lacks scientific objectivity.
How exactly are shorter hash codes derived from longer ones in NHL’s nested structure (e.g., truncation, learned sub-structures)?
Which specific baseline hashing models were used to compare NHL’s 5–8x training speedup?
What standard datasets were tested to measure NHL’s 3.4% performance gain? |
Heavily AI-edited |
|
Nested Hash Layer: A Plug-and-play Module for Multiple-length Hash Code Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes NHL, a replacement for the traditional hash layer that produces multiple code lengths in a single model via a nested structure. To mitigate training conflicts among objectives of different lengths, it introduces a dominance-aware dynamic weighting strategy. To transfer information from longer to shorter codes, it employs a long-short cascade self-distillation scheme. Empirically, across CIFAR-10, ImageNet100, and MSCOCO, the paper indicates ~5-8× training speed-ups while maintaining or improving retrieval accuracy vs. the base models trained per length. The method is designed in a plug-and-play manner, demonstrating compatibility with multiple hashing backbones. The training protocol monitors each per-length objective and saves parameters when each L_k reaches its minimum, which the authors argue contributes to stability and efficiency.
- NHL is plug-and-play and directly replaces the traditional hash layer, enabling multi-length code generation in one model without redesigning the backbone.
- The paper formalises domination gradients over nested parameters and provides a closed-form dynamic weighting (Eqs. 6–7) to keep shorter-length objectives from being overwhelmed. That is to day, Eqs. (6)–(7) act as an analytical conflict regulator across multi-length objectives. They detect when gradients from different hash lengths start to point in opposite directions (anti-domination) and dynamically rescale the offending loss so that the overall update remains aligned with the dominant, consensus direction. When gradients already agree, the weights stay at 1to ensure no unnecessary damping. The beauty lies in their closed-form efficiency where the adjustment is computed directly from inner products and norms of gradients at the hash layer, requiring no iterative optimization or tuning. Conceptually, it’s like an automatic “traffic controller” that keeps shorter and longer code objectives from interfering, while maintaining stability and efficiency. This balance of mathematical rigor, interpretability, and negligible overhead makes Eqs. (6)–(7) one of the technically elegant of the paper.
- The paper reported ~5–8× speedups with average accuracy improvements across multiple models/datasets. This supports the claim that NHL improves both efficiency and effectiveness within a single training run.
- The dynamic re-weighting is explicitly computed only on NHL parameters, not the full network. The paper stated “we don’t consider the full network weights and focus on the parameter in NHL.” Consequently, while results show consistent improvements across architectures and datasets, and suggesting no practical instability upstream, but the analysis does not report backbone-level gradient diagnostics. Any claim of “resolving cross-length interference” should be scoped to the hash layer or be supported by backbone-level checks (e.g., cosine similarity between \nabla_{\theta_F} L_k for different lengths, or an ablation that extends the weighting to \theta_F and measures incremental benefit).
The reason is that multi-objective interference often arises throughout the network. By restricting weighting to NHL, one cannot rule out residual clashes in earlier layers. The paper demonstrates strong end-to-end performance which is good, but doesn’t isolate or measure whether backbone gradients still conflict. This limits how broadly the reader can interpret “conflict mitigation”. That is to say, it is proven at the hash layer and suggested empirically for the full model, but not causally pinned down in the backbone.
---
- The derivation provides closed-form expressions and notes computational complexity and ~11.15% per-step overhead, but there is no ablation that compares this dominance-aware rule against simpler baselines (e.g., static per-length weights, uncertainty weighting) in otherwise identical settings to causally attribute the gains to the proposed weighting (as opposed to, say, self-distillation or nested design alone). The paper’s math and presentation are clear, but component-wise attribution is underdeveloped.
In simple terms, although the paper includes ablation variants such as NHL-basic, NHL w/o D, NHL w/o L, and Full NHL, showing that the full version performs best and that the dominance-aware weighting contributes most. However, these ablations are aggregated averages across datasets and bit lengths, and they lack a controlled comparison against simpler weighting strategies (e.g., fixed equal weights, uncertainty weighting, or GradNorm) under identical architectures. Moreover, there is no per-bit or per-dataset breakdown showing how much each component contributes at 16, 32, 64, or 128 bits.
Hence, while the evidence suggests the proposed weighting helps, it doesn’t causally isolate Eq. (6)–(7)’s effect from the influence of other design factors (nested structure, self-distillation, or checkpointing trick). In other words, I know the entire system works, but not precisely why or how much each part matters.
A more rigorous attribution would involve:
(a) Controlled substitution tests: replacing Eq. (6)–(7) with a static weighting scheme or another known multi-objective method (e.g., GradNorm) while holding all other parts constant; and
(b) Per-length diagnostic tables: showing performance gain by bit length (e.g., 16 / 32 / 64 / 128 bits) for each variant, to see whether dynamic weighting primarily benefits shorter codes or improves all lengths uniformly.
Adding such analyses would convert the current descriptive ablation into a causal attribution study, clearly demonstrating that Eq. (6)–(7), and not auxiliary mechanisms, drives the reported gains.
---
- The proposed NHL is evaluated entirely within symmetric deep supervised hashing settings, where query and database encoders share parameters. Could the authors clarify whether NHL can be extended to asymmetric retrieval frameworks, where query and database encoders differ or where database codes are optimised separately? If such an extension is not straightforward, it would be helpful to explicitly state this boundary in the paper, since the current framing as a “plug-and-play universal module” could be interpreted as supporting a broader range of hashing paradigms than those tested.
Please see above. |
Fully AI-generated |
|
Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The work proposed an inference-time scaling method that integrates two key components: 1) Dynamic MoE: allows direct adjustment of the number of active experts during inference, and 2) Expert Configuration Inheritance: maintains consistent expert usage within a reasoning trajectory while permitting variation across different runs. The paper systematically studied how the number of activated experts at test time influences the model’s reasoning behavior. They found varying the number of activated experts produces complementary solution sets, offering a new source of diversity in addition to output-sampling.
- The paper studies a new perspective of test-time search specifically for MoE models, which would have a large audience given people's interests in MoE LLMs.
- The paper is well written.
- As shown in Table 1, the proposed method doesn’t show significant improvement compared to best-of-N strategy.
- The work explores different expert counts at inference time. However, the number of activated experts is often fixed during training. So this would introduce parameter distribution shift since it doesn't align with the training behavior. The paper doesn’t show the significance of the proposed method by qualitative case studies or quantitative experiments or theoretical proof.
- This inference-time scaling method is not applicable to closed-source models.
The paper mentions "DES enhances accuracy and stability without additional cost." How do we prove its stability compared to other baselines? |
Fully human-written |
|
Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes Dynamic Experts Search (DES), a test-time scaling strategy tailored for Mixture-of-Experts (MoE) LLMs. The key idea is to treat the number of activated experts k at inference as a controllable search dimension, rather than a fixed architectural constant. DES has two components: (1) Dynamic MoE, which exposes k as a knob during generation to induce structural diversity in reasoning trajectories; and (2) Expert Configuration Inheritance, which keeps k consistent along a single trajectory but varies it across trajectories so the search can both maintain stability within a path and explore diverse configurations across paths.
The empirical picture is fairly comprehensive: DES is evaluated across several MoE models (Qwen3-30B-A3B, Ling-lite-1.5, OLMoE-1B-7B-Instruct, DeepSeek-V2-Lite-Chat), two verifiers (Qwen2.5-Math-PRM-7B and Llama3.1-8B-PRM-Deepseek-Data), and multiple benchmarks in math (MATH500, AIME24/25, GSM8K, SVAMP), code (HumanEval, LiveCodeBench), and knowledge (LiveBench reasoning). Against Best-of-N, Beam Search, and DVTS, DES consistently improves accuracy and precision at similar reported cost. Ablations suggest that (i) exploring multiple k values increases the chance of hitting a configuration that yields a correct solution, and (ii) inheriting k along a trajectory filters out unpromising configurations as the verifier prunes candidates, increasing pass@N and final accuracy. The paper also argues that DES does not simply “use more experts,” showing the average activated experts during search does not exceed the default.
Overall, the paper pushes a simple but compelling idea: leverage the latent structural flexibility of MoE at test time to unlock complementary solution sets that temperature-based sampling alone does not reach.
- The central idea is fresh and timely: moving beyond output-level diversity (temperature/top-p) to architecture-aware diversity by controlling MoE’s activated expert count. It’s a natural fit for the growing prevalence of MoE models.
- Methodologically simple and easy to reason about: treating k as a test-time knob plus a straightforward inheritance rule that gives implicit configuration-level credit assignment along the search.
- Empirical breadth and consistency: many models, two verifiers, and diverse benchmarks with sensible, consistent improvements over multiple baselines. The gains are not tied to one model family or evaluation domain.
- Thoughtful ablations: the paper separately probes the contributions of exploring different k and inheriting k; the violin plots and step-wise views help argue that improvements are not a byproduct of trivially activating more experts.
- Practical upside: if exposing k can be standardized in inference stacks (e.g., vLLM), DES could be a “drop-in” test-time enhancement that avoids retraining and scales with budget.
- Cost accounting is under-specified for MoE: the paper mainly uses generated tokens as a proxy for computation, but per-token cost scales with k in MoE. Even if average k doesn’t exceed the default, small shifts matter. A fairer comparison would report FLOPs or at least “token-count × average-k” (and preferably align average k across methods or normalize results by FLOPs).
- Reproducibility and implementation detail gaps: changing top-k routing at inference isn’t universally exposed; capacity factors, load-balancing, and token dropping can complicate things. The paper should document how k is controlled in practice (per-layer or global, capacity handling, any router noise) and quantify the throughput/latency/memory impact of varying k.
- Evidence on “complementary solution sets” could be tightened: the Jaccard analysis is persuasive, but I would like to see stronger statistics across seeds/difficulty strata and a low-temperature or fixed-seed setting isolating structural (k) diversity from sampling diversity.
- Baselines could be stronger on diversity: comparisons to more diversity-focused search variants (temperature schedules, top-p sweeps per branch, stochastic beam, nucleus-beam hybrids, or recent foresight/token-level diversity methods) would better establish DES’s advantage.
- Decision-rule inconsistency: the text alternates between PRM-best and majority vote for final selection. The main results should lock in one rule (and report the other as supplemental), along with an analysis of where the two diverge.
- Sensitivity analysis: results depend on the initial k set, T, M, and temperature. A more systematic study with CIs would help practitioners know how to set these knobs and how robust the gains are to reasonable changes.
- PRM reliance and potential leakage: the paper should explicitly discuss possible overlaps between PRM training data and the eval sets, and include a sanity check with an alternate verifier or a pure answer-checking setup to show robustness.
- Minor editorial issues: a duplicated paragraph in the appendix on math answer extraction; one appendix table labels a verifier as a “policy model”; and some notation (e.g., Top_M) could be made more precise. These are easy fixes but worth cleaning up.
- The idea is interesting and promising, but the implementation and cost accounting need to be clearer. For reproducibility and fair comparison, I’m especially looking for the following.
1) k implementation and control
- How exactly do you expose and tune the MoE router’s top‑k in your inference stack (vLLM/Transformers)? Is k global across all MoE layers or configurable per layer? When k changes, do you adjust capacity factor, token dropping, or router noise to keep routing stable?
- Does changing k require edits to model weights or router kernels? Please provide the minimal diffs/flags for reproducibility.
2) Compute cost accounting
- Since per‑token FLOPs scale with k in MoE, token count alone isn’t sufficient. Please report FLOPs or a reliable surrogate (e.g., tokens × average‑k × number of MoE layers), and plot accuracy vs. FLOPs under cost normalization.
- Provide throughput/latency/memory as a function of k (tokens/s, per‑sample latency, peak VRAM), and show how DES affects system efficiency across context lengths.
3) Capacity and routing behavior
- As k increases, do you see more capacity overflow or token dropping? What capacity factor and load‑balancing settings do you use? Please report drop rates vs. k and any mitigation.
- Do you add router noise at inference? If so, how does it interact with k and affect reproducibility?
4) Decision rule consistency
- The paper alternates between majority voting and PRM‑best. Which rule is primary for the main results? Please add an agreement analysis when they diverge, with an error taxonomy and sensitivity to budget N.
5) Structural vs. sampling diversity
- The Jaccard analysis is compelling. Can you isolate structural (k) diversity from sampling diversity by fixing seeds or using very low temperatures while varying only k? Please report statistics across multiple seeds and difficulty strata, with significance tests.
- Include per‑problem case studies where changing k flips failure to success, and describe the associated routing/load changes.
6) Diversity‑oriented baselines
- To strengthen claims over diversity‑aware methods, please add or clarify: stochastic beam variants, temperature/top‑p schedules per subtree, nucleus‑beam hybrids, token‑level diversity (e.g., Phi‑decoding), and stronger DVTS configurations. Match budgets under a common cost normalization.
7) Hyperparameter sensitivity and robustness
- Run systematic sweeps (with CIs) over the initial k set (range and granularity), T, M, and temperature. How sensitive are gains to these knobs? What happens as n (number of initial k values) is reduced?
- Report step‑wise average k and its variance under DES, and verify the claim that “average k does not exceed default” across datasets and budgets.
8) PRM dependence and potential leakage
- Audit for overlap between PRM training data and evaluation sets (MATH/AIME, GSM8K, LiveBench, etc.). If overlap exists, how do results change after filtering?
- Add robustness tests with an alternative verifier and a pure answer‑checker (no PRM) to assess DES’s reliance on PRM scoring quality.
9) When and why DES helps
- Can you correlate problem characteristics with preferred k (e.g., routing entropy, gate‑margin distributions, reasoning length/depth)? Any evidence that certain layers benefit more from larger/smaller k?
- Does per‑layer heterogeneous k (or a schedule that adapts k across steps) further improve performance, and is it stable?
10) Fairness of “thinking mode” comparisons
- Are comparisons equalized by FLOPs or only by tokens? Please report accuracy vs. FLOPs and pass@N, and time‑to‑first‑correct where applicable. Clarify stopping criteria and any length penalties.
11) Reproducibility details
- Will you release code/configs to toggle k at inference? Please provide seeds, prompts, router settings, vLLM flags, and any CUDA/kernel constraints. Also fix minor editorial issues (duplicated math answer‑extraction paragraph, verifier mislabeled as “policy model,” precise definition of Top_M in Eq. (4)).
12) Beyond MoE
- Do you see analogous “structural knobs” for dense models (e.g., dynamic depth/width, selective channels in grouped‑FFN) where the DES paradigm transfers? Any preliminary evidence?
- Overall, I like the motivation and direction. If you firm up the implementation and cost details above, the conclusions will be stronger and the work easier to reproduce and extend. |
Fully AI-generated |
|
Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work introduced a test-time scaling method for reasoning LLMs with the Mixture-of-expert architecture.
Different from traditional strategies that focuses on generating different reasoning traces, the authors proposed to unlock more flexibilities of the reasoning model by considering different configurations of the number of experts activated in the reasoning process. This idea is well motivated in Figure 1 - the correlation plots showed that activating different experts can significantly impact the type of questions the model can correctly answer.
In experiments, the authors explored a simplified strategy to prove the effectiveness of the method. Under the given computation budget, the model start from uniformly distributed number of experts. For each task, the policy model helps search for the optimal number of expert number and the number. Each reasoning trace has a fixed number of expert.
Experiments show that the proposed method has pros and cons. with a low computation budget, the method performs worse than standard beam search. with higher computation budget, the proposed method achieves better performance than baselines because each num-of-expert setup can be better explored.
# Very Interesting Topic
- The idea of exploring different expert setups is very interesting and also well motivated by experiment results.
- The experiment setup is well defined and comparisons are pretty fair.
- Experiments show that under certain conditions the proposed method can outperform standard beam search and selective baselines
- The ablation studies provides helpful insights about the proposed method
# Claims / clarifications
- Section 4.2.1 claims that the proposed method "constantly outperform baselines on both accuracy and precision", but actually this claim is only true when the model is provided with a massive computation budget.
- When compared to thinking mode, it's unclear which sampling method is used. and there is no experiment results on DES + thinking
# Effectiveness
- I think Figure 5 is the core observation of this work - when there is enough computation budget to explore different setup of expert number, the model will achieve better performance. As a result, I actually think the proposed method highly rely on bigger budgets and won't be effective in any application that require efficient inference.
# Utilization / scaling [not weakness, just my opinion]
- I feel when the model size is increased, the observation about differnt expert setup will strongly differentiate the answerable questions will become weaker. As a result, the proposed method might not scale to bigger models and in my opinion, the authors should highlight that this method only works for smaller models, and might not contribute to the R&D of larger models in the limitation
- I'm not asking the authors to experiment with frontier-level open-source models
# Depth of the research
- I'm very excited to read the motivation and the problem statement, and I believe there are very interesting problems in this space. Fixing the number of experts across layers & tokens seems over-simplified compared to the motivation of this paper.
- I understand the search space for expert number shouldn't be num_layer * num_token * num_expert, but the authors could explain or cite related studies to give readers better intuition about how to narrow down the search space.
see weakness |
Fully human-written |
|
Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a new Test-Time Scaling (TTS) strategy that leverages the structural flexibility of Mixture-of-Experts (MoE) models to improve reasoning performance. The proposed method, Dynamic Experts Search (DES), dynamically adjusts the number of activated experts during inference and maintains consistent configurations along reasoning paths through an inheritance mechanism, thus balancing stability and diversity in exploration. Experiments across multiple MoE-based LLMs and reasoning benchmarks show that DES achieves higher accuracy and precision than prior TTS baselines without increasing too much computational cost, offering an architecture-aware perspective on improving LLM reasoning.
1. Paper is well written and easy to follow.
2. Substantial experimental results are presented to show the effectiveness and efficiency of the proposed method.
3. The proposed method is simple and novel.
1. Some technical details seem not fully logical. Please see my questions below.
2. The proposed method should be used with a trained reward model, which is not always available in all reasoning domains. This is an obvious limitation.
1. Why do you think using different numbers of experts at each step of generation would waste compute? It is true that if we keep k intact for a complete trajectory generation, different k induces different results. However, it does not mean that in each trajectory, employing different k at each step would lead to inferior result compared to an optimal fixed k. I feel that the motivation of expert configuration inheritance should be fixed.
2. What if now we are using DES to reason over some tasks without existing reward models? |
Fully human-written |
|
Style2Shape: Image Style Guided 3D Shape Material Generation |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
TLDR: A complex, three-stage framework generating PBR materials from single image.
This paper proposes Style2Shape, which generates PBR materials for 3D models from a single reference image. Extending PSDR-Room (retrieves procedural materials and optimizes parameters), this work adds texture generation via image editing models to capture fine-grained appearances. However, the approach assumes albedo and roughness/metallic are independent (problematic), relies on multiple external dependencies, and uses multiiple ad-hoc components that create a fragile cascade.
- Tries to combine image generated textures with procedural patterns.
- Transfer results look interesting.
- Overly complex and fragile cascade system that have multiple ad-hoc components - failure in any stage causes irreversible harm. Three sequential stages with heavy external dependencies (parts segmentation, image editing model, RGB-X decomposition, procedural library, differentiable renderer).
- Problematic independence assumption: Stage 2 separately retrieves roughness/metallic and generates albedo, but in reality these are correlated. Inconsistency between these two components will make the material not physically plausible. And to really see if the reflectance make sense, we need to see a view-varying video.
- Assumes simplistic textures - method may not generalize to other type of textures.
- "Editability" claimed but not demonstrated - no actual editing examples
- Have to find the best pose of the 3D shape so the image can capture most of the parts
1. How this work will inspire and stimulate future work in this direction? |
Fully human-written |
|
Style2Shape: Image Style Guided 3D Shape Material Generation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
Style2Shape is a framework for generating PBR materials for 3D models conditioned on a reference image. It combines procedural materials with generative techniques to balance physical plausibility and visual fidelity. The pipeline has three stages: (1) structure-guided appearance transfer, (2) hybrid PBR material initialization, and (3) differentiable material optimization. The method is compared against two baselines, PSDR-Room and Material Palette, and reports superior performance.
1. Clear motivation. Procedural materials provide physical correctness but have limited expressivity, while generative textures are expressive but not guaranteed to be physically valid. The proposed hybrid approach is well motivated.
2. Empirical results indicate better visual fidelity than the baselines.
1. Several relevant texture generation methods are missing, for example TexGaussian and Meta3DGen.
2. The paper compares only to two procedural methods, omitting many generative approaches with the same problem setting, for example RomanTex, Hunyuan3D-2.0, TEXGen, SyncMVD, and Paint3D. Without these, the evaluation is not fully convincing.
3. The ablation focuses on incorporating generated textures, leaving other design choices under explained. For example, how does the VGG term contribute in Exp(9)? How much does the Progressive Optimization Strategy help compared to direct optimization?
4. The paper claims preprocessing via SAMesh extends to objects beyond the six categories, but no substantive examples are provided, aside from the dress in Fig. 12 which lacks distinct material parts. In addition, semantic segmentation does not guarantee material homogeneity within segments, and it may not merge spatially separated regions of the same material. This limits generalization.
5. The method assumes an optimal viewpoint that covers all material regions. This can hold for simple objects such as tables and chairs, but often fails for complex geometry such as human characters and garments, which constrains applicability.
6. Section indices are inconsistent. There is no index before RELATED WORK, and MATERIAL GENERATION FOR 3D OBJECT is incorrectly labeled as 1.1.
TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting
Meta 3d texturegen: Fast and consistent texture generation for 3d objects
RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis
Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation
TEXGen: a Generative Diffusion Model for Mesh Textures
Text-guided texturing by synchronized multi-view diffusion.
Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models
How many objects were used for quantitative evaluation, and is this the same set used in the user study?
In Table 1, why does performance degrade for the bag category?
“For small material segments where texture details are less critical, we assign uniform materials without retrieval to maintain computational efficiency.” What uniform materials are assigned in practice, and how are their parameters chosen? |
Fully human-written |
|
Style2Shape: Image Style Guided 3D Shape Material Generation |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Style2Shape, a novel and comprehensive framework for generating editable, physically-based rendering (PBR) materials for a 3D model from a single reference image. The core contribution is a hybrid material representation that synergistically combines procedural materials and AI-generated textures. The authors argue that this approach leverages the strengths of both: procedural materials ensure physical correctness of reflectance properties (e.g., roughness, metallicity), while generative models capture arbitrary and complex visual appearances from the reference.
The central contribution—a learnable blend of procedural materials and generated textures—is a powerful and elegant idea. It effectively combines the strengths of both paradigms, achieving results that are simultaneously physically plausible and visually faithful to an arbitrary reference.
The three-stage pipeline is very well-designed. Stage 1 intelligently solves the critical domain gap problem. Stage 2 provides a strong initialization that is crucial for the success of the high-dimensional optimization. Stage 3 uses a progressive optimization strategy to robustly converge to a high-quality solution. This structured approach demonstrates a deep understanding of the problem's complexities.
A major strength is that the final output consists of standard, editable PBR texture maps. This makes the generated assets directly usable in modern 3D engine or rendering software.
My comments primarily echo and expand upon those points, framing them as areas for improvement.
The proposed "maximin" criterion for optimal viewpoint selection is logical, but it does not address scenarios where some material segments are inherently occluded from any single viewpoint (e.g., the inside of a cup, the underside of a complex chair). The paper does not specify how materials are generated for segments that are not visible from the chosen optimal view. This omission represents a potential failure point for complex geometries.
The success of Stage 1 heavily relies on a state-of-the-art, prompt-guided image editing model (referred to as "GPT-Image-1"). The paper would be strengthened by demonstrating that the framework is robust to the choice of the image model, for instance, by showing results with open-source alternatives like ControlNet or InstructPix2Pix.
The framework's material retrieval and subsequent optimization are contingent on the quality of the decomposed BRDF maps from Stage
2. The paper cites RGB-X, but in practice, single-image intrinsic decomposition models often struggle with complex, non-uniform lighting, strong cast shadows, or highly specular surfaces. The real-world performance of such models can be suboptimal, which could lead to inaccurate initialization of roughness and metallic properties, potentially hindering the optimization process or leading to physically incorrect results. The paper would benefit from a discussion on how errors from this decomposition stage propagate and how the framework might mitigate them.
Could the authors clarify the specific model used for the structure-guided image editing? Have they experimented with publicly available models (e.g., based on Stable Diffusion with ControlNet)? How sensitive are the final results to the quality of the image generated in this first stage?
The viewpoint selection method aims to maximize visibility, but for complex models, it's inevitable that some material segments will remain unseen from any single optimal viewpoint. How does the framework handle material generation for these occluded parts? Are they assigned a default material, do they inherit properties from adjacent visible segments, or is there another mechanism in place?
Given that models like RGB-X can produce artifacts or inaccurate maps under challenging lighting, how does the system perform when the initial Irgh and Imet maps are noisy or incorrect? Does the final differentiable optimization stage have the capacity to correct for a poor initialization stemming from decomposition errors, or does it typically converge to a suboptimal result? |
Fully AI-generated |
|
Style2Shape: Image Style Guided 3D Shape Material Generation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Style2Shape, a framework designed to generate materials from a single reference image. The pipeline comprises three stages: (1) aligning 3D geometry with the reference image, (2) initializing material parameters through a retrieval mechanism, and (3) refining procedural materials via differentiable rendering. The system is engineering-complete, and the presented results demonstrate the effectiveness.
However, the core modules and overall pipeline strongly overlap with MaPa (SIGGRAPH 2024): both use 2D diffusion generation as a bridge to produce aligned images, both rely on a procedural material library for retrieval, and both refine procedural parameters with differentiable rendering. The primary distinction lies in the modality of input that Style2Shape accepts the image as input, while MaPa is conditioned on text. The authors do not clearly validate its unique advantages or contributions over MaPa.
While the system is well-engineered, the overlap in methodology limits its originality. I think the contribution may fall short of the innovation threshold expected at ICLR.
- A complete three-stage pipeline that jointly addresses multiple subproblems in alignment, initialization, and refinement
- Novelty
The paper has high overlap with MaPa and lacks direct comparison or justification of unique contributions.
- Clarity
The paper’s exposition could be improved. For instance, in line 43, it points out the drawbacks of procedural material generation, yet in line 46, it begins to praise its advantages, which creates a logical inconsistency and may confuse readers.
- Experiments
1. The paper proposes a three-stage pipeline, yet lacks corresponding ablation studies to verify the importance of each component. For example, there is no experiment showing the effect of misalignment when the input image is not properly aligned with the geometry.
2. In line 36, the authors mention the issue of specular ambiguity as a motivation for introducing procedural modeling, yet the presented results do not include materials with strong specular highlights, such as metals, it is not sure the performance of the proposed method for such cases.
3. The paper lacks comparison to several relevant methods, such as MaterialMVP (ICCV 2025), which also supports image as the condition and generates the object material. A comparative analysis would be beneficial to highlight the strengths of the proposed formulation, making the paper more solid
- Figure
Some of the text in the figure is low-resolution and difficult to read. The authors are encouraged to improve the rendering quality or resolution
- Technical limitation
The paper attempts to use text prompts to generate seamless patterns. However, it cannot guarantee true seamlessness with only text prompts constraints.
1. In the supplementary materials, is Style2Shape/images/bag-005/seamless_0 intended to be the output SVBRDF folder? If so, it is evident that the normal map in this directory is incorrect. Moreover, the result is not truly seamless, as I mentioned.
2. The normal maps depicted in Fig. 12 appear excessively smooth. It is unclear whether this is due to a downsampled resolution in the visualization or if the predicted normals themselves inherently lack high-frequency detail.
3. Most of the results are demonstrated on furniture models with relatively simple geometry. Is that possible to apply the method to the complex model and other types (like a metallic dragon)? |
Lightly AI-edited |
|
Automating the Refinement of Reinforcement Learning Specifications |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper considers the problem of automatically refining logical specifications in order to help specification guided reinforcement learning algorithms. The main intuition is that when the specification is very coarse, these algorithms find it hard to learn effective policies. So they propose identifying problematic specifications and refining them to help the algorithms converge faster and also help with guided exploration.
This work uses the SpectRL specification logic which can be represented as a graph that captures different ways to satisfy the specification. They present several types of refinement procedures that modify the graph and these procedures consist of refining goal/target regions, or adding additional intermediate target regions.
In their experiments they show that their method greatly helps specification guided RL algorithms to learn effective policies in large gridworld environments as well as robotic manipulation task with obstacles.
This is one of the first few works to consider the problem of automatic refinement of RL specifications based on collected feedback from the training of policies. This is a fundamental issue because if the specification is too coarse grained then algorithms would find it hard to effectively explore the state space and learn good policies.
The problem they consider is studied in depth and many refinement techniques are proposed that are sound. The benchmarks they consider are also interesting. This work also opens up many interesting related directions that can be explored.
While the contributions of the paper are substantive, they can be presented better. The introduction can be expanded to give further intuition with respect to the problem being solved. Specifically, the notion of abstract graph is never introduced informally even though it is central to the paper. Perhaps it would be helpful to take the example in figure 1, and present in some detail about what the refinement procedures would produce and how it would make the learning task easier. Similarly, logical specifications for RL are also never introduced.
The related work section can also be organized better into paragraphs.
1. Are there possible failure modes where the refinement procedure would follow a wrong chain of refinements that make the learning task much harder? Perhaps this deserves a short discussion?
2. In the current algorithms, the different refinements are applied in a specific order. Do you imagine situations where this order can be detrimental?
3. Why or why not dynamically choose which refinement to apply at each step? |
Fully human-written |
|
Automating the Refinement of Reinforcement Learning Specifications |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
Specification guided reinforcement learning often fails when initial logical task descriptions and their labeling functions are too coarse. The paper proposes $\mathrm{AutoSpec}$, a framework that refines SpectRL specifications through an exploration driven loop and four refinement procedures, while guaranteeing that any trajectory that satisfies the refined specification also satisfies the original specification. The method integrates with existing algorithms and is demonstrated with $\mathrm{DIRL}$ and $\mathrm{LSTS}$, where refinements help recover learnability on tasks that were previously hard to solve.
- The paper addresses an important challenge in formulating specifications for RL. Automatically refining predicates and specifications provided to the agent is a promising direction, particularly because crafting appropriate predicates is difficult and loosely defined specifications can be hard to satisfy.
- The framework is integrated with established specification guided algorithms and the experiments illustrate how the refinements interact with the different exploration strategies of $\mathrm{DIRL}$ and $\mathrm{LSTS}$.
- The empirical scope is narrow. Only two domains are considered, n Rooms and PandaGym. This limits the evidence for scalability and diversity of specifications. A broader study that samples many predicate regions or includes a less contrived multi room world would strengthen the case.
- The specifications tested upon are quite limited (only 1 or 2 refinements per specification needed). A sample driven testing approach (say randomly chosen predicate regions) or a less contrived 100 room example would be more convincing of the scalability of the approach.
- I appreciate the carefully chosen experiments for an intuition of what is happening, but some further generalization studies would help (e.g. more than 2 refinements needed and whether $\mathrm{AutoSpec}$ covers the search space appropriately).
- When there are no successful samples, certain refinements cannot be computed, as observed for $\mathrm{LSTS}$ on the complex specification. The approach is sound but not complete and it may fail to find a refinement even if one exists.
- `AvoidRefine` only enlarges the avoid set or equivalently reduces the safe set, without permitting relaxations when the avoid region is overly conservative. Algorithm 3 defines the refined safe region by removing the convex hull of recent failure states, which can bias the learner away from potentially optimal paths if the initial avoid labeling is narrow or misaligned. This is acceptable in most situations, but the onus is on the user specifying the initial predicate regions to start with conservative definitions. A discussion of when to relax an avoid constraint would be valuable.
1. What is the computational overhead of $\mathrm{AutoSpec}$ in the reported settings, relative to running $\mathrm{DIRL}$ or $\mathrm{LSTS}$ alone? A wall clock comparison and a complexity view in terms of the number of edges and sampled trajectories per refinement would help readers assess practical costs.
2. How do the procedures behave when a specification needs several consecutive refinements? Is there an observed depth beyond which refinements fail to improve satisfaction probability or become unstable? |
Lightly AI-edited |
|
Automating the Refinement of Reinforcement Learning Specifications |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces AutoSpec, a framework designed to automatically refine logical specifications for reinforcement learning (RL) tasks. While logical specifications can guide RL agents towards complex goals, a common issue is that coarse-grained or under-specified definitions may prevent agents from learning useful policies.
The core idea of AutoSpec is to use an "exploration-guided strategy" to automatically search for a more detailed specification. This "refined" specification is stricter than the original but provides additional guidance to the RL algorithm, making the learning process easier. Crucially, the framework guarantees "soundness": any trajectory satisfying the new specification must also satisfy the original coarse one. Theoretical justifications are provided. Experiments demonstrate that agents using specifications refined by AutoSpec can solve more complex control tasks than before.
1. The issue of "specification too coarse to learn from" is a significant practical hurdle in specification-based RL. This paper tackles this problem directly, which is highly valuable.
2. The primary contribution of AutoSpec is its ability to refine specifications without human intervention. This greatly lowers the barrier to using logical specifications, which might otherwise require extensive manual tuning by domain experts. Also, theoretical justifications are provided to show the framework doesn't just modify specifications arbitrarily; it guarantees that the refined specification is a valid "subset" of the original. This is crucial.
3. The refinement processes to abstract graphs are clear and intuitive, the order of refinement are also reasonable. Overall, the algorithm is easy to follow.
1. I am not fully convinced by the name *SpectRL* specification logic. To me, they are just standard subset of Linear Temporal Logic (LTL), clarifying this in the paper clearly is sufficient. We do not really need a separate notation.
2. The paper mentions that AutoSpec searches for a refinement in order. How large is this search space for each environment, and what is the overhead of this additional procedure? These are not mentioned in the paper.
3. As discussed in the paper, the reliability of the AutoSpec are heavily dependent on its base algorithm. This limitation is understandable, yet it would be better if additional discussion and proposals can be provided.
4. As I mentioned in weaknesses 1, this work only discusses a subset of LTL, while other works have already proposed some insights and solutions to the problem [1, 2]. Please consider add discussion of these works and give your own insights.
5. This is minor, I noticed that two important concepts mentioned in the abstract never mentioned again in the main text. Please explain what are "under-specified" and "exploration-guided strategy" in the main text. I know these are the summarizations of later component, but it is better to make them clear.
[1] Qiu et al. Instructing goal-conditioned reinforcement learning agents with temporal logic objectives. NeurIPS 2023.
[2] Jackermeier and Abate, DeepLTL: Learning to Efficiently Satisfy Complex LTL Specifications for Multi-Task RL. ICLR 2025.
1. See weaknesses 2, please explain the search space and overhead of AutoSpec.
2. See weaknesses 3, please explain the difficulty of refinement in AutoSpec.
3. See weaknesses 4, can you discuss [1] and [2] in the paper, and provide some insights on whether AutoSpec can be applied to $ \omega $-regular LTL specifications?
4. It seems that AutoSpec detect the failures of a policy and perform refinement, this is good. However, it is possible to perform "active" refinement rather than "passive"? This could be very interesting. |
Fully human-written |
|
Automating the Refinement of Reinforcement Learning Specifications |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose a method for automatically refining specifications defined using the SpectRL framework and that are used for specification-guided reinforcement learning. Their technique produces a provable refinement of the original specification (a trace satisfying the refined specification implies that the trace also satisfies the original specification). By using the refined specifications to retrain the RL policies, the authors observe that the newly trained policies have higher specification satisfaction rates. In other words, better specifications result in better policies.
- Problem statement is relevant, interesting, and well defined.
- The authors do a good job of explaining the preliminary material needed to understand their work.
- The results show that their method significantly improves the performance of the re-trained policies to meet the original specification after refinement.
- Some of the presentation of the AutoSpec framework is lacking... specifically Figure 2 really doesn't clarify what the PastRefine refinement procedure does. What's the relationship between the two parts of the figure? It also leaves a ton of open whitespace, which looks sloppy.
- It would be nice to have visuals on how each of the refinement procedures is working, not just PastRefine.
- The experimentation and presentation of it is lacking significantly
- Only use 2 experimental setups (n-Rooms and PandaGym)
- There are 2 tunable hyperparameters (the probability threshold and the number of traces to sample) the details of which are never mentioned for their experiments.
- They never discuss the cost of doing the specification refinement (how long does it take? does retraining the policy take as long?) This would mean training 2x since we have to train all over again to integrate specification refinement.
- They never describe how many times the experiment was attempted. Did they train a bunch of different policies and try it multiple times? Or are the results they show just from training one policy for each of the specification-guided RL algorithms mentioned and trying their framework on it? I am assuming the former, which is limited experimentation in my opinion.
- They don't compare to any other specification refinement or generation techniques for specification-guided RL.
- Figure 4 is a plot of "Best Path Cost" on the y-axis vs. "Number of Timesteps" on the x-axis. They never introduce what path cost is or what it means to have best path cost. How should these plots be interpreted?
- Figure 5 doesn't have any labels on the x and y axes, so it is unclear what results are being demonstrated.
Minor comments / typos:
- Missing space between guarantees and citation (Lechner et al.) on pg. 2, line 104
- I believe the "AddRefine: Introducing Waypoints." part should be given a new line in section 3.1. All of the other specification refinement procedures are given their own new lines when introduced.
The paper introduces a compelling strategy for improving specification-guided RL by refining the specifications, but it lacks strong experimentation to be convincing. While apparently theoretically sound, much more experimentation would be necessary (more experimental setups, try on more specification-guided RL algorithms and for multiple different trained policies, a stronger ablation study than shown in section 4.3 by again running more experiments, also experiments controlling the tunable hyperparameters, experiments comparing to other related methods, and better presentation of the results).
Please address and discuss weaknesses above. |
Fully human-written |
|
STDDN: A Physics-Guided Deep Learning Framework for Crowd Simulation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This work proposes a physics-guided deep learning framework for crowd simulation (STDNN). The authors introduce the continuity equation from fluid dynamics as a strong physical constraint and design a density-velocity coupled dynamic graph learning module. They show that STDNN is significantly superior to simulation performance compared to SOTA methods.
1. The authors propose a network of time-space decoupled differential equations combined with the continuity equation, which is helpful for predicting the physical laws of trajectories in the macroscopic world.
2. In experiments, Tables 1 and 2 clearly illustrate the trajectories and verifies main results from the paper.
1. The proposed method uses Neural ODEs to solve $\rho$, but there are many similar ideas, and the use of Neural ODEs in trajectory prediction is also a very common approach.
2. The proposed method utilizes constraints based on continuity equations, but the specific implementation of this constraint in the Neural ODE framework requires more detailed explanation.
3. The authors conducted many experiments, but it seems that it is necessary to split each subset of the dataset and compare with newer baselines and methods. The current baseline only reaches the year 2024. Based on the trajectory dataset used by the authors, it seems that there are a large number of sota in pedestrian trajectory prediction that have not been compared.
1. How is the continuity equation incorporated into the Neural ODE solution process? It requires a more detailed explanation.
2. The detailed parameters used when solving the Neural ODE in torchdiffeq are not disclosed.
3. Figure 1 contains many typo errors.
⦁ For example, $Gin$($Gout$) should actually be $G_{in}$($G_{out}$).
⦁ The input to Microscopic seems to be $\pho^0$.
⦁ The DDM and CGD in the figure are also too simple.
4. Should the use of the loss function in Eq 10 be more explicit? Eq 8 does not seem to be included in it.
5. Are there more granular comparative tests, such as what the results were for ETH/HOTEL/ZARA1/ZARA2/UNIV, respectively?
6. Should ADE and FDE also be reported for general trajectories? |
Fully human-written |
|
STDDN: A Physics-Guided Deep Learning Framework for Crowd Simulation |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes STDDN, a novel framework for crowd simulation that addresses the common issues of error accumulation and physical inconsistency in long-term predictions. Its core contribution is the unique integration of a macroscopic physical law—the continuity equation from fluid dynamics—with a microscopic deep learning model for trajectory prediction. By using a Neural ODE to model crowd density evolution, STDDN enforces a strong physical constraint during training. Experiments show that STDDN not only achieves state-of-the-art accuracy but also significantly reduces inference latency compared to leading methods.
The paper's primary strength is its originality in creating a macro-micro coupled framework. Using the continuity equation to regularize trajectory prediction is a conceptually novel and powerful idea for this field. The quality of the work is good, supported by rigorous and comprehensive experiments that convincingly demonstrate superior performance in both accuracy and efficiency over strong baselines. The paper is also written with exceptional clarity.
- **Lack of Direct Physical Metrics**: The paper claims to improve physical realism by avoiding issues like congestion and collisions, but it fails to provide direct quantitative evidence. The evaluation relies on general error metrics (MAE/OT), which are insufficient proxies. The work would be much stronger if it included systematic measurements and comparisons of collision rates, obstacle penetration rates or density extremum analysis to directly support its core claims.
- **Training Cost**: While the paper rightly emphasizes its fast inference speed, it completely neglects to discuss the training cost. The use of a Neural ODE likely makes the training process computationally expensive and slow. An additional analysis should be included in the paper.
- **A minor issue**: the table in page 8 has a wrong caption: "**Figure** 4".
- The fluid dynamics assumption is a strong prior. Could you clarify the intended scope of your method? In which crowd scenarios (e.g., panic, counterflow) might this assumption become a limitation?
- Given the model's sensitivity to grid size, can you offer any practical guidelines or a more principled approach for selecting this crucial hyperparameter for new scenes? |
Moderately AI-edited |
|
STDDN: A Physics-Guided Deep Learning Framework for Crowd Simulation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes STDDN (Spatio-Temporal Decoupled Differential Equation Network), a novel physics-guided deep learning framework for crowd simulation. Unlike prior microscopic or purely data-driven approaches, STDDN introduces a Neural ODE formulation guided by the continuity equation from fluid dynamics, thereby coupling macroscopic density evolution with microscopic trajectory prediction. The model integrates three modules — Differentiable Density Mapping (DDM), Continuous Cross-Grid Detection (CGD), and Node Embedding (NE) — to ensure differentiability and physical consistency. Experiments on four real-world datasets (GC, UCY, ETH, HOTEL) show that STDDN significantly improves both simulation accuracy and inference speed compared with state-of-the-art baselines such as SPDiff and PCS.
1、Novel Integration of Physics and Deep Learning:
The paper introduces a principled way to integrate the continuity equation into deep models for crowd simulation. This macro–micro coupling via Neural ODE is both original and physically interpretable.
2、Methodological Sophistication:
The DVCG module cleverly connects density and velocity fields through a graph structure, while the DDM and CGD modules effectively address gradient discontinuity and cross-grid flux detection. These designs are mathematically sound and technically detailed.
3、Interpretability and Physical Consistency:
The approach offers clear interpretability grounded in physics, addressing a key limitation of previous purely data-driven models that violate conservation laws
1、The proposed model enforces strict mass conservation through the continuity equation, implying that the total population density within the target spatial domain remains constant over time. However, in realistic datasets and surveillance scenarios, the number of pedestrians in view is not fixed — new individuals may enter the scene, and others may leave. Such open-world dynamics inherently violate the closed-system assumption of the continuity equation. Without explicit treatment of source or sink terms (i.e., inflow/outflow of mass) or adaptive boundary conditions, the model may experience cumulative density drift or numerical instability, particularly when crowd density fluctuates significantly. The authors are encouraged to clarify whether boundary inflows are modeled, or to discuss potential modifications to better handle non-conserved population scenarios.
2、The ablation study provides useful insights, particularly regarding the contributions of the ODE solver and the mass constraint loss. Both components appear meaningful; however, the current experimental setup only uses discrete outputs in the loss computation. As a result, the experiments do not adequately demonstrate the benefit of continuous-time modeling enabled by the ODE formulation. To strengthen this section, I suggest decomposing the “w/o ODE” setting into two variants:
(1)Purely autoregressive training, as mentioned in the paper (“trained using purely autoregressive methods”).
(2)Discrete neural network replacement for ODE, where the ODE solver is replaced with a discrete neural module that still leverages the combined loss function including the mass constraint term.
Such a refinement would better isolate the contribution of the continuous-time ODE formulation from the general modeling capacity and loss design, making the ablation analysis more convincing.
Can the authors explain why the fluid physics improves results in low-density datasets like ETH and HOTEL, where the fluid assumption barely holds? |
Fully AI-generated |
|
STDDN: A Physics-Guided Deep Learning Framework for Crowd Simulation |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes STDDN (Spatio-Temporal Decoupled Differential Equation Network), a novel physics-guided deep learning framework for crowd simulation.
STDDN explicitly combines microscopic trajectory prediction with macroscopic density evolution by embedding the continuity equation from fluid dynamics into a Neural ODE structure.
The model separates local trajectory dynamics from global density fields, enabling physical consistency and stable long-term simulations.
Experiments on four real-world crowd datasets (GC, UCY, ETH, HOTEL) show that STDDN outperforms prior physics-guided baselines such as SPDiff and PCS in both accuracy and inference speed.
1. Good motivation on coupling of micro- and macro-level dynamics.
The paper’s main contribution is conceptually sound. By using the continuity equation as a bridge between trajectory prediction and density evolution, STDDN unifies local motion modeling with global flow consistency.
2. Physically meaningful ODE formulation.
The introduction of a Neural ODE to simulate density evolution is well justified. It provides continuous-time reasoning while enforcing conservation principles, addressing a key limitation of purely data-driven models that tend to accumulate errors over time.
3. Strong empirical performance.
Across four datasets, STDDN shows consistent gains over all baselines, including both physics-based and deep learning methods. The improvements in both accuracy and latency demonstrate that the proposed framework is practically beneficial.
4. Interpretability and efficiency.
The method retains interpretability through its physically grounded formulation while remaining computationally tractable, which is uncommon in physics-guided models.
1. Limited experimental diversity.
Although the method is tested on multiple datasets, all belong to similar crowd domains. It would strengthen the generality claim to include different physical systems, such as vehicle or swarm simulation.
2. Ablation breadth.
The ablation study is informative but it would be useful to show how performance changes under different ODE solvers or with alternative coupling strengths between density and trajectory modules.
3. Minor missing citations for ODE-based trajectory forecasting.
The paper would benefit from acknowledging prior studies that have already explored ODE formulations for trajectory or crowd prediction, such as Social ODE: Multi-agent Trajectory Forecasting with Neural Ordinary Differential Equations (ECCV 2022) and Improving Transferability for Cross-Domain Trajectory Prediction via Neural Stochastic Differential Equation (AAAI 2024).
These works share conceptual overlap in embedding physical dynamics into continuous differential frameworks.
Please see the weakness section |
Fully AI-generated |