|
CLIP as a Prior Teacher: Breaking the Label Dependency in Semi-Supervised Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper attempts to address a well-known problem in SSL: the model's performance is heavily dependent on the quantity and quality of the limited labeled data. The authors claim to break this dependency by introducing CaPT. The core idea is to concurrently train a fft unimodal network on images and a parameter-efficiently fine-tuned (PEFT) CLIP model. These two models supervise each other via an entropy-weighted co-pseudo label. The results show that CaPT achieved state-of-the-art (SOTA) performance across multiple SSL benchmarks, especially in extremely low-label settings.
1. The authors have tested CaPT on a wide range of benchmarks, including USB, ImageNet, and several fine-grained datasets, covering various scenarios of label scarcity, label noise, and class imbalance.
1. The core contribution of this paper is severely overclaimed. The CaPT, is, in my opinion, nothing more than a simple combination of several existing ideas, such as co-training, adapter-tuning, and mixup.
2. The authors make the assertion that their work breaks the label dependency. In reality, they have merely replaced the dependency on high-quality labels with a dependency on high-quality CLIP prior. This is laid bare in Appendix N: when CLIP performs poorly on the FGVCAircraft dataset, CaPT's performance is low as well.
3. Entropy-based weighting is naive. Did you explore any other, more robust weighting strategies?
See in Weaknesses. |
Fully human-written |
|
CLIP as a Prior Teacher: Breaking the Label Dependency in Semi-Supervised Learning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces a semi-supervised learning approach based on an asymmetric teacher–student scheme that uses CLIP as the guidance model. To mitigate known CLIP biases, the method combines consistency regularization with a lightweight fine-tuning strategy to keep compute overhead modest. A theoretical analysis explores learning with scarce labels, linking label quantity/quality to training dynamics and expected performance. Experiments cover multiple image-classification datasets, with ablations and visualizations that examine the contribution of each component.
- Coherent design and solid empirical gains: The integration of CLIP within an SSL pipeline is thoughtfully engineered; ablations suggest each component contributes meaningfully, and the reported results surpass the listed baselines.
- Clear exposition and positioning: The manuscript is generally easy to follow, and the related-work section situates the method well among comparable SSL approaches.
- Comprehensive experimentation: The empirical section is broad, includes analyses of the proposed regularization and fine-tuning, and emphasizes practical efficiency.
- Interesting theoretical motivation: The analysis connecting pseudo-label quality to labeled-data quality—and to how prototypical the labeled samples are—is insightful and adds value to the overall contribution.
- Questionable generality of the “framework” claim: CLIP differs meaningfully from modern VLMs, and CLIP itself is comparatively dated. Without evidence that the approach transfers cleanly to stronger/modern VLMs—or to other tasks—the contribution feels more like a targeted CLIP-based recipe than a broadly applicable framework. Demonstrating adaptability (e.g., with a second teacher family or a distinct task) would strengthen the novelty and impact.
- Scope limited to CLIP-based image classification: While effective in this setting, the study does not explore alternative tasks beyond classification. The paper does not claim multi-modality; however, discussing or lightly probing extensibility (even in a small-scale study) would strengthen the practical generality of the approach.
- Theory presentation could be clearer (minor but actionable):
- Introduction: grammar around **line 50** needs a pass.
- Symbols should be explicitly defined when first used: $\epsilon_n$ , $r$ , and $\hat{y}$
- Tightening these points would make the connection between assumptions and the training pipeline more transparent.
- Teacher swapability: How readily can the teacher be replaced with stronger CLIP variants or contemporary vision–language models? Are there stability or calibration issues when doing so?
- Beyond classification: What modifications (if any) are needed to extend the method to detection/segmentation or image–text retrieval? Were any preliminary attempts made?
- Sensitivity to prompts and thresholds: How sensitive is performance to text-prompt choices, pseudo-label thresholds/temperatures (if applicable), and augmentation strength? |
Fully AI-generated |
|
CLIP as a Prior Teacher: Breaking the Label Dependency in Semi-Supervised Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper focuses on CLIP-based semi-supervised learning. First, the paper demonstrates through theoretical and empirical analysis that performance is limited by the quantity and quality of labeled data. Then, it proposes a new method called CLIP as a Prior Teacher (CaPT), encompassing three modules: a pseudo-label module based on ViT, an adapter tuning module, and an ensemble module that combines the predictions of the first two modules. Experimental results validate the effectiveness of the proposed method.
The studied problem of semi-supervised learning with CLIP is an important and interesting research topic.
The experimental results are very comprehensive and validate the effectiveness of the proposed method.
The proposed approach seems to be a direct combination of the FixMatch approach (module A) and the parameter-efficient fine-tuning approach (module B). Although the co-training technique is interesting, directly combining off-the-shelf approaches may weaken the paper's contributions.
The paper's layout can be improved. First, it is unusual to include a theorem in the introduction. Additionally, it is not rigorous to directly call the different modules "A," "B," and "C." Additionally, there are typos, such as "though" instead of "through" in line 229. Furthermore, the augmentation in Eq. (2) is an addition. However, this is not always true, as many augmentations cannot be implemented by simply adding a feature to another vector or tensor.
Although the co-training scheme is effective, involving a ViT and a CLIP model together is much more complex than the compared methods.
Theorems 1 seem irrelevant to the motivation and the proposed approach. First, it is obvious that the classifier's performance will be inferior with less data, without the need for any theoretical analysis. Second, accuracy is the nearest-prototype pseudo-label error, which is different from the classification model. Third, a larger upper bound does not necessarily indicate a smaller label error.
Please see "weaknesses". |
Lightly AI-edited |
|
CLIP as a Prior Teacher: Breaking the Label Dependency in Semi-Supervised Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces CaPT (CLIP as a Prior Teacher), a novel semi-supervised learning framework that leverages the strong generalization ability of CLIP to reduce the dependency of SSL methods on labeled data.
The key idea is to treat CLIP as a prior teacher, combining its zero-shot semantic knowledge with a unimodal visual learner through an asymmetric co-training mechanism.
The paper also provides theoretical insights into label dependency in SSL and demonstrates significant performance gains on standard benchmarks under extreme low-label conditions.
1. The paper formalizes label dependency in SSL and clearly articulates why existing methods fail when labeled data are extremely scarce.
2. The asymmetric co-training between CLIP and the visual model is simple yet well-motivated, enabling complementary learning between prior knowledge and data-driven adaptation.
3. Extensive experiments on CIFAR, STL, and ImageNet subsets show consistent improvements over strong SSL baselines (FixMatch, FreeMatch, RegMixMatch, etc.), especially in 1-shot and 2-shot settings.
1. While some ablations are included, it would be useful to see results with other multimodal priors (e.g., SigLIP, EVA-CLIP) to confirm generality.
2. The paper focuses on SSL baselines but could better position itself against few-shot or distillation-based methods.
See Weaknesses. |
Fully AI-generated |
|
CLIP as a Prior Teacher: Breaking the Label Dependency in Semi-Supervised Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes a semi-supervised learning pipeline that leverages CLIP’s prior knowledge in a co-training framework where CLIP is updated with parameter-efficient fine-tuning (e.g., adapters) rather than full model tuning. Experiments on common SSL benchmarks show performance gains.
1. The paper is build on clear motivation with supporting theory showing pseudo-label error grows with prototype bias and fewer labels, formalizing a well-known intuition.
2. The paper proposes a practical strategy to incorporate CLIP in SSL that balances efficiency (adapter tuning, feature-level Mixup) and reliability (co-training + entropy-weighted labels)
1.Domain dependence of CLIP priors: where CLIP is strong (e.g., natural images like CIFAR), gains are intuitive; where CLIP is weaker or off-distribution (e.g., EuroSAT and many medical domains), benefits may diminish and are harder to guarantee.
2. Technical contributions feel like a careful combination of known pieces (co-training, PEFT adapters, entropy-weighted pseudo-labels, Mixup)
NA |
Moderately AI-edited |
|
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces OmniSpatial, a comprehensive and challenging benchmark designed to evaluate the spatial reasoning of VLMs. Grounded in cognitive psychology, OmniSpatial features over 8.4K manually curated question-answer pairs across four key categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking. Experiments show that even SOTA VLMs struggle significantly, performing far below human accuracy. To bridge this gap, authors propose two novel strategies PointGraph and SpatialCoT, which leverage structured scene graphs and multi-view synthesis to improve the model’s reasoning capabilities.
This paper’s most significant contribution is exposing that top AI models fail at complex spatial reasoning where humans excel, clearly defining a crucial and challenging direction for future VLM research.
It presents the highly original concept of SpatialCoT, which enhances reasoning by simulating mental imagery. This creative fusion of 3D novel-view synthesis with chain-of-thought prompting represents a significant conceptual advance for tackling view-dependent and perspective-taking tasks.
The work is distinguished by its quality, evident in the meticulous manual creation of its 8.4K question-answer pairs, which achieved a high inter-annotator agreement and transparent evaluation across a wide spectrum of leading AI models.
The paper demonstrates that models fail on complex tasks but does not offer a deep analysis of the reasons. Without a breakdown of specific error types, the work provides limited actionable guidance for researchers to develop targeted architectural improvements.
Given the performance gap between frontier models and humans, it's important to consider whether current methods can help VLMs catch up. If not, what future research or scaling approaches could bridge this gap? |
Moderately AI-edited |
|
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces OmniSpatial, a large-scale benchmark designed to evaluate comprehensive spatial reasoning in vision-language models (VLMs). It organizes tasks into four key categories, including dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, covering 50 subtypes and 8.4K manually curated QA pairs. The benchmark integrates multiple data sources (web, cognitive tests, driving exams, and prior embodied datasets) with high annotation consistency (Krippendorff’s α = 0.84).
The authors further propose two methods to enhance VLM spatial reasoning:
1. PointGraph – providing explicit scene graphs for spatial structure.
2. SpatialCoT – enabling multi-view reasoning using novel-view synthesis (InstantMesh).
They benchmark 36 VLMs (GPT-4.1, Gemini-2.5, Qwen-VL, InternVL, etc.) and show that while leading reasoning models (e.g., o3, Gemini-2.5-pro) achieve ≈56% accuracy, human performance reaches 92%. Fine-tuning on OmniSpatial improves performance (+7.8 points) and transfers modestly to other spatial benchmarks (e.g., VSI-Bench +2 points)
1. Focused and Systematic Scope
The paper maintains a clear focus on spatial reasoning, defining it precisely, covering its cognitive dimensions, and avoiding unnecessary general multimodal extensions. This conceptual focus makes OmniSpatial a coherent and practically usable benchmark.
2. Rigorous Manual Curation:
- The dataset is human-annotated, multi-sourced, and cross-validated with strong inter-annotator agreement, addressing common weaknesses of synthetic or template-based datasets.
1. Lack of Deep Analysis or Failure Studies
The paper could benefit from qualitative examples showing why models fail (e.g., depth reasoning errors, frame-of-reference confusion, or temporal misalignment)
2. Marginal Quantitative Gains
The improvements from PointGraph and SpatialCoT are modest (≈1–2 points per dimension), raising questions about their practical impact.
1. How do PointGraph and SpatialCoT interact? Are their improvements additive or overlapping?
2. Can the authors provide qualitative examples illustrating typical model errors (e.g., misinterpreting object orientation, inconsistent frame of reference)? |
Fully AI-generated |
|
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents OmniSpatial, a new and comprehensive benchmark aimed at evaluating higher-level spatial reasoning in VLMs beyond basic left–right or counting tasks. It provides 8.4K human-curated QA pairs across four categories—dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking—covering 50 task types. Evaluating 36 models shows that state-of-the-art VLMs achieve only 56% accuracy, far below human performance, with notable weaknesses in geometric reasoning and non-egocentric perspective shifts. The paper also introduces PointGraph and SpatialCoT as two strategies to improve spatial reasoning, both yielding modest gains.
- The paper is well written and easy to understand.
- The dataset construction is solid and carefully annotated by humans.
- The evaluation is comprehensive.
**Training Data Leakage Concern**
- While the dataset is manually curated, some sources (e.g., web images, exam-style questions) may overlap with model pretraining corpora. A clearer discussion on leakage mitigation, measurement, and dataset decontamination would strengthen the benchmark’s credibility.
**Compute Cost of SpatialCoT**
- The proposed SpatialCoT relies on multi-view synthesis, which appears computationally expensive. A discussion of its runtime, resource requirements, and potential lightweight or more practical alternatives would improve the clarity of its applicability.
**Lack of Discussion on Related Works**
- I have seen prior works that also incorporate structured spatial information through text-based scene representations (e.g.,[1]). The PointGraph idea seems to be closely related to this one. It would be appropriate to acknowledge and discuss such related methods when introducing PointGraph in Sec. 3.3.1 to better position the contribution.
**Missing Error Bars in Reporting Results**
The main table does not present confidence intervals, variance, or statistical testing. As this is a benchmark paper, stronger evidence of robustness and significance is needed. Reporting standard deviations or significance tests can better support the claims and ensure results are reliable.
Overall, I believe the paper could be a good contribution to the community, and I would be happy to reconsider the score if the above concerns are satisfactorily addressed.
- To what extent can models answer correctly without looking at the images? Since most questions are binary or 4-way multiple choice, some may be solvable from textual priors alone (e.g., Fig. 3: “I am entering a highway, I will encounter a ‘Give Way’ sign”). Have the authors evaluated a text-only baseline to isolate true visual reasoning?
- How is PointGraph different from existing methods like [1]?
- Can the authors provide an estimated compute overhead of SpatialCoT and discuss practicality for deployment?
- How do the authors assess or mitigate potential data leakage, especially for web- or exam-derived content that may exist in model training corpora? Is there any decontamination or measurement of overlap?
[1] Wang et al., Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models, NeurIPS 2024 |
Moderately AI-edited |
|
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes a new Visual Question Answering (VQA) benchmark, OmniSpatial, designed to comprehensively evaluate spatial reasoning capabilities. The benchmark highlights a broad and in-depth coverage of spatial relation tasks. The authors carefully design four key dimensions of spatial reasoning to be evaluated: perspective-taking, spatial interaction, dynamic reasoning, and compositional understanding. These four categories present significant challenges and substantially advance the complexity of spatial evaluation for current vision-language models (VLMs).
The proposed benchmark provides a comprehensive evaluation across diverse conditions, supported by a large collection of curated web images. The paper conducts experiments across multiple model series and sizes, including reasoning, closed-source, and open-source VLMs, while also providing a human baseline for comparison.
In addition, the authors investigate two approaches aimed at improving VLM spatial understanding on this benchmark, PointGraph and SpatialCoT. Both yield consistent improvements across different VLMs.
- The proposed benchmark introduces a new and challenging evaluation setting that explores aspects of spatial reasoning rarely addressed in previous datasets. It is notably more complex and comprehensive than prior benchmarks.
- The question annotations involve a human-in-the-loop process to ensure clarity, answer uniqueness, and the resolution of ambiguous spatial references.
- The evaluation includes a wide range of VLMs—covering reasoning-focused, open-source, closed-source, and human baselines—demonstrating the benchmark’s broad coverage, thorough experimental setup, and comprehensive comparisons across models.
- Results demonstrate the significant shortcomings of VLMs across different type of spatial relation
- The paper also introduces two promising approaches, PointGraph (which incorporates an explicit scene graph as input) and SpatialCoT (which generates multi-view points from a given image to provide diverse spatial perspectives). These methods consistently improve model performance across different VLMs.
- Paper shows that fine-tuning models with this dataset shows potential for transferability to other VLM benchmarks.
- The paper is well-written and includes clear illustrations that help the audience understand the proposed spatial relation benchmark and its evaluation scope.
- There is no qualitative analysis of failure cases. Investigating these failures would strengthen the paper further. Providing a few examples and categorizing the errors could help reveal which aspects of reasoning need improvement—such as perception, logical reasoning, or consistency.
- The paper only demonstrates the effectiveness of SpatialCoT on the perspective-taking task. How does this approach affect performance on other task types? This might raise some concern that it make model perform worth in other tasks that does not require perspetive taking.
- Minor issue, Table 4 is never mentioned in discussion.
- In the question generation process, is a fixed template used, or are LLM involved? If the process relies on template-based questions, would it be possible to incorporate LLMs to increase the diversity of question types? If it already use LLM, the paper should expicitly said so. |
Fully human-written |
|
RA-SpaRC: Robust Adaptation with Sparse Plus Low-Rank Compressors |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes RA-SpaRC, a novel initialization strategy for parameter-efficient fine-tuning that automatically balances sparse and low-rank components when adapting large pretrained models. Concretely, RA-SpaRC adopts a compressor-based framework and defines a compressor which optimimally decomposes gradient updates into sparse and low-rank components under a fixed computational budget. It further contributes an efficient alternating projection algorithm to automatically determine the best rank–sparsity trade-off as well as a compressor quality metric that guarantees loss reduction during optimization.
Experiments on LLaMA-2-7B, Qwen2.5-7B, and T5-base models across NLU, NLG, and code reasoning tasks show consistent improvements over LoRA, LoRA-One, and RoSA, without extra memory and with moderate initialization cost.
* The proposed method identifies the shortcomings of fixed-ratio sparse + low-rank hybrid PEFT methods and comes up with an elegant formulation to adaptively tune this.
* The paper reformulates PEFT initialization as applying a compressor on gradients, unifying sparse and low-rank initialization under a unified lens.
* The authors implement a custom kernel for avoiding the materialization of dense matrices, therefore achieving real speedup compared to naive unoptimized implementations.
* The evaluation has the potential to be improved.
- First off, I would recommend that a more complete experimental setup is provided, which would enhance understanding and reproducibility.
- Breadth-wise, results are mainly focused on dense LLMs and text-based tasks. Applying the technique to other networks, e.g., ViTs, would help showcase the generality of the method.
- Depth-wise, the paper’s evaluation section mostly focuses on aggregate results (accuracy, efficiency) rather than offering deeper insights into why or how the adaptive rank–sparsity allocation interacts with different layers of large language models (LLMs). Coupling this with different budgets would be even better.
* The paper focuses on better initialization, but adaptation dynamics during training (e.g., stability or convergence rate) are not deeply analyzed.
* Though the initialization is efficient, the adaptive search may still be costly for very large models or frequent reinitialization (e.g. adapter soups).
* How would the authors propose their solution be applied on non-dense, multi-branch models, like MoE or hybrid-attention structures?
* How does the method perform under quantized pretrained backbones?
* How does RA-SpaRC interact with alternative modern optimizers (e.g., Muon)? Is the initialization's benefit amplified or diminished?
* How does the technique behave against DoRA?
* Does adaptive allocation generalize across tasks, or must it be recomputed for every fine-tuning run?
* Could the compressor be integrated dynamically during training rather than only at initialization? |
Fully human-written |
|
RA-SpaRC: Robust Adaptation with Sparse Plus Low-Rank Compressors |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces RA-SpaRC (Robust Adaptation with Sparse plus Low-Rank Compressors), a new hybrid PEFT initialization strategy that overcomes the need to manually tune the ratio of parameters between low-rank and sparse components, through an automatic parameter allocation mechanism. The paper shows that RA-SpaRC outperforms classic and hybrid PEFT methods in extensive experiments across multiple models.
The paper is a pleasure to read and the idea seems interesting and promising, offering new insights into the performance of different PEFT initialisation methods and PEFT methods themselves. Still, the empirical improvements seem limited, and it is not fully clear if the empirical evaluation is completely fair.
- The paper is well presented and conveys the message well.
- The proposed method seems novel in its approach.
- The claims are supported by theory and experiments.
- The proposed method advances the PEFT research, introducing a new hybrid approach.
- The average improvement in Table 1 is marginal, especially when compared to LoRA-One.
- Results in Table 2 are more significant, however comparisons may not be fair when considering classic PEFT methods (i.e. LoRA, LoRA-GA, LoRA-One), as for LLaMA-2 results are not reported for 4.746% parameters, but only for 0.297%, which marginally improves for GSM8K. In addition, Table 2 omits results for PISSA, which seems a relevant competitor.
- In Section 4.4 the paper acknowledges the initialization and training times as core limitations of the proposed method, where the overhead is respectively 3.75x to 5.33x (depending on the model) and 1.05x to 1.35x. The paper fails to show the trade-offs between performance gains and time overhead, which should be the overarching goal.
- how does PISSA compare?
- is there a favourable time-accuracy trade-off for the proposed method?
- why are the models (Sect. 4.2) fine-tuned only on a sample?
- what is the accuracy for the experiments in Fig. 2/3? |
Fully human-written |
|
RA-SpaRC: Robust Adaptation with Sparse Plus Low-Rank Compressors |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces RA-SpaRC, a new initialization strategy for Parameter-Efficient Fine-Tuning (PEFT) that addresses the limitation of existing hybrid methods like RoSA requiring manual tuning of sparse and low-rank components.
RA-SpaRC uses an adaptive allocation mechanism based on gradient analysis to automatically balance these components within a fixed parameter budget, outperforming LoRA, its variants, and RoSA across multiple benchmarks.
The paper addresses an important problem of developing an initialization strategy that enables flexible, automatic, and effective budget allocation for Robust Adaptation.
To this end, they propose RA-SpaRC (Robust Adaptation with Sparse plus Low-Rank Compressors), an initialization strategy for sparse plus low-rank fine-tuning, which could dynamically assign the ratio of low-rank and sparse parts according to the gradient information of different tasks and models.
The proposed method is very pragmatic and practical.
The proposed method was examined in comparison with different LoRA variations in several benchmarks.
Notation should be revised and redundancy in some terms should be fixed.
The proposed method performs on par in some experiments compared to the state of the art LoRA variations.
The initialization and running time of the proposed method is pretty large compared to the baseline vanilla LoRA.
Theoretical analyses of several properties of the proposed method, such as the convergence rate, were not provided.
Have you compared convergence rate of your proposed method and the other LoRA variations theoretically?
The accuracy boost is less for GSM8K compared to the HumanEval in Table 2. Could you please elaborate this result? |
Fully human-written |
|
RA-SpaRC: Robust Adaptation with Sparse Plus Low-Rank Compressors |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes RA-SpaRC, which first computes the initial gradient on a small data batch and, under a fixed parameter budget, employs a binary search to explore different ranks. It allocates the remaining budget to sparsity, selects the rank $ r^* $ that minimizes reconstruction error, and constructs the sparse matrix ( S ) from the $ s^* $ largest-magnitude elements of the residual.
1. The proposed method can automatically allocate the ratio between low rank and sparsity which reduce the complexity of hyper-parameter fine-tuning.
2. The paper shows many results on various benchmark with different models, which demonstrate it is better than LoRA and its variants.
1. Since the paper determines the ratio between low-rank and sparse components based on a mini-batch, it should demonstrate whether this ratio remains stable or changes when the samples in the mini-batch vary.
2. From Tables 1 and 2, the performance gap between RA-SpaRC and LoRA-One is minor, while RA-SpaRC incurs higher initialization time. This raises questions about its practical benefit. Moreover, Table 2 should include results for LoRA-One using the same number of parameters as RA-SpaRC when evaluated on LLaMA-2-7B.
3. In Table 4, the performance difference between SpaRC and SVD is minimal. The authors should therefore provide experimental results for the proposed method without the sparse compressor to better isolate its contribution.
4. Since RoSA is the main baseline for comparison, the paper should report results for RoSA under additional configurations to more clearly demonstrate the effectiveness of the proposed adaptive ratio allocation.
See the weakness |
Fully human-written |
|
UniOMA: Unified Optimal-Transport Multi-Modal Structural Alignment for Robot Perception |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes UniOMA, a unified optimal-transport based framework for multimodal structural alignment that addresses the "structural alignment gap" in existing contrastive learning approaches. The key insight is that while InfoNCE-style objectives achieve statistical alignment between modalities, they fail to preserve intra-modal geometric relationships, leading to embeddings that are statistically correlated but structurally inconsistent. UniOMA augments standard contrastive losses with a Gromov-Wasserstein (GW) distance-based regularization that enforces structural coherence across modalities by:
- Computing intra-modal similarity matrices for each modality and estimating a dynamic GW barycenter as structural consensus
- Aligning each modality's embedding-space geometry to this consensus via weighted GW distances
- Learning modality-specific weights that quantify each modality's contribution to the structural consensus
The approach is evaluated on robotic perception tasks across vision, force, tactile, proprioception, and audio modalities, demonstrating improvements in downstream tasks while maintaining interpretability through learned modality weights.
- Novel Problem Identification: The paper clearly identifies and formalizes the "structural alignment gap" - a fundamental limitation where InfoNCE objectives achieve statistical dependence but fail to preserve intra-modal geometry.
- Strong Theoretical Motivation: Figure 1 provides an effective illustration of the key theoretical motivation—the limitation of InfoNCE-based alignment methods in preserving intra-modal structural relationships, even when achieving overall statistical alignment. The figure clearly demonstrates how correlated structure within modalities can be lost, supporting the necessity of structure-aware regularization. This theoretical insight is further contextualized with concrete examples from robotics, establishing the practical importance of addressing this challenge in real-world applications.
- Theoretically Grounded Approach: Strong theoretical foundation connecting Gromov-Wasserstein distances to multimodal alignment
- Comprehensive Experimental Validation: Evaluation across diverse robotic modalities (vision, force, tactile, proprioception, audio), Multiple downstream tasks including regression, classification, and cross-modal retrieval, Consistent improvements when adding GW regularization to existing methods (Pairwise, Symile, GRAM)
- Comprehensive Ablations, Analysis and Visualizations: The paper provides thorough ablation studies, particularly the unequal modality sampling scenarios that demonstrate practical robustness. The analysis of learned modality weights is particularly compelling, showcasing the model's ability to adaptively handle imbalanced modalities. Figure 3e is especially convincing evidence that the theoretical foundations of UniOMA are working as intended—the framework genuinely learns to weight modalities appropriately based on their informational content and availability. This adaptive redistribution of weights when modalities are downsampled provides both practical value and theoretical validation, demonstrating that the GW-barycenter approach isn't just mathematically elegant but actually captures meaningful structural relationships that translate to improved performance.
- Computational Overhead: The iterative GW barycenter computation and optimal transport estimation introduce significant computational cost during training. While mini-batch approximations are used, scalability to very large datasets remains unclear. There is limited analysis of computational complexity in practice beyond algorithmic complexity bounds
- Hyperparameter Sensitivity: The framework introduces multiple hyperparameters (regularization weight α=1000, kernel scales γ, barycenter iterations Tmax=5) There is limited sensitivity analysis provided, particularly for the choice of kernel similarity measures across different modalities
- The paper could benefit from a discussion and comparison with the rich literature on missing modality learning, which addresses related challenges of preserving modality-specific information while modeling shared information. Prior works grounded in Partial Information Decomposition emphasize explicitly modeling and disentangling unique versus shared modalities information. Although these works do not explicitly discuss intramodality topology or structural concisitency, the concept of “modality-specific” information seems like a different theoretical view of the same goal. Including these references and discussing their relationship to the proposed approach would strengthen the contribution and contextualize the novelty relative to relevant fields. Notable examples include:
- Wang et al. (2023), "Multi-modal learning with missing modality via shared-specific feature modelling" (CVPR)
- Nguyen et al. (2025), "Robust: Leveraging redundancy and modality-specific features for robust multimodal learning" (IJCAI)
Clarifications:
- In Figure 2, why are the corresponding similarity matrices between the data space and the embedding space not aligned? For example, Kx1 could be aligned with Kz1, and Kx2with Kz2. It seems aligning each modality’s similarity matrix individually to its embedding counterpart would better encourage encoders to capture the topological structure, while still maintaining O(M) complexity rather than relying on a combined consensus? And the encoders are already aligned across modalities through L_c.
- Lines 260-262 mention that modalities such as vision and force–torque are in incomparable metric spaces but have meaningful internal geometries, which is crucial in robotics. Could this claim be clarified with a concrete example to illustrate this point? Like a task or data instance when this would be the case
- In line 264, what does the bold "1" represent in the notation?
- In Table 1, what do the bolded and orange-colored numbers signify? Additionally, what measure of uncertainty is reported with the +- (standard deviation, variance, confidence interval)?
Though Experiments/Extensions
- Scalability Concerns: How does the computational overhead scale with dataset size and number of modalities? What are the practical limits for real-time robotic applications where inference speed is critical?
- Generalization Beyond Robotics: How effective would this approach be in non-robotic multimodal domains (e.g., vision-language, medical imaging), are there any issues that may occur? Adding proof of funcionality in other domains would really extend this work’s contributions
- Alternative Consensus Strategies: Why choose a single barycenter as consensus rather than multiple cluster centers? Could hierarchical or mixture-based consensus improve performance for complex structural relationships? |
Fully AI-generated |
|
UniOMA: Unified Optimal-Transport Multi-Modal Structural Alignment for Robot Perception |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a framework designed to achieve geometry-aware alignment across multiple heterogeneous modalities such as vision, force, tactile, proprioception, and audio in robotic perception. The method introduces a Gromov–Wasserstein (GW) distance–based regularization to augment conventional contrastive objectives (e.g., InfoNCE). Specifically, UniOMA computes intra-modal similarity matrices to represent modality-specific geometric structures and aligns them through a dynamically learned GW barycenter that serves as a shared “structural consensus.” This barycentric formulation reduces the pairwise coupling complexity from O(M^2)to O(M)and allows adaptive modality weighting via learnable coefficients. Experiments on four multimodal robotics benchmarks (VFD, VFP, MuJoCo Push, VAT) show consistent gains across regression, classification, and retrieval tasks compared to pairwise and higher-order contrastive baselines (CLIP, Symile, GRAM). Ablations suggest improved robustness to asynchronous sampling and interpretability via modality weights.
- The paper identifies and formalizes the structural alignment gap in multimodal contrastive learning, an under-discussed but practically relevant issue.
- Integration of Gromov–Wasserstein barycenters into the multimodal alignment objective is mathematically principled and computationally efficient (O(M) scaling).
- The theoretical novelty is limited; the framework primarily combines existing OT/GW formulations with standard contrastive objectives.
- The ablation studies, while illustrative, lack statistical rigor (no error bars or repeated trials).
- Comparisons are restricted to InfoNCE-based baselines; recent large-scale multimodal foundation models (e.g., ImageBind, CLIP4Clip) are not directly benchmarked.
- The computational cost of computing GW barycenters, though claimed mitigated, is not thoroughly analyzed (no runtime or memory comparison).
- The claim of “scaling naturally to 3+ modalities” is empirically modest, tested only on up to 3 modalities per benchmark.
- How sensitive is the method to the choice of kernel functions (RBF vs. TCK) for constructing similarity matrices?
- Could the authors provide runtime analysis or scalability benchmarks comparing UniOMA with pairwise CLIP and GRAM?
- How does the learned barycentric consensus behave qualitatively—does it correspond to physically interpretable intermediate structures?
- Can UniOMA extend beyond robotics to vision–language–audio domains, and would the same kernels apply?
- Does the GW regularizer introduce convergence instability or require curriculum scheduling during training? |
Fully AI-generated |
|
UniOMA: Unified Optimal-Transport Multi-Modal Structural Alignment for Robot Perception |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper augments contrastive multimodal alignment with a Gromov–Wasserstein (GW) barycenter regularizer that encourages each modality’s batchwise similarity structure to agree with a shared consensus. The method is meant to scale to ≥3 modalities and is evaluated on several robotics-flavored datasets (vision/force/tactile/proprioception), reporting improvements when the GW term is added to common contrastive objectives.
1. Clear articulation of the structural alignment gap in contrastive learning and a tidy objective that is easy to plug into existing losses.
2. Sensible idea for many-modality settings (barycenter vs. O(M²) pairwise couplings), with interpretable per-modality weights.
3. Robotics tasks with underused modalities (force/tactile) are a good target domain.
The theoretical component largely instantiates known pieces of OT/GW (trace-style alignment, barycenter regularization) within a standard contrastive framework, without new guarantees or analysis (e.g., identifiability, convergence behavior with stochastic batches, or conditions under which the barycenter preserves task-relevant geometry). Derivations and definitions appear to repackage established GW formulations rather than introduce genuinely new theory. As a result, the contribution feels incremental on the theory side.
Empirically, the paper shows consistent but mostly modest gains, and the evaluation lacks the depth needed for an ICLR-level claim:
- Ablations are thin: no systematic study of kernel choices/γ sensitivity, solver settings, or the effect of the barycenter update frequency.
- Compute & practicality: no clear reporting of training overhead vs. baselines (wall-clock/epoch, peak memory, GW iterations), which matters for practitioners considering per-batch GW.
- Missing-modality robustness: the narrative highlights this use case, but there’s no explicit drop-a-modality at inference stress test.
- Baselines & breadth: comparisons miss stronger or more recent structural-preserving or OT-regularized approaches; it’s hard to conclude that the proposed regularizer is the best option among peers.
See Weaknesses. |
Fully AI-generated |
|
UniOMA: Unified Optimal-Transport Multi-Modal Structural Alignment for Robot Perception |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
In this paper, the authors propose an enhancement to existing InfoNCE-based contrastive learning frameworks by incorporating modality-specific intrinsic relationships. The motivation stems from the observation that conventional contrastive methods, which primarily align paired cross-modal data, often fail to preserve the inherent structural topology within each individual modality. To address this limitation, the authors introduce a Gromov–Wasserstein distance–based regularization, which explicitly maintains intra-modality geometric consistency while aligning multiple modalities. The proposed method is applied to representation learning across vision, force, and tactile modalities—domains that are relatively underexplored yet critical for robotic perception and manipulation. Experimental results demonstrate that the approach significantly improves both representation quality and downstream task performance, validating the effectiveness of incorporating modality-specific structural alignment into contrastive learning.
1. The paper is well-written and clearly structured. The proposed idea of preserving intrinsic relationships when aligning representations across different modalities is both intuitive and promising. It provides a thoughtful perspective on improving cross-modal contrastive learning.
2. The authors’ claims are well-supported by both qualitative and quantitative evidence. The case study (Fig. 1) effectively illustrates the underlying, while the main experimental results convincingly demonstrate the method’s performance advantages.
3. The incorporation of Gromov–Wasserstein (GW) distance as a regularization term is well-motivated and promissing. The resulting framework is flexible and can be seamlessly integrated into existing InfoNCE-based contrastive learning methods, enhancing their ability to capture modality-specific structural information.
1. The main concern with this paper lies in its experimental setup. Aligning representations between vision and low-dimensional, robotics-related modalities (such as tactile signals or end-effector (EEF) positions) may not be conceptually sound. Visual observations inherently contain richer, high-level semantic information — including environmental context, object appearance, and background — whereas proprioceptive or tactile data capture only limited, low-dimensional physical aspects of the same scene. Aligning these representations risks degrading the generality and expressiveness of the visual embeddings, as the model may overfit to the less informative modalities. Similarly, the task of aligning vision and audio modalities seems somewhat unnatural in the given robotic context, and its motivation should be better justified.
2. The paper’s central idea of maintaining intrinsic structural relationships is conceptually appealing, but modeling these relationships in high-dimensional visual feature spaces remains an open challenge. The choice of RBF kernel to define distances in such complex, semantically rich spaces is not particularly promising — while it has shown effectiveness in early low-level computer vision applications, it may not adequately capture the nuanced semantic geometry of deep visual embeddings. A deeper discussion or alternative strategies (e.g., learned metrics or graph-based structures) would strengthen this aspect.
3. It is unclear why the proposed method is designed specifically for three or more modalities. The idea of using Gromov–Wasserstein regularization to enhance two-modality contrastive learning (e.g., in vision–language pretraining) could be impactful, given its broader applicability and relevance to large-scale multimodal learning. Exploring or discussing such extensions could significantly increase the practical and theoretical contribution of this work. The key remaining challenge would be to properly model intra-modal structures for high-semantic modalities like vision and language, which would make the approach more meaningful and generalizable.
N/A |
Fully AI-generated |
|
ZoomV: Temporal Zoom-in for Efficient Long Video Understanding |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors address the challenge of long-video understanding by proposing a mechanism to select the most relevant frames in a video to answer a given prompt. They introduce ZoomV, a method that leverages TemporalLinks. TemporalLink is an additional module for MLLMs designed to embed timestamp information into timestamp tokens, which are then linked to visual tokens. Their second contribution is Temporalight, an approach that utilizes the model's reflection to assess the relevance of a selected time window for a given query, providing a confidence score. By considering multiple windows with varying reflection confidence, the window with the highest confidence can be selected for the video understanding task. This approach effectively concentrates computational effort on the most pertinent input frames. The authors use their method on top of LlaVa, InternVL and Qwen2.5-VL and show improvement on MVBench, MLVU, LongVideoBench, VideoMME and LVBench. They also evaluate their method on different temporal ground benchmarks such as Charades-STA, ActivityNet-Caption and ReXTime.
- The paper is well written and the methods that is presented as TemporaLink and TemporaLight are sounded and relevant.
- Introducing self-correction to select the frames to use is an interesting idea, sometimes MLLMs are indeed better when used as judges.
- There are a number of missing baselines: other training based methods such as LongVu, LongVa, Frame-Voyager and training-free methods such as Adaptive Keyframe Sampling (AKS). Overall, I am concerned with the lack of comparison with other methods and the very short related work section that does not cite most common papers on video frame selections for MLLMs.
- Would have appreciate a deeper study on efficiency with a better discussion on training cost/time, inference time versus others methods such as LongVu or other frame selection method such as AKS.
- Would also have liked a more in depth study over the self-reasoning for window selection. Are some models better at that or did you observe the same results for LlaVA-Video, InternVL and Qwen?
Is there any reason why you did not compare your method with similar training based methods for frame selection? |
Fully human-written |
|
ZoomV: Temporal Zoom-in for Efficient Long Video Understanding |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper ZoomV, a query-aware temporal zoom-in framework designed for efficient and accurate long video understanding. It retrieves relevant events and their associated temporal windows as candidates, and select higher-confidence temporal windows as the LVLM's final input to provide the answer. It conducts experiments on temporal grounding benchmarks as well as long video understanding benchmarks to demonstrate the effectiveness of the proposed method.
1. The paper is clearly written and easy to follow.
2. The proposed approach is reasonable and methodologically sound.
3. Experiments are conducted on both temporal grounding and long video understanding benchmarks to demonstrate the effectiveness of the method.
1. One of the main contributions claimed by the paper is the confidence-based temporal grounding approach. However, this concept has already been introduced in TimeSearch [1]. Therefore, it cannot be regarded as a novel contribution of this work. Moreover, the authors have not properly cited TimeSearch to acknowledge prior work.
2. The technical novelty appears limited, as the main modification involves adding textual timestamps to each frame embedding, which was already employed in models such as Eagle2.5 [2] ([Eagle2.5 implementation](https://github.com/NVlabs/Eagle/blob/047e51070e8976978376cb828f7af92323c0f8ef/Eagle2_5/deployment/inference.py#L85))
3. The method seems not consistently effective to all video benchmarks, and the improvement is very trivial in several benchmarks such as MVBench and VideoMME.
4. Since the paper positions its approach as an agent-style method, it should also include comparisons with recent video agent frameworks such as Video-RAG [3].
5. Given that the model is fine-tuned on a recent backbone (Qwen2.5-VL), which already exhibits strong temporal grounding capabilities, it would be more convincing to compare against recent models fine-tuned on the same base, such as VideoChat-R1 [4] and Time-R1 [5].
6. The hierachical search is not a novel idea, which has already explored in TimeSearch [1], UniTime [6] and VideoChat-R1.5 [7]. They should be discussed in the related work and experiments.
7. Since the proposed methods are fundamentally based on temporal grounding, the paper should include a discussion of the temporal grounding task in the related work section.
8. The paper emphasizes efficiency in its title; however, it does not provide a comprehensive analysis of efficiency compared to the base models.
[1] TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding, arXiv:2504.01407.
[2] Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models, arXiv:2504.15271.
[3] Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension, NeurIPS 2025.
[4] VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning, arXiv:2504.06958.
[5] Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding, NeurIPS 2025.
[6] Universal Video Temporal Grounding with Generative Multi-modal Large Language Models, arXiv:2506.18883.
[7] VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception, arXiv:2509.21100.
1. The improvement on VideoMME is very limited, for example only 0.1 on Qwen2.5-VL and no improvement on MVBench, while achieves 11.3 on LVBench. It seems that the method is not generalized to all video benchmarks. Could you explain why it achieves improvement by large margin on LVBench, but not effective to VideoMME.
2. Both VideoMME and LVBench are long-video understanding benchmarks that contain thousands of frames. However, the proposed method only samples 64 frames, which results in a substantial loss of visual details throughout the video and may prevent accurate grounding on evidence frames. Have the authors experimented with increasing the number of sampled frames? This could better demonstrate the effectiveness of the proposed approach.
3. The evaluation involves recursively exploring video frames, meaning that the total number of processed frames exceeds 64. How many frames are explored on average?
4. Considering the increased number of processed frames and the computational overhead, is it entirely fair to compare the results with the base model under a 64-frame input setting? A fairer comparison would be against the model’s officially reported best performance, for example, Qwen2.5-VL achieves 70.2 on MLVU and 65.1 on VideoMME.
5. How about the inference efficiency on long video benchmark like VideoMME compared with base models? |
Fully human-written |
|
ZoomV: Temporal Zoom-in for Efficient Long Video Understanding |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces ZoomV, an agentic-style approach to long video understanding that differs from most existing methods, which typically rely on multiple specialized models for different sub-tasks. Instead, ZoomV offers an end-to-end framework using a single model. The approach is inspired by neural cognition, mimicking how humans selectively “zoom in” on relevant visual content to enhance understanding and reasoning. Experimental results demonstrate significant gains on video QA benchmarks, including state-of-the-art performance on the challenging LVBench, as well as strong improvements in temporal grounding tasks.
* The motivation is clear and well defined: The paper addresses long video understanding which is hard due to too many frames and loss of temporal context. It explains why existing solutions (uniform sampling, token sparsification, multi-model agents) are insufficient.
* It presents a novel human-inspired framework in how humans watch videos and in the human self reflection capabilities.
* End-to-end single-model design which is a very important point for video agents.
* ZoomV achieves state-of-the-art results for the complicated LVBench and competitive results for other Video Qa benchmarks when compared to video agents based on models with much more parameters (like GPT).
* Results show that for short-video datasets performance is not sacrificed when using this method specific for long video understanding.
* This paper present ablation studies on how their modules can be beneficial for long video understanding.
* The model is not training-free when compared to other video agents like VideoTree.
* I found Section 3.3 somewhat difficult to follow. It would be helpful to provide additional details or clearer explanations of the key steps and reasoning in this section to improve readability and allow the reader to better understand the contribution.
* Experiments on InternVL3 would be valuable to add even though the decision is not dependent on this.
* The baselines on Table 2 seem a bit weak. Would it be possible to include stronger baselines even though not fully designed for temporal grounding?
* The paper misses analysis about runtime overhead for very large videos when compared to other models besides VideoTree. While the paper highlights the advantage of not downsampling, the scaling behavior and trade-offs could be more explicitly benchmarked.
* I noticed a possible inconsistency in the evaluation protocol: for multiple-choice questions you ask “which choice is correct,” whereas for open-ended questions you ask whether the “temporal window is correct.” Could you clarify why different criteria are used for these two settings? Why can't you use $I_{tf}$ for both?
* What is the number of parameters for the models LLaVA-Video, QwenVL2.5 and InterVL2.5? I assume you are using the 72B versions, right? |
Fully human-written |
|
Faster Vision Transformers with Adaptive Patches |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes Adaptive Patch Transformers (APT) , which accelerates Vision Transformers (ViTs) by replacing uniform patch splitting with content-aware, multi-granularity patching based on entropy calculation . The method utilizes a Resize + ZeroMLP mechanism to fuse features from different scales into a unified embedding space, significantly reducing the input token count . The key contributions include achieving drastic throughput speedup (up to 50% on ViT-H) while maintaining accuracy across classification and dense prediction tasks, and ensuring fast, stable adaptation to fine-tuned models via its zero-initialized fusion layer.
1. Strong Experimental Validation: The paper features a comprehensive set of ablation studies across various tasks, effectively demonstrating the efficacy of the proposed mechanisms.
2. Significant Efficiency and Generalization: APT delivers substantial throughput improvements (up to 50% on ViT-H) on large models, exhibits fast convergence (1 epoch fine-tuning), and shows robust generalization across classification and dense prediction benchmarks.
3. Clarity and Presentation: The paper is well-structured, and the figures are high-quality.
1. Missing Similar Methods Comparison: The evaluation is incomplete because it fails to include a direct comparison against methods addressing the same task, such as MG-ViT[1] and PPT[2]. A detailed analysis of the methodological and empirical differences among APT, MG-ViT and PPT would substantially strengthen the paper.
2. Hyperparameter Dependency: The performance is sensitive to the entropy threshold , which appears to be a manually tuned hyperparameter . This dependency might complicate achieving optimal efficiency across different downstream tasks, as the definition of "salient information" can vary significantly between tasks.
3. In object detection tasks, does the use of entropy to determine patch size risk ignoring subtle object boundaries? Entropy measures pixel intensity distribution diversity, which may not perfectly align with semantically critical edges, especially when compared to gradient-based measures.
4. For higher resolution images, which naturally result in a larger number of base patches, could the authors explore further patch fusion/aggregation operations after the initial adaptive patching step, particularly when several adjacent low-entropy patches exhibit similar entropy scores
5. When patch size change, should entropy threshold change? That means one size patch correspond to one entropy threshold, and different size patch correspond to different entropy threshold.
[1]Zhang Y, Liu Y, Miao D, et al. MG-ViT: a multi-granularity method for compact and efficient vision transformers[J]. Advances in Neural Information Processing Systems, 2023, 36: 69328-69347.
[2]Wu, Xinjian, et al. "Ppt: Token pruning and pooling for efficient vision transformers." arXiv preprint arXiv:2310.01812 (2023).
Refer to weakness. If these concerns are well addressed, I will raise the rating to a positive one. |
Lightly AI-edited |
|
Faster Vision Transformers with Adaptive Patches |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces an adaptive patching (tokenization) method for ViTs. The main idea is to measure the entropy (how variable the pixel values are) of local regions of images and use finer (smaller) patches to tokenize the regions with higher variation. The patches of different sizes are resized (and split) to the same size for embedding. It reduces the number of tokens compared to the ViT using the same-sized patches for the whole image. This method is adapted to different ViTs and used in different vision tasks such as ImageNet classification, VQA, and object detection. The method generally shows better accuracy-effeciency trade-off than the original ViTs and some efficient ViTs.
- This paper is well written overall; the adaptive patching idea (larger patches for low-entropy regions, smaller for high-entropy regions) is intuitive and easy to follow. The entropy formulation and hierarchical quadtree patchification are clearly described, with alternatives noted for the appendix.
- The zero-initialized MLP lets the model incorporate high-res details without hurting initialization, enabling quick convergence from existing ViTs.
- The proposed method APT plugs into several ViT backbones and tasks, including classification, VQA, detection, and segmentation. It also works with window attention (e.g., EVA/ViTDet).
- APT reported 40–50% throughput gains on large models/resolutions while matching accuracy, and also some speedups on dense tasks, achieving better accuracy-effeciency trade-off than several well-known efficient ViTs such as EViT and ToMe.
- The paper re-implements layer-level merging baselines with FlashAttention for a fairer comparison (and shows APT outperforms across compute budgets).
- Table 3 shows +22–26% throughput on LLaVA-1.5 (7B/13B), with some metrics slightly down (e.g., VQAv2 −0.6 for 13B) and others on par or up; overall the Pareto looks close but not strictly better across all benchmarks. Table 4 shows +14–30% throughput on detection with essentially unchanged mAP/AP50. This is positive, but the improvements are less decisive than in classification and would benefit from a Pareto plot analogous to Fig. 4 for these tasks.
- Currently APT uses entropy to measure the variation of pixels in image regions. It would be beneficial to add ablations for different measures, such as standard deviation of the pixels and local frequency (e.g., DCT-band energy).
- The writing of the experimental setup can be clearer, specifically on the difference between Full Fine-Tuning and Short Fine-Tuning (Section 4.2). Sometime it could be a bit confusing as to what is the pre-trained MAE; is it one only trained with masked autoencoding or one with both masked autoencoding and classification training?
- For dynamic input size (seciton 3.3), you concatenate the tokens of a batch of images into a single sequence and use block attention. Why not padding the sequence of each image into the same length?
- In the Input-level Merging Baselines, what do you mean by saying Resizing represents a stronger version of Quadformer? It seems that Resizing is a variant of APT by removing the zero initialized layer, and is not really related to Quadformer. Similar question for "Random". Clearer explantion is needed.
- For the dense prediction tasks (section 4.3), you mentioned training only the newly added component. Does it mean only training the conv layers and ZeroMLP as in Fig 3? For example, for LLaVa, the language model and the projection layer are frozen. Is that correct? |
Fully human-written |
|
Faster Vision Transformers with Adaptive Patches |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes the Adaptive Patch Transformer (APT): it adaptively employs multiple patch sizes within the same image based on content—using larger patches for flat/redundant regions and smaller patches for detail-rich areas, thereby reducing input tokens and boosting throughput. Core Implementation: Multi-scale histogram entropy (Eq.(1)) serves as compressibility metric, while hierarchical thresholds (ω) determine whether to “stop” at a layer or continue subdivision. For large patches, dual-path information is simultaneously utilized: “Embedding resized to base dimensions + sub-patch embeddings aggregated via Conv,” fused through a zero-initialized MLP. Inference/training side: “Sequence packing + block-diagonal mask” accommodates variable sequence lengths, accelerated by FlashAttention. Experiments show: Full ImageNet and 1-epoch fine-tuning achieve 20%–90% throughput gains while maintaining accuracy, with accelerated performance on LLaVA's VQA, COCO detection, and ADE20K segmentation. Authors also report: Without token reduction, APT incurs non-zero overhead; zero-initialized connections deliver the most stable “plug-and-play” fine-tuning convergence.
1. Comparison tables are provided across four task categories—classification, VQA, detection, and segmentation—clearly demonstrating how variable tokens are transformed into regular feature maps for dense prediction tasks.
2. Key methodological formulas and structural diagrams are presented, along with component ablation studies (zero initialization vs. non-zero/residual; system overhead without compression) to facilitate reproducibility and pinpoint sources of improvement.
1. The adaptive patch size relies on a hierarchical entropy threshold (Eq. 1, §3.1), which is fixed and manually tuned for each scale. The paper gives no data-driven method to set these thresholds, and poor choices can cause information loss or accuracy drops.
2. To use FlashAttention, some baselines were re-implemented with modifications like disabling weighted attention (§4.1). These changes may alter results, so runtime and throughput comparisons might not be fair without full implementation details or code release.
3. The reconstruction method (§3.3) repeats large-patch features to form dense grids, which may create block artifacts and hurt small-object accuracy. The paper reports only overall mAP/mIoU and shows no failure examples, leaving fine-detail loss unverified.
4. Table 6 shows that APT is slower than the baseline when no compression is applied, suggesting preprocessing overhead from entropy computation and token packing. The paper omits CPU/GPU timing breakdowns, and speed gains are smaller than FLOPs reductions, implying memory or pipeline bottlenecks.
5. The method mainly benefits high-resolution, large-model settings. Key parameters (thresholds ω, binning strategy, search range) are missing, and code is unreleased, making reproduction and fair comparison difficult.
1. How are the hierarchical thresholds (ωᵢ) determined for different datasets or tasks? Are they manually tuned or selected automatically?
2. How is the histogram entropy computed — what bin settings and value ranges are used? Have other texture or frequency-based criteria been compared?
3. What is the preprocessing overhead of multi-scale entropy computation and token packing on CPU and GPU? Are these times included in the reported runtime measurements?
4. Since token counts vary across samples, what is the distribution of sequence lengths, and how does this variation affect throughput and latency?
5. Can the object detection results be further analyzed by object size (small, medium, large) to better understand performance on fine-grained details?
6. Does the “repeat 2^{2i}” reconstruction step cause aliasing or checkerboard artifacts? Has any boundary or contour accuracy been evaluated to confirm visual quality?
7. How are positional embeddings handled for variable patch sizes and packed sequences? Is there any scale-aware interpolation or adjustment applied?
8. What is the parameter and memory overhead introduced by the zero-initialized MLP fusion layer, and how stable is it during longer fine-tuning?
9. For baselines adapted to support FlashAttention, can the authors provide a detailed list of implementation changes and verify that all models were tested under identical conditions?
10. Does the aggressive merging of smooth regions lead to errors when fine-grained or background details are required for prediction? Has any failure case analysis been performed?
**If you address my concerns, I will consider raising my score.** |
Fully AI-generated |
|
Faster Vision Transformers with Adaptive Patches |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The manuscript proposes Adaptive Patch Transformers (APT) considering using multiple different patch sizes within the same image processed by a Vision Transformer (ViT). Larger patch sizes are allocated in more homogeneous areas while smaller patches are allocated to more complex ones. The proposed approach accelerates the ViT with 40%.
Entropy is used as a measure of a patch’s compressibility with lower entropy indicates higher redundancy.
Patch aggregation is also employed aggregating embeddings from sub-patches back to size
Token merging approaches is done at input-level. Input-level merging reduces tokens directly from image patches before entering the model. The method is also compared to Layer-level token merging.
* A more efficient and adaptive transformer model is proposed, where the adaptation refers to the fact that complex information is processed in more detail with smaller patches. Meanwhile less complex regions are processed with larger patches.
* Extensive experimental results are provided.
* Lack of theoretical analysis
* Lack of computational analysis
How is the adaptive patch sizes work with other transformer models, such as for example the Swin transformer.
Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV 2021
Can other methods be used of detecting complex regions for using smaller patches? |
Fully human-written |
|
Vision-on-Demand: Efficient Visual Language Understanding with Intermittent Attention |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work proposes Vision-on-Demand (VoD), a method that accelerates inferences for VLMs by sparsifying the image-text and image-image interactions without token dropping. It incorporates cross-attention layers to allow text tokens to gather information from the image tokens, further full self-attention layers are enabled to allow information refinement. With evaluations on the LLaVA-OV model, the proposed method is shown to obtain better reults compared to other acceleration methods.
- The idea of removing the inter-connection of image and text tokens is interesting for accelerating VLMs.
- The proposed method is shown to obtain good results on the target tasks.
- It seems that only LLaVA-OV is utilized for evaluation, while more other models (such as Qwen VL series) should also be included to show that the proposed method can work well for different models.
- The proposed method changes the model architecture and needs to be trained. It remains a question how well it can generalize to different scenarios, since the training usually requires a specific training set.
- I'm wondering how much the savings on FLOPs can be transformed to the final latency improvement. There is a lack of evaluation on this aspect.
(Please refer to the weakness part.) |
Fully human-written |
|
Vision-on-Demand: Efficient Visual Language Understanding with Intermittent Attention |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper targets the efficiency of vision-language models, particularly addressing the high number of visual tokens. Its main approach is to minimize processing on visual tokens within the LLM decoder. First, it argues that for “easy” VLM tasks, text tokens do not require frequent attention to visual tokens. Second, using CKA analysis, it shows that visual tokens for “easy” VLM tasks do not change significantly across the LLM layers, leading to the conclusion that self-attention among visual tokens is only necessary for more challenging VLM tasks. Based on these observations, the paper proposes the Vision-On-Demand (VoD) method, which applies text-to-image and image-to-image attention only in a few layers of the LLM. Experiments in the paper demonstrate more than an 8× theoretical FLOPs reduction compared to the baseline, with comparable accuracy.
- The research problem addressed by the paper (efficiency challenges caused by the high number of visual tokens) is crucial for practical applications of vision-language models.
- The observations presented in the paper are interesting:
- Text-to-image attention is sparse, and for some easier tasks, it is minimal, suggesting that such attention is not needed throughout the network.
- The original visual features produced by the vision encoder remain largely unchanged for some easier tasks, indicating that image-to-image attention is necessary only for certain tasks and in a few layers.
- The proposed method, Vision-On-Demand (VoD), is well-motivated and intuitive.
- The results show a significant FLOPs reduction while maintaining accuracy.
- For a consistent comparison, different baselines are reimplemented to eliminate other confounding effects.
- One major issue with the paper is its positioning relative to baselines. Throughout the paper, the number of visual tokens ($N_v$) and text tokens ($N_t$) is not mentioned. What input resolution is considered for the LLaVA-OV baseline [1]? Are you using a full $3\times3+1=10$ patch configuration, equivalent to $10\times729=7290$ visual tokens? This seems like an extreme baseline to report all FLOPs savings against. More recent works, such as [2] (not discussed in the paper), use the same Qwen LLM as this work but produce significantly fewer visual tokens (e.g., 256 instead of 7290). It remains unclear whether the proposed method benefits efficient vision-language models such as [2].
- The presentation in Sections 4.2.3 and 4.3 is confusing and incomplete. Many important details are moved to the appendix, while the main paper focuses on less critical information. For example, how pseudo-labeling is done is part of the proposed algorithm but is missing from the main text, whereas dataset mixtures and training hardware details are included. A major reorganization is needed to improve the presentation of the main algorithm. See detailed comments below:
- It would be clearer if a “configuration” were defined first. Perhaps as the arrangement of CA and SA layers in a given LLM architecture.
- Section 4.2.3, Step 1: What does “maximum number of cross-attention” mean? Does it refer to an optimal number for achieving good accuracy?
- Section 4.2.3, Step 1: It is mentioned that $L_{CA} = L_{SA} = L/3$ was found empirically. However, experiments supporting this observation are neither included nor referenced.
- Section 4.2.3, Step 2: This step is not clearly defined, and supporting experiments are not included or referenced.
- Section 4.2.3, Step 3: What does “randomly selecting” mean? Does it imply that for each optimization step a new random configuration is sampled? Please clarify.
- Section 4.2.3 step 3: What do “these viable configurations” refer to? Are these configurations identified in Stage 2? Please clarify.
- Section 4.3: It is unclear what exactly is logged. For a dataset with $N$ samples and a model with $L$ layers, what must be logged? What is the final pseudo-label? Does it represent a layer configuration per sample? How many inference runs per sample are required to determine this? Please clarify.
- Section 4.3: The paper mentions that directly learning the router is unstable. Are there experiments demonstrating this? Please reference and explain.
- Results in Table 3 (upper part) show insensitivity to the number of cross-attention layers, but it is also argued that accuracy saturates at 8 CA layers. Please clarify and include results for fewer CA layers, including 0.
- The paper briefly mentions actual runtime latency at the end of Section 6, yet this is a critical metric. Please include an accuracy vs. time-to-first-token plot for baselines and different VoD variants, including VoD-TR, across various compression ratios.
Minor issues:
- Lines 265–269 discuss the choice of positional encoding via convolution without sufficient detail. This paragraph seems out of place and unrelated to the main contribution. It could be moved to the appendix.
- Lines 478–479 contain typos: “VoD uses” and “VoD trains.”
References:
[1] Li, Bo, et al. “LLaVA-OneVision: Easy Visual Task Transfer.” arXiv preprint arXiv:2408.03326 (2024).
[2] Vasu, Pavan Kumar Anasosalu, et al. “FastVLM: Efficient Vision Encoding for Vision-Language Models.” Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
- Figure 1: For the proposed method (VoD), the actual FLOPs depend on each sample due to routing. Are you showing averaged FLOPs reduction rates? If so, why are they the same for both the Hard and Easy datasets?
- Are all results in the paper, except those in Table 4, based on the "Universal model" when referring to VoD and VoD-TR?
- What model is used for the analysis in Figure 2?
- Figure 2: How is the attention score calculated? Is it averaged over all heads per layer? Also, is it averaged over all samples in the corresponding dataset? If so, what is the variance? Does the mentioned “saw-tooth” pattern appear for all heads in the model?
- What does the distribution in Figure 4a represent? Please provide details of the random visual token dropout process. Additionally, annotating tasks as Hard/Easy in Figure 4a would clarify the basis of this categorization.
- What is the significance of the result shown in Figure 4b? It is not discussed in the text.
- In line 359, is the vision encoder (SigLIP) frozen during stages 1 and 2? |
Fully human-written |
|
Vision-on-Demand: Efficient Visual Language Understanding with Intermittent Attention |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduced a new method Vision-on-Demand (VoD) for building efficient vision language models. Starting from the observation that for different tasks, the amount of compute and interaction required for visual tokens are very different, the authors proposed a new VLM architecture consists of regular LLM layers, cross attention layers and self-attention layers (the vision encoder and project remains the same). The cross attention layers allows text tokens to attend to visual tokens, and self-attention layers allow bidirectional interaction between text and visual tokens for better feature update. Additionally, the model is trained with a internal routing mechanism where a special token is used to predict the optimal inference configuration. The model is trained in a similar way as LLaVA-OV and architecture configurations are dynamically sampled during training to ensure the model can handle different inference conditions.
The method is grounded on a very practical but often overlooked aspect: that there is no universal visual token compression methods, hence an adaptive inference strategy should delivery optimal performance. The proposed VoD method addresses this issue by allowing the model to select optimal inference path for every example.
Compared to LLaVA-OV baseline and other visual token pruning methods, VoD achieves salient performance gains on both easy and hard tasks and have much better FLOPs savings, showing its effectiveness. Moreover, the VoD are orthogonal to visual token pruning methods, and it can be combined with those methods to further save computes.
My main concern is about the generality of VoD. The author mentions that VoD has self-attention layers, which allow image-to-image, text-to-image, image-to-text and text-to-text interactions, meaning that this is bidirectional attention without causal mask. This means that for every new token being generated, all token’s self-attention will need to be recomputed. When the generation becomes long, e.g. long CoT settings, this would incurs a very large computation overhead. Thus I’m not sure if this method can actually save computation in general.
The authors mentioned that both cross-attention and self-attention layers are inserted into the LLM. Per my understanding, this adds additional layers/parameters to the model. In other words, VoD actually has more parameters and layer depth than LLaVA-OV, despite the fact that vision tokens are not passing through LLM layers. This would definitely give VoD an unfair advantage (large modeling capacity) compared to baseline methods, thus the results and gains on those benchmarks should also be taken with a grain of salt.
Can authors clearly explain how much more parameters/layers does VoD has, and how much more compute would it require when generating sequences of different lengths? (e.g. 1k, 2k, 4k, 8k, etc) |
Fully human-written |
|
Vision-on-Demand: Efficient Visual Language Understanding with Intermittent Attention |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes Vision-on-Demand (VoD), a new paradigm that does not reduce visual tokens but allocates computations on demand in large visual language models (LVLM) through the "intermittent attention" mechanism: Provide static visual context for the text with a small amount of lightweight cross-attention layers, only insert self-attention layers at key layers to refine high-resolution visual tokens, and train a unified network to support different computing power budgets. Then, a lightweight policy network is introduced to dynamically select the number of self-attention layers based on sample complexity, achieving sample-level adaptive reasoning. VoD is orthogonal to the existing token compression methods and can be superimposed to further accelerate. Experiments show that while VoD achieves SOTA accuracy on multiple visual language benchmarks, its reasoning FLOPs are reduced by 8.6 to 18×, especially showing significant advantages in difficult tasks with high resolution and fine-grained visual understanding.
- This paper points out that the existing "visual token compression" methods experience a sharp decline in performance due to information bottlenecks in fine-grained tasks. For the first time, it considers the efficiency issue from a brand-new perspective of "retaining all high-resolution visual information and only sparsity the computing layer", providing an orthogonal direction distinct from the mainstream for the efficient utilization of LVLM.
- This method decouples the computing layer into "low-cost cross-attention (read-only visual context) + a small amount of self-attention (on-demand refinement of visual tokens)", and train a unified network in conjunction with a lightweight routing network to achieve samply-level dynamic layer selection. This not only significantly reduces FLOPs without losing information but also seamlessly superimposes with existing token compression, balancing accuracy and efficiency
- The method requires repeatedly inserting and skipping self-attention and cross-attention layers within the LLM, and training additional routing networks to determine the number of layers per sample, which leads to complex implementation, high engineering deployment difficulty, and the need for fine-tuning of the training and inference processes, significantly increasing the system implementation and maintenance costs.
- Could the proposed method adapt on inference frameworks like vllm, sglang? How to establish it on flash attention?
- The models used for experiments is too small. It would be better to use larger models over 7B parameters.
See above. |
Fully human-written |
|
Graph-Driven Uncertainty Quantification in Text-to-Image Diffusion Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents a novel framework for uncertainty quantification (UQ) in text-to-image diffusion models.
The core proposal is to construct a graph where nodes represent different diffusion models and weighted edges represent their architectural and output similarity.
The paper introduces three strategies to propagate uncertainty across this graph: an iterative message-passing scheme (IPU), a spectral method leveraging the graph Laplacian for smoothness (SGUP), and a method that aggregates influence along all graph paths (PUI).
The authors provide extensive theoretical analysis covering convergence and complexity and report state-of-the-art performance on seven datasets, showing improvements over standard UQ baselines in both calibration and image quality metrics.
- Novel Conceptual Framework: The central idea of representing relationships between generative models as a graph to propagate and refine uncertainty is highly original and provides a new perspective on UQ beyond single-model or simple ensemble methods.
- Comprehensive Multi-Strategy Approach: The paper introduces three distinct and well-motivated propagation strategies (IPU, SGUP, PUI). This provides a flexible toolkit that can be adapted to different needs, such as iterative refinement, global smoothness, or detailed path-specific analysis.
- Strong Empirical Results: The experimental validation in Table 1 shows significant and consistent improvements over established baselines (Ensemble Sampling, MC Dropout, Bayesian Diffusion) across a wide range of UQ and image quality metrics (PICP, UCE, FID, LPIPS).
- Thorough Theoretical Analysis: The authors provide eight theorems with proofs covering the convergence, complexity, and smoothness properties of their methods. This attempt to ground the framework in solid theory is a commendable strength.
- Fundamental Mismatch between Methodology and Implementation:
The paper's core premise is critically undermined by a contradiction between the described methodology and the actual implementation.
Section 3.2 explicitly defines nodes as distinct diffusion models (e.g., DDPM, LDM).
However, the Appendix (A.3, lines 745-747) states that the experiments construct graphs over 'diffusion timesteps using cosine similarity of intermediate latent embeddings' for each sample.
This suggests the experiments were run on internal states of a single model, not a graph of different models. This is a severe flaw that makes the paper's central narrative misleading.
- Misalignment with Current Research Directions:
The paper frames the UQ problem as propagating uncertainty between different model architectures.
However, the key challenge in text-to-image UQ is widely considered to be semantic uncertainty arising from prompt ambiguity.
The work fails to compare against or even cite recent, highly relevant baselines like PUNC (Franchi et al., 2024), which directly addresses this problem and is a critical point of comparison.
- Technically Flawed Graph Construction:
The edge weighting formula in Equation (5) is mathematically problematic and poorly justified.
The formula simplifies to wij = min(S_arch, S_out), which is an arbitrary way to combine two similarity scores.
Furthermore, the theory presented in Section 4 (e.g., Theorem 1) relies on the weight matrix being row-stochastic, but the proposed edge weighting scheme does not guarantee this property, creating a disconnect between the theoretical claims and the method itself.
- Questionable Practical Utility:
The stated premise of requiring a graph of multiple, distinct diffusion models to perform UQ for a single generation is computationally prohibitive and practically infeasible for most real-world scenarios.
This raises serious questions about the utility of the proposed framework as described.
- Opaque Custom Metrics:
The paper's novel evaluation metrics, SGU-Score (Eq. 38) and PSUI (Eq. 39), are not clearly defined.
The definitions rely on unexplained terms like the operator FG and variables σu and σv, making it impossible to fully understand or trust the results reported using these metrics.
---
- General Limitations:
The paper's limitations section is well-considered.
However, it does not address how the graph construction process itself—the choice of models (or timesteps) to include and the specific similarity metrics used—could introduce its own biases, potentially leading to uncertainty estimates that are systematically skewed.
- Could you please clarify the critical discrepancy between your methodology and implementation?
Was the graph constructed over different diffusion models as described in Section 3.2, or over the internal timesteps of a single model as suggested in Appendix A.3?
The validity of the paper's contribution hinges on this clarification.
- Can you provide a formal justification for the edge weighting formula in Equation (5), which resolves to $w_{ij} = min(S_{arch}, S_{out})$?
Why is this a meaningful way to combine architectural and output similarity, compared to other standard methods like a weighted average?
- How do you ensure that the weight matrix W meets the row-stochastic requirement for the convergence proof in Theorem 1?
The edge weighting scheme in Equation (5) does not inherently satisfy this property.
- Why was the recent and highly relevant baseline PUNC (Franchi et al., 2024), which was one of the first works to systematically address prompt-based uncertainty in T2I models, omitted from your experimental comparison?
- In the definitions for SGU-Score and PSUI (Eqs. 38-39), could you please formally define the Graph Fourier Transform operator FG and the variables σu and σv used in the PSUI formula? |
Fully AI-generated |
|
Graph-Driven Uncertainty Quantification in Text-to-Image Diffusion Models |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper models a graph over diffusion models or components to propagate uncertainty and introduces three strategies for this: Intrinsic and Propagated Uncertainty coupling, Spectral Graph Uncertainty Propagation, and Path-Specific Uncertainty Influence. It provides convergence statements and reports improvements on several datasets.
The paper explores an area that is not theoretically well explored: uncertainty quantification in text-to-image diffusion models.
It uses concepts from graph theory to model the behavior of diffusion models.
1. The paper is at times ambiguous. How is the graph constructed for diffusion models? Is it built over a single model, or over latents during denoising? Both are mentioned in the text (L.103, L.758 say v_i is a diffusion model; L.747 says diffusion timesteps).
2. The related work section discusses UQ in broad diffusion contexts but omits prior work specific to T2I UQ (e.g., Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation, Franchi et al.).
3. The definition of architectural similarity (L.126) seems over-simplified. Different architectural components generally do not have an equal effect.
4. L.138 is hard to justify as written. The term implicitly assumes a correspondence between the i-th image generated by each model. Also, L.133 mentions using the images themselves. in that case, pixel $L_2$ is a poor proxy for semantic uncertainty.
5. L.148 appears problematic. The architectural term is at most 1, while FID can be much larger depending on the feature extractor. So, the expression in L.148 collapses to the architectural difference.
6. L.273 seems inconsistent. The $U_{\text{prop}}$ term depends on $U_{\text{total}}$s (not on the $U_{\text{prop}}$s) via the weights. A form like $U_{\text{total}} = (1-\gamma)U_{\text{intrinsic}} + \gamma W U_{\text{total}}$ would be more consistent with the stated propagation.
7. In L.347, $D_w^{-1} W$ is not the normalized graph Laplacian. The normalized Laplacian is $I - D_w^{-1/2} W D_w^{-1/2}$.
8. The uncertainty maps in Fig. 2 are not clearly defined. They do not appear to be uncertainty in image space. Are they derived in a latent space? The surrounding discussion does not clearly connect to the maps.
1. In L.181, what are the output categories when the model is a T2I generator?
2. Why does solving equation 24 lead to equation 25 ? Where does b come from?
3. The method targets uncertainty quantification. Could you explain what the quality metrics reported are suggesting, in this context? Why is LPIPS treated as higher is better? Also, sharpness seems flipped. larger gradient usually implies sharper, not lower.
4. In L.682, how were unambiguous, ambiguous, and OOD prompts defined and separated?
5. L.751 mentions a user study. Where are its design and results reported? |
Fully human-written |
|
Graph-Driven Uncertainty Quantification in Text-to-Image Diffusion Models |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors introduce a novel graph-based framework to quantify uncertainty in text-to-image generation models. They model each diffusion model as a node in a graph and use strategies like Intrinsic-and-Propagated Uncertainty Coupling, Spectral Graph Uncertainty Propagation, and Path-Specific Uncertainty Influence to measure uncertainty. The method captures both local and global uncertainties.
1. The paper addresses an important and timely topic—uncertainty modeling in large generative models.
2. The paper proposes a novel approach with potentially valuable contributions to uncertainty estimation in diffusion-based architectures.
1. It is unclear how the metrics in Table 1 are computed. Are the authors training new models or evaluating pre-trained ones? Please clarify the experimental setup.
2. It is not clear how uncertainty contributes to improved model performance. Is uncertainty incorporated into the loss function during training or into the inference process? Typically, uncertainty in diffusion models is used for OOD detection or hallucination identification, so additional explanation is needed.
3. The paper introduces graphs of diffusion models but does not provide concrete examples. Each node is described as a diffusion model and edges represent similarity between models. Do such graphs exist in practice, or are they purely conceptual for this work?
4. Missing citations: Prior works [1, 2] already estimate uncertainty in diffusion models and should be cited for completeness.
5. Formatting issue: Several equations appear inline or too close to surrounding text (see Eqs. 11–16). These should be properly separated for readability.
6. The Results section lacks sufficient methodological detail. The paper discusses outcomes but does not explain how the results were obtained or what experimental setup produced them.
[1] Berry, Lucas, Axel Brando, and David Meger. "Shedding light on large generative networks: Estimating epistemic uncertainty in diffusion models." The 40th Conference on Uncertainty in Artificial Intelligence. 2024.
[2] Berry, Lucas, et al. "Seeing the Unseen: How EMoE Unveils Bias in Text-to-Image Diffusion Models." arXiv preprint arXiv:2505.13273 (2025).
Please see weaknesses. |
Lightly AI-edited |
|
3D Medical Image Segmentation with Anatomy-Guided Conditioning from Surrounding Structures |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a method called Anatomy-Guided Conditioning (AGC), which integrates anatomical context from surrounding structures into segmentation networks. By leveraging signed distance maps of neighboring anatomy, AGC aims to enhance segmentation accuracy in 3D medical images. The method is applied to the decoder part of an encoder-decoder architecture, allowing for effective integration of anatomical priors without requiring significant modifications to the original network. The authors demonstrate AGC’s effectiveness across various datasets and different segmentation backbones.
1. The paper introduces a method for incorporating anatomical priors using signed distance maps, offering a clear advantage over traditional binary masks or coordinate embeddings.
2. AGC is shown to work effectively with a variety of encoder-decoder architectures without requiring modifications to the backbone, making it an easy-to-implement and scalable solution for many segmentation tasks.
3. The evaluation covers a wide range of datasets and diverse segmentation models, providing robust evidence of AGC's generalizability.
Lack of Clear Analysis of Computational Overhead.
The paper does not provide a detailed or clear analysis of the computational overhead introduced by AGC. While the proposed method is promising in terms of performance, it is important to understand the trade-off between its benefits and its impact on training and inference efficiency. Including this analysis would help assess the practical feasibility of the method in applications.
1. For practical deployment, it is crucial to assess how AGC affects aspects like training speed, inference time, and memory consumption. The authors should include a more explicit discussion on these aspects and provide quantitative results in the form of tables comparing the computational cost (e.g., training time, inference time, memory usage) with and without AGC, as this would provide a clearer picture of the additional overhead introduced by the method.
2. Could the authors provide a more detailed breakdown of the impact of using signed distance maps versus binary masks or other prior conditioning methods? Does AGC offer a significant advantage over these alternatives? I may understand the rationality of using signed distance maps to represent anatomical structures, but I hope the author can provide a clearer explanation. |
Fully AI-generated |
|
3D Medical Image Segmentation with Anatomy-Guided Conditioning from Surrounding Structures |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper focused on a 3D medical image segmentation pipeline of thin, low-contrast, complex structures. The authors proposed an anatomy-guided conditioning module, which is an anatomically driven inductive bias for segmentation networks. The anatomical priors are extracted from the TotalSegmentator dataset, and encoded as signed distance maps (SDMs), then it is fed into a feature moduleation block in the decoder for segmentation prediction.
- The paper utilized a pre-defined comprehensive dataset for extracting the anatomical priors and used it for further processing.
- The method is evaluated in many complex structure datasets, such as arteries, visceral adipose tissue, head and neck organs, etc.
- The anatomically guided methods are widely adopted for medical image segmentation methods, including hierarchical methods, topological methods, etc. The work shall clarify why this anatomically guided conditioning method is superior, and more ablative studies may be needed to explain its effectiveness.
- Typically, if the method is designed with structure-aware metrics, it won’t perform well in general anatomies. How do the methods perform directly on the total segmentator dataset or other whole-body CT datasets?
- The TotalSegmentator dataset is mainly generated from nnUNET, not purely from human annotation. I understand this work’s assumption is based on priors from TotalSegmentator, but it might create a circular proven issue for gold standards. The learned priors is also from AI models.
- Baseline UNETR shall also be a hybrid model, same as Swin UNETR and nnFormer, a minor issue, but it’s good to clarify.
Question, see weaknesses section. |
Fully human-written |
|
3D Medical Image Segmentation with Anatomy-Guided Conditioning from Surrounding Structures |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes AGC for 3D medical image segmentation, which integrates signed distance maps of surrounding anatomy to introduce automatic priors that improve accuracy, boundary delineation, and topology without additional training. However, the practical utility of AGC and the completeness of the experiments both seem poor.
AGC introduces anatomical priors via signed distance maps, improving segmentation accuracy and topology-aware metrics on some existing models.
1. Lack of the sequence number of equations.
2. Lack of baseline evaluation with the latest approaches, such as CNN models nnUNet [1] and nnWNet [2], and SAM variant MedSAM2 [3] and H-SAM [4].
3. AGC relies on the full TotalSegmentator model to inject anatomical priors while introducing additional computational overhead and slowing inference efficiency. Stacking multiple models severely limits practical applicability; reporting parameters, FLOPs, and inference time is essential.
4. The datasets used, while suitable for assessing topological improvements, are small-scale and omit common 3D medical segmentation benchmarks such as BraTS2024 [5] and FLARE2022 [6].
Reference:
1. Isensee F, Jaeger P F, Kohl S A A, et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation[J]. Nature methods, 2021, 18(2): 203-211.
2. Zhou Y, Li L, Lu L, et al. nnWNet: Rethinking the Use of Transformers in Biomedical Image Segmentation and Calling for a Unified Evaluation Benchmark[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 20852-20862.
3. Ma J, Yang Z, Kim S, et al. Medsam2: Segment anything in 3d medical images and videos[J]. arXiv preprint arXiv:2504.03600, 2025.
4. Cheng Z, Wei Q, Zhu H, et al. Unleashing the potential of sam for medical adaptation via hierarchical decoding[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 3511-3522.
5. de Verdier M C, Saluja R, Gagnon L, et al. The 2024 brain tumor segmentation (brats) challenge: Glioma segmentation on post-treatment mri[J]. arXiv preprint arXiv:2405.18368, 2024.
6. Ma J, Zhang Y, Gu S, et al. Unleashing the strengths of unlabelled data in deep learning-assisted pan-cancer abdominal organ quantification: the FLARE22 challenge[J]. The Lancet Digital Health, 2024, 6(11): e815-e826.
1. What does q denote in the equation (lines 173–176)?
2. Why are anatomical priors transformed into scale and shift parameters instead of using concatenation or other feature fusion strategies? |
Lightly AI-edited |
|
How Learning Dynamics Drive Adversarially Robust Generalization? |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper investigates robust generalization within the PAC-Bayesian framework. By assuming a Gaussian posterior over the model parameters and a quadratic approximation of the loss, the authors derive an upper bound on the robust generalization gap. This bound connects the generalization behavior to the Hessian of the empirical adversarial loss at a local optimum, as well as to the mean and covariance of the posterior. The paper further analyzes how the parameters of this Gaussian posterior evolve during SGD with Polyak momentum, thereby linking training dynamics to robust generalization.
The paper approaches robust generalization from an appealing dynamic perspective—focusing on how the optimization trajectory and posterior evolution influence generalization—rather than adopting a static hypothesis-space view based on Rademacher complexity or other capacity measures.
The connection between posterior evolution and curvature-based generalization offers an interesting conceptual direction, potentially bridging PAC-Bayesian analysis and training dynamics.
1. Unclear contribution and novelty. The paper does not clearly describe its theoretical novelty compare to prior PAC-Bayesian analyses of adversarial robustness (e.g., [Viallard et al., 2021](https://proceedings.neurips.cc/paper/2021/file/78e8dffe65a2898eef68a33b8db35b78-Paper.pdf); [Mustafa et al., 2019](https://ml.cs.rptu.de/publications/2023/computing_non_vacuous_pac_bayes.pdf) ; Xiao et al., 2023). It is unclear how the presented bounds improve the previous results.
2. Concerns regarding the Quadratic Loss assumption. Since Assumption 3.5 enforces $\hat{R}_{\rm adv}(w, S)$ to be a quadratic function for any choice of $S$. By taking $S$ = {$ (x,y) $}, this assumption implies that the adversarial loss itself is a quadratic function w.r.t $w$ for any $(x,y)$:
$\hat{R} _ { \rm adv}(w, S) = \ell _ {\rm adv} (w; x,y ) = \ell _ {\rm adv}(w*; x,y) + \frac{1}{2}(w-w*)^{T} H ( w- w* )$
This severely departs from practical settings where $\ell_{\mathrm{adv}}$ involves deep neural networks and non-quadratic losses such as cross-entropy.
3. Limited explanatory power for robust generalization. The paper claims to shed light on robust overfitting (Rice et al., 2020), yet the derived bounds do not appear to explain when robust overfitting occurs or disappears. For instance, adversarial training achieves good robust generalization on MNIST (Madry et al., 2019) and for small perturbation radii or large datasets, whereas robust overfitting is prominent only under certain regimes. The presented bounds (Theorems 3.7, 4.5, and 4.7) do not capture how data distribution, perturbation radius, or sample size affect the generalization behavior. Moreover, when the perturbation radius approaches zero, the framework fails to recover standard generalization phenomena (e.g., CIFAR-10 models generalizing well under clean training but not under adversarial training).
4. Writing and presentation issues. Several statements are vague or lack sufficient justification. Ad hoc terms are introduced (e.g., “propagation term,” “injected term”) without formal definition or motivation. Key assumptions (e.g., stationarity, steady state) are not clearly stated or connected to the analysis. See “Questions” below for specifics.
5. Missing key references. The paper omits several relevant studies on robust generalization and algorithmic stability, including:
(1) Yue Xing et al. On the algorithmic stability of adversarial training.
(2) Jiancong Xiao et al. Stability analysis and generalization bounds of adversarial training.
(3) Runzhi Tian et al. Algorithmic Stability Based Generalization Bounds for Adversarial Training.
(4) Daniel Cullina et al. Pac-learning in the presence of evasion adversaries.
(5) Shaopeng Fu et al. Theoretical analysis of robust overfitting for wide dnns: An ntk approach
(6) Viallard et al., A PAC-Bayes analysis of adversarial robustness
(7) Mustafa et al., Non-vacuous PAC-Bayes bounds for models under adversarial corruptions.
Proper discussion of these works would better situate the contribution and clarify the incremental advance.
Minor Issue:
Lemma 3.4 restates the closed-form expression for the KL divergence between Gaussians, which is standard and could be omitted or moved to an appendix.
>However, these bounds often abstract away from the actual optimization trajectory and adopt simple isotropic Gaussian posteriors for tractability, overlooking structural properties of the learned model that are crucial for explaining generalization.
Could the authors elaborate on how existing PAC-Bayesian bounds "_abstract away_" from the actual optimization trajectory?
In addition, please clarify what specific limitations prior works face in explaining robust generalization. Is the key issue primarily the use of isotropic Gaussian posteriors, or are there other underlying factors? If the isotropic assumption is central, please explain why it limits explanatory power.
>PAC-Bayes bounds offer general guarantees but lack fidelity to the learning dynamics, whereas curvature-based approaches provide qualitative insight without rigorous predictive guarantees.
What specific _guarantees_ are being referred to ? What does “lack fidelity” mean in this context? Please make these terms precise and support the claim with references.
>we use d, m to denote the dimensions of the input space ${\cal S}$.
Typo: the input space should be denoted by ${\cal X}$ not ${\cal S}$.
In Assumption 3.5, the statement "for any $w \sim {\cal Q}$" —since ${\cal Q}$ is Gaussian — appears equivalent to "for any $w \in {\mathbb R}^m$"? If so, the assumption is independent of ${\cal Q}$ ; please clarify.
The remark for Lemma 3.6 (Line 188-190) merely restates the formula. Could the authors elaborate on the insight or interpretation of this result?
Regarding the local optimum $w*$ in Assumption 3.5, since $w*$ depends on ${\cal S}$, it should arguably be treated as a random variable, rather than a constant as stated at Line 202?
At Line 220, please cite relevant references for SGD with Polyak momentum and clarify whether the theoretical analysis extends to standard SGD.
In Lemma 4.2,
>suppose the posterior Q reaches a steady state with stationary mean
Could you justify this assumption and define what “steady” or “stationary state” precisely means?
Does ${\cal Q}$ denote the marginal distribution of $w _ T$ after T SGD steps, or the conditional distribution of $w_t$ given $w_{t-1}$? Clarifying this would help interpret Theorem 4.5.
For remark 4.8, please define and justify the introduced terms “propagation term” and “injected term.” |
Fully human-written |
|
How Learning Dynamics Drive Adversarially Robust Generalization? |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies PAC Bayes guarantees for adversarial loss. The authors start with a general upper bound in terms of quantities related to the posterior, then make it more specific to Gaussian posteriors. Subsequently, they study what that posterior looks like in two regimes of SGD with momentum on the adversarial loss (I think -- see below). They verify the functionality of their bounds experimentally.
The paper is written extremely clearly, and the development is very logical. I genuinely enjoyed reading the paper. The results are nice and insightful. I am not close enough to the literature to evaluate how different they are from existing results, but they are interesting and the approach is well-motivated.
I see how the proposed Gaussian model is in fact less restrictive than models in previous work, but I am curious if the authors can comment on its limitations. I also have a few additional questions that are in the section below.
1. What is the quantifier over epsilon in Defn 3.1? shouldn't this be the eps-adv risk or something?
2. What role does the geometry induced by the metric in which the perturbation is bounded play in the generalization bounds?
3. Is the analysis for SGD on the standard (non-adv) loss, followed by evaluation of that predictor on the adv loss? or SGD run on the adv loss? I think the latter but want to confirm. |
Fully human-written |
|
How Learning Dynamics Drive Adversarially Robust Generalization? |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper studies the robust generalization issues and use PAC-Bayesian framework to derive the generalization bound in adversarial training. Specifically, the author use second-order approximation and consider SGD with momentum to investigate how learning rate, Hessian structure and gradient noise affect the mean and covariance of posterior distribution, which is consistent with and validated by numerical experiments.
++ The paper is generally well-written and the overall framework is not difficult to follow.
++ Under the second-order approximation, it is novel to derive the closed-form posterior covariance under both constant learning rate regime and learning rate decay regime.
++ The analysis is relatively comprehensive: it considers many factors that may affect generalization error, including the momentum mechanism, the gradient noise, the Hessian structure and the learning rate. The theoretical analyses is qualitatively consistent with the numerical experiments.
1. The major concerns are restrictive assumptions:
* Assumption 3.3: I do not think the posterior distribution after adversarial training for general deep neural networks is a Gaussian distribution. Probably, the authors can assume that the posterior distribution is a mixture of several super Gaussian distributions, as the probabilistic density will generally concentrate during training, and different initialisations will lead to converged parameters near different local minima.
* Assumption 3.5 is only applicable when $w$ is close to $w*$. This is a bit in contradiction with Assumption 3.3. When we using Gaussian distribution as the random initialization, the parameter distribution in early steps is close to Gaussian distribution, but $w$ is far away from $w*$. On the other hand, in the late phase of training when $w$ is close to $w*$, the distribution of the parameters is not Gaussian. Probably, the authors can have some additional assumptions, such as the ones in lazy training, which means the parameters do not move a lot during training. However, this may introduce additional conditions.
2. There is a considerable gap between the theory and practice. The theoretical analyses do not utilise some unique properties of adversarial training. For example, in practice, we see a larger robust generalization gap when the magnitude of adversarial perturbation is larger (i.e. larger $\epsilon = |\delta|_p$), I do not see how this variable affects the generalization gap bound. In an extreme case, when the adversarial perturbation's magnitude is zero, will the bound in Theorem 4.7 degrade of analyses of normal training? What makes the results special for adversarial training? I think this part needs further elaboration.
3. The experiments are not comprehensive, the authors compare the performance of adversarial training and AWP, which include the factors of Hessian structure (AWP prefers a flatter minma) and learning rate (both have learning rate decay). However, the effect of gradient noise (which is mentioned in the abstract and the introduction) is not adequately discussed and studied. In addition, it would be better to compare the emprical generalization gap and the theoretical one by Theorem 4.5. The results in Table 1 and Table 2 are not convincing enough.
Minor issues:
1. Based on analyses in the appendix, $\rho_i$ in Equation (14) actually depends on $\eta_2$, the authors should clearly indicate this in the maintext to avoid confusion, because the right hand side should be independent of $k$ when $\eta_1 = \eta_2$.
2. Some missing related literature:
* The convergence of adversarial training: "On the Convergence and Robustness of Adversarial Training" (2019) "On the loss landscape of adversarial training: Identifying challenges and how to overcome them (2020)."
* More literature about robust overfitting: **The authors should compare with the bounds in these works technically** "Non-vacuous Generalization Bounds for Adversarial Risk in Stochastic Neural Networks" (2024) "On the impact of hard adversarial instances on overfitting in adversarial training" (2024).
In general, I think the research in this work can contribute to the machine learning community, but the manuscript is not ready for publication given the concerns above. I welcome the authors to address my concerns during rebuttal and will reconsider my ratings after rebuttal.
The questions are pointed out in the weakness part. Please answer them one by one.
1. [Weakness 1] How assumptions are satisfied in practice? Is it possible to provide a weaker and more generic assumption in contrast to Assumption 3.3 and Assumption 3.5. I believe these assumptions work well for a convex problem like linear regression, but I do not see how it is satisfied in deep neural networks.
2. [Weakness 2] What makes the results special for adversarial training? How adversarial perturbations (especially their magnitude) affect the generalisation bound in your theorem?
3. [Weakness 3] More experiments to validate the effect of gradient noise are needed (like using different batch size). In addition, it would be better to compare the emprical generalization gap and the theoretical one by Theorem 4.5.
4. Please pay attention to some minor issues and missing literature pointed out above. |
Fully human-written |
|
ConstrainPrompt: Code-Based Assurance of Prompt-Defined Constraints |
Soundness: 1: poor
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents ConstrainPrompt, a verification pipeline that separates constraint compliance from semantic quality to ensure LLM outputs satisfy hard control constraints. ConstrainPrompt identifies hard control constraints (e.g., format, lexical inclusion, length limits) from prompts, organizes them into a logical evaluation tree that enforces a global-to-local validation order, and compiles this tree into executable Python validators. Experiments show that ConstrainPrompt outperforms the LLM-as-a-judge baseline across three models in both constraint compliance accuracy and violation rationale quality.
1. The method solves a meaningful task. It ensures that the generated results satisfy the prompt defined hard control constraints, for example, format, lexical inclusion, and length limits.
2. The guard-first, coarse-to-fine ordering enforces logical precedence between global and local constraints, improving robustness and interpretability.
3. The method demonstrates significant performance in terms of compliance accuracy and violation rationale.
4. The manuscript is easy to follow, and the method is clearly described and defined.
1. LLM-as-a-judge is a weak baseline. There are many works related to agentic workflow that address the same issue. Also, constrained decoding is a typical way but the paper didn’t compare them them.
2. Current instruction-tuned models can already align well with complex output requirements (e.g., GPT-4.1 performs much better than GPT-4o on instruction following [1]). In addition, instruction-tuned models can align with any kind of output requirement. At the same time, constrainprompt did not mention how to generalize to those output requirement beside promptset (line 93). In other words, the constraints are not generalizable.
3. The evaluation relies on a single benchmark, 61 records only (Line 355), which could not represent the real-world diversity. Further experiments on related benchmarks are necessary.
4. The manuscript mentioned in line 55, one method to check the output constraint is a rule-based script. But no related baselines in the experiment. Also, the experimented models are limited to only 3 powerful LLMs (not enough). It is not clear whether the pipeline can benefit smaller LLMs.
5. Extraction, tree synthesis, and code generation all rely on LLMs (Sec. 3); there is a risk that they introduce bias or errors from the LLM itself. Related failure cases are not discussed.
6. The generated validators could add much more computational overhead, while efficiency analysis is not discussed.
[1] https://openai.com/index/gpt-4-1/
As current instruction-tuned models increasingly follow prompts accurately, how does ConstrainPrompt + base model compare against instruction-tuned models alone in terms of constraint compliance?
For the constraints category beyond PromptSet, how does ConstrainPrompt generalize? |
Fully human-written |
|
ConstrainPrompt: Code-Based Assurance of Prompt-Defined Constraints |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors study prompts inside programmatic LLM applications, which tend to have hard constraints expressed in natural language. They introduce a method to turn natural language constraints into an executable verification function. Their method, ConstrainPrompt, starts by extracting a list of "code-verifiable" constraints (e.g., format checks, numerical checks, or the absence of certain tokens), each of which may possibly be conditioned on some criteria. Then, ConstrainPrompt sorts the constraints by scope from coarse-grained to increasingly fine-grained, dependent, or lower-level ones. The authors define this as a tree, though (as discussed below) it appears to me that this is just an ordered list of possibly conditional checks, with early exit on first failure.
The authors also introduce a dataset of real prompts from such LLM systems and, for a small number of them, collect corresponding LLM outputs and annotate them with violations of the hard constraints specified in the prompts. They use this data to investigate patterns of constraints and failures in real systems and the smaller subset of it to compare their method, ConstrainPrompt, against simply asking an LLM to judge model outputs against (hard constraints from?) the original prompts. Across three models, the authors observe the large gains in quality along two axes: accuracy of detecting compliance and attributing failures correctly.
This work explores an understudied problem: more carefully defining and evaluating the reliability of prompts inside programmatic LLM systems, particularly along the axis of explicit constraints in the prompts. It does so in a way that I think can help future work: the larger dataset, the taxonomy produced, and the smaller annotated data can be a sensible starting point for multiple future projects in this area.
The method introduced is simple and might be a good starting point for methods in this space, and the gains against LLM judges appear substantial, not to mention that code-based validators are likely cheaper and more interpretable than judges. Though the data and the scope of constraints are quite small, these types of constraints are nonetheless almost ubiquitous, so the problem studied can still have a reasonable amount of impact.
The scope of the study (only a handful of types of hard constraints) and the amount of data labeled for the evaluation (61 examples?) are alarmingly small. While I commend the authors for their transparency in describing their process, the filtering applied is substantial, e.g. keeping only "templates that contain only one user–input placeholder, which simplifies controlled input synthesis". It can be hard to ascertain how difficult all of this really is, especially as models and judges get better or the constraints become more complex.
The method described uses a "tree", but as described in the summary it appears that the nature of this binary tree is more simply characterized as just an (ordered) list of conditional checks, with early exit. While it is of course a valid tree, a simple list with "early exit" is perhaps better since it's simpler than the design space of "trees" could evoke.
The baselines are not necessarily most convincing. Were the judges "engineered" to align with the complete specification of the authors' intent from these evaluations? For example, the system is designed to prioritize certain types of violations over others (e.g., coarse-grained ones and focus on hard constraints). Is the LLM judge informed of all that? This matters for the Violation Rationale output and probably also for Constraint Compliance Accuracy. Why can't modern LLM judges check all these extremely simple constraints? Perhaps modern reasoning models, which are not particularly new anymore, can do this out of the box? The reason this concern matters is that it appears that the authors want to argue that their method is superior to simple judges, so the reasons for this argument need to be clearly argued or supported.
See weaknesses. |
Fully human-written |
|
ConstrainPrompt: Code-Based Assurance of Prompt-Defined Constraints |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper focuses on post-generation validation rules for LLM. It proposes ConstrainPrompt, a pipeline that automatically extracts constraints for the user input, deduces a tree-like structure for validation, and adopts LLM to generate the final code-based validation test. The paper also builds a new benchmark for evaluating how accurate ConstrainPrompt can validate outputs generated from real-world prompts. In the evaluation, the paper shows that ConstrainPrompt outperforms the vanilla LLM-as-a-judge. The ablation study also shows the effectiveness of the judgement tree.
- The paper focuses on a novel problem that is prevalent for people that adopt LLM into their workflows.
- The paper is presented well, with nice flow and comprehensive quantitative evaluation.
- The importance of the problem is somehow questionable. Normally people derive prompts and then handwrite validation rules as a one-time effort. ConstrainPrompt only handles mostly syntactic checking, which already does not take much effort.
- The evaluation is missing some details. For example, for how many times are the evaluation run? Also it is unclear which model generated the outputs for the benchmark.
- Why is the problem important? How is ConstrainPrompt much better than human-written validation, which is only a one-time effort per prompt?
- What is the fluctuation of the result? How does the fluctuation affect the result of the ablation study?
- Which model is used to generate the outputs for the benchmark? |
Fully human-written |
|
ConstrainPrompt: Code-Based Assurance of Prompt-Defined Constraints |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces ConstrainPrompt, a verification pipeline for code-verifiable, semantically agnostic constraints in LLM output. ConstrainPrompt induces code-verifiable constraints from natural language prompts, synthesizes a logical evaluation tree to determine order and scope, and compiles it into an executable verifier that returns a deterministic pass/fail decision with human-readable justification and provenance. On paired data of real-world prompts and model outputs, the proposed approach consistently outperforms a baseline judged by LLM, significantly improving constraint compliance accuracy and violation justification. Ablation studies confirm the critical role of the evaluation tree in both accuracy and explainability.
1. By introducing the evaluation tree, ConstrainPrompt can effectively separate global analysis and local checks, and respect the order from coarse to fine, reducing the common misjudgments and omissions of LLM-as-a-judge.
2. ConstrainPrompt automates the extraction and compilation of natural-language constraints into executable code, enabling deterministic, reproducible validation that is immune to the inconsistency and subjectivity of human or LLM-based judges.
3. The method demonstrates significant and consistent improvements over LLM-as-a-judge across multiple state-of-the-art models, with gains of up to 39.5% in accuracy and 93.4% in violation rationale quality, underscoring its practical utility.
1. The pipeline heavily depends on powerful LLMs (like GPT-4o, Claude Sonnet) for constraint extraction, evaluation tree synthesis, and code generation. This raises concerns about the method's generalizability and accessibility. The paper does not demonstrate that the pipeline remains robust when using weaker models (e.g., smaller open-source models).
2. The core of this method is to only process "code-verifiable" constraints. However, this filtering step itself is judged by an LLM, which could become a source of error and a single point of failure. If the filter misclassifies a constraint (e.g., incorrectly judging a code-verifiable constraint as non-verifiable, or vice versa), the entire verification process becomes incomplete or inaccurate. There is a lack of deterministic guarantees for this critical step.
3. The paper primarily compares ConstrainPrompt against a simple “LLM-as-a-judge” baseline. However, a comparison with carefully engineered, hand-crafted rule-based validation systems would be more compelling. Such a comparison would more clearly measure the advantages and disadvantages of this automated approach in terms of accuracy and efficiency compared to human-expert-built, task-specific validators.
1.Your approach relies on state-of-the-art models like GPT-4o or Claude Sonnet for code generation. Have you evaluated the performance degradation of various stages of your approach (particularly constraint extraction and code generation) when using less powerful (e.g., 7B-13B parameter size) open-source models like Llama or Qwen?
2.The core of this approach relies on a "code verifiability" filter determined by the LLM. Have you evaluated the accuracy of this filter itself? In your research, have you encountered cases where the entire verification process failed due to misjudgments by the filter (e.g., filtering out verifiable constraints or retaining unverifiable constraints)?
3.The paper compares ConstrainPrompt to a "LLM-as-a-judge" baseline and demonstrates significant improvement. Have you considered comparing your approach to hand-crafted, task-specific verifiers? Such a comparison would better illustrate your approach's accuracy advantage. |
Lightly AI-edited |
|
TopoFormer: Topology Meets Attention for Graph Learning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces TOPOFORMER, a novel framework that integrates topological data analysis (TDA) with transformer based graph learning. The central idea is to capture multi scale structural information from graphs using a topological sequence representation that can be directly processed by a transformer encoder.
The key component, TopoScan, converts a graph into an ordered sequence of topological tokens. Each token represents information extracted from overlapping slices of a graph filtration, such as degree or heat kernel signature. This design enables the model to preserve both local and global topological properties while avoiding the computational cost of traditional persistent homology.
Formally, the authors prove an ℓ1 ,stability theorem showing that small perturbations in the filtration function lead to bounded changes in the generated sequence, ensuring robustness of the representation.
The resulting TopoFormer model combines TopoScan with a transformer backbone to perform graph classification and molecular property prediction. It achieves state of the art or near state of the art results on more than fifteen datasets, including IMDB, REDDIT, MUTAG, PROTEINS, and several MoleculeNet benchmarks.
Originality:
The paper introduces an inventive way to integrate topological reasoning with transformer architectures for graph learning. The proposed TopoScan mechanism is a fresh formulation that replaces persistence diagrams with sequential topological tokens derived from overlapping filtration slices. This approach is conceptually novel because it allows topological information to be represented in a transformer compatible format, overcoming long standing barriers between topological data analysis and deep attention models. The use of interlevel filtrations and the demonstration that they capture multi scale graph structure are creative contributions that extend beyond incremental improvement.
Quality:
The paper demonstrates high technical and experimental quality. The theoretical section provides a sound stability guarantee (the ℓ1\ell_1ℓ1 stability theorem) that ensures robustness to small perturbations, while the empirical results convincingly show strong and consistent performance across both graph classification and molecular property prediction tasks. The authors also perform comprehensive ablations that isolate the effect of filtration type, window size, and token sequence length. Comparisons with a wide set of baselines, including topological and transformer based methods, confirm the reliability of the findings. The method is computationally efficient and well engineered.
Clarity:
The writing is clear, well organised, and pedagogical. The motivation is established early, mathematical definitions are presented cleanly, and diagrams effectively illustrate the construction of topological sequences and their flow into the transformer. The appendix provides sufficient detail for reproducibility. The balance between topological intuition and algorithmic implementation makes the paper accessible to both theoretical and applied audiences at ICLR.
Significance:
The work is significant in both theoretical and practical terms. It provides a general framework for incorporating topological information into neural architectures without requiring heavy persistent homology computation, making it scalable to large graphs. The approach has broad applicability beyond the tested datasets, offering a template for topology aware transformers in other relational domains such as biological networks, material science, and social graphs. By demonstrating that topological structure can be embedded as a sequence and effectively processed through self attention, the paper establishes a promising direction for future research in structure aware representation learning.
Overall:
A technically rigorous, clearly written, and conceptually innovative paper. It combines theoretical insight with practical relevance, resulting in a contribution that meaningfully advances the integration of topology and modern deep learning.
1. Limited theoretical depth beyond stability
While the inclusion of an ℓ1 stability theorem demonstrates that TopoScan is robust to small perturbations, the theoretical analysis does not go far enough to explain why the proposed representation preserves meaningful topological invariants or how it compares in expressive power to standard persistent homology (PH) based methods. The paper would be stronger with a formal comparison of representational capacity, such as bounding the information loss between TopoScan token sequences and PH barcodes, or connecting the proposed construction to existing frameworks like Graph Filtration Learning (Hofer et al., NeurIPS 2020) or Stable Rank Vectors (Chazal et al., 2021). A clearer theoretical bridge between TopoScan and persistent homology would make the contribution more foundational rather than heuristic.
2. Dependence on hand crafted filtration functions
The approach still relies on predefined filtrations (for example degree, curvature, or heat kernel). The choice of filtration has a notable influence on performance, as shown in the ablation studies, yet no adaptive mechanism is proposed. This limits generality and introduces dataset specific tuning. The authors could strengthen the work by introducing a learnable or task aware filtration module, or at least by exploring gradient based parameterisation of the filtration functions to allow end to end optimisation.
3. Limited interpretability analysis
Although the paper claims that TopoScan tokens are interpretable and capture multi scale structures, there is no concrete analysis showing what the model learns. For instance, visualising the transformer’s attention weights mapped back to graph substructures (for example motifs, cycles, or communities) would provide stronger evidence that the model captures meaningful topology. A few case studies or qualitative visualisations would greatly improve interpretability and help validate the topological claims.
4. Incomplete generalisation evaluation
The experiments focus primarily on molecule and social graph benchmarks, which are standard but relatively homogeneous. Testing on non molecular heterogeneous graphs (for example citation networks or dynamic temporal graphs) would help confirm that the method generalises to other graph structures. Since TopoScan claims to be a general topological sequence generator, demonstrating this versatility would make the paper’s impact broader.
5. Efficiency claims lack concrete quantitative support
The paper argues that TopoScan avoids the cubic computational complexity of persistent homology, yet runtime improvements are reported qualitatively rather than quantitatively. Providing explicit runtime tables or scaling plots, such as comparing training and inference times against PH based baselines like PersLay or TopoGCL, would substantiate the scalability advantage and enhance the practical credibility of the method.
1. Theoretical clarification on representation power
Could the authors explain in more detail how the TopoScan representation compares in expressiveness to persistent homology? Specifically, does the sequential encoding preserve the same critical topological information that persistence diagrams capture, or does it approximate it? A small empirical or theoretical comparison could help clarify what kind of information may be lost or transformed in the conversion to token sequences.
2. Learnable or adaptive filtrations
Have the authors considered learning the filtration function directly rather than fixing it a priori? For example, one could parameterise the filtration with trainable weights that adapt during training, similar to Graph Filtration Learning (Hofer et al., 2020). If so, what are the main challenges in integrating such a module with TopoScan, and could it potentially improve generalisation across datasets?
3. Sensitivity to TopoScan parameters
The results depend on the number of slices N and the window width 𝑚 used in TopoScan. Could the authors provide a sensitivity analysis or heuristic guideline for selecting these parameters? Understanding whether performance is robust to parameter variation would increase confidence in the stability of the framework.
4. Attention interpretability and visualisation
Can the authors show examples of which graph substructures the transformer attends to when processing the topological token sequences? For instance, does the attention mechanism focus on regions corresponding to high curvature, cycles, or clusters? Such visual evidence would strengthen the claim that TopoFormer is both interpretable and topology aware.
5. Broader evaluation and generalisability
Would the authors consider evaluating TopoFormer on non molecular heterogeneous datasets such as citation or temporal graphs? Since TopoScan is proposed as a general representation, results on more diverse graph types could demonstrate broader applicability and robustness.
6. Quantitative runtime and scalability analysis
The paper claims that TopoScan avoids the heavy computational cost of persistent homology. Could the authors provide explicit runtime benchmarks comparing TopoFormer with PH based baselines like PersLay or TopoGCL on large datasets? This would substantiate the claim of improved efficiency.
7. Relation to existing topological and transformer models
How does TopoFormer conceptually differ from recent hybrid approaches such as TopoGCL (Zhao et al., 2023) or Graphormer (Ying et al., 2021)? A more explicit discussion of what new design principle TopoFormer introduces beyond these works would help clarify its unique contribution to the literature.
8. Empirical validation of the stability theorem
Can the authors provide an experiment demonstrating the stability property empirically, for example by perturbing the graph structure or filtration and measuring the variation in output embeddings or predictions? Such a demonstration would connect the theoretical result with observable robustness in practice. |
Fully AI-generated |
|
TopoFormer: Topology Meets Attention for Graph Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces TopoFormer, a framework for injecting topological information into graph Transformers by turning a graph plus a filtration into a short, fixed-length sequence of “topology tokens.” Instead of running standard persistent homology and then vectorizing diagrams, the authors propose Topo-Scan: they slide overlapping windows over a node/edge filtration and, for each window, record four inexpensive quantities. Multiple filtrations (degree, HKS, Ollivier–Ricci) produce multiple sequences, which are encoded independently and fused to create a final graph representation. A tailored stability result shows these interlevel Betti sequences change in a controlled way under small perturbations of the filtration, which helps justify the construction.
Empirically, the method is evaluated on 9 graph classification datasets and 7 molecular property prediction tasks. It achieves first or second place on most small/medium benchmarks and remains competitive on the larger OGBG-MOLHIV (refer to questions). The ablations are well designed: they hold filtrations fixed and swap in PH+MLP, PH+Transformer, and the proposed Topo-Scan, which cleanly attributes gains to the sequence-based topological view rather than to feature choice. The authors also report concrete extraction times and argue their approach avoids the “early saturation” commonly seen in PH.
I find the core idea clean and promising. The proposed methods to address real challenges in applying TDA to graphs (global PH + saturation) and nicely enough it does so by simplifying in a well motivated fashion. The resulting sequences naturally fit Transformer architectures and I think this is a clean and reusable concept.
The comparisons to PH+MLP and PH+Transformer under the same filtrations isolate the benefit of the proposed representation rather than conflating it with different signal sources. This supports the central claim.
The empirical results are strong even though some common benchmarks are surprisingly missing. The model is tested on many graph classification datasets and several molecular property prediction tasks, and it is consistently at or near the top on the small/medium ones, showing the idea is not tuned to a single benchmark.
**Architecture**
The part of the paper that introduces the “new” architecture for using multiple filtrations and then combining them at the end is poorly motivated. It is also not clear if you consider this just something you tried or an important part of the contribution of this submission.
Table 12 does not show any convincing evidence that using multiple filtrations in this fashion has any real benefit over using only one filtration. A discussion of this would be important and appropriate where the ablation is mentioned in the main body.
Moreover, the description of the architecture suffers from some over the top phasing. For instance “To enhance generalization and mitigate overfitting, we integrate regularization techniques,
such as dropout and weight decay, ensuring robustness across diverse graph learning tasks.”
For one these techniques of course do not “ensure” anything, additionally they are well established and it should be made clear that these techniques are standard.
**Experimental**
The experiments focus on small benchmarks and none of the commonly studied benchmarks of [1] are considered. Especially in graph transformer literature these might be some of the most standard and common benchmarks that are analyzed, so it is somewhat disappointing to see them missing.
On ogb-molhiv, Table 3 does not represent the state of the art on the dataset, cf., https://ogb.stanford.edu/docs/leader_graphprop/#ogbg-molhiv.
**Minor Notes:**
Please order the columns in Tables 2, 5 and 6 the same.
[1] Dwivedi, Vijay Prakash, et al. "Benchmarking graph neural networks." Journal of Machine Learning Research 24.43 (2023)
At the end of 4.3 you mention that this architecture is particularly well suited for Graph Foundation Models. What makes you claim this? I do not see anything empirically to back this up. Simply working well on multiple scales (which is also only demonstrated in a limited way) is not sufficient reason to make such a claim.
How did you decide on the list of models in Table 3? It seems to leave out many top performing models of the ogb leaderboard (link above).
Are the results in Table 2, and Table 3 all for the exact same filtration setup (OR curve + HKS for Table 2, atomic weight + OC curvature for Table 3)? |
Fully human-written |
|
TopoFormer: Topology Meets Attention for Graph Learning |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces TOPOFORMER, a framework for graph-level representation learning. The core contribution is "Topo-Scan," a module that decomposes a graph into an ordered sequence of topological tokens .
The authors claim that standard Persistent Homology (PH) suffers from an "early saturation" problem on graphs. They propose Topo-Scan as a lightweight solution. Instead of the standard cumulative filtration, Topo-Scan uses a simple "sliding window" to "slice" the graph . For each slice, it computes basic invariants (Betti-0, Betti-1) . This process creates an ordered sequence fed into a standard Transformer.
The authors claim this simple method solves the "early saturation" problem and achieves SOTA results on graph classification and molecular property prediction
The paper is well-organized and easy-to-follow. It also provides a lot of experimental results to understand the empirical behavior of this method. The Figures and tables are carefully presented. Additionally, the paper is well-motivated, clearly identifying the "early saturation" problem , and its core originality comes from combining topological slicing with a standard Transformer architecture .
The paper's central claims are undermined by a significant gap between its motivation and its empirical results. The core contribution (Topo-Scan) is a conceptually simple modification, but the experiments fail to provide statistically significant evidence that this modification offers a meaningful advantage over the standard methods it claims to improve upon.
1. The core idea is trivial and lacks justification: The paper's main technical contribution, Topo-Scan, replaces the standard cumulative sublevel filtration $V_i = \{v: f(v) < a_{i}\}$ with a slicing window $V_i = \{v: a_{i} <f(v) < a_{i+m}\}$. This is a conceptually simple modification.
2. Though the authors justify the contribution of solving early saturation by showing the empirical Betti-0 comparison with increasing thresholds, the trends only differ at the very late stage (the last 10 steps on IMDB-B and only from the ~72nd to 82nd step on IMDB-M). The relation of this phenomenon with the eventual performance on property prediction tasks is unclear. Table 5 is the key experiment meant to provide justification, comparing TOPOFORMER (Topo-Scan + Transformer) directly against "PH-TR" (standard PH + Transformer). However, the paper's own data shows no statistically significant improvement.
For example, on IMDB-B, where Figure 3 implies that early saturation is addressed by Topo-Scan, the performance with HKS filtration function for PH-TR is (76.8, 3.97) and for TOPOFORMER is (77.9, 5.72). The second's CI almost overlaps with the first one, which gives no evidence to support that Topo-Scan is better than standard PH.
3. The comparison on molecular property prediction is misleading: The results on Table 4 are also based on an unfair comparison. The model used is “TOPOFORMER*”, which is a hybrid model fusing the ECFP fingerprints. However, not all the cited methods used this additional feature. Such comparison exaggerates the advantages. It would be natural to ask whether PH-TR + ECFP gives similar results, but this is not provided either.
4. The paper claims to be "lightweight" and "efficient". However, Appendix C.3 (Table 10) shows that the preprocessing for the SOTA results (using Ollivier-Ricci) takes 339 minutes (>5.5 hours) on OGBG-MOLHIV. It is still a significant cost, and limits the possibility of applying this to larger scale datasets
1. The SOTA claim in Table 4 relies on TOPOFORMER* (which includes ECFP features), while many baselines do not . This is an unfair comparison. To justify the method, can the authors provide results for a "PH-TR + ECFP" baseline?
2. The paper's core efficiency claim is that Topo-Scan achieves "multi-fold runtime... gains" by avoiding PH's "global boundary-matrix reductions" . Can the authors provide experimental runtime data for this specific claim? |
Fully human-written |
|
TopoFormer: Topology Meets Attention for Graph Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work introduces TOPOFORMER, a graph representation framework that integrates topological structure into Transformer architectures. The core component, Topo-Scan, generates compact sequential topological tokens by sliding over node/edge filtrations and extracting slice-wise invariants (Betti numbers, node/edge counts). This avoids traditional persistent homology pipelines and costly persistence diagram computations. The authors provide stability guarantees, demonstrate scalability, and report state-of-the-art results on graph classification and molecular property prediction benchmarks.
1. The paper proposes a framework to inject TDA signals into Transformers without persistence diagrams, which articulates limitations of classical PH pipelines (saturation, vectorization design burden) and motivates a practical alternative.
2. The paper provides stability guarantees for the proposed topological sequences, linking back to interlevel persistence.
3. The proposed method avoids heavy PH computation and vectorization. It emphasizes parallelizable slice computation with predictable runtime.
1. Although the paper emphasizes advantages over classical persistent homology pipelines, the empirical comparison against methods that incorporate learnable filtrations remains limited (e.g., Horn et al., 2021; Immonen et al., 2024).
2. While the proposed Topo-Scan framework leverages fixed scalar functions to generate topological slices, the methodology may still exhibit sensitivity to the choice of these filtration signals. The current study provides limited analysis of this dependency, and a deeper investigation—particularly involving learnable or data-driven filtration functions—could strengthen the robustness and interpretability claims.
3. The approach demonstrates competitive performance across multiple graph classification and molecular property prediction benchmarks. However, on large-scale datasets such as OGB-MOLHIV, the framework does not achieve the top reported results, suggesting potential room for improvement in scaling to very large graphs or addressing complex molecular tasks relative to specialized architectures.
4. A few relevant contributions, e.g., [1,2], to learning persistent-homology-based representations are not included in the reference.
[1] de Surrel et al. "RipsNet: a general architecture for fast and robust estimation of the persistent homology of point clouds." ToGL 2022.
[2] Yan et al. "Neural approximation of graph topological features." NeurIPS 2022.
1. Could Topo-Scan be extended to learn filtration functions end-to-end? How would stability guarantees extend to such a case?
2. For large-scale graphs (e.g., OGB-MOLHIV, REDDIT), what are memory/time trade-offs compared to graph Transformers with pooling?
3. Beyond Betti numbers, would incorporating additional topological descriptors (e.g., persistence landscapes) meaningfully improve results? |
Fully AI-generated |
|
TopoFormer: Topology Meets Attention for Graph Learning |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes a sliding window compared to the standard sublevel set filtration for the computation of persistent homology. The resulting sequence of simplicial complexes (graphs in this case) necessarily form an *ordered* sequence which can be processed with a transformer architecture for downstream tasks.
The sequence of graphs is summarized in a set of topological features (Betti numbers and number of nodes / edges) and used as the input tokes for a transformer architecture, which is a natural choice.
Overall, the idea, though maybe simple, well thought out, natural, elegantly executed and thoroughly evaluated with great care for details. The code is provided, well documented and easily accessible, which is great to see! The reviewer believes that the use of a sliding window for the computation of persistent homology (though no longer persistent perse) is interesting and novel, which is encouraging given the results on the benchmark datasets. This perspective is also interesting beyond the scope of the specific choices made in the paper which is very interesting.
Overall im very positive about this work. The only two weaknesses of the method could be the choice of vectorization scheme (choose a more expressive statistic instead of $\beta_{0}$ or $\beta_1$).
Adding a discussion there could be good. The second minor weak point is that a discussion on the current limitations is missing and some remarks on either future work and the current challenges could be beneficial to the readers. For instance, most methods in TDA have been specifically developed for filtrations and would no longer hold in this scenario.
- Choice of vectorization scheme, as three hard coded numbers. Have you considered using either learned filter functions or more elaborate vectorizations such as persistent images?
- What motivated you to use the Ollivier–Ricci curvature and Heat Kernel Signature? Have you considered other methods as well, such as learning the filtrations? |
Fully human-written |
|
Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper studies the Tool-Integrated Reasoning for LLMs, and proposes a new framework namely Tool-Light to resolve the problem of excessive tool calls during the reasoning. The key designs of the proposed framework contain dataset construction and multi-stage fine-tuning. Specifically, for dataset construction, Tool-Light employs two sampling strategies and for multi-stage fine-tuning, it introduces a two-stage training method. To support the effectiveness of Tool-Light, extensive empirical studies are conducted.
- The authors provide extensive experiments to demonstrate the effectiveness of **Tool-Light**, particularly through the experiments in **Figure 1**, which clearly illustrate the motivation behind the work.
- The paper is clearly written and well organized, making it easy to follow.
- **Figure 1** lacks sufficient explanation. For instance, the meaning of *Token Index* and the specific roles of *Step 1–4* are unclear.
- In **Figure 4(c)**, it is not specified whether the response length includes the tool-calling part. The figure shows that **Tool-Light** produces shorter responses than **Tool-Star**, yet the examples in the appendix (Examples 1 and 2) suggest otherwise.
- In **Line 257**, the computational complexity is claimed to be $O(n\log m)$, but the derivation of this result is not provided.
- The authors adopt two sampling strategies but do not include an ablation study to compare their effects.
- The experiments are conducted only on the 7B model. To further validate the effectiveness of **Tool-Light**, the reviewer encourages testing on both smaller (e.g., 3B) and larger (e.g., 72B) models.
- In **Line 80**, the authors state that *Pre-Aligned DPO Training* can reduce redundant tool calls. However, it is unclear why this is the case, as *Pre-Aligned DPO Training* does not appear to include mechanisms that explicitly control the number of tool calls.
- It is recommended to report the standard deviation across multiple runs to demonstrate the robustness of the experimental results.
Answer questions in Weaknesses. |
Moderately AI-edited |
|
Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents Tool-Light, a framework designed to enhance the efficiency of tool-integrated reasoning (TIR) in large language models. The core empirical finding demonstrates that invoking external tools causes significant shifts in downstream token-level uncertainty, as measured by entropy. Based on this observation, the authors propose: (i) an entropy-guided sampling procedure for constructing training data, and (ii) a two-stage self-evolved Direct Preference Optimization (DPO) pipeline comprising Pre-Aligned and On-Policy phases. Experimental evaluation across ten benchmarks covering mathematical reasoning and multi-hop question answering shows comparable or superior accuracy while reducing redundant reasoning and improving both efficiency and necessity of tool usage.
1. The paper is well-structured with clear writing that facilitates comprehension of the proposed methodology.
2. The investigation of information entropy changes during tool invocation processes is particularly interesting. The authors effectively leverage these insights to guide their methodological design, providing a principled foundation for their approach.
3. While some components of the proposed data construction and self-evolved DPO framework draw upon established techniques, the overall approach remains intuitive and theoretically sound.
4. The authors conduct thorough experiments demonstrating the effectiveness of their method. The entropy distribution analysis particularly convincingly shows that the approach achieves lower entropy distributions, validating the theoretical motivation.
1. The analysis of entropy in tool invocation has been explored in prior work, and the conclusions drawn are not particularly surprising or groundbreaking.
2. The proposed techniques, including entropy-based sampling and evolved DPO, represent relatively incremental advances rather than significant methodological innovations.
3. While the core idea is interesting, the evaluation is restricted to DPO. The work would be significantly strengthened by demonstrating applicability to other preference learning methods such as GRPO or PPO, which would provide stronger evidence for the generalizability and robustness of the approach.
N/A |
Moderately AI-edited |
|
Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work addresses inefficiencies in Tool-Integrated Reasoning (TIR), where LLMs overuse, underuse, or mismanage tool calls. The authors analyze tool-usage dynamics through information entropy, finding that tool call results significantly influence reasoning entropy patterns. Building on this insight, they propose Tool-Light, a framework for more efficient and accurate TIR. Tool-Light combines self-evolved dataset construction—using vanilla and entropy-guided sampling with strict positive-negative selection—with a two-stage training scheme of Supervised Fine-Tuning and Self-Evolved DPO. Experiments on 10 datasets (e.g., AIME24, MATH, HotpotQA, etc) show that Tool-Light notably enhances both the efficiency and accuracy of tool-integrated reasoning.
1. Tool-Integrated /Agentical reasoning is a important topic with many potential practical applications.
2. The information-entropy based analysing after calling tools is novel and insightful.
3. The experiment evaluation is comprehensive and the performance boost is obvious.
1. Discuss more related work from the Self-Evolved Preference Learning perspective, such as Zeng et al., Evolving LLMs' Self-Refinement Capability via Synergistic Training-Inference Optimization, and Su et al., Trust Region Preference Approximation: A Simple and Stable Reinforcement Learning Algorithm for LLM Reasoning.
2. Clarify whether the information entropy observation generalizes across all tools, and list the specific tools used in the experiments for completeness.
Please see the above Weaknesses. |
Fully human-written |
|
Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper studies “tool-intelligence” in LLM agents and targets two failure modes: overthinking (too many tool calls/long chains) and underuse (skipping necessary tools). It first runs a pre-experiment that tracks token-level entropy over reasoning traces and observes patterns around tool-call moments. Building on this, it proposes Tool-Light, a two-stage pipeline: (1) SFT to establish a reasonable tool-using policy, then (2) self-evolved DPO using curated preference pairs to favor accurate, tool-efficient traces. In inference, it adds entropy-guided branching that expands alternatives at high-uncertainty steps while keeping compute controlled. Experiments span math and knowledge-intensive QA with web search and a code interpreter, reporting accuracy alongside two bespoke metrics—Efficiency (tool economy) and Necessity (using tools when needed). Results show competitive or better accuracy with fewer tool calls and shorter outputs compared to strong baselines (e.g., Tool-Star, Search-R1), plus initial ablations on sampling knobs. The paper positions entropy as a practical signal for both training supervision and test-time exploration control.
(1) Actionable diagnostic. Using token-level entropy to profile traces is simple, implementation-light, and yields intuitive visualizations to reason about when/why tools are called.
(2) Modular training recipe. The SFT → self-evolved DPO pipeline is straightforward to adopt on top of popular backbones and existing tool-use frameworks; no exotic infra required.
(3) Competitive results. Shows strong performance relative to recognized baselines while reducing tool calls—evidence that “lighter” tool use need not sacrifice accuracy.
**Weaknesses**
(1) Missing details. The pre-experiment omits decoding settings. No temperature, logit scaling, or sampling vs. greedy are reported. These choices change entropy levels and trends. Please specify the exact decoding config used to measure entropy and justify it. You say the entropy study spans “multiple QA datasets,” but the figure does not list them. It’s also unclear whether the same model/datasets are used later to train Tool-Light. Please enumerate the datasets in Figure 1 and clarify any reuse. For the method, entropy decides branching at top-k steps and uses top-i prefix averages. The values of k and i and their stability are not reported.
(2) Missing ablations. Table 2 varies loop count and a few sampling knobs only. Missing: (i) no-entropy (vanilla) baseline, (ii) entropy-only, (iii) β sensitivity in DPO, (iv) reference-policy choice, and (v) branch-width sensitivity.
(3) “Multi-tool” over-claim. Most experiments use only two tools (web search + code interpreter), and several evaluations appear to be single-tool setups. There’s no test on unseen tool types or cross-tool composition tasks. As a result, the paper does not yet support broad “multi-tool generalization” claims.
**Suggestions**
(1) Control for length. Lower entropy often comes with shorter or more templated outputs. Your own results show Tool-Light reduces sequence length (Fig. 4). The same factor could explain both “lower entropy” and “fewer tool calls.” Please report entropy at fixed token positions or use length-normalized entropy.
(2) Link entropy to correctness. “Low-entropy chains use fewer tool calls” may reflect early commitment, not better answers. Please test whether entropy predicts correctness. Report AUROC for path-level mean or area-under-entropy, controlling for tool-call count.
(3) Stress test with noisy tools. Overthinking and underuse show up when tools are imperfect. Add controlled corruptions: vary retrieval precision/recall, inject code stderr/noise. Show how Tool-Light adapts tool frequency and preserves accuracy vs. Tool-Star/Search-R1. This directly probes the “analysis paralysis” claim.
(1) Entropy dips after tool calls.
Could the “drop before the next tool call” be driven by inserting long, deterministic tool outputs (i.e., context length/format effects) rather than better reasoning? What happens if you replace tool results with semantically equivalent, length-matched paraphrases? Does the pattern remain?
(2) Scope to multi-tool.
Do the pre-experiment findings hold in true multi-tool settings, not just single-tool or search-heavy cases? Please show the same entropy analysis when multiple heterogeneous tools are available. |
Fully AI-generated |