|
Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
In this paper, the authors study how simplified one-layer transformers with position-only attention trained via gradient descent can learn a class of teacher models defined through a bilinear structure combined with a non-linearity. Included in this class of models are, among others, convolution layers with average pooling, graph convolution layers, sparse token selection, and group sparse linear predictors. Several experiments on synthetic and real-world data are carried out to confirm the theoretical results.
The paper is well written and well organized. The proofs are complex and they seem sensible to me, even though I could only superficially go over them.
My main concerns with the paper are the strong theoretical assumptions and what I perceive is a somewhat unfair comparison with previous works.
Regarding the first point, the assumption that the attention matrix only depends on the position of the tokens is extremely strong. In fact, this simply means that the attention matrix does not depend on the input at all, since the positions are the same for all inputs. (In Eq. (2.4), P is fixed and independent of X). With this assumption, the paper neglects what is arguably the main feature of attention, that is, to attend to other tokens depending on their value. With this assumption, the architecture precisely mimics the teacher class considered (compare Eq. 2.1 with Eq. 2.4), therefore it is not surprising that the simplified architecture can correctly learn this class (albeit, I concede that proving this fact is still challenging).
Regarding the second point, I find the way the authors compared to some previous work quite misleading. Specifically, the authors claim that their work generalizes (Wang et al. 2024), which also consider how a simplified transformer architecture learns the sparse token selection task. However, the two works differ on two important points. Firstly, in (Wang et al., 2024), the fact that the attention only focuses on the positional information of X is shown to be learned during gradient descent, and not assumed from the start. Secondly, and more importantly, in (Wang et al., 2024) the attention matrix depends on the input. In fact, in their work, the sparse subset of indices over which X is averaged is sampled randomly and given as an input to the transformer. The transformer then selects the correct indices of X through the attention layer based on the input. In this paper, instead, the subset of indices over which the average is taken is presumably fixed before training for all inputs. This is a much simpler task, which explains why the attention is chosen to be input independent. Comparing the two works, and in particular their convergence rates, seems unfair.
I would like the authors to address my main concerns expounded above. In particular, I'd like the authors to discuss how their work is still relevant despite the strong theoretical assumptions (in particular, the input-independent attention), and to discuss more thoroughly how their assumptions compare with those of the previous works they compare to (Wang et al., 2024; Zhang et al., 2025).
As a minor side question, I see that S in (2.1) is assumed to have all non-zero entries to be equal to 1/K. Can your approach be generalized to non-uniform S. What are the main difficulties for this case? |
Fully human-written |
|
Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The work demonstrates both theoretically and empirically that a modified version of Transformers can mimic a collection of well-known models (GCN, CNN, etc.). Briefly, this modified version replaces the separation between query and key matrices in traditional Transformers with a single parameter and introduces position-only attention weights. Importantly, the authors provide proven polynomial-time convergence rates for the gradient descent-based minimization of the population loss for Gaussian data. On top of this, theoretical claims are well-positioned within the broader literature and clearly supported by a comprehensive empirical analysis.
All in all, the main document is nicely written and provides apparently significant contributions for the theoretical understanding of Transformers. Nonetheless, the supplementary material is very dense and it is difficult to assess the correctness of the theoretical results. Lemma C.2, for instance, states that the rows of the learned value matrix are always a positive multiple of the rows of the true value matrix, and the proof is long and convoluted. Also, it is unclear why the data’s dimensionality should be larger than a polynomial function of $M$ and $K$, and why many of the presented inequalities break when $D = K$ (See weaknesses below).
1. The main document is very well written; the notation is clean and easy-to-follow.
2. The proposed convergence bounds are, AFAIK, the tightest known for this specific type of problem (which, as remarked by the authors, has been extensively studied lately).
3. The presented empirical results strongly agree with the introduced theoretical analysis.
4. Assumptions for both the Theorems and experiments are clearly stated (e.g., Gaussian data for in-distribution learning, exponential data for out-of-distribution learning, hypothesis class for optimization, Gaussian noise for the target, etc.).
My main concern is regarding the lack of a clear explanation for the results obtained through extensive and dense calculation, as I will detail below. I will be more than happy to raise my score in case the following points are addressed.
1. Convergence rates of $\mathcal{O}(1 / T)$ (non-tight) are well-known for strongly convex problems (or, more broadly, for Polyak-Lojasiewicz-type problems). However, this is clearly not satisfied by the squared objective function in this work. Which other regularity conditions do the authors think have allowed these tight bounds to exist?
2. Lemma C.2, as I referred earlier, seems to be a strong result, and its demonstration is broken up into several pages with many hand-waved computations. Could the authors provide a short and intuitive explanation for this result? What is (if there is) a meaning for $C_1$, $C_2$, and $C_3$?
3. Also, demonstrations look invalid when $D = K$, that is, when $\mathbf{S}$ is a simple pooling matrix - this might be the case, e.g., when $g = [D]$ in Example 2.3. What is the fundamental reason for this constraint (if there is such a reason)?
4. On the same note, the bounds also seem to fail when $K = 0$, i.e., when all entries in $\mathbf{S}$ are identically null. As I understand it, this should be the easiest-to-learn scenario. Could the authors briefly discuss this restriction?
5. In line 1088, it is stated that an inequality follows due to Lemma E.1, which only derives a few equalities. Could the authors please elaborate on this regard?
6. Theorem 3.1 bound for $T$ depends linearly on the inverse of the quadratic norm of $\mathbf{V}^{\star}$, while Theorem 3.2 depends linearly on the quadratic norm of $\mathbf{V}^{\star}$. That is, the dependence is inverted in both results. Why is there such a difference between in-distribution and out-of-distribution results?
7. Could the authors provide the slope for the linear curves fitting the tails of Figures 2(a) and 2(b)? Also, do the authors plan to release the code?
Please refer to the questions above.
On top of that, calculations heavily rely on the Gaussianity of the data. When the data deviates slightly from a Gaussian - a t distribution or a Gumbel distribution - how do the authors believe that the results would behave? Would we keep the polynomial convergence rate, for instance? |
Fully human-written |
|
Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work presents a theoretical analysis of the optimization dynamics for learning a class of teacher models via a single layer transformer. This class of models satisfies a bilinear structure, and is general enough to recover the results of previous theoretical analyses. In particular, when optimizing over the population loss of the least squares error, the authors show a tight convergence rate, improving upon previous analyses. An additional result on mild OOD generalization as well as experimental results are provided.
- The results provide a tight convergence rate of $\Theta(1/T)$ when training on the population loss, improving previous analyses while generalizing the setting.
- Experimental results for different examples of teacher models are provided, which supports the theoretical analysis.
- The setting is restricted to the population loss, as well as Gaussian data.
- The generalization of teacher models is nice, but ultimately it is still the idea of sparse selection from the input.
- Is there any example of a teacher model in this class that does not do sparse selection as a subtask? If not, what are the challenges in characterizing such a class that does something other than sparse selection?
- I know this does not change the takeaway much since experimental results are provided, but what are the challenges in getting sample complexity guarantees? I suspect it will be some online SGD type analysis; would the authors elaborate on this?
I am mostly curious about the first question, and I am happy to raise my score if the authors address it sufficiently. |
Fully human-written |
|
Generalizable and Consistent Granular Edge Prediction |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Granular Edge Prediction, extending binary edge detection to predict granularity levels reflecting perceptual saliency. Main contributions: (1) SGED dataset with 376K SAM-generated images and graph-based refinement for consistency; (2) GEPM model with Edge Consensus Loss; (3) new consistency metrics. The method achieves competitive zero-shot performance across four benchmarks. However, the work lacks human validation, inherits SAM's biases toward object boundaries, and shows limited advantages over supervised methods.
1. The granular edge prediction task is well-motivated, addressing the inherent subjectivity in edge annotation with clear practical applications.
2. Creating 376K images addresses severe data scarcity (only 600 in existing datasets), enabling robust training and generalization.
3. Table 1 shows competitive cross-dataset results, with Multicue ODS of 0.843 approaching supervised methods (0.904).
1. Zero human studies validating predicted granularities align with perception.
2. Critical omission for a paper claiming to predict "human-recognized" edge granularity.
3. Some figures hard to see, granularity reduction (36→6) under-motivated in main text.
1. Can you provide human studies validating predicted granularities?
2. What is supervised fine-tuning performance?
3. When is zero-shot granular prediction preferable to supervised binary detection? |
Moderately AI-edited |
|
Generalizable and Consistent Granular Edge Prediction |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work systematically defines and solves the "granular edge prediction" task for the first time, constructs the first large-scale, structured granular edge dataset (SGED) and a novel edge consensus loss and a comprehensive evaluation framework, providing a new paradigm for subsequent edge perception, controllable generation, and other tasks.
This work construct a large-scale synthetic dataset for granular edge prediction, where each edge is labeled with a quantized granularity level, and introduce a graphbased edge representation to enforce consistency in edge granularity across the dataset. The approach develop a novel edge consensus loss to enforce granularity consistency within individual edges, and propose a comprehensive evaluation framework, including granularity-aware edge evaluation and two quantitative metrics to assess the consistency of granular edge prediction.
1. The paper mentions "making it particularly valuable and has potential for applications where edge prominence varies," which shows that the authors recognize the importance of downstream applications. However, if the full text does not include specific experiments.
2. The SGED dataset relies on the Segment Anything Model (SAM) to generate synthetic edges. However, the core capability of SAM is to detect object boundaries, which leads to the dataset's insufficient capture of "non-object edges" (such as textures, contour boundaries, and edges with severe material differences).
The core value of granular prediction is to support downstream tasks. Have the authors tried applying the granular output of GEPM to depth estimation or artistic rendering? Are there any empirical results demonstrating "performance improvement in downstream tasks "? |
Fully human-written |
|
Generalizable and Consistent Granular Edge Prediction |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper constructs a synthetic dataset for granular edge detection and proposes an efficient model that outputs edge maps of multiple granularity levels in a single forward pass, treating the task as a multi-class prediction problem. In addition, the paper introduces an edge consensus loss to enforce granularity coherence within edges, as well as granularity-aware edge evaluation to demonstrate the effectiveness of the proposed approach.
1. The dataset is built by first extracting object boundaries using SAM and then performing refinement via graph-based representation, resulting in a large-scale granularity-aware edge detection dataset.
2. The proposed formulation effectively avoids the need for multiple forward passes required by previous methods to infer different granularity levels. By converting the problem into a multi-level classification task, the method can produce all granularity results with a single inference.
3. A new loss function and evaluation metric are introduced, which can potentially promote further research in granularity-aware edge detection.
1. Choice of granularity levels. The dataset is initially annotated with 36 granularity levels, but experiments later quantize these into 4 or 6 levels due to distribution imbalance (line 325–327). (1) How were the *original 36 levels* chosen? What advantage does the approach of annotating 36 levels *first and then merging* have over *directly annotating fewer granularity levels*? (2) When the granularity level is extremely small (level = 1), how does this differ from conventional deterministic edge detection? (3) In Fig. 12, one would expect that the smallest granularity level preserves only instance-related boundaries without background clutter. however, the results still appear influenced by background textures. Could the authors comment on this?
2. Model architecture clarification. The main paper does not clearly describe the model architecture. It only becomes clear in Appendix Table 3 that it follows a diffusion-model U-Net backbone. Were pretrained diffusion model weights used?
3. Inconsistent cross-dataset generalization. In Table 1, MuGE trained on SGED performs worse on NYUD compared to zero-shot inference from models trained on BSDS. However, performance improves on BIPED and Multicue. (1) Does this suggest that SGED yields weaker cross-domain generalization than BSDS? (2) Similarly, in Table 7, the performance gap between naive SAM and GEPM is small on BSDS and NYUD but large on BIPED and Multicue. Why does SGED training benefit some datasets but not others? (3) How does the current SOTA DiffusionEdge model perform when trained on SGED and tested on BSDS?
4. Model size comparison. In Table 7, what architecture does *naive SAM* refer to? Is it SAM-ViT-B? The base model used in the paper contains substantially more parameters than both SAM-ViT-B and the EfficientNet models used in MuGE. A direct parameter comparison table would clarify fairness in evaluation.
5. Class imbalance handling (Eq. 2). Since larger granularity levels have fewer annotated samples, does Eq. 2 account for this imbalance? Would class-balanced weighting or focal-type reweighting help further alleviate the imbalance noted in line 325–327?
The strength of this paper lies in proposing a dataset, introducing a granularity consistency loss, and transforming prediction tasks of different granularities into multi-class classification tasks.
The main issue, however, lies in the experiments. The results on BSDS are nearly comparable to those of SAM's zero-shot performance. Additionally, the constructed dataset was expected to be entirely at the instance level when the granularity is set to 0, but this requirement has not been fully met. |
Lightly AI-edited |
|
Generalizable and Consistent Granular Edge Prediction |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a new edge detection paradigm, termed Granular Edge Prediction (GEP), which redefines edge detection from binary classification into a granularity-aware prediction problem. Instead of simply detecting whether a pixel belongs to an edge, the proposed framework predicts an edge granularity level that reflects edge consistency and perceptual thickness.
To support this task, the authors construct a large-scale synthetic dataset called SGED, generated via SAM 2 segment masks and multi-level granularity transformations. They also design a novel Generalizable Edge Prediction Model (GEPM) equipped with a graph-based edge representation ensuring per-edge consistency, and an Edge Consensus Loss that enforces distributional agreement along the same edge.
1. Results show that GEPM achieves near or better than supervised methods in zero-shot settings on several benchmarks.
2. The Edge Consensus Loss based on Jensen–Shannon divergence effectively enforces intra-edge consistency in predictions.
1. SGED is entirely synthetic; the paper lacks rigorous validation of how well its granularity annotations align with human perceptual judgments.
2. The distinction between “granular consistency” and previously studied “multi-level edge fusion” (e.g., in UAED or MuGE) could be more clearly articulated.
3. The quality of the SGED dataset depends heavily on SAM’s segmentation quality, which may introduce structured bias.
1. How closely do the generated granularity annotations approximate human perceptual edge thickness? Have user or psychophysical studies been considered?
2. Could SGED overfit to SAM’s segmentation priors, limiting real-world generalization?
3. What is the training cost (GPU hours, parameter count), and how scalable is the proposed GEPM framework?
4. Would combining GEP with downstream tasks (e.g., boundary-aware segmentation) improve overall performance?
5. How sensitive is the model to the granularity level discretization (e.g., 36 vs. 10 levels)? |
Fully AI-generated |
|
None to Optima in Few Shots: Bayesian Optimization with MDP Priors |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper targets the important problem of black box optimization where only few evaluations are possible, but optimization trajectories of related tasks are available to speedup the optimization. The paper uses a prior-fitted neural network (PFN) as a model in bayesian optimization. The main contribution is a new method that incorporates optimization trajectories on existing tasks as an MDP prior into the PFN while employing MAML for adapting the trained PFN to the target task.
* Tackles an important established problem setting highly relevant to the ICLR community
* Usually PFNs are trained with synthetic data, incorporating actual evaluations is an interesting research direction and a good fit for the tackled problem setting
* Combining MAML and PFN is an interesting methodological contribution
* Follows best practices for reproducibility
* Experimental setup, including baselines, protocol, and used datasets described in sufficient detail.
* Sufficient discussion on related works that use optimization trajectories of related tasks
* Clearly structured and understandable writing
* The paper does not discuss limitations of their method and experimental design prominently
* The ablation study is not convincing. The results of the ablation study as discussed in the text may not hold in the target range of 20 evaluations. Results are very noisy here and their un-ablated algorithm is not the preferred choice. The ablation study is not performed on the covid-b and cancer-b benchmarks. Why?
* The approach does not take into account categorical parameters. How about integer parameters? Limiting its applicability.
* Unclear how the chosen related tasks affect the approach and evaluation. How many related tasks are necessary? Do some approaches work better for different numbers of related tasks? An explicit evaluation of this would strengthen the paper. What about the case of misleading related tasks? How related are you tasks?
* The authors use different #iterations for each benchmark (20, 40, 90). To keep evaluation here consistent, showing up to e.g., 20 everywhere (which is their target setting) would be better. For example, showing 40 evaluations for the cancer dataset, where your method performs worse than TNP for 20 seems hand picked, what was the reason here? See also my comment on the ablation study.
* How were the final hyperparameter settings of your method chosen? The appendix lists a search space of hyperparameters for you method. Did you tune them? How? Did you also tune the baselines?
* Regret plots for Covid / cancer (Figure 3) only have one y-axis label (10-1), not possible to know the difference in the aproaches at all.
* Unclear how beneficial meta-learning across related tasks compares to other speed-up techniques for BO (e.g., multi-fidelity, cost-aware, or use of expert priors). An empirical evaluation would be best, but a discussion in related works is also missing. This is lacking in prior works, too.
Minor
* Line 180 "It [PFNs] outperforms traditional GP (see Figure 1), " While making an argument for use of PFNs over GPs, you are comparing a meta learned PFN to a standard GP, which is not an apples-to-apples comparison.
* A small discussion of NN-based BO in general would be helpful. In particular, a discussion on PFNs, their inherent meta learning capabilities and their existing applications to BO would be beneficial.
* x-axis of Figure 4 is unlegible
* Paper shows normalized regret. How meaningful are the improvements in absolute terms for the respective benchmarking settings?
* How many seeds were used?
* You follow Maravaal et. al in selecting 6 of 16 HPO-B problems, how were they selected? How would the results look on all the problems? |
Fully human-written |
|
None to Optima in Few Shots: Bayesian Optimization with MDP Priors |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes PROFBO, a procedure-informed Bayesian optimization framework aimed at solving black-box optimization problems in very few evaluations (typically $T \le 20$). The central idea is to leverage optimization trajectories from related source tasks, modeled as MDP priors, so that the target-task BO does not have to learn an efficient search policy from scratch. Concretely, the method (i) trains lightweight DQN agents on source tasks to generate optimization trajectories, (ii) embeds these trajectory distributions $p(T^{(i)})$ into a prior-fitted Transformer (PFN) to act as a BO surrogate, and (iii) fine-tunes this PFN with MAML and positional encodings to transfer procedural knowledge while avoiding overfitting to specific source-task temporal patterns. At test time the BO loop is standard (context $\to$ posterior $\to$ acquisition), but the surrogate now reflects trajectory-informed priors. Experiments on two new real-world few-shot drug discovery benchmarks (Covid-B, Cancer-B) and on HPO-B show that PROFBO achieves lower regret and better early ranks than meta-surrogate baselines (META-GP, FSBO, TNP) and recent trajectory/meta-BO methods (MAF, NAP, OptFormer), while being more training-efficient than NAP.
* Clear and motivated problem setting: BO in regimes where $T \le 20$ and evaluations are expensive, which is where standard asymptotic BO results are less useful.
* Procedural transfer via MDP priors is novel in this particular PFN + MAML setup and allows the surrogate to internalize good search patterns rather than only response surfaces.
* The PFN backbone gives a principled way to do single-pass Bayesian inference over contexts, leading to faster inference than GP posteriors and enabling the use of non-GP priors.
* Ablation studies isolate the contribution of MDP priors, MAML, and positional encodings and show that each of them improves early-iteration regret.
* Strong empirical results on newly proposed real-world discrete/continuous drug-like tasks (Covid-B, Cancer-B) and on HPO-B, outperforming several state-of-the-art meta/few-shot BO baselines.
* Training-time comparison with NAP shows the proposed modular two-stage training (RL for priors, supervised PFN fine-tuning) is more efficient than end-to-end RL with transformers.
* The approach depends critically on the availability and quality of source-task optimization trajectories; the paper does not quantify how performance degrades when source tasks are few, noisy, or mismatched with the target.
* The MDP prior is learned with per-task DQN on (possibly) large discrete action spaces, and although the authors optimize it (subset of actions, batched GPU generation), this can still be expensive in domains without precollected meta-data.
* The method is benchmarked mostly on structured or tabular/meta datasets with fixed embeddings (26D molecule embeddings, HPO-B); it is less clear how the approach would behave on high-dimensional continuous design spaces where actions cannot be discretized so easily.
* No acquisition-function–level comparison is made under exactly matched hardware/time budgets; some of the reported gains could be due to PFN’s fast forward pass rather than the MDP prior per se.
* There is no formal analysis of negative transfer: in principle a trajectory prior that encodes a suboptimal policy could bias the PFN and hurt few-shot performance.
1. How sensitive is PROFBO to the number and diversity of source tasks? For example, if only a small subset of Covid-B problems is available for training, does the method still outperform TNP/META-GP on the held-out problems?
2. In Section 4.2, the MDP defines the action space as candidate points from the dataset. How would PROFBO be instantiated for continuous domains where we cannot enumerate actions and cannot easily train DQN on a discrete subset?
3. The PFN head outputs a discretized bar distribution over $y$. When computing EI/UCB from this surrogate, are the baselines given access to the same acquisition budget (number of candidate points scored per iteration)?
4. The two-stage training (GP-like pretrain, then MDP-prior fine-tune with MAML) is argued to prevent overfitting to trajectory order. Can the authors show a small example where training only on trajectory priors actually harms OOD target tasks?
5. For the drug-discovery tasks, can the authors confirm that target tasks are not leaked into the MDP-prior training stage (i.e., that there is a strict meta-train/meta-test split at the trajectory level)? |
Fully AI-generated |
|
None to Optima in Few Shots: Bayesian Optimization with MDP Priors |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces the ProfBO method, which is a prior-fitted network model for transfer learning / metalearning in BO. A key aspect of the method is that for a particular transfer learning problem instance, the PFN is fine-tuned using the source tasks to produce an MDP prior that trains the model on optimization trajectories. DQN in particular is used for modeling optimization trajectories, and a MAML-style approach is used for incorporating them into fine tuning.
The focus of the paper is on settings where a small (20ish) number of evaluations can be made on the target task, and so metalearning from prior tasks is important.The paper shows good empirical performance in this setting on some new test problems and standard benchmarks.
* The main contribution of the paper is the MDP prior and the associated incorporation of optimization trajectories into a PFN framework. This is novel and quite interesting. The general strategy also seems like it could be useful outside the transfer learning / metalearning setting that is the focus of the paper.
* The paper also introduces new problems that can be used for evaluating transfer learning methods that are based on real-world problems and seem that they will be useful for future work.
* The ablation study does a good job of showing the importance of various aspects of the optimization trajectory learning and providing insight into how that should be handled.
* Feasibility of learning optimization trajectory policy on source tasks: The method includes learning a DQN policy network on the source tasks during fine tuning. As I understand it, this requires being able to make new evaluations of the source tasks, and in particular, not just using whatever optimization trajectory you happen to have from some earlier optimization on this task. Is that correct? The typical assumption in metalearning for BO is that the data you have from each source task come from some previous run of BO or the like on that source task, so the number of points per source task would be in the neighborhood of 20-40. How many datapoints per source task were used here? Appendix C.2 says "In Cancer-B, we utilize three meta-training datasets (6T2W, NSUN2, RTCB), comprising 437,634 evaluations in total, and two meta-test datasets (WHSC, WRN), totaling 291,756 evaluations." Does that mean that all O(10^5) points were used as the dataset for learning the DQN on the source tasks? If so, it would present a serious problem for the motivation of the paper. I cannot think of many realistic scenarios where one would be able to run orders of magnitude more evaluations on the source tasks as on the target task. In particular it does not seem to be the case in the scenarios used to motivate the paper: "In practice, these related source tasks can be the docking scores of a set of molecules evaluated on different receptors." If for one receptor (target task) it takes hours/days to do an evaluation, it is implausible that it would take only seconds/minutes to do the evaluation for some different receptor.
* The paper is broadly framed as being for few-shot learning, but it is really about transfer learning. These are not the same thing, generally in few-shot learning one may not have access to the similar source tasks required by the method. Even the title of the paper may be confusing, "None to optima": Except is it really "None" when source tasks are required?
* The potential for negative transfer is an important issue in any transfer learning / metalearning method, and is not investigated in the paper at all. I'd like to see what happens to performance as unrelated tasks are added as source tasks.
* Understanding when trajectory information is helpful: The MDP prior and model for optimization trajectories is the core contribution of the paper. Internally to the paper, the claim that this is important/valuable seems well-supported by the ablation studies, which shows that eliminating either positional encoding or the MAML training algorithm significantly deteriorates performance. But, prior work has come to a different conclusion. The paper for the NAP method (Maravel et al. 2023) has a whole section (Property 3.2) claiming that history-order invariance is important for Meta-RL generally, and that positional encoding should not be used. NAP performs nearly as well ProfBO. So while removing positional encoding degrades ProfBO, other methods (NAP) intentionally exclude positional encoding and yet perform nearly as well as ProfBO with positional encoding. What is the source of this seeming discrepancy?
* The language around optimization trajectories and sampling strategies seems confused. Section 4 states "Under the GP assumption in standard BO, at iteration t, all queries in Dt−1 are treated as i.i.d. uniformly distributed, ignoring the fact that Dt−1 is actually a BO trajectory." This makes it sound like the GP is making distributional assumptions on the X input locations, which is not the case. I think the text is trying to get at the notion of exchangeability, which is a related albeit different concept.
* Number of samples: The paper is framed very strongly around 20 iterations being the upper bound for number of samples. E.g. page 1 "fewer than 20 evaluations"; page 3 "within a few shots, e.g. T<=20"; page 4 "T can be fewer than 20", etc. Then suddenly in Section 5.2 this is changed to "within 20 or 40 iterations." This is of course because the method did not perform particularly well at 20 iterations on the Cancer problem and required 40 iterations to beat TNP. This raises the obvious question of what happens on the covid problem if the number of iterations is taken out to 40, and furthermore seems like there should be some acknowledgement that in some settings more iterations are required.
* It would be helpful to understand when/why more iterations are required. The HPO-B problems were run out to 90 iterations because that's what was done in the work this builds most directly on. We see that there are are significant improvements from iteration 20 to iteration 90. So while it may be correct to say that ProfBO is the best of the methods at iteration 20, it is not correct to say that ProfBO has solved the problem well in fewer than 20 evaluations; the regret is still very high if one were using ProfBO and had to stop after 20 iterations. Is well-performing few-shot learning just not possible here? When is it possible?
* Please clarify how many points are being used as the dataset for each source task. If this is more than the 20-40 budget used in the target task, please justify. |
Fully human-written |
|
None to Optima in Few Shots: Bayesian Optimization with MDP Priors |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents ProfBO (Procedure-informed Bayesian Optimization), a framework that accelerates black-box optimization by leveraging priors learned from previous optimization trajectories on related tasks. ProfBO models a prior distribution of past optimization trajectories using a Markov Decision Process (MDP), enabling it to transfer knowledge gained from past optimization runs on related tasks to new tasks. The authors evaluate ProfBO on synthetic and real-world benchmarks, demonstrating that ProfBO consistently outperforms baselines from the literature, achieving faster convergence and lower regret within a very small number of function evaluations. This work shows that leveraging knowledge from past optimization of related tasks can substantially improve sample efficiency in Bayesian optimization.
Originality:
ProfBO introduces an original and conceptually elegant idea: learning procedural priors over past optimization trajectories to improve few-shot Bayesian optimization. The method treats prior optimization runs as Markov Decision Processes, enabling it to capture the dynamics of optimization itself. This “procedure-informed” view is a novel way to transfer optimization knowledge and differs from existing approaches.
Quality:
The paper is technically strong and empirically thorough. The methodology is well-motivated. Experiments are extensive, covering a large number of benchmark tasks and comparisons against baselines from the literature. ProfBO consistently achieves faster convergence and lower regret across tasks within small evaluation budgets.
Clarity:
The paper is clearly written and well-structured. The main components of the ProfBO method are clearly motivated and explained. The narrative is easy to follow, and the paper effectively communicates both intuition and implementation details.
Significance:
ProfBO demonstrates that procedural priors can improve sample efficiency of Bayesian optimization. This is significant in real-world domains such as drug design, materials design, etc. where sample efficiency is important.
While ProfBO demonstrates strong relative performance across benchmarks, including real-world-inspired tasks such the COVID and Cancer benchmarks, the paper does not provide discussion or qualitative analysis of the meaningfulness of the obtained solutions. It remains unclear whether the best-found objective values correspond to scientifically relevant or near-optimal outcomes in these applications, or primarily represent improvements over baselines in a normalized performance sense. Adding brief commentary or case examples illustrating how close these solutions are to practically desirable or known-good results would strengthen the empirical claims and highlight the real-world impact of the method.
1: Could the authors provide discussion or intuition on why ProfBO achieves larger improvements over baselines on certain tasks but smaller gains on other tasks? Maybe leveraging prior optimization trajectories is more useful for some tasks than others? Do the authors have any intuition for why that is or what types of tasks ProfBO might be most useful for?
2: In experiments and method development, did the authors encounter any scenarios when procedural priors from source tasks are poorly aligned with the target? Is it possible that such a scenario could lead to ProfBO having worse performance than other methods rather than better performance?
3: For the biomedical benchmarks, do the best-found objective values correspond to scientifically meaningful or near-optimal solutions? Adding qualitative insight or examples would help contextualize the results. |
Fully AI-generated |
|
Constructing coherent spatial memory in LLM agents through graph rectification |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents a novel framework, LLM-MapRepair, for enabling large language model (LLM) agents to build and maintain coherent spatial memory during textual navigation tasks.
Unlike prior works that rely solely on LLM contextual reasoning, the proposed approach incrementally constructs a topological graph of the environment and introduces mechanisms for error detection, localization, and repair.
1. Unlike prior LLM-based navigation works that depend purely on context-window reasoning, this study formalizes a persistent, structured memory mechanism that records graph evolution and enables rollback and repair.
2. Current LLMs cannot sustain consistent spatial understanding over long reasoning horizons due to context limits and forgetting.
Emulating human-like incremental mapping and introducing self-repair mechanisms are convincingly argued as necessary steps for robust embodied reasoning.
1. Evaluation is restricted to text-based static environments; lacks validation in dynamic or multimodal contexts.
2. The link to embodied or real-world robotic tasks could be emphasized more clearly.
3. The evaluation domain is narrow, so the generality of the proposed approach for other tasks (e.g., visual navigation, embodied dialogue) remains uncertain.
4. The lack of definition or quantitative analysis of map scale. The paper never specifies how many nodes or edges each constructed navigation graph typically contains, nor does it provide any complexity analysis or scaling study.
As a result, it remains unclear how large a map the proposed framework can effectively handle before memory usage, repair cycles, or LLM context length become bottlenecks.
See Weaknesses* |
Fully AI-generated |
|
Constructing coherent spatial memory in LLM agents through graph rectification |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 4: excellent
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The authors propose a framework for repairing navigation graphs constructed by LLMs.
The authors have a really great idea via-a-vis their framework which in addition to their untended use for repair, can also be used to diagnose problems in LLM navigation.
*Figure 2 needs more explanation
* the authors propose a framework and then run an experiment that is confusing and not clear what they hope to accomplish. I recommend that in future work, they focus on showing why LLMs perform best/works across their 3 sections and also do some sensitivity analyses on their Edge Impact Score, particularly in light of complexity.
*They also mention they want to enable LLMs to handle complex long texts, but I would focus on this after they have a more stable approach.
*Why would someone need to use your framework?
*how could the results be used for future improvements?
*What are the limitations of your framework?
*How sensitive is your approach/scores to prompt variability? |
Fully human-written |
|
Constructing coherent spatial memory in LLM agents through graph rectification |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work proposes LLM-MapRepair, a framework that enables large language models to build and maintain consistent spatial maps from textual navigation instructions. Instead of relying solely on context reasoning (which leads to memory limits and inconsistencies) the method incrementally constructs a topological graph and repairs it through three integrated stages: conflict detection, error localization, and version-controlled repair. A central contribution is the introduction of a Version Control mechanism that rcords every graph edit for rollback and causal tracing, and an Edge Impact Score that prioritizes low-risk repairs based on reachability and path usage. Evaluated on a refined version of the MANGO benchmark, the approach significantly improves map correctness and robustness, especially for long-horizon navigation with accumulated inconsistencies.The experiments highlight the importance of history-aware reasoning for maintaining coherent spatial memory in LLM agents
The paper:
- Clearly identifies and motivates the problem of inconsistent spatial memory in LLMs
- Proposes a novel and inventive solution: a modular andd interpretable framework (LLM-MapRepair) for detecting and fixing map inconsistencies
- Introduces Version Control for persistent, history-aware reasoning, enabling rollback and causal tracing
- Defines an Edge Impact Score to prioritize low-risk, high-impact repairs using a principled heuristic
- Provides strong experimental validation with clear ablations showing complementary effects of each module
- Cleans and refines the MANGO benchmark, improving evaluation quality for future work
- Some of the figures make the approach easier to follow and conceptually well-motivated
- The paper is hard to read in its current form. Some examples below:
- The paper uses not-so-common terms without defining them, e.g., “local spatial cognition,” “cognitive biases,” “structural integrity,” and “contextual pressure.”
- In the introduction, the solution appears before the setup: what exactly are nodes and edges? I was only able to figure it out after reading the methods section.
- Problem formulation and evaluation criteria are not specified clearly.
- The motivation for using LLMs (as opposed to simpler map builders) is not provided.
- Evaluated on a single dataset, and method appears tailored to it.
- No baseline comparisons beyond GPT-4o, making it difficult to assess the method’s relative performance.
- The paper lacks a deeper analysis explaining the method’s behavior, limitations, or challenges in solving the problem.
- Why do we need an LLM for this mapping problem at all? What capabilities are essential here?
- The paper motivates with "human-like cognition", but what evidence suggests human spatial memory is graph-based?
- Fig. 1 mentions forgetting in the LLM context; why would information in-context be "forgotten"?
- If graph errors stem from LLM reasoning, how does version control fix them? Maybe shortening the reasoning context? Please motivate more clearly.
- L158: How does 1k observation steps exceed context? What's the average tokens per observation?
- The conflict taxonomy is confusing; aren't directional conflicts fundamentally topological?
- "Delayed/entangled/silent" conflicts are introduced, but how is it tied to the solution later?
- Why is "usage" a factor in resolving edge conflicts if all conflicts will be resolved anyway? How is it computed?
- Why assume the true error lies between detected conflicts and their LCA?
- The paper mentions Conflict Revelation Gain, but replaces it with a score function defined earlier. Why is that a good approximation?
- How are each individual component of the score computed? |
Lightly AI-edited |
|
REFLEX-Med: Reinforcement for Label-Free Explainability in Unified Medical Reasoning |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose to use GRPO to improve the visual grounding and cross-modal provenance (without labels for these) of medical MLLMs. They evaluate their method on the medical VQA task, showing improved answer quality and faithfulness, claiming reduced hallucination and attention drift. Experiments on cross-modal (medical image modality) and zero-shot generalization are also presented.
1)Provides label label-free framework for improving visual grounding and provenance, thereby trying to reduce hallucination in large medical VLMs.
2)The framework shows good results in answer utility, cross-modal, and zero-shot performance and could be easily transferable to different backbones and settings.
1)The presentation of the paper could have been much better. It is hard to follow the text for various reasons. Some are listed below:
a) The density of custom terminology is high; the core ideas are obscured by the constant use of these terms.
b)The structure of the paper could be improved so that the reader can follow through easily without it being convoluted for no reason.
c)Some of the mathematical notations are not defined. (Ex: c, G in line 231, etc.)
d) More verbose captions for some figures (Figure 3, 6, 7) could help better understand the figure on its own.
e)Redundant description of some of the techniques/processes of the framework throughout the paper.
2)The paper claims to resist “attention-think drift” without providing any substantial evaluation.
3)The paper claims about the reasoning capabilities of their framework; they briefly analyze this in a subsection through the <think> component in their model’s responses, the details of evaluating this <think> component quantitatively, which can show the reasoning of the model, are not presented.
4)Some of the recent and relevant paper that employs GRPO for medical reasoning have been mentioned by the paper in the related work section (MedVLM-R1, MedReason), but they were not used as baselines by the paper. Justification as to why not use them as baselines was also not provided.
1)The paper uses many thresholding parameters. How sensitive is the framework to the selection of these parameters? Was there any such study performed?.
2) Is there any analysis of the sensitivity of the framework to the choice and quality of the frozen judge? |
Fully human-written |
|
REFLEX-Med: Reinforcement for Label-Free Explainability in Unified Medical Reasoning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces REFLEX-Med, a RL framework designed to improve the explainability of medical VLM without relying on costly human-annotated rationales. The core idea is to instantiate "label-free explainability" through two verifiable prerequisites: faithful visual grounding (where the model looks) and bi-directional cross-modal provenance (why it looks there).
To achieve this, the authors propose two novel reward signals computed by a frozen, pre-trained medical VLM acting as a "judge":
1. A **Visual Fidelity Reward ($R_{VFR}$)**, which encourages the policy's attention (saliency map) to align with the attention of an "anchor" (ground-truth) statement.
2. A **Bi-modal Provenance Reward ($R_{BPR}$)**, which enforces semantic consistency in the embedding space between the model's generated answer, the anchor text, and the input image.
1. **Originality & Significance:** The paper proposes a novel and important shift in medical VLM explainability, moving from subjective (and annotation-hungry) rationales to objective, "label-free" verifiable criteria. The core concept of using a frozen judge to reward faithful grounding ($R_{VFR}$) and semantic provenance ($R_{BPR}$) is a creative and promising approach.
2. **Problem Formulation:** The work correctly identifies a critical failure mode of current VLMs ("attention-think drift") and proposes a concrete mechanism to penalize it. The goal of instantiating "process verifiability" (where and why the model looked) is highly relevant for high-stakes domains like medicine.
3. **Methodology:** The design of the two reward functions is intuitive and directly maps to the stated goals. Using a frozen judge to provide a stationary reward signal and prevent reward hacking is a sound design choice within an RL framework.
1. **Unexplained Catastrophic Performance on PathVQA:** The most glaring weakness is the model's performance on the PathVQA dataset, as shown in Table 1. The Qwen2.5-VL (SFT) baseline achieves 87.8% (c) and 79.3% (o). In contrast, REFLEX-Med-7B scores 80.9% (closed) and a shockingly low 30.3% (o). This is a massive performance degradation on an in-domain dataset. The paper fails to acknowledge, analyze, or explain this result. This strongly suggests that the proposed reward framework may be fundamentally flawed or, at best, highly detrimental to specific modalities like pathology. This single result undermines the paper's primary claims of improving answer quality.
2. **Unvalidated Reward Signal Quality:** The methodology critically depends on the frozen BioMedCLIP judge providing accurate saliency maps and meaningful embeddings. The paper provides zero evidence that BioMedCLIP is a reliable judge, especially for saliency. Medical grounding models are notoriously unreliable outside of the domain they were trained on (e.g., CXR). If the judge produces low-quality masks for CT or pathology images, the $R_{VFR}$ signal is optimizing the policy for noise, which would explain the poor performance on PathVQA. The authors must validate the judge's performance before using it as a source of truth.
3. **Outdated and Limited Experimental Setup:**
* **Baselines:** The baselines are missing more recent, state-of-the-art VLM, such as those from the InternVL series or the newer Qwen-VL models and InternVL.
* **Judge Model:** BioMedCLIP is an outdated choice. Newer, more powerful medical foundation models (e.g., BIOMEDICA) trained on far larger and more diverse datasets exist and would almost certainly provide a more reliable reward signal.
* **Scale:** The experiments are limited to 3B and 7B models.
4. **Marginal Improvements in Ablations:** As seen in Figure 4, the improvements from adding the $R_{VFR}$ and $R_{BPR}$ rewards are often minimal. For example, in the rightmost panel (Train on X-Ray), the test accuracy on MRI for the full model is 93.0%, while the ablation (w/o R-VFR + R-BPR) is 92.7%. A 0.3% gain is not a compelling argument for the added complexity of the method, especially given the catastrophic failure on PathVQA. The low gains on MRI data also raise questions about the judge's effectiveness on this modality.
1. Can you please provide a detailed explanation for the massive performance drop on the PathVQA dataset (Table 1) when applying REFLEX-Med, compared to the simple SFT baseline? Why does your method perform so much worse (87.8% -> 80.9% c, 79.3% -> 30.3% o)?
2. How did you validate the quality of the saliency maps generated by the frozen BioMedCLIP judge? Can you provide quantitative or qualitative evidence that these masks are accurate, especially for the non-CXR modalities (PathVQA, CT, MRI)? Is it possible that your model is simply learning to match a *bad* set of saliency maps?
3. The improvements in the cross-modal ablation (Figure 4) are very marginal, especially for the MRI modality (e.g., 0.3% gain in the right panel). Why do you think the gains are so small? Does this suggest the judge model is ineffective on MRI, or that the rewards themselves have limited impact?
4. Could you clarify if the RL implementation is online or offline? The use of GRPO and sampling from the policy suggests an online setup, but this is not explicitly stated. |
Fully AI-generated |
|
REFLEX-Med: Reinforcement for Label-Free Explainability in Unified Medical Reasoning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Motivation: medical VQA models often give plausible answers without image-grounded reasoning (“answer-right, look-wrong”), and existing explainability needs scarce rationale labels.
Proposal: REFLEX-Med, a GRPO-based RL framework with a Visual Fidelity Reward that aligns text-conditioned saliency with an anchor from the gold answer and a Bi-modal Provenance Reward that enforces text–text and image–text agreement using a frozen VLM.
Results: consistent gains over vanilla GRPO on VQA-RAD, SLAKE, and PathVQA, improved cross-modality transfer, and qualitatively tighter attention maps, with ablations showing both rewards matter.
1. The paper addresses a timely and important problem in medical vision–language modeling, namely improving answer grounding without extra process supervision.
2. VFR optimizes IoU between text-conditioned saliency masks from a frozen medical VLM, and BPR enforces text–text and image–text agreement through explicit thresholds. The pipeline is straightforward to implement and uses only answer labels, avoiding rationales, region annotations, or segmentations by deriving anchors and embeddings from the gold answers.
3. Declarativizing questions and answers into canonical statements yields a single interface for computing saliency and embeddings, which unifies close ended and open ended VQA under one policy.
4. The paper reports results across multiple medical VQA benchmarks and modalities with in-domain and out-of-domain tests, includes cross-modality transfer analyses, and provides ablations that isolate the contribution of VFR and BPR.
1. Faithfulness is assessed through a single frozen medical VLM judge for both saliency and embeddings. Improvements could reflect increased agreement with that judge rather than truthfulness to the image. There is no external grounding metric or human assessment of localization to break this circularity.
2. The saliency maps come from the judge without calibration. IoU between two unvalidated masks might not correlate with clinical localization quality. The paper lacks any quantitative localization benchmark or sanity checks on the saliency mechanism.
3. BLEU, ROUGE, and BERTScore are known to be poorly aligned with clinical correctness in free-form medical text. Without clinically grounded scoring or exactness checks on key entities, the reported open-ended gains may overstate clinical utility.
1. Since VFR and BPR use a single frozen judge for both saliency and embeddings, the policy may align to that judge rather than the image and evaluation can become circular. How do you demonstrate that the gains reflect real grounding? Do results persist when you replace the judge with a different model after training?
2. The method uses fixed hyperparameters for the saliency quantile and thresholds. A robustness analysis to these choices would improve the contribution, as this could affect learning dynamics and reported gains. |
Moderately AI-edited |
|
REFLEX-Med: Reinforcement for Label-Free Explainability in Unified Medical Reasoning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents an RL fine-tuning framework for medical VLMs that aims to make explanations auditable without rationale labels. The method converts each (question, answer) pair into a declarative statement and evaluates two frozen signals from a medical vision-language encoder (BioMedCLIP): (1) a Visual Fidelity Reward (VFR) that grants a bonus if text-conditioned saliency for the model’s statement sufficiently overlaps (IoU) with saliency for an “anchor” statement built from the dataset ground-truth answer; (2) a Bi-modal Provenance Reward (BPR) that requires both text-text and image-text cosine similarities to exceed margins. These rewards are added to the conventional format and answer rewards and optimized with curriculum GRPO (i.e., close-ended first, then open-ended QA). Experiments are performed on multiple datasets, including both in-domain and out-of-domain evaluation.
1. The paper is well-organized and well-written.
2. The motivation is sound.
3. The experiments are conducted on six datasets.
1. The comparison with previous works should be improved.
- Label-Free RL has been widely explored [1][2][3][4].
- The core contributions of this work are the proposed faithful visual grounding and bi-directional cross-modal provenance rewards for RL training; however, these have already been introduced in previous studies [5][6][7].
2. The definition of “Label-free” is not clear, as the ground-truth anchors are provided during training.
3. The main claim of this paper is that the proposed method can provide auditable and faithful explanations.
- However, the paper does not include experiments to support this claim. For example, for faithfulness, the authors should evaluate the saliency maps against human-annotated ROIs.
- In addition, external benchmarks for hallucination and robustness are missing, which weakens the core anti-hallucination argument.
4. The comparison with previous works should be improved.
- Medical RL methods are not included.
- The ablation study of the curriculum setting is missing.
5. It is unclear how the saliency map 𝑆(𝐼,𝑡) is computed.
Refs:
[1] Absolute Zero: Reinforced Self-play Reasoning with Zero Data, ArXiv, 2025.
[2] Learning to Reason without External Rewards, ArXiv, 2025.
[3] Maximizing Confidence Alone Improves Reasoning, ArXiv, 2025.
[4] Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO, ArXiv, 2025.
[5] Grounded Reinforcement Learning for Visual Reasoning, ArXiv, 2025.
[6] X-VILA: Cross-Modality Alignment for Large Language Model, ArXiv, 2024.
[7] Reinforced Cross-modal Alignment for Radiology Report Generation. ACL, 2022.
See weaknesses. |
Lightly AI-edited |
|
REFLEX-Med: Reinforcement for Label-Free Explainability in Unified Medical Reasoning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces REFLEX-Med, a reinforcement learning framework designed to provide process verifiability for unified medical VQA without requiring human annotations. The authors argue that current explainability methods, such as post-hoc saliency maps or chain-of-thought approaches that demand extensive annotations, have inherent flaws. To address this, the paper proposes two core verifiable pre-conditions: "faithful visual grounding" and "bi-directional cross-modal provenance." Specifically, the method employs a frozen medical vision-language model as a "judge" to compute two novel reward signals, a Visual Fidelity Reward (VFR) and Bi-modal Provenance Reward (BPR). These rewards are used to fine-tune a LVLM via curriculum learning and the GRPO algorithm. Experimental results demonstrate that the proposed method outperforms baselines on several in-domain and out-of-domain medical VQA benchmarks.
1. The paper tackles a problem of critical importance and significant challenge in the medical AI domain. In high-stakes clinical scenarios, it is crucial for a model not only to provide the correct answer but also to offer a reasoning process that can be audited and trusted by physicians.
2. The paper conducts extensive experiments across multiple standard medical VQA datasets, comparing the proposed method against a range of baselines and demonstrating its superiority on several metrics.
3. The paper is generally clearly structured, and easy to follow. The authors effectively articulate the problem background, motivation, and the proposed methodology.
Despite its strengths, I have several major concerns regarding the novelty of the methodology, the rigor of the experimental evaluation, and the completeness of the exposition.
1. **Limited Novelty of the VFR**: The core idea of VFR, enforcing visual grounding consistency by comparing the IoU of attention maps, is not a new concept. This technique has been widely used in computer vision and multi-modal learning, for instance, in Grad-CAM [1] and its variants, as well as in conditional image-text embedding networks [2]. Furthermore, the idea of "text-visual consistency" as a regularization or reward mechanism has been explored in prior work [3-4]. Consequently, the contribution of this component feels more like an application of existing techniques rather than a fundamental innovation.
2. **Lack of Comparative Experiments**: The authors mention that the GRPO algorithm has already been applied to medical VQA tasks. However, the experimental comparison section lacks a direct comparison with existing GRPO-based medical VQA methods [5-7]. This makes it difficult for readers to accurately assess the true performance gain of the proposed method over the most relevant state-of-the-art work.
3. **Clarity Issues and Lack of Symbol Definitions**: The paper's clarity suffers in several key areas. Symbols such as $y_i$, $\pi_\theta$, $c_i$ on page 5, lines 231-232, and $r_i$ in Equation (10) appear to be used without clear prior definition, which hinders comprehension. And Equation (8) introduces a LoopTight term, but the paper fails to explain its specific function, design rationale, or how it is utilized within the algorithm.
4. **Insufficient Justification and Support for Reward Design**: The designed rewards, VFR and BPR, all use indicator functions. The paper claims this "stabilizes group-standardized advantages," but provides no theoretical derivation or experimental evidence to support this crucial assertion. Using continuous values like IoU or cosine similarity directly as rewards is a more natural choice. And the reward design introduces several key hyperparameters ($τ_{IoU}=0.5, τ_{tt}=0.8, τ_{it}=0.5$). The paper provides no basis for selecting these specific thresholds.
5. **Incompleteness of Ablation Studies**: The current ablation study only tests the scenario where VFR and BPR are removed simultaneously. This is insufficient for understanding the individual contribution of each reward component. A more comprehensive ablation study should include: 1) Experiments where only VFR is removed, and only BPR is removed. 2) An ablation on the choice of the "medical judge" model. 3) An ablation on the curriculum learning strategy.
**References**:
[1] Grad-cam: Visual Explanations from Deep Networks via Gradient-based Localization, In ICCV 2017.
[2] Conditional Image-text Embedding Networks, In CVPR 2018.
[3] Learning from Observer Gaze: Zero-shot Attention Prediction Oriented by Human-object Interaction Recognition, In CVPR 2024.
[4] Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? Computer Vision and Image Understanding, 2017.
[5] Medvlm-r1: Incentivizing Medical Reasoning Capability of Vision-language Models (vlms) via Reinforcement Learning, In MICCAI 2025.
[6] Medreason: Eliciting Factual Medical Reasoning Steps in llms via Knowledge Graphs, Arxiv, 2025.
[7] Med-r1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models, Arxiv 2025.
The questions are provided above. |
Lightly AI-edited |
|
REFLEX-Med: Reinforcement for Label-Free Explainability in Unified Medical Reasoning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposed Reflex-Med as a new reinforcement learning framework for VLM. It proposed to use a CLIP model to compute (1) an image-text saliency map for fine-grained visual alignment reward, and (2) a text-text and text-image similarity score for cross-modal semantical reward. The paper has evaluated the proposed method on both in-domain and out-of-domain datasets and shows an improved overall performance against the existing baselines, demonstrating the effectiveness of the proposed method.
1. The proposed method is proven to be effective via massive experiments. The evaluation includes as many as 7 datasets from different sources and different focuses. And the proposed method demonstrated a non-trivial overall improvement against a same-size baseline model with large-scale pre-training. This is still quite impressive, considering the simplicity of the proposed method (in a positive way).
2. The proposed idea of using a frozen CLIP model for multi-modal reward is convincing. Different from a discrete accuracy reward or text-only reward, the proposed method takes the image input into consideration and measures the correlation between the model output and the image.
3. The core code is provided in the supplement.
First, and foremost, the paper has obviously modified the paper margin. Its bottom margin is increased, while the left and right margin is decreased. It is unclear whether the paper has gained or lost space from this modification, but I believe this is a clear violation of the conference paper requirement, which clearly describes the page margin. Given that, I think I have no choice but to reject this paper. Yet, I do have some more comments about the paper's weakness, listed below.
1. While the experimental results are impressive, the paper seems to overstate its contribution, from the reviewer's point of view. The paper claims the proposed method can help avoid hallucination and attention-think drift, and further improve the faithfulness of the reasoning and explainability.
However, the proposed rewards only rely on the text output (modal statement $t_1$), and it is computed via a stand-alone CLIP model. This leads to two problems: **(1)** Is the CLIP model as a judge reliable? All the proposed rewards are computed as semantic similarity in the CLIP model's embedding space, which could be error-prone from the first hand. The chosen BioMedCLIP is clearly not pre-trained for fine-grained text-image alignment, making the saliency map unreliable as well. There is also no fact-check or direct chain-of-thought quality assessment, which means the claim of avoiding hallucination is questionable. **(2)** The rewards are indirect quantities in the GRPO optimization, which means optimized rewards don't guarantee a better reasoning or explainability, but just higher semantic similarity between model output and input text-image pair. *Eventually, GRPO is optimizing the model in a direction that generates output more similar to the anchor text, rather than improving the reasoning.* This could be fine in terms of improving performance, but no proof or evaluation shows that this is helpful for explainability.
2. The paper claims the proposed method is a **label-free** solution, but it is not that solid a point. Compared with all the baselines mentioned in the paper, the proposed method uses the same VQA data, where the ground-truth answer text is necessary. Of course, the proposed method does not need a fine-grained local corresponding map or heatmap, but none of the baselines or commonly used methods need these additional annotations. To better validate this point, one may want to compare with a baseline that requires such annotation.
3. The proposed method also claims it can improve the visual grounding capability. However, from the limited visual example in Figure 6, it is not obvious that the model is really capable of visual grounding. Figure 3 looks nice, but it is the saliency map for the CLIP model, rather than the actual attention of the VLM. Moreover, optimizing the VFR reward only improves the quality of the saliency map, but not the internal attention of the VLM. Providing more visual examples and reasoning results could help answer this question.
4. It might just be the problem of the reviewer, but it would be great if some of the points could be better clarified. For example, how is the saliency map computed? Also, it will be much easier to follow the paper if the frozen CLIP reward model could be clarified earlier in the paper, rather than just using a vague description.
1. Can you provide some more visualization, like Figure 3 and Figure 6? The reviewer is very interested in the quality of the visual grounding for both the CLIP model and the final VLM.
2. When computing the text embedding for the model output, does it include the thinking part of the output? Or is it just about the answer part?
3. Also, the reviewer wonders what will happen when the question asks about questions related to global information, e.g., imaging modality, and how the grounding reward will be helpful in this case. Also, for a yes/no question, if the question itself is wrong, what will happen? |
Fully human-written |
|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes TimeSqueeze, a method for efficient long sequence time-series forecasting using dynamic patch interpolation. It learns to compress long input sequences into a small number of informative patches by interpolating around learnable query points with soft Gaussian weights. This reduces computational cost while maintaining forecasting accuracy. TimeSqueeze achieves strong results on standard benchmarks and can be plugged into existing transformer models.
The paper introduces TimeSqueeze, a novel module for time-series forecasting that performs dynamic patch interpolation to reduce sequence length while preserving important temporal structure.
This adaptive downsampling technique, which uses learnable interpolation kernels, significantly reduces computational cost while maintaining competitive or superior forecasting accuracy.
The approach is modular and model-agnostic, allowing it to be integrated into various transformer-based backbones without architectural changes.
The authors provide extensive experimental validation across standard benchmarks, demonstrating clear gains in efficiency (speed and memory) alongside strong predictive performance.
While the method is effective at reducing computational overhead, it lacks discussion on how the dynamic patching and interpolation process affects interpretability or transparency of the learned representations.
The paper does not offer a formal theoretical framework to understand the trade-offs between compression rate and information loss, especially in rapidly changing signals. The exclusion of non-transformer baselines, such as state-space or statistical hybrid models, weakens the comparative rigor of the evaluation.
Could the authors provide a qualitative or quantitative analysis of the learned patching structure, and whether it consistently adapts to different temporal patterns such as trends, seasonality, or sudden shifts?
Is the model capable of handling irregularly sampled or incomplete time-series, or does it require preprocessed, evenly spaced inputs for effective interpolation?
Is this method good for financial time series data, which can be influenced by many factors and hard for time series forecasting? |
Fully AI-generated |
|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
TimeSqueeze presents a well-executed and impactful approach to efficient long-context time-series modeling via dynamic patching. Its hybrid design and strong empirical results make it a valuable contribution. However, the work could be strengthened by broader comparisons with adaptive compression methods and a more in-depth analysis of the learned representations.
1. The introduction of TimeSqueeze, which dynamically combines point-level fine-grained encoding with adaptive patch-level compression, is a novel and well-motivated. The dynamic patching mechanism based on relative deviation effectively addresses the limitations of fixed-size patching and enables content-aware compression.
2. The paper demonstrates compelling efficiency gains (up to 20× faster training and 10× faster inference) while maintaining competitive forecasting performance with state-of-the-art point-embedding models like Time-MoE in both zero-shot and full-shot settings across multiple benchmarks.
3. The authors provide extensive experiments, including comparisons with strong baselines, detailed ablation studies, and analyses of the impact of pre-trained context length and compression rates, which convincingly validate the design choices and scalability of the proposed method.
1. While the paper compares with fixed-patching methods and point-embedding models, it does not include comparisons with other adaptive or learned compression strategies from recent literature (e.g., learned chunking or entropy-based methods), leaving the relative advantage of the proposed patching criterion less fully contextualized.
2. Although the paper shows that longer pre-trained contexts improve performance, the analysis is limited to performance curves without deeper investigation into what temporal structures or dependencies are better captured, or how the dynamic patching interacts with long-range modeling.
Please see weaknesses! |
Fully AI-generated |
|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes TimeSqueeze, a hybrid forecasting architecture that fuses point-wise and dynamic, content-aware patch-wise representations for efficient time series forecasting, particularly in long-context regimes. TimeSqueeze introduces a lightweight state-space encoder to extract fine-grained features, followed by an adaptive patching mechanism that groups time points into variable-sized patches based on local signal complexity. This yields a variable-resolution sequence processed by a Transformer (MoE) backbone. Experiments demonstrate substantial improvements in computational efficiency versus point-embedding baselines, while maintaining strong (often comparable) forecasting accuracy across zero-shot and full-shot scenarios on established long-range benchmarks.
1. TimeSqueeze innovatively combines state-space encoders with adaptive dynamic patching for time series, addressing a well-known bottleneck of fixed patching and inefficient context scaling in Transformer models. This hybridization enables granular feature preservation where needed and aggressive compression elsewhere, a nuanced but underexplored trade-off.
2. The dynamic patching strategy is clearly described and mathematically motivated (see the explicit formulation of the patch boundary condition on Page 4). The model architecture (Figure 1) is systematically illustrated, showing integration points for patching, unpatching, and multi-horizon prediction.
3. Numerous ablation studies probe each critical component (patching, encoder type, positional encoding), and visualizations (Figures 8-11) clarify how patch sizes adapt responsively to data domains.
1. While Section 3 briefly gestures at point-wise decomposability, the model's explicit capability for multivariate or exogenous feature forecasting is only cursorily addressed, with no empirical analysis contrasting, e.g., approaches like Crossformer or TimeXer. This reduces clarity on generality and limits the scope of significance, especially for real-world multivariate forecasting use cases.
2. Potential Hyperparameter Sensitivity: The choice of patching threshold $\tau$ is acknowledged as data-dependent and requires tuning. However, the ramifications (e.g., stability, transferability across domains, optimal selection strategies) are not robustly quantified—raising concerns about practical usability and the risk of model brittleness. The only discussion occurs at a high level in the Conclusion, and more rigorous empirical explanation (e.g., full-sweep results for a range of $\tau$ on multiple datasets) is absent.
3. While the patching function is presented cleanly, some important aspects remain underspecified—for instance, the way boundaries are handled when signals have sudden global changes, or how minimum/maximum patch sizes interact with highly nonstationary regions (see dynamic patching on Page 4 and visualization in Figures 8–10). Additionally, the role of variable-length unpatching in SSM/Transformer composition warrants deeper theoretical and implementation clarity. The claim of strict causality preservation could be accompanied by formal or simulation evidence to eliminate any ambiguity.
4. While Figure 5 (Appendix D) plots the MSE versus compression rate, there is little theoretical guidance or model explaining these trends, nor discussion of limits of the dynamic patching regime in catastrophic or highly non-stationary settings.
1. How robust is the threshold $\tau$ selection across datasets with highly variable information density? Are there scenarios where patch boundary assignments lead to overcompression or undercompression? Please provide more empirical analysis, including out-of-domain or adversarial examples.
2. Does the architecture generalize robustly to multivariate or exogenous-variable tasks, as per the settings in Crossformer or TimeXer? What adjustments (if any) are required for such use cases?
3. Are there practical deployment scenarios (e.g., real-time forecasting in resource-constrained environments) where the patch boundary computation or unpatching steps impose bottlenecks? What is the end-to-end wall-clock speedup, including patching/merging steps? |
Fully AI-generated |
|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The authors propose wrapping patch-based time series forecasting foundation models (focusing on Time-MoE) in a state space model encoder-decoder architecture that reduces the number of patches passed to the inner model. The experiments provided suggest that this does not significantly affect predictive metrics, but speeds up pretraining.
- The idea of wrapping a transformer model in an SSM encoder-decoder structure to speed it up is well-motivated and could have applications in other areas that use transformers.
- The paper is clear and well-written.
- Thorough ablations are provided.
- The time series forecasting models cited and evaluated against are out of date. Baseline evaluations are taken from Time-MoE (ICLR 2025) and have not been updated to include more recent models, such as Sundial [1] and Moirai-MoE [2], that have advanced the state-of-the-art in the meantime. As such, this paper's claim of state-of-the-art performance is not demonstrated. Posting models on the GIFT-Eval benchmark [3] has become common practice in this area, and many more recent performant models can be found there. (Ideally, the Chronos evaluations would also be updated to at least use Chronos-Bolt, released Nov. 2024.)
- The specific focus on Time-MoE in the method design and evaluations limits the impact of the paper, given the advances in the field since its publication. Without further discussion and evaluation, it's not necessarily clear that this method could be applied on top of more recent models and perform as well.
- Only five evaluation datasets are used - this is extremely limited and not in line with recent papers in this area. Even among earlier papers, MOMENT, TimesFM, Moirai, and Chronos all use dozens of datasets in their evaluations. Again, GIFT-Eval is an example of a large evaluation set that has become commonly used.
- A compelling explanation is lacking for placing patch boundaries based on the difference between neighbouring samples. For a seasonal input, this means focusing on the areas between peaks and troughs, but it's not clear why they should be treated as more relevant. Empirical justification could be all that's available, but it would strengthen the paper if some insight could be provided.
- It's not clear to me that it's very impactful to speed up pretraining of time series foundation models without improving other aspects of performance, given that they tend to be relatively cheap to train as far as foundation models go, and the zero-shot capability means pretraining only has to be done once. One possible benefit would be allowing more scaling of model size, but the results suggest that doing so does not help performance.
[1] Liu et al. "Sundial: A family of highly capable time series foundation models" ICML 2025.
[2] Liu et al. "Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts" ICML 2025.
[3] https://huggingface.co/spaces/Salesforce/GIFT-Eval
- Ablations compare to using a fixed patch length of 4 but the dynamic patching seems to prefer using patch length 2 in many cases - have you evaluated a fixed patch length of 2? (Acknowledging that this reduces the speedup benefits.) |
Fully human-written |
|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a hybrid forecasting architecture that combines the strengths of point-wise and patch-based embeddings through dynamic time series compression. It comprises a lightweight state-space encoder that uses point-wise embeddings to process full-resolution time series and extract fine-grained temporal features. An adaptive patching module then prunes these features using variable-sized patches, assigning smaller patches to information-rich regions and larger patches to redundant segments.
S1. This paper presents a hybrid forecasting architecture to incorporate dynamic, content-aware patching for adaptive compression in time series.
S2. The experimental findings validate the computational efficiency of the proposed method.
1. Time series data often exhibit periodic and trend patterns. Relying solely on single-step differences between adjacent samples to determine boundaries may be insufficient for capturing periodic boundaries or trend changes.
2. The patching mechanism determines boundaries by comparing the absolute difference between adjacent samples with the local average power within a sliding window. Could the authors clarify how this criterion effectively distinguishes between information-rich and redundant regions?
3. The boundary selection depends on the design of the sliding window, yet the paper does not clearly specify whether the window is overlapping or non-overlapping.
4. Experimental results show that model performance decreases as the compression ratio increases, which may be due to excessive information loss caused by over-compression. Intuitively, using a smaller compression ratio might improve performance, but the paper does not provide corresponding experiments.
5. The current experimental results show limited forecasting performance, and the paper does not include comparisons with recent Time Series Foundation Models (TSFMs), such as Sundial, LightGTS, and VisionTS.
6. The paper lacks an analysis of patch distribution across different datasets. It would be valuable to examine how patch length, density, or boundary frequency vary among datasets with different statistical characteristics.
See Weaknesses. |
Moderately AI-edited |
|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes dynamic patching and lightweight state-space encoder for time series prediction. Experiments are conducted on 5 datasets with zero-shot and finetuning settings.
1. This paper is easy to follow.
2. Experiments are conducted on 5 commonly used datasets.
3. The experiments are conducted with zero-shot and finetuning settings.
1. The claimed main contributions in this paper say that "incorporate dynamic, content-aware patching". However, this has been well studied in time series foundation models, such as [1-3] which propose dynamic and/or content-aware patching. The claimed contributions and novelty are limited.
[1] HDMixer: Hierarchical Dependency with Extendable Patch for Multivariate Time Series Forecasting. AAAi 2024.
[2] Irregular Multivariate Time Series Forecasting: A Transformable Patching Graph Neural Networks Approach. ICML 2024.
[3] LightGTS: A Lightweight General Time Series Forecasting Model. ICML 2025.
2. Key experimental comparison for baselines [1-3] is missing. I briefly checked the finetuning settings where the performance of this work is worse than [3]. And the efficiency of this work is worse than [3].
3. The authors 'validate TimeSqueeze across diverse zero-shot forecasting benchmarks, achieving performance on par with state-of-the-art point embedding models while delivering up to 20× faster training and 10× faster inference'. Why not compare efficiency with patch embedding models?
4. Does the improvement come from downsampling pretraining data as it reduces the bias?
5. No code for reproducibility.
see weaknesses. |
Fully human-written |
|
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes TimeSqueeze, a dynamic patching architecture for efficient long-context time series forecasting. The model addresses the trade-off between fine-grained temporal resolution and computational scalability. TimeSqueeze introduces a two-stage hybrid representation: 1. A lightweight state-space encoder extracts fine-grained temporal features from the full-resolution time series. 2. An adaptive patching module dynamically adjusts patch sizes, assigning smaller patches to regions with complex temporal variations and larger ones to stable segments. This variable-resolution representation allows the Transformer backbone to process fewer tokens without losing critical information, improving both efficiency and accuracy.
1. The methodology is well-motivated and clearly integrated into the forecasting framework.
2. The paper is clearly written and conceptually intuitive.
1. The experimental validation is limited, as the evaluations are conducted only on the Time-MoE architecture, which restricts the generality of the conclusions.
2. The overall architecture of TimeSqueeze largely builds upon the Time-MoE framework — equations (2–4) are directly inherited from the original Time-MoE paper — and the idea of dynamic patching has already been explored in several prior works.
3. The efficiency comparison with Time-MoE is not entirely fair, since Time-MoE is intentionally designed to maximize model capacity by using point-wise rather than patch-based embeddings. Therefore, a more appropriate efficiency analysis should include lightweight baselines such as SparseTSF, TimeBase, or DLinear.
4. The full-shot forecasting experiments lack strong state-of-the-art baselines such as CycleNet, TQNet, TimeBase, or DUET, which makes it difficult to assess the claimed superiority of the proposed model.
Have you considered extending the TimeSqueeze architecture to handle multi-dimensional time series data, such as those involving spatial-temporal correlations? |
Fully AI-generated |
|
GenCompositor: Generative Video Compositing with Diffusion Transformer |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces GenCompositor, a diffusion-transformer-based framework for generative video compositing, enabling insertion, removal, and harmonization of dynamic foreground objects within videos. The method takes as input a masked background video, a mask video, and a foreground video, along with user-defined trajectory and scale controls. The model architecture follows a MMDiT, similar to those used in FLUX, SD3, or HunyuanVideo, with full self-attention, and includes:
- A background preservation branch to maintain scene consistency,
- A DiT fusion block, which concatenates background and foreground tokens for fusion instead of using cross-attention, and
- An Extended Rotary Position Embedding (ERoPE) to mitigate layout misalignment artifacts.
Furthermore, a new dataset, VideoComp, containing 61K video triplets, is curated for training. The qualitative results are visually strong and demonstrate convincing object insertion and harmonization, though the quantitative metrics show only modest improvements over existing methods.
- **Strong engineering effort:** The system is well-implemented and technically solid. The architecture is clean and the pipeline is carefully designed.
- **Good visual quality:** Qualitative and video results are impressive, showing smooth motion and coherent integration of foreground and background.
- **Clear writing and presentation:** The paper is well-organized, easy to follow, and supported by clear figures and a detailed video presentation.
- **Practical relevance:** The task aligns well with real-world video editing workflows and could be useful for production or creative pipelines.
1. **Limited conceptual novelty**:
- The core components are incremental modifications of existing DiT architectures.
- The background preservation branch resembles ControlNet-style conditioning, which is also stated in the paper.
- ERoPE is a simple positional embedding shift; while useful, it is not conceptually innovative.
Overall, the work feels like a strong system-level integration rather than a conceptual contribution.
2. **Inadequate quantitative evaluation**:
- The paper reports only frame-level metrics (PSNR, SSIM, LPIPS, CLIP). These are not sufficient for video generation; video-level metrics such as FVD or KVD are missing.
- Quantitative improvements are relatively small, making it difficult to assess significance.
3. **Weak baseline comparisons**:
- The compared baselines (Harmonizer, VideoTripletTransformer) are both video harmonization methods, not true generative compositing or conditional video generation systems.
- DynVFX can be used for comparisons. Although it cannot add directly the reference objects, it can add objects through text-prompts, which will also strengthen the paper's results.
4. **Dataset concerns**:
- The construction of the new VideoComp dataset is clearly described, combining 409K source videos into 61K compositing triplets via a semi-automatic pipeline (SAM2 + CogVLM + Qwen). While this is a solid contribution, the paper provides limited analysis of dataset characteristics such as category diversity, motion distribution, or domain coverage, which limits understanding of how well the dataset covers diverse motion and scene types. Including a brief quantitative summary would strengthen this part.
5. **Conceptual framing and scope**:
- The paper frames generative video compositing as a new task, but it heavily overlaps with existing controllable video generation and video inpainting setups.
- The results without the fusion block are much worse. Could this be due to the limited representational power of the VAE encoder? I would like to see additional results where a stronger feature extractor, such as DINO or CLIP, is used for the cross-attention variant, as this could provide more meaningful semantic conditioning.
- The model depends strongly on mask and trajectory inputs. How flexible is it when these are imprecise or unavailable? Can it generalize to more unconstrained compositional setups?
- How well does it handle multiple objects or occlusion?
- The paper mentions luminance augmentation for harmonization. Is it sufficient for handling complex illumination changes, or does it fail in extreme lighting conditions? |
Fully AI-generated |
|
GenCompositor: Generative Video Compositing with Diffusion Transformer |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
GenCompositor introduces generative video compositing, an automated task to inject dynamic foreground elements into background videos via user-specified trajectories/scales. Its core is a Diffusion Transformer pipeline with three key components: a lightweight background preservation branch ensuring input-output consistency, DiT fusion blocks with full self-attention for foreground-background token fusion, and Extended Rotary Position Embedding (ERoPE) addressing layout misalignment. Trained on the 61K-video VideoComp dataset, it outperforms video harmonization (e.g., Harmonizer) and trajectory-controlled generation (e.g., Tora) methods in PSNR, SSIM, and motion smoothness. Ablation studies validate the necessity of each component. GenCompositor enables seamless, interactive video editing with realistic foreground-background interactions.
1. The work addresses an underexplored yet challenging setting: interactive, adaptive injection of foreground identity and motion into target videos, enabling fine-grained control over size, trajectories, and other dynamic attributes.
2. The showcased results demonstrate strong perceptual quality and temporal coherence, suggesting the method’s practical utility.
3. The release of implementation details and code substantially improves reproducibility and facilitates future research.
4. The construction of the 61K-pair VideoComp dataset represents a significant data-engineering contribution and a useful resource for benchmarking video composition.
1. In my opinion, the innovation of the module is more of an engineering improvement rather than a fundamental breakthrough. Specifically:
- For Sec.4.1, the proposed Input Conversion primarily involves operations like scaling and expanding the mask foreground object, which are quite common in existing methods.
- For Sec.4.2, the use of a black area to indicate the modified background and a binarized mask video to represent the masked area is also a common practice in prior work.
2. Although the RoPE alignment appears to be a promising idea for alignment, the authors did not provide sufficient elaboration on this design.
3. Additionally, it is suggested to revise Fig.3(A) to align with the paper's description by using a mask video instead of a foreground video, better demonstrating the unique application of ERoPE.
1. Sec.4.3 proposes a new token-wise concatenation method. However, what differentiates it from in-context learning approaches like VACE [Jiang2025 et al.]? Perhaps I missed some details, but I did not find a detailed explanation in the ablation study or supplementary materials.
2. After reviewing the supplementary material on your data construction, I have a question: how do you ensure that the foreground data strictly adheres to the user-provided trajectories, especially when considering variations in angle and direction? This raises concerns about the strictness of the task’s generalizability. |
Lightly AI-edited |
|
GenCompositor: Generative Video Compositing with Diffusion Transformer |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a new video editing task called generative video compositing and proposes the first feasible solution, GenCompositor. Based on the Diffusion Transformer architecture, this method can automatically composite dynamic elements from a foreground video into a background video according to user-specified trajectories and dimensions.
1. This work is the first to introduce the "generative video compositing" task, which aims to automatically composite foreground video elements into a background video using generative models while supporting user control over attributes such as trajectory and scale.
2. A complete architecture is proposed, including a background preservation branch, DiT fusion blocks, and Extended Rotary Position Embedding (ERoPE), aiming to address the three main challenges: background consistency, foreground injection, and user control.
1. Appendix F shows that extending RoPE along height, width, or temporal dimensions yields nearly identical training loss curves, leading authors to conclude these three directions are equivalent. However, this dimension-agnostic behavior seems contradictory to the inherent spatial-temporal characteristics of videos. If ERoPE truly models spatial layout relationships, spatial extensions (height/width) should outperform temporal extension since they directly encode spatial proximity. Conversely, if it captures motion dynamics, temporal extension should be more effective given the causal and directional nature of video frames. The complete equivalence across all three dimensions suggests that ERoPE may simply function by expanding the position encoding to avoid feature conflicts, rather than genuinely modeling spatial-temporal interactions between foreground and background. Could the authors clarify this dimension-independence phenomenon and explain whether simpler alternatives (such as learnable position offsets for different video sources) might achieve comparable results?
2. The practical usage requires SAM2 for foreground segmentation and optical flow algorithms for trajectory tracking, and these preprocessing steps may introduce errors and increase system complexity, yet the paper does not discuss how these errors propagate and accumulate to affect the final results.
3. The paper only compares its method with two video harmonization approaches and two trajectory-controlled generation methods, but fails to include comparisons with recent state-of-the-art methods in related tasks such as video object insertion, which share similarities with generative video compositing.
Please refer to the weaknesses. |
Lightly AI-edited |
|
GenCompositor: Generative Video Compositing with Diffusion Transformer |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes GenCompositor, a generative video compositing method that can adaptively inject foreground video with user control into the background video. The proposed framework was developed based on the diffusion transformer pipeline, allowing users to control the size, trajectory of the foreground objects, and seamlessly preserve them in the background video. To validate the effectiveness of this proposed framework, the paper also curated a dataset called VideoComp. The experiments have been validated to compare against both trajectory-controlled video generation and non-trajectory-controlled video generation methods.
Overall, this paper proposes a new task, i.e., generative video compositing, which has not been well addressed before. This differentiates from existing works on video compositing as it utilizes the generative models to complete the task. From this perspective, the proposed task is similar to the existing one, but tackles the problem from a different technical perspective.
In addition to the task, the paper shows some strong points in curating the dataset required to evaluate on this task. If released, this should help further advance the research in this field. The paper offers a good motivation for why this problem is important and presents the components in the pipeline relatively well. Below, I will show the weak points that need particular attention.
The experimental visualization was not quite clear. For instance, by checking on Figure 1, what is the foreground object on the left top example of the figure? The depicted trajectory is also unclear in this case. It will be nice to have high-resolution and good-quality results. Similarly in Figure 7, the trajectory is not very clearly depicted.
The paper mentions the interaction between the added foreground object and the background, citing the explosion effect in Figure 1(a) as an example. This looks very interesting. However, I don't believe this paper presents sufficient technical details to explain why that happens. Usually, we will expect the added element to have a static presence. But this claim shows that the newly added element can adapt to the background content. This part is not supported by the technical details, at least not clearly presented. This is related to mask inflation, but not entirely.
Regarding the input conversion, the user trajectory has been used to generate the mask video and masked video. The trajectory itself was not directly encoded. It will be nice to directly state what the conditions of "c" are in Eqn. (2).
Based on the workflow of GenCompositor in Figure 2, the masked video and its corresponding mask video has been concatenated as the input. This indicates that the GenCompositor deals with the layout-aligned conditions. In Section 4.4, where ERoPE was discussed, the problem has been shifted to the layout-unaligned conditions. This creates some gaps in understanding, as the workflow in Figure 2 does not incorporate the layout-unaligned conditions. Will the layout-unaligned conditions change the workflow of the proposed GenCompositor? This creates some confusion here.
The role of the inflated mask could be better clarified. Initially, I was thinking that the mask regions were expanded to increase the size of the foreground object, if there is an imperfect foreground mask. A Gaussian filter was used to inflate the mask video. When looking at Figure 8, I have also noticed that the inflated mask was naturally generated as the foreground object changes.
In summary, the paper tackles an interesting task, and the presented pipeline is also feasible to address the proposed task. There are some technical contributions regarding the proposed pipeline, as well as the ERoPE part. The proposed dataset is also a good contribution. Therefore, the paper looks positive. However, the paper also shows the concerns listed above, which need to be addressed well. The corresponding answers to address these concerns would be critical for the final decision.
1. The interaction between the added foreground object and the background is unclear. How this is related to the technical component in the proposed pipeline needs to be justified.
2. The input conversion procedure is unclear and needs to be further explained.
3. The relationship between layout-unaligned conditions and its proposed pipeline is not clearly justified.
4. The role of the inflated mask and its generation process needs to be better explained. |
Fully human-written |
|
CooperTrim: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents CooperTrim, an adaptive, uncertainty-aware feature selection framework for cooperative perception. The framework leverages conformal prediction to estimate temporal uncertainty for feature relevance and employs a data-driven adaptive mechanism to select an appropriate quantity of shared features based on environmental complexity. CooperTrim is plugged in and extensively evaluated on semantic segmentation using several co-perception methods on the OPV2V dataset, demonstrating significant reductions in bandwidth usage without sacrificing accuracy.
**Compelling Motivation and Scope:** The paper focuses on the bandwidth-accuracy trade-offs in cooperative perception, arguing for temporally and contextually adaptive feature selection that is not covered by static or threshold-based approaches.
**Effectiveness of Sub-modules:** CooperTrim's integration of conformal temporal uncertainty estimation with a cross-attention-based selection mechanism is well described, addressing both feature relevance and adaptivity. Detailed elaboration on training strategies provides hints on reproduction.
**Concrete Adaptivity Insights:** The claim of scene adaptivity is validated by the qualitative results given in Fig. 4.
**Major Weaknesses:**
1. Although more components are incorporated, the proposed method remains a threshold masking mechanism. The adaptivity claim needs more quantitative validation. For example, in Fig. 4 (left), a convincing result would be to show that the IoU curve is relatively stable, or at least more stable than the BW curve ("complexity" curve). The current result cannot prove that the adaptivity benefits the final results.
2. Robustness against localization error and latency is not discussed.
**Minor Weaknesses:**
1. The method is only tested on OPV2V, which is a relatively simple simulated dataset.
2. More network-efficient baselines are expected to be included. Presenting results for the object detection task can also help enhance comparisons with previous methods.
1. Can the method adjust the bandwidth requirement during operation? Or a new model (a newly learned threshold generator) is needed to cope with a new network condition?
2. There will be a potential delayed response to a new traffic pattern as history information is used to generate the threshold. Are there any results on the influence of the temporal window size?
3. How well would CooperTrim's adaptivity generalize to other perception tasks (detection, tracking, etc.)? In those tasks, the ROI is much sparser.
See the weaknesses section for more details. |
Fully human-written |
|
CooperTrim: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes an adaptive data selection framework called COOPERTRIM for cooperative perception in autonomous agents. The main idea is to exploit the temporal continuity of the environment to identify relevant features and avoid transmitting redundant or static information. This reduces the communication bandwidth required while maintaining comparable accuracy to existing selection strategies. The proposed framework uses a conformal temporal uncertainty metric to measure feature relevance and a data-driven mechanism to determine the amount of data shared. The evaluation shows significant bandwidth reduction and improved IoU compared to other selection strategies. However, there are some limitations, such as no ablation study on choosing optimal thresholds, only one simulated dataset was used, and the method has not been evaluated on real-world datasets. Additionally, the threshold-based method may not be robust in scenarios where only minor changes occur in the scene. The method is currently only evaluated on segmentation tasks and further evaluation on detection tasks would demonstrate its generalizability. Finally, the impact of the computation cost after introducing this data selection to collaboration perception models needs to be evaluated. Overall, the idea of using temporal uncertainty for data selection is interesting and the theoretical proof is sound, but further research is needed to address the limitations mentioned above.
1. The idea of data selection by using temporal uncertainty is interesting.
2. The theoretical proof is sound.
3. The reduction in communication bandwidth consumption in segmentation tasks is obvious.
1. No ablation studies on how to choose the optimal thresholds.
2. Only one simulated dataset is used; No real-world dataset is evaluated.
3. What is the mathematical expression of the distance function? What is the deep reason for using this distance function?
4. How robust is the threshold-based method? For example, in a certain scenario, maybe the overall scene doesn't change much, only a small object (e.g., a new pedestrian emerges), probably leading to a small temporal uncertainty, and how will the system take actions to this?
5. Although the method is advantageous for segmentation tasks, it should be better evaluated on detection tasks as well to demonstrate its generalizability.
6. Apart from the bandwidth consumption, one important factor is how the computation costs change after introducing this data selection to the collaboration perception models. What is the processing speed (evaluated by FPS)? Can this method be used in a real-time driving system (Typically more than 100 FPS)?
Please refer to the weaknesses. |
Heavily AI-edited |
|
CooperTrim: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
# What the paper does
COOPERTRIM is an adaptive data-selection framework for cooperative perception that uses temporal uncertainty to decide what features to share and how much to share under bandwidth limits.
# Key idea
* Compute a conformal, temporal uncertainty signal by comparing current encoded features (F_t) with the previous fused features (F_{t-1}^{\text{fused}}); uncertainty indicates where collaboration helps most.
* Use an uncertainty-guided attention to score relevance per channel/region and apply adaptive thresholds to (i) select features and (ii) determine sharing quantity frame-by-frame.
* Train with an ε-greedy–inspired regimen to balance exploration/exploitation so the model learns robust selection under bandwidth constraints.
1. The temporally driven, uncertainty-aware communication scheme is conceptually clear and well structured: it measures discrepancies between the previous fused representation and the current features, applies conformal quantile thresholding to select candidates, and then uses attention with an adaptive mask cutoff to decide both what to transmit and how much—focusing bandwidth on high-value regions.
2. The $\epsilon$-greedy training schedule provides a practical stabilizer under bandwidth constraints: intermittent full-feature updates interleaved with predominantly masked updates smooth optimization and reduce variance, yielding stronger performance than standard-deviation–only uncertainty baselines and curriculum-style fine-tuning.
3. The method is readily portable: as a drop-in component for cooperative semantic segmentation backbones (e.g., CoBEVT, AttFuse, DiscoNet), it delivers consistent improvements at equal or lower communication budgets.
1. Lack of comparison with asynchrony-robust methods (e.g., CoBEVFlow). While the task settings may differ, CoBEVFlow demonstrates that estimating BEV flow and propagating prior features can effectively counter temporal variation; this capability should be considered—either as a baseline or as a complementary design—when claiming advantages in time-varying scenes and realistic, asynchronous communications.
2. Single-benchmark evaluation. Experiments are confined to OPV2V, which limits external validity. Broader evidence across datasets (e.g., DAIR-V2X, V2X-Sim, OPV2V-Async) and tasks beyond semantic segmentation would strengthen generality claims.
3. “Conformal” is used primarily as quantile gating rather than as standard conformal prediction with finite-sample coverage guarantees. The paper lacks formal coverage analyses or mismatch bounds, so the terminology risks overstating the method’s theoretical assurances.
4. Limited system- and communication-layer characterization. Reported metrics focus on bandwidth ratios/Mbps and IoU, with no measurements of end-to-end latency, packet loss/retransmissions, congestion behavior, or the computation/runtime overhead introduced by attention and masking under different hardware budgets. Deployment-level compute-communication trade-offs thus remain underexplored.
1. Benchmark against asynchrony-robust methods and/or integrate BEV flow.
Can you compare to CoBEVFlow or prepend a BEV-flow pre-alignment module, reporting IoU–bandwidth trade-offs under controlled time offsets (e.g., ±50/100/200 ms) on OPV2V/OPV2V-Async? Does your two-threshold policy still add gains beyond flow alone?
2. Report system- and network-level metrics under realistic conditions.
Measure end-to-end latency (encode→select→transmit→align→fuse→decode), packet loss/retransmissions, and congestion behavior across link budgets (e.g., 3/6/12 Mbps) and loss rates (0–10%). Plot IoU–bandwidth–latency curves and characterize degradation/fallback under losses.
3. Quantify computational overhead and deployment feasibility.
Detail added FLOPs/memory and per-frame latency from attention/masking on embedded automotive hardware (e.g., Jetson/SoC) and desktop GPUs. Compare compute-communication trade-offs against feature-compression/distillation baselines at equal accuracy.
4. Establish cross-dataset and cross-task generalization.
Evaluate beyond OPV2V (e.g., DAIR-V2X, V2X-Sim, OPV2V-Async) and beyond semantic segmentation (detection/occupancy/tracking). Include fine-tuned and zero-shot transfers, reporting full IoU–bandwidth curves to substantiate external validity.
I would support acceptance provided the authors satisfactorily resolve all identified concerns. |
Fully AI-generated |
|
Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper studies how to distill not only answers but also reasoning procedures into smaller models. It combines an “explanatory probing” data construction with a reinforcement-learning stage that rewards multi-turn, structured explanations before producing the final answer. Across a wide set of tasks and baselines, the method reports consistent gains, with readable presentation and clear motivation.
1. The paper targets a timely and important problem: moving beyond pattern imitation in distillation toward robust reasoning.
2. The narrative flows well, design choices are motivated, and figures/tables are easy to follow.
3. Empirical coverage is broad. Include many competitive baselines and diverse task families; comparisons are thorough and generally fair.
4. The combination of explanation-oriented probes with an RL objective that prefers structurally coherent multi-turn dialogs is a neat, conceptually coherent idea that fits the stated goal.
I think the central question remains unclear: does EI teach “understanding,” or is it primarily stronger data augmentation? The evidence presented is largely behavioral (end-task accuracy on in-domain and held-out sets). This does not disentangle genuine conceptual acquisition from targeted exposure to templated probe distributions. In particular:
1. The DSU/structural reward is still an outcome-level signal (full probe dialog > partial). It does not by itself show that the model internalizes transferable rules, as opposed to learning to perform longer, EI-style rituals.
2. If EI fosters understanding, predictions should change in directionally correct ways under counterfactual edits (flip a premise, rename variables, swap symbols, introduce irrelevant modifiers). The paper lacks such invariance/causality diagnostics that would separate concept use from surface patterning.
3. The paper reports little on intermediate-step faithfulness/validity (are the stated steps actually correct?), forward↔reverse consistency (e.g., reversal or bidirectional tasks), or error localization (does EI reduce spurious but fluent steps). Such process metrics would directly bear on “understanding” rather than augmentation.
See weaknesses. |
Fully AI-generated |
|
Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes the ExGRPO knowledge distillation framework, aiming to enhance the reasoning ability of large language models (LLMs) by combining Explanatory Inversion (EI) and reinforcement learning (ExGRPO), and effectively distill them into smaller student models.Experimental results show that the student model distilled using this method achieves significant performance improvements on multiple datasets. Compared with existing distillation methods, the ExGRPO method reduces pattern memorization and improves the reasoning ability of the student model. In conclusion, I find the paper's argument relatively clear, and I would give it a score of 6.5 or 7.
1.Diverse interpretive probes generated by EI force student models to understand the logic of questions rather than simply memorizing answers, thereby improving their reasoning ability.
2.ExGRPO, through reinforcement learning and dialogue structure utility rewards, encourages student models to maintain consistency and coherence throughout multi-turn reasoning, which helps them understand and apply complex reasoning structures.
3.Compared to traditional knowledge distillation methods, ExGRPO significantly improves student model performance on cross-distribution tasks, especially demonstrating stronger generalization ability when faced with different domains or unseen data.
4.By using data augmentation methods generated by EI, student models can achieve high reasoning ability with less data and fewer training rounds, significantly improving training efficiency, especially suitable for tasks with limited data.
1.Using EI probes significantly increases training costs. Could the authors reduce the number of training samples based on EI to achieve the same training cost and better quantify the contribution of EI?
2.The models all appear to be distilled from Gemini-1.5-Pro. Will exgrpo still perform well even with a relatively weak teacher model?
3.The authors compared many distillation methods and EI probes. EXGRPO's design is ingenious, but it does not seem to compare with existing RL methods to highlight its adaptability and effectiveness.
Same as above |
Fully human-written |
|
Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces a novel framework for distilling reasoning capabilities from large language models (LLMs) into smaller, efficient student models. It addresses the limitations of traditional knowledge distillation, which often leads to pattern memorization and poor generalization. Thus, authors proposed Explanatory Inversion (EI) that asks teacher model to generate explanatory probes that challenge the student model to explain the logic behind answers.
To distill a student model, it has three-stages:
1. Generate EI probes using a teacher model.
2. Supervised fine-tuning (SFT) on curated EI-augmented data.
3. Reinforcement learning via ExGRPO with structured probe dialogues with Dialogue Structure Utility Bonus (DSU).
The student models are Gemma-7B-it and Qwen2.5-7B-Instruct. And the teacher model is Gemini-1.5-Pro. Authors tested on 8 different reasoning tasks (SQA, GSM8k, ANLI). The proposed method improves 6.02% over SOTA distillation method RevThink. Ablation studiy shows that SFT warm-up and LSFT-aux regularization are crucial for stable RL training.
Authors provide a comprehensive evaluation: Includes in-distribution and out-of-distribution benchmarks, and provides a solid case study that demonstrates improved reasoning in math and commonsense tasks, with structured logic and fewer distractor errors.
Authors introduce novel methods: Explanatory Inversion and ExGRPO, which combine structured explanatory probes with reinforcement learning
The paper compares against zero-shot and SFT baselines, but does not include recent reasoning-focused RL distillation methods (e.g., [Divide-or-Conquer](https://aclanthology.org/2024.findings-emnlp.145.pdf), [CoT-Evo](https://arxiv.org/abs/2510.13166v2), and [On-Policy Distillation](https://arxiv.org/abs/2306.13649)).
The quality of explanatory probes heavily depends on Gemini-1.5-Pro. This raises concerns about scalability and reproducibility for researchers without access to such a strong teacher model. Could authors try open-source alternatives, such as llama70B, for probe generation.
The paper highlights success cases but does not deeply analyze where ExGRPO fails. |
Fully human-written |
|
Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work proposes a new framework for distilling robust reasoning from large language models into smaller ones. It introduces Explanatory Inversion (EI) to combat pattern memorization by prompting students to explain their reasoning, and Explanatory GRPO (EXGRPO) to enhance generalization via a reward for coherent reasoning. On 12 datasets, the method improves student model performance, training efficiency, and generalization ability.
- The paper is clear in writing and presentation, which is easy to follow.
- The idea is intuitive and explores the reasoning chain constructions of the LLMs, which helps the student model better learns the principles instead of the patterns from the dataset.
- The results are strong and comprehensive, with significant improvement margins and many ablation studies.
- The paper used a Dialogue Structure Utility Bonus if the student is engaging in the full k-turn probing dialogue, which leads to better overall outcomes than a partial dialogue. How can the authors prevent reward hacking from using this reward bonus?
- Why do the authors pick the GRPO-based objective for the RL training? Is there any intuition or reason behind that?
- How does EXGRPO training efficiency compare with other baseline methods?
See weaknesses. |
Fully human-written |
|
Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a model distillation method. It first constructs a high-quality EI training set, ensuring each problem includes a reasonable reasoning expansion, preserves the original logic, and has appropriate difficulty; SFT is then performed on this data. It then introduces ExGRPO, designing rewards based on a Dialogue Structure Utility Bonus to carry out reinforcement-learning-based distillation. Systematic experiments on two student models (Gemma-7B-it and Qwen2.5-7B-Instruct) show improvements over strong baselines on multiple OOD evaluations.
1. The method is well designed and addresses practical issues in model distillation.
2. The paper proposes the Explanatory Inversion strategy and carefully engineers a large set of prompts.
3. The paper introduces the Dialogue Structure Utility Bonus as the reward in reinforcement learning; this design is somewhat innovative.
4. The paper provides thorough comparative experiments and analyses.
1. Data generated via the Explanatory Inversion strategy is essentially a form of data augmentation; this part appears largely engineering-oriented, and the core idea is not particularly new, so the academic contribution is limited.
2. Evol-Instruct [1] presents a method for progressively generating complex instructions from simple ones. Although it does not target model distillation, it bears similarities to the paper’s Explanatory Inversion; the paper should add discussion of such related works.
[1] WizardLM: Empowering large pre-trained language models to follow complex instructions
1. Mainly those noted under “Weaknesses.”
2. Line 359: “ablation” is misspelled. |
Lightly AI-edited |
|
Probing to Refine: Reinforcement Distillation of LLM Reasoners via Explanatory Inversion |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes ExGRPO, a reinforcement-learning–based distillation framework that combines Explanatory Inversion and a Dialogue Structure Utility reward to enhance the reasoning capability and generalization of student LLMs. Empirical results are reported on 12 reasoning datasets, suggesting gains over distillation and data-augmentation baselines.
- The work addresses a genuine problem: retaining reasoning ability during LLM distillation.
- A relatively large evaluation suite with 12 datasets and OOD tests are included.
- The paper is well organized with clear figures and pseudo-formal derivations.
- The so-called Explanatory Inversion resembles existing reverse-reasoning or bidirectional augmentation ideas (e.g., “A→Q” vs. “Q→A” reversals) rather than a fundamentally new concept.
- No ablation quantifies whether improvements stem from RL fine-tuning, extra teacher tokens, or the EI data itself. Table 2 mixes multiple knobs (SFT, RL, DSU) but lacks isolating the effect of “explanatory probing.”
- Even the teacher’s “Zero-shot-EI” performance improves, implying that EI augmentation changes the test distribution itself; this raises the possibility of data leakage or prompt-format bias.
- The statistical significance of the improvements is unreported.
- The filtering pipeline (Eq. 1–2) relies on teacher predictions but gives no statistics on rejection rates or dataset sizes after filtering.
Please see Weaknesses |
Moderately AI-edited |
|
BridgeRAG: A Framework for Reasoning over Partitioned Knowledge Graphs |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 2: fair
Rating: 0:
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper BridgeRAG: A Framework for Reasoning over Partitioned Knowledge Graphs presents a framework for querying over multiple documents, evaluated in the context of multi-hop question answering. The documents are first transformed into knowledge graphs using an NLP pipeline to extract named entities as well as relations. However, the system does not use graphs per se, as the LLMs are not equipped for dealing with data in graph format. Therefore, the resulting graph is summarized using an LLM (both at the entity level and at the document level), and the rest of the processing is done on the generated text.
The authors of the paper do not clearly delineate their own work from existing tools and methods. It is therefore very difficult to accurately delineate their contributions.
The contributions as stated by the authors consist in proposing a framework for doing RAG across multiple documents and introducing three mechanisms to this end: knowledge predigestion (creating entity-level and document-level summaries based on the information in the knowledge graph); hybrid entity linking (matching the entities across documents using encodings of their summaries and an LLM-based SAME_AS score) and dynamic working memory (which deals with expanding the graph starting from the original query and documents).
The paper has potential, however, the way it was written makes it difficult to understand and to replicate. For example, the idea of creating a document manifest, i.e. a list of all entities followed by a document summary is interesting. The planning also seems to matter according to the ablation study, but unfortunately it is not described in a way that can be understood.
The paper is written in a tabloid style, with unnecessarily bombastic wording, making it more difficult to understand what the authors describe. The authors should revise their prose and stick to the scientific style of writing, focusing on describing facts and offering clear explanations.
Some terms are used throughout the paper without ever being properly introduced: e.g. entity saturation, contextual noise pollution, contextual purity. Adding quotes around a term does not explain it.
The paper is difficult to read, as authors do not use citations properly: please use \parencite{smith2020} -> (Smith, 2020) when the citations are not part of a sentence. And \textcite{smith2020} -> “Smith (2020)…” when the citation is part of a sentence.
Moreover, the authors do not clearly delineate their own contribution from the different off-the-shelf tools they use.
The methodology section is written extremely superficially, without mentioning important aspects about the framework being proposed. The explanations offered are mostly handwavy, and are of no use for understanding the methods.
I would recommend a careful rewrite of the paper, based on the questions below.
In Fig. 1, subfigure 3 the bridged graph can easily become a monolithic graph – it only needs a couple of links added for all nodes to be completely accessible from any node of the graph. Which begs the question: how do you test/ensure that the connectivity in the graph remains bridge-style and does not become monolithic?
The methodology is unclear:
- What information about an entity is used to generate the entity summary?
- How is N_1(e) defined?
- What prompt is used to generate the summary?
- Which LLM was used in the experiments? (this appears only in the appendix)
- What prompt is used to generate the document summary? Is the document summary more than a concatenation of the individual entity summaries?
- How are the Partitioned Knowledge Graphs created? Which knowledge based was used?
- What is a hybrid entity linking mechanism? What entity linker was used? Is this an off-the-shelf entity linker?
- What is a multi-source weighted router? Is this an existing tool or something that you developed?
- What termination condition is used for the guided retrieval?
- Which independent NLP pipeline is used in S. 3.2.1? Is this something the authors created, or an existing library? If it is an existing library, please add proper citations.
- In Section 3.2.1, the KG construction is unclear: is the knowledge graph G_i constructed only based on the information from the document, d_i, or is there other external knowledge from a knowledge base that is used (maybe which is part of the NLP pipeline)?
- In section 3.2.2, I am assuming that the advanced sentence embedding model is, again, a third-party library which is not properly cited.
- In the deep semantic adjudication section, how is the output of the LLM calibrated – e.g. how is the scoring mechanism trained?
- Why not use directly an entity linker (e.g. BLINK, https://github.com/facebookresearch/BLINK) rather than this ad-hoc hybrid approach?
- In Section 3.3.1, what is the Reciprocal Rank Fusion algorithm? Is this something the authors designed, or existing work?
- In Section 3.3.1, what is the algorithm/method for deciding if a document is primary or auxiliary?
- In Section 3.3.2, what is the difference between the mining strategy for primary and for auxiliary documents? It is not explained at all.
- In Section 3.3.3., what is a critique function?
- In Section 3.3.3, how does the agent formulate a precise sub-question?
- In Section 4.1, where does the 2WikiMultiHopQA-Subset come from?
- In Section 4.1, the evaluation metrics are atypical for a question answering task. The authors must present it in more detail. |
Fully human-written |
|
BridgeRAG: A Framework for Reasoning over Partitioned Knowledge Graphs |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper tackles a key pain point in KG-based RAG for real-world, multi-document settings: if you build a separate KG for each document, you keep context clean but can’t answer questions that span documents; if you merge everything into one big KG, you get entity collisions (“Apple” vs “apple”) and noisy retrieval. To resolve this “contextual purity vs knowledge connectivity” dilemma, the authors propose BridgeRAG, a two-phase framework. Offline, it “pre-digests” documents and uses hybrid (LLM + embedding) entity linking to build trustworthy SAME-AS links across otherwise isolated document-level KGs. Online, an LLM agent does iterative, multi-hop reasoning over these linked partitions: a multi-source weighted routing module picks the most relevant documents, and a Dynamic Working Memory (DWM) keeps only the highly relevant facts in context. Experiments on multi-hop QA show BridgeRAG retrieves cleaner cross-document evidence and beats prior KG-RAG setups.
1.the paper clearly formulates the core KG-RAG problem as “partitioned isolation vs cross-partition linking,” which is exactly what happens in multi-document enterprise / report / case-file settings. It combines offline high-fidelity entity linking (to avoid KG pollution) with online, agentic, step-by-step reasoning + DWM is a nice division of labor and more realistic than doing everything at query time. The shared named entities are the main “bridges” between documents is intuitive, inspectable, and lets them control noise better than naïve graph merging.
2.Empirical results have shown the effectiveness of the proposed approach.
1.if NER/coref/linking is wrong or sparse (domain jargon, long-tail entities), the “bridges” won’t form and cross-document reasoning degrades. It means that the proposed approach is heavily relying on the correction of intermediate steps. It should be analyzed that whether error accumulation or potential risk in real-world scenarios.
2.the offline “knowledge pre-digestion” + dual-verification linking step is extra machinery that must be re-run when documents change, which may limit use in highly dynamic corpora. It is also time-consuming in practice. More latency analysis should be done to show the efficiency problem.
3.Entity-as-bridge is a strong assumption. Some cross-document reasoning relies on events, schemas, or implicit relations not surfaced as named entities. The current design might under-retrieve those cases unless extended with relation/schema-level linking.
Please refer to the weakness part |
Fully AI-generated |
|
BridgeRAG: A Framework for Reasoning over Partitioned Knowledge Graphs |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper “BridgeRAG: A Framework for Reasoning over Partitioned Knowledge Graphs” presents a RAG architecture for multi-document reasoning. It builds isolated knowledge graphs for each document and links them through a hybrid entity-linking mechanism combining embeddings and LLM verification. During inference, an agent performs iterative retrieval and reasoning using a Dynamic Working Memory (DWM) that refines queries into sub-questions when information is insufficient.
Strength: (1)Comprehensive Framework Integration: BridgeRAG effectively combines multiple components—entity linking, document summarization, and iterative reasoning—into a cohesive architecture that addresses both contextual isolation and cross-document connectivity in multi-document RAG.
(2)Strong Empirical Performance: The framework demonstrates notable improvements on multi-hop question answering benchmarks, suggesting its potential for enhancing reasoning accuracy and retrieval precision in complex knowledge integration tasks.
Weakness: (1) In the Partitioned KG Construction and Pre-Digestion stage, each document contains a large number of entities, and the total number across documents is even higher. Invoking the LLM to summarize every entity neglects the issue of computational cost, leading to significant resource waste as the number of KG nodes increases. It is recommended to include experiments analyzing time consumption and computational overhead. Moreover, while the idea of leveraging broad world knowledge and contextual cues from entity summaries to disambiguate whether two entities refer to the same real-world object is sound, this approach causes severe exponential growth in LLM invocations, making it impractical for large document collections.
(2) The Dynamic Working Memory (DWM) module in BridgeRAG is presented as the “core innovation” of the framework, but from a rigorous review perspective, its novelty is limited. The first step of DWM—using IsSufficient to determine whether the current context adequately answers the question—cites Liu et al. (2025) and follows the established “self-evaluation and reflection” paradigm seen in Self-RAG (Asai et al., 2024), Reflexion (Shinn et al., 2023), and Critic-CoT / Self-Consistency (Kumar et al., 2025). The subsequent “generation of new sub-questions” is likewise a common Query Refinement strategy. The second step,
, which generates refined sub-questions based on prior context, is already well-established, as in RQ-RAG (Chan et al., 2024) under its “query refinement through LLM planning” approach.
(3) The construction of cross-document experimental datasets is insufficiently described. Given that typical LLMs handle around 5,000 tokens effectively, the rationale for the chosen dataset configuration remains unclear. It would be more convincing to include cross-document evaluations involving longer contexts—e.g., 100,000-token settings—to justify the dataset design.
(1)Regarding computational efficiency: How does BridgeRAG handle the significant computational overhead introduced by invoking the LLM to summarize every entity and perform dual-verification linking across documents? Could the authors provide quantitative analyses of time and resource consumption to demonstrate scalability on large knowledge graphs?
(2)On methodological originality: Given that the Dynamic Working Memory (DWM) module closely follows established self-reflection and query refinement paradigms (e.g., Self-RAG, RQ-RAG), what concrete methodological innovations distinguish BridgeRAG’s DWM from these prior approaches beyond architectural integration?
(3)About dataset construction and scalability: Could the authors elaborate on the design rationale for the chosen multi-document datasets and provide evidence that BridgeRAG remains effective in much larger or longer-context settings (e.g., >100K tokens), where cross-document reasoning becomes more challenging and realistic? |
Heavily AI-edited |
|
Robust Detection of Directional Adversarial Attacks in Deep Neural Networks for Radiological Imaging |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
In the field of radiology, deep neural network (DNN) models have been widely used to detect abnormalities in radiographic images. However, these models can be easily misled by carefully crafted adversarially manipulated samples, revealing a lack of robustness. To address this issue, this paper proposes a simple yet effective method for detecting adversarial-attacked samples. The method performs a “re-attack” on an unknown sample by adding noise of a specific intensity and analyzes the change in the model’s final prediction probabilities to distinguish clean samples from attacked ones. Experiments conducted on three mainstream radiographic image datasets, using various types of adversarial samples (FGSM, BIM, etc.), demonstrate the effectiveness of the proposed method. The results show that their approach can successfully differentiate adversarial-attacked samples from clean samples, exhibiting impressive performance and strong potential for clinical application.
1. The proposed method is simple, effective, and training-free. Therefore, it can be seamlessly integrated with existing neural network–based radiographic anomaly detection approaches, offering strong generalizability and practical potential.
2. The proposed method demonstrates robust performance against various types of adversarial attacks, highlighting its strong resilience.
3. The authors providing an interpretation of their proposed method’s effectiveness from the perspective of representation learning, for heatmap visualizations are used to illustrate how the method influences deep learning models.
Although this paper proposes a seemingly simple and effective method for detecting adversarial attacks on radiographic images, I believe it still has several weaknesses. The main issues are as follows:
1. The authors’ research motivation and the scientific questions to be addressed should be clearly defined and stated in the Introduction of the paper. E.g: “...(Based on the above practical considerations), distinguishing between clean samples and adversarial samples is a research problem worth exploring in (related research)”.
2. Although Section 1.2 is devoted to discussing adversarial attacks in medical imaging, there are relatively few references that simultaneously address both key themes: medical imaging and adversarial attacks. This section appears to focus more on adversarial-attack-related content rather than on their applications in medical imaging.
3. ICLR focuses on representation learning, but the representation-learning approach proposed in this paper appears to be limited to feeding medical images into a pre-trained network — a representation-learning strategy that seems too simplistic for ICLR’s scope. Besides, although ICLR also welcomes simple and effective methods, this paper lacks broader and deeper experiments and analyses of its proposed method.
4. The proposed method appears to require setting noise intensity thresholds and prediction-difference thresholds specific to a particular dataset, neural network, and attack type, which may hinder its generalization to different Out-of-Distribution medical imaging data.
5. This paper contains many ambiguous descriptions, and some figure and table captions lack sufficient explanation. The presentation of data is also not standardized (e.g., using long blocks of text instead of tables or charts), which seriously affects the readability of the paper. For details, please refer to the questions I raised in the Questions section. Note that the issues I raised represent only a portion of the problems in this paper. The authors should carefully examine this particular weakness.
1. The case study shows in Figure 1 requires further improvement. It should provide a more detailed explanation of the potential outcomes that such adversarial samples may cause within the network.
2. All external data should be accompanied by reliable source citations. E.g. in Section 1.1: “In 2022 alone, more than 28 million patient records were exposed in healthcare-related data breaches in the US, with imaging data often among compromised content.” Moreover, the related research methods discussed in Section 1.4 are outdated and should be supplemented with more recent studies.
3. It appears that the Related Works section could be divided into two parts: “Applications of adversarial attacks in medical imaging” and “Methods for defending against attacks in medical imaging.” To improve readability, these two parts should be more clearly separated. I suggest that Section 1.2 could focus on adversarial attacks, while Section 1.3 could focus on defense methods. Moreover, incorporating the medical imaging–related content into the Related Works section might be more appropriate.
4. This paper lacks broader and deeper experiments and analyses of the approach. E.g. (1) Is this strategy effective for other visual tasks in medical imaging, such as detection and segmentation? (2) The datasets used in this paper appear to be binary-classification problems, does the strategy remain effective for multi-class problems? (3) A deeper, more persuasive analysis is needed to explain why the proposed method works.
5. In Section 2.4, although the authors propose manually setting a threshold for prediction differences, the meaning of the parameter $\theta$ remains unclear, making Table 2 extremely difficult to interpret. Is the author implying that samples exceeding $\theta$ are considered anomalous, or that samples exceeding the manually set prediction-difference threshold are? In Table 2, does the second subcolumn of each column represent $\theta$ or the manually defined prediction-difference threshold?
6. In Section 3.2, DF causes the prediction accuracies of ResNet-50, ViT-B16, and Inception-V3 to change by 99%, 99%, and 62.6%, respectively. What types of networks were used for the other attack methods, and what specific changes occurred? It would be clearer to present this information in a table.
7. In the 3rd paragraph of Section 3.2, the authors mentioned that “The most effective universal power for all types of models was 3%.” However, they subsequently mentioned that “So, the added changes to the image on average were smaller than for the other types of attack, resulting in a best epsilon of random noise for each model attacked by DeepFool was 0.01.” Is there a contradiction here, or are the referents of the two unclear? In addition, the experimental results in this section should be presented in a table to improve readability.
8. The caption of the bubble plot presents in Figure 4 is unclear. For example, what do the radii of the bubbles represent? Does each bubble correspond to a single sample?
9. It seems that the Brain Tumor dataset is merely a binary classification problem (healthy vs. tumor). Therefore, the conclusion mentioned in Section 3.3 is rather obvious: the predictions of most adversarial samples tend to move toward the correct class, since there are only two possible labels in this context. What I am more concerned about is whether the proposed method can ensure that, in multi-class tasks, the predictions of attacked samples still move toward the correct class—either uniquely or toward the majority of correct classes among many categories.
10. Although this paper focuses on defending against adversarial attacks and proposes a simple and effective method for detecting adversarial examples, similar studies should exist in the broader image domain. In Section 3.3 & Table 2, the paper does not seem to compare its approach with other image-based adversarial detection methods, making the baseline unclear. |
Lightly AI-edited |
|
Robust Detection of Directional Adversarial Attacks in Deep Neural Networks for Radiological Imaging |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a framework for detecting directional adversarial attacks in medical imaging models. The approach compares model predictions before and after adding small random noise, assuming clean images are more stable under such perturbations than adversarial ones. Experiments on three medical datasets (ChestX-Ray, Brain MRI, Retinal Fundus) using ResNet50V2, InceptionV3, and ViT-B16 report detection accuracies up to 99.8% with low false positives.
1. The paper addresses an important and safety-critical problem: detecting adversarial manipulations in medical imaging models used for diagnosis. With deep learning increasingly adopted in clinical practice, ensuring reliability against such attacks is both timely and significant.
2. The proposed method uses only test-time perturbations and prediction differences, making it easy to add as a post-hoc verification step without retraining or model changes. This design fits well with real world medical AI deployment and regulatory constraints.
1. Limited Novelty: The method is very close to earlier prediction-difference and stability-based detection approaches [1, 2], which already use small perturbations to test model confidence. The claimed focus on “directional attacks” adds little new, since the method does not actually model or exploit directionality-adding random noise and thresholding are general operations.
2. Insufficient Theoretical Foundation: The idea that random noise can “neutralize” adversarial perturbations is not theoretically justified. No analytical proof or clear reasoning shows that unstructured noise can reliably counter structured attacks, making this assumption largely heuristic [3].
3. Experimental Design Concerns: Reported accuracies above 99 % seem implausible for this task and are not well-supported by methodological details. It is unclear how thresholds were tuned, raising possible data leakage concerns. Standard robustness metrics (e.g., AUROC, TPR@5%FPR) and variance reporting are missing, which makes the results hard to trust.
4. Lack of Analytical Depth: The paper provides almost no formal analysis of why prediction differences should distinguish clean from adversarial samples. The math is descriptive rather than explanatory, unlike prior work such as Mahalanobis or energy-based detectors that include probabilistic reasoning.
[1] Guo, Feng, et al. "Detecting adversarial examples via prediction difference for deep neural networks." Information Sciences 501 (2019): 182-192.
[2] Lee, Kimin, et al. "A simple unified framework for detecting out-of-distribution samples and adversarial attacks." Advances in neural information processing systems 31 (2018).
[3] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
1. Could author provide how sensitive is the method to the choice of random noise strength (ε)? Would slightly different noise magnitudes or distributions (e.g., Gaussian vs. uniform) significantly change performance?
2. How would your method perform against stronger recent detection baselines (e.g., Energy-based OOD detection, or LogitNorm)? |
Fully AI-generated |
|
Robust Detection of Directional Adversarial Attacks in Deep Neural Networks for Radiological Imaging |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a lightweight adversarial detection framework for medical imaging systems that identifies directional adversarial attacks that intentionally mislead models toward a specific incorrect class. The proposed method introduces a random noise–based detection mechanism: it analyzes prediction differences between clean vs. noisy-clean images, and adversarially attacked vs. noisy-attacked images, to distinguish genuine from manipulated images. Experiments were conducted on three datasets, using multiple deep learning backbones under common adversarial attacks.
The framework’s core idea, leveraging prediction variance after random perturbation, is elegant and training-free. It cleverly repurposes noise sensitivity as a proxy for adversarial trace detection without retraining models.
Low false-positive rates (≈ 2–4%) make it practical for medical deployment.
Covers four adversarial attacks (FGSM, BIM, PGD, DeepFool) under multiple epsilon values and models.
Incorporates datasets from major imaging modalities (X-ray, MRI, fundus).
The approach is largely empirical; no theoretical guarantee or formal proof explains why random noise restores original class predictions under adversarial perturbations.
Tested only on 2D image classification; not validated for large 3D imaging datasets (CT volumes, fMRI). Multi-class or regression tasks (e.g., lesion segmentation) are untested.
The fixed threshold may be suboptimal for other datasets or attack strengths, which could impact generalization.
na |
Moderately AI-edited |
|
Robust Detection of Directional Adversarial Attacks in Deep Neural Networks for Radiological Imaging |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper focuses on adversarial example detection that leverages subsequent attacks using random noise. Specifically, the paper focuses on the Chest X-ray for computer-aided medical image analysis. The detection primarily relies on the prediction gap between clean/adversarial images and their noise-perturbed counterparts. In addition to the chest X-ray image context, the paper further explores the generalization of the proposed method in the context of eye fundoscopy and brain MRI. Experiments are also conducted across diverse vision backbones.
1. The topic is interesting. Investigating adversarial threats and their countermeasures in the context of medical image analysis is important in the current society.
2. The visualizations are well presented and organized, which visually support the claim that small noise nudges attacked images back toward the clean class focus.
3. The paper explores diverse diagnosis tasks in the context of medical image analysis. Furthermore, experimental results across diverse backbones are given.
1. The motivation seems to be weak. Although it might be intuitive that clean samples and their adversarial counterparts can exhibit different behavior when encountering noise. However, an empirical analysis on a certain dataset or some theoretical analyses would further improve the paper. Otherwise, the correctness of the motivation is not justified.
2. The visualizations of Figure can further be improved. The authors can consider enhancing the pixel intensity for adversarial noise to get a better visualization. In addition, some illustrative signs can also enhance the motivation.
3. AutoAttack and adaptive attack results are missing. In addition, robustness against higher attack strength (perturbation radii) should be given.
4. It appears that no equations are given in this paper. The paper would benefit more from mathematical analyses.
5. The paper lacks explicit and empirical comparisons with adversarial fine-tuning and adversarial purification.
1. Can the proposed method be extended to multimodal architectures? In other words, would the proposed detection method still be applicable to multimodal large language models, e.g., CLIP or medical CLIP?
2. Would the proposed method be improved via an adaptive (learned) detection threshold compared with a fixed threshold?
3. I would suggest adding an additional discussion about the practicality of adversarial attacks for medical image analysis. Because medical diagnosis machines are typically offline. Then how would the attacker create adversarial medical examples? |
Fully human-written |
|
Towards Generalizable Implicit In-Context Learning with Attention Routing |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces In-Context Routing (ICR), a novel approach to improve large language models' in-context learning capabilities without using explicit demonstration examples. ICR extracts generalizable patterns from multi-domain in-context learning by identifying Principal ICL Directions (PIDs) through PCA on attention representations. These patterns are applied via a learnable router that modulates attention logits based on input queries, enabling effective zero-shot inference. Unlike existing vector-based implicit ICL methods that inject task-specific vectors into residual streams, ICR operates at the attention mechanism level, providing better generalization. Experiments on 12 datasets show ICR consistently outperforms baselines, particularly excelling on out-of-domain tasks while maintaining computational efficiency comparable to zero-shot inference.
1. ICR operates at the attention logits level rather than post-hoc residual stream injection, which is more aligned with how ICL fundamentally works through attention mechanisms
2. The paper provides rigorous theoretical grounding using the Spiked Covariance Model and Davis-Kahan theorem to explain why PCA on multi-domain ICL bases can extract generalizable patterns.
3. Instead of additive vector interventions, ICR modulates attention through low-rank modifications to query-key interactions.
4. Novel use of PCA to extract reusable structural directions from cross-domain attention representations
1. OOD Design Issues:
The division into "near-OOD" and "far-OOD" seems subjective. For example, why is MRPC (paraphrase detection) considered "near" while CB (NLI) is "far"? Both involve sentence-pair understanding. The "OOD" tasks are still mostly classification/QA tasks from standard NLP benchmarks. True OOD would include fundamentally different task types (e.g., structured prediction, generation, mathematical reasoning). The paper trains on 5 diverse datasets (AGNews, SST-2, TREC, CSQA, PIQA) which already cover sentiment, QA, and classification. This makes the "generalization" less impressive since the model has seen similar task types during training.
2. The technical contributions are relatively incremental: The core idea of routing attention through PCA-extracted directions is reasonable, but the execution lacks the technical depth and innovation expected for a top-tier venue. A stronger contribution would involve more sophisticated pattern extraction, adaptive routing mechanisms, or novel theoretical insights about ICL.
3. The quality of PIDs heavily depends on the diversity and quality of initial ICL prompts, but no guidelines are provided for this critical step
1. The experiments only test on 7B-8B models. How does ICR scale to larger models (70B+) where ICL behavior might be fundamentally different?
2. No analysis of how PID dimensionality (r) should scale with model size or task complexity
3. Computational cost of extracting PIDs grows with the number of domains, but this overhead isn't thoroughly analyzed when compared with few-shot learning which require zero training but may cost at inference. |
Fully AI-generated |
|
Towards Generalizable Implicit In-Context Learning with Attention Routing |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces ICR, which extracts Principal ICL Directions from attention and adaptively injects them into logits via a lightweight router. Experiments show ICR outperforms prior implicit ICL, remains stable on out-of-distribution tasks, and achieves strong efficiency. It offers a new paradigm with few parameters, zero-shot generalization, and cross-task reusability.
- This paper introduces the new paradigm of attention routing, shifting implicit ICL from residual injection to low-rank bias at the logits level, demonstrating clear novelty.
- It achieves consistent gains on open-source models such as Llama2, Qwen2.5, and Llama3.1, showing strong generality and reusability.
1. The evaluation is limited to classification and reasoning tasks, lacking assessment on open-ended QA and long-context reasoning.
2. Experiments are only conducted on 7B/8B models, without validation on larger-scale LLMs.
3. The router relies solely on a fixed MiniLM encoder for query representations, without examining whether alternative encoders could affect routing quality and generalization.
see weaknesses |
Moderately AI-edited |
|
Towards Generalizable Implicit In-Context Learning with Attention Routing |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper focuses on reconstructing the attention patterns of few-shot inputs on zero-shot settings. Specifically, this paper propose In-context Routing method, which adds a bias term to the attention logits generated in zero-shot inputs to reconstruct the attention patterns observed in few-shot scenarios. Extensive quantitative experiments demonstrate that the proposed method outperforms a wide range of modern baselines, especially on out-of-domain tasks.
1. The authors propose a novel framework called attention routing, which enables automatic additive steering of attention logits, which can be broadly applicable across various scenarios. The attempt to explicitly control LLMs' behavior through a mechanistic understanding is highly revolutionary for the field of interpretability. This is my primary reason for recommending the acceptance of this paper.
2. Based on the above attention routing framework, the authors further introduce the In-context Routing (ICR) method. Through extensive quantitative experiments on sufficiently diverse datasets and model types, the authors demonstrate that ICR outperforms multiple baselines.
3. The analysis section provides insightful details about the proposed method, strengthening their claim that ICR provides generalizable attention shaping. In particular, they find that the reshaped attention scores can capture reasoning-oriented tokens, thereby confirming the soundness of the original motivation to ICR.
1. ICR utilizes gradient-based training on relatively large datasets with several tricks to facilitate attention routing. Also, ICR introduces an external text encoder to calculate the two key control gates in the method, and optimizes on a complex loss function. This design may contradict the low-resource spirit of ICL and undermine the overall usability. Furthermore, to my knowledge, the authors did not discuss how the performance of this additional text encoder affects ICR, nor did they provide sufficiently convincing ablation results to confirm the effectiveness of each loss component (e.g., in line 2-4 of Table 4, ablating some loss terms does not harm the accuracy). I consider attention routing to be an elegant framework, but relying on a bulky auxiliary module seems less than ideal.
At the same time, this raises concerns regarding the paper’s main results (Tables 1 and 2): many of the provided baselines (such as TV and FV) involve substantially lower computational costs than ICR, making the comparison somewhat unfair. Although the authors claim that ICR exhibits good generalization and reusability, I would like to see at least a comparison in terms of calculation cost to enhance the credibility of their results.
Moreover, from another perspective, since ICR already uses gradient-based training, it can be reasonable to directly train $\Delta \mathbf{A}$. I hope the authors can include such an experiment to demonstrate that their manual selection of the $\Delta \mathbf{A}$ basis is not redundant.
2. Mechanistically, the authors employ an external text encoder ($E(\cdot)$) to predict two key gating units within the ICR framework. These gating units are closely related to the internal structure of the LLM (e.g., selecting the important heads, as the authors mentioned in Section 5.3). Therefore, a crucial question arises: do the $E(x)$ actually contain information about the LLM’s internal structure? Or is $E(x)$ merely irrelevant variables? A simple experiment could address this by ablating $E(x)$ into random vectors. If the former is the case, how is this information then extracted by the two parameters $\theta$? This should be an interesting analysis, yet the authors skipped it.
3. There are several writing issues that make the paper somewhat difficult to follow, but I believe that this does not significantly affect my overall judgment of the paper.
1. Line 52. “out-of-domain (OOD)” is ambiguous. You seem to mean that the query lies outside the distribution of the demonstrations, but another possible interpretation is that “the query lies outside the pre-training distribution”. Understanding this is crucial to getting your motivation, so I recommend clarifying it to eliminate ambiguity. Also, the specific experimental setup of Fig. 1 should be described (perhaps in the appendix).
2. Line 115. This paragraph is somewhat unclear. I don’t fully understand the causal link in “Such additive interventions cannot structurally control how information flows, and thus often remain tied to task-specific representations.” I can understand that using task vectors for steering cannot *explicitly* control the information flow (i.e., attention scores, although not absolutely, since injecting certain components into the attention query could indirectly alter the attention scores), but I don’t see how this leads to being “tied to task-specific representations.” If I have missed something, I apologize.
3. I suggest that the authors explain how each introduced symbol is grounded. For example, the symbol $\alpha$ in Equation (3) is confusing, it isn’t clear until Sec. 3.2 introduce it as a parameter to be trained.
1. The authors seem to attribute all the benefits of ICL demonstrations to local attention effects within the query’s tokens (i.e., dynamically filtering task-relevant signals through attention scores). However, as far as I know, additional attention behaviors such as induction heads perform global attention operations from the demonstrations to the query. ICR clearly cannot reconstruct such attention patterns, since it is conducted under zero-shot inputs, yet their method still outperforms vanilla few-shot. This might prompt a new perspective on the mechanism of ICL. I would like to ask how the authors interpret this phenomenon, and whether they could expand their discussion of such mechanisms in the paper.
2. The analysis of layer/head importance (Fig. 4, left and middle) appears to include only the later layers. Could you release the results for all layers? This seems to suggest that certain specific attention heads induce the 0-shot inference, which can thus be improved by attention routing, therefore, it is interesting to get the detailed distribution of such heads. |
Fully human-written |