ICLR 2026 - Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	15899 (21%)	4.43	3.58	3687
Heavily AI-edited	3233 (4%)	4.22	3.59	2990
Moderately AI-edited	7082 (9%)	4.20	3.61	2722
Lightly AI-edited	16648 (22%)	4.15	3.68	2746
Fully human-written	32938 (43%)	4.13	3.62	2917
Total	75800 (100%)	4.21	3.62	3026

Title	Ratings	Review Text	EditLens Prediction
Exploring Cross-Modal Flows for Few-Shot Learning	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper addresses the challenge of achieving precise cross-modal alignment in vision-language models (VLMs) for few-shot learning. The authors argue that existing parameter-efficient fine-tuning (PEFT) methods—such as prompt tuning, adapter-based, and LoRA-based approaches—perform only a "one-step" adjustment of features, which is insufficient for complex datasets where modalities are highly entangled. To overcome this limitation, the authors propose Flow Matching Alignment (FMA), a model-agnostic framework that leverages flow matching theory to enable multi-step feature transformation. FMA incorporates three key designs: coupling enforcement to preserve class correspondence, noise augmentation to mitigate data scarcity, and an early-stopping solver for efficient and accurate inference. Extensive experiments on 11 benchmarks show that FMA consistently improves performance, especially on challenging datasets, and integrates seamlessly with various backbones and PEFT methods. 1. Novel application of flow matching to cross-modal alignment in few-shot learning, moving beyond generative tasks. 2. Effective design choices (e.g., early-stopping solver, noise augmentation) that address practical challenges in training and inference. 1. No analysis of computational overhead or inference latency introduced by multi-step transformation. 2. Ablation studies do not explore the sensitivity of performance to hyperparameters like inference steps. 3. The early-stopping strategy uses a fixed step count rather than a sample-adaptive criterion, which may limit optimality. 1. Could the author provide more detailed information across different datasets in section 4.2 GENERALIZATION ABILITY, where only average performance was given? 2. Was any exploration done into adaptive early-stopping criteria (e.g., based on feature discriminability) rather than a fixed stepsize? 3. How does FMA perform in cross-modal retrieval or other downstream tasks beyond classification, given its alignment-focused design? 4. Could the authors provide more intuition or theoretical insight into why coupling enforcement preserves class-level correspondence in high-dimensional feature spaces?	Fully AI-generated
Exploring Cross-Modal Flows for Few-Shot Learning	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a method to improve alignment performance between modalities in cross-modal models. It claims that existing methods fail to align well on challenging datasets because they attempt one-step alignment, and proposes a multi-step approach to align the embedding vectors of the two modalities. Specifically, it performs flow matching to transform from images to the distribution of text embedding vectors. - This paper is well-written and well-structured. - Alignment of embedding vectors across multi modalities is an important research topic. - Flow-matching alignment seems novel. However, its necessity is questionable, and it might be just a combination of new techniques. - In experiments, the proposed method outperforms baselines on class classification tasks. However, as written in Weaknesses, it is unclear whether the evaluation is well-designed to confirm the claims. - The motivation for multi-step adjustment is unclear. First, the definition of one-step adjustmentfor poor performance in existing methods is ambiguous. For example, is the claim that PEFT's optimization objective function is inappropriate, or that optimization is insufficient due to difficult learning? Figure 2 discusses PEFT's characteristics compared to LP, but the validity of using LP as a baseline for this discussion is unclear. It is unclear how this connects to the statement: “these methods try to adjust their general aligned multi-modal distribution towards the golden distribution by one rectification step.” - The experimental setups are insufficiently described, resulting in a lack of reproducibility. For example, it states that velocity networks are learned, but I could not find a description of the specific structure of the velocity networks. There is no definition of $\sigma^2(\cdot)$. There seems also no report of the number of steps M for the proposed method across each dataset. In addition, there is no evaluation of statistical significance. - The baseline varies depending on the evaluation. While Table 1 compares against 8 baselines, Table 2 has one baseline and Table 3 has five. Specifically, the baseline compared in Table 2 is one of the weaker baselines among those appearing in Table 1. Although there are practical limitations on the number of experiments, comparing against the strongest baseline yields more convincing results. - The proposed method seems computationally expensive. It requires preparing velocity networks and performing multiple updates during inference (Algorithm 2). How does the computational cost compare to the CLIP-Adapter with two linear layers? How does it compare to PEFT? Since the performance improvement over CLIP-LoRA is only 0–2%, the heavy inference cost makes the proposed method less useful. - Minor issues - The space is filled, making it difficult to read. The absence of a single line of space before and after figures and tables, such as the caption for Figure 4, violates the template. - What is the definition of one-step adjustment? Does it mean that the objective function is set once or that the optimization is only one step? Are Fig. 1(b)-(d) optimal embeddings in some sense? - In Fig. 2, isn't it a bit simplistic to conclude that PEFT is weak on challenging datasets based on LP? Couldn't one also conclude that LP is strong on more challenging datasets? - What happens if stronger methods are used as baselines in Table 2? Also, did you check the standard deviation of the results and their statistical significance? - How about the comparison of computational cost?	Fully human-written
Exploring Cross-Modal Flows for Few-Shot Learning	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces a new framework called Flow Matching Alignment (FMA) to improve feature alignment between visual and textual modalities in few-shot learning. The authors observe that current parameter-efficient fine-tuning (PEFT) methods—such as prompt tuning, adapter tuning, and LoRA—perform only one-step feature adjustments, which are insufficient for complex datasets where image–text features are highly entangled. FMA leverages the multi-step rectification ability of flow matching by learning a cross-modal velocity field that iteratively transforms image features toward corresponding text embeddings, achieving finer alignment. To ensure stability and correctness, the method incorporates three key designs: coupling enforcement (to maintain class correspondence), noise augmentation (to mitigate data scarcity), and an early-stopping solver (to prevent over-transformation during inference). Experiments across 11 benchmarks and multiple backbones show that FMA consistently outperforms existing PEFT methods. 1. FMA introduces flow matching to few-shot learning. By formulating traditional PEFT methods as one-step updates, FMA enables more precise iterative alignment between visual and textual features. As argued by athe uthors, FMA better handles entangled multimodal distributions, especially in challenging datasets. 2. The framework is architecture-independent and can be integrated with various pre-trained vision-language models (e.g., CLIP, CoOp, LoRA) without altering their internal structures. 1. The multi-step flow matching process requires iterative training and inference, which increases computational cost compared to traditional one-step PEFT methods, potentially limiting scalability for large datasets or real-time applications. 2. The method relies on carefully chosen parameters such as the number of inference steps, step size, and noise schedule. Suboptimal tuning can lead to degraded performance or instability during alignment. Especially when flow matching originates from generative modeling, and its adaptation to supervised classification tasks lacks rigorous theoretical grounding, in terms of convergence and optimal stopping criteria. 1. As discussed in the weakness part, is there any theoretical guarantee of convergence and stability of FMA? Given that flow matching originates from generative modeling, what are the theoretical conditions under which FMA ensures convergence to the correct class-aligned distribution in supervised learning settings?	Fully AI-generated
Exploring Cross-Modal Flows for Few-Shot Learning	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper reframes PEFT as a one-step adjustment problem that fails on difficult datasets and proposes Flow Matching Alignment (FMA), which learns a velocity field to iteratively transport image features toward their ground-truth text features. It introduces coupling enforcement to preserve class correspondence, noise augmentation to combat data sparsity and manifold collapse, and an early-stopping solver (ESS) that classifies from intermediate states to avoid late-stage drift. FMA is plug-and-play across CLIP and multiple PEFT backbones and shows consistent gains on 11 benchmarks, especially on difficult datasets, with ablations supporting each component. The diagnosis of one-step PEFT limitations is convincing; the method is simple, modular, and effective across backbones; experiments are comprehensive; and the early-stopping insight is well-supported by empirical phenomena. Lack of formal guarantees for coupling assumptions, reliance on validation to set inference steps, missing comparisons with higher-order ODE solvers, coarse difficulty metric, and limited analysis of compute trade-offs and failure modes. 1. Can ESS be made adaptive without validation tuning (e.g., stop on sufficient logit margin, small velocity norm, or diminishing logit gains)? 2. How is σ(xt) chosen in practice, and how sensitive are results to the noise magnitude/schedule? Any benefit from uncertainty- or density-adaptive noise? 3. Did you try higher-order or adaptive ODE solvers (Heun/RK) to reduce truncation error and improve margins at the same step budget? 4. Is fixed pairing too rigid for multi-modal classes? Would transporting toward multiple positive prototypes or a class subspace help? 5. Have you considered adding a discriminative loss on intermediate states (e.g., contrastive/margin) to align transport with classification throughout the trajectory? 6. Can you report detailed overhead (velocity net size, train/infer time, average ESS steps) and scaling with number of classes?	Fully AI-generated
Sequential Diffusion Language Models	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces sequential diffusion language models that (1) reformulate the training objective from block diffusion that all the inputs in the block are now masked tokens, and (2) enable parallel training through a specially designed attention mask. During inference, the authors employ greedy decoding and self-speculative decoding to generate and sample tokens in parallel. The results show that it can have strong performance compared to its base model and faster inference speed. 1. The paper proposes two novel language models that enable parallel decoding while achieving superior performance compared to previous diffusion language models. 2. The proposed approach is methodologically sound, and the resulting models demonstrate strong empirical performance. 1. I was wondering if this model can still be called a diffusion model anymore. The only thing left in the current DLM that is related to the term `diffusion` is that they randomly mask different tokens at different timesteps. However, according to the training objective proposed in this paper in equation 3, it omits all the timestep-related design and adds several mask tokens together. It has no relationship with the theory of diffusion models. Can the author clarify why this model can still be called a diffusion model? 2. Important comparisons and ablations are missing. The most straightforward baseline to compare with is the block diffusion. The training objective and baseline idea are derived from that paper, but no comparison is shown. What benefits can we get from this new training objective? What would be the effect if we removed the timestep-dependent random masking? 3. Missing comparison with multi-token prediction and speculative baselines. This paper can be very similar to the idea of extending an AR model to predict multi-tokens (Medusa series) and then using/not using speculative decoding (the self-speculative decoding part). Do you have any comparisons with those baselines? 1. How much extra training overhead would be introduced by applying this special attention mask? Since the token length would be longer than the original sequence. 2. Why would the performance of Math be higher than LLaDA/Dream by a large margin? I guess it's not the benefits from the new training recipe, but maybe some different tricks in the instruction tuning stage	Fully human-written
Sequential Diffusion Language Models	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes sequential diffusion models. SDLMs perform inference by predicting a block of $D$ tokens ahead in parallel, all at once, from masked tokens. This gives more FLOPs per prediction than eg multi-token prediction. Predictions are then truncated based on sequence confidence (Longest Prefix Decoding). A couple options are explored for confidence functions: product of probabilities, product of entropy-related quantities, and a self-consistency check drawn from Blockwise Parallel Decoding. Training is the same as Block Diffusion, relying on fixed block sizes. Results demonstrate that SDLMs offer speedups over the original autoregressive model. No comparisons are made to speculative decoding or multi-token prediction methods. The method leads to speedups over sequential decoding, but so will any speculative decoding method with a decent acceptance rate. The writing was clear. The main weakness is a lack of comparison to spec decoding methods. I assume this weakness has been anticipated by the authors. Since speculative decoding has been optimized by the community, I would not expect SDLMs to be faster on wall-clock time. A reasonable win would be to show an improved accept length per prediction. Faster wall-clock time as well would of course be great. Eagle-3 claims to have speedups of around 3.5-4.4x and accept lengths around 5-6 tokens on Llama 8b. The results in Table 3 are not far off in terms of accept lengths, but are not an apples-to-apples comparison. One main benefit of diffusion models is any-order inference, e.g. infilling. It's possible that if pure spec-decoding cannot be beat in the sequential setting, infilling may be the main benefit of blending diffusion models + speculative inference. The main discussion point is comparison to speculative decoding methods. In the weaknesses section, I suggested comparing to Eagle 3.	Fully human-written
Sequential Diffusion Language Models	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This work proposes Sequential Diffusion Language Models (SDLM), a framework that aims to unify autoregressive decoding and diffusion decoding, through a next sequence prediction (NSP) objective. SDLM/NSP adopts block-wise token generation (e.g., generating 4 token at a time). It further uses a confidence-based or consistency-based length function to identify a usable, longest prefix of the generation -- making the generation block-wise but variable length as well. SDLM is continued trained from regular autoregressive LLMs (e.g., Qwen-2.5). The experiments compare SDLM with autoregressive and diffusion LMs, showing inference speedup with reasonable performance as well. This paper addresses an important problem in block-wise token generation in LLMs, making the usually fixed block size variable-length. The proposed SDLM can be continued to be trained from regular LLMs. The experiments show reasonable speed-ups for inference. The writing is clear and well-organized. (1) Relevance to diffusion models. Though this work frames SDLM as a diffusion model, it doesn't seem to have a "diffusion" procedure (e.g., with a noising procedure, interpolating between a clean and noise distribution, etc.). Instead, each prediction block is always fully masked with a masking probability of 1. This makes the connection to diffusion language models questionable. The prediction length in each block is also much shorter (D=4). (2) Baselines in the multi-token prediction domain. This work compares with regular language models (Qwen) and also two diffusion language models (LLaDA, Dream). However, this work should have a stronger connection with the multi-token prediction works. Direct comparisons with Medusa [1] and work like [2] should be most relevant. (3) The performance drop. The performance drop of SDLM at the same model size of regular LMs is non-trivial. Table 1 shows that when D=4, in average there's a 2.2 to 7.3 points performance drop. Though it offers 1.89 to 2.39x speed-ups, this gap needs more justification. Also, does average tokens per pass directly translate to the same amount of acceleration in reality? (e.g., need to factor in advanced inference frameworks like vLLM, speculative decoding, etc?) [1] Cai et al., 2024. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. [2] Gloeckle et al., 2024. Better & Faster Large Language Models via Multi-token Prediction. Does the method work for D > > 8? It could be hard for general tasks I assume, but would it work for tasks that require large chunks of copying?	Fully human-written
Sequential Diffusion Language Models	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors propose to cheaply finetune autoregressive LMs for diffusion decoding of contiguous token blocks. They extend standard block diffusion inference for variable-length block sizes, which adapts to confidence. The authors show 5x speedup over prior dLLM LLaDA with improved benchmark performance and 2x speedup over Qwen, scaling their models up to 32B parameters. - Technical novelty: The authors overcome the limitations in existing block diffusion models by proposing adaptive block sizes at inference, while introducing entropy-normalized confidence and self-speculative decoding for dLLMs - Strong experimental results: The authors show 5x speedup over LLaDA and 2x speedup over Qwen with limited SFT examples, scaling up to 32B parameters. Further, the benchmark performance improves substantially over prior dLLMs LLaDA and Dream. The released models will positively impact the research community - Thorough benchmarking and ablations: The authors provide thorough analysis across benchmarks, showing comparable performance to AR SFT with better throughput. The authors also carefully assess the importance of block size, confidence metric, logit shifting, attention mechanism, and speculative decoding and show that their proposed recipe improves downstream performance - Unclear whether proposed Next Sequence Prediction modeling objective corresponds to a valid NELBO - Ambiguous training algorithm: It is unclear how the sequence is split into blocks at random positions (line 213), or why noised blocks are randomly inserted in the training input sequence (line 250). Training pseudocode is not provided. - Contiguous block decoding prevents long-range token interactions: The proposed decoding strategy is restricted to generate contiguous blocks, preventing long-range token interactions and non-causal token conditioning, which are advantages of diffusion. - Missing ablation on the importance of contiguous block decoding: It is unclear whether contiguous block decoding is necessary, which is a core contribution over [1] - Missing experimental comparison to multi-token prediction: The proposed decoding strategy generates the longest contiguous block of tokens possible in each decoding step, which is very similar to multi-token prediction (MTP) approaches [2,3] as noted in Appendix B. Given their close similarities, the authors should show experimental comparison to MTP methods to better motivate their approach - Overstated contribution: The authors claim that standard block diffusion training [1] is not scalable (lines 100-101), while the authors propose to overcome this by initializing from an AR checkpoint. However, the proposed training algorithm (Section 3.3) has comparable efficiency to [1]. Further, [1] could easily be directly extended to initialize from an AR checkpoint - The proposed parallel training is very similar to the vectorized training in [1], which is not cited or compared against in the text (lines 245-258). [1] also proposes parallel training to predict all blocks using FlexAttention kernels - Presentation/clarity: Various parts of the paper feel rushed with numerous grammatical errors/missing definitions - line 96, 108: "with only 3.5M training data" - line 40: "benefit from the diffuse nature in efficiency" - line 194: "shifted-prediction objective" undefined in the section - line 200: $\hat{X}$ is undefined - line 169: $\alpha_t$ undefined and the forward noise process $q$ defined informally - line 214: confusing notation: $Y_T^i$ corresponds to a block of $D$ tokens, but the superscript refers to indexing at position $i$? [1] Arriola et al. Interpolating between autoregressive and diffusion language models. ICLR 2025 [2] Cai et al. Medusa: Simple llm inference acceleration framework with multiple decoding heads. ICML 2024 [3] Gloeckle et al. Better & Faster Large Language Models via Multi-token Prediction. arXiv 2024 - Can the authors provide pseudocode for the training and inference algorithms? - Does the proposed next sequence prediction objective (Eq 3) correspond to a valid NELBO? - Can the authors show the importance of contiguous block decoding via ablation? - Can the authors show the effectiveness of their approach relative to multi-token prediction [2,3]? - Can the authors detail the differences between the proposed parallel block diffusion training and the vectorized training from [1]? - In addition, the choice of $\tau \in \[ 0.82, 0.98 \]$ in Table 1 seems heuristic; can the authors clarify how $\tau$ was picked? [1] Arriola et al. Interpolating between autoregressive and diffusion language models. ICLR 2025 [2] Cai et al. Medusa: Simple llm inference acceleration framework with multiple decoding heads. ICML 2024 [3] Gloeckle et al. Better & Faster Large Language Models via Multi-token Prediction. arXiv 2024	Fully human-written
Semi-Supervised Dataset Condensation with Dual Consistency Trajectory Matching	Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces Semi-Supervised Dataset Condensation (SSDC), and a method, Semi-Supervised Dual Consistency Trajectory Matching (SSD), to address it. The method employs a two-stage knowledge distillation framework. First, a teacher model is trained on the full dataset using a semi-supervised learning (SSL) technique to generate accurate pseudo-labels for the unlabeled data. Instead of matching the teacher's potentially noisy training trajectory, SSD trains a student model on the full pseudo-labeled dataset using a novel dual consistency regularization loss. This loss enforces both inter-model consistency (matching the teacher's predictions) and intra-model consistency (maintaining stable predictions under augmentation). The stable training trajectory of this student model is then used to optimize the synthetic dataset via trajectory matching. Experiments on MNIST, Fashion-MNIST, and CIFAR-10 show that SSD consistently outperforms existing supervised condensation methods adapted to this new semi-supervised setting. The paper is easy to follow. 1. The novelty lies in the specific combination and application of existing techniques to the new SSDC problem. The core components—trajectory matching (MTT), semi-supervised learning with pseudo-labels (FixMatch), and consistency regularization—are all established concepts. The primary contribution is the architectural design that synthesizes these ideas into a framework that successfully stabilizes trajectory matching in a semi-supervised context. The dual consistency loss is a nice refinement but is conceptually similar to losses used in knowledge distillation and SSL. Also, the idea has alreadly been studied in previous works[1][2][3]. 2. Experiments: - Baseline Comparisons: The paper compares SSD against supervised dataset condensation methods (DC, DM, MTT, etc.) applied to the small labeled portion of the data. While SSD's superior performance is expected and demonstrates the value of using unlabeled data, a more insightful baseline would be to apply these supervised methods to a dataset where the unlabeled data has been pseudo-labeled by the teacher (the "naive approach" mentioned in the intro). This would directly isolate the benefit of the dual-consistency student trajectory from the benefit of just having more (pseudo-labeled) data. The current "MTT" baseline seems to do this, but the description is slightly ambiguous ("generated via FixMatch on the entire dataset"). Clarifying this setup is important. - Limited Dataset Diversity: The experiments are conducted on three standard, relatively simple image classification benchmarks (MNIST, Fashion-MNIST, CIFAR-10). While sufficient for a proof of concept, the true challenge of SSL and condensation often appears in more complex, fine-grained, or long-tailed datasets, e.g., ImageNet-1K, COCO, VOC, etc. Evaluating SSD in such scenarios would provide a stronger test of its robustness. - Computational Cost: The proposed method involves three distinct training stages: (1) training a teacher model, (2) training a student model to generate trajectories, and (3) optimizing the synthetic dataset via trajectory matching. This is a highly complex and computationally intensive pipeline. The paper lacks a clear analysis of this overhead compared to baselines. While the resulting condensed set is efficient to train on, the cost of creating it appears substantial. Also, the better the teacher, the better the performance. Please consider using weak-to-strong strategies. - Missing baselines: Trajectory matching is a too out-dated baseline. Please conduct experiments on matching-based SOTA methods like NCFM[4], etc. Particularly, NCFM is more efficienct than MTT-based method. It could be a new direction for considering using your recipe for distribution matching. Also, decoupled-based baslines are also missing, including training-free RDED[5], and training-based SRe2L[6]. [1] Yu S F, Yao J J, Chiu W C. Boost self-supervised dataset distillation via parameterization, predefined augmentation, and approximation[C]//The Thirteenth International Conference on Learning Representations. 2025. [2] Joshi S, Ni J, Mirzasoleiman B. Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks[J]. arXiv preprint arXiv:2410.02116, 2024. [3] Lee D B, Lee S, Ko J, et al. Self-Supervised Dataset Distillation for Transfer Learning[C]//The Twelfth International Conference on Learning Representations. [4] Wang S, Yang Y, Liu Z, et al. Dataset distillation with neural characteristic function: A minmax perspective[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 25570-25580. [5] Sun P, Shi B, Yu D, et al. On the diversity and realism of distilled dataset: An efficient dataset distillation paradigm[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 9390-9399. [6] Yin Z, Xing E, Shen Z. Squeeze, recover and relabel: Dataset condensation at imagenet scale from a new perspective[J]. Advances in Neural Information Processing Systems, 2023, 36: 73582-73603. Please see weaknesses. The quality of this paper can be highly improved if all the experiments and related works are compared and discussion properly.	Fully human-written
Semi-Supervised Dataset Condensation with Dual Consistency Trajectory Matching	Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper extends trajectory-matching dataset condensation (MTT) to a semi-supervised setting: a teacher trained with unlabeled data produces pseudo-labels; a student is trained with dual consistency (teacher–student and augmentation consistency); the synthetic set is optimized by matching the student’s training trajectory on real vs. synthetic data. The paper claims improved accuracy and efficiency on small image classification benchmarks. 1. Clear motivation to exploit abundant unlabeled data in condensation. 2. The dual-consistency idea is simple and reasonable for handling noisy pseudo-labels. 3. Writing and figures make the pipeline easy to follow. 1. The method mainly combines known pieces (MTT-style trajectory matching + standard semi-supervised ingredients) and reads as an adaptation rather than a new principle. 2. Practical value for broader tasks is unclear if each task would require its own distilled set. 3. No result shows one fixed distilled set transferring across different architectures trained from scratch (cf. cross-architecture evaluations reported in DSA and M3D). 4. No wall-clock time, hardware/memory, or timing comparisons are reported. 5. Scale is small. Results are limited to MNIST/Fashion-MNIST/CIFAR-10. 6. Table 1 shows M3D not always > DM (e.g., CIFAR-10); setup alignment is unclear. 7. Minor presentation issues. Typos (e.g., 408 “12.2These results”). The experimental support feels too light; closing the gap (fixed-set cross-architecture transfer, larger-scale data, and efficiency numbers) would likely require substantial additional work, so my current rating is 2 (reject). 1. Is the distilled set intended for one task and one model family? What kind of transfer should readers expect across tasks and architectures? 2. Under a comparable compute/data budget, why prefer synthesizing a dataset over directly fine-tuning/KD of the student? If the distilled set is student-specific, the conceptual benefit beyond model-centric adaptation is not evident. 3. Could you comment on why a fixed distilled set is not evaluated across different architectures trained from scratch? If the set is student-specific, what is the intended use case? 4. Given the small-scale benchmarks, how should readers assess the claim that the synthesized dataset can stand in for a large-scale dataset? 5. What explains the M3D vs. DM numbers (e.g., CIFAR-10 at higher IPC) relative to M3D’s usual gains—are there configuration or protocol differences the reader should be aware of?	Lightly AI-edited
Semi-Supervised Dataset Condensation with Dual Consistency Trajectory Matching	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduced a new task: Semi-Supervised Dataset Condensation (SSDC), which aims to condense a mixture of labeled and abundant unlabeled samples into a small, informative synthetic labeled dataset. To ensure robust performance and stable trajectories, student model generates training trajectories on the entire dataset using a dual consistency regularization loss. Experiments show that the proposed method consistently outperforms supervised DC baselines on MNIST, Fashion-MNIST, and CIFAR-10. 1. This work is the first one to study the SSDC problem. 2. SSD consistently and significantly outperforms several supervised DC baselines (DC, DSA, DM, MTT, M3D) across three different datasets and can be used with various SSL pre-training methods. 1. The novelty of the proposed method is limited. The intra-model consistency and the inter-model consistency has been used in previous SSL papers [1,2,3,4]. 2. The presentation is not good. For example, there is no equation labels. In line 276, two terms are the same. Why is $L_{ref}$ used for the student network, while $L^s_{ce}$ is used for the reference model? 3. Missing latest baselines for dataset condensation, e.g. DATM [5], DANCE [6], D3S[7]. Specially, SSDC with pseudo-labels can be seen as dataset distillation with domain shift [7]. [1] Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning. NeurIPS 2016 [2] Temporal ensembling for semi-supervised learning. ICLR 2017 [3] Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeuIPS 2017 [4] Deep Co-Training for Semi-Supervised Image Recognition. ECCV 2018 [5] Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching. ICLR 2024 [6] DANCE: Dual-View Distribution Alignment for Dataset Condensation. IJCAI 2024 [7] Large Scale Dataset Distillation with Domain Shift. ICML 2024 See above.	Fully human-written
Semi-Supervised Dataset Condensation with Dual Consistency Trajectory Matching	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces a new problem setting in the dataset condensation literature, semi-supervised dataset condensation, where a small amount of labeled data and a large amount of unlabeled data are available. Under this setup, the authors propose semi-supervised dual consistency trajectory matching, building on Matching Training Trajectories (MTT; Cazenavette et al., 2022), which updates synthetic data so that the parameter trajectory of a student model matches the training trajectory of real data (teacher model). Specifically, the method trains teacher models under semi-supervision and student models with a dual consistency loss. Experimental results demonstrate that the proposed approach achieves strong performance. - This paper introduces a new problem setup in the dataset condensation literature. - The proposed method is simple and easy to implement. - The method demonstrates strong performance under the experimental settings presented in the paper. - This paper lacks clarity: - In L270–271, could the authors clarify the statement “Using the trajectory of the teacher model for dataset condensation leads to unstable results”? Could the authors also provide empirical evidence supporting this claim? - In L207, what does $N$ represent in $\hat{\theta}_{t+N}$? As far as I understand, $N$ denotes the number of supervised data points, i.e., $\|D^l\|$. - In L294, the authors use $N$ to refer to rounds rather than the number of data points, which is confusing. What does a “round” mean here? Is it the number of iterations used to update the student models $p_S$? - In L226 (line 5 of Algorithm 1), I believe it should be $t=0$ instead of just 0. - In L227 (line 6 of Algorithm 1), what does $T^+$ denote? - In L299–301, what is $p_R$? Is it the same as $p_S$ (the student model)? - Although this paper introduces a new problem setup, its technical contribution is limited. The main algorithm is essentially the same as MTT (Cazenavette et al., 2022), with semi-supervised teacher models. - The motivation behind the proposed dual consistency is not persuasive. It is unclear why this component is necessary for the given problem setup. - The experimental setup is too limited and small-scale. The baseline method, MTT (published in 2022) conducted experiments on larger datasets such as TinyImageNet (64×64) and ImageNet subsets (128×128). - The main intuition appears to be that better teacher models under semi-supervised settings will lead to better synthetic data. If this is the case, could the authors provide the performance of the teacher models? see the weakness	Fully human-written
Semi-Supervised Dataset Condensation with Dual Consistency Trajectory Matching	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces a novel task, Semi-Supervised Dataset Condensation (SSDC), and proposes a novel method named Semi-Supervised Dual Consistency Trajectory Matching (SSD). The goal is to synthesize a small, labeled dataset from a mixture of labeled and unlabeled samples, enabling efficient supervised learning without relying on fully labeled data. Particularly, a teacher model and a student model are employed. The teacher model is trained on both labeled and unlabeled data using SSL to generate reliable pseudo-labels. The Student model aims to achieve teacher-student agreement and meanwhile enhancing its robustness under perturbations. As a result, the synthetic dataset is optimized via trajectory matching between the student’s training dynamics on the real and synthetic data. Extensive experiments have shown effective performance in distilling a small dataset for supervised training. - The paper introduces semi-supervised dataset condensation, which is a realistic topic that has lots of real-world applications. - The proposed SSD elegantly combines semi-supervised learning with trajectory matching, which can effectively exploit unlabeled data meanwhile maintaining stability during training. - The SSD consistently outperforms both supervised dataset condensation and self-supervised condensation baselines across multiple benchmarks and architectures. - While the method is intuitively sound, the paper lacks formal theoretical analysis. It would be beneficial to demonstrate how dual consistency contributes to the convergence or stability of trajectory matching. - The proposed method combines multiple modules, such as teacher–student training, consistency regularization, and trajectory matching. Computational efficiency on how these modules require and how they collaborate with each other is unclear. Moreover, the experiments are conducted under very small datasets, which remain unclear regarding their scalability to large datasets. Since the study aims to condense datasets, if there is no large-scale dataset experiments, the condensation would be unconvincing. - Moreover, this paper does not examine how the ratio of labeled to unlabeled data affects condensation quality, nor whether unlabeled-only condensation is feasible within this framework. Please see the weaknesses.	Lightly AI-edited
Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces a Turing test study for speech-to-speech (S2S) dialogue systems, evaluating 9 LLMs against human speakers. Using a gamified online platform, the authors collect 2,968 human judgments across 1,486 dialogues in English and Chinese. None of the tested systems passed the Turing test, revealing a persistent gap in human-likeness. To understand the causes, the paper develops a taxonomy of 18 human-likeness dimensions, spanning semantic, paralinguistic, emotional, and persona-related traits. Crowd annotations show that while S2S systems perform near human levels in semantic coherence, they fall short in prosody, emotional expression, and conversational naturalness. Finally, the paper proposes an interpretable AI judge, which is a finetuned LLM that predicts human-likeness with strong transparency and accuracy, outperforming both humans and baseline AI judges, either prompted or LoRA-finetuned. - This paper presents the first formal Turing test for S2S dialogue systems, extending evaluation beyond text to spoken interaction, which is an impactful direction given recent advances in conversational AI. - The paper convincingly shows that the bottleneck of S2S dialogue systems is no longer semantic understanding but rather paralinguistic and emotional expressivity, which is an under-explored dimension in S2S research, offering valuable insights for improving S2S design. - The interpretable AI judge is a standout contribution, which provides a reproducible and scalable framework for automatic S2S evaluation, with decent human-machine discrimination accuracy. - The gamified Turing test platform may attract casual participants who do not conduct the human-machine discrimination carefully. This paper does not clearly describe participants’ quality-control mechanisms such as attention checks, response-time filtering, etc. This could bias the Turing test results. - It is unclear how many unpassed Turing test cases are because LLMs avoid human disfluency cues (or other fixes of human speech deficiencies). This is an easy-to-detect feature but a minor issue, since superior speech fluency is rather preferred by users in real-time applications. It would be better to track the cause of each Turing test failure and analyze the rate of such minor causes versus more severe causes. - Any additional discussions or experimental results to resolve the above weaknesses? - Would shorter dialogues (e.g., 20-second versus 60-second dialogues) have more chance to pass the Turing test?	Fully human-written
Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents a study of testing if current speech-to-speech (S2S) models, e.g., GPT-4o's advanced voice mode, can pass a Turning test of conducting human-like conversations. Authors firstly constructed S2S dialog datasets between human-model conversations, as well as synthesizing speeches with TTS models. After that, approximately 3K human judgements were carried out, measuring the human-likeness of current S2S models. To measure the human-likeness, authors defined 18 dimensions such as memory consistency, use of fillers etc. Authors show that humans can easily identify human-model dialogs, such that current S2S models cannot pass the Turning test. Authors also show that off-the-shelf AI models cannot serve as a judge for the test; finetuning the off-the-shelf AI on the authors' created dialog datasets enable AI to better judge human-model dialogs. - This is the first evaluation of S2S models from the Turing test prospective, which is new and interesting. - Evaluation details are well presented with clear takeaway messages, and the paper is easy to follow. - It seems the conclusion, i.e., current S2S models cannot pass Turning test, is very much expected. E.g., VoiceBench[1] has shown that current S2S models still largely lag behind their text counterparts. Then the main contribution seems to be the "Turing Test" framing. I think authors should more clearly demonstrate how this paradigm can provide more insights than existing S2S benchmarks like VoiceBench. - The generalization ability of the judge model finetuned on authors' created datasets is unknown. It is very much expected that the in-domain (e.g, based on the 18 evaluation dims) finetuned judge model performs the best on judging the dialogs. Authors should test the correlation of the judge model with humans, when being applied to some other real-world OOD dialogs to ensure the generalization ability, beyond only testing on the pseudo-human datasets. - Some claims seem contradicting (cf Q2). [1] VoiceBench: Benchmarking LLM-Based Voice Assistants - Line140: The datasets seem contain both English and Chinese. Are the 28 participants from 10 countries all speak both two languages? - Line278-279 describe that current S2S are limited by aspects like topic understanding etc. While Line332 claims that S2S models have largely solved the foundational challenges of understanding and generating clear and coherent dialogue turns. Wouldn't these claims contradicting? How do you define semantic and paralinguistic tasks?	Fully human-written
Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper targets speech‑to‑speech (S2S) dialogue and proposes a Turing‑style evaluation. It builds a dataset that includes human–human, human–machine, and pseudo‑human (TTS‑synthesized) conversations, and uses a game‑based human study to judge whether current systems “sound human.” The headline finding is that none of the evaluated systems pass. To explain why, the authors introduce a fine‑grained human‑likeness taxonomy and crowd annotations, showing that shortcomings lie in paralinguistics (rhythm, intonation, stress, fillers, breath), emotional expressivity, and a “mechanical persona,” rather than semantics. Off‑the‑shelf AI judges are unreliable; authors therefore propose an interpretable evaluator that first produces ordinal scores on the taxonomy dimensions and then makes a transparent linear human‑vs‑machine decision. 1. Focused problem and clear protocol. The work targets the central question for speech‑to‑speech (S2S): Do these systems actually sound human in multi‑turn dialogue? Instead of testing isolated sub‑skills, the study frames evaluation as a Turing‑style decision under realistic interaction. The tri‑part setup, human–human, human–machine, and a TTS‑based pseudo‑human control, gives a clean yardstick for what “human‑like” means. Bilingual coverage and multiple everyday topics reduce overfitting to any single style and make results more comparable across systems. The recording and interaction procedures are standardized, improving internal validity and making cross‑system contrasts meaningful. 2. Interpretable automatic judge. The proposed evaluator is intentionally two‑stage: first map dialogs to ordinal scores on the human‑likeness dimensions, then apply a linear decision with symmetry regularization. This keeps the prediction space aligned with how humans actually rate speech (ordered categories), while the linear head provides transparent attribution: which dimensions pushed a sample toward “human” vs. “machine.” The design is modular, portable across collections, and produces diagnostics that engineers can act on (e.g., prosody shaping, disfluency modeling, persona calibration). 3.Memorable headline result. The paper lands a crisp, communicable takeaway: contemporary S2S systems still fail a Turing‑style test. That single sentence is easy for the community to remember and cite, and it reframes progress: sounding human is not simply a by‑product of better recognition or text generation. Because the result was obtained under a matched protocol with both human–human and synthesized controls, it carries weight beyond a one‑off demo and can serve as a reference point for future work. 1. Application‑heavy, limited theoretical novelty. The main novelty lies in system integration rather than theory. The core claim, that semantics alone cannot sustain effective speech interaction, is treated as an empirical observation, not a theoretical insight. Adding paralinguistic cues (prosody, affect, persona) targets known gaps, long discussed in TTS and affective computing. The work validates their importance but does not explain underlying mechanisms or interactions, nor does it offer a general theoretical framework. 2. HCI‑leaning narrative.The main text emphasizes human‑study design, demographics, and logistics more than theory, ablations, and generalization, which may misalign with ICLR. The manuscript emphasizes HCI logistics (recruitment, demographics, task design) while skimming key ML details (architecture ablations, training dynamics, hyperparameter sensitivity). This balance misaligns with ICLR. 3. Limited external generalization and statistics. Training and evaluation are tightly coupled to the same data collection protocol, raising distribution-shift concerns. Evaluation relies on a single-threshold binary decision without calibration, uncertainty quantification, or decision-boundary analysis. Missing ablations across acoustic conditions, speaker populations, and interaction settings leave generalization in doubt. Statistical reporting should add uncertainty intervals, significance tests, calibration curves, and failure-mode taxonomies to substantiate reliability and scope. Also need more systematic ablations to clarify the theoretical footing of ODL, justify the 18‑dimension design and strengthen label reliability. Q1. Theoretical Positioning of ODL What is the precise formal correspondence between ODL and classical ordinal regression models (cumulative-link, threshold models)? What does ODL add beyond these baselines in parameterization or inductive bias? Which properties are guaranteed by construction versus empirically observed? Q2. Necessity of 18 Dimensions Why exactly 18 dimensions rather than a single score or reduced factorized set? Provide dimension-wise correlation analysis, clustering structure, and direct comparisons with (i) single-score model and (ii) low-factor model to demonstrate irreducibility. Q3. Annotation Reliability and Expert Impact Need to report inter-rater reliability for dimensions. Quantify how expert edits change label distributions and downstream judge performance. Test measurement invariance across languages and subgroups.	Fully AI-generated
Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction	Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper performs a Turing test experiment with multiple speech to speech models. Multiple versions of the experiment are tested using controlled conditions that attempt to minimize potential confounds. To collect more data, a mobile app is released as well. Conversations were also annotated with 18 dimensions. The results show key gaps in all S2S models. Using these insights, a smaller speech model is fine-tuned to predict these ratings and then classify whether a given dialog is human-to-human or human-to-s2s, attaining very high accuracy while still being interpretable. - Clean experimental setup that controls for multiple confounds and tests a diverse suit of S2S models, including TTS models using LLM-generated text. Experiments demonstrate key gaps. - Rich analysis of why models fail the test through a multifaceted analysis of the conversations qualities. This analysis involved collecting human perceptions using crowdsourcing - Experiments to see if other audio models could pass the test, with key gaps in existing models. Proposes new model and design to get an interpretable explanation, showing that these features (when correctly identified) are reliable indicators of gaps in current system - Released platform and game to collect new annotations, helping grow data and potentially incorporate more models in the future - Very well written paper with clear motivation, analysis, and visuals. - The biggest gap to me was in the lack of details around the annotation for conversation qualities. These are barely mentioned in text, so I was expecting to see a much more detailed report in B.5. However, important questions are hard to answer, such as who annotated (which platform?), how many annotators were there, did annotators agree on these qualities, how much were annotators paid, or what quality controls were present, if any. Given the importance of this data for your results and later test-taking model, more details are needed to assess the quality of the data and for future replicability. - The paper itself is very dense (though well written). However, the space constraint has pushed many details to the appendix which hinders readability at times. - How was the annotation performed (see questions above)	Fully human-written
MODE: Learning compositional representations of complex systems with Mixtures Of Dynamical Experts	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces MODE (Mixture of Dynamical Experts), a new framework designed to model complex systems, like those in computational biology (e.g., cell differentiation) where the behavior shifts dramatically over time or across different states. Unlike traditional models (like Neural ODEs) that assume a single, smooth governing rule, MODE uses a mixture-of-experts approach. It learns multiple simple, interpretable equations (i.e., experts) and a gating function that dynamically chooses the right equation for a given state. This allows MODE to successfully model systems that bifurcate or transition between different operating regimes. - MODE’s formulation as a mixture of sparse dynamical regressors with neural gating is conceptually simple yet powerful. It bridges classical sparse regression (e.g., SINDy) and modern mixture models, resulting in interpretable and flexible dynamics decomposition. - Challenges introduced in L37-43 with an interesting example of RNA sequencing make sense. Also, the problem statement (Consequently, a large body of research at the intersection of computational biology and data-driven dynamical systems has been devoted to the modeling of snapshot data. Regard the action error diagnosis as a node classification task.) is reasonable. - The model’s ability to discover latent regimes without supervision is well demonstrated, highlighting its potential for uncovering hidden cellular states or transitions from high-dimensional, noisy biological measurements. - The evaluation spans from controlled synthetic systems (bistable, predator–prey, Lorenz) to biological switching processes and real single-cell data, which demonstrates the model’s versatility and robustness. - Figure interpretation and clarity. It is unclear whether the x-axis in Figure 1 represents time. The visualization appears to suggest that blue cells evolve into red and green cells over time. In the overlapped region (left panel), do the red and green cells physically interact, or does their spatial overlap simply obscure individual dynamics? I would appreciate further clarification on how overlapping dynamical regimes introduce modeling challenges and whether this overlap is a visualization artifact or a true physical mixture. - Is the research problem novel? The general problem of learning systems governed by multiple, switching, and partially unknown dynamics has been extensively studied in the literature (e.g., Graph Switching Dynamical Systems, ICML 2023). It is not fully clear what unique challenge this paper addresses beyond existing frameworks for switching or hybrid dynamical systems. - Is MODE technically novel? The MODE framework appears conceptually similar to standard mixture-of-experts models with sparse MAP estimation. The objective in Equation (3) essentially corresponds to a weighted MAP formulation, and the comparison baselines (GMM, supervised MLP, NODE) are arguably too weak or mismatched for a fair assessment. It would strengthen the paper to demonstrate MODE’s advantage over stronger baselines explicitly designed for switching or compositional dynamics. - Motivational gap regarding branching systems. The paper motivates MODE by emphasizing that biological systems may bifurcate into multiple branches. However, it is not clear why existing models could not be independently trained for each branch (e.g., fitting separate NODEs for red and green trajectories). If MODE’s advantage lies in discovering branches without supervision, this distinction should be emphasized more clearly. - How do we know how many experts will be required? The number of experts (K) seems to correspond to the number of distinct dynamical regimes, which in practice may require prior domain knowledge. It remains unclear how MODE performs when K is misspecified or when the true number of regimes is unknown, which could be an important consideration for real-world biological applications. - In Line 185, the authors assume that each sample is governed by one expert, implying no interaction across regimes. If that assumption holds, why not simply learn from data after regime transitions are completed, rather than during overlap? Clarifying this design choice would help justify the need for mixture modeling during ambiguous transitions. - Equation (4) introduces pi_s(x) as the expert assignment distribution, but it is unclear how this distribution behaves in practice. Are the mixture weights highly non-uniform across regimes, and how sensitive are results to the entropy or balancing regularizers? Please see the weaknesses.	Heavily AI-edited
MODE: Learning compositional representations of complex systems with Mixtures Of Dynamical Experts	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors propose a unified perspective to address key limitations of existing methods for modelling complex mixtures of overlapping behavioural regimes: the inability of flow-based models to handle branching dynamics, the computational complexity of switching system inference, and the lack of interpretability in many neural approaches. The resulting Mixture Of Dynamical Experts (MODE) framework shows improved performance over baselines like Gaussian Mixture Models, Neural ODEs, etc. across a range of idealised and real-world benchmarks. 1. The paper clearly identifies the failure of traditional flow-based models (like NODEs) in handling complex, overlapping dynamical regimes, using Figure 1 effectively to show how these models improperly "average" distinct fates. 2. The MODE framework is great for its simplicity, combining an intuitive Mixture of Experts (MoE) approach with SINDy-based regressors. This grounds the model in interpretable, sparse symbolic equations, which is a significant advantage for scientific applications. 3. The "Related Work" section is thorough, correctly positioning the paper by contrasting it against standard dynamical systems, specific computational biology flow models, switching systems, and other MoE approaches. I am not an expert in computational biology, so here are the weaknesses I underline, mostly relating to the machine learning methodology: 1. A priori $K$ selection: The most critical limitation of this work is that the number of experts, $K$, must be manually specified corresponding to the number of dynamic regimes. This is impractical for real-world discovery tasks where $K$ is a primary unknown. 2. The paper's solution to trajectory crossing (a known failure of NODEs) is clear. It would be strengthened by contrasting its discrete mixture-based approach with continuous, augmentation-based methods like ANODE [1]. 3. Gradient-based optimisation of the gating network is a known challenge (see MixER [2]). The paper's regularization solution (Eq. 4) to prevent the gating network from collapsing is interesting. It could be compared to other methods, such as the K-Means initialization used in MixER. 4. A claim in L297-298 that MODE "does not rely on phase space geometry" appears to contradict the model's formulation, which explicitly uses the state $x$ (the geometry) to parameterize both the gating function $\pi(x)$ and the expert dynamics $f_{\Theta_{s}}(x)$. This needs clarification. ### Minor issues: - L235: "dynamical" - L236: "on" ### References: - [ 1 ] Dupont et al., "Augmented Neural ODEs", NeurIPS 2019 - [ 2 ] Nzoyem et al., "MixER: Better Mixture of Experts Routing for Hierarchical Meta-Learning", SCOPE@ICLR 2025. 1) Is the claim that NODEs "average" switching zones (L067) an empirical observation of a common training failure, or a more fundamental theoretical limitation of fitting a single vector field to multi-modal velocity data? Please compare with ANODE for instance? 2) There is minimal quantitative comparison to Neural ODEs, even though there are the subject of much of the criticism in the Related Work. This relates to my question in 1) 3) Eq (4) displays two losses that aim to achieve two opposite things. I have two questions concerning this: - It is not clear how the load balancing term prevents expert collapse (despite the additional definition in L1055) - Could you please provide an ablation for this? 4) Regarding ablation studies, one examining the polynomial basis of SINDy is much needed, especially when the oscillator uses functions outside the dictionary (Goldbeter experiment).	Fully human-written
MODE: Learning compositional representations of complex systems with Mixtures Of Dynamical Experts	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper introduces a new method called MODE (Mixture of Dynamical Experts) to address the challenge of modeling complex systems, particularly in computational biology. It focuses on the difficulties posed by sparse and unordered `snapshot’ data that often exhibit multiple overlapping behavioral regimes and branching dynamics. MODE tackles this gap by using a gated mixture model that decomposes complex, ambiguous flows into distinct dynamical components. The authors demonstrated that MODE outperforms existing approaches in unsupervised classification of dynamical populations under noise and limited sample sizes, using both synthetic and real-world single-cell RNA sequencing data. 1) The paper abstract, introduction and related work are written very well and motivate their method. 2) They discussed and prioritized method interpretability which is crucial for scientific applications. 3) They applied the method to both real world and synthetic data. 4) They explicitly explained the distributions and the assumptions about (most) variables. 1) The method itself should be more detailed, i.e. you state the model, but the fitting procedure / algorithm could be more clearly explained. 2) Model-wise, from what I understand, it seems like the authors had an implicit assumption that each snapshot reflects a single state (i.e., comes from one most likely expert). I would assume that during its cycle, a cell's evolution may be governed by multiple experts simultaneously, e.g., experts that capture division-related signals as well as growth signals that may reflect different dynamics and can be combined for a more flexible expert. 3) While I recognize that this paper's main goal is to describe the method, I miss a short discussion about what the method tells us about the biology (rather than only e.g. predictions). E.g. How much in advance in time can you predict future differentiation? 4) In the related work, there is a missing discussion on decomposed dynamical system models [1,2] and their extension to long term forecasting [3]. 4) With respect to figure 4, you talk about SINDy but I cannot see it presented. 5) The method assumes that the basis functions that define the dynamics (Z) are known and are polynomials. It is built on setting these a priori, which requires choosing/knowing the appropriate polynomial basis. 6) Can you clarify what is B? (unless I missed it, I cannot find where it is defined) 7) Small typo: title of section 3 should be method, not methods. [1] Mudrik, N., et al. (2024). Decomposed linear dynamical systems (dlds) for learning the latent components of neural dynamics. JMLR [2] Chen, Y., et al. (2024). Probabilistic decomposed linear dynamical systems for robust discovery of latent neural dynamics. NeurIPS [3] Mudrik, N., et al. (2024). LINOCS: Lookahead Inference of Networked Operators for Continuous Stability. TMLR 1) How would you model a case where each snapshot evolves by the rules of multiple co-active experts that govern its dynamics? 2) Did you try to look at the 2nd or 3rd most likely expert in every time point? Maybe you will reveal differences within the apparently the same cell state across snapshots that can further reveal their fate even earlier in time? 3) How much does the degree of the polynomials affect the result? Is there a way to fit / infer the basis Z rather than setting it as a hyperparameter? 4) What can the method tell us about the biology / biological processes in future datasets that existing methods cannot?	Fully human-written
MODE: Learning compositional representations of complex systems with Mixtures Of Dynamical Experts	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The authors present MODE (Mixture of Dynamical Experts), a framework that decomposes complex and noisy biological dynamics into components. MODE enables unsupervised discovery of dynamical regimes and accurate gene expression forecasting across regime transitions. Applied to synthetic and single-cell RNA sequencing data, MODE effectively distinguishes proliferation and differentiation dynamics and predicts cell fate commitment. 1. The manuscript is clearly written and well organized. 2. The experiments cover both simulated and real-world datasets, providing comprehensive validation. 1. The advantage or the main goal of developing MODE is not that meaningful. There are many RNA velocity models, like VeloVAE (Gu et al, ICML 2022, LatentVelo ), that are explicitly designed to model cell lineage bifurcation. This significantly reduces the novelty and practical impact of this study. 2. The result from the benchmarking is not convincing, since GMM and spectral methods are quite simple and may not be suitable for complicated data. The authors should compare the performance with other methods mentioned in the related work, like flow matching based methods (Meta Flow Matching) and maybe some RNA velocity models like scVelo (Bergen et al, 2020). And it will also be helpful to compare with other MoE models, like DynMoE (Guo et al, 2025). 3. The scalability is questionable, which is crucial for real single cell RNA sequencing data. The U2OS cell line dataset contains only about 1,000 cells, which is a very limited number in reality. It will be helpful if the authors could run the method on a larger dataset, like the mammalian organogenesis dataset (Cao et al, 2019), which also has branching differential trajectories for this dataset. And I also suggest including metric like cross-boundary direction correctness (CBDir, Qiao et al, 2021) for the real dataset. 1. Why didn't you just use traditional precision, recall and F1 metrics for evaluation? 2. How much compute time does the MODE model need in each of your experiments? 3. When generating the data, how will different noise levels affect the model performance?	Fully human-written
MODE: Learning compositional representations of complex systems with Mixtures Of Dynamical Experts	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces MODE (Mixture Of Dynamical Experts), a mixture-of-experts framework for snapshot dynamical data that jointly (i) clusters heterogeneous dynamical regimes and (ii) forecasts across regime switches/branching. Each expert is a sparse symbolic regressor, combined by a gating distribution and per-expert isotropic noise. Empirically: (i) on elementary dynamics, MODE yields NMI/ARI ≈ 0.96–1.00, strongly outperforming GMM and spectral clustering and approaching a supervised MLP; (ii) on synthetic forecasting tasks, MODE achieves lower Wasserstein distances than MLP/SINDy and commits to branches; (iii) on U2OS scRNA-seq, a 2-expert MODE matches FUCCI cycle vs. exit AUC = 0.98 over 10 seeds. 1. This paper proposes a snapshot-trained MoE for dynamics with interpretable SINDy-style experts, with gating regularizers to avoid collapse and stochastic rollout that commits to fates. Which I believe is new for this field. 2. The method is evaluated on strong and fair elementary benchmarks, which confirms its performance. 3. The paper is well-written, such as the objective (Eq. 3), regularizers (Eq. 4), rollout (Eqs. 5–6) and data generation (appendices) are explicit. 1. The decomposition between expert field and stochastic term is not probed. 2. Results focus on low-D synthetic (2–3D) and PCA-5 for scRNA. Please add OOD tests. 1. Please consider adding switching-model baselines (Switched Flow Matching; Neural MJP; mixture-NODEs with gate). Use the same K and similar parameter counts. It would be interesting to see the performance gaps. 2. MODE improves W2 but slightly loses W1,x to MLP (0.1363 vs 0.1284). Please explain the trade-off and add per-axis ablations.	Moderately AI-edited
Map as a Prompt: Learning Multi-Modal Spatial-Signal Foundation Models for Cross-scenario Wireless Localization	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes SigMap, a multimodal foundation-model framework for wireless localization featuring (1) a periodicity-aware adaptive masking pretraining scheme tailored to CSI, and (2) a “map-as-prompt” mechanism that encodes 3D maps as geometric prompts for parameter-efficient finetuning. Experiments on DeepMIMO (O1-3p5) show gains for single/multi-BS localization and some few-shot cross-scenario transfer. I think how the authors use GNN to generate the Prompt is a great innovation. It cleverly borrows the idea of Prompt-Tuning from LLMs, encoding 3D map information into lightweight soft prompts used to guide a large signal foundation model. This fundamentally solves the problem of model adaptation in new environments. Weakness 1. The introduction has too many paragraphs, although it compares with many existing works, the logic is not clear. It cannot effectively introduce the work done in this article from existing works. In addition, the shortcomings of many existing studies, such as the inability to capture high-dimensional features for description, are not sufficient to demonstrate the inadequacy of these work. 2. The experimental setup is relatively single and the validation depth is insufficient. Evidence is mostly from a single ray-tracing world, and there is only one cross scenario experiment. The existing experiments are difficult to fully demonstrate the universal applicability of the proposed model. Apart from error metrics, are there any other experiments that demonstrate the effectiveness of the proposed model? 3. How to better reflect the mentioned advantages limited labeled samples, efficient parameters, interpretability? For example, the model proposed parameters efficient, but the comparison of training time, memory usage, inference complexity is insufficient. 4. Insufficient ablation experiments. All of the paper's training and testing are based on DeepMIMO, a simulation dataset. It lacks validation on data collected in the real world. Real-world signals are filled with noise, dynamic interference, and complex propagation effects that simulators cannot fully replicate. It is a significant unknown whether the clean physical laws learned from the simulator can maintain high performance in a dirty real-world environment. the current "Map-as-prompt" primarily encodes the environment's geometry by processing 3D coordinates with a GNN. It does not encode material information. The model doesn't know if it's facing a concrete wall that absorbs"signals or a glass curtain wall that reflects them.	Fully human-written
Map as a Prompt: Learning Multi-Modal Spatial-Signal Foundation Models for Cross-scenario Wireless Localization	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes SIGMAP, a transformer backbone pre-trained with cycle-adaptive masked modeling on Channel State Information, then fine-tuned with a learned geographic prompt from a 3D map via a GNN. The paper claims three main contributions: (1) cycle-adaptive masking to break periodic shortcuts in CSI; (2) map-as-prompt conditioning using 3D geometry; (3) parameter-efficient adaptation with strong cross-scenario generalization. The experiments demonstrate substantial improvements over other baselines, on both single- and Multi-BS localization, as well as generalization performance. 1. The self-adaptive masking and GNN map-as-prompt strategies are novel and meaningful combinations for indoor localization task. The experimental results show significant advantages over other baselines. 2. During fine-tuning, only prompt GNN and projection head are trained, while the backbone is kept frozen. This makes the model efficient and handy for deployment. 3. The algorithm achieves consistent metric gains in different tasks. And the improvements are substantial. 1. The paper asserts good generalization abilities, but it’s not intuitively clear why the algorithm achieves this. The model isn’t trained using meta-learning or transfer learning techniques. The paper also lacks of experimental comparisons to modern baselines that target at generalization in indoor localization, e.g., [1]. 2. The paper doesn't mention how the quality or degradation of the 3D Map could adversely affect the performance of the model. Illustrations of the 3D Map used are needed. More ablation studies on the qualities of the 3D Map are desirable. [1] Gao, Jun, et al. "MetaLoc: Learning to learn wireless localization." IEEE Journal on Selected Areas in Communications 41.12 (2023): 3831-3847. 1. Could the authors explain why the model achieves good generalization abilities to new environments? Since the algorithm is not trained using meta-learning or transfer learning, I am curious about how the model learns to generalize. 2. Could the authors give an example of the 3D Map used in the paper? Can the authors discuss how the quality of the 3D map would affect the model’s performance?	Fully human-written
Map as a Prompt: Learning Multi-Modal Spatial-Signal Foundation Models for Cross-scenario Wireless Localization	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents SigMap, a prompt-based architecture for cross-scenario wireless localization that integrates masked autoencoding with geographic and topological maps serving as soft prompts. The model introduces a cycle-adaptive masking mechanism designed to align with the cyclic nature of Channel State Information (CSI) signals, thereby improving feature learning during pretraining. Evaluated within simulated DeepMIMO environments, SigMap demonstrates strong generalization capability and achieves parameter-efficient few-shot adaptation. The approach aims to bridge the gap between environment-specific training and scalable localization across diverse wireless scenarios. (1) The idea of using maps as prompts is both innovative and practical. By embedding spatial priors directly into the learning framework, the model can better understand geographic context without requiring explicit supervision or heavy parameterization. This approach provides a lightweight yet effective way to integrate domain knowledge into data-driven models. (2) The proposed cycle-adaptive masking strategy effectively leverages the inherent periodic and structural characteristics of CSI signals. This allows the pretraining process to focus on more informative segments of the data, improving robustness and representation quality, especially when dealing with noisy or incomplete measurements. (3) The demonstration of few-shot adaptation using a frozen backbone is impressive, as it highlights the model’s ability to generalize with minimal retraining. This efficiency in adapting to new environments or conditions suggests that SigMap could serve as a versatile foundation for scalable wireless localization systems, reducing computational and data requirements during deployment. (1) The absence of real-world evaluation limits the impact of the results. Without validation on empirical datasets or publicly available benchmarks such as CSI-Bench, it is difficult to assess how well the approach generalizes beyond simulation. This gap weakens the practical relevance of the presented findings. (2) The paper’s claim of developing a “foundation model” for wireless localization appears overstated. While the architecture shows potential for generalization within simulated settings, it lacks evidence of robustness across devices, propagation environments, or hardware variations, all of which are critical for real-world applicability. (3) Although the system integrates several established components—masked autoencoders, vision transformers, and graph-based prompting—the overall architectural contribution feels incremental. The novelty lies more in the combination and application context rather than in introducing fundamentally new mechanisms or model designs. (4) The work asserts interpretability through the use of map prompts but does not provide supporting analysis. Visual or quantitative evaluation of how the prompts influence model predictions would strengthen the paper’s interpretability claims and offer deeper insights into model behavior. (5) The scalability of the proposed approach remains uncertain. The paper does not explore how the framework performs when applied to large-scale or densely connected map graphs, which are common in real-world urban deployments. Understanding such scalability constraints is important for practical use in complex environments. (6) While the paper relies on ray-tracing–based wireless simulation, this approach—though widely used—offers limited novelty unless extended with advanced modeling such as diffuse scattering, dynamic environments, or hybrid physics–ML calibration. The current setup would benefit from stronger validation or augmentation to better capture real-world propagation complexity. Can the authors report scalability experiments by evaluating SigMap on larger or denser map graphs, or by simulating more complex urban propagation conditions, to objectively assess how the method performs in real-world large-scale deployments and justify its practical robustness?	Fully AI-generated
Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes SCIQL, an offline reinforcement learning algorithm designed to learn policies that optimize task reward while exhibit specific behavioral styles. Building upon IQL, SCIQL extends it to the style conditioned setting and introduces GAWR mechanism to balance the two advantage terms. The proposed GAWR mechanism and sub-trajectory labeling provide a simple yet effective way to integrate style supervision into offline RL. Empirical results on Circle2D and HalfCheetah environments show that SCIQL consistently achieves higher style alignment scores compared to the baselines. 1. The problem formulation is conceptually unclear. If style alignment and task reward are inherently conflicting, the object should be to balance the trade-off between the two. However, the current formulation seems to sacrifice task reward to increase style conformity, which raises the question of whether this trad-off is explicitly modeled. 2. Given that style alignment and task reward clearly conflict as shown in Section 5.3, the evaluation might be better framed in a Pareto optimality context rather than using single averaged metrics. Without such discussion, it is difficult to interpret whether improving style alignment at the cost of lowered reward constitute genuine progress. 3. The paper defines style labels as discrete categories obtained via predefined labeling functions. Could the authors clarify why a discrete formulation was chosen over a continuous style ? Using continuous representations might allow smoother interpolation between styles and potentially improve generalization to unseen or mixed style combinations. 4. The evaluation is restricted to toy circle 2d and halfcheetah environments., which are relatively simple and low-dimensional. It would strengthen the work to include results on more diverse environments, such as other MuJoCo or Atari tasks or humanoids-tyle control demands where stylistic variations are more naturally expressed. 5. It would be valuable to assess whether the proposed method can extrapolate (or interpolate) to unseen style labels or novel combinations of style labels that were not encountered during training. 1. Is z a multi-dimensional vector aggregating multiple criterion-specific labels, or a single discrete label ? If it is the former, the description around lines 180-190 should be revised to clarify how multiple criterion labels are annotated and used in z. 2. Minor typos. Line 453, Twhile -> while. Line 169, " is reversed.	Fully human-written
Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a method for style alignment in the offline RL setting using implicit Q learning and advantage weighted regression. Styles are defined using hard coded functions which is then used as a reward to learn a style value function. This value function is combined with a task value function (independent of style) to train a style aligned policy. Experiments are conducted on the circle and halfcheetah environments, showing significant performance advantage over baselines such as SORL and SCBC. Ablation experiments demonstrate how different temperature parameters prioritize task performance and style alignment. * The proposed solution is quite simple and sound. * The effectiveness of the proposed method is clear. * I think the presentation can be improved if the authors moved some of the plots in the appendix to the main paper. * Some details in the method can be better explained. * I find the need to tune the temperature parameter and its sensitivity a downside of the proposed method. * In (12), can you add a text explanation of the equation? Is the gating saying that if the style advantage is high enough such that the sigmoid output is 1 then you can incorporate task advantage? In theory the advantage function at optimality is zero $\max_{a}Q(s, a) = V(s)$, the sigmoid output is 0.5, so you are still using a small weight on the task reward advantage. * In the results, you did not include an in-depth explanation of the different datasets. Can you explain how you expect the method to behave differently for different datasets? From Table 1, it looks like halfcheetah-vary performs worse on the baseline methods than the other halfcheetah datasets. Why? * Can you comment on the sensitivity of the temperature parameters? * I would suggest moving some of the plots in the appendix to the main paper so that people understand what style means. * (Minor) there are a lot of typos in the paper. Please fix.	Fully human-written
Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper propose a new view of the stylized policy learning problem as a generalization of the goalconditioned RL and introduce SCIQL algorithm which uses hindsight relabeling and Gated Advantage Weighted Regression mechanism to optimize task performance. This paper provides a unified formulation of behavioral style learning via programmatic sub-trajectory labeling, and introduces the SCIQL+GAWR framework that effectively balances style alignment and task performance in the offline RL setting. The reliance on hand-crafted style labeling functions constrains scalability to more abstract or subtle styles, and may require domain expertise when applied to complex environments. The algorithmic pipeline is relatively intricate, increasing implementation burden, and evidence on large-scale real-world or high-dimensional robotic systems remains limited The proposed approach relies on hand-crafted sub-trajectory labeling functions; how scalable and generalizable is this design to tasks where styles are abstract, high-level, or difficult to encode programmatically? While the method demonstrates strong performance in simulated benchmarks, there is no evaluation on real-world systems or higher-dimensional robot control tasks. Can the authors comment on the expected practicality and robustness of SCIQL in real settings? The overall pipeline introduces multiple components and optimization stages; how sensitive is the method to hyperparameters, and can the authors provide an ablation isolating the contributions of each module to ensure that improvements are not due to increased model complexity? The approach assumes accurate style labels from the labeling functions. How does performance degrade under noisy or imperfect style annotations, and can the method handle ambiguous or overlapping style categories? The paper positions programmatic style labeling as scalable, but could the authors discuss potential avenues for extending the framework to automatically learn style representations, or integrate human feedback when labeling heuristics are insufficient?	Fully AI-generated
Out-of-Distribution Robust Explainer for Graph Neural Networks	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes ORExplainer, an out-of-distribution (OOD) robust post-hoc explainer for graph neural networks (GNNs). It introduces Weighted Energy Propagation (WEP) based on node energy scores to suppress unreliable OOD nodes and enhance explanation reliability. The method provides stable, in-distribution–focused subgraph explanations even under noisy or OOD conditions. Extensive experiments on synthetic and real-world datasets demonstrate that ORExplainer outperforms existing explainers in both robustness and fidelity. 1. The proposed Weighted Energy Propagation (WEP) effectively suppresses OOD influence, offering a simple yet principled way to enhance explainer robustness. 2. Extensive experiments on diverse datasets show consistent improvements in both robustness and fidelity over existing methods. 1. The overall writing quality could be improved. For example: (1) the caption of Figure 2 is not expressed clearly; (2) the font style of the embedding matrix Z in Preliminaries should be unified; (3) abbreviations such as “out-of-distribution (OOD)” appear repeatedly across Introduction, Related Work, and Our Proposed Method; and (4) subsection formatting should be made consistent. 2. In Figure 2, the CE loss should also have an arrow pointing to ORExplainer, and the main diagram could benefit from more detailed and polished visual design. 3. The experimental section could be strengthened by adding quantitative analyses of the energy mechanism to demonstrate its contribution and effectiveness. 1.The method ensures robustness by suppressing OOD message passing, but such OOD information may sometimes carry useful signals. It would be valuable to discuss how these potentially informative OOD components could be better utilized.	Lightly AI-edited
Out-of-Distribution Robust Explainer for Graph Neural Networks	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes ORExplainer, a post-hoc explanation model designed to generate robust and reliable explanations for GNNs in the presence of out-of-distribution (OOD) nodes. The method introduces an Energy-Score mechanism to prioritize in-distribution (ID) nodes while suppressing OOD influence. The paper evaluates the approach on both synthetic and real-world datasets and reports improved robustness of explanations under several OOD settings. The topic is timely and relevant to trustworthy graph explainability. However, the paper contains several writing, methodological, and conceptual issues that need to be clarified before the contribution can be properly assessed. 1. Addresses an important and underexplored problem—robust explainability under graph OOD conditions. 2. The use of Energy Scores for node importance is intuitively reasonable. 3. The experiments include both synthetic and real-world datasets, demonstrating practical relevance. 1. Writing clarity issues. Some expressions are grammatically or semantically unclear. For example: Line 126: should read “The graph used for explanation …” instead of the current phrasing. Line 129: should use “The model f is a node classifier …” rather than “The GNN f is a …”. 2. Unclear generation of OOD settings. Lines 301–307 only describe how synthetic node OOD and real-world feature OOD are generated. It remains unclear: How are node OODs created in real datasets? How are feature OODs constructed in synthetic datasets? For synthetic “unseen-label” OOD, is the largest label simply treated as unseen? Furthermore, the experimental design is inconsistent: synthetic datasets only test structural OOD, while real datasets only test feature and unseen-label OOD. Why not evaluate all three OOD types on both domains to support the claim of general robustness? 3. Questionable assumption on excluding OOD nodes. Lines 199–202 state that explanations should consist mainly of ID nodes, with OOD nodes excluded or down-weighted. However, this assumption may fail when OOD nodes directly cause prediction errors. In such cases, removing them may hide the model’s failure mechanism rather than provide a faithful explanation. The authors should discuss this limitation explicitly. 4. Missing extension discussion. Could the proposed ORExplainer framework be extended to graph-level classification tasks? Since the method currently focuses on node-level explanations, a discussion of potential extensions would be valuable. 5. Target-node OOD scenario. If the target node itself is OOD, can ORExplainer still produce a reliable and meaningful explanation? This scenario seems practically important, but is not analyzed in the paper. See Weaknesses.	Moderately AI-edited
Out-of-Distribution Robust Explainer for Graph Neural Networks	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper studied the problem of generating robust post-hoc, instance-level explanations for Graph Neural Networks (GNNs) in dynamic settings，where new nodes/edges at test time can introduce out-of-distribution (OOD) noise and outliers, undermining existing XAI methods that assume distributional consistency. To address this challenge, the authors propose ORExplainer, which incorporates Energy Scores to capture structural dependencies, prioritize in-distribution nodes, and suppress the influence of OOD nodes during explanation generation. Experiments across synthetic and real-world datasets with varied OOD injection strategies show that ORExplainer delivers more reliable and robust explanations than prior approaches, and the implementation is released for reproducibility. 1. This paper proposes ORExplainer, which provides robust and verifiable instance level explanations for GNNs under test time OOD scenarios in dynamic graphs. The problem setting is novel, and the method is supported by solid experimental results and theoretical analysis. 2. The paper is generally readable and reasonably well structured. 1. Some experimental results appear to be insufficiently comprehensive. For example, Figure 3 shows only the BA-Community and Cora datasets. 2. The paper appears not to report the accuracy metrics commonly used by prior GNNExplainer. 1. Some experimental results appear insufficiently comprehensive. For example, Figure 3 includes only BA-Community and Cora; please expand to additional datasets or justify this selection. 2. The paper does not report the accuracy metrics commonly used by prior explainers (e.g., GNNExplainer). Please justify this choice and, if appropriate, include those metrics for comparison. 3. In Table 1 and Table 3, some results appear less than ideal. Please explain the underlying reasons.	Fully human-written
Out-of-Distribution Robust Explainer for Graph Neural Networks	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes an OOD robust explainer for GNNs and introduces a WEP mechanism that suppresses unreliable OOD nodes while emphasizing in-distribution information. The authors also provide a theoretical analysis connecting WEP to a diffusion process. Experiments on synthetic and citation datasets show improved explanation stability and robustness compared to existing explainers. 1. The presentation of the paper is clear. 2. In the theoretical section, they try to connect the proposed mechanism with diffusion-based energy minimization, showing some effort toward providing interpretability and analytical grounding. 1. While the paper identifies three types of OOD nodes in Introduction, their definitions appear somewhat overlapping, as all are described in the context of new or injected nodes. It is unclear whether these categories are mutually exclusive or represent different perspectives of the same phenomenon. Clarifying the conceptual distinctions would improve the overall clarity of the problem setup. 2. The authors mention that existing explainers assume a fixed graph, but the failure modes of prior works are not clearly articulated. A deeper analysis of why such assumptions reduce robustness, and how the proposed method directly addresses these issues, would make the motivation more convincing. 3. Eq(3) defines WEP as a simple average between a node’s own energy and its neighbors’ energies. However, authors did not explain why this linear diffusion operation enhances robustness, or how many propagation steps k are used. 4. The theoretical derivation assumes that ID and OOD nodes have disjoint energy intervals. This assumption seems strong, and it is unclear whether it holds for real-world graphs where ID and OOD distributions may overlap. 5. Theorem 5.2 sets WEP as a lazy substochastic diffusion, showing that minimizing propagated energy reduces visits to high-energy nodes. However, this finding largely restates the intuitive effect of averaging over neighbors and does not offer a deeper theoretical guarantee of robustness or faithfulness. This analysis reads more like a mathematical restatement of the mechanism than a theory. 6. The experiments are conducted on small-scale datasets, which may not adequately test scalability or robustness on larger and more complex graphs. Including larger datasets would make the evaluation more convincing. Please refer to Weaknesses part.	Moderately AI-edited
IPGO: Indirect Prompt Gradient Optimization for Text-to-Image Model Prompt Finetuning	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces Indirect Prompt Gradient Optimization (IPGO), a parameter-efficient framework for prompt-level finetuning in text-to-image (T2I) diffusion models. IPGO enhances prompt embeddings by injecting learnable prefix and suffix embeddings, optimized via gradient-based methods with low-rank approximation, rotation, and stability constraints (orthonormality, range, and conformity). Unlike prior approaches, IPGO requires no modification to the diffusion model or text encoder and operates with far fewer parameters, enabling efficient, prompt-wise optimization at inference. Experiments across three datasets and three reward models (aesthetics, image-text alignment, and human preference) show that IPGO consistently outperforms baselines, such as TextCraftor and DRaFT-1, while using fewer parameters. + The proposed method is efficient and it achieves better results with fewer parameters and lower hardware requirements than baselines such as TextCraftor and DRaFT-1. + Looks like the proposed method has potential to be applied to existing T2I diffusion models and reward functions without modifying the underlying model or text encoder. + The proposed outperforms several baselines across multiple datasets and reward types. It also provides ablation studies to validate the contribution of each design choice and constraint. - Native image generation models (e.g. VAR [a] BAGEL [a], ) are pretty popular recently which does not include a text encoder. It is not clear how the proposed method can be applied or generalized to these recent SOTA methods. [a] Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction, 2024 [b] Emerging properties in unified multimodal pretraining, 2025 - It is not clear how well the proposed method can handle the case where inference prompts are long and detailed (e.g. > 150 words). As for recent methods, LLM prompt rewrite/expansion is a commonly used strategy which will transfer short input prompt to detailed long prompt as input to the model. It is not clear how good ADSS can be applied to those cases with detailed prompt. - SOTA performances are claimed by the submission, while the most recent methods included for comparisons are TextCraftor (year 2024) and DRaFT-1 (year 2023) which is a bit old. Please refer to the detailed questions raised in Weakness section above.	Fully human-written
IPGO: Indirect Prompt Gradient Optimization for Text-to-Image Model Prompt Finetuning	Soundness: 1: poor Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes IPGO (Indirect Prompt Gradient Optimization), a method for aligning text-to-image diffusion models with reward functions through optimization of prompt embeddings. The approach adds learnable prefix and suffix embeddings to the original prompt embeddings, parameterized through rotated low-rank approximations. Instead of using explicit KL regularization like traditional RL-based alignment methods, IPGO relies on three embedding-space constraints to prevent reward hacking: orthonormality of the embedding basis, range constraints on coefficients ([-1,1]), and conformity (mean preservation with original prompt embeddings). The method operates as a test-time optimization approach, optimizing each individual prompt separately for multiple epochs. Experiments are conducted on prompts from COCO, DiffusionDB, and Pick-a-Pic datasets using Stable Diffusion v1.5 and three reward models (CLIP alignment, LAION aesthetics, HPSv2 human preference). IPGO is compared against six baselines, including training-based methods (TextCraftor, DRaFT-1, DDPO) and training-free methods (DPO-Diff, Promptist), showing improvements over competing methods when evaluating on the reward that is optimized. - Novel parameterization approach for prompt embedding optimization: The method combines prefix-suffix embeddings with rotated low-rank parameterization and three constraints (orthonormality, range, conformity) to keep optimization within a meaningful embedding region. The linguistic motivation for prefix-suffix design is intuitive, and the approach allows preserving original prompt semantics while adding learnable content. - Ablation studies The paper provides valuable ablations showing task-dependent importance of constraints (e.g., range crucial for aesthetics, orthonormality for alignment) and demonstrating that the parameterization significantly outperforms naive unconstrained optimization. Ablations on prefix/suffix lengths reveal that equal lengths work best and longer isn't necessarily better. Major weaknesses, experimental design: (I would be open to reconsidering my score given that these concerns are adressed.) - No evaluation on general T2I benchmarks: The evaluation is limited to the same reward models used for optimization, which provides no evidence that IPGO improves general image quality beyond simply overfitting to the metric. Without testing on standard T2I benchmarks that assess compositionality and attribute binding (e.g., GenEval, T2I-CompBench), or conducting human studies, or minimally cross-validating rewards, the reported gains could reflect reward hacking rather than genuine improvements. The risk of exploiting reward model biases is substantial, especially given the multi-epoch per-prompt optimization and the absence of explicit KL regularization to maintain fidelity to the original model distribution. - Experimental design does not align with the test-time optimization paradigm: The experimental design is not fully convincing to me, IPGO requires multiple optimization epochs per generated image, against training-based baselines (e.g., TextCraftor, DRaFT-1) that offer instant inference after a one-time training cost. While this comparison is interesting, I would argue the main comparison should be against other test-time techniques, including wall-clock time as one axis. Currently, it's unclear what the compute vs performance trade-off looks like. While the main comparison would be Promptist, other test-time optimization techniques are in my view the main competing methods to IPGO (e.g., noise selection [1] (Best-of-N, or over paths) or noise optimization [2,3]). - In my opinion, reporting wall-clock time (and GPU memory) and performance on some metric that is disentangled from the optimized metric (e.g. GenEval) compared to more test-time techinques, is needed to accurately assess the performance of IPGO, which the current evaluation doesn't sufficiently cover. Minor weaknesses: - Limited justification for specific design choices: Many of the method's core components lack rigorous justification and appear arbitrary. The rotation parameterization is motivated by a simplified 2D analysis that does not transfer to the high-dimensional setting and is empirically contradicted by results where it harms alignment scores. Likewise, the constraint formulations, such as the `[-1,1]` range and the preservation of the mean, are presented as heuristics without theoretical motivation or ablation against alternatives. While the paper shows these components contribute to performance, it fails to provide a clear rationale for why these specific choices are optimal. - The majority of experimental evaluations are only on SD1.5, which is far w.r.t. performance from the models currently used in practice (SD3, FLUX, ...). [1] Ma et al. "Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps". CVPR 2025. [2] Tang et al. "Inference-Time Alignment of Diffusion Models with Direct Noise Optimization". ICML 2025 [3] Eyring et al. "ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization". NeurIPS 2024. - Generalization across noises: Does an optimized prefix/suffix for a given prompt generalize to different initial noise seeds, or must the optimization be re-run for every new image generation? What is the typical wall-clock time for this inference-time optimization? - Interpretability of Embeddings: Have you attempted to project the learned prefix/suffix embeddings back into the vocabulary space to see if they correspond to interpretable words or concepts? What does IPGO learn to "say" to improve aesthetics or human preference? - Rotation's Role: The rotation component appears to have a task-dependent effect (improving aesthetics but not alignment in the ablation). Do you have an intuition for why this is the case? Could the rotation be made adaptive based on the reward function?	Fully human-written
IPGO: Indirect Prompt Gradient Optimization for Text-to-Image Model Prompt Finetuning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces a prompt tunning methods for human or aesthetics preference alignment. The main idea of this paper is to finetun the prefix and suffix of the given prompt in the embedding space. Therefore, the trainable parameter can be greatly reduced. In addition, the authors introduced low-rank approximation and rotation transform on the trainable embedding space, which also demonstrate its improvement in the experiment section. - Strengths 1) Similar to the prefix prompt tunning, this paper finetung prefix and suffix embedding for aligning the human preference using reward model. 2) In addition, the authors introduced rotation and low-rank approximation (I would like to say that it is not true) for improving the expressiveness of the trainable embedding. - Weaknesses 1) In the related works, the authors claimed that they do not require to access the original image compared to PEZ and Textual Inversion. However, I think this is not true as it is related to the task. PEZ requires original images since they consider the prompt discovery tasks. 2) In equation (3), the dimension of Z and E are not aligned, they can not be multiplied directly. 3) The reason for decompose the prefix or suffix embedding into low-rank matrices are not clear. In addition, according to the Table 8, m is set to 300 while the prefix or suffix is 10, thus it is totally not the low-rank approximation or parameter-efficient strategy, but the over-complete dictionary learning. The authors should thoroughly correct this claim. 4) Given a reward loss function, how to optimize the embedding? Can we directly optimize the embedding through via BP? More details can also be introduced in the paper. See the weakness section for details.	Fully human-written
IPGO: Indirect Prompt Gradient Optimization for Text-to-Image Model Prompt Finetuning	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper introduce Indirect Prompt Gradient Optimization (IPGO), a parameter-efficient method for aligning text-to-image (T2I) diffusion models with various reward objectives including aesthetic, CLIP alignment, and human preference. Instead of modifying diffusion model’s backbone or text-encoder parameters, IPGO optimizes a few learnable prefix and suffix embeddings added to the original prompt’s text embeddings. These embeddings are optimized via a constrained, gradient-based procedure incorporating low-rank approximation, rotation parameterization. The framework is training-efficient (0.47M parameters) and evaluated under multiple datasets and showcased consistent improvements of 1–3% over strong baselines. 1. Clear Writing: The paper is clearly written and well-structured. The motivation and methodology are presented in an organized and accessible manner, making it easy to follow and pleasant to read. 2. Novel and Effective Method: The proposed approach introduces a novel optimization-based framework for prompt refinement, which effectively enhances text-to-image alignment and image quality. Extensive experiments showcased the effectiveness of the IGPO method. 1. Limited Exploration on Large-Scale Models: While the method targets parameter-efficient learning, most results are on backbones with relatively small text encoders (e.g., SD). It remains unclear how well the approach scales to larger systems and richer text stacks (Flux). Evaluating on modern large diffusion models—adding both quality gains and compute/latency/memory trade-offs to Table 6—would strengthen the claim of broad effectiveness and practical scalability. 2. Weak Justification for Using Both Prefix and Suffix: The paper claims that using both prefix and suffix embeddings improves performance. However, as shown in Table 5, configurations with only prefix or only suffix embeddings (e.g., (10, 0) or (0, 10)) achieve nearly identical CLIP scores (0.286 vs. 0.289). This marginal difference does not strongly justify the necessity of employing both prefix and suffix components simultaneously. Authors may need to compare (10,10) with (20,0) or (0,20) to ensure a fair comparison with the same learnable parameters. see weakness	Lightly AI-edited
Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces Color3D, a framework for colorizing both static and dynamic 3D scenes (represented by 3D/4D Gaussian Splatting) from grayscale inputs. Its core idea is to avoid the multi-view inconsistency of applying 2D colorizers independently by instead personalizing a single colorization model per scene. This is done by selecting and colorizing one key view, then fine-tuning a pre-trained colorizer (using adapters) on augmented versions of this single view to learn a scene-specific, deterministic color mapping. This personalized colorizer is then used to colorize all other views/frames consistently. A dedicated Lab color space Gaussian representation with a warm-up strategy is used to improve reconstruction fidelity. 1. The paper is well-written and easy to follow. The proposed method can be adapted to various colorization models, making it highly practical and valuable for real-world applications, especially in AR/VR scenarios. 2. The experimental results on LLFF and Mip-NeRF 360 and D-NeRF colorization setting have shown the Color3D's good performance, not only on statistic 3d scenes but also on dynamic 4d scenes. Also, the method demonstrably outperforms existing alternatives across multiple metrics (FID, CLIP Score, Matching Error), showing superior consistency, color vividness, and alignment with user intent. 1. It seem like the requirement to fine-tune a personalized colorizer for every new scene adds a non-trivial computational cost (∼8 minutes per scene) compared to a generic, one-time-trained model, limiting its scalability for large-scale applications. 2. The entire color propagation relies on a single key view. If this view is unrepresentative or lacks critical scene elements, the colorizer's generalization may be hampered, potentially leading to incomplete or less vibrant colorization in occluded regions. The paper shows that the personalized colorizer is robust to viewpoint changes from the key view. How does it handle novel views containing objects or textures that are semantically similar but visually different (e.g., another type of chair)? Would it incorrectly transfer learned colors or struggle to color them plausibly?	Fully AI-generated
Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper presents Color3D, a unified framework for colorizing both static and dynamic 3D scenes reconstructed from monochrome inputs. Rather than colorizing multiple views independently, which causes cross-view inconsistency, the method colorizes one key view using any off-the-shelf 2D colorization model, then fine-tunes a per-scene personalized colorizer to propagate the learned color mapping to all other views or time steps.E xperiments on LLFF, Mip-NeRF 360, and DyNeRF datasets demonstrate consistent and controllable colorization across viewpoints and time, with improvements in FID, CLIP score, and Matching Error compared to baselines. High controllability and flexibility. The system supports multiple control modalities: reference-based colorization, language-conditioned colorization, and automatic default color prediction. Computational practicality. Despite using fine-tuning, the reported per-scene personalization time (~8 minutes) is relatively efficient compared to retraining full colorization networks. The framework does not require full 3D model retraining, and adapters make it lightweight enough for scene-level deployment. Limited generalization beyond scene-specific fine-tuning. The reliance on per-scene personalized colorizer tuning implies that a new model must be trained for each scene. This restricts scalability for large datasets or interactive applications. The approach is elegant but computationally heavy when many scenes must be processed. Inductive bias assumption not rigorously examined. The claim that a single-view-trained colorizer generalizes to novel viewpoints via inductive bias remains empirical. The paper could benefit from deeper theoretical or diagnostic analysis to substantiate why the learned mapping remains consistent under large view changes. Limited evaluation. The datasets being evaluated on are LLFF (static), Mip-NeRF-360 (static), and DyNeRF(dynamic). Although promising results are shown, the scale of evaluation is still limited. Also, for dynamic scenes, there are more dimensions for evaluation like temporal color-consistency, which is missing in the current experiments. 1. How stable is the personalized colorizer’s output under large camera baselines (e.g., >60° change)? If still good, what are the fundamental factors that may contribute to this effect? 2. How sensitive is performance to errors in key-view selection? 3. How could this appoarch be generalized to feed-forward style 3D generation models? This means per-scene optimizations are removed.	Lightly AI-edited
Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer	Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	Color3D proposes a method for colorizing monochromatic images in 3D representations for static and dynamic scenes. Naive implementations, like colorizing 2D multi-view images and reconstructing the 3D scene lead to severe cross-view inconsistencies. Recent methods that distil the color information to a 3D representation sacrifice controllability and often have desaturated colors in the final output. Key contributions of the proposed methodology are as follows: - The key idea is to colorize (automatic/reference-based/prompt-based) a single "key" view which is the most informative one and then fine-tune a scene-specific colorization network on that view. - This scene-specific colorizer learns a deterministic color mapping for the scene and is then applied to all other views/frames, enforcing cross-view and cross-time color consistency . - Finally, the colorized views (with known luminance and predicted chrominance) are fused into a Gaussian Splatting representation in Lab color space. Experiments on standard benchmarks (LLFF, Mip-NeRF 360, DyNeRF) and "in-the-wild" legacy videos show that Color3D produces vivid and consistent colorizations. - [S1] Technical Novelty: The per-scene colorization for a single view is a novel idea which achieves consistent colorization across views. The robust technical pipeline achieves this consistency through key design choices. Specifically, utilizing a pre-trained 2D colorization encoder (DDColor) preserves the generalization capability of the model. The use of the Lab color space leads to more stable results. Finally, warm-up training of the 3D representation on the Luma ($\text{L}$) channel first ensures the model establishes a strong geometric structure before introducing complex color information. - [S2] Practical Applications: The authors show results on "in-the-wild" multi-view images and historical video (Fig.6), producing vivid and plausible colors while maintaining consistency. - [S3] Controllability: The proposed method allows users to control the colorization using text descriptions or reference images. Earlier methods did not offer this type of user control. - [S4] Thorough Experimentation for Key-View selection: The authors performed detailed experiments for "Key-View Selection" module. Results in Fig. 7 demonstrate that this module is critical for better colorization. - [W1] Limited Comparison to Recent Methods: The authors mention "ChromaDistill" in Related Work, but do not perform a quantitative comparison. For complete experimentation, it is also necessary to quantitatively compare the results against a video colorization baseline. - [W2] Cross-View Consistency: ChromaDistill utilized a long-term and short-term view consistency method for measuring the geometric consistency. However, the authors do not show this metric. The manuscript will benefit from this metric. - [W3] As the method utilizes training for each scene, it has an extra training overhead. Typos: - L74: It should be "aim" instead of "aims". - L857: It should be "suffers" instead of "suffer". - L347: It should be "entire" instead of "entir". - [Q1] Have the authors tried fine-tuning on _more_ than one view? Using two or three colorized views might improve coverage for large scenes. Is there a reason the approach is limited to one view? - [Q2] How does the proposed method handle motion for dynamic scenes? In case the object that was not in the key view appears in the scene, how is its colour determined? Does the generative augmentation simulate such cases? This is not clear in the current manuscript.	Fully human-written
Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	1. This paper focuses on the task of controllable and consistent 3D scene editing. It aims to address the limitations of previous methods that often lack precise controllability and multi-view color consistency. 2. The authors propose Color3D, a two-stage framework. In the first stage, a text-to-image diffusion model is used to modify the reference view’s color according to user input. In the second stage, Score Distillation Sampling (SDS) is applied to enforce 3D consistency. Unlike prior work, the method performs Gaussian Splatting optimization in the LAB color space, which helps maintain color consistency across different viewpoints. 3. Experiments on Color-Edit-3D dataset show that Color3D achieves state-of-the-art results. The ablation study highlights the importance of the Color Consistency Regularization (CCR) module, which provides a clear performance gain when included. 1. The method is technically sound and well-motivated. 2. The experiments are thorough and provide strong empirical validation. 1. The computational cost is unclear — how long does the optimization take during the second stage? 2. The main novelty seems to lie in optimizing in the LAB color space instead of RGB. While this is a reasonable choice for editing tasks, the contribution may be somewhat limited in scope. 3. It would be interesting to see how the approach performs under challenging lighting conditions (e.g., scenes with lamps or strong reflections). Please address the issues listed in the weaknesses section.	Heavily AI-edited
MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data	Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces MedInsightBench, a new benchmark for evaluating medical analytics agents on the ability to discover multi-step, clinically meaningful insights from multi-modal medical data (particularly pathology images and reports). The benchmark contains 332 curated medical cases and 3,933 verified insights, each annotated with corresponding questions, evidence, and analysis goals. The authors further propose MedInsightAgent, a multi-agent framework comprising three components: Visual Root Finder (VRF) – identifies salient visual features and generates initial analytical questions. Analytical Insight Agent (AIA) – answers questions and synthesizes insights using a specialized pathology model (PathGen-LLaVA). Follow-up Question Composer (FQC) – iteratively generates deeper and complementary questions to extend reasoning. The paper introduces an automated evaluation protocol for insight discovery (Recall, Precision, F1, and Novelty) and provides comprehensive experiments comparing MedInsightAgent to state-of-the-art LMMs (GPT-4o, GPT-5, Qwen2.5-VL, Deepseek-VL2, InternVL3-38B) and ReAct-based frameworks. Results show that MedInsightAgent (Qwen2.5-VL backbone) achieves the best performance across all metrics, substantially improving insight novelty and interpretability. Novel task definition: Insight discovery in multimodal medical contexts is a new and important evaluation dimension. High-quality dataset: Expert-verified, diverse, and hierarchically structured (goals → questions → insights → evidence). Robust evaluation: Four complementary metrics (Recall, Precision, F1, Novelty) enable both quantitative and qualitative assessment. Strong baseline coverage: Includes several frontier LMMs and a ReAct comparison. Agent design clarity: The modular decomposition (VRF–AIA–FQC) provides transparency and potential for transfer to other domains. Comprehensive analysis: Includes ablation, human evaluation, redundancy statistics, and case studies illustrating improvement in reasoning depth. Domain limitation: All data originates from TCGA pathology, focusing heavily on cancer cases. The generalizability to other medical imaging types (radiology, histology, ophthalmology) remains untested. Limited human expert benchmarking: While correctness and rationality are validated, there is limited comparison of agent-generated insights to expert-generated analytical reports. Lack of longitudinal reasoning: The framework operates on single-image cases without temporal patient data, which would better test clinical reasoning. Scalability and cost: The multi-agent setup and web retrieval modules increase inference latency and resource cost, though this is not quantified. Novelty metric dependency: Novelty scoring relies on textual distance metrics (ROUGE/G-Eval) rather than domain novelty validation by clinicians. 1-How does MedInsightAgent handle contradictory evidence across multi-modal sources (e.g., text vs. image)? 2-Could the Follow-Up Question Composer benefit from reinforcement learning or self-critique loops to adaptively decide iteration depth instead of a fixed parameter p? 3-Have you evaluated zero-shot transfer of the MedInsightAgent to non-pathology modalities (e.g., radiology, dermatology)? 4-How consistent are the insight novelty metrics with human perception of novelty? Any inter-rater reliability studies? 5-Would you consider releasing a smaller public subset of MedInsightBench with de-identified samples for community benchmarking?	Fully AI-generated
MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces MedInsightBench, a new benchmark dataset for evaluating the ability of large multi-modal models to discover multi-step, deep insights from multi-modal pathology data. The authors also propose MedInsightAgent, a multi-agent framework designed to improve insight discovery, and show that it outperforms baseline LMMs on their new benchmark. The paper's primary strength is addressing an important gap in existing evaluations. The experimental comparisons are limited, the methodology for dataset creation and evaluation lacks transparency, and the true novelty of the agent framework's contribution is unclear: 1. The evaluation compares MedInsightAgent against LMMs-only and a single general-purpose agent framework ReAct, while the paper's own related works" section lists numerous domain-specific medical agent frameworks (e.g., MedAgentsBench, AgentClinic). Some of these works should be included as baselines. 1. The paper repeatedly states that human verification or human experts were used to curate the dataset. However, it provides no details on the verifiers' qualifications (e.g., were they board-certified pathologists?), the verification protocol, or the inter-annotator agreement. 2. The "Insight Novelty" metric, a key part of the proposed evaluation framework, is poorly explained and methodologically questionable. Appendix states that correct insights are those with a G-Eval score > 5, an arbitrary threshold that seems inconsistent with all reported G-Eval scores in Table 3. It then describes a process where incorrect insights are re-evaluated for novelty. This process is opaque and makes it difficult to trust the reported novelty scores. 3. The ablation study shows the IAT has the greatest impact on performance. It is therefore unclear how much of MedInsightAgent's performance gain comes from its novel agentic orchestration versus simply using a powerful, specialized tool that other baselines (like ReAct) were not given access to. Please see the weaknesses above.	Lightly AI-edited
MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data	Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper presents MedInsightBench, a benchmark framework for evaluating large multimodal models (LMMs) and agent-based systems in multi-step medical insight discovery. The benchmark includes 332 cancer pathology cases and 3,933 expert-validated insights, integrating medical images, reports, and analytical goals into a unified evaluation scheme. The authors further propose a three-stage multi-agent framework, MedInsightAgent, comprising a Visual Root Finder, an Analytical Insight Agent, and a Follow-up Question Composer. Through visual analysis, question generation, external knowledge retrieval, and multi-turn reasoning, the framework enhances interpretability and insight depth. Experiments demonstrate that MedInsightAgent outperforms mainstream LMMs (e.g., GPT-4o, GPT-5, Qwen2.5-VL) in both F1 and Novelty metrics, highlighting the limitations of current models and their potential for improvement in medical insight generation. 1. The dataset is well-designed, balancing image quality, analytical objectives, and question–insight pairing, which ensures strong systematicity and evaluation value. 2. MedInsightAgent adopts a multi-round chain structure (Root Question → Insight → Follow-up), effectively enhancing the depth and diversity of insights while improving interpretability. 3. The benchmark introduces four complementary metrics—Insight Recall, Precision, F1, and Novelty—offering a more rigorous and comprehensive evaluation than prior text-matching–based methods. 1. The insight generation and validation process depends heavily on manual proofreading, which may limit scalability, consistency, and efficiency when applied to larger or more diverse medical datasets. 2. Although the multi-agent framework (MedInsightAgent) is conceptually interesting, its algorithmic design remains largely engineering-driven, lacking explicit optimization objectives, convergence proofs, or theoretical analysis of complexity. 3. The mathematical formulations (Eq.1–3) mainly describe procedural steps rather than formal optimization goals, reflecting limited theoretical depth and rigor. 4. The definition of “insight” is semantically ambiguous; while the authors propose F1 and Novelty metrics, they do not clearly distinguish between linguistic novelty and genuine medical discovery, and some examples resemble report paraphrasing. 5. The experiments, though extensive, lack statistical significance testing, detailed error analysis, and cross-domain generalization evaluation, which weakens the reliability of the reported improvements. 6. Comparisons with other medical agent frameworks (e.g., MedAgentsBench, AgentClinic) are insufficient, and the integration potential with large medical foundation models such as Med-PaLM M is not explored. 7. The inter-agent communication mechanism, including message passing and reasoning order, is under-specified, making it difficult to reproduce or verify the cooperative reasoning logic. 8. The benchmark focuses primarily on cancer pathology and single-round insight generation, without testing transferability to other modalities (e.g., radiology or endoscopy) or multi-stage clinical reasoning, which limits generalization and real-world clinical relevance. 1. How do the authors ensure causal consistency across each reasoning step within the multi-agent framework? Are there potential issues of pseudo-insights or reasoning drift during multi-turn inference? 2. How is the baseline for measuring “Original vs. Innovation” defined? Could linguistic diversity be mistakenly identified as genuine insight innovation? 3. How is the correlation between images and reports quantitatively validated, and does error propagation across modalities affect the accuracy of the final insights? 4. Was any inter-rater agreement analysis conducted to assess annotation reliability? Could LLM-assisted insight generation introduce semantic noise or bias? 5. If MedInsightAgent were to be applied to other medical domains (e.g., radiology or multi-organ pathology), would the Root Question module need to be redesigned or re-trained for domain adaptation?	Fully AI-generated
MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces MedInsightBench, a new multimodal benchmark consisting of 332 cancer pathology cases (≈3.9 k annotated insights) that pairs whole‑slide images with structured analytical goals, question‑insight pairs, and difficulty levels. It also proposes MedInsightAgent, a three‑module multi‑agent pipeline (Visual Root Finder, Analytical Insight Agent, Follow‑up Question Composer) that leverages image summarisation, web retrieval, and a pathology‑fine‑tuned LMM (PathGen‑LLaVA) to generate multi‑step medical insights. Experiments compare several state‑of‑the‑art large multimodal models (GPT‑4o, GPT‑5, DeepSeek‑VL2, Qwen2.5‑VL‑32B‑Instruct, InternVL3‑38B) and two agent baselines (ReAct, MedInsightAgent) on the benchmark using four automatically computed metrics: Insight Recall, Insight Precision, Insight F1, and Insight Novelty (original vs. innovation scores). Ablation studies remove individual MedInsightAgent modules to assess their impact. 1. Novel Benchmark Idea – The focus on multi‑step insight discovery rather than single‑turn QA is a worthwhile gap in current multimodal evaluation. 2. Dataset Construction Pipeline – The authors describe a fairly detailed pipeline (WSI down‑sampling, OCR‑based report extraction, LLM‑assisted insight generation, human verification) and provide some quality analyses (correctness, rationality, coherence). 3. Agent Architecture – The three‑module design is clearly motivated and the paper includes a full algorithmic description, making the system reproducible in principle. 4. Comprehensive Baselines and Ablations – A wide range of recent LMMs are evaluated, and a ReAct‑style agent baseline is included for comparison. The paper also quantifies the contribution of each MedInsightAgent component, showing measurable performance drops when modules are removed. - There seems to be a mismatch between ground truth (largely extracted from pathology reports) and model inputs during evaluation. Many “ground-truth insights” (e.g., HPV/p16 status, node counts, margins, R-status, IHC panels) cannot be inferred from an H&E image alone, especially after whole-slide downsampling to PNG. In Table 7 and case studies, several insights are report-only facts. If the benchmark input at test time is Goal + Image (as Table 2 indicates), a substantial subset of ground-truth is fundamentally unanswerable from the provided modality, confounding recall/precision/F1 and making negative findings uninterpretable. Unless I missed something, the benchmark input appears to exclude the report text at evaluation time; this undermines the “multi-modal” positioning and obscures what is measurable. - Baseline parity is not ensured. MedInsightAgent uses an additional domain-specific image-analysis tool (PathGen-LLaVA) and web retrieval; ReAct is restricted to a computation module and web search. If multi-tool access improves performance, ReAct should be given equivalent tools to isolate the effect of the orchestration strategy rather than tool availability. - Limited human evaluation and unclear rigor: The paper mentions 10 human experts for 100 data points and a separate 100-sample data quality audit, but provides no inter-annotator agreement, precise scoring rubric, or confidence intervals. Claims like “strong concordance with human judgments” lack quantified evidence (e.g., Pearson/Spearman/Kendall correlations, bootstrap CIs). There is also no statistical significance testing, confidence intervals, or variance reporting across methods. Reported gains (e.g., G-Eval F1 improvements of ~0.05–0.06) may be within evaluator noise. - I also couldn't figure out if PathGen-LLaVA was exposed to similar TCGA distributions (potential leakage) and how cases were partitioned. There is also no compute/latency/cost reporting for the agent loops. - Case studies show MedInsightAgent often produces general, textbook-like statements (e.g., “perineural invasion suggests aggressive tumor”), which can inflate “novelty” under the current metric but do not demonstrate image-grounded discovery. Without human verification that each output is supported by the image at the provided resolution, improvements may reflect better phrasing rather than better clinical insight. Consequently, the conclusion that “higher F1 corresponds to greater novelty” may be an artifact of the novelty scoring pipeline rather than a real causal relation. - The claims “first comprehensive benchmark for medical insight discovery” and “strong concordance with human judgments” are not sufficiently supported. Prior work targets medical agents and multimodal evaluation; the novelty claim too should be carefully scoped to “pathology image insight discovery with goal/question/insight structure.” 1. Inputs: At evaluation time, do models see only Goal + Image, or also the report text? If only Goal + Image, how do you justify including report-only insights (e.g., HPV status) in ground truth and metrics? 2. Image fidelity: What downsampling ratios, target resolutions, and magnifications are used? Are multi-scale tiles or high-power patches provided? How do you ensure image sufficiency for cellular-scale findings? 3. Ground-truth filtering: Did you filter insights to those image-inferable from H&E at the provided resolution? If not, what fraction of insights are inherently non-inferable from the image alone? 4. Human validation: Who were the human experts (qualification, number per sample)? Please report inter-annotator agreement (e.g., Cohen’s/Fleiss’ kappa) and confidence intervals for human scores. 5. Novelty metric: How many “novel” insights were spot-checked by pathologists? What fraction were truly image-grounded? Please report a blinded human audit on a random subset. 6. Baseline parity: Why not equip ReAct with PathGen-LLaVA and identical tools to isolate orchestration benefits? Conversely, evaluate MedInsightAgent without PathGen-LLaVA to separate tool vs. agent effects. 7. Statistical rigor: Please report variance, CIs, and tests of significance (e.g., bootstrap) for Table 3 and Table 4. Also provide human-vs-automatic evaluator correlations with CIs. 8. Data release and splits: Will the dataset, code, and prompts be released, including a clear train/val/test split and case IDs? Any overlap between PathGen-LLaVA training data and TCGA cases used here? 9. Safety: Will you add explicit guidance that outputs are research-only and not for clinical decision making? 10. Claims: Please precisely scope the “first benchmark” claim and provide a systematic comparison distinguishing MedInsightBench from MedRepBench, 3MDBench, and other agent-based medical datasets.	Fully AI-generated
Learning When to Be Uncertain: A Post-Hoc Meta-Model for Guided Uncertainty Learning	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	Summary This paper introduces GUIDE (Gradual Uncertainty Refinement via Noise-Driven Curriculum), a novel post-hoc evidential meta-model for improving uncertainty quantification (UQ) in pretrained deep learning models without retraining or architectural modifications. The key idea is to explicitly teach a model when and how much to be uncertain, thereby addressing misplaced confidence — a common limitation in existing post-hoc UQ methods. GUIDE operates in two main stages: 1. Saliency Calibration Stage: The pretrained (frozen) model undergoes a relevance propagation analysis (via Layer-wise Relevance Propagation, LRP-ϵ) to identify salient intermediate features. This yields both layer-level relevance scores and input-level saliency maps, which determine the layers and spatial regions most critical for prediction. 2. Uncertainty-Guided Training Stage: GUIDE attaches a lightweight Dirichlet-based evidential meta-model that consumes features from selected salient layers. Using the previously derived saliency maps, it generates a noise-driven curriculum — progressively corrupting salient input regions to simulate distributional shifts. The model is trained with a soft-target loss combining uncertainty regularization and a Self-Rejecting Evidence (SRE) penalty, ensuring uncertainty increases monotonically with corruption while confidence remains justified. Theoretical analysis (Theorem 1) provides guarantees on Fisher information retention, showing that GUIDE preserves a bounded fraction of the base model’s informative structure under the saliency selection mechanism. Key Contributions 1. A fully post-hoc evidential meta-model that explicitly learns when to be uncertain using guided curricula rather than passive calibration. 2. Saliency-based layer selection eliminating manual design choices and ensuring feature relevance consistency. 3. Noise-driven curriculum learning, progressively teaching uncertainty behavior aligned with model sensitivity. 4. Theoretical guarantees for saliency coverage and information retention. 5. Extensive experimental validation across multiple in-distribution (ID), out-of-distribution (OOD), and adversarial benchmarks showing robust, state-of-the-art results. Empirical Results GUIDE was benchmarked against intrusive and post-hoc UQ baselines including ABNN, EDL-Head, Whitebox, and EMM, across datasets such as MNIST, CIFAR10/100, SVHN, and Oxford Flowers → Deep Weeds. • ID Accuracy: Comparable to baselines (≈ 99% on simple datasets; ≈ 90% on CIFAR tasks). • OOD/Adversarial Coverage: GUIDE achieves the lowest coverage (e.g., ≤ 8% OOD, ≤ 5% adversarial), outperforming others by large margins. • AUROC: GUIDE consistently achieves >94% on OOD and adversarial detection (up to 96% on MNIST → FashionMNIST), exceeding all intrusive and post-hoc baselines. • Calibration: Expected calibration error (smECE) reduced to 0.061, compared to 0.317 (pretrained) and 0.193 (EMM). • Robustness: Maintains high AUROC (>90%) across perturbation strengths and attack types (L2PGD, FGSM, Salt-and-Pepper). Impact and Positioning GUIDE is positioned as a non-intrusive, architecture-agnostic, and computationally lightweight solution that bridges the gap between model confidence and predictive reliability. Unlike earlier post-hoc approaches (which reshape outputs), GUIDE actively instructs the model through a principled uncertainty curriculum, yielding better calibration and robustness under distributional shifts. The approach has strong implications for safe and trustworthy AI, particularly in high-stakes domains like healthcare, autonomous systems, and human-in-the-loop robotics. Overall Evaluation (Summary KPI) Criterion Assessment Originality High — introduces guided uncertainty curricula in post-hoc UQ. Significance Strong — directly impacts deployment reliability and calibration. Technical Quality Excellent — rigorous derivation, clear algorithmic pipeline, theoretical support. Clarity Very good — figures and pseudo-code are interpretable, though dense in notation. Empirical Evaluation Comprehensive — multiple datasets, attacks, and ablations. Reproducibility Strong — open-source repository available. Potential Weakness May depend on LRP assumptions; sensitivity to noise schedule hyperparameters not fully explored. The paper demonstrates a well-balanced and well-executed contribution to the growing area of uncertainty quantification (UQ) and reliable deep learning. It succeeds in combining conceptual novelty with solid empirical performance, offering a practical and interpretable framework that is relevant to both research and deployment contexts. Originality • The central idea — teaching a pretrained model when to be uncertain via a saliency-guided, noise-driven curriculum — is distinctive among post-hoc uncertainty methods. • While it draws from known concepts (e.g., evidential learning, saliency mapping), the paper’s integration of saliency-based layer selection with curriculum-style uncertainty learning is an innovative synthesis not previously explored in this form. • The Self-Rejecting Evidence (SRE) loss adds a novel regularization mechanism encouraging monotonic uncertainty behavior with respect to perturbation strength. Quality • Methodologically rigorous and technically consistent, the proposed GUIDE framework is both theoretically motivated (Fisher information retention theorem) and empirically validated across multiple datasets and uncertainty scenarios. • The experiments are comprehensive, spanning in-distribution, out-of-distribution, and adversarial settings, and consistently demonstrate GUIDE’s superiority or parity with state-of-the-art baselines. • The ablation studies, calibration metrics (ECE, AUROC), and robustness analyses collectively support the validity of the paper’s claims. Clarity • The writing style is clear and structured, balancing technical precision with readability. • The stepwise exposition (motivation → framework → theoretical justification → empirical results) makes the methodology accessible to readers with diverse backgrounds in UQ and evidential deep learning. • Figures illustrating the uncertainty refinement process and saliency-based layer selection are informative, though some could be more tightly integrated with text explanations. Significance • The contribution is practically significant: GUIDE is post-hoc, lightweight, and architecture-agnostic, making it applicable to real-world AI pipelines without retraining or invasive model access. • The framework addresses one of the most persistent challenges in deep learning — overconfidence under distributional shift — with a method that enhances both calibration and interpretability. • The approach bridges human-in-the-loop AI, explainability, and uncertainty estimation, aligning with broader research directions in trustworthy AI, which is central to ICLR’s evolving research priorities. Summary of Strengths Dimension Assessment Originality Creative synthesis of evidential learning, saliency analysis, and guided curricula. Quality Strong theoretical and empirical foundation; well-executed experiments. Clarity Generally clear exposition and strong visual explanations. Significance High practical relevance for post-hoc model reliability and safety. While the paper presents a creative and empirically validated approach to post-hoc uncertainty quantification, it exhibits several conceptual, methodological, and presentation-related weaknesses that limit its theoretical depth and generalizability. The following are specific, constructive observations aimed at strengthening the work for future revisions or journal extensions. 1. Limited Theoretical Grounding of “Learning When to Be Uncertain” • The central notion of learning when to be uncertain remains intuitively compelling but mathematically underdeveloped. The method demonstrates empirical behavior consistent with this idea but does not provide a formal probabilistic framework or causal model linking saliency-driven noise to epistemic uncertainty. • The proposed Fisher information retention theorem is a useful analytical step but only partially supports the conceptual claim—it ensures informational consistency, not causal justification for uncertainty behavior. • To strengthen rigor, future versions could introduce a formal uncertainty measure evolution model (e.g., monotonic entropy gradient or causal sensitivity analysis) or link GUIDE’s curriculum mechanism to Bayesian learning theory (e.g., Kendall & Gal, 2017; Malinin & Gales, 2018). 2. Dependence on Saliency and Potential Bias • The reliance on Layer-wise Relevance Propagation (LRP-ϵ) introduces methodological fragility: • LRP performance and interpretability vary across architectures and data modalities (e.g., CNNs vs. transformers). • The paper does not evaluate whether GUIDE’s performance depends heavily on the chosen saliency technique. • An ablation comparing different saliency methods (Grad-CAM, Integrated Gradients, DeepLIFT) would clarify whether the framework’s improvements arise from the general saliency mechanism or specific LRP behavior. • This dependence may reduce reproducibility across architectures beyond those tested. 3. Insufficient Hyperparameter and Sensitivity Analysis • The paper introduces several tunable components — notably: • Noise schedule parameters in the saliency-guided curriculum, • Regularization weights (λ₁, λ₂) in the Self-Rejecting Evidence loss, and • Layer selection thresholds (τ). • However, no sensitivity analysis is provided to assess how these affect performance or stability. • This omission weakens claims of robustness and generality, especially for a method promoted as post-hoc and lightweight. • Future work should present a systematic exploration of these hyperparameters, ideally visualized as performance landscapes or uncertainty–accuracy trade-offs. 4. Incremental Conceptual Novelty • Although the method performs well, its conceptual novelty is moderate: it integrates established elements (evidential learning, saliency analysis, curriculum noise) rather than introducing fundamentally new uncertainty theory or model structure. • Similar principles appear in prior works on post-hoc meta-modeling (Postels et al., 2021), selective prediction (Geifman & El-Yaniv, 2019), and uncertainty calibration through adversarial exposure (Mukhoti et al., 2023). • The authors could strengthen originality by clearly positioning GUIDE as a “structured synthesis” rather than a new paradigm, and by articulating where exactly it diverges conceptually or empirically from these precedents. 5. Generalization to Modern Architectures • All experiments are performed on mid-scale CNN-based image classifiers (MNIST, CIFAR, SVHN, Flowers), which are suitable for controlled studies but insufficient for establishing scalability or architectural generality. • The approach’s compatibility with transformers, diffusion models, or multimodal architectures—which dominate current ICLR topics—is untested. • Extending GUIDE to large-scale or multimodal settings would substantiate its relevance to the broader ICLR audience. 6. Limited Real-World or Cross-Domain Evaluation • Although GUIDE improves calibration and OOD robustness, no experiments are conducted in domain-shifted or real-world datasets (e.g., corrupted CIFAR, ImageNet-C, or medical imaging benchmarks). • Including at least one realistic scenario (e.g., sensor noise, class imbalance) would better demonstrate the method’s reliability beyond clean benchmarks. 7. Minor Presentation and Clarity Issues • Some mathematical derivations are densely presented and could benefit from expanded intuition or intermediate explanations. • Figures showing calibration improvement and uncertainty maps are insightful but small, making quantitative differences difficult to assess visually. • A concise visual summary (e.g., flow diagram of the GUIDE pipeline with saliency/noise progression) would improve readability for interdisciplinary readers. Overall Assessment: The paper is strong in execution but could significantly improve by deepening theoretical grounding, expanding generalization experiments, and clarifying dependence on design choices. These refinements would elevate the work from a high-quality empirical contribution to a conceptually mature framework deserving of top-tier recognition. 1. On the Theoretical Framing of “Learning When to Be Uncertain” • Could the authors formalize the concept of learning when to be uncertain beyond its intuitive description? For example: • Is there a measurable quantity (e.g., monotonic relationship between noise level and entropy or epistemic evidence) that supports this claim? • How does the proposed Self-Rejecting Evidence (SRE) loss encourage this behavior mathematically — does it impose any guarantee of monotonic uncertainty growth under perturbation? • A short theoretical or empirical analysis of this relationship would make the claim substantially stronger. 2. On the Role and Robustness of LRP in GUIDE • GUIDE relies heavily on Layer-wise Relevance Propagation (LRP-ϵ) for both saliency selection and curriculum construction. • How sensitive is GUIDE’s performance to the specific saliency method used? • Have the authors tested alternative saliency measures (e.g., Grad-CAM, Integrated Gradients, or SHAP)? • Could GUIDE’s uncertainty refinement fail if the saliency signal is noisy or misaligned (as often happens in deeper models)? • Including an ablation or at least a qualitative comparison would help clarify whether the saliency mechanism is integral or replaceable. 3. On Hyperparameter Sensitivity • The method introduces multiple hyperparameters: • λ₁, λ₂ (regularization weights), • τ (saliency threshold), and • noise schedule parameters (corruption magnitude or rate). • Could the authors provide sensitivity curves or variance estimates showing GUIDE’s stability with respect to these parameters? • This would be particularly useful to assess GUIDE’s reliability as a “plug-and-play” post-hoc method. 4. On Comparative Baselines and Fairness • How do the authors ensure fair comparison with existing methods like Deep Ensembles, MC-Dropout, or Temperature Scaling? • Were all models trained or calibrated using identical datasets and computational budgets? • Since GUIDE is post-hoc, how is its runtime and memory footprint compared to ensemble-based methods? • Including a table summarizing computational cost vs. performance could clarify GUIDE’s real-world efficiency. 5. On Theoretical Result (Theorem 1 – Fisher Information Retention) • Theorem 1 provides a bound on retained Fisher information under saliency selection. • Could the authors elaborate on the assumptions underlying this bound (e.g., independence or linearity of selected features)? • Is this guarantee empirical (observed in finite data) or asymptotic (as dataset → ∞)? • How tight is the bound in practice — do the authors have empirical measurements correlating Fisher information loss with performance degradation? 6. On the Relationship Between GUIDE and Calibration Methods • GUIDE seems related in spirit to post-hoc calibration (e.g., Temperature Scaling, Platt Scaling, Isotonic Regression) but introduces a learning component. • How does GUIDE differ conceptually and practically from these calibration techniques in terms of the underlying uncertainty model (Dirichlet vs. softmax temperature)? • Could the authors provide a brief comparison or unified framework situating GUIDE among existing calibration approaches? 7. On Scalability to Modern Architectures • Have the authors tested GUIDE on transformer-based architectures (e.g., ViT, DeiT) or multimodal models? • If not, do they anticipate challenges due to the saliency extraction step or computational scaling? • This would be important to determine whether GUIDE generalizes beyond CNN-based settings, which are becoming less central in current ICLR research. 8. On Dataset Diversity and Real-World Scenarios • The experiments are limited to canonical image datasets (MNIST, CIFAR, SVHN, Oxford Flowers). • Could the authors discuss potential extensions or ongoing work on realistic, domain-shifted, or corrupted datasets (e.g., CIFAR-C, ImageNet-C, or medical images)? • This would strengthen the claim of GUIDE’s utility in safety-critical applications. 9. On Interpretability and Human Alignment • Since GUIDE integrates saliency-driven learning, it implicitly aligns with explainability goals. • Have the authors considered whether the generated saliency maps or uncertainty overlays are interpretable by human users (e.g., can humans trust GUIDE’s “when uncertain” behavior)? • A small user study or interpretability evaluation could add practical value to this aspect. 10. On Open-Source and Reproducibility • The paper claims that code will be made available. • Could the authors clarify the current status of code release and whether pretrained models, saliency scripts, and evaluation setups will be publicly accessible? • Reproducibility is especially important for a post-hoc framework advertised as “lightweight and widely applicable.” The authors’ clarifications on the theoretical basis, robustness to saliency and parameters, and scalability of GUIDE could significantly enhance the perceived rigor and impact of this work. A strong rebuttal addressing these points with additional experiments or analysis would likely improve its overall evaluation.	Fully AI-generated
Learning When to Be Uncertain: A Post-Hoc Meta-Model for Guided Uncertainty Learning	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes GUIDE, a novel framework for guided uncertainty estimation that teaches a pretrained model when and how to be uncertain without requiring retraining or architectural modifications. The method combines saliency calibration and a noise-driven curriculum to enhance reliability and out-of-distribution (OOD) awareness in existing deep networks. The paper provides solid theoretical grounding, supported by an information-theoretic analysis that links saliency-guided learning to the preservation of Fisher information during uncertainty calibration. Empirical results demonstrate state-of-the-art performance across diverse in-distribution and out-of-distribution settings (e.g., CIFAR-10 → SVHN, MNIST → FashionMNIST, Oxford Flowers → DeepWeeds). GUIDE consistently outperforms prior baselines (e.g., temperature scaling, Mahalanobis, DUQ, and EMM) The method is validated primarily on small- to mid-scale datasets (MNIST, CIFAR-10, Oxford Flowers). It remains unclear whether GUIDE scales effectively to large-scale foundation models (e.g., CLIP, ViT-L, or LLaVA) The motivation for selecting specific saliency methods (LRP) is only briefly discussed. It would be beneficial to clarify whether other attribution techniques (Grad-CAM, Integrated Gradients) would yield similar benefits Figures do not include error variance analysis across multiple random seeds. The theoretical analysis is centered on Fisher information preservation during saliency calibration, which provides a local view of uncertainty propagation. Could the authors clarify whether this framework generalizes to non-local uncertainty effects—for instance, when prediction uncertainty arises from feature interactions beyond the saliency-identified regions?	Fully AI-generated
Learning When to Be Uncertain: A Post-Hoc Meta-Model for Guided Uncertainty Learning	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper discusses a new framework GUIDE for uncertainty quantification using an existing (pre-trained) network. GUIDE uses the layers of the existing network to determint salient layers and connect them to a meta-model network for uncertainty guided training. Using the GUIDE framework, the authors showed an improvemend of the Uncertainty Estimation as opposed to state-of-the-art. The authors have also shown this through ID vs OOD datasets but also through adverserial attacks. Main contributions: - The use of layer importance measure to select the layers through a saliency metric - Construction of a curriculum (dataset of corrupted data points with noise based on saliency metric) which is used in the loss function for uncertainty guided training of the meta-model - The authors have done extensive experiements accross different datasets and adverserial methods - The paper is presented in a clear way and is well written. - The paper also shows extensive experiements using many datasets. - Although the relevance propagation is a well described way to describe explainability in neural networks, it is used as a means to train a meta-model which estimates the uncertainty. In that way, the authors describe a novel and orignal way to use this for uncertainty quantification. This has very promising results as opposed to other techniques (such as Bayesian NN). - The paper has a strong mathematical foundation. - The main article focusses on the main findings and a clear explanation. The authors have done a good job in dividing the main findings with detailed descriptions of the theorems and additional experiments, which are described in the appendix. - I noticed that the authors have struggled with the page limit due to the size of some tables (e.g.g Table 1). I would suggest to leave out a columns in table 1 (e.g. EMM+curric). - The use of intermediate layers is for uncertainty estimation is not completely novel. The use of intermediate layers for uncertainty has been provided in [1], [3]. Also uncertainty has been determined using Intermediate Layer Variational Inference [2]. - Some explanations at the end of the article are very short. It gives the impression that this in unfinished work. E.g. the adverserial attack analysis is very relevant but needs more explanation. [1] Ameer et al., Enhancing adversarial robustness with randomized interlayer processing, 2024, Expert Systems with Applications, [2] Ahmed et al. 2021. Real-time Uncertainty Estimation Based On Intermediate Layer Variational Inference. In Proceedings of the 5th ACM Computer Science in Cars Symposium (CSCS '21). [3] https://arxiv.org/abs/2012.03082 - The definition of the uncertainty target. What is the rationale behind this formula? - What is the reason why the training is done first on clean targets and later noise-corrupted targets are used? Will it not induces a more difficult to learn? What is the effect of the training process? - How many layers are selected to reach the cumulative relevance coverage threshold? What is the impact of this threshold in the amount of selected layers. - The authors mention in the results section that intrusive methods have a large coverage, but that the OOD detection is high. Also the methods of the authors claim that the OOD coverages is below 10%, but it is not clear to me why this is good. In my opinion, a high OOD coverage would mean that the model can detect OOD samples correctly. Could the authors clarify this? I propose to elaborate on the explanation on the OOD coverage. - The authors claim that GUIDE is architecture agnostic, however for the method to work, one needs to calculate the relevance values.	Fully human-written
Learning When to Be Uncertain: A Post-Hoc Meta-Model for Guided Uncertainty Learning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper introduces GUIDE (Guided Uncertainty Learning Using a Post-hoc Evidential Meta-model), a post-hoc non intrusive method for uncertainty quantification that can be applied to pretrained deterministic neural networks without retraining them. - The paper is easy to read - The topic can be relevant, and the approach is interesting Motivation and definition of post-hoc uncertainty: The motivation behind the paper, as per the abstract, is that existing post-hoc approaches inherit “misplaced” confidence, or reshape the predictions (through temperature scaling for instance, I believe). However, these motivations are not being developed and explained in the later sections of the manuscript, which makes the rationale behind GUIDE somehow unclear. While I just did not like that, since I believe it was not so relevant after all, I then noticed that the authors' definition of post-hoc is a bit ambiguous. By reading, one is left believing that the definition refers to approaches acting on a pretrained network. However, the authors mention as post-hoc method the one proposed in [int_ref_1], which do performs end-to-end training. This makes the whole paper framing confusing, and makes the lack of description of limitations of existing post-hoc approaches (i.e., misplaced confidence and reshaping) extremely vague. Results discussion: The discussion of the results is a bit reductive and simple, while it should be developed more, especially for what concerns Table 1. For instance, the results on the CIFAR10-CIFAR100 and Oxford Flowers-DeepWeeds are a bit bad for GUIDE. There is no mention of such an outcome, nor an explanation. Adversarial setup: I understand that the attacks are simply used to produce OOD samples, but please do refrain from using simple PGD or FGSM attacks [ext_ref_1]. There are better approaches which are much widely accepted, such as AA [ext_ref_2]. Also, please specify the used hyperparameters, such as the number of iterations. It is rather possible that the used configurations might be producing extremely suboptimal solutions to the optimization problem of adversarial attacks. Other issues, not necessarily minor: - References: I suggest associating Laplace approximation to this pioneering work [ext_ref_3]. - Figure 6: I cannot understand which dataset is used in this figure. - Citation style: Please use citet and citep. [int_ref_1]: Sensoy, Murat, Lance Kaplan, and Melih Kandemir. "Evidential deep learning to quantify classification uncertainty." Advances in neural information processing systems 31 (2018). [ext_ref_1]: Carlini, Nicholas, et al. "On evaluating adversarial robustness." arXiv preprint arXiv:1902.06705 (2019). [ext_ref_2]: Croce, Francesco, and Matthias Hein. "Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks." International conference on machine learning. PMLR, 2020. [ext_ref_3]: MacKay, David JC. "A practical Bayesian framework for backpropagation networks." Neural computation 4.3 (1992): 448-472. - Could the authors clarify what is the definition of post-hoc uncertainty? - Could the authors clarify what is precisely the motivation behind the work? Can the authors better articulate what gap GUIDE fills that existing post-hoc or evidential approaches do not? - Why is Evidential Deep Learning [int_ref_1] considered post-hoc here, despite being trained end-to-end? - Could the authors expand their discussion on the results? Could they include clear discussion on weaker results, such as those on CIFAR10-CIFAR100 and Oxford Flowers-DeepWeeds, and explain why GUIDE performs less effectively in those settings? - Could the authors specify all attack hyperparameters (e.g., step size, number of iterations) and, specifically, also the perturbation radius used in table 1?	Fully human-written
Property-Driven Protein Inverse Folding with Multi-Objective Preference Alignment	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes ProtAlign, a multi-objective preference-alignment framework for protein inverse folding that optimizes developability properties without compromising designability. The method uses a semi-online DPO loop: generate rollouts at higher temperature, score them with property predictors, construct pairwise preferences per property, then train offline with an adaptive preference margin to reconcile conflicts among objectives. Instantiated on ProteinMPNN as MoMPNN, the approach is evaluated on CATH 4.3, de novo backbones from RFDiffusion, and realistic de novo binders; results show developability gains while maintaining or improving structural consistency relative to strong baselines. - Method is simple and general: multi-objective DPO with an adaptive preference margin to mitigate conflicts across properties; the training pipeline evenly samples pairwise entries across properties and alternates rollout and training for efficiency. - Practical semi-online training decouples rollout/evaluation from optimization, enabling batch computation and easier deployment while retaining online exploration benefits. - Evaluations are broad and application-relevant: crystal redesign, de novo backbones, and realistic binder design; the study systematically integrates developability metrics into inverse-folding evaluation beyond amino acid recovery. - The presentation style is good, with nice-looking figures and easy-to-follow-up narration styles. - Limited ablations on multi-objective weights and margin settings. It might be helpful to quantify how weights, temperature, and margin thresholds shape the Pareto front and to provide transferable default configurations as the paper heavily relies on it. - The adaptive preference margin m(yw,yl) is precomputed from auxiliary property deltas and then kept fixed during training. This is simple and fast, but it cannot react if the policy distribution drifts, predictors recalibrate, or property trade-offs evolve; the “right” margin may change as the frontier moves. - Pair construction may over-represent “easy wins” and under-sample ambiguous regions. Preference pairs are formed by sorting rollouts and pairing top-half vs. bottom-half, with a delta threshold to drop uncertain pairs. While this stabilizes supervision, it can bias learning away from the decision boundary where the frontier is decided. Active pair mining (hard-negative selection) or uncertainty-aware sampling could help learn more from the ambiguous region and reduce label imbalance across properties. - Can the weights across properties and the adaptive margin be tuned online using objective-improvement rates to more reliably approach a Pareto front across backbones and lengths? - What is the effect of the number of rollouts and sampling temperature on the stability of training and final metrics in the semi-online loop, given that the paper uses a higher temperature for exploration but evaluates at a lower temperature for ProteinMPNN-family models?	Moderately AI-edited
Property-Driven Protein Inverse Folding with Multi-Objective Preference Alignment	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper addresses the challenge that protein inverse folding models must balance designability (recovering a backbone) with developability properties (e.g., solubility, thermostability). The authors propose ProtAlign, a multi-objective preference alignment framework that fine-tunes pretrained models using a semi-online Direct Preference Optimization (DPO) strategy. The method uses a flexible preference margin to mitigate conflicts between competing objectives and constructs preference pairs using in silico property predictors. Applying this to ProteinMPNN yields MoMPNN. Experiments on CATH 4.3 crystal structures , de novo backbones , and binder design scenarios show that MoMPNN enhances developability properties without compromising structural fidelity compared to baselines. This method improves developability metrics using a preference alignment framework , which does not require additional specific, curated datasets of experimentally-validated proteins. The authors evaluate MoMPNN on a strong set of tasks beyond standard sequence recovery. This includes redesigning CATH 4.3 crystal structures, designing sequences for de novo generated backbones, and a practical de novo binder design scenario. This rigorous evaluation demonstrates the method's utility in realistic design workflows where other baselines show performance degradation. It would be better to report the metrics on ground truth sequences, as these metrics are based on prediction models as approximations. Full names of abbr.’s in tables are missing in the captions. The temperatures used in inference of different baselines are not identical, resulting in potentially unfair comparison. A fair comparison would be either the greedy strategy (without temperature), or comparing the best point on the temperature-performance curves between different methods; or at least report the results under one identical temperature. is it a typo in eq 4? k appears in the formula of m, which seems irrelevant to k. Explanation for the relationship between L and L_MO is needed. Why is the AAR of ProteinMPNN on CATH 4.3 test 0.39, which seems lower than most of the reproduction of this model, e.g., 0.44 on CATH 4.3 was reported in ProteinInvBench? If this AAR is not correct, does it indicate a significant compromise of AAR? RL-based preference methods like ProteinDPO for inverse folding are discussed in the related work section. Why are they compared as baselines? They are supposed to be the most related baselines. Regarding the semi-online training strategy, is the preference dataset $\mathcal{D}_k$ at iteration t cumulative (containing all rollouts from iterations $1 \dots t$), or is it replaced entirely by the new rollouts? The paper provides a compelling comparison against a "Weighted-score DPO" baseline in Appendix A.2, showing MoMPNN is more stable. Can the authors provide more intuition on why the flexible margin (Eq. 4 ) achieves better and more stable multi-objective optimization compared to directly optimizing on a weighted sum of preference scores? The model is trained on protein monomers but evaluated on a de novo binder design task, which involves protein complexes. Did the authors observe any specific failure modes or performance issues at the binder-target interface, given that the model was not explicitly trained for complex-specific properties?	Fully human-written

PreviousPage 36 of 1516 (75800 total rows)Next