|
Fairness-Aware EHR Analysis via Structured Missing Pattern Modeling and Adversarial Low-Rank Adaptation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes FEMALA, a two-stage framework for fair EHR analysis that tackles bias from structured missingness. It uses a dual-encoder to model both temporal data and missing patterns, then applies adversarial low-rank adaptation (LoRA) to fine-tune for fairness, achieving state-of-the-art accuracy and fairness.
* **Missingness as Signal:** Innovatively treats missingness patterns as an informative signal, not just noise to be imputed.
* **Stable Two-Stage Design:** The "learn first, correct later" approach using adversarial LoRA achieves a superior accuracy-fairness trade-off.
* **Extensive Empirical Eval:** Performs a fairly extensive eval across baselines and datasets.
* **SOTA Claim** The method claims SOTA performance in a few places, yet the only recent baseline is one from 2024, which may not be the best comparison either (due to the focus of the other study on multimodal data).
* **Fixed Segmentation:** Relies on fixed-length segmentation, which may be less effective than event-based or adaptive methods.
* **Simple Missingness Encoding:** The method for encoding global missingness patterns (time/feature averages) is relatively simple.
* **Remaining Biases:** The model still struggles to fully mitigate deep-rooted biases related to sensitive attributes.
- Do the authors claim that their results are better than FLMD and FFVAE in Fig. 3? |
Moderately AI-edited |
|
Hierarchical Quantized Diffusion Based Tree Generation Method for Hierarchical Representation and Lineage Analysis |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
Single-cell analysis represents one of the major breakthroughs in recent bioinformatics, generating enthusiastic expectations for elucidating cellular differentiation mechanisms and their applications in regenerative medicine and artificial organs. This paper proposes a novel deep learning-based approach for the data-driven differentiation structure (i.e., hierarchical structure) inference task for such single-cell analysis data, as well as for more general hierarchical structure inference tasks. Traditionally, analytical methods such as visualization techniques, clustering, and factor models have been the standard for differentiation structure tasks. However, deep learning-based methods, particularly those based on Variational Autoencoders (VAEs), have recently gained prominence due to their effectiveness. Even the most advanced methods face limitations, and this paper makes significant progress, especially regarding the module dependency of branching structures inherent in existing approaches. The authors quantitatively demonstrate that their proposed method delivers substantial practical progress by conducting a large-scale, comprehensive investigation on both the subject single-cell data and widely used benchmark datasets in machine learning.
- This paper achieves very solid progress in line with the latest trends in the structural inference task. Specifically, it presents a novel solution using hierarchical codebooks and a stochastic diffusion model to address the issue of unstable learning caused by module dependencies in the branching structure of hierarchical architectures—a problem encountered in recent state-of-the-art VAE-based methods.
- The experiments in this paper are exceptionally robust and comprehensive, providing extremely strong evidence for practical effectiveness. Particularly for single-cell analysis data, the supplementary materials detail the preprocessing procedures, successfully appealing to a broader audience beyond bioinformatics specialists. Furthermore, for readers more interested in standard machine learning tasks, the paper also provides baselines on popular datasets.
- I have some concerns regarding the novelty or effectiveness of the hierarchical codebook (HCB), one of the key components of the proposed methodology. Specifically, I find it difficult to follow at a concrete level how the HCB effectively resolves the issue of module dependency on branching in hierarchical structures, which the authors highlight as a focus in prior research. I will elaborate further in the questions section.
**Effectiveness of Hierarchical Codebook**
I understand the weakness of existing VAE research requiring separate configurations for the representation of each branch in the hierarchical structure (binary tree). Intuitively, as one goes deeper into the hierarchy, observational data clues become sparse, making learning extremely difficult. The authors' Hierarchical Codebook (HCB) appears to be a new approach that addresses this weakness in existing research. I understand this overall framework is very promising, but I couldn't clearly discern from the text how HCB specifically overcomes the weaknesses of existing research. Section 3.2 appears to model parent-child relationships in a conventional manner (where the code vector of a child node approaches that of its parent node). For example, this is commonly used in Section 3 of [Adams+, NeurIPS2010] and Section 3 of [Lakshminarayanan+, AISTATS2016] (apologies, my field may bias my specific references towards statistical modeling sense, but this seems like a frequent policy even in optimization contexts). I have reread Section 1's introduction and Section 3's specific model design multiple times. While I broadly agree with the authors' motivation for introducing HCB (overcoming the weaknesses of VAE-type models), I actually cannot accurately discern why HCB is such a brilliant idea for achieving that goal. Based on these considerations, my questions are as follows:
- Is it possible to provide a qualitative explanation that the authors' HCB offers a method with “unique, standout advantages” over other hierarchical modeling approaches for addressing the problem of data sparsity as one moves to the end of the hierarchical structure?
Or is it that while the HCB idea itself is one of the standard approaches in hierarchical representation, it has been experimentally confirmed (I commend the authors' extremely large-scale and comprehensive experiments across diverse data) to demonstrate outstanding performance?
[Adams+, NeurIPS2010] Adams, R. P. , Jordan, M., Ghahramani, Z. & (2010). Tree-structured stick breaking for hierarchical data. Advances in neural information processing systems, 23.
[Lakshminarayanan+, AISTATS2016] Lakshminarayanan, B., Roy, D. M., & Teh, Y. W. (2016). Mondrian forests for large-scale regression when uncertainty matters. In Artificial Intelligence and Statistics, pp. 1478-1487.
**Relevance to the supertree construction problem**
To the best of my knowledge, problems explicitly addressing the sparsity inherent in hierarchical structures—namely, the requirement for existing VAE-based approaches to have separate modules for each branch—appear to have long been discussed as a significant research topic in the field of bioinformatics, specifically as the supertree construction problem. The authors do not appear to discuss this topic either in the main text or supplementary materials (apologies if I missed it), but isn't this a relevant issue? In the context of single-cell analysis, data scarcity is a fundamental challenge, not just at the terminal nodes of hierarchical structures. For instance, acquiring single-cell analysis data for specific human organs is costly, limiting available datasets. This motivates the use of single-cell analysis data from other organisms (such as mice, chosen for similar biological characteristics). However, naturally, the surface-level observations (broad trends in gene expression levels) of these datasets differ significantly. Consequently, the approach of attempting to capture a consensus tree (supertree) between the hierarchical structure of humans and that of another organism emerges. My impression is that the unified code book the authors aim to capture with HCB shares a fundamental similarity in motivation and core principles with this supertree construction problem. Perhaps if the authors were to discuss this point, the paper might gain greater persuasive power for readers in the traditional bioinformatics field. (This point does not directly affect my impression or evaluation of the paper, so the authors are free to consider it without concern. If it seems unrelated, feel free to disregard it.) |
Fully human-written |
|
Hierarchical Quantized Diffusion Based Tree Generation Method for Hierarchical Representation and Lineage Analysis |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes HDTree, a hierarchical diffusion-based framework designed for hierarchical representation learning and lineage analysis. The method integrates a hierarchical vector-quantized codebook with a quantized diffusion process, enabling the model to capture multi-level dependencies among data points and generate biologically meaningful hierarchies. Unlike previous VAE-based models (e.g., TreeVAE), which require branch-specific modules, HDTree employs a unified hierarchical latent space that enhances both stability and generative capacity. Comprehensive experiments on general-purpose datasets and single-cell datasets demonstrate the superiority of HDTree in clustering accuracy, tree purity, and lineage reconstruction. The results show consistent improvements in both representation quality and biological interpretability, highlighting the model’s potential as a powerful tool for hierarchical modeling and generative analysis in biological data. Overall, the work is conceptually solid, well-motivated, and empirically convincing.
S1. The paper addresses a meaningful and increasingly important topic. It is particularly relevant to single-cell data modeling, which remains a major challenge in computational biology and generative modeling.
S2. Extensive experiments across both general and domain-specific datasets show clear performance gains over existing baselines, validating both the stability and scalability of the approach.
S3. The paper evaluates multiple aspects—tree structure purity, clustering accuracy, reconstruction loss, lineage consistency, and computational efficiency. This provides a convincing and multidimensional assessment of HDTree’s strengths.
S4. The proposed combination of hierarchical vector quantization with diffusion processes may eliminate the need for branch-specific networks while maintaining high flexibility and generative accuracy.
**Concerns**
C1. The Method section is written in a very direct “component-by-component” manner, explaining what each module does but not why each design choice is necessary or how it contributes to solving the stated problems. For instance, when the authors argue that previous methods “require specialized network modules for each tree branch,” it would strengthen the explanation if they discussed alternative perspectives (e.g., whether a shared backbone with dynamically extended subnetworks, similar to continual learning, could achieve similar adaptability). Adding this type of reasoning would help readers understand the technical logic and design motivation more deeply.
C2. The manuscript would benefit from polishing to improve readability and layout. In several places, multiple bolded labels \textbf{XXX.} appear within a single paragraph, which disrupts the flow. These should ideally start as separate paragraphs or be converted into sub-headings. Moreover, some overly technical derivations or implementation details could be moved to the Appendix to enhance readability in the main text.
C3. Figure 1 currently does not clearly differentiate the three comparative frameworks or visually convey why the proposed HDTree offers a tangible improvement. The figure could better highlight the distinctions and illustrate the hierarchical structure more intuitively.
Please mainly respond to C1. |
Heavily AI-edited |
|
Hierarchical Quantized Diffusion Based Tree Generation Method for Hierarchical Representation and Lineage Analysis |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes HDTree, a new generative model for hierarchical data, specifically aimed at single-cell lineage analysis. The core problem is that existing methods are unstable as they require branch-specific network modules. HDTree addresses this by combining three components: (1) a standard encoder, (2) a unified Hierarchical Tree Codebook that quantizes latent representations into discrete paths, and (3) a quantized diffusion decoder that generates data conditioned on these paths. The model is optimized with a composite loss including soft contrastive learning a hierarchical quantization loss , and the diffusion loss. The authors demonstrate that this approach can be used for both lineage trajectory analysis (by finding shortest paths in the codebook graph) and conditional data generation. Experiments on general and single-cell datasets show it outperforms SOTA methods in clustering, tree structure fidelity, and lineage alignment.
The core architectural idea of using a unified hierarchical codebook to condition a diffusion model is a strong and stable alternative to prior VAE-based methods that required branch-specific modules.
The model demonstrates consistently strong performance across a wide range of tasks and datasets, outperforming SOTA methods like TreeVAE in clustering (Table 1, Table 2) and, impressively, even beating a semi-supervised method on lineage ground truth alignment (Table 3).
The ablation study (Table 4) is effective, clearly demonstrating that the novel components (HTC, SCL, HQL) are all critical to the model's success. The large performance drop without the HTC (A2) is particularly convincing.
The method is computationally efficient in training time compared to competitors, especially TreeVAE and methods requiring expensive offline clustering (tSNE/UMAP+Agg) on large data (Table 5)
The evaluation is performed on a downsampled test set of 10,000 points for any dataset larger than this. This is a major weakness. The paper claims performance on large datasets (e.g., Weinreb, 130k cells; ECL, 838k cells) but never evaluates on them (at full scale). The justification (clustering metrics are slow) is an evaluation choice, not a model limitation, and it undermines the claims of scalability.
The trajectory inference method (Sec 3.4) is not a pure application of the learned tree It requires constructing a new graph by adding k-nearest neighbors edges withineach level of the tree. This introduces a new hyperparameter k (which was tested in Appendix C) and makes the lineage analysis less interpretable, as it's not solely dependent on the learned hierarchy.
The model's complexity seems high. It requires three separate loss functions , each with its own hyperparameters. This may make the model difficult to tune and reproduce.
The paper admits that the diffusion decoder is "computationally expensive during sampling", which is a well-known diffusion model issue but still a practical limitation for the data generation task.
1. Regarding the test set downsampling: Since the model is trained on up to 100k-300k points (Table L.5), why not report evaluation metrics (like reconstruction loss, -RL) that don'trequire expensive clustering, but do run on the full, large test sets? This would provide a true measure of scalability.
2. In the trajectory analysis (Sec 3.4), what is the justification for the penalty term P^(L-1) in Eq. 8? This seems to manually enforce hierarchical preference, which one might expect the learned tree structure to handle on its own. How sensitive is the lineage analysis (Table 3) to this value P?
3. The Hierarchical Quantization Loss (Eq. 5) is confusing. What is the set z in the definition ? Is this the set of all zi in the batch? Please clarify the "consistency term" in plain.
4. How was the number of hierarchy levels L=10 chosen? This seems like a critical parameter, but there is no sensitivity analysis provided for it. How does performance vary with a shallower or deeper tree? |
Fully AI-generated |
|
RaanA: A Fast, Flexible, and Data-Efficient Post-Training Quantization Algorithm |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper tackles post training quantization by combining RabitQ and AllocateBits. In doing so they formulate a vector quantization approach that utilizes per layer mixed precision.
strengths:
- the empirical results seem promising.
- the notation of Assumption 4.1 (which is not really an assumption) is inappropriate. Several symbols are used without being defined. The statement is based on some K>0 that does not appear in the theorem/assumption.
- Corollary 4.2 is unprofessionally stated. High probability events should not be denoted as having a probability of "at least 0.99" which is a subjective amount and can be either good or bad depending on the context. A proper way to introduce high probability events is by lower bounding the probability by 1-\delta, where \delta is a scalr that actually affects the bound on the error.
- the gcd is not needed for (5) to hold. Is this decorative math?
- overall the paper is poorly written and its clarity should be significantly improved.
- Can the mixed precision be done intra-layer such as in FGMP (Hooper et al.)? |
Fully human-written |
|
RaanA: A Fast, Flexible, and Data-Efficient Post-Training Quantization Algorithm |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces RaanA, a novel post-training quantization (PTQ) framework designed to enhance the inference efficiency of large language models (LLMs).
RaanA addresses key limitations of existing PTQ methods, such as heavy reliance on calibration data and inflexible bit-width allocation by integrating two core components:
(1) RaBitQ-H: A highly efficient variant of the RaBitQ vector quantization algorithm, adapted for LLMs by replacing the computationally expensive random rotation with a Randomized Hadamard Transformation (RHT).
(2) AllocateBits: An optimal bit allocation algorithm that formulates bit-width assignment across layers as an integer programming problem, dynamically solved via dynamic programming.
This paper demonstrates significant originality by creatively bridging advanced vector quantization techniques from database systems (RaBitQ) with the practical demands of LLM PTQ.
The development of RaBitQ-H is a key innovation, replacing a computationally prohibitive random rotation with a Randomized Hadamard Transformation to make the method viable for high-dimensional model weights. The AllocateBits algorithm also presents a novel, principled approach to mixed-precision quantization, moving beyond simple heuristics to an optimal, solvable integer program.
The quality of the work is high, supported by extensive and rigorous experiments on major model families (LLaMA, Qwen) across multiple bit-widths. The results are highly competitive with state-of-the-art methods, particularly in the challenging ultra-low-bit regime (~2 bits).
The clarity is commendable; the paper is well-structured, and the algorithmic contributions are precisely defined, with a clear separation of the overall framework, bit allocation, and core quantization components.
Its significance is substantial. RaanA directly tackles critical deployment barriers for LLMs—high computational cost and data dependency—by offering a fast, data-efficient, and flexible solution. The demonstration of effective zero-shot calibration is especially impactful, potentially removing the need for carefully curated calibration datasets altogether and enhancing the practicality and accessibility of LLM quantization.
The paper's primary weakness is the suboptimal implementation of its core algorithm.
(1) RaBitQ is a CPU-bound bottleneck contradicts the goal of a "fast" framework and limits practical utility. A GPU implementation is crucial for true speed competitiveness.
(2) The evaluation could be more comprehensive. While perplexity is standard, the absence of inference latency and memory footprint measurements on actual hardware is a significant omission for a method claiming efficiency gains. Reporting wall-clock quantization time for baselines would also contextualize the claimed speed advantage.
(3) The layer-wise bit allocation, while an improvement, is noted as a sub-optimal constraint itself. The authors identify finer-grained (e.g., column-wise) allocation as future work, but not exploring a simple, coarser alternative (e.g., grouping layers by type/sensitivity) leaves a clear, actionable avenue for immediate improvement unexplored in the current evaluation.
There are two questions:
(1) What are the fundamental algorithmic operations in RaBitQ that make it challenging to port to GPU, and are there known parallelization strategies (e.g., using CUDA kernels for the core quantization steps) that could be applied?
(2) Could you provide measurements of the end-to-end inference latency (e.g., time to generate the first token and time per subsequent token) for RaanA-quantized models compared to the fp16 baseline and a key baseline like GPTQ or Quip#? This should be done on a standard hardware setup (e.g., a single A100). |
Fully AI-generated |
|
RaanA: A Fast, Flexible, and Data-Efficient Post-Training Quantization Algorithm |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes RaanA, a novel post-training quantization (PTQ) framework for large language models. The framework combines two main components: (1) RaBitQ-H, a variant of the RaBitQ vector quantization algorithm adapted for LLM quantization by replacing expensive random rotations with efficient Randomized Hadamard Transformations, and (2) AllocateBits, an algorithm that optimally allocates bit-widths across layers based on estimated quantization sensitivity using dynamic programming. The authors evaluate RaanA on LLaMA and Qwen models using perplexity on WikiText2 and C4 datasets. Results show RaanA achieves comparable or better perplexity, particularly in the extreme 2-bit regime, while being significantly faster and requiring far less calibration data.
- This paper proposes RaaanA, which can quickly determine the bit allocation for each layer using only a small amount of calibration data. Compared with heuristic methods, it also offers better interpretability.
- RaanA requires very few quantization resources. For example, it takes only 301.74 seconds to complete the entire PTQ process for LLaMA2-7B.
- This paper lacks novelty, as the proposed Randomized Hadamard Transformation (RHT) has already been widely used in quantization[1] and is considered a common trick.
- This paper only reports perplexity experiments on Wikitext2 and a C4 subset, lacking broader evaluations such as MMLU and math. As a result, it fails to demonstrate that the proposed method does not suffer from overfitting when using a small calibration dataset.
- As a weight-only mixed-precision quantization method, the experimental results presented in this paper do not outperform previous approaches. For instance, in Table 1, RaanA fails to surpass OmniQuant at average bit levels of 2.1, 3.1, and 4.1 (note that OmniQuant uses a group size of 128, with average bits of 2.125, 3.125, and 4.125). Moreover, OmniQuant was published two years ago, and this paper does not include comparisons with more recent methods[2].
[1] Ashkboos, Saleh, et al. "Quarot: Outlier-free 4-bit inference in rotated llms." Advances in Neural Information Processing Systems 37 (2024): 100213-100240.
[2] Li, Yuhang, et al. "GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration."
- Have you compared the results of RaanA and other methods, such as OmniQuant, using the same calibration dataset?
- For other issues, please refer to the Weaknesses section. |
Lightly AI-edited |
|
CroCoDiLight: Repurposing Cross-View Completion Encoders for Relighting |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes CroCoDiLight, which repurposes the pre-trained CroCo encoder for photometric tasks. The authors hypothesize that CroCo implicitly learns lighting information through cross-view completion training on image pairs with varying illumination. The method is demonstrated on tasks including lighting stabilization in timelapse, temporal upsampling, shadow removal, and intrinsic decomposition, trained on datasets two orders of magnitude smaller than CroCo's original training data.
- **S1.** Novel insight that cross-view completion models implicitly learn photometric understanding. The hypothesis that CroCo must estimate and manipulate lighting to complete masked patches across views with varying illumination is interesting and well-motivated.
- **S2.** Efficient learning paradigm requiring datasets two orders of magnitude smaller than original CroCo training. This demonstrates that photometric knowledge is already embedded in the pre-trained encoder and only requires extraction rather than learning from scratch.
- **S3.** Demonstrates feasibility of repurposing cross-view completion foundation models for photometric tasks, opening a new direction for leveraging geometric pre-training for appearance-related downstream applications.
- **W1.** My main concern is the paper's positioning and the scope of investigation. The paper is framed as an application showcase (e.g., lighting stabilization in timelapse, temporal upsampling, shadow removal, intrinsic decomposition, etc.), but its core contribution is the insight into repurposing foundation models. It would be much stronger if repositioned as a systematic investigation (similar to Probe3D) into which and how pre-trained vision foundation models capture photometric properties, and why. The current study is confined to CroCo, missing a crucial comparative analysis against other foundation models. An investigation should include:
- Other two-view encoders (e.g., DUST3R, MAST3R) and matchers (e.g., RoMa, GIM).
- Single-view models known for strong correspondence (e.g., DINOv2, DINOv3).
- Multi-view models where two-view is a special case (e.g., VGGT, Pi-3).
- Such a comparison would provide more generalizable insights into how different pre-training objectives (cross-view, contrastive, etc.) contribute to learning photometric understanding.
- **W2.** Mixed results on quantitative evaluations and missing some evaluations.
- We only have quantitative results for shadow removal (Table 1) and intrinsic decomposition (Table 2), while the other applications (lighting stabilization in timelapse, temporal upsampling) lack any quantitative benchmarks.
- The intrinsic decomposition results are state-of-the-art (Table 2), while shadow removal is not (Table 1). This is acceptable if the paper is repositioned as an investigation (per W1), where the goal is demonstrating feasibility rather than beating every SOTA. However, under the current paper's narrative, it is difficult to justify the advantage of CroCoDiLight over other specialized methods.
- That is, if we shift the paper's focus from application showcase to systematic investigation, we don't need to provide quantitative results for all applications, and we don't necessarily need to beat every SOTA.
Overall, the paper's insight (cross-view completion models implicitly learn photometric understanding) is valuable, but the paper's positioning and scope are not strong enough to provide a comprehensive investigation of this insight. I welcome the authors' response to address these concerns.
- **Q1.** (Related to W1, authors may answer together) Have you experimented with other two-view foundation models like DUST3R or MAST3R (which build on CroCo)? What about single-view and multi-view models?
- **Q2.** (Related to W2, authors may answer together) Could you provide quantitative metrics for lighting stabilization and temporal upsampling tasks? For example, comparing temporal coherence metrics or perceptual quality against baseline interpolation methods? Or is that something beyond the scope of this paper? |
Lightly AI-edited |
|
CroCoDiLight: Repurposing Cross-View Completion Encoders for Relighting |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work presents CroCoDiLight, which leverages supposedly inherent lighting disentanglement capability within CroCo to modify & repurpose CroCo for de-lighting / relighting tasks. To achieve this, the authors introduce two networks that explicitly separate CroCo's patch embeddings into a lighting vector and lighting-invariant latents, then recombine them, demonstrating that photometric understanding is already embedded in CroCo's representations and can be efficiently extracted for explicit control and various relighting tasks such as interpolation between lighting conditions, shadow removal, and albedo estimation.
- The paper is well-written and easy to follow. Good writing.
- The paper starts with a strong observation & hypothesis, recognizing the (possible) inherent capability within the original CroCo paper that its encoder implicitly encodes illumination information, which enables delighting & relighting during its novel view reconstruction task, and extending it to the hypothesis that this capability can be explicitly harnessed to achieve photometric tasks such as relighting/shadow removal/delighting. I believe the work is well-motivated and tackles an interesting question about the nature of CroCo and its representation.
- The authors offer a simple and intuitive solution to the problem (though this might point to the lack of novelty, as I would mention in the weakness section) by including a latent vector that disentangles lighting from geometry during the training phase. The method is simple and straightforward, effectively achieving its goal of lighting disentanglement as shown in the results.
- The original CroCo paper was a representation learning paper, focused on pre-training the model to be generally more suitable for various 3D / NVS downstream tasks from a simple two-view reconstruction loss. However, it seems that this work is more focused on training a model towards each specific downstream task (relighting / shadow removal / intrinsic image decomposition), which makes this work more closely aligned with existing relighting methods, of which there are already many. However, in this view, the core method of this paper (adding a separate latent vector for style and teaching model to change 'style (lighting)' of the image) very closely resembles previous GAN methods that achieve similar goals in generative scene and seems to lack novelty. Is there a more general implication for this method that may be relevant to representation learning / other downstream tasks, as was the original CroCo?
- The method requires datasets with identical geometry under different lighting, significantly limiting available training data. While synthetic datasets like HyperSim could be used, they introduce domain gap issues. How the authors address this fundamental limitation remains unclear.
- Lighting manipulation requires "walking" through latent space, making it difficult to achieve specific desired lighting conditions. The paper lacks a demonstration of how users can specify target lighting or achieve reproducible, controllable results without another scene that has desired lighting - can this point be further elaborated?
- What does this method have in advantage in comparison to Diffusion-based relighting methods, especially IC-Light (ICLR 2025), whose lighting can be controlled with text prompt as well as can be applied to various domains beyond scene imagery (i.e. including human faces, etc.)? Please elaborate.
Please see Weaknesses section. |
Fully human-written |
|
CroCoDiLight: Repurposing Cross-View Completion Encoders for Relighting |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper explores whether CroCo encoders, originally trained for cross-view completion with geometric objectives, also implicitly learn photometric representations due to illumination variations in training pairs. The authors propose CroCoDiLight to make this knowledge explicit through a delighting transformer that disentangles CroCo patch embeddings into a single lighting latent and intrinsic patch latents. Also a relighting transformer R that recombines them, and a single-view decoder D for high quality RGB reconstruction. Training uses only 57k pixel-aligned image pairs with different illumination, two orders of magnitude less than CroCo's training data as claimed in the paper. The method demonstrates applications in lighting interpolation, timelapse stabilization, shadow removal, and albedo estimation. Results show state-of-the-art on IIW among methods not trained on IIW, though shadow removal and construction metrics are good but not convincing.
The hypothesis that CroCo learns implicit photometric understanding is intuitive and the paper validates it convincingly. The delight-relight framing is elegant.
Needing only 57k pairs versus CroCo's 5.3 million is a strong practical advantage, especially given that aligned multi-illumination data is scarce. The paper shows strong albedo results- achieving 14.3% WHDR on IIW without training on IIW is impressive and suggests the intrinsic latents capture meaningful scene properties.
The paper covers multiple downstream tasks and provides extensive qualitative results. The failure case analysis in Appendix F is honest and valuable. The timelapse stabilization and lighting interpolation demos are compelling and showcase practical utility though there are some limitations such as shadow motion not being entirely smooth and the method struggling with sharp shadow boundaries that move rapidly across scenes.
Sharp shadow handling (shading effects, cast shadows, etc) is inadequately addressed: This is my biggest concern. The method uses a single global lighting latent, which fundamentally cannot capture the geometric information needed for sharp shadow boundaries. Sharp shadows arise from point lights and hard occluders, they encode precise light direction, occluder position, and surface geometry. A single image-space vector cannot represent this information, especially when shadow boundaries need to move correctly across multiple frames. The evidence is scattered throughout:
Fig. 17 shows direct shadows being replaced with ambient occlusion
Section 5.1 notes shadow motion during interpolation is "not entirely smooth" and Section F.1 admits tiles fully in shadow fail.
Most successful examples (Figs. 3-4, 8-10) show soft shadows, diffuse lighting, or outdoor scenes with gradual illumination changes
The timelapse examples work well for slow sun movement creating soft shadow transitions, but would likely fail for a person walking past a lamp creating sharp moving shadows
Tiling artifacts is unresolved: Section 3.5 and Fig. 16 show color inconsistencies from the sliding window approach. Paper mention potential fixes (Poisson blending, global reference tile) but don't implement them. Why present solutions but not evaluate them? The lighting latent being "optimally used" at 448×448 is a fundamental design limitation. This significantly undermines the high-resolution claims. Shadow removal metrics don't really match qualitative results.
Fig. 17 shows cases where your method is "working better" by removing additional shadows, but this also suggests the model isn't learning what the benchmark defines as shadow removal in my opinion.
Limited architectural justification- Why a single lighting latent and why D=1024?
Paper didn't provide ablations on:
Multiple lighting latents per image/tile (which would help with local lighting)
Lighting latent dimensionality (is 1024 dimensions necessary? wasteful?)
Spatial lighting maps vs. single vector
The Table 3 ablation uses a much simpler baseline (just linear embedding + DPT head), making it unclear whether gains come from CroCo features or better architecture. A fairer comparison would use the same I/R architecture without CroCo pretraining.
Image-space lighting is a fundamental limitation: Section 5 and Appendix C mention the lighting latent works in "image space" not "world space" but don't explore the implications. This means:
The method can't handle viewpoint changes
It can't reason about 3D light positions or directions
It's brittle to even small camera motion
Shadows will appear in wrong positions if the camera moves slightly
Can you quantify performance degradation as shadow sharpness increases? Even a simple analysis binning test images by edge gradient magnitude or manually annotating hard vs. soft shadows would help establish the method's scope.
Why not implement the color correction solutions you mention (Poisson blending, global reference) and show results? This seems critical for addressing both the metrics gap and the tiling artifacts.
A dimension ablation (dimensionality of S0) would help understand what information is being compressed.
For the world-space vs. image-space issue- did you try encoding light direction or position explicitly? Even rough geometric cues might help.
How does the method handle colored lighting vs colored surfaces? The disentanglement seems like it would be ambiguous
Lack of extensive ablation studies-
Multiple lighting latents per image/tile (which would help with local lighting)
Lighting latent dimensionality (is 1024 dimensions necessary? wasteful?)
Spatial lighting maps vs. single vector |
Lightly AI-edited |
|
CroCoDiLight: Repurposing Cross-View Completion Encoders for Relighting |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors introduce a method to disentangle CroCo latent representations into two components: a global latent vector capturing illumination, and patch-wise latent vectors representing the intrinsic properties of the scene. The model is trained in a self-supervised manner using pixel-wise aligned image pairs taken under different lighting conditions, guided by per patch cross-lighting and intrinsic consistency losses. They demonstrate that the disentangled latent space can be effectively leveraged for novel tasks such as interpolating between lighting conditions, shadow removal, and albedo estimation.
I found the approach of using an encoder-decoder architecture inspired by the Croco architecture to disentangle illumination from intrinsic scene representation both interesting and original. The design of the self-supervised training framework, particularly the proposed losses, appears well thought out and conceptually sound.
The paper demonstrates, through a range of tasks—including lighting interpolation, shadow removal, albedo estimation, and intrinsic image decomposition—that the proposed disentanglement approach is effective. The latent representations prove useful for handling these diverse downstream tasks. While the model does not outperform state-of-the-art methods specifically tailored for each task, the results are nonetheless promising, and the visual examples are rather convincing.
It is probable that leveraging the pretrained CrocoV2 model is beneficial, because the model was trained on a large set of image pairs captured under varying lighting conditions. Additionally, photometric augmentations applied during training likely enhanced the model to be robust to lighting changes. Still the experiments in the paper do not completely prove this. How would the model work if instead of the Croco encoder the MAE, DINOv2 or DINOv3 encoder is used and disentangled? Would the model perform less well on the downstream tasks?
Also, the ablation in Table 6 raises a concern: the architecture of the two models compared are no longer identical so while supposedly it is true, it is not directly shown that the gain comes from the pre-trained model and not from the architecture choice. It would have been insightful to evaluate also a model that retains the CroCo encoder architecture but is initialized from scratch. While indeed the training dataset is smaller, the learning task seems simpler than the hidden content reconstruction, making such an experiment worthwhile. Note also that the performance gain with the CroCo pretraining is significant only for the Intrinsic Image Decomposition task, much less for the Shadow Removal.
I like the illustration and the narrative in Figure 1 as it effectively conveys how the CroCo model implicitly learns to extract content information from the second image and appearance information from the first, guided by the training data. This raises an interesting question: could the two models be integrated to jointly learn both the reconstruction and the disentangled latent representations assuming relevant training set (e.g. image triplets)? Such a unified approach—where mask content reconstruction and disentangled latent representations are jointly learned—could potentially enhance consistency and improve performance not only on the downstream tasks explored in this paper (e.g., shadow removal, albedo estimation, lighting interpolation), but also on geometric tasks such as 3D reconstruction. |
Fully human-written |
|
Learning from What the Model Forgets: Prototype-Guided Patch-wise Replay for Medical Image Segmentation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes an end-to-end framework for medical image segmentation that mines moderately forgettable (hard-positive) samples to reduce false negatives and improve boundary accuracy. Specifically, it introduces CLIP-based text embeddings to guide prototype learning for semantically richer features, defines a multi-metric difficulty measure to score prototypes, and uses an online forgettable sample bank to dynamically store and replay difficult samples for curriculum-like retraining. Experiments on five public datasets show improvements over baselines. Ablation show each module’s contribution.
1. This paper focuses on a previously underexplored problem and introduces moderately forgettable sample mining guided by CLIP semantics.
2. This paper proposes a multi-metric prototype-based score that balances geometric and probabilistic cues.
3. Extensive experiments and ablations.
1. All experiments rely on 2D patch training, even for 3D datasets.
2. The online bank and prototype updates likely introduce overhead.
1. Motivation & Novelty
1.1 Clarity of “Moderately Forgettable Samples”
The concept of “moderately forgettable samples” is central to this paper, but its definition remains informal. The authors should provide a clearer, quantitative criterion to demonstrate these samples from easy or noisy ones. Moreover, it remains unclear how the proposed method guarantees that the identified samples correspond to clinically meaningful hard positives rather than mislabeled or ambiguous regions.
1.2 Technical Novelty and Contribution.
The proposed components (text-guided fusion, prototype-based scoring, and a replay memory bank) individually build upon well-established ideas. The authors should better highlight what is fundamentally new in their formulation or analysis compared with prior works on hard-sample mining, prototype learning, or CLIP-based semantic guidance, especially to appeal to a broader ICLR audience beyond medical imaging.
2. Method
2.1 Section 3.2 (Prototype-Based Scoring)
(1) Line 235, “Our approach addresses these limitations by leveraging semantically-enhanced prototypes to provide both computational efficiency and semantic-aware patch-level scoring.”
This claim requires justification. How does semantic enhancement improve efficiency rather than add overhead? A brief complexity analysis or runtime comparison would clarify this point.
(2) Line 278, “The four terms are normalized by the number of pixels to bring them to a comparable scale.”
Please analyze how this normalization affects the relative weighting among metrics, particularly for organs of different sizes. Could this bias the difficulty estimation toward small or large structures?
2.2 Sect. 3.3 Forgettable Sample Bank
How does performance vary with different bank sizes, and what trade-offs exist between memory cost, sample diversity, and replay stability? A more systematic guideline or sensitivity curve would strengthen this part.
3. Experiments
3.1 The paper shows strong overall results but does not clearly isolate the effect of CLIP-based semantic fusion on prototype learning. Visualization or quantitative analysis would help demonstrate how text guidance improves the representation quality.
3.2 How sensitive is the framework to the choice of frozen CLIP backbone (ViT-B/32 vs ViT-L/14)?
3.2 The method introduces additional modules. What is the computational overhead (memory and runtime) compared with nnU-Net baselines? This will clarify the practicality of deploying the framework in real clinical workflows. |
Fully AI-generated |
|
Learning from What the Model Forgets: Prototype-Guided Patch-wise Replay for Medical Image Segmentation |
Soundness: 1: poor
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a medical image segmentation framework that utilizes a hard-sample patch-wise replay method guided by prototypes, and incorporates CLIP text embeddings for encoder-decoder feature fusing in the U-Net architecture. While the motivation of addressing hard samples which are near the decision boundary is relevant, the paper suffers from method originality and is lack of convincing justification for its core components.
1. Addressing the issue of hard-positive samples / hard samples that are near the decision boundary is an important direction in the field.
2. The authors conduct experiments across five datasets, covering different anatomical structures and modalities.
Lack of novelty of the core method designs: the idea of using CLIP text embedding to facilitate medical image segmentation has been heavily explored these years, such as [1-3], there are no significant differences suggesting that this approach is innovative. And the TGF module can be regarded as a type of attention-gated mechanism.
[1] CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection (ICCV 2023)
[2] PCNet: Prior Category Network for CT Universal Segmentation Model (TMI 2024)
[3] Text-driven Multiplanar Visual Interaction for Semi-supervised Medical Image Segmentation (MICCAI 2025)
1. My biggest concerns are set out in the weaknesses section.
2. Although the authors state that "The core question is not architectural but strategic: how to define sample difficulty" and adapt the same 2D patch-based framework and only test the proposed strategy on the nnUNet backbone, I think it's important to validate the generalization ability of the strategy across different architectures (as those compared in Table 1), and to avoid possible over-optimization problems.
3. The CLIP text embedding may have a huge gap between natural image descriptions and medical images, and the performance gain may simply be the result of adding a powerful, high-dimensional, pre-trained feature vector that acts as a strong form of regularization or feature enrichment, rather than providing true *semantic guidance* derived from the text input. Without deeper analysis demonstrating that the text features align with medical concepts (e.g., via visualization or linear probing), the claim of semantic guidance is unconvincing (as those stated around line 198-200).
4. If evidence is provided for question (3). Will medical-tuned/oriented CLIPs provide better performance under the framework?
Minor:
1. Provide baseline results for ablation study results (Table 3)
2. Missing highlight (bold results for DSC) for PROMISE2012 in Table 1 (line 351-352) |
Fully human-written |
|
Learning from What the Model Forgets: Prototype-Guided Patch-wise Replay for Medical Image Segmentation |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents a prototype-guided, CLIP-informed framework for medical image segmentation that identifies and replays moderately forgettable samples (patches that lie near decision boundaries and are prone to being forgotten during training). The approach combines three modules:
- Text-Guided Fusion (TGF), which incorporates CLIP text embeddings to guide visual prototype formation.
- Prototype-Based Scoring (PBS), which measures sample difficulty via intra-/inter-class distances and confidence-based metrics.
- Forgettable Sample Bank (FSB), which maintains and replays informative samples to reinforce learning.
Experiments on five public datasets (KiTS2023, BraTS2020, ACDC, FLARE2021, PROMISE2012) show consistent gains in Dice and sensitivity, and lower Hausdorff distances than baselines like nnU-Net, Attention U-Net, and MambaUNet.
- The paper addresses an important task in medical image segmentation: how make models robust at low-contrast regions.
- The proposed method is innovative and effective: particularly, using CLIP text embeddings for *training-time* guidance and using PBS and FSB to keep the training focus on hard cases.
- Comprehensive evaluation across diverse datasets. The Result section is also informative. Strong performance compared to baselines.
- The writing is clear and easy to follow.
- The explanation of how CLIP contributes during training is unclear. The statement “CLIP semantic guidance provides discriminative information beyond visual appearance” is overly general and does not specify the mechanism by which CLIP influences feature learning. It would strengthen the paper to include feature-space visualizations (e.g., t-SNE or UMAP plots) comparing models trained with and without text-guided fusion, to demonstrate the effect of CLIP guidance on representation structure. In addition, the discussion of prompt design is limited. It would be useful to analyze how different prompt formulations affect training and whether the observed performance gain is robust to prompt variation.
- Although CLIP is frozen, its bias toward natural image semantics may limit robustness in rare or pathology-heavy datasets. Some comparison with medical-domain text encoders (e.g., MedCLIP, BioCLIP) would clarify sensitivity to text priors.
- Table 3 lumps PBS and FSB together in some configurations. Independent ablations would better clarify each module’s role. For this, the authors may consider comparing PBS with existing sample-scoring methods by substituting one of them for PBS in the framework.
- Regarding $Score^b$ and forgettable samples: although the intuition behind the formulation of $Score^b$ is clear, its relationship to the true forgetting frequency (as per Toneva et al., 2019) is not quantitatively demonstrated. A correlation plot or ablation on actual forgetting events would strengthen the claim.
- Were there any experiments done on prompt design? How sensitive is the method to different prompts?
- Would domain-specific encoders (e.g., MedCLIP, BioLinkBERT) provide similar or better benefits than CLIP?
- About $Score^b$, were there experiments exploring or tuning the weights assigned to its components?
- Have the authors compared PBS against standard hardness metrics such as loss magnitude, gradient norm, or prediction entropy to show unique benefit? |
Heavily AI-edited |
|
Learning from What the Model Forgets: Prototype-Guided Patch-wise Replay for Medical Image Segmentation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes a prototype-guided patch-wise replay strategy for medical image segmentation: (1) CLIP-based text–image fusion to incorporate semantic priors, (2) prototype-based scoring to identify moderately forgettable samples, and (3) an online replay buffer to revisit them during training. The method is simple to implement and is evaluated on five datasets (≤5 classes). Ablations and sensitivity analyses are clear; improvements are consistent but generally small.
1. Clear, readable paper with a straightforward method.
2. Well-designed ablations and sensitivity studies that isolate replay frequency, prototype size, and CLIP fusion.
3. Consistent (though small) gains across datasets without heavy architectural changes.
1. The absolute Dice improvements are marginal. The paper needs multi-seed runs with statistical tests to establish significance.
2. Evaluation scope is narrow (five small-class datasets), missing large multi-organ benchmarks (BTCV, AMOS, TotalSegmentator v2) to test scalability and class-wise robustness.
3. Baselines are incomplete: missing strong or hybrid models (TransUNet [1], MedNeXt [2], EMCAD [3], etc.).
4. No discussion and comparison with established prototype- or memory-replay methods for segmentation or continual learning under a shared protocol.
5. CLIP text encoder is frozen and general-domain; fine-tuning or using BioMedCLIP may improve alignment.
6. Only 2D UNet-style backbones are tested; impact on pretrained hybrids/transformers (TransUNet [1], EMCAD [3]) is unknown.
7. Impact on interactive foundation models (e.g., Med-SAM) is untested but promising.
[1] Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L. and Zhou, Y., 2021. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306.
[2] Roy, S., Koehler, G., Ulrich, C., Baumgartner, M., Petersen, J., Isensee, F., Jaeger, P.F. and Maier-Hein, K.H., 2023, October. Mednext: transformer-driven scaling of convnets for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 405-415). Cham: Springer Nature Switzerland.
[3] Rahman, M.M., Munir, M. and Marculescu, R., 2024. Emcad: Efficient multi-scale convolutional attention decoding for medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11769-11779).
1. Are the gains statistically significant over multiple seeds? Please report mean±std and appropriate significance tests per dataset.
2. How does the method perform on large multi-organ datasets, such as BTCV, AMOS, TotalSegmentator v2, including per-class results under strong imbalance?
3. What is the effect of integrating the replay mechanism into pretrained hybrids models (TransUNet [1], EMCAD [3])?
4. Does fine-tuning CLIP's text encoder or swapping to BioMedCLIP improve results?
5. How does this approach compare to established prototype and memory-replay baselines under an identical training pipeline?
6. Could replay be combined with Med-SAM to guide interactive segmentation? |
Fully AI-generated |
|
RefineBench: Evaluating Refinement Capability in Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces RefineBench, a new benchmark with 1,002 problems across 11 domains designed to evaluate the refinement capabilities of Large Language Models. It uses a novel checklist-based framework to test two modes:
1. Self-Refinement (no feedback, $f_t = \emptyset$)
2. Guided Refinement (with feedback $f_t$)
The primary contribution is the finding that even frontier LMs like Gemini 2.5 Pro struggle significantly with self-refinement, showing minimal gains (e.g., +1.8%) across iterations. However, in guided refinement, these same models effectively use targeted feedback to achieve near-perfect scores. This suggests LMs possess refinement abilities but lack the direction of what to fix.
1. This paper introduces RefineBench, a high-quality benchmark for evaluating refinement in complex, non-verifiable domains like law and humanities, moving beyond simple math problems.
2. The quality of this benchmark is very high, using real-world problems and a novel checklist-based evaluation framework that was rigorously validated by Ph.D. domain experts (96.1% appropriateness).
3. The authors identify a key bottleneck: LMs can improve with feedback but are fundamentally poor at self-refining because they cannot identify their own errors.
1. The "self-refinement" failure mode is not precisely identified. The paper concludes models "lack direction on what to fix". However, the evidence (e.g., Figure 6) suggests the model failed to identify that a problem existed at all. This is a failure of self-verification or error-detection. Models aren't necessarily unable to fix errors, but rather they incorrectly conclude their initial answer is already "complete and correct" and thus stop trying to refine.
2. The "Guided Refinement" setting likely overstates true refinement capability by effectively testing instruction-following. The feedback provided is not a realistic, high-level critique. Instead, it's a list of explicit, atomized commands derived directly from the failed checklist items (e.g., "The response should accurately..." in Appendix K), which is a much simpler task for LMs.
See weakness. |
Heavily AI-edited |
|
RefineBench: Evaluating Refinement Capability in Language Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces RefineBench, a benchmark aimed at probing whether LLMs can perform self-refinement, either independently or with external guidance. In this context, external guidance refers to evaluation checklist items that the model previously failed on. The authors evaluate 37 LLMs and highlight several insights: (1) LLMs generally struggle to refine their own responses, though thinking models show slightly better self-refinement; (2) LLMs can refine themselves when given failed checklist items, but still fail to address issues that are not explicitly pointed out in partially guided setups.
1. The benchmark is well-curated, covering a wide range of topics and domains. The manual quality control process also seems solid.
2. The evaluation spans a large number of models, showing a commendable level of comprehensiveness.
3. The findings are interesting - especially the comparison between thinking models and standard ones. As the paper notes, whether refinement itself is beneficial has been extensively studied and debated in prior work, but revisiting this question in the context of reasoning models is valuable.
1. I have concerns about using the same checklist for both external guidance and evaluation. Could this create potential leakage, where models optimize for missing checklist items instead of genuinely improving quality? It's unclear whether the provided guidance leads to real improvement or just better checklist completion.
2. Discussion of related work is strangely organized. The CriticBench line of work seems most relevant and should probably be introduced earlier in Section 2. In contrast, the part on multi-turn benchmarks feels less directly connected and can be toned down.
The analysis on whether test-time scaling helps refinement is intriguing. However, it's limited to Gemini-2.5-Pro in Figure 5. It would be great to expand the analysis and see whether similar trends hold across other reasoning models. |
Lightly AI-edited |
|
RefineBench: Evaluating Refinement Capability in Language Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This dataset presents a multi-topic refinement benchmark with university-level problems and evaluation checklists. The paper presents the dataset curation process and evaluation process, as well as extensive experiments with LLMS. This is a fantastic paper, with one significant let down -- there is no human evaluation/verification of the evaluation pipeline -- which really weakens the results.
- Very interesting problem.
- The dataset is a significant contribution.
- Human evaluation of checklist generation.
- Summary statistics of the dataset and comparison to other benchmarks are included and well-presented.
- Extensive experiments.
I really only found one letdown in this paper, but it is a big one -- there is no human evaluation/verification of the evaluation pipeline. To be strong, there needs to be a human verification of a sample of the end-to-end evaluation process. How do we know how good the LLMs are at comparing the answer to the checklist and providing good feedback? This is instrumental to understanding the results.
I also think that a good baseline would have been to compare the success with human feedback (desirable).
(line 190 -- step 3) -- If this was manually reviewed by the authors, why did you need LLMs to create the checklists from reference answers?
-- Why did you not have human verification of the eval pipeline, and can it be reasonably added? |
Fully human-written |
|
RefineBench: Evaluating Refinement Capability in Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents RefineBench, a benchmark of 1002 problems across 11 domains. Each problem includes a checklist to help evaluators assess LLM responses consistently and accurately. The dataset combines verifiable STEM tasks with non-verifiable, free-form tasks. Experiments show that guided refinement enables most LLMs to reach correct answers after multiple turns, whereas self-refinement does not achieve comparable gains.
- The paper introduces a new benchmark with a relatively large problem set and clear per-problem checklists, enabling more reliable evaluation of LLMs’ reasoning abilities.
- The analyses are clear and highlight that self-refinement remains challenging, particularly due to LLMs’ difficulty in identifying specific errors and determining how to adjust initial answers.
- The study uses GPT-4.1 as the sole evaluator, which may introduce bias. Incorporating a second independent LLM-as-judge or human auditing would strengthen the evaluation.
- For problems that originally include images, textual descriptions may omit important details. Expanding the benchmark to a multimodal setting would address this limitation.
1. The paper finds a key bottleneck is that LLMs struggle to pinpoint detailed issues or the direction of correction from the initial response. Do the authors have insights or proposals on how to improve this?
2. The paper notes that many existing benchmarks focus on math or symbolic reasoning rather than open-ended questions, yet RefineBench still contains a moderate share of math and math-like tasks. How are truly open-ended tasks represented, and could their proportion be increased? |
Lightly AI-edited |
|
Learning Generalized Hamiltonian Dynamics with Stability from Noisy Trajectory Data |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
In this work, the authors propose a probabilisitic framework for learning generalized Hamiltonian dynamics from noisy trajectory data. The method is based on symplectic Gaussian processes with random Fourier features and trained with variational Bayesian inference, where the training loss is augmented by regularizations for enforcing soft physical constraints such as energy conservation, volume conservation and Lyapunov stability and can be solved numerically via gradient descent-ascent (GDA). Experiments demonstrate that the method outperforms prior approaches on a few basic Hamiltonian systems including conservative, dissipative and externally-forced ones.
The learning of dissipative and externally forced Hamiltonian systems within a probabilistic framework is novel to my knowledge.
1. The main novelty of the proposed method compared to prior works such as Tanaka et al. (2022) and Ross and Heinonen (2023) seems to lie in the several penalty terms for softly enforcing the respective physical constraints, together with a min-max formulation of the optimization problem for balancing the different loss terms. It is a bit limited for a publication in ICLR, in my opinion.
2. Does the energy conservation constraint actually make sense when we consider dissipative and externally-forced Hamiltonian systems?
3. My understanding of port-Hamiltonian systems is that they usually refer to interconnected networks of subsystems with force exchanges, which are different from what the authors consider in this paper. It may be clearer to name them as "forced" or "externally-forced" Hamiltonian systems instead.
4. The presentation of the paper, including the quality of the figures, can be improved. The analysis of the experiment results is also rather limited compared to prior works such as Tanaka et al. (2022).
See item 2 in the weakness section above. |
Fully human-written |
|
Learning Generalized Hamiltonian Dynamics with Stability from Noisy Trajectory Data |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a Gaussian process–based learning algorithm for generalized Hamiltonian systems. The method incorporates three soft constraints—on energy conservation, volume preservation, and Lyapunov stability—to better capture the physical structure and stability properties of the system. Experiments on multiple dynamical systems show that the proposed approach achieves superior accuracy and robustness compared to existing methods.
- The authors propose a learning algorithm based on Gaussian processes for the generalized Hamilton systems and introduce three soft constraints to make learning more effective. This aspect appears to be novel.
- The authors verify the effectiveness of the proposed method in multiple dynamical systems, demonstrating its robustness particularly in long-term prediction.
- The proposed learning process merely adds three regularization terms representing physical constraints to the conventional SSGP method, and thus appears to offer no significant technical contribution.
- The experimental results are encouraging, but a more thorough analysis is required, including consideration of uncertainties.
- The motivation for introducing three soft constraints needs to be clarified. SSGP is learned to follow Hamilton's equations. Since SSGP represent Hamilton vector fields, they guarantee at least energy and volume conservation. Therefore, soft constraints on these properties (at least energy and volume) may not be necessary when modeling vector fields. On the other hand, considering learning from trajectory data, cumulative errors due to numerical integration are introduced. Soft constraints may be effective in absorbing this error. Please add the above discussion to clarify the motivation for the proposal.
- After learning, what values did the terms in equation (11) and the hyperparameters take?
- One of the advantages of Gaussian processes is their ability to handle uncertainty, so it would be better to demonstrate their effectiveness in this regard through experiments.
- Please define J and D. |
Lightly AI-edited |
|
Learning Generalized Hamiltonian Dynamics with Stability from Noisy Trajectory Data |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a Gaussian process framework for learning different kinds of non-conservative Hamiltonian systems. The main idea is to use a relaxed Hamiltonian framework, and control the model by regularisation towards conservation and stability.
The paper is sufficiently original by consider generalised Hamiltonian systems. However, the GP methodology is quite basic, and the overall Hamiltonian GP approach has already been established in earlier works.
The clarity of the paper is overall good, and the paper is easy to follow.
The results show superior performance over baselines, which is a good achievement.
The math presentation could be improved. The ELBO and the joint distribution is quite oddly presented: the joint distribution has no connection between x_ij and x_0, or any connection between W and x. There is also no theta. I don't think this notation is sufficiently rigorous.
The paper oddly shows no system fits in the main paper, which makes it quite difficult to get a good intuition on what is happening, how much data is used, or what do the predictions look like. In appendix there are visualisations, which seems to show quite different picture from the tables. Figs 4-6 all show that the there is basically no difference between SSGP and the proposed method (!), and the GP methods make really strange and really strong error patterns. I suspect that there is something wrong in the implementation of this method, or there are some serious misidentification issues in these models.
Finally, it's difficult to see in what ways the proposed method is significant. There is no "real-world" usecase, and all the experiments are small-dimensional simple systems that we probably could model better by conventional means.
See above |
Fully human-written |
|
Learning Generalized Hamiltonian Dynamics with Stability from Noisy Trajectory Data |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a unified and probabilistic framework for learning generalized Hamiltonian dynamics (conservative, dissipative, and port-Hamiltonian) from noisy and sparse trajectory data. The core of the method is to model the Hamiltonian function as a probabilistic surrogate using a sparse Gaussian Process approximated with Random Fourier Features (RFF). To ensure physical plausibility and improve long-term stability, the authors introduce a multi-term loss function that combines the standard evidence lower bound (ELBO) for data-fitting with soft regularization terms enforcing energy conservation, phase-space volume conservation, and Lyapunov-style stability. A key contribution is the use of a Gradient Descent-Ascent (GDA) algorithm to automatically balance the weights of these loss terms, treating it as a min-max optimization problem. Experiments on several benchmark Hamiltonian systems demonstrate that the proposed method achieves superior performance in both short-term accuracy and, most notably, long-horizon forecasting compared to state-of-the-art baselines.
- Principled and Unified Framework: The paper presents a unified approach to a complex problem. By parameterizing the three distinct classes of Hamiltonian dynamics within a single RFF-based GP framework, the authors provide a systematic way to tackle a broad range of physical systems. The probabilistic nature of the model is well-justified and naturally handles the challenges of noisy and sparse observations.
- Novel Automated Loss Balancing: The use of GDA to automatically learn the Lagrangian multipliers (λ) is a significant contribution. Balancing multiple, often competing, loss terms is a notoriously difficult hyperparameter tuning problem. The proposed min-max optimization framework offers a principled and automated solution, which enhances the method's practicality and robustness. This is a valuable technique that could be adopted in other multi-task or physics-informed learning settings.
- Comprehensive Experimental Validation: The authors conduct a thorough empirical evaluation across all three classes of Hamiltonian systems. The comparison with strong baselines (HNN variants, SSGP) is fair and clearly highlights the benefits of the proposed method. The ablation studies on noise levels and individual loss components further strengthen the paper's claims and provide valuable insights into the model's behavior.
- Scalability Concerns: While RFFs improve the scalability of GPs, the experiments are conducted on relatively low-dimensional systems (1D or 2D position spaces). It is unclear how the computational cost and performance of the method, particularly the GDA optimization, would scale to higher-dimensional phase spaces (e.g., many-body systems) or datasets with very long trajectories. A discussion on the computational complexity with respect to the phase space dimension d and the number of RFF features M would be beneficial.
- Stability and Nuances of GDA: The GDA for min-max optimization can be notoriously tricky to train and may not always converge to a desirable equilibrium. The paper mentions this but could benefit from a more in-depth discussion. For instance, in Table 1, the "Ours (Equal)" variant sometimes slightly outperforms the "Ours (GDA)" variant. This raises a question about the stability and reliability of the GDA optimization. Is it sensitive to learning rates or initialization? When and why might a simpler weighting scheme be sufficient or even better?
- Assumptions on D and F(t): The framework makes simplifying assumptions about the structure of the dissipation matrix D (diagonal, only affecting p) and the external force F(t) (also only affecting p). While reasonable for the chosen benchmarks, this limits the generality of the approach. A brief discussion on how the framework could be extended to handle more complex, unknown, or state-dependent dissipation and forcing structures would strengthen the paper.
- Regarding the GDA balancing: Could you comment on the training dynamics and stability of the GDA approach? As noted in the weaknesses, the equally-weighted version sometimes outperforms the GDA-balanced one in Table 1. Does this suggest that the GDA optimization is sometimes getting stuck in a suboptimal local minimum, or that for some tasks, a simpler balance is more effective?
- Regarding the choice of constraints: The paper applies the same set of regularizers (Energy, Vol, Lyap) across all system types, though their physical meaning changes (e.g., energy is conserved in one case and dissipated in others). The implementation detail "the conservative laws can be enforced by integrating the conservative part only" is key. Could you elaborate on this in the main text? How exactly is L_Energy (Eq. 8) adapted for dissipative and port-Hamiltonian systems, where the total energy is not expected to be conserved? Is it applied only to the J∇H component of the flow?
- Regarding the Lyapunov loss (L_Lyap, Eq. 10): The paper states that for Hamiltonian settings, you can take V=H and α=0. However, the loss term ReLU(d/dt H(x(t))) penalizes any increase in energy. For a port-Hamiltonian system with external energy input, the energy H is expected to increase. How is this apparent contradiction handled? Does the GDA mechanism learn to down-weight this loss term (λ_1 -> 0) in such cases? |
Fully AI-generated |
|
Goal-driven Bayesian Optimal Experimental Design for Robust Decision-Making Under Model Uncertainty |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose an experimental-design method that targets minimal expected loss in a decision of interest and considers parameter constraints. They focus on two particular decision problems: choosing quarantine rates during an epidemic, and choosing a dosing rate for administering a drug. They demonstrate the performance of their method on these two problems.
Originality: this is unclear to me, despite quite a lot of effort to work it out.
Quality: the high-level problem (Section 2 up to Equation 4) and the main empirical results (Figure 2) are clear.
Clarity: the writing is understandable at a low level.
Significance: decision-oriented BED is an important direction.
I’m struggling to understand the proposed method. I agree with Equations 3-4, which align with existing decision-oriented objectives (eg, Bernardo & Smith, 1994; Bickford Smith et al, 2025; Huang et al, 2024; Neiswanger et al, 2022; Raiffa & Schlaifer, 1961) if we set $\rho = \mathbb{E}$ as the authors here do in Equation 5, making $h[p(\theta|\xi,y)] =\min_{q \in \mathcal{Q}} \mathbb{E}_{p(\theta|\xi,y)}[J(q,\theta)]$ the key quantity to target, where $J$ is a loss function, $q$ is an action, and $\theta$ is an unknown ground-truth variable.
Things get confusing thereafter because the authors actually consider $J$ not being a function of $\theta$ while placing constraints on $\theta$, leading to Equation 5. Mathematically it’s unclear to me how changing beliefs over $\theta$ lead to a change in the minimising $q$, unless there is some $\theta$ assigned nonzero weight by the prior and zero weight by the posterior: all that matters with regard to $\theta$ is that the constraints are met, and these are set upfront, before any experimentation.
Even if there’s some way this works out, it’s unclear to my why the constraints on $\theta$ should not just be thought of as implying an updated belief state, $p'(\theta|\xi,y,\mathcal{C})$ for constraints $\mathcal{C}$, produced by applying the constraints and renormalising. If we had this updated belief state and a $J$ that depends on $\theta$ then I think we’re back to the setup from past work.
Aside from these methodological issues, it looks to me like the proposed method is not compared against existing methods, even though the authors promise to “compare GoBOED with standard BOED baselines”. I think the authors should be comparing against EIG maximisation as well as other non-parameter-oriented methods (eg, Huang et al, 2024; Kandasamy et al, 2019).
Finally I think there is a general inflation of the paper’s novelty. The goal-oriented aspect of the work is not new (see for example the citations for decision-theoretic methods). Considering the intersection between experimental design and control is not new (eg, Anderson et al, 2023; DeGroot, 2004; Mesbah & Streif, 2014). Studying SIR and pharmacokinetic models is not new (eg, Ivanova et al, 2021). The lack of novelty would be fine if there were a compelling contribution otherwise, but this is very unclear to me.
---
Anderson et al (2023). Experiment design with Gaussian process regression with applications to chance-constrained control. Conference on Decision and Control.
Bernardo & Smith (1994). Bayesian Theory. John Wiley & Sons.
Bickford Smith et al (2025). Rethinking aleatoric and epistemic uncertainty. ICML.
DeGroot (2004). Optimal Statistical Decisions. John Wiley & Sons.
Huang et al (2024). Amortized Bayesian experimental design for decision-making. NeurIPS.
Ivanova et al (2021). Implicit deep adaptive design: policy-based experimental design without likelihoods. NeurIPS.
Kandasamy et al (2019). Myopic posterior sampling for adaptive goal oriented design of experiments. ICML.
Mesbah & Streif (2014). A probabilistic approach to robust optimal experiment design with chance constraints. arXiv.
Neiswanger et al (2022). Generalizing Bayesian optimization with decision-theoretic entropies. NeurIPS.
Raiffa & Schlaifer (1961). Applied Statistical Decision Theory. Division of Research, Harvard Business School.
Can you show how different beliefs over $\theta$ lead to different optimal $q$?
If so, can you show why constraints cannot just be applied as a belief update over $\theta$?
How well do the abovementioned baseline methods work?
Can you confirm that Equation 2 is correct? I don’t think it matches any estimators from Foster et al (2019). |
Fully human-written |
|
Goal-driven Bayesian Optimal Experimental Design for Robust Decision-Making Under Model Uncertainty |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents GoBOED, a framework that unifies Bayesian Optimal Experimental Design (BOED) and robust optimal control for decision-making under model uncertainty. Instead of focusing solely on parameter uncertainty reduction (via expected information gain), the proposed method explicitly optimizes experiments to reduce uncertainty that most impacts downstream decisions. The framework uses variational inference for amortized posterior approximation, convex optimization for tractable robust control, and differentiable decision layers to enable end-to-end gradient-based training. Applications to epidemic management (SIQR model) and pharmacokinetic (PK) control demonstrate the method’s capacity to identify flexible, near-optimal experimental designs that balance decision quality and acquisition cost.
1. **Conceptual contribution**: The idea of integrating BOED with robust control in a single differentiable pipeline represents a novel and meaningful conceptual advance over classical information-theoretic approaches, which often overlook decision performance.
2. **Practical relevance**: Applications in epidemic control and pharmacokinetics are both timely and compelling, demonstrating generalizability across domains requiring safe and cost-aware experimental scheduling.
3. **Computational tractability**: The amortized inference and differentiable control layers reduce the sampling and computational overhead that typically affect BOED, addressing a key bottleneck in the literature.
4. **Empirical results**: Case studies show interpretable outcomes (e.g., near-optimal observational “windows”), emphasizing the trade-off between informational and operational objectives.
1. **Clarity and structure**:
- The motivation is unclear to me. The introduction section mentions many challenges, including the computational challenges in BOED and parameter uncertainty issues in robust control, but it is unclear which challenge this paper aims to address. It would be helpful if the authors could indicate which section addresses each challenge.
- For the problem formulation, the combined objective in equation (4) is interesting, but there is little interpretation provided. The overall goal remains unclear, and my uncertainty about which problem (BOED, robust control, or both) this paper seeks to address is not resolved even after reading Section 2.
- The exposition is often dense and notation-heavy, particularly in Sections 3–4. Long derivations (e.g., Eq. (11)) could benefit from a clearer explanation of the high-level logic before diving into detailed formulations.
- I think Figure 2 can be significantly improved by labeling the notations directly on the corresponding charts, rather than relying on the text. It would also be very helpful to provide a complete algorithm block for the proposed method.
- Some passages conflate notations from BOED and control optimization, making it difficult to distinguish design variables ($\xi, \xi^*, \xi^\star$), control variables ($q$), and variational parameters ($\phi$) on first read. A summary table of notations would help.
1. **Limited comparison baselines**: Only classical EIG-based BOED is compared. Including a *decision-focused baseline* (e.g., Expected Predictive Information Gain) would provide a fairer benchmark to demonstrate decision-aware benefits.
2. **Computational efficiency validation**: The authors claim to propose a computationally efficient method, but this claim is not validated by experimental evidence.
Please see the weaknesses for my concerns. In particular, I would like the authors to provide a clearer explanation of the overall procedure, for example by including algorithm blocks and clearer diagrams. |
Fully AI-generated |
|
Goal-driven Bayesian Optimal Experimental Design for Robust Decision-Making Under Model Uncertainty |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes Goal-driven Bayesian Optimal Experimental Design, a framework that optimizes experimental designs to minimize downstream decision costs rather than just maximizing EIG on parameters as in the traditional BOED. The approach combines variational inference for posterior approximation with convex optimization for robust control under parameter uncertainty, using chance constraints or CVaR to handle uncertainty in the constraints. The authors apply differentiable optimization layers (cvxpylayers) to enable gradient-based design selection and demonstrate the framework on two simulated examples: epidemic management using an SIQR model and pharmacokinetic dose optimization.
1. The paper studies an important problem in BOED that maximizing EIG may not optimize decision-making objectives, and demonstrates this on two concrete applications (epidemic management and pharmacokinetic control).
2. The proposed approach is technically sound, and the use of chance constraints and CVaR for robust optimization looks reasonable to me.
1. The proposed framework primarily applies existing techniques without domain-specific adaptation. Goal-oriented BOED is well-established in prior work (as acknowledged in the related work section), and the variational BOED framework with importance sampling is directly adopted from [1] without justification or specific adjustments for the two applications. The paper should better position its contribution, either as novel methodology or as an application study demonstrating feasibility in specific domains.
2. The author only compares the proposed method against traditional EIG-based BOED, not other goal-oriented approaches such as [2] which would fit the experimental setting. Additionally, the paper only tests on two applications (SIQR, PK) without evaluating on standard BOED benchmarks commonly used in the literature (e.g., source localization problems).
3. The experimental results do not sufficiently demonstrate the value of goal-oriented BOED, especially in the SIQR setting. Could the authors elaborate more on this part?
4. The paper lacks crucial experimental details including training procedures (optimizer, learning rate, epochs, batch size), model architecture specifications, and key hyperparameters. More importantly, no ablation studies are provided to justify design choices such as: importance sampling vs direct VI sampling, impact of N (posterior samples), sensitivity to $\eta$, or the architecture choices.
[1] Foster, Adam, et al. "Variational Bayesian optimal experimental design." Advances in neural information processing systems 32 (2019).
[2] Smith, Freddie Bickford, et al. "Prediction-oriented bayesian active learning." International conference on artificial intelligence and statistics. PMLR, 2023.
1. Please see the questions in the Weakness part.
2. The authors only conduct single-step experimental design over small discrete spaces. Why use amortized inference in this setting? Amortization is essential for sequential BOED where posteriors must be computed repeatedly across multiple rounds, but for single-step designs, evaluation of all candidate designs would be computationally feasible and simpler. Can you provide justification or computational cost comparisons demonstrating why amortization is necessary for your experimental setting?
3. In the experiments section, the authors mentioned that they estimate EIG using nested Monte Carlo with 5,000 outer samples and 3,000
inner samples for the marginal likelihood. Can you elaborate on how these specific numbers were selected? Have you conducted any ablation studies to determine the optimal sample sizes? |
Lightly AI-edited |
|
Goal-driven Bayesian Optimal Experimental Design for Robust Decision-Making Under Model Uncertainty |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces GoBOED, an integrated framework combining Bayesian Optimal Experimental Design (BOED) with convex optimization–based decision-making under uncertainty. The main idea is to choose experiments that improve downstream decision quality, not merely parameter accuracy. While the concept of linking BOED to decision-aware control is worthwhile, the paper’s claims of “robustness under model uncertainty” and its empirical contributions are overstated relative to what is demonstrated.
The idea of linking Bayesian Optimal Experimental Design (BOED) directly to downstream decision-making is conceptually important, addressing a genuine gap between information-theoretic design and decision-focused inference. The framework combines variational inference (VI) with differentiable convex optimization (via cvxpylayers) to enable end-to-end gradient flow from experiment design through control decisions. This is a clean engineering contribution that improves computational tractability. The use of amortized VI and differentiable convex optimization allows efficient gradient-based optimization over both experiment designs and decision variables without repeated posterior refits.
The approach can, in principle, be applied to multiple convex decision problems, demonstrated with epidemiological (SIQR) and pharmacokinetic (PK) case studies. The main ideas are well structured, with clear visuals (e.g., Fig. 1) illustrating how the BOED and decision layers interact.
While the paper’s central idea is conceptually appealing, its current presentation overstates robustness. The method addresses parameter uncertainty within a fixed model rather than broader forms of model misspecification. The robustness achieved through chance constraints or CVaR is useful but conventional, and the paper should more clearly define its scope as parameter-uncertainty-aware rather than fully model-uncertainty-robust. The framework’s differentiable structure is well executed but not fundamentally new in either BOED or robust optimization.
Empirically, the examples in Figure 2 highlight an interesting divergence between EIG-optimal and decision-optimal designs, particularly the emergence of a broader, flatter near-optimal window under risk-sensitive criteria. This is an intriguing and potentially valuable observation. However, the paper does not quantify or interpret why this difference matters or what practical benefit arises from using GoBOED instead of traditional BOED. A more systematic analysis, such as measuring control cost improvements, robustness to posterior misspecification, or constraint-violation frequency, would make the contribution more convincing.
Terminology and exposition could be clearer. Concepts such as risk functional, chance term, and the “discrepancy” corrected through importance sampling are introduced without rigorous definition. Moreover, since the entire method depends on the posterior quality from variational inference, the absence of any evaluation of posterior calibration or its effect on decision reliability leaves an important gap.
The current evaluation relies on a single-shot design, which limits the interpretability of the framework. Extending the experiments to an iterated design, where updated posteriors inform subsequent measurement choices, would better demonstrate the claimed benefits of goal-driven experimental design. Visualizing posterior updates would also provide concrete evidence of how decision-aware objectives reshape uncertainty, rather than inferring these effects indirectly from control costs. Although benchmarking new methods is challenging, the authors could still compare their approach across multiple time points and posterior evaluations for both models, and contrast the results with plain EIG optimization. Posterior evolution could be illustrated visually or through calibration metrics such as L-C2ST.
The literature review misses seminal work and blurs distinctions between related methods.
- The separation between Kleinegesse and Foster’s MI-based BOED methods is overstated; both rely on similar bounds, though Kleinegesse optimized a critic. The paper should also specify which bound is optimized, since MI bias and variance depend on that choice.
- Foundational references such as Lindley (1956) and Barber–Agakov (2003) are missing.
Finally, the literature review seems to take a very high-level overview of the field of BOED hinting at a misunderstanding of some fundamental concepts. Careless exposition makes it difficult for readers to place this paper in the context of those before, so I recommend edits to the introduction and background on BOED.
1. The framework relies heavily on how the posterior distribution changes under different design choices, but no posterior visualizations are provided. Could the authors show what the posteriors look like for the single-shot examples presented? Even a qualitative comparison between the EIG-optimal and decision-optimal designs would help clarify how the decision-aware objective shapes posterior uncertainty.
2. How would the proposed approach behave in an iterated experimental design setting, where posteriors from earlier measurements inform the next design choice? This seems especially relevant for the SIQR example, where measurements could be taken over multiple time points. Would the decision-aware design criterion lead to different sequences of measurement times compared to classical EIG? A short discussion or pilot experiment illustrating this would strengthen the paper’s argument for real-world applicability. |
Fully AI-generated |
|
Time series saliency maps: Explaining models across multiple domains |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper provides a novel explainability method for the time series domain. Specifically, the method is based on an extension of the popular saliency method Intergrated Gradients to incorporate multiple domains that can be derived through an invertible, differential transformation from the time domain. Importantly, the proposed method maintains the sensitivity and implementation invariance properties of IG.
- The paper proposes a saliency method for time series which does not solely focus on the time domain, but can also integrate latent features such as frequencies
- The presentation of the method is easy to follow
- The paper provides open access to the method in the form of a Python package
- The paper completely lacks references in the introduction. This is not in line with good research practice. It is unclear to the reader whether observations and statements are taken from the literature or are a novel contribution by the paper. Importantly, the observation that existing saliency maps fail in the
time-series domain and that other features, e.g., stemming from the frequency domain, are not novel and have been shown before in the literature (e.g., [1],[2],[3]). This renders Proposition 1, without proper citations, almost plagiarism. Section 3.2 is therefore unnecessary. I am not raising an ethics flag at the moment, but this aspect is, in my opinion, sufficient for rejecting the paper without further evaluation.
- The paper does not discuss the many restrictions and limitations of IG and their equivalent part in the proposed method.
- The paper states that the method is applicable to many domains. However, all argumentation and derivation (e.g., section 4.2) solely focus on the frequency domain. More explanation and examples are needed here to evaluate the usefulness of the method.
- The method requires domain knowledge to specify the domain of interest. In practice, such knowledge might not exist, or if it does, might be limited. This can lead to dangerous misinterpretations of the explanations and wrong decision-making. It would be desirable that a novel explainability method can directly infer saliency across many (unspecified) domains to potentially uncover so far unknown important features, instead of suffering similar limitations to existing time series saliency methods on the time domain.
- The paper does not discuss the experimental results or the limitations of the proposed method. Overall, the paper seems unfinished.
- The experimental section only focuses on three examples with specific ML methods. Here, a model-agnostic evaluation would be beneficial.
[1] Schröder, Maresa, Alireza Zamanian, and Narges Ahmidi. "Post-hoc saliency methods fail to capture latent feature importance in time series data." International Workshop on Trustworthy Machine Learning for Healthcare. Cham: Springer Nature Switzerland, 2023.
[2] Schröder, Maresa, Alireza Zamanian, and Narges Ahmidi. "What about the Latent Space? The Need for Latent Feature Saliency Detection in Deep Time Series Classification." Machine Learning and Knowledge Extraction 5.2 (2023): 539-559.
[3] Theissler, Andreas, et al. "Explainable AI for time series classification: a review, taxonomy and research directions." IEEE Access 10 (2022): 100700-100724.
- Section 4.2: How is the known failure mode of IG addressed in other domains besides the frequency domain?
- How are failure modes/limitations of IG addressed in the proposed method for general ML models (not only CNNs)?
- How can the method integrate multiple domains at the same time? Importantly, how can it detect + explain interactions between the domains? |
Fully human-written |
|
Time series saliency maps: Explaining models across multiple domains |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes the Cross domain Integrated Gradients method, which extends the Integrated Gradients (IG) to any reversible and differentiable transformation domain (including the complex domain), providing a more semantic and insightful explanation for time series models.
1. Wide universality: The attribution framework for unknown transformations has high universality and is not specific to any particular transformation.
2. Solid theoretical contribution: Extend the IG method to the complex field.
3. Compelling & Diverse Applications: This method shows significant application potential in multiple scenarios like medical and other general time series application.
Overall, I think the strengths of this paper are very prominent, and the disadvantages are not worth mentioning compared to it. Here are a few of my small concerns.
1. The main text of the paper lacks quantitative experimental comparisons, and the structure should be adjusted by moving the section in Appendix F to the main text.
2. Appendix D indicates that this method is the general form of [1]. Can the author compare the two in a visual form to see if the actual effect is consistent with the theory?
[1] Johanna Vielhaben, Sebastian Lapuschkin, Grégoire Montavon, and Wojciech Samek. Explainable ai
for time series via virtual inspection layers. Pattern Recognition, 150:110309, 2024.
1. Since this method can be applied to all reversible transformations, is it suitable for the current popular flow generation models? What would be the computational burden in practical applications?
2. What are the errors of this method for differentiable irreversible transformations? Is it possible to make corrections? |
Fully human-written |
|
Time series saliency maps: Explaining models across multiple domains |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work is focused on the explainability of black-box model in the context of time series models. The key insight is that the semantically meaningful information might not always be found in the time domain, but in other domains such as the frequency domain. To address this limitation, a generalization of the well-known Integrated Gradients method is proposed such that explanation can be presented in different domains. The proposed methodology is analyzed and evaluated on 3 time series analysis tasks.
1. A clear idea that is well motivated.
2. A through theoretical analysis of the proposed methodology.
3. A nicely written and well-structured manuscript.
1. The novelty is low. The idea of providing explanations in a a different domain is already established. The paper mentions the Virtual Inspection Layers of Vielhaben et al. [1], but do not include it as a baseline, even though [1] also evaluated Integrated Gradients with their Virtual Inspection Layers. Furthermore, several other works [2, 3] have presented methodology for providing explanations in other domains than the time domain.
2. The experimental evaluation is limited. The proposed methodology is tested on 3 datasets, but no baselines are provided, and a limited quantitative evaluation. Compared to existing works [1, 3], where numerous datasets are used and a wide range of explainability metrics are evaluated, the evaluation in this work does not give insights into the usefulness of the proposed method.
- [1] Vielhaben et al., Explainable AI for time series via Virtual Inspection Layers, Pattern Recognition, 2024
- [2] Brüsch et al., FreqRISE: Explaining time series using frequency masking, NLDL, 2025
- [3] Brüsch et al., FLEXtime: Filterbank Learning to Explain Time Series, Explainable Artificial Intelligence, 2025
1. How does the proposed method quantitatively compare to [1, 2, 3] in terms of established explainablity metrics like faithfulness, localization, complexity, and robustness?
2. Apart from being specific for Integrated Gradients, how is the invertible transform introduced here to transfer between domains different from the transform introduced in [2]?
- [1] Vielhaben et al., Explainable AI for time series via Virtual Inspection Layers, Pattern Recognition 2024
- [2] Brüsch et al., FreqRISE: Explaining time series using frequency masking, NLDL, 2025
- [3] Brüsch et al., FLEXtime: Filterbank Learning to Explain Time Series, Explainable Artificial Intelligence , 2025?? |
Fully human-written |
|
Time series saliency maps: Explaining models across multiple domains |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors advance explainability for time series methods by innovating on the integrated gradient method from 2017. Their proposal is called "Cross-domain Integrated Gradients", based on the fact that their method works on any invertible transformation. The user of the XAI method can thereby choose a transformation of their choice based on which domain is most suitable for explanations. The authors demonstrate qualitative feasibility using the Fourier Transform, Independent Component Analysis, and Seasonal-Trend decomposition.
1. The paper tackles an important and timely problem of developing XAI methods for time series. This field is in its infancy and underdeveloped, with the majority of XAI methods being developed on image data.
1. The authors provide both a TensorFlow and a PyTorch open-source library for their method.
1. Strong mathematical foundation for their method.
1. The method works, based on Figure 2-4, and is tested using three different transformations, in three different data domains.
As far as I can see, the paper has only one major weakness, which is its positioning in the existing state of the art. If the authors can address this, I will definitely consider changing my recommendation.
References to relevant prior literature on time series explainability are lacking in section 2, e.g, there are only two papers from 2024 in related work.
The experimental section is quite limited and is primarily qualitative. The only quantitative experiments I could find are in Appendix F, where the authors present insertion/deletion and compare that to time IG. There should be comparisons included with existing time-frequency analysis XAI methods. Comparisons should be made e.g to https://proceedings.mlr.press/v265/brusch25a.html, which also only assumes invertible transformations and therefore seems relevant.
1. Equation 1: $\xi$ can easily become negative; is that an issue?
1. Figure 1: Why is the peak of the orange distribution not located at 4 Hz, as specified by Eq. 1?
1. Equation 2: I'm a bit unsure about the notation in the integral with $x
+t(x-\hat{x})$, does this mean the partial derivative of $f$ is evaluated at that point, and then you integrate over $t$ ?
1. Line 352: $x(t) = a_1 \cos(2\pi · \xi_{hr}\cdot t + \phi) + a2 \cos(2\cdot \pi(2\xi_{hr})\cdot t + \phi))$. Small esthetic typo: should it be $2\pi$ without the cdot in the second cosine?
1. Figure 2: The blue dashed (as opposed to solid) lines make this figure more difficult to read, but I may be wrong.
1. Line 389: What does the $\rightarrow$ mean in this context? |
Fully human-written |
|
CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a fine-tuning method that leverages the text-image alignment (cross-attention maps) formed at smaller timesteps to explicitly calibrate the learning at larger timesteps. The goal of the paper is to improve the persistent challenge of poor alignment between the text prompt and generated images.
1. While a lot of prior works focus on improving image-text alignment during inference, its interesting to see this paper talk about providing explicit supervision for modeling fine-grained text-image correspondence during training instead.
2. The main idea of Cross-Timestep Self-Calibration is novel. It moves beyond conventional losses by introducing a self-supervised signal derived internally from the model's behavior at different levels of noise.
1. The method seems to add computational complexity, and the qualitative results do not seem strong enough to suggest utility of the proposed approach. For example, in the first half of Figure 4, the jar in the 5th column is just floating in the air, the banana in the 6th column looks unnatural and there is leakage of green to the banana.
2. The authors choose $t_{tea}=0$ in the final setup, but used t_{tea}=1 while motivating the overall approach in figure 1. I wonder whether t_{tea}=0 would give meaningful differentiation across attention maps for various objects unless I may be missing something.
1. How reliable is the Part-of-Speech Tagger for complex prompts?
2. The authors mention in L203-204 that nouns (denoting objects or entities) are extracted and their attention maps are considered. I wonder whether the attributes of objects e.g. "yellow" for a yellow should be considered as well.
3. The real role of the autoencoder introduced is unclear to me. While a reconstruction task is used to prevent mode collapse, why would the alignment in the compressed space is better than pure pixel-level alignment? |
Fully human-written |
|
CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses the challenge of imprecise text-image alignment in text-to-image diffusion models, proposing Cross-Timestep Self-Calibration (CTCAL) based on the observation that alignment degrades with increasing (more noisy) timesteps .
CTCAL uses reliable cross-attention maps from smaller (less noisy) timesteps to calibrate larger-timestep learning, with components like noun-prioritized attention selection, pixel-semantic joint optimization, and subject response regularization, plus timestep-aware weighting for integrating with diffusion loss .
Model-agnostic CTCAL works with diffusion-based (e.g., Stable Diffusion 2.1) and flow-based (e.g., Stable Diffusion 3) models .
Experiments on T2I-CompBench++ and GenEval show CTCAL improves alignment in attributes, object relationships, and compositions, with user studies confirming better visual and semantic fidelity .
1. Targeting a validated issue: It addresses the measurable degradation of text-image alignment with increasing diffusion timesteps, supported by cross-attention map visualizations .
2. Model agnosticism: It seamlessly integrates into diverse text-to-image diffusion models, including diffusion-based (e.g., SD 2.1) and flow-based (e.g., SD 3) approaches .
3. Comprehensive validation: It is rigorously tested on T2I-CompBench++/GenEval benchmarks and user studies, with no trade-offs in image diversity or aesthetic quality .
1. Limited novelty. It is essentially an integration of existing techniques rather than a breakthrough: using cross-attention for alignment, filtering non-semantic tokens, and combining multi-loss terms are all well-explored in prior diffusion model optimization works. The token mapping in the attention map has been well-explored in previous works, either during inference process or training process. For example, , Dreamo[1] explores routing constraints in DiT structure to distinguish multiple subjects and Anystory[2] explores multiple subjects injection on SDXL with attention maps restrictions.
2. Fragile Noun Selection undermines core supervision. CTCAL’s performance depends on its POS-based filter, which relies on Stanza to extract "spatially meaningful" nouns. However, the paper admits this filter is flawed: it incorrectly includes non-spatial nouns (e.g., directional terms like "top" or "left") that lack semantic-spatial correspondence. The proposed workaround—an ad-hoc blacklist—is not generalizable, and the authors only mention "using LLMs for semantic filtering" as a future direction.
Reference:
[1] DreamO: A Unified Framework for Image Customization
[2]AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation
please check the weaknesses. |
Lightly AI-edited |
|
CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper tackles the gap that text–image correspondence tends to be strong at small timesteps (low noise) but weak at large timesteps. It proposes a training-time self-calibration scheme (CTCAL): for each sample, a "teacher" small-timestep cross-attention map supervises the "student" large-timestep map. The loss combines pixel-level and semantic-level attention alignment, a subject-response balancing term, and a timestep-aware weighting that emphasizes harder (noisier) steps. On SD 2.1 and SD 3, the method improves compositional alignment and prompt faithfulness on standard benchmarks.
1. **Clear problem framing:** Directly targets cross-timestep misalignment.
2. **Simple supervision signal:** Reuses model-internal cross-attention as a self-supervised ``teacher".
3. **Comprehensive design:** Pixel + semantic attention alignment and subject balancing are sensible; timestep-aware weighting matches the difficulty profile.
4. **Empirical gains:** Consistent improvements on compositional/prompt-following metrics.
1. **Diversity risk from attention supervision.**
Using small-timestep attention to shape large-timestep behavior might bias the model toward more deterministic layouts and reduce output diversity. I wonder is there a drop on diversity or mode collapse in generation. An metric analysis or visualization result may help.
2. **Dataset construction and generalization.**
Training data is curated from T2I-CompBench-like prompts via reward-driven selection. This raises concerns about overfitting to that prompt style or metric. Please evaluate on broader benchmarks or metrics (e.g., FID, CLIPScore, HPS, ImageReward) to demonstrate the generalization.
3. **Positioning vs reward-based post-training (ReFL / GRPO family).**
While CTCAL focuses on improving cross-timestep consistency during training, recent post-training methods such as ReFL or GRPO also enhance text–image alignment by optimizing explicit reward signals. It would strengthen the paper to clarify how CTCAL compares or complements these approaches—both conceptually and empirically. A short discussion or a compute-matched comparison (e.g., ReFL vs. CTCAL under similar reward setups) would help readers understand whether CTCAL offers distinct benefits or can be combined with reward-based fine-tuning.
4. **Complexity and scalability of the training recipe.**
The approach combines several components (noun-focused maps, pixel+semantic alignment, subject regularization, timestep-aware weighting). It’s not obvious how robust this recipe is when scaling to larger/faster backbones. More evidence of training stability and a brief report of training cost or efficiency would make the method’s practicality more convincing.
Please check the weaknesses. |
Fully AI-generated |
|
What happens when generative AI models train recursively on each others' outputs? |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper describes work on investigating effects of recursive training of model generated data (+ combinations of human authored data) with LLMs. The authors draw motivation for the need for research in this direction by highlighting several realities and overlooked realities such as proprietary LLMs are trained mostly with internet-scraped data and how these datasets overlap substantially. The most important motivation that the authors emphasize is that future LLMs will evidently be trained will LLM-generated (their own) data. The authors back this with existing literature from learning theory on model collapse. In order to further shed light on the potential good or bad consequences of this model-data interaction across tasks, the authors formalize an interactive training pipeline controlled by alpha (fraction of new data per iteration) and beta (public data-private data partition). Results show from a small-scale experiment of two model providers (K=2) derived from Llama and OPT model architectures with t = 15 show that setting alpha and beta to 0.5 (equal partition) seemingly optimal results across tasks (science QA and math QA) compared to other values. Setting alpha to 0 denoting use of purely human-generated data results to degradation in task generalization while setting this to 1 denoting purely LLM-generated data results to equal degradation in the original task.
Overall, the paper does present a simple and understandable method for potentially simulating model collapse and interactions upon training from LLM-generated data but my main issue is centered on grounding experiments with more rigor such as exploring K=10/20/50 or t=50. See feedback below for other concerns with the paper.
The paper proposes a simple yet intuitive method for exploring training data dynamics of LLMs as shown in Figure 1. I found the paper to be fairly readable and the way the paper motivates the need for investigating recursive training from model-generated data to be useful in contextualizing the study and support for the experiments. I believe the model collapse research community may find this paper's results to be beneficial and interesting.
While I appreciate the simplicity and readability of the paper, there are some issues that I found that the authors can use to improve the quality/rigor of the study:
First, the current experiment setup seems shortsighted to me with using only two model providers of K=2. Likewise, in terms of iteration, a realistic scenario would be repeatedly training on model-generated data by a longer margin, say t = 30 or 50 or even 100. The goal is to rigorously investigate how far can the performance converge or if there are possibilities of similar phenomenon like grokking. Model providers these days are extremely fast in releases (almost a new one every 3-4 months), hence I believe a more realistic setup is needed for the study. Likewise, a larger K such as 10 could also be explored to further investigate effects of training from a diverse collection of models.
I believe the paper lacks equal discussion on the implications of training data dynamics that are grounded from the results. For example, the study shows that using a perfect split of 0.5 for alpha and beta seems to produce optimal results but this seems very idealistic and tied to the current experiment setup. More realistic setups might be more nuanced and highly dependent on factors such as training data quality, task diversity, etc. How can the study account for this? This part is underdeveloped in the paper.
I suggest the authors to balance the structure of the paper by prioritizing experiment results. The current paper’s experiments and discussion are both pushed back to the last 2 pages of the paper and feels quite rushed/limited when you read it. The first four pages motivating the challenge could be condensed further to prioritize the results. I would also appreciate more expanded and clear discussion on task generalization as well as this seems underdeveloped as well.
Please improve your references and cite published articles. In the introduction alone, the main citations to support interdisciplinary use case adoption of generative AI are mostly blogs and websites, ignoring more qualified literature or previous works that have been peer-reviewed.
“While some amount of mixing improves model performance on previously-unseen tasks, homogenization occurs for D∗ at all α and for ̃Dk when α > 0—everywhere it can” - this sentence is confusing, can you please clarify and expand this? |
Fully human-written |
|
What happens when generative AI models train recursively on each others' outputs? |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper studies how generative language models (LLMs) may interact recursively through training data that include other models’ outputs.
1. It introduces a formal framework with two parameters — the synthetic data ratio ($\alpha$) and the initial data weight ($\beta$) — to describe cross-model data mixing.
2. Theoretical analysis (Sec. 3) under a generalized linear model shows convergence and bias–variance behavior depending on $\alpha$ and $\beta$.
3. Empirical results (Sec. 4; Fig. 3–5) using OPT-350M and LLaMA-1B demonstrate that moderate mixing ($\alpha=\beta=0.5$) improves both models’ performance but leads to representation homogenization.
The paper also concludes with a discussion of the implications for model diversity and long-term ecosystem dynamics (Sec. 5).
1. Clear and meaningful problem setting:The paper addresses a timely and practically significant question — how recursive data interactions among generative models affect learning stability and diversity. The motivation and background are well-articulated (Sec. 1–2), making the research goal both relevant and understandable.
2. Comprehensive and interpretable theoretical framework:The proposed formalism based on the parameters $\alpha$ and $\beta$ (Sec. 3) systematically captures cross-model data mixing. The accompanying bias–variance and convergence analysis provides solid conceptual grounding for the empirical findings (Fig. 2–3). Even without verifying every derivation, the overall reasoning is coherent and accessible.
3. Exceptional clarity and readability:The writing is well-structured and accessible to readers beyond the immediate subfield. Explanations, figures, and notation are consistently clear, enabling a broad audience to grasp the motivation, methodology, and conclusions (Sec. 1–5).
1. Limited novelty in the modeling of cross-model interaction:The description of data-mediated interactions between models (Sec. 3) is clear and well-structured but largely descriptive. While it helps readers understand the setup, this section mainly formalizes an intuitive process rather than introducing a new mechanism or theoretical insight. As a result, the contribution of this part feels limited in terms of originality.
2. Gap between theoretical modeling and practical relevance:Most of the paper focuses on theoretical modeling and proofs (Sec. 4–5). Although the derivations appear sound, the connection to real-world large-scale training scenarios remains weak. The introduction of parameters $\alpha$ and $\beta$ is conceptually useful, yet in practice, their exact values or ratios are difficult to estimate or control during continuous training. The conclusions drawn from the linear or generalized linear setting may not easily transfer to nonlinear or high-dimensional models.In essence, while the problem definition is good and $\alpha$–$\beta$ reasoning is meaningful, it is unclear how the theory can concretely guide actual large-model training.
3. Experimental validation is narrow and idealized:The experiments (Sec. 4–5) mainly serve to verify the theory, but they do not provide further insights into realistic settings. Only two medium-sized models (OPT-350M and LLaMA-1B) and two datasets (SciQ, GSM8K) are used, with highly controlled data composition. The synthetic data are assumed to represent model outputs cleanly, without considering realistic mixtures of human and synthetic text (I know in limitation part). Scaling experiments or additional ablations (e.g., varying model size, task diversity, or realistic data proportions) would make the findings more convincing.
Overall, the experimental content is rather insufficient. The question itself is meaningful, but it does not provide much insight in terms of conclusions. However, considering that this might be a theoretical paper, it is difficult to for me to assess the practical value of such a theory. Therefore, I would lower the confidence to mitigate the possible impact of this uncertainty.
You may refe to the content in the “weakness” section. If you can address the doubts raised there effectively, I will consider giving a higher score.
I hope valuable work will not be overlooked.
For example, in addition to theoretical explanations based on existing assumptions, it would be great if you could highlight some unique insights proposed in this paper.
Perhaps the paper already includes them, but I did not notice. |
Heavily AI-edited |
|
What happens when generative AI models train recursively on each others' outputs? |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors consider what happens when a collection of models is iteratively trained on a combination of public data, private data, and data generated by all the other models. They begin by arguing that this setup reflects reality by surveying training datasets for a variety of generative AI models. Then, they present a simplified linear regression model for this setting. They derive the bias and variance of the collection of models after $t$ training iterations. They find that the models all converge most efficiently when about half of their initial data is public and about half of their data in future iterations consists of prior generation outputs. This prediction is validated in experiments training OPT and Llama, where each one has a private dataset (SciQ or GSM8k). Models are able to do transfer learning from each others' outputs.
- The overall paper, presentation, and writing quality is high. This is a very well-executed research project.
- The research topic of recursive training dynamics with multiple models is important, timely, and interesting
- The model is well-designed (Figure 1 and Section 4 are great; I wish they had come two pages sooner)
- The experimental setup is clever
- What we learn in the multi-model setting is limited and follows expectations: just as we've seen in single-model collapse with accumulation, but different private data can lead to some transfer learning in the population
- The paper could do a better job of providing the intuitive takeaways from the theorems.
- The paper is fairly verbose and repetitive; the core of the paper doesn't begin until page 5
### Overall evaluation
This is a tricky paper to evaluate, as it's a very high-quality paper, but what we learn feels limited. Perhaps other reviewers will feel differently. This paper definitely deserves to be published and does contribute to the area of model collapse. The structure of the paper could be improved, getting to the contributions more quickly and providing clearer takeaways from the theory.
1. What exactly are the takeaways from the theoretical analysis? We see that under certain conditions, each model's estimate converges to the true parameter. Is Figure 2 the real takeaway from this section?
2. In addition to loss, were there also the same patterns in model accuracy on SciQ and GSM8k?
### Comments
- It's a bit jarring for the paper to go from discussing generative AI for so long and then jump to a linear regression model. The rationale for this simplification makes sense, but should be mentioned/justified earlier in the abstract/intro (e.g., in the context of related work).
- Section 3 has a lot of text, references, and tables for some well-known facts about model training. It could be summarized in a paragraph
- the motivation in the intro for why it matters to study recursive training among multiple models vs just one single model is lacking. The intro just says it has "received little attention," but doesn't explain why it matters that there are multiple models rather than just one. Are there new and different dynamics that occur? Otherwise we could just assume it's basically the same and that studying a single model recursively training is sufficient to understand what happens with multiple.
- Some of the intro and related work was repetitive.
- some in-text citations missing an author (gen, 2022), (app, 2024)
- from the related work, it wasn't clear what it means for all models to have a "bound in error $\pi^2/6$"
- Line 100: "long term effects" is unclear; what is meant is iterative training dynamics, right?
- Line 157: ... this doesn't seem overlooked. It seems like a widely understood fact that many models use CommonCrawl, arXiv, GitHub, Wikipedia, etc
- Figure 1 is very nice. |
Fully human-written |
|
What happens when generative AI models train recursively on each others' outputs? |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper investigates the phenomenon of data-mediated interactions among different generative AI models. Specifically, it studies what happens when different generative models are trained on each other’s outputs, a realistic scenario given the increasing prevalence of AI-generated content on the internet. The authors first review evidence that modern large language models (LLMs) are trained on overlapping, internet-scraped datasets that increasingly contain synthetic text from other models. Building on this, they formalize an iterative, interactive training framework where multiple entities train models using mixtures of public, private, and synthetically generated data. Then the authors give a theoretical analysis in a linear regression setting and derives closed-form dynamics for bias, variance, and convergence properties, showing that cross-model data sharing can promote homogenization while sometimes improving efficiency. Experiments using OPT-350M and LLaMA 3.2-1B fine-tuned on distinct tasks (SciQ and GSM8K) simulate multi-model interactions and confirm theoretical predictions: moderate mixing ($\alpha = \beta = 0.5$) yields the best balance.
The paper concludes that recursive cross-model training can both diversify and homogenize generative models.
- A novel problem. The paper's most significant contribution is the formalization of a new unstudied problem: "data-mediated interaction" within a multi-model systems. This shifts the research focus from the standard "model collapse" setting (a single model consuming its own outputs) to a more realistic and complex scenario where multiple heterogeneous models coexist and interact by training on a shared data pool containing each other's outputs.
- Comprehensive Methodology: The paper supports its claims with a comprehensive methodological approach that provides both theoretical and empirical evidence. The authors develop a formal theoretical model (a linear regression setting) to make analytical predictions about the system's long-term dynamics and (2) validate these predictions with a set of well-designed experiments using large language models (OPT-350m and Llama 3.2 1B).
- Drawback of the whole setting. The theoretical and empirical framework assumes that fine-tuning data for each model is randomly sampled according to ($\alpha-\beta$) proportions (new vs. old, public vs. private). However, in practice, major foundation models rely heavily on highly curated, high-quality fine-tuning datasets that are explicitly designed to avoid noise or low-quality synthetic data. This mismatch between the model’s random-mixing assumption and real-world fine-tuning practices limits the external validity of the results — particularly the conclusions about homogenization and performance degradation under synthetic data reuse.
- The theoretical analysis relies entirely on a linear regression model with Gaussian assumptions, which limits its generality to real-world large-scale nonlinear generative models. Although the authors cite “universality” results, the mapping from this toy model to practical LLM training remains speculative.
- After examining the released code (sft-config), it appears that each fine-tuning round uses a very small effective data volume: the batch size is 2, with gradient accumulation over two steps and only 100 optimization steps per generation. This means that each “generation” sees at most a few thousand training tokens, which is extremely small compared to realistic fine-tuning scales for modern LLMs. Consequently, the observed trends in “cross-model interaction” may reflect under-trained or noisy optimization dynamics rather than genuine long-term convergence effects. Moreover, the experiments involve only K = 2 interacting models and explore just three discrete values for both $\alpha$ and $\beta$ ({0, 0.5, 1}), providing too coarse a sampling to fully characterize the theoretical phase behavior. These limitations substantially weaken the empirical support for the paper’s broader claims about multi-model ecosystems.
- The paragraph leading with Step 3: Model Updates (lineno 180-188)' is misleading. The paper’s description of “model updates” incorrectly claims that successive generations of models such as GPT-1/2/3/4 and LLaMA-1/2/3 are typically trained by initializing with the previous generation’s weights. Only using the same datasets to train a family of models does not imply directly descend from one another'.
- The proofs contain several typographical and notational issues that impede verification and, in a few places, likely invalidate steps. See more in Questions.
- Please kindly think of Weakness 1. The interaction of different models is a novel problem, while the analysis framework in this paper is a little weak. Could you improve the problem setting and make it closer to the reality?
- There is a grammar mistake in the paper's title. Use \textbf{each other's} instead of \textit{each others'}.
- I have a question on the proof of Lemma 1 (Appendix E.4). In line 1020, the derivation implicitly equates $(\Pi S \Pi)$ with $(I_K \otimes \underline{S})$. These two matrices are not equal. $(\Pi S \Pi)$ is a dense block matrix (specifically $\frac{1}{K}(J \otimes \underline{S})$). $(I_K \otimes \underline{S})$ is a block-diagonal matrix. If the equation holds for the spectrum norm, I'd appreciate it if you can provide a detailed calculation.
- Typos. Here I list several obvious mistakes and please proofread the whole paper to improve the quality of presentation.
- Line 276: its most recent parameter estimate $\hat\theta_{t−1,k}$ instead of $\hat\theta_{t−1,t}$.
- Line 291. In the rightmost, $y_{t1}$ instaed of $y_{tk}$.
- Different definition of $S_*$ in the main text Section 5 and appendix E. |
Fully AI-generated |
|
Are complicated loss functions necessary for teaching LLMs to reason? |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 0:
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper looks into GRPO training for reasoning models. I tests three variants of GRPO 1. positive only reward 2. GRPO without importance sampling 3. naive reinforce
N/A
* There is a severe lack of novelty in the paper. The proposed RGRA is essentially GRPO without importance sampling.
* The paper is poorly composed; the results are not well organized. Figure 1 occupies an entire page without any accompanying analysis in the caption. Tables 1 and 2 are also poorly formatted, lacking proper bolding and explanations for abbreviations. From this standpoint alone, the paper feels far from complete.
* There is almost no discussion regarding the differences between RGRA and GRPO. Why does removing importance sampling and not using a clipping objective lead to better training?
* There is no comparison with related works at all (e.g., Dr.GRPO, DAPO, etc.).
* Overall, the paper lacks a clear motivation, shows limited novelty, and provides insufficient analysis of the results. Major revisions are needed. |
Fully human-written |
|
Are complicated loss functions necessary for teaching LLMs to reason? |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper studies different components of the GRPO (Group Relative Policy Optimization) loss in improving reasoning in LLMs. The authors break down the loss into its main components—negative feedback, PPO-style clipping, and advantage estimation—and test simplified versions such as positive-only GRPO, their proposed REINFORCE with Group Relative Advantage (RGRA), and direct REINFORCE. Their experiments on small models (Qwen2.5-0.5B/1.5B and Llama3.2-1B) show that removing negative feedback leads to collapse and reward stagnation and PPO clipping can be dropped without hurting performance.
* Systematic studies like the one that the paper conducts is generally important for the community, especially for understanding RL post-training for LLMs.
* Authors test on two different model families and take care in evaluating on a comprehensive set of benchmarks split across Chinese/English and math/other subject domains.
* The model scale and setting (<=1.5B parameter models with LoRA fine-tuning) is limited and it's unclear if their findings extrapolate to larger model scales and full fine-tuning.
* In particular, prior work [1] seems to show a different result that positive-only reinforcement can be competitive with GRPO/PPO provided verifiable rewards are used and poor prompts are filtered. The findings from Xiong et al. are from larger models (7B-70B), which supports the potential limitations of the model size and setting studied in this work.
* The reported results lack confidence intervals and seem necessary to draw strong conclusions like the ones made in this work (in particular, how significant is the performance delta between RGRA and GRPO)? I'm sympathetic to the author's limited compute constraining them to their training setup, but multiple seeds and further performance analysis (eg. pass/majority@k performance) would strengthen their results.
While the paper’s goal of simplifying GRPO is well-motivated, the evidence feels too narrow and limited to support its strong claims about the necessity of negative feedback.
Minor: Some areas in the manuscript need `\citep` (Line 169, 263 to name a few).
[1] Xiong, Wei, et al. "A minimalist approach to llm reasoning: from rejection sampling to reinforce." arXiv preprint arXiv:2504.11343 (2025).
* The training dataset appears quite small (around 1.8k examples). Could the authors clarify why this size was chosen, and whether you observed any sensitivity to data scale?
* In the positive-only advantage setup, do the authors ensure that each batch contains enough positively rewarded samples for stable gradient estimation?
* In Figure 1(a), both REINFORCE and RAFT collapse only for the Qwen 0.5B model. Do you have an explanation for why this smaller model is unstable compared to the 1.5B and 1B variants? Could this have been mitigated with a cold start stage?
* Regarding clipping, what $\epsilon$ value was used for GRPO runs, and roughly what fraction of updates were actually clipped during training?
* The author's results seem to differ from Xiong et al., who find that positive-only RAFT remains competitive with GRPO. Could you comment on the key differences in setup (e.g., model scale, filtering, or reward structure) or clarify the discrepancy? |
Fully human-written |
|
Are complicated loss functions necessary for teaching LLMs to reason? |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper investigates the necessity of complex loss functions, specifically Group Relative Policy Optimization (GRPO), for enhancing the reasoning capabilities of Large Language Models (LLMs). The authors conduct a systematic analysis of GRPO, an algorithm that combines group-relative advantage estimation, PPO-style clipping, and KL regularization.
The paper identifies two key findings:
Negative feedback is essential. Training solely on actions that outperform a baseline (i.e., positive-only advantages) or using simpler rejection sampling (RAFT) leads to training instability, performance collapse, and a failure to elicit reasoning behaviors.
PPO-style constraints are unnecessary. The analysis demonstrates that PPO-style components, such as policy ratio clipping, are not required to improve mathematical reasoning performance or maintain training stability.
Experiments across standard mathematical benchmarks indicate that RGRA achieves stable training dynamics and demonstrates stronger performance than the more complex GRPO, surpassing it in 17 out of 27 task comparisons.
This paper's originality is strong. The authors challenge the assumed necessity of all components within the successful GRPO framework . This "less is more" approach, which rigorously questions the utility of established components like PPO-style clipping, represents an original and valuable methodological contribution. The claims are substantiated by a comprehensive and robust body of empirical evidence, including extensive quantitative benchmarks across multiple model families and languages (Tables 1-3) , crucial analysis of training dynamics and stability (Figure 1) , and insightful qualitative analysis of emergent reasoning behaviors (Figure 2). The experimental setup is sound and provides convincing validation for the paper's conclusions.
The paper's primary weakness lies in its significant overgeneralization of claims from a narrow and limited experimental setup. The central conclusion that PPO-style constraints are "unnecessary" for teaching reasoning is drawn exclusively from experiments on small-scale models, ranging from 0.5B to 1.5B parameters. PPO's clipping mechanism was precisely designed to ensure stability during large, high-variance policy updates, which are a far greater concern in the state-of-the-art 70B+ models. The paper provides no evidence that its findings would hold in a large-scale setting, thus failing to adequately support its ambitious and broad claims.
Furthermore, the dismissal of baseline methods like RAFT and positive-only GRPO as inherently unstable is unconvincing. Their catastrophic collapse (shown in Figure 1) is observed on a minuscule training dataset of only 1,800 instances , which is highly susceptible to reward-hacking. More importantly, the paper fails to provide evidence of rigorous hyperparameter tuning for these baselines. Their collapse could simply be an artifact of a poorly chosen learning rate or insufficient KL regularization, rather than a fundamental flaw in the methods themselves. Without a proper hyperparameter sweep to find the most stable configuration for these baselines, the paper's conclusion is not fully substantiated.
The paper's central claim that PPO-style constraints are "unnecessary" is derived from experiments on relatively small-scale models (0.5B to 1.5B). Given that PPO's clipping mechanism was specifically designed to ensure stability for large scale policies with high variance updates, what justification or evidence can you provide that this finding will generalize to the 70B+ or 100B+ models where such stability constraints are traditionally considered critical?
The experiments are confined entirely to mathematical reasoning, a domain characterized by sparse and verifiable binary reward signals (i.e., correct/incorrect). How do you anticipate the stability of RGRA (which lacks clipping) will hold in standard alignment scenarios (e.g., helpfulness, safety) that rely on dense, non stationary, and often noisy rewards from learned preference models?
The paper concludes that methods ignoring negative feedback (RAFT, GRPO-pos) are fundamentally unstable, citing their rapid collapse on a small 1,800-instance training set. Could you elaborate on the extent of the hyperparameter search (e.g., learning rate) conducted for these baselines? How can you be certain this collapse is an inherent flaw of the methods, rather than an artifact of sub-optimal tuning or a simple case of reward-hacking on a dataset small enough to be easily exploited?
The results suggest RGRA does not just match, but often outperforms GRPO. Since the key difference is the removal of the PPO clipping and policy ratio terms, what is the mechanism for this performance improvement?
The paper shows that standard REINFORCE with direct rewards collapses, even on the 1.5B model, which you use to underscore the necessity of advantage estimation. Could you clarify the tuning process for this specific baseline? Does this result definitively prove that any REINFORCE-style method without advantage estimation is doomed to fail in this setting, or could this collapse also be sensitive to hyperparameter choices? |
Lightly AI-edited |
|
Are complicated loss functions necessary for teaching LLMs to reason? |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 0:
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper does an ablation over the components of the GRPO loss function, namely advantage clipping, negative examples, and KL regularization.
This paper studies an important question in RL post-training, namely which components are required in the loss function to get the models to perform well. Based on their findings, the authors propose RGRA for LLM post training.
There are several major weaknesses with this paper. To begin, the framing of the paper is an ablation over the main components of the GRPO loss. However, there are several key components missing from this ablation:
- as far as I understand, the authors do not sweep over the hyperparameters of any of the baselines they run. Critically, for an ablation over components of GRPO, they do not sweep over the number of rollouts, nor over the amount of steps taken off policy by the algorithm (I am referring to doing multiple gradient steps over a set of rollouts for a given batch)
- the authors start from pretrained models, which can confound the results. Namely, [1] shows that the KL regularizer can have different effects based on the pretraining data
- the baselines are quite trivial, especially with respect to the length of the chain of thought required to solve the prompts. In general, the terms in the loss start to be significant for long chains, or the number of offline steps taken by the algorithm
Based on the above comments, I believe proposing RGRA is not currently backed by empirical results.
[1] Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
No questions. |
Fully human-written |
|
Model-Agnostic Text Condensation with Coherence Awareness |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes PInR, a model-agnostic framework for text dataset condensation. It optimizes representative and diverse particles in the embedding space using Stein variational principles and then converts them into coherent text through an invert-and-refine process. The key novelty lies in explicitly enforcing textual coherence, beyond readability, to improve reasoning-oriented tasks. Experiments on AG-News, SST-2, GSM8K, and Quora-QuAD show that PInR consistently outperforms existing methods, achieving better accuracy and distributional similarity with strong robustness across both understanding and reasoning benchmarks.
1. The paper is well-motivated and introduces coherence as a fundamental property in text condensation.
2. Its theoretical foundation using Stein variational optimization is principled and general.
3. The experimental evaluation is extensive and convincing, showing clear improvements across tasks.
4. The framework is modular and reusable and bridges embedding-based optimization with text-level refinement effectively.
1. The method relies heavily on external LLM APIs for refinement, which challenges reproducibility and model-agnostic claims.
2. The definition of coherence is heuristic rather than formally measurable.
3. Evaluation is limited to medium-scale datasets, and the approach may struggle with longer or multi-turn sequences.
4. Privacy and efficiency analyses are discussed but not rigorously demonstrated.
1. How sensitive is the refinement quality to the specific API used, and could weaker open-source models achieve similar results?
2. How might the framework handle long or multi-turn textual data where coherence is more complex? |
Fully AI-generated |
|
Model-Agnostic Text Condensation with Coherence Awareness |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces PInR, a novel framework for model-agnostic text condensation that explicitly integrates coherence as a key property alongside representativeness and diversity. The method optimizes “informative particles” in a semantic embedding space and converts them back into discrete text through an Invert-and-Refine (InR) process using an API-based coherence refinement. Experiments on understanding (AG-News, SST-2) and reasoning (GSM8K, Quora-QuAD) tasks show consistent gains over prior methods like DaLLME, MGD3, and Aug-PE. The authors also analyze privacy implications, computational cost, and ablation on coherence.
1. Introducing coherence as an explicit property in text condensation is both intuitive and impactful, especially for reasoning tasks. It moves beyond the typical focus on readability and directly links data quality to logical soundness.
2. The two-stage PInR pipeline is well-motivated and mathematically grounded. The use of kernelized updates and variational inference offers a principled way to approximate text distributions.
3. PInR does not depend on specific downstream architectures, and experiments span both classification and reasoning models.
1. The refinement stage relies on commercial LLM APIs (e.g., GPT-3.5), which raises questions about reproducibility and fairness of comparison, even though the authors mention cost control.
2. The Stein particle optimization in high-dimensional embedding spaces may become computationally expensive. Runtime or memory analyses are missing, and comparisons with lighter methods (e.g., clustering-based condensation) on efficiency would be valuable.
See above. |
Heavily AI-edited |
|
Model-Agnostic Text Condensation with Coherence Awareness |
Soundness: 1: poor
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the problem of model-agnostic text condensation (MaTC) by introducing "coherence" as a critical property, arguing it is more stringent than mere "readability" and essential for downstream tasks. The paper then proposes PInR, a two-stage framework to generate a small dataset that embodies three key properties: representativeness, diversity, and coherence. The first stage optimizes a set of "particles" in the semantic embedding space using a Stein-based method to ensure representativeness and diversity. The second stage, Invert-and-Refine (InR), decodes these particles into initial text and then uses an API-assisted refinement process to enforce coherence while ensuring the final text remains semantically faithful to the optimized particle. Experiments across text understanding and reasoning benchmarks demonstrate that PInR outperforms existing SOTA baselines.
1. The PInR framework is logically designed. It decouples the problem by first optimizing for representativeness, diversity in a continuous embedding space and then separately enforcing coherence in the discrete text space.
2. The ablation study provides empirical support for the paper's central hypothesis. Removing the refinement step leads to a drastic performance degradation, especially on reasoning datasets, confirming the necessity of this component.
1. The framework's effectiveness is critically dependent on powerful, black-box APIs at every stage. It uses text-embedding-ada-002 as the encoder and outsources its core "coherence" contribution to a strong generative API. Without any ablations on open-source models, it's impossible to tell if the PInR method is good, or if the APIs are just doing all the work.
2. The "warm start" efficiency argument seems to be flawed. The paper's own inverted text example in Appendix B.2 is nonsensical and logically flawed. This suggests the API isn't refining this input but is just ignoring it and re-generating from scratch, which invalidates the entire efficiency claim.
3. The Random baseline is exceptionally strong and, in several ICL settings for reasoning, it outperforms PInR. Given that Random is computationally trivial, the practical value of this complex, multi-stage, API-dependent synthesis method is questionable.
Authors should give their response and explanation with respect to the weaknesses mentioned above. |
Lightly AI-edited |
|
Model-Agnostic Text Condensation with Coherence Awareness |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose a novel method for text condensation, where the original dataset is converted into embeddings, condensed samples are generated in the embedding space, and these condensed samples are inverted back into text. The proposed method contains two main contributions. First, it introduces a novel condensed sample generation technique that optimizes condensed samples to make their distribution closer to that of the original samples. Second, they propose an invert-and-refine approach using an API to improve the coherence of the inverted text. Experiments demonstrate that the proposed method achieves the best performance compared to existing text condensation methods and core set selection methods.
1. The authors focus on coherence in text condensation, which is a crucial aspect of inference tasks. While DaLLME (Tao et al., 2024) considered the readability of condensed samples, it did not address coherence. This paper is the first to explicitly take coherence into account.
2. The authors propose a method to optimize condensed samples so that their distribution becomes closer to that of the original samples, introducing an invert-and-refine approach using an API. These methods are simple yet intuitive and contribute to performance improvements.
3. The proposed method achieves superior performance compared to existing approaches, including coreset selection (Random) and three condensation methods (DaLLME, MGD3, and Aug-PE). Text condensation is a particularly promising technique in an era of rising model training costs, and its potential impact is substantial.
1. The authors should conduct a more comprehensive comparison with coreset selection methods. Although their method addresses the issue of coherence loss, this problem arises specifically from generating synthetic data. In contrast, coreset selection methods do not suffer from this issue because the samples are drawn from real training data, which inherently ensures coherence. In particular, (Maekawa et al., 2025b) demonstrated that K-centers and Herding outperform random selection as coreset selection strategies. To clarify the importance of generating synthetic data, the authors should include comparisons with K-centers and Herding.
2. Similarly, the authors should compare INVERT-AND-REFINE with a simple nearest-neighbor search. Specifically, they should evaluate performance by selecting the real data point closest to the optimized embedding, which can be formulated as: $\tilde{x_j} = \arg\min_{x} d(\psi(x), \tilde{e_j}) \quad \text{s.t. } x \in \mathcal{T}_o$. Since this approach selects samples directly from the real training set, it naturally ensures coherence. Evaluating this strategy would help clarify the importance of generating synthetic data.
3. The difference between the condensed samples optimization method represented by Equation (2) and existing condensed samples optimization methods is not explained. The authors should describe what condensed samples optimization methods DaLLME, MGD3, and Aug-PE employ and explain why the condensed samples optimization method represented by Equation (2) is superior to these existing methods.
4. The compared method MGD3 is a condensation method for images. The authors should clarify how they are applying this to text condensation.
Please add explanations regarding the weaknesses. |
Lightly AI-edited |
|
Perishable Online Inventory Control with Context-Aware Demand Distributions |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper considers online perishable inventory problem with contextual information. Because of perishability, there is no inventory carryover, the problem setup is more akin to a newsvendor inventory problem under bandit setting. The authors consider three types of demand estimation and derive regret bound correspondingly: linear demand, neural network parametrized demand, kernel density function is demand.
- Online inventory control is an important application domain, and the authors also conduct experiments on real data.
- The problem setting regarding the demand information is very strong, i.e., full demand information without demand censoring. Given that censored feedback is a vital issue in inventory control, the setup is too simple and not realistic. Importantly, a key prior work considers censored demand feedback, and this is also considered in the non-contextual case.
- The overall organization lacks clarity. There is not an unified framework on the three types of demand functions considered; instead, it is a collection of existing estimation accuracy results of three offline settings, with straightforward adaption to the online setting.
[1] Ding J, Huh WT, Rong Y. Feature-based inventory control with censored demand. Manufacturing & Service Operations Management. 2024 May;26(3):1157-72.
[2] Zhang W, Li C, Qin H, Xu Y, Zhu R. Thompson Sampling for Repeated Newsvendor. arXiv preprint arXiv:2502.09900. 2025 Feb 14.
- Given the current setting with full information and no demand censoring, how is it qualitatively different from contextual continuum arm bandit problems? Are there any particular challenges or methodological advances in this work?
- Can the authors articulate more on the motivation of context-dependent noise. Given that mean demand is contextual dependent and the simple form of the newsvendor loss, what is the necessity of this particular modeling? It's also helpful to give some examples of the noise distribution.
- Line 8 in Algorithm 2 appears to be a typo. Does this affect the proof argument?
- The numerical study lacks natural baseline comparisons. How would a generic online quantile regression approach work? Some more implementation details such as neural network size, kernel bandwidth would also help with reproducibility. |
Fully human-written |
|
Perishable Online Inventory Control with Context-Aware Demand Distributions |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses online contextual inventory control for perishable goods under a more realistic setting where both the expected demand and noise distribution depend on observable features. The authors propose an algorithm combining demand mean estimation (via ridge regression or neural networks) with kernel regression for CDF estimation. They establish a minimax regret lower bound of $\Omega(\sqrt{dT} + T^{(p+1)/(p+2)})$ and prove their algorithm achieves near-optimal regret $\tilde{O}(\sqrt{dT} + T^{(p+1)/(p+2)})$ for linear demand and $\tilde{O}(\sqrt{\alpha T} + T^{(p+1)/(p+2)})$ for nonlinear demand. Under additional regularity conditions, the exponential term improves to $p\sqrt{T}$.
I am very familiar with both the inventory control literature and contextual bandit/online learning theory. I have carefully verified the mathematical arguments and compared with related work in both areas. The technical content is sound, and my assessment focuses primarily on presentation, practical applicability, and positioning relative to existing literature.
References
[1] Ding, J., Huh, W. T., & Rong, Y. (2024). Feature-based inventory control with censored demand. Manufacturing & Service Operations Management, 26(3), 1157-1172.
[2] Bu, J., Simchi-Levi, D., & Xu, Y. (2020, November). Online pricing with offline data: Phase transition and inverse square law. In International Conference on Machine Learning (pp. 1202-1210). PMLR.
[3] Ao, R., Jiang, J., & Simchi-Levi, D. (2025). Learning to Price with Resource Constraints: From Full Information to Machine-Learned Prices. arXiv preprint arXiv:2501.14155.
⦁ Clear and compelling motivation: The introduction effectively argues why standard homoskedastic assumptions are unrealistic, with concrete examples (umbrella sales, binomial demand) illustrating when heteroskedasticity naturally arises.
⦁ Strong theoretical completeness: The paper provides both minimax lower bounds and matching upper bounds (up to logarithmic factors), offering a complete characterization of the problem's fundamental difficulty. The analysis elegantly decomposes the problem into mean estimation and non-parametric regression components.
⦁ Novel algorithmic approach: The core idea of using hierarchical elimination to ensure conditional independence while combining contextual bandits techniques with kernel regression is creative and technically sound.
⦁ Extension to nonlinear demand: The treatment of nonlinear demand via neural networks and NTK analysis represents a significant advance, as prior online inventory control literature has focused primarily on linear models.
⦁ Empirical validation: Experiments on both synthetic and real-world (M5 dataset) data demonstrate practical superiority over OSGD baselines, particularly under heteroskedastic conditions.
Algorithmic clarity and intuition:
⦁ The three if-else conditions in Algorithm 2 lack detailed intuitive explanations. More high-level discussion of the scenarios corresponding to each condition and the motivations for the associated actions would significantly improve accessibility.
⦁ The connection to hierarchical elimination from the bandit literature (Auer 2002, Chu et al. 2011) could be explained more explicitly to clarify what is borrowed versus what is novel for this specific inventory control setting.
Analysis of OSGD failure modes:
⦁ While the paper claims OSGD fails under context-dependent noise, more concrete analysis of how and why OSGD deteriorates would strengthen the motivation. Does it suffer constant regret, or merely worse rate dependence? Specific examples demonstrating failure would be valuable.
Strong assumptions:
⦁ Known feature mapping $\psi(x_t)$: The assumption that the practitioner knows the correct feature mapping determining noise dependence is quite strong. The paper acknowledges this briefly but does not adequately address the practical challenge of discovering this mapping. How sensitive is performance to misspecification of $\psi$?
⦁ Over-parameterization for neural networks: The width requirements ($w = \tilde{\Omega}(\text{poly}(T, K, \lambda^{-1}))$) for the NTK analysis are extremely large and impractical. While the experiments show small networks work well, the theory-practice gap deserves more discussion.
Limited comparison with related work:
⦁ The paper would benefit from more detailed comparison with the contextual bandit literature (Auer 2002, Chu et al. 2011, Han et al. 2025, Wen et al. 2025) to make the unique difficulties of the inventory control setting clearer.
⦁ Discussion of how offline data or machine learning oracles (references [2], [3] in the review) could improve the minimax results would provide valuable practical direction.
Experimental limitations:
⦁ Real-world experiments focus only on food items from the M5 dataset. Evaluation on other categories or datasets would strengthen generalizability claims.
⦁ The synthetic experiments use relatively simple noise structures (linear, binomial, sinusoidal). More complex heteroskedastic patterns could better stress-test the algorithm.
Missing practical guidance:
⦁ How should practitioners determine the feature mapping $\psi$ in practice? The umbrella example suggests domain knowledge, but more systematic approaches would be valuable.
⦁ Guidance on bandwidth selection and hyperparameter tuning for the kernel regression component is limited.
See above. |
Fully AI-generated |
|
Perishable Online Inventory Control with Context-Aware Demand Distributions |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies online contextual inventory control with perishable goods when both the demand mean and the residual noise distribution depend on observable features. The key departure from prior work is to allow heteroskedastic, context-dependent noise via a p-dimensional feature vector z_t (possibly a transform or subset of x_t), which makes the optimal order quantity no longer linear in x_t and breaks the usual OSGD approach that relies on linearity under i.i.d. noise. The authors propose a plug-in, estimation-to-decision algorithm: estimate the conditional mean demand f*(x_t) with either ridge (linear) or over-parameterized neural networks (nonlinear), then estimate the conditional CDF of the noise Q(·; z) using a Nadaraya–Watson estimator built from estimated residuals and (possibly noisy) features, and finally choose the inventory c_t by maximizing the corresponding plug-in newsvendor objective $,\tilde{\ell}_t(c)=h\int_0^c \hat{Q}_t(y-\hat{D}_t),dy + b\int_c^M (1-\hat{Q}_t(y-\hat{D}_t)),dy$. Theoretically, for linear f*, they prove a regret upper bound $\tilde{O}(\sqrt{dT} + T^{\frac{p+1}{p+2}})$ and a matching minimax lower bound $\Omega(\sqrt{dT}+T^{\frac{p+1}{p+2}})$, identifying p as an intrinsic dimension of the noise’s context dependence. For nonlinear f*, they train a sufficiently wide and deep network (NTK regime) and obtain regret $\tilde{O}(\sqrt{\bar{d},T}+T^{\frac{p+1}{p+2}})$ where $\bar{d}$ depends on the effective dimension of the NTK kernel and the function norm of $f^*$. Under an additional Fourier-decay condition on the feature/noise distributions, the nonparametric term improves to $\tilde{O}(\sqrt{pT})$. Experiments on synthetic settings and the M5 dataset show that the proposed method outperforms an OSGD baseline when noise is heteroskedastic or the mean is nonlinear.
(1) The efforts to resolve nonlinearity (i.e. involvement of neural networks) in contextual inventory control problems: Regret bound in the NTK/over-parameterized regime, tying complexity to an effective dimension and function norm.
(2) The rigorous proofs of matching upper and lower regret bounds $\tilde{O}(\sqrt{dT}+T^{\frac{p+1}{p+2}})$, which closes the information-theoretic gap.
(3) Empirical results support the claim that OSGD breaks under heteroskedastic noise or nonlinear means.
The authors are encouraged to disagree with my points. Convincing rebuttals will end up with a higher evaluation.
(1) Novelty of the nonlinear part is mostly in assembling well-known tools. The paper’s nonlinear extension is interesting for this domain, but methodologically it relies on standard statistical covering arguments and NTK-based local linearization; the hypothesis space here is not especially exotic, and the complexity control mostly follows known templates. The value is in adapting these to perishable inventory with heteroskedasticity, rather than a fundamentally new learning technique.
(2) Assumption stack can be heavy for the nonlinear case. The NTK analysis requires very wide nets, carefully chosen stepsizes, and a non-singular NTK over the random contexts. The required over-parameterization and effective-dimension conditions may be far from practice, even if small networks work empirically; the theory may overstate what is needed in real deployments.
1, On the role and selection of $p$: In practice, how should one choose or validate the feature map $z_t$ and its dimension $p$? For instance, when using $\hat{z}_t := \hat{f}_t(x_t) (p=1)$, are there diagnostics that indicate when this is insufficient, and would you recommend simple expansions like $z_t = [\hat{f}_t(x_t), x_t,j]$ for a few $j$?
2, Sensitivity to local density via $\hat{f}_a(z)$. The pointwise error bounds scale with $1/\hat{f}_a(z)$, and you control the effect through a probability argument. Empirically, did you observe instability when $z_t$ visits low-density regions (e.g., early rounds or rare contexts)? Do you implement any floor on $\hat{f}_a(z)$ or adaptive bandwidths to stabilize estimates?
3, Disclosure of LLM usage: As is required to have an explicit statement on the LLM usage of this work, do you have a paragraph on that? |
Heavily AI-edited |
|
Perishable Online Inventory Control with Context-Aware Demand Distributions |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies an online demand learning problem with perishable items where the decision maker (DM) observes a context $x_t$, selects an inventory level $c_t$, and then observes the full demand $D_t = f_\star(x_t) + \epsilon_t$.
Unlike previous works, this paper assumes that the noise follows a context(feature)-dependent conditional distribution
$\epsilon_t \sim Q(\cdot; z_t)$ where feature $z_t$ is a transformation of the context $x_t$. This setting departs from previous works that assume i.i.d. or homoskedastic noise. They establish the minimax regret bound for linear expected damand model with Heteroskedastic noise. Moreover, under a mild regularity condition (a Fourier-smoothness assumption), the regret bound improves to \tilde{O}(\sqrt{\alpha T} + p\sqrt{T}). Additionally, they also derived the regret upper bound for non-linear model.
1. Feature-dependent noise modeling.
- The key novelty lies in modeling heteroskedastic, feature-dependent noise rather than assuming i.i.d. noise. This is a realistic modeling, and the paper clearly motivates how noise variability depends on contextual features in practical demand systems.
2. Tight regret bound for online demand learning with context-dependent noise.
- The regret bound clearly decomposes into a parametric term $\tilde{O}(\sqrt{\alpha T})$ and a nonparametric term $\tilde{O}(T^{(p+1)/(p+2)})$. In particular, for the linear case, this bound matches the minimax regret lower bound.
The theoretical results appear technically sound, and no major issues were found.
However, several aspects of the presentation could be improved.
1. Justifying that Assumption 1 (Lipschitz CDF) is a standard and mild regularity assumption (as commonly used in nonparametric statistics) would be helpful.
2. A comparison with Ding et al. (2024) would strengthen the discussion. Ding et al. (2024) assumes i.i.d. noise and achieves $O(d\log T)$ regret, while this paper obtains $O(\sqrt{dT})$ under feature-dependent heteroskedastic noise.
Explaining this difference—e.g., that $\sqrt{dT}$ arises from jointly estimating $\theta^*$ and the noise structure (as noted in Appendix A.1)—would improve interpretability.
3. The experimental section is somewhat limited. For example, showing that the real-world data exhibit heteroskedastic noise is needed. The explanation for Figure (a) also needs to be more detailed.
1. Comparison to Ding et al. (2024)
- Does the factor $\sqrt{T}$ arise from estimating true heteroskedastic noise? Why does your result improve over Ding et al. (2024) by a factor of \sqrt{d}? Can you compare your results with Ding et al. (2024)?
2. Numerical results
- Why does the NN estimator sometimes outperform the Ridge regression estimator, while in other cases the Ridge estimator performs better, as observed in Figures (b) and (c)? |
Fully human-written |