ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
Frequency-Balanced Retinal Representation Learning with Mutual Information Regularization Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The presented work postulates that there is a discrepancy between the most salient visual features in medical images and those learned by current representation learning techniques. In particular, the authors claim that the prevalent masked autoencoder disregards high-frequency features related to retinal pathology in color fundus photographs. To demonstrate this, they rank image patches by the amount of high-frequency content. Subsequently, they show that patches with reduced high-frequency content shape latent space structure while providing substantially less information for downstream tasks. In response to this observation, the authors propose RetMAE, an extension of the loss function of the established masked autoencoder. RetMAE regularizes the latent space by maximizing its mutual information with embeddings of patches with increased high-frequency content. In experiments using four ophthalmological datasets, the authors show that RetMAE outperforms various baselines that rely on a basic masked autoencoder. This effect persists when auxiliary signals, such as text or a pre-trained vision encoder, are included as learning signal, albeit in a diminished capacity. - The work’s main motivation that current pre-training paradigms for vision transformers result in suboptimal feature extractors when applied to medical images is very interesting. The authors convincingly support this hypothesis in a set of initial experiments (Figure 1, Section 4, Supplementary Material A1), showing that salient image features differ in natural and medical images, and that masked image modeling has an inductive bias towards low-frequency features. - The provided solution to this problem is theoretically well founded and experimentally shown to improve performance. As such, it has the potential to benefit the large scientific community in the field of medical image analysis. - The clinical application of ophthalmological image analysis is very well selected. Many retinal diseases manifest as small pathologies, resulting in high-frequency image features. Furthermore, there are several prominent works that have used large-scale pre-training of masked autoencoders to derive ophthalmological foundation models. - The paper is clearly structured, nicely illustrated and generally well written. Additionally, the authors include extensive supplementary material that provides in-depth technical detail about the method and experimental setup. - The motivating hypothesis that medical images contain more salient high-frequency image features is only explored for color fundus photographs. The importance and reach of the work would substantially increase if similar findings were shown for other types of data. Similarly, the efficiency of the proposed solution is only demonstrated for color fundus photographs so that it is unclear whether the proposed method seamlessly translates to other settings or requires extensive hyperparameter tuning for both the extraction of high-frequency information and the loss weighting. - The proposed solution is highly complex. In particular, it relies on estimation of mutual information via an Donsker-Varadhan estimator, which is known to numerically unstable (Belghazi, Mohamed Ishmael, et al. "Mutual information neural estimation." International conference on machine learning. PMLR, 2018). I could envision that conceptionally much simpler solutions exist that emphasize high-frequency features. For example, one could adjust the masking scheme to prioritize patches with increased high-frequency content (similar to Xie, Jiahao, et al. "Masked Frequency Modeling for Self-Supervised Visual Pre-Training." The Eleventh International Conference on Learning Representations). Alternatively, one could provide the high-pass-filtered image as additional input (Wang, Wenxuan, et al. "Fremim: Fourier transform meets masked image modeling for medical image segmentation." Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2024). The authors should discuss the aforementioned works in more detail and include them as baselines. - At the moment, the performance of the proposed method is only quantified via linear probing using the latent representations. I believe that full fine-tuning should also be conducted considering that the ultimate downstream performance matters most in applied domains such as medical image analysis. - Considering that most ophthalmological foundation models make their training code and weights public, I believe that the authors should strongly consider doing the same. - Several core concepts of the paper are only very briefly introduced or require consultation of the supplementary material. I suggest that the section on the interpretation of the masked image modeling objective through the lens of a Lagragian is slightly extended so that it can be understood without consulting previous works by Huang et al. and Tishby and Zaslasky. Similarly, the frequency score calculation should be briefly introduced in the main manuscript considering its vital importance, instead of only being introduced in the supplementary material. - Additionally, I struggled with the notation on several occasions. Already in the very first mathematical paragraph, the variable $N$ is overloaded, $D$ not introduced, and mutual information $I$ is not defined. Later, the use of $X$ varies to signal whether it denotes input or decoded image tokens. The authors should carefully parse the manuscript once again, aiming to improve the clarity of its mathematical passages. - On a minor note, the acronym CKA is not introduced at its first appearance in the introduction section. Fully human-written
Frequency-Balanced Retinal Representation Learning with Mutual Information Regularization Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes a frequency-balanced masked autoencoder framework (RetMAE) that enhances retinal representation learning by introducing a high-frequency mutual information regularizer to emphasize clinically critical high-frequency structures while suppressing redundant low-frequency content. This paper does present a clear motivation that high-frequency structures are clinically important in retinal imaging, and it attempts to incorporate this insight through a mutual-information-based regularizer. And the presentation is good. (1) According to Figure 2, the loss L_hmi proposed in this paper and the loss L_rec in the original MAE appear to optimize the latent feature in two fundamentally different directions. Specifically, L_rec aims to optimize 𝑧 such that it can reconstruct the full Image from Image_mask1. In contrast, L_hmi optimizes 𝑧 to enable the reconstruction (or transformation) of Image_mask2 from Image_mask1, where mask2 is generated through the high-frequency masking strategy proposed by the authors. This is clearly contradictory and constitutes the most significant issue. (2) Were all evaluations in Table 1 conducted on the standard MAE model? The results shown in this table seem to indicate that the standard MAE already has a strong ability to represent high-frequency information, which contradicts the authors’ claim of “under-encoding high-frequency diagnostic structures.” For example, in the T_low row of Table 1, even after masking 25% of high-frequency information, the CKA value remains as high as 0.990, indicating that the model still retains stable reconstruction capability for high-frequency components. Conversely, in the T_high row, when high-frequency information is used as input, the model shows low CKA because it cannot reconstruct the full image, which is expected. However, the AUROC increases, demonstrating that high-frequency information is highly discriminative; when low-frequency, lesion-irrelevant content is removed, the model’s prediction accuracy improves. Therefore, the evidence presented in Table 1 may actually support the importance of high-frequency features rather than demonstrating the insufficiency of MAE in encoding them. I believe the authors have not conducted a sufficiently thorough investigation in this aspect. (3) As highlighted in Comment (1), there is a potential conflict between the two loss terms. The authors should explicitly report the training weights assigned to each loss or provide a sensitivity analysis (e.g., a performance graph under different loss weight configurations) to demonstrate the impact of the loss balance on model performance. (4) The innovation of the proposed method appears to be limited, as it essentially adds a frequency-based loss on top of MAE, while the use of high-frequency representations to capture lesion-related features in retinal images has already been explored in numerous prior studies. If the authors can provide a clear theoretical justification or empirical evidence resolving the contradictions I raised—particularly regarding the compatibility of the two loss objectives and the interpretation of Table 1—I am willing to reconsider my rating and increase my score accordingly. Were all evaluations in Table 1 conducted on the standard MAE model? Lightly AI-edited
Frequency-Balanced Retinal Representation Learning with Mutual Information Regularization Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper focuses on the high-frequency information in fundus images that is clinically relevant. The authors introduced an information-theoretic auxiliary supervision into the MAE pretraining paradigm to guide the encoder toward clinically important regions, without requiring architectural modifications. The overall logic—from problem identification to the proposed solution and experimental validation—is clear and well-structured. 1. The paper presents a clear and coherent logical flow from motivation and problem formulation to analysis, proposed solution, and experimental validation, making it easy to follow and understand. 2. It offers new and interesting insights into model training for ophthalmic applications. 3. The theoretical derivation of the proposed method is sound, and the approach itself is easy to reproduce. 1. Limited novelty. The inspiration of this work appears to be directly derived from Huang, Tao, et al. “Learning Mask Invariant Mutual Information for Masked Image Modeling.” arXiv preprint arXiv:2502.19718 (2025). Although this paper is cited, I would still like the authors to explicitly clarify which parts of the current method are independently proposed. 2. I have concerns regarding the generalizability of the proposed approach. (i) The low-pass filtering property originates from the ViT architecture itself, and this phenomenon is not unique to MAE. (ii) The application scenario in this work is limited to color fundus photography. From the perspective of developing a robust retinal foundation model, MAE is not the only viable paradigm even within the image modality. For example, VisionFM, which follows the iBOT framework, also builds a powerful image encoder. From the standpoint of understanding the mechanism of MAE, this paper does not provide new insights. The authors need to further elaborate on the substantive contribution of their work. 3. The performance evidence is limited. The chosen downstream tasks are relatively few and of low difficulty (e.g., binary classification of DR, glaucoma, and AMD). Considering that the motivation of this work focuses on clinically relevant high-frequency lesions, the authors are encouraged to validate their method on more challenging tasks to substantiate its claimed clinical value. 1. This paper introduces an additional high-frequency contextual feature constraint into the latent space of MAE. Some previous studies have imposed constraints directly on the masking strategy (e.g., image-entropy-based masking). I encourage the authors to discuss this line of work to further highlight the value of their proposed method. 2. According to Figure 3, the performance of RetMAE appears to saturate after pretraining on approximately 12.8k images. Increasing the data size beyond this point seems to yield no significant improvement. The authors attribute this to saturation of model capacity and computation, which is a reasonable explanation. However, given that 12.8k is far smaller than the typical data scale used for foundation model pretraining—and that the employed encoder architecture has been shown in other domains to effectively utilize much larger datasets—this phenomenon remains concerning. The authors should provide a more convincing explanation for this observation. 3. Not all retinal lesions exhibit high-frequency characteristics—for example, large hemorrhages or retinal detachments. I would like to see visualizations of such cases to better understand how the proposed method behaves under these conditions. Lightly AI-edited
Frequency-Balanced Retinal Representation Learning with Mutual Information Regularization Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a frequency-aware masked autoencoder (MAE) for unsupervised pretraining on retinal images, called RetMAE. This is accomplished by including a high-frequency regularization term in the objective function that reduces low-frequency redundancy. The authors main claim is that the diagnostic information in retina images are encoded in high frequencies, and thus better representing these areas yields better downstream accuracy. There is a section on evaluating the frequency bias in MAE representations applied to fundus images in which the paper presents centered kernel alignment (CKA) and linear probing as evidence. The experimental section utilizes five publicly available datasets and compares against two other MAE-based approaches as well as a vision-language baseline. Although the paper doesn't contribute a significant new algorithmic framework, its approach in providing frequency-based context latents to guide the representation learning paradigm for applications in which frequency bias is an impediment could be a valuable contribution. There are, however, a few areas of both theory and practice that need clarification. On theory: 1- What is the purpose of using a variational autoencoder with a fixed variance Gaussian mapping? From theorem 1, it reduces the reconstruction error to minimizing the conditional in eq 2. However, it is not clear if this enforced constraint is warranted. Is this constraint enforcing any aspect of the regularization framework? 2- The choice of using $\mathcal L_{MINE}$ as the context-alignment training objective is not quite clear. In other words, why does estimating MINE maximize the conditional? On Application: 1- Does this framework extend beyond retinal fundus images? Could other factors than frequency bias be incorporated in the regularization objective? 2- How much computational complexity is added to the problem by incorporating the proposed high-frequency regularization? 3- How does the pretrained encoder fare in a formal classification tasks rather than the employed linear probing? Additional suggestions: 1) use the figure in appendix A instead of Figure 1 in the main paper. The figure from the appendix better justifies the frequency bias of retinal fundus images as compared to natural images, e.g. ImageNet. 2) CKA and its relevance to the claimed frequency bias should be explained more clearly. The approach in utilizing a bias term as regularization to improve representation learning in certain application is interesting. This approach could be potentially significant for applications that are not based on natural images. Better discussion is needed to connect the theoretical aspect of the work (MI) with the practical tools utilized (MINE estimation). Provided in the summary. Fully human-written
InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a pipeline to create instruction following samples in low-resource languages. In specific, instructions in a high-resource language are first generated and input into a large language model that translates these instructions and generates corresponding response in the low-resource language. These candidate instruction-following samples are evaluated by a large language model checker with retrieval-augmented generation, where samples labeled with errors are further refined by human experts. Authors have created instruction-following datasets in three low-resource languages using the proposed pipeline. Experimental results show that models tuned on these datasets are significantly better than the ones tuned on machine-translation samples. - This paper investigates an important problem: how to create a large amount of high quality instruction following samples in low-resource languages? - The instruction following datasets generated in three low-resource languages will be helpful to the low-resource NLP community. - **Limited Generalization**: This pipeline involves a large language model with reasonable performance on the low-resource language and some human experts for evaluation and correction, which makes it hard to scale and generalize to some low-resource languages. Given a low-resource language, this pipeline may be not applicable for all large language models performing bad or none suitable human experts. On the other hand, the number of instruction following samples is constrained by the budget to hire human experts. - **Missing Evaluation of RAG Checker**: This method uses a RAG checker to filter out low-quality samples. However, they do not evaluate the effectiveness of this checker, which makes the quality of accepted samples questionable. - **Missing Baseline for Comparison**: Some important new pipelines to create instruction following samples in low-resource languages are not cited or comparison [1, 2]. References [1] Li, C., Yang, W., Zhang, J., Lu, J., Wang, S., & Zong, C. (2024, January). X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions. In ACL (Findings). [2] Köksal, A., Thaler, M., Imani, A., ÜstĂŒn, A., Korhonen, A., & SchĂŒtze, H. (2025). Muri: High-quality instruction tuning datasets for low-resource languages via reverse instructions. Transactions of the Association for Computational Linguistics, 13, 1032-1055. 1. Does the RAG checker good in evaluating low-resource language instruction-following samples? Are there any problems with 85.8% samples marked "Accepted without correction"? 2. What is the advantage of your method comparing other beseline methods? How do they perform on the three low-resource languages? Fully human-written
InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces **InstructLR**, a scalable and modular pipeline to generate high-quality instruction datasets for **low-resource languages (LRLs)**. The approach leverages large language models in high-resource languages (such as French) to generate seed instructions, translates and adapts them to the target low-resource language, and applies a **dual-layer quality filtering mechanism**—an automated RAG-based correction system followed by human validation. Using this pipeline, the authors produce three 50k-scale instruction datasets (Zarma, Bambara, and Fulfulde) and demonstrate through extensive automatic and human evaluations that fine-tuning open-source LLMs on these datasets substantially improves instruction-following capabilities and downstream performance (e.g., NER) in these languages. - **Timely and important topic:** Addressing LLM accessibility for under-resourced languages is a highly relevant problem with social and scientific impact. - **Complete and scalable approach:** The paper presents an end-to-end framework, from seed instruction generation to human validation, which is reusable across languages and domains. - **Clarity and reproducibility:** The pipeline is clearly described and supported by well-chosen examples and figures. The authors also emphasize cost-efficiency and open licensing, which makes the work practically impactful. - **Empirical thoroughness:** The experiments are extensive, involving multiple models and metrics, and include both automatic and human evaluations, adding credibility to the results. - **Writing quality:** The manuscript is well-written and easy to read, with clear motivation and well-organized experimental sections. - **Limited novelty:** While the framework integrates translation, RAG-based filtering, and human validation effectively, these components are individually standard. The main contribution is the *composition* of these techniques rather than a new algorithmic insight. - **Experimental focus:** The experiments primarily show that models fine-tuned on the resulting datasets perform better than baselines. This is known fact. However, they do not deeply analyze *the pipeline itself*—for instance, how translation quality, RAG corrections, or human validation quantitatively affect final performance. A more ablation-style study would have better demonstrated the pipeline’s internal efficacy. - **Applied rather than exploratory:** The paper provides solid engineering value but remains on the applied side; it does not explore new theoretical or modeling questions in instruction tuning. - While the paper shows that InstructLR fine-tuning improves performance in the target languages, how does this affect *related* languages (e.g., mutual benefit for typologically similar LRLs)? - Does fine-tuning on these new datasets lead to degradation in performance for high-resource languages such as French? - Can the authors provide results comparing model performance on the *original untranslated* instruction set before and after InstructLR fine-tuning? This would clarify whether the model truly improves in cross-lingual understanding or only specializes in the generated instruction style. - How sensitive is the overall quality to the translation step? Have the authors evaluated how errors in translation propagate through the pipeline and influence filtering success rates? Fully AI-generated
InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a comprehensive pipeline, InstructLR, to automatically and efficiently generate high-quality instruction-tuning datasets for low-resource languages (LRLs), focusing on Zarma, Bambara, and Fulfulde. The framework integrates: Seed instruction generation in a high-resource language (e.g., French); LLM-based translation and response generation directly into the target LRL; Dual-layer quality filtering, combining automated RAG-based checking and human validation. 1, Tackles multilingual equity by addressing a pressing issue: lack of instruction datasets for African and other under-resourced languages. 2, The dual-layer filtering pipeline (RAG-based automatic correction + human validation) is novel and pragmatic. 3, Framework demonstrated across three distinct LRLs, showing language-agnostic and reusable properties. 4, Quantitative gains (BLEU +20–30, ROUGE, METEOR) and human preference results clearly substantiate claims. 1, Relies on Gemini and GPT-4o for initial generation; this undermines reproducibility and scalability in low-resource contexts. 2, All three LRLs are French-contact African languages, so claims of language-agnosticism remain under-tested. 3, Only five Zarma and one Bambara annotators—too few to ensure dialectal or sociolinguistic representativeness. 4, The dual-layer filtering ensures fluency but not factual correctness, leaving potential hallucination propagation unaddressed. 5, The framework is more engineering-driven than theoretically motivated; lacks a clear linguistic or data-centric theoretical foundation. 6, Evaluations are limited to BLEU/ROUGE/NER; lacks instruction-following generalization on realistic multi-turn tasks. 7, Heavy reliance on French-based seed instructions may embed Western or francophone biases into LRL outputs. 8, BLEU/ROUGE are weak proxies for instruction-following quality, especially across languages with divergent morphology. 9, 500 samples per language is not enough for statistical robustness; lacks confidence intervals for inter-annotator consistency. 10, Some pipeline components (RAG knowledge base construction, FAISS index details) are under-specified for replication. 11, The MT-Seed baseline may be too simplistic; missing comparisons to existing multilingual instruction datasets (e.g., Aya, BELLE, or Multilingual Alpaca). - Fully AI-generated
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a simple, effective, and principled method called AlphaSteer, which steers the activations of LLMs to refuse malicious prompts while retaining maximum utility for benign ones. Specifically, AlphaSteer defines an explicit objective for this goal and derives an efficient approach to achieve it without exhaustively retraining model parameters for safety alignment. The experimental results demonstrate the effectiveness of the proposed method. - The presentation is clear and easy to follow. - The idea is simple yet principled: the goal of this paper is rigorously defined, and the proposed approach to achieve it is both efficient and well-justified. In particular, introducing the concept of having zero effect on benign prompts (rather than explicitly maximizing utility, such as the log-likelihood of outputs) is a reasonable formulation. - The experimental results are strong, at least within the scope of the setups presented in this paper. I think this paper is already strong, but the following points could further improve it: - The proposed method appears lightweight (mainly involving SVD computation and matrix multiplication in a full-batch manner). However, in my view, it is still data-driven. It would therefore be helpful to compare this approach with a fully data-driven baseline — for example, a simple supervised fine-tuning model trained to generate refusals for malicious prompts in the same dataset $\mathcal{D}_m$. Although such a baseline might overfit $\mathcal{D}_m$, it would still highlight the advantages of the proposed method. Even if the baseline performs better, AlphaSteer would remain preferable due to its efficiency. - AlphaSteer introduces some additional computational overhead (which appears marginal), but it would be useful to discuss this overhead in more detail — particularly in comparison to the baseline (i.e., only computing the refusal vector $r$). - In certain cases (e.g., Llama-3.1-8B-Instruct on Math and GSM8K), AlphaSteer actually improves utility. This suggests that AlphaSteer might have a regularization effect (e.g., the input $h_b$ being influenced by $\tilde{\delta}$ when moving out of the null space). Providing intuition or analysis for this phenomenon could further support the claim that AlphaSteer enhances both safety and utility. - The paper studies the effect of the steering strength $\lambda$ in Figure 11. Could an optimal $\lambda$ be derived using a similar objective formulation? See the weaknesses Lightly AI-edited
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes an activation steering method with a learnable refusal vector to defend against jailbreak attacks in LLMs. The learnable vector is optimized to balance the trade-offs between utility and safety. Experiments are carried out on three open-source LLMs with recent jailbreak attacks and utility benchmarks to show the effectiveness of the proposed defense. - The proposed defense achieves a better utility score (even slightly better than standard models on average). - The paper shows theoretical grounding on its optimization of the learnable refusal vector. - The proposed method achieves a better defense success rate on average against recent jailbreak attacks. - The paper is well written and easy to read. - The contribution may be limited as there are other existing learnable activation-steering methods considering before ICLR submission deadline. The general learnable activation steering methods include: [1] https://arxiv.org/abs/2505.20309v2 (version 1 released in May 2025) [2] https://arxiv.org/abs/2506.03292 (hypernetwork-based steering) [3] https://aclanthology.org/2024.findings-emnlp.479.pdf The reviewer skips the paper after September 2025. - The experiments are not rigorous. Better attacks, such as "do anything now" [a], AdvPrompter [b], are not used for evaluation. [a]https://arxiv.org/abs/2308.03825 [b]https://arxiv.org/pdf/2404.16873 - Case study (RQ3) should be an in-depth analysis rather than showing an example of (ReNeLLM). - The generalization ability of the learned refusal vector is not clearly explained, although there are experimental results on generalization without math data in the appendix (D.4). Minor: - The caption of Fig. 4 is missing. - The small graphs in the supplementary materials are not readable. - Activation steering is known to introduce safety and alignment risks. How does the proposed method guarantee not to introduce other safety and alignment risks other than jailbreak attacks at hand? - The steering vector may not generalize well beyond the defined settings or prompt types. What is the expected generalization? - How does the proposed method guarantee the learned steering direction is reliable? (Fidelity) - The design of the prompts may affect the steering direction. What is the variance? How $D_b$ and $N_m$ are constructed? - The limitations say the effectiveness is unknown for large reasoning models. What about small reasoning models such as Phi-3? Fully human-written
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint Soundness: 3: good Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. AlphaSteer introduces a learnable activation-steering method that keeps a model’s normal behavior intact while strengthening its tendency to refuse harmful requests. It first carves out a space that represents benign behavior and minimizes any steering there, then learns an adaptive “refusal direction” from activation data so the model gently shifts toward safe responses only when prompts are malicious. Across multiple open instruction models and a range of common jailbreak attacks, it raises defense success while largely preserving compliance and standard task performance, outperforming prior refusal-vector baselines. 1. The method grounds activation steering in a clear linear-algebraic framework: (1) preserve utility by projecting benign activations into a learned (near) null-space, and (2) enhance safety via an adaptive, data-driven refusal vector estimated in closed form. 2. Across diverse jailbreak families, the approach delivers state-of-the-art (SOTA) defense success on malicious prompts while maintaining (or minimally impacting) compliance and standard-task performance on benign prompts—consistently outperforming refusal-vector baselines and contemporary steering methods under comparable settings. 3. Clear geometry-focused visualizations (activation trajectories, norm-separation) and ablations (layer choice, steering strength, linear vs. MLP) justify each design choice and make the mechanism easy to audit and reproduce, strengthening both clarity and credibility. 1. The proposed method includes introduction of the computation of null-space projection matrix, but does not show whether the new computation is costly. For showing effective practical usage, it would help to compare computation with existing baselines. For example, Surgical [1] offers Inference time and Memory comparison. 2. The evaluation solely depends on GPT-4o model as LLM-for-judge for DSR (Defense Success Rate) and CR (Compliance Rate), while having no justification for the model selection. Although it is based on GPT-4 not GPT-4o, WIldGuard[2] shows that guard-specific models can serve as better judges. You might want to include other guard-specific models as independent judges, and report how the results change for further validation. [1] Wang, X., Hu, C., Röttger, P., & Plank, B. (2024). Surgical, cheap, and flexible: Mitigating false refusal in language models via single vector ablation. arXiv preprint arXiv:2410.03415. [2] Han, S., Rao, K., Ettinger, A., Jiang, L., Lin, B. Y., Lambert, N., ... & Dziri, N. (2024). Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. Advances in Neural Information Processing Systems, 37, 8093-8131. 1. Please state more details about the content and intent deduplication method in C.1. Lightly AI-edited
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes AlphaSteer, a theoretically grounded activation steering method that addresses the safety-utility trade-off in LLM defenses against jailbreak attacks. Unlike conventional activation steering that indiscriminately applies refusal direction vectors to all prompts, AlphaSteer learns a transform matrix which produces steering vectors which are nearly zero for benign prompts (via null-space constraints for utility preservation) while maintaining refusal vectors for malicious prompts (via linear regression for safety enhancement). The method requires no additional post-training and demonstrates significant improvements in safety across multiple jailbreak attacks while maintaining general model capabilities. - Strong theoretical foundation with principled learning objectives based on null-space constraints and linear regression, providing clear mathematical grounding for the approach. - Addresses a critical limitation of existing activation steering methods with an elegant solution that treats benign and malicious prompts differently. - Comprehensive experimental evaluation across multiple jailbreak attacks (GCG, AutoDAN, PAIR, etc.) and utility benchmarks demonstrating consistent improvements. - Well-written with clear motivation and flow - Strong results vs existing baselines - The paper would benefit from more theoretical analysis of when and why the null-space constraint successfully preserves utility, and under what conditions it might fail. - I think the paper would benefit from more details on how AlphaSteer is learned for the experiments to give a better sense of cost/scalability Does a transform matrix always have enough capacity to adequately learn when the difference between malicious and benign? Is AlphaSteer easy to trick if the attacker is aware ahead of time? How does AlphaSteer perform against adaptive attacks where an adversary has knowledge of the learned steering vectors? Can the null-space constraints be circumvented by adversaries? What is the computational overhead of learning AlphaSteer vs existing methods? Under what conditions does the null-space constraint fail to preserve utility? Are there specific types of benign prompts that the authors observe still lose utility after AlphaSteer? How much is this affected by things like training set size. [Figure 1] How is this plot created? By my understanding at this point in the paper, should the vanilla benign/malicious distributions be the same between Surgical and AlphaSteer? To me it looks like the benign vanilla distributions are different for surgical and alphasteer, why is that? [98] Not a big deal, but it says recent studies and the first citation is from 1969. [101] Extra space? [366] This claim is too strong as Table 1 contricts the fact that 'AlphaSteer yields superior defense success rates across all the jailbreak attacks' [Table 1 and 2] Can you discuss why you believe AlphaSteer underperforms on certain benchmarks/models compared to the baselines? [411] The CAST papers claims that there is only a small increase in refusal rate for harmless prompts, can you explain why it is misclassifing math problems as malicious prompts, this seems surprising to me. Fully human-written
Ultra-Fast Inverse Tone Mapping via Gain Map-based LUT Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes GMLUT, a gain-map-based LUT framework for real-time inverse tone mapping (ITM). It combines a bilateral grid, an image-adaptive LUT, and a lightweight neural modulator to achieve locally adaptive, ultra-fast HDR reconstruction from SDR inputs. The method reports strong runtime efficiency (6.2 ms on 4 K) and modest performance gains over prior LUT-based and lightweight deep approaches. 1. Practical significance: The proposed pipeline achieves exceptional inference speed with low memory and FLOPs, making it highly suitable for deployment on low-power edge devices. 2. Comprehensive experiments: The paper includes extensive quantitative and qualitative comparisons, ablations, and both synthetic and real-capture datasets. 3. Reproducibility and clarity: The technical presentation and dataset description are clear, and the results are consistent across evaluation domains (linear, PQ, HDR metrics). 4. New dataset. This paper introduces a new dataset for ITM 1. Limited novelty. The method largely integrates existing ideas—Gain-Map representation, bilateral grids, and LUT-based enhancement—into a unified framework. The contribution is mainly engineering-oriented, showing solid system design but few new conceptual insights. 2. Marginal improvement on some datasets. Gains over strong baselines such as GMNet or ITMLUT are sometimes modest (≈ 0.2–0.3 dB in PQ-PSNR) and occasionally lower on real-world scenes. This raises doubt about the generality of the improvement. 3. Missing broader justification. While speed is highlighted, the paper could better discuss trade-offs between model complexity and perceptual quality, or compare with hardware-accelerated real-time HDR solutions. Please see Weakness and answer all the concerns in it. Fully AI-generated
Ultra-Fast Inverse Tone Mapping via Gain Map-based LUT Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes GMLUT that predicts Gain Map for high-resolution inverse tone mapping. It employs three image-adaptive operators: bilateral grids, LUT, and neural modulator to address local tone-mapping degradations. Besides, this paper constructs a 8K dataset with SDE-GM pairs for training and evaluation. Extensive experiments demonstrate the effectiveness and efficiency of the proposed GMLUT. 1. Instead of HDR values or LUT, the proposed method learns a color gain map, mitigating quantization artifacts. The motivation is interesting, and the experiment results demonstrate this manner can effectively generate high-quality outcomes while requiring minimal computational overhead. 2. This paper constructs a large-scale 8K dataset with SDR-GM pairs, promoting the ITM task. 1. The proposed GMLUT employs: (a) bilateral grids for local adaptation, (b) image-adaptive LUTs for SDR-to-GM translation, and (c) a lightweight neural modulator for GM refinement. The output of LUT and neural modulator is under supervision, yet the output of bilateral grids misses supervision. How to ensure it meets the expected goals? 2. As the paper noted, LUT suffers from the quantization issue. However, the GMLUT also employs a LUT. How does it address this issue? 3. What are the predicting resolution of the three operators (bilateral grid, LUTs, neural modulator) ? If predicting in a low resolution, how to restore the details? And, a high resolution requires additional computational overhead. See weakness. Fully human-written
Ultra-Fast Inverse Tone Mapping via Gain Map-based LUT Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes GMLUT, which learns to transform the standard dynamic range (SDR) images to Gain Maps (GMs) for efficient inverse tone mapping (ITM). This method combines Look-Up Tables (LUTs) for SDR-to-GM transformation, bilateral grids for local adaptation, and a light-weight neural modulator for GM refinement. A notable contribution is the curation of a new dataset consisting of over 8,000 synthetic SDR-GM pairs and a small-scale real-captured test set. The method is demonstrated to be fast and effective on its proposed test sets. **Clarity and Reproducibility:** The paper is generally written well and easy to follow. The commitment to release both code and dataset strengthens its practical value and reproducibility. **Dataset Contribution:** Constructing a large-scale dataset for the challenging ITM task is appreciated, which may facilitate future research in this field. **Performance-Efficiency Balance:** The results demonstrate an impressive balance, providing competitive HDR reconstruction quality while maintaining high inference speed for high-resolution images. Unfortunately, despite these strengths, the paper currently lacks a strong establishment of its core novelty and rigorous evaluation. **Weak Motivation and Unclear Technical Novelty:** The paper builds upon the SDR-to-GM formulation proposed by Liao et al. (2025). The primary extension appears to be the move from a single-channel to a three-channel GM representation. However, the motivation for this design choice is not explained. ***(1)*** Why is a three-channel GM representation necessary or superior? What specific limitations of the single-channel formulation does it address? A clear narrative is missing, making the core contribution feel more like an incremental engineering adjustment than a principled scientific advancement. ***(2)*** The overall architecture is a combination of well-established components (LUTs, bilateral grids, a small neural network). The paper does not articulate a novel insight that justifies this specific assembly. What is the unique conceptual synergy between these elements that solves a key problem in ITM that previous methods could not? Without this insight, the method risks being perceived as a straightforward pipeline of existing techniques rather than a novel solution to this well-established problem. **Insufficient Evaluation:** The evaluation is currently confined to the proposed new dataset, which limits the claims of generalizability and state-of-the-art performance. ***(1) Lack of Comparison on Established Benchmarks:*** A critical issue is the failure to evaluate on existing public ITM benchmarks such as HDRTV1K, HDRTV4K, and the dataset of Liao et al. (2025). To make a convincing claim about overall performance and generalization, the method is expected to be tested on these independent datasets, while the performance on the proposed dataset alone is not sufficient to establish the method's broader effectiveness. ***(2) Incomplete and Unconvincing Ablation Study:*** The ablation study in Table 4 is insufficient to reflect the contribution of each component. **--->** It lacks an important baseline of a single-channel GM variant to justify the three-channel design. **--->** Simply removing bilateral grids and neural modulator does not justify their novelties. More ablations are required, e.g., comparisons with other/existing bilateral processing mechanisms and internally ablated variants of the neural modulator (e.g., by removing the channel-wise parameters), to highlight the contributions of novel designs. In addition, the Q_max is not involved in the ablations. Is it important? **--->** Efficiency metrics are missing from the ablation table (Table 4). For a method that highlights efficiency, it is essential to show how each component affects the performance-speed trade-off. **Dataset Rigor and Broader Impact:** The new dataset is a strength, but its construction and potential impact require further justification. ***(1) Train-Test Split Justification:*** The extremely high train-test split ratio (around 28:1) is not a common practice and requires explicit justification. A discussion on how the 200 synthetic and 82 real SDR-GM pairs were selected to ensure they are representative of the dataset's diversity and complexity is necessary for a reliable evaluation. ***(2) Broader Utility of the Dataset:*** The dataset seems tailored to the proposed 3-channel GM formulation. To demonstrate its value as a community resource beyond this specific paper, it would be highly impactful to show that this dataset can be used to improve other existing ITM methods. Is it possible to take advantage of this dataset to produce a better performance for other methods? **Justification for Recommendation:** This paper presents a practical and efficient method for ITM and provides a valuable new dataset. The performance-efficiency trade-off is compelling. However, the paper currently lacks a clear narrative regarding its core conceptual novelty and a rigorous enough evaluation to support its claims. The authors are suggested to reframe their contribution around a clearer scientific insight, provide comprehensive evaluations on established benchmarks, and conduct thorough ablations that include efficiency metrics. **Conceptual Motivation:** What is the specific hypothesis behind using a three-channel GM representation instead of the single-channel one from Liao et al. (2025)? What quantitative or qualitative advantages does this design offer, and can the authors demonstrate this with a direct comparison? **Generalization Performance:** To substantiate the claim of state-of-the-art performance, can the authors provide evaluation results on the established benchmarks? This is critical for assessing the method's true capability for HDR reconstruction. **Ablation Rigor:** Could the authors expand Table 4 to include: (1) a single-channel GM baseline, (2) variants that alter the bilateral grid and neural modulator, (3) a baseline without Q_max, and (4) the corresponding efficiency metrics (e.g., inference time/FLOPs) for each ablated variant? This is essential to ground the design choices in empirical evidence. **Dataset Impact:** Beyond its use in this paper, have the authors explored using this new dataset for other ITM methods? Is it possible to combine this synthetic dataset with existing real-world data to further boost the performance? Fully AI-generated
Ultra-Fast Inverse Tone Mapping via Gain Map-based LUT Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents a novel and fast solution for inverse tone mapping (ITM) based on gain map-driven Look-Up Table (LUT). This approach simultaneously outperforms previous state-of-the-art methods in image quality (PQ-PSNR) and offers a substantially faster inference speed. The proposed architecture mitigate the limitations of standard LUTs by proposing a sophisticated approach by including bilateral grids (enabling crucial local adaptation), image-adaptive LUTs, and a neural modulator to improve the inverse tone mapping process. This design choice is empirically validated: the experimental results consistently demonstrate that the proposed method is superior to prior works, not only in terms of objective image quality but also in its remarkable efficiency and low computational complexity. 1. The entire pipeline relies heavily on features extracted from a thumbnail image which are subsequently used across multiple specialized modules (grid generation, LUT generation, and neural modulation). However, the specific design and structure of the encoder responsible for generating these features are entirely missing. Given that each subsequent module has a distinct functional purpose (e.g., generating spatial grids vs. generating color mapping parameters), they likely require specialized or differentiated feature representations. The absence of details on the encoder design makes it impossible to evaluate whether the extracted features are suitable for these diverse tasks. 2. The mathematical expression in the second part of Equation (2) needs to be clarified as currently presented, it is difficult to interpret the intended operation. 3. This paper does not clearly explain the interdependence between the grid generation and LUT generation modules. A detailed mathematical or logical description of how the outputs of one module inform the inputs or constraints of the other is necessary. More details, including the specific mathematical operations, are required for the grid generation module. 4. The experiments are limited to a single dataset that was constructed internally by the authors. For a convincing demonstration of a state-of-the-art inverse tone mapping technique, validation against at least one widely recognized, publicly available benchmark dataset is essential. 1.Ablation Network Architecture: The corresponding network architectures (pipeline diagrams) for the specific ablation experiments are missing. For example, when evaluating the impact of the grid/LUT components, how exactly was the rest of the architecture modified or removed? 2. Thumbnail Image Size: The overall complexity is tied to the use of a thumbnail image for feature extraction. Please provide an analysis and discussion on the impact of varying the input size of the thumbnail image used by the encoder. Did the authors explore different sizes, and how did this affect the trade-off between speed and PQ-PSNR? 3. LUT Size: Similarly, the size of the Look-Up Table (LUT) is a core parameter. Please discuss the motivation behind the chosen LUT size and provide an ablation on how different LUT dimensions influence both the mapping precision (quality) and the resulting inference time (complexity). Fully AI-generated
Ultra-Fast Inverse Tone Mapping via Gain Map-based LUT Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces GMLUT, a fast and lightweight framework for inverse tone mapping that uses Gain Map encoding together with learnable LUTs to convert SDR images into HDR. Instead of predicting HDR values directly, it estimates an 8-bit Gain Map and applies a few adaptive operators derived from a low-resolution version of the image to restore HDR details efficiently. The paper presents GMLUT, an efficient inverse tone mapping framework that integrates Gain Map encoding with learnable LUTs for real-time SDR-to-HDR conversion. The approach is simple yet effective, delivering high perceptual quality with extremely low computational cost. Please refer the Questions. If the authors can address the following points, it would greatly help me better understand and evaluate the paper. 1. The overall design appears quite similar to deep bilateral filtering pipelines, where a lightweight network processes a low-resolution thumbnail before propagating results to high resolution. The novelty over existing bilateral grid methods is not clearly explained. 2.It is also unclear which color space (e.g., BT.2020) the constructed HDR dataset uses; a color gamut comparison like Fig. 1 in [GamutMLP CVPR 2023] would make the work more complete. 3.The authors should clarify whether a single RAW exposure from the RAISE dataset provides sufficient dynamic range for HDR–SDR pair generation, as multi-exposure fusion (e.g., Adobe Indigo) is typically required to capture the full luminance range. 4.Moreover, SDR images often contain dark noise and highlight saturation, as noted in HDRCNN and UltraFusion, but the paper does not analyze how such degradations may affect the bilateral grid’s robustness. Testing on those datasets and adding visualizations could strengthen the analysis. 4.Since the paper claims that Gain Map learning reduces banding artifacts compared with direct HDR regression, an explicit visual comparison would be helpful. 5.Perceptual error visualization could also be improved using HDR-VDP distortion maps instead of raw RGB residuals. 6.Finally, perceptual metrics such as PU-PSNR and PU-SSIM would be more appropriate than standard PSNR/SSIM. 7.Several important references are also missing: HDR image reconstruction from a single exposure using deep CNNs UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion Gain-MLP: Improving HDR Gain Map Encoding via a Lightweight MLP GlowGAN: Unsupervised Learning of HDR Images from LDR Images in the Wild LEDiff: Latent Exposure Diffusion for HDR Generation Revisiting the Stack-Based Inverse Tone Mapping Single-Image HDR Reconstruction by Learning to Reverse the Camera Pipeline (also built its dataset from RAISE) Lightly AI-edited
Amodal SAM: Open-World Amodal Segmentation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the challenge of open-world amodal segmentation, where models need to predict complete object shapes including occluded regions and generalize to novel objects and contexts. The authors propose Amodal SAM, a framework that extends SAM’s capabilities to amodal segmentation while preserving its generalization ability. This framework improves across three aspects: a lightweight Spatial Completion Adapter that enables the model to reconstruct occluded regions, a Target-Aware Occlusion Synthesis pipeline that generates diverse synthetic amodal training data to solve the scarcity of amodal annotations, and novel learning objectives that enforce regional consistency and holistic topological regularization. Extensive experiments show Amodal SAM achieves state-of-the-art performance on multiple image and video amodal segmentation benchmarks. 1. The paper is well-structured and clearly written, ensuring good readability. 2. Video Extension: As the first attempt to tackle open-world video amodal segmentation via SAM, the work successfully extends the method’s applicability beyond images and validates its generalization. 3. The authors provide comprehensive experimental validation. 1. Lack of a Related Work Section: Relevant studies are only scattered in the Introduction and experimental comparisons. A systematic review is missing, making it difficult to clearly position the work’s incremental contributions against existing literature. 2. Insufficient Investigation of Prior Works: When discussing video amodal segmentation, it overlooks the first work that addressed this task and validated it on open-world datasets[1]. This omission weakens the paper’s ability to demonstrate the novelty of its own video amodal segmentation efforts. 3. Unspecified Sources for Baselines in Table 3. The baseline methods included in Table 3 do not have corresponding literature sources. 4. Although Appendix A.2.1 verifies the impact of the number of Spatial Completion Adapters on performance, it does not conduct in-depth exploration in two key aspects: (1) how the insertion positions of SCA influence the model’s ability to reconstruct occluded regions; (2) comparative experiments between SCA and other mainstream adapter structures in the context of amodal segmentation tasks. [1] Self-supervised Amodal Video Object Segmentation. 1. How does the model perform on the open-world dataset mentioned in [1]? 2. How were the experimental results of each baseline method in Table 3 obtained? [1] Self-supervised Amodal Video Object Segmentation. Lightly AI-edited
Amodal SAM: Open-World Amodal Segmentation Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Amodal SAM extends the Segment Anything Model to perform open-world amodal segmentation, predicting both visible and occluded object regions even for unseen categories. It introduces a Spatial Completion Adapter for occlusion reasoning, a Target-Aware Occlusion Synthesis pipeline for large-scale training without manual labels, and new consistency losses for structural coherence. The model achieves state-of-the-art performance on multiple benchmarks, making it a strong framework for enabling amodal segmentation in open-world scenarios. The paper extends SAM to perform amodal segmentation, predicting both visible and occluded object regions. It introduces a Spatial Completion Adapter (SCA) with new losses and leverages the TAOS synthetic dataset for training, showing strong technical quality. There’s a typo in Figure 2: it labels the component as “Amodal Spatial Attention Module (SAM)”, which is confusing. 1. Which components or models were actually retrained or fine-tuned during the development of Amodal-SAM? For example, was the base SAM frozen while only the Spatial Completion Adapter (SCA) was trained, or were other parts of the model updated as well? 2. After fine-tuning with the Spatial Completion Adapter (SCA) for amodal segmentation, does the model maintain the original SAM’s segmentation capability on normal (visible-only) tasks, or is there any performance degradation in standard segmentation scenarios? Moderately AI-edited
Amodal SAM: Open-World Amodal Segmentation Soundness: 3: good Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes Amodal SAM, a framework that extends SAM to predict both visible and occluded regions of objects, aiming to improve generalization in open-world amodal segmentation. The authors introduce a lightweight Spatial Completion Adapter to infer hidden regions, a Target-Aware Occlusion Synthesis pipeline that generates synthetic occlusions from the SA-1B dataset without manual labeling, and new loss terms promoting regional consistency and topological coherence. Evaluations on six benchmarks spanning both image and video amodal segmentation (KINS, COCOA, COCOA-cls, MP3D-Amodal, FISHBOWL, and MOViD-A) show consistent gains over several prior methods in both closed- and cross-domain settings. While the method demonstrates solid empirical performance, its core ideas substantially overlap with prior work. The proposed TAOS pipeline closely parallels existing synthetic occlusion generation strategies such as Amodal-LVIS in SAMEO and the mixed real–synthetic approach in SAMBA, both of which already integrate similar data synthesis procedures into SAM-based models (neither is cited). Moreover, the paper omits direct comparisons to these closely related and publicly available baselines — pix2gestalt, SAMEO, and SAMBA — which weakens the empirical validation and makes it difficult to substantiate claims of state-of-the-art performance or novel methodological contribution. The paper is relatively well written and easy to follow. The proposed approach is sound. The proposed design offers a natural extension from image to video amodal segmntation. A minimal ablation study is reported. The authors completely omit the most relevant works in the literature on open-world amodal segmntation. Specifically, they do not cite, discuss or compare to pix2gestalt [1], SAMEO [2], and SAMBA [3]. Moreover the dataset and methodological contributions are minimal compared to SAMEO and SAMBA, which also fine-tune SAM for amodal segmntation using synthetically "generated" occlusions. [1] Ozguroglu, E., Liu, R., Surís, D., Chen, D., Dave, A., Tokmakov, P., and Vondrick, C. “pix2gestalt: Amodal Segmentation by Synthesizing Wholes.”, CVPR'24 [2] Tai, W.-E., Shih, Y.-L., Sun, C., Wang, Y.-C. F., and Chen, H.-T. “Segment Anything, Even Occluded.”, CVPR'25 [3] Liu, Z., Qiao, L., Chu, X., Ma, L., and Jiang, T. “Towards Efficient Foundation Model for Zero-shot Amodal Segmentation.”, CVPR'25 Please discuss your work's relationship to the state-of-the-art methods for the problem you are trying to address. Update novelty claims accordingly. Compare to these methods on the datasets used in their paper using the same metrics. Fully human-written
Amodal SAM: Open-World Amodal Segmentation Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The submission focused on the task of amoda segmentation, which predict the object shapes including occluded regions. Specifically, the paper proposed to extend SAM to a amodal SAM with several proposed modules. The proposed Spatial Completion Adapter reconstructed the occluded regions, the Target-Aware Occlusion Synthesis pipeline generated training data, and the proposed training loss enhance the consistency on regional and topological aspects. Extensive experiments on various datasets are provided to show the effectiveness of the proposed method. 1. The task of open world amodal segmentation is interesting. 2. The idea of building amodal segmentor based on SAM is reasonable and make sense. 3. The performances of proposed method are shown on various benchmark datasets, and outperform baselines by a large margin. 1. The insight of proposed regional consistency is problematic. Why are the visible and occluded regions of the same object expected to exhibit similar characteristics like appearance and texture patterns? The occluded regions should be in the characteristics of occluder, instead of target. This is a factual and fatal issue. 2. The critical details of proposed TAOS is unclear. How to employ VLM to evaluate the generated occlusion and eliminate invalid data? How to promise that the generated amodal mask is complete, if the selected target is already occluded before synthesis? 3. The critical details of proposed holistic topological regularization is unclear. How to compute the Eqn. 10? The detailed formulations are also missing in A.1.2. 4. The experiments is in-depth enough. How to balance the loss terms in Eqn. 7? Why is there no hyperparameter? 5. The performances seem not cost-effective enough against previous works, bencause the proposed method relies on a large number of data and a heavy foundation model. Related analysis on complexity is also missing. 6. Some closely related works on amodal segmentation are missing, and thus the discussion and experiments are not extensive enough.  > 1. "pix2gestalt: Amodal segmentation by synthesizing wholes." 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2024. > > 2. "Amodal segmentation through out-of-task and out-of-distribution generalization with a bayesian model." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. > > 3. "Amodal instance segmentation via prior-guided expansion." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. No. 1. 2023. > > 4. "Amodal cityscapes: a new dataset, its generation, and an amodal semantic segmentation challenge baseline." 2022 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2022. Please see Weakness. Fully human-written
VIRTUAL CELLS AS CAUSAL WORLD MODELS: A PERSPECTIVE ON EVALUATION Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This perspective paper advocates for better evaluation of AI "virtual cells". The authors focus on causal evaluation of virtual cells, where these models are evaluated according to their ability to predict responses to interventions. The first third of the paper focuses on existing predictive approaches, datasets, and evaluation strategies, with a detailed review of generative approaches (e.g. based on autoencoders, GANs, and diffusion models) and recent biological foundation models (FMs). They subdivide these models based on their function and data modality, e.g. genomic FMs for predicting transcriptomic effects of genetic variation and Protein FMs for predicting protein structure and other properties of proteins. They describe commonly-used datasets, including large-scale atlases like Tahoe-100M to synthetic data generators like Splatter. Finally, they discuss evaluation metrics focused on predictive fit, e.g. the accuracy of sequence classification or cell state classification, and discuss crucial limitations of such metrics, e.g. that they do not reflect the ability of these models to generalize to unseen perturbations. The second third of the paper focuses on existing causal approaches, interventional data, and causal evaluation metrics. They outline different perspectives on what makes a method "causal", e.g. a mechanistic perspective that emphasizes biochemical interactions, a probabilistic perspective that emphasizes conditional independences and (conditional) invariances, and a counterfactual/interventionist perspective that emphasizes the outcomes of interventions. They review ODE-based models, causal machine learning, graphical approaches, and perturbation prediction approaches such as scCausalVI and CINEMA-OT, then review perturbation dataset such as Perturb-seq screens. They discuss various causality-relevant metrics, such as mechanistic alignment with known pathways, GRN recovery for graph-based approaches, and metrics related to the effect of perturbations. The final third of the paper focuses on a vision for the future, including the design of mechanistic models, and alternative causal evaluation approaches based on a better use of observational data, use of biological domain knowledge from sources such as Reactome, use of experimental metadata about batch effects, and uncertainty quantification. **Originality and significance:** The main theme of the paper is timely, given the current trends on virtual cells and foundation models in biology. While the emphasis on causal evaluation is becoming more apparent in recent works, a major novel contribution is bringing this discussion into a single paper with a quite detailed literature review. **Clarity and quality:** The main point of the paper is quite clear and convincing and the paper follows a logical structure in its presentation. The organization of existing literature is well-done and clearly supports the position argued for by the paper. ## Major weaknesses 1. **Overly repetitive:** I found the paper to be quite repetitive in some sections, e.g. the subsubsections "3.1.1 Metrics" and "3.3.2 Strategies" cover a lot of the same ground - "Mechanistic alignment" in Section 3.1.1 is closely related to "Pathway fidelity tasks" in Section 3.3.2. Within Section 3.1.1, structural intervention distance (SID) is repeated both within "Intervention Validity" and "Mechanistic Alignment". I think the paper would benefit from another round or two of refining the categorization and clearly articulating the main goal of each section, and how that goal is different from the goal of the other sections. 2. **Lack of concrete contribution:** As a perspective/review paper, this work accomplishes its goal. However, looking at the ICLR call for papers, it is unclear whether such papers are meant to be in-scope for the conference (though I may have missed something). To give the paper more substance, it would be very interesting if the authors had compared existing approaches on some of the causal evaluation metrics that they discussed to reinforce their message that non-causal approaches are not sufficient for the intended purpose of virtual cells. 3. **Text-heavy:** Even as a perspective/review paper, one weakness of the paper is how dense and text-heavy it is. The literature review covers a lot of ground, and citations make up a substantial portion of the overall text. Ideally, tables and figures would be used to make the paper more readable and organized, and more equations would be used so that the paper could serve as a reference to practitioners who wish to use the causal evaluation metrics described. ## Minor weaknesses 4. **Interventions vs. counterfactuals:** This point is slightly more of personal taste, but especially in biology, I believe *most* tasks that we care about can be thought about as measuring the effects of interventions, rather than generating counterfactuals. For example, when the authors introduce the third perspective on causality at the end of page 4, they say > "a *counterfactual* view highlights potential outcomes under interventions (e.g., asking how a tumor cell's transcriptome would change if KRAS were knocked out versus left intact)" To me, it is best to consider this question as an *interventional* one, e.g. as a conditional treatment effect: "given what I know about the cell, what do I predict *will happen* if I perform intervention X", rather than as a *counterfactual* question, which would usually be along the lines of "what *would have happened* if I had performed X in the past?". The distinction that I have in mind is that interventions are forward-looking, and hence have practical implications for treatment, whereas counterfactuals are backward-looking, and typically more relevant for ethical issues like assigning responsibility/blame (see [1] and responses for more background on the use of interventionist vs. counterfactual language). Throughout the paper, there seems to be a confusion between counterfactuals and interventions, e.g. on line 321, "counterfactual reconstruction error... compares predicted states against observed perturbation responses", which again I would say is interventional. Ideally, I would prefer if the paper stuck to the interventional language except when counterfactuals are specifically needed. At the very least, it would be good for the authors to acknowledge and discuss the difference between interventions and counterfactuals: the distinction is important and it would be a disservice to propagate any confusion to a biological audience which may not be as familiar with the issues. [1] Dawid (2020), "Decision-theoretic foundations for statistical causality" 1. How do you intend to reduce the repetitiveness of the paper? (see Weakness 1) 2. Is it possible to make the paper less text-heavy and easier to parse? (see Weakness 3) 3. How do you plan to address the difference between interventions and counterfactuals? (see Weakness 4). If needed, I am happy to discuss these points further in the discussion period. Fully human-written
VIRTUAL CELLS AS CAUSAL WORLD MODELS: A PERSPECTIVE ON EVALUATION Soundness: 3: good Presentation: 1: poor Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors focus on virtual cell modeling, a field of research that builds models to simulate some aspects of biological cells, generally aiming to produce realistic cell states and transitions. While this field is growing, evaluation of virtual cells has remained predictive, focusing on correlation rather than causation. For a virtual cell model to truly represent how cells behave and interact, it needs to represent how components causally affect each other. To promote this shift in focus for the field, the authors first conduct a survey of current practice in virtual cell modeling, breaking down current practice in terms of modeling approaches, data, and evaluation metrics. They then also break down current practice in causal modeling by modeling approaches, data, and evaluation metrics and discuss how virtual cell modeling evaluation can be expanded to capture causality. The breadth of the presented survey is impressive. The authors did a great job of covering a range of virtual cell modeling methods, evaluation metrics, and causal modeling methods, allowing this paper to serve as a rich source of references. For the most part, the writing quality is high, making it an easy read. **Poor organization as a survey paper** While I think the authors have a strong handle on the literature in virtual cell modeling and causal modeling, and the pieces are mostly here for a solid survey paper, the way it is presented prevents the paper from really functioning as a useful survey. Ideally, after reading Section 2, I should have a decent understanding **(1)** The problem definition(s) of virtual cell modeling, **(2)** the broad classes of approaches that people have applied to virtual cell modeling, **(3)** what sort of predictions these approaches are being used to generate, **(4)** what types of data each of these modeling methods uses/is evaluated on, and **(5)** what metrics are used to evaluate each of these model/prediction types. However, breaking these down: **(1) The problem definition(s) of virtual cell modeling:** This doesn't seem to be covered at all. The only definition I can find of what the scope of "virtual cell modeling" is in the second paragraph of the introduction: "our vision of AI virtual cells is simulation-ready representations that reason about mechanisms, predict perturbation responses, and serve as in silico testbeds". However, this seems to be the authors ideal vision of what virtual cells should be, rather than how they are actually defined in the literature. From reading the paper, they seem to broadly be any model that focuses on making predictions about cell state and behavior. However, if that's the definition the authors are working with, that encompasses a wide range of possible task definitions, which doesn't really seem to be discussed anywhere. **(2)-(4) existing virtual cell modeling approaches, what predictions they produce, and what type of data they require:** are covered in part by Section 2.1. However, while 2.1 does a great job at provided a large number of references, there's not enough description or categorization for a survey paper. 2.1 seems to mostly group models into either "Autoencoder-based and conditional architectures" (paragraph 1) and "Biological FMs" (paragraph 2), with a few bold headers in paragraph 2 separating types of FMs. However, the vast majority of methods listed have very little information provided about them, making it very hard to parse for someone not particularly familiar with virtual cell modeling. The bold headings do break model categories down by what type of data they are trained on and provide a few words each describing what task they are performing/what they are modeling, which is useful context. However, the presentation method of a single dense paragraph makes the relevant information harder to extract. **(4) What types of data each of these modeling methods uses/is evaluated on:** is also partially covered by Section 2.2. However, it seems unlikely that all of the datasets described in 2.2 are applicable to all the models/applications discussed in 2.1. If a similar categorization scheme were used in both 2.1 and 2.2, it would be easier to map datasets to relevant modeling approaches. **(5) what metrics are used to evaluate each of these model/prediction types:** encounters similar difficulties to (4): while there are many categories of evaluation metrics discussed in 2.3.1, the categories here are again different from those presented in 2.1, making it hard to map evaluation metrics to modeling approaches. The categories in 2.3.1 (broken down by task type) actually seem like great categories that could be used in 2.1 as well. I think Section 2 really needs a table, the kind that is present in many survey papers. There are many ways to do, but something that defines clear categories of modeling methods and maps those to the type data they use, predictions they can produce, evaluation metrics typically used, etc. If the categories are in practice too blurred to make a clear distinction like that, or if, counter to how 2.1 seems, all of the modeling methods are flexible enough to run with any of the dataset types and evaluation metrics discussed, then that should be discussed and explicitly stated. The paper as-is seems to assume that the reader is already familiar with the virtual cell modeling domain. If that is the intended audience, and this lack of clarity isn't felt by the other reviewers, then you can discount these comments. However, if you're aiming for a broader audience, then the descriptions of modeling methods 2.1 aren't clear enough. For example, in the first sentence of 2.1, you list a group of methods that "interpolate from control to perturbed states" and another group that "enhance latent representations or capture combinatorial and differential perturbations." Both of these descriptions are far too vague for me to understand what they're referring to. Does "interpolate from control to perturbed states" mean that they take input data that includes example interventions (i.e., control) and learn to estimate what the resulting perturbed states would be? "Enhance latent representations" of what? Then in Section 2.3.1, there are sentences like "Detection metrics are applied to the prediction of genetic interaction", without actually saying what a "detection metric" is. --- **Unclear second contribution** The authors second contribution, in the 3rd paragraph of the introduction, is listed as "a taxonomy of causal evaluation metrics mapped to available perturbation datasets and benchmarks (Figure 1). However, Figure 1 doesn't seem to be taxonomy of evaluation metrics. It is instead labeled as a "Summary of our proposed framework as described in Section 4". However, I also don't see a clear framework in Section 4. Section 4 instead seems to consist of (1) an explanation that causality is important, (2) a list of 4 ways that virtual cell modeling can move in a causal direction, and (3) an explanation of the importance of representing uncertainty. These are all fine points, but none of them seem anywhere close to a new "framework". It looks like the proposed taxonomy may be the terms listed under "evaluation" in Figure 1? These do correspond to the first 4 bold headings in Section 3.3.1, which is helpful. However, I'm then not sure of "GRN Recovery" should also be in Figure 1. Also, the 2nd contribution also mentions that the provided taxonomy is "mapped to available perturbation datasets and benchmarks". However, the categorization used in 3.3.1 (and seen in Figure 1) isn't actually presented in the text of the paper until 3.3.1, and the perturbation datasets are discussed in Section 3.2, so, unless I'm missing something, the mapping between the taxonomy and the datasets is never made explicit. Section 4 falls victim to a similar issue as Section 2 (presenting a wide range of references in dense paragraphs, with models, data, and metrics presented with different categorizations, making it hard to link them together), but it was less of an issue for me here since I'm more familiar with that literature. Still, it would benefit from a similar reorganization as Section 2, ideally with some sort of table. --- **Miscellaneous issues** One of the arrows in Figure 1 is "Active Learning", a term that doesn't appear anywhere else in the paper that I can tell. In the conclusion, the authors state that they "emphasize uncertainty as a cross-cutting principle". However, uncertainty isn't really discussed until the very end of the paper (the final paragraph in Section 4.2 and Section 4.3), so it really doesn't come across as a "cross-cutting principle". In addition, those parts of the paper that do discuss uncertainty feel very disjointed. The "Uncertainty Quantification" section reads mostly like a list of techniques/approaches that deal with uncertainty/confidence in various ways (e.g., "Calibration", "Gaussian processes"), but without any discussion of how they can be applied to causal modeling or virtual cells. Only "simulation-based inference" actually has any real discussion. Section 4.3 is a bit more specific, but even still, it reads mostly as a future work section, suggesting that certain evaluation metrics could be extended in different ways. These are interesting points, but their hypothetical nature means that they really don't serve as a "cross-cutting principle". The discussion of "strategies" in Section 2.3.2 feels odd. While the concept of evaluation strategies as the authors describe them seems sound, all that's discussed in this section is "rank based metrics" (which seems like a type of metric, not a strategy for applying a metric) and "calibrations", which seems like something that should be done for a model but again, not really a "strategy" (at least, not as the authors seem to define a strategy). --- Ultimately, I think the pieces are here for a solid paper. However, the current presentation is significantly holding it back. The key insights of this paper (what the current practice is in virtual cell modeling, what alternate approaches are available in the causal modeling world, and what concrete actions can be taken to move virtual cell evaluation in a more causal direction) are buried in dense reference-heavy and description-light paragraphs, making them hard to glean without spending significant effort and limiting the likely impact of this work. No questions Fully human-written
VIRTUAL CELLS AS CAUSAL WORLD MODELS: A PERSPECTIVE ON EVALUATION Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This is a perspective/position paper arguing that evaluation of “AI virtual cells” must move beyond predictive fit to explicit causal assessment. The authors (i) survey recent predictive and causal approaches for virtual-cell modeling and (ii) propose a taxonomy and framework for causal evaluation built around four axes: Intervention validity, Counterfactual consistency, Trajectory faithfulness, and Mechanistic alignment, with uncertainty treated as a cross‑cutting principle. 1. Timely and well‑motivated problem framing. The distinction between predictive fit and mechanistic/causal validity is crisply articulated, with clear failure modes of purely predictive assessments (e.g., generalization to unseen interventions). 2. Actionable taxonomy of evaluation axes. The four axes and their associated metrics/strategies are specific enough to guide practitioners toward more probing tests (e.g., using SID/SHD, pathway fidelity, invariance‑based tests), not just MSE/LogFC. 3. Uncertainty as a first‑class concern. Treating calibration and distributional uncertainty as integral to causal evaluation (not an afterthought) is a valuable emphasis. 1. No instantiated benchmark or code. The paper proposes a taxonomy and mentions candidate datasets/benchmarks (Perturb‑seq, OP3, PerturBench), but does not release a concrete evaluation suite (tasks/splits/metrics scripts) or re‑evaluate representative models within the proposed framework. This limits immediate impact and testability. 2. Lack of empirical case studies. There is no demonstration that the proposed metrics change conclusions relative to standard predictive metrics (e.g., a re‑ranking of methods by intervention validity or SID on a public perturbational dataset). 3. Ambiguity at scale. Practical guidance for computing graph‑level metrics (e.g., SID/SHD) and pathway‑fidelity at genome scale with noisy annotations is limited; identifiability and confounding are acknowledged but not operationalized into robust scoring protocols. see above Fully AI-generated
Q-Learning with Fine-Grained Gap-Dependent Regret Soundness: 2: fair Presentation: 2: fair Contribution: 4: excellent Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper studies the online tabular RL problem. The authors show that UCB-Hoeffding (Jin et al., 2018) achieves a fine-grained, gap-dependent logarithmic regret bound. Such a result is the first among model-free algorithms. While there is an exception of AMB (Xu et al., 2021), the authors point out errors in the analysis of AMB and propose how to fix it, providing a correct fine-grained regret bound and its proof for the algorithm. The paper proposes a general framework for obtaining fine-grained, gap-dependent logarithmic regret bounds of model-free algorithms. The raised issues for AMB and the corrected version seem valid. Some core definitions for the analysis are not properly given. - Is $\eta_ i^{N}$, introduced in line 310, $\eta_ i \Pi_ {j=i+1}^N (1 - \eta_ j)$ or the $N$-th power of $\eta_ i$? It seems like it's the former but I could not find the definition. - What is the input state-action pair of $N_ h^k$ when written without one? It appears multiple times including in the definition of $\omega_ {h'+1}^k(h)$, but it is not clear from the context. - It seems like the definition of $\omega_ {h'+1}^k(h)$ requires a state-action pair as $k^i$ requires a state-action pair in its definition. However, the notation does not reflect the fact. Also, there is a problem of using $N_ h^k$ without definition, so I have no idea what $\omega_ {h'}^k(h)$ is supposed to represent. I hope there is a description about what $\omega_ {h'+1}^k(h)$ represents as it is hard to understand it intuitively. Partially due to these ambiguities, Lemma C.3 is not trivial. Also, it is really hard to see why the equation in line 372 is true. Could the authors explain how these equations are established? In addition, as I see that $\omega_ h^k = \mathbb{I}\lbrace (s_ h^k, a_ h^k) \in Z_ {\text{sub}, h}\rbrace$ in the end, which is a simple function, wouldn't plugging in this value from the beginning simplify the analysis a lot without needing to define multiple series of $\omega$? I will be happy to raise my score once these points are clarified. 1. While a general framework for fine-grained, gap-dependent bounds for model-based algorithms is known (Simchowitz & Jamieson, 2019; Dann et al., 2021; Chen et al., 2025), I see that the analysis in this paper is different from the one in Simchowitz & Jamieson (2019). What challenges are there in applying the techniques in these papers to the model-free setting? 2. I don't understand why UCLB is proposed when it is no better than UCB-Hoeffding both theoretically and empirically. If the goal is to show the generality of the framework, couldn't any other model-free algorithm be used, for instance, UCB-Bernstein or UCB-Advantage? 3. In Section 4, it is mentioned that the analysis of Xu et al. (2021) violates the martingale property. It is due to the fact that $h'$ in their work is also random? It is not clear what this part is trying to claim. For instance, what are the values of the expectation and conditional expectation described in this paper? 4. Under what MDP was the experiment conducted? Fully human-written
Q-Learning with Fine-Grained Gap-Dependent Regret Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents the fine-grained, gap-dependent regret bounds for model-free reinforcement learning in episodic tabular Markov Decision Processes. While previous algorithms achieved minimax worst-case guarantees, their gap-dependent analyses were coarse, depending on the smallest suboptimality gap rather than individual ones. The authors introduce a new analytical framework that explicitly separates optimal and suboptimal state-action pairs, enabling fine-grained regret analysis. They apply this framework to the well-known UCB-Hoeffding algorithm, deriving a tighter bound that matches known lower bounds up to polynomial factors, and propose a simplified variant, ULCB-Hoeffding, which achieves similar theoretical guarantees with improved empirical performance. The paper also revisits the non-UCB-based AMB algorithm, identifying key theoretical flaws and proposing a refined version that restores correctness, ensures valid concentration analysis, and achieves a rigorous fine-grained regret bound in this class. Experimental results on synthetic MDPs confirm the theoretical findings, showing that the refined algorithms outperform prior methods. Overall, the work provides a unified framework that advances theoretical understanding of model-free reinforcement learning by bridging worst-case and instance-dependent analyses. 1. Introduces a fine-grained decomposition that separately analyzes optimal and suboptimal pairs, tightening gap-dependent bounds. 2. Framework applies to both UCB-based and non-UCB-based algorithms. 3. Provides the first fine-grained regret bound for model-free RL; matches known lower bounds up to polynomial factors. 4. Identifies and corrects subtle theoretical flaws in prior work (AMB). 5. Experiments confirm theoretical improvements and show scalability across MDP sizes. 1. Results do not extend to function approximation or continuous-state RL. 2. The theoretical derivations are mathematically heavy and might be difficult for practitioners to follow or generalize. 3. Experiments are conducted on randomly generated MDPs rather than benchmark environments. 4. The bounds, while asymptotically tight, may not yield practical improvements in all regimes. 5. The experimental comparison includes few competing algorithms beyond AMB and its variants. 1. Can the proposed fine-grained framework extend to function approximation (e.g., linear or neural models)? 2. How do fine-grained regret improvements translate to real-world tasks (e.g., navigation, games)? 3. Are the polynomial factors in H (e.g., H^5 or H^6) necessary, or could further refinement reduce them? 4. Does the fine-grained analysis provide insight into exploration dynamics, beyond theoretical improvement? 5. Could this framework be combined with variance-dependent or adaptive analyses to yield sharper or adaptive regret bounds? 6. Could ULCB-Hoeffding’s structure be adapted to other algorithms (e.g., Q-learning with linear approximation)? Fully AI-generated
Q-Learning with Fine-Grained Gap-Dependent Regret Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work establishes gap-dependent regret bounds for UCB-based model-free RL regarding individual suboptimality gaps $\Delta_h(s,a)$ instead of global one $\Delta_{\min}$. Besides, this work identifies the issues in analyzing the AMB algorithm, which is non-UCB-based model-free algorithm, on incorrectly applying concentration inequalities, and refines the algorithm and the analysis. 1 This work establishes the first and tight individual-gap-dependent regret bounds for model--free RL, which improves coarse global-gap-dependent coarse bound in the prior work. The core technical contribution lies in separating the analysis of of optimal and suboptimal state-action pairs. 2 This work refines and fixed the issues in the AMB algorithm and associated analysis, which provides a rigorous fine-grained regret bound for non-UCB-based algorithm. 1 The analysis is based on episodic tabular MDP. The theoretical guarantees do not extend to complex settings such as linear MDP, MDP with function approximation. 2 Model-based algorithms has shown to achieve fine-grained gap-dependent regret bound. This work addresses a theoretical gap where model-free algorithms were lagging behind model-based algorithms in this setting. The contribution may not be significant. 3 The regret bound has high dependency on the horizon $H$. 1 There lacks comparison between this work and model-based RL. For example: a) What is the technical difficulty in adapting techniques in model-based algorithms for deriving fine-grained gap-dependent regret bound? b) Comparison on the dependency on $H$, and potential improvement on $H$. 2 What is the advantage of ULCB-Hoeffding over UCB-Hoeffding? They achieve the same fine-grained gap-dependent regret bound and their analysis are also similar. Given that ULCB is considerably more complex, I don't see the necessity of introducing ULCB algorithm in the main context. 3 Comparing the last term in Eqn. (10) and Eqn (2), the non-UCB-based algorithm AMB achieves a better sample complexity in $H$. Is it due to the sharper analysis of AMB? Minor issues: 1* Line 226 "line 15 in Algorithm 2": line 15 should be replaced. 2* Line 310 Eqn. (5): inside indicator function, should $k$ be $k'$? The recursive definition is hard follow. It would be great if the authors could come up with intuitive way to explain the intuition if exists. 3* Line 951 "(1),." -> "(1)." Fully human-written
Q-Learning with Fine-Grained Gap-Dependent Regret Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper aims to establish fine-grained, gap-dependent regret bounds for model-free algorithms in episodic tabular MDPs. While such bounds exist for model-based methods, model-free approaches have been limited to coarse bounds dependent on the global minimum gap, $\Delta_{min}$. The authors provide a two-part affirmative answer: 1. For UCB-based algorithms, they develop an analytical framework that distinguishes between optimal and suboptimal state-action pairs. Using this, they derive the first fine-grained, gap-dependent regret bound for the classic UCB-Hoeffding algorithm. 2. For non-UCB-based algorithms, they revisit the AMB algorithm (Xu et al., 2021), identifying and correcting significant algorithmic and analytical flaws to propose a Refined AMB. They then provide the first rigorous fine-grained bound for a non-UCB-based method. Empirical results on synthetic MDPs validate the theory, showing that the refined algorithms outperform the original AMB. 1. A key strength is the identification and correction of critical flaws in the AMB algorithm (Xu et al., 2021). The identification of improper truncation and violation of martingale difference conditions is a valuable and clear-cut contribution. 2. The paper successfully applies a fine-grained analytical framework, separating optimal/suboptimal pairs, to the model-free setting, yielding the first fine-grained, gap-dependent regret bound for the widely-known UCB-Hoeffding algorithm. 3. The experiments clearly support the theory. The flawed AMB algorithm performs poorly, while the Refined AMB and UCB-Hoeffding all perform well and exhibit the logarithmic regret behavior predicted by the new theory. 1. The paper's primary weakness is its failure to position its analytical framework relative to recent model-based works that also achieve fine-grained bounds, e.g., Dann et al., (2021); Chen et al., (2025). The paper cites these works but does not discuss the technical challenges of adapting their analysis to the model-free setting. Without this comparison, the core technical contribution in Section 3.3 appears to be an incremental adaptation rather than a novel framework. 2. The derived bounds include $H^5$ and $H^6$ terms. While the focus is on the gap-dependence, this looseness in $H$ is a significant limitation of the current analysis and makes the bounds less tight, even if they match the lower bound except for the factors in $H$. 3. The inclusion of the ULCB-Hoeffding algorithm feels unnecessary. It achieves the same theoretical bound as UCB-Hoeffding in Theorem 3.3, but performs noticeably worse in the experiments shown in Figure 1, distracting from the two stronger, clearer contributions. 1. Please explicitly compare your analytical framework in Section 3.3 to the techniques used in model-based papers like Chen et al. (2025). What are the key technical novelties required to make this style of fine-grained analysis work for model-free Q-learning? What new challenges arise from this difference that your analysis overcomes? 2. Can you elaborate on the source of the large polynomial dependence on $H$? Is this an artifact of the analysis, e.g., from recursively applying bounds, and do you see a path to tightening it? 3. Could you provide a direct comparison between your final bound for Refined AMB (Eq 10, depending on $|Z_{mul}|$) and the bound for UCB-Hoeffding (depending on $|Z_{opt}|$)? Which bound is tighter, and under what conditions? Lightly AI-edited
FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes an offline imitation learning framework in which a student policy learns from a reward model based on Flow Matching (FM). The authors begin by noting that the absence of an online FM policy learning mechanism limits the policy's generalization capability. The point of this paper is not to train a FM policy. Inspired by adversarial inverse reinforcement learning, this work leverage FM to develop an enhanced discriminator. A student policy is implemented as a simple MLP. The FM-based discriminator is trained to fit expert data while distinguishing it from the behavior generated by the student policy. The authors also observe that while several prior attempts have integrated online RL with diffusion models, these methods often suffer from training instability. I would recommend that the authors also discuss the relevant work DACER [1] in this context, but I generally agree with this statement based on my own experience. Furthermore, the paper elaborates on the inherent challenges of training FM policies online. Its key contribution lies in leveraging a powerful generative model to "infuse" knowledge into a simple policy, while the simple policy also learns online to prevent overfitting. Overall, this paper is interesting to me. However, I still have some concerns about the comprehensiveness of the technical details, which lower my confidence. [1] Wang, Yinuo, et al. "Diffusion actor-critic with entropy regulator." *Advances in Neural Information Processing Systems* 37 (2024): 54183-54204. This paper presents a well-motivated and novel approach, supported by a clear and logical structure. The proposed method demonstrates significant performance improvements over baselines. The authors also provide comprehensive discussion in the appendix, including answers to some possible questions, which greatly aids in understanding the methodological rationale. I am not clear about whether the framework is easy to implement effectively or it requires tricks and careful hyperparameter tuning. Also, I am not sure about how the authors made the comparison fair (e.g., using common hyperparameters or network architectures, or fine-tuning each algorithm one-by-one). I noticed that the authors claim that the code will be made open-source, but I would appreciate some explicit discussion about such details. 1. Is there a learned value function or advantage estimation for the student MLP policy? Regarding the student policy loss (Equation (11)), the first term is the expected return. Does this mean that the student policy learning is identical to REINFORCE with adversarial rewards when $\beta = 0$? 2. How is the MLP student policy implemented? For example, PPO usually outputs a mean vector and uses a fixed scale to represent a Gaussian distribution, while SAC typically outputs both a mean vector and a per-dimension scale vector. There is also a policy class called amortized actors [2,3,4] which, though structurally an MLP, can express a multi-modal decision distribution. 3. The teacher FM can represent a multi-modal data distribution, but the student policy probably cannot (if it is Gaussian). In cases where the expert data is highly multi-modal (for example, the scenario discussed in Figure 1 of the DQL paper (Wang et al., 2022)), would the "infusing" encounters challenges? 4. Are there any tricks or hyperparameters not covered in the appendix? For example, only disc_lr is listed in Table 2. Is this a global learning rate that also applies to the student policy? What is the detailed network architecture of FM teacher? 5. How did you made the comparison with baselines fair? 6. What is $p_\theta$ and $T$ in equation (1)? They don't seem to have been explained. [2] Haarnoja, Tuomas, et al. "Reinforcement learning with deep energy-based policies." *International conference on machine learning*. PMLR, 2017. [3] Messaoud, Safa, et al. "S$^ 2$AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic." *12th International Conference on Learning Representations, ICLR 2024*. 2024. [4] Wang, Ziqi et al. "Learning Intractable Multimodal Policies with Reparameterization and Diversity Regularization." *Advances in Neural Information Processing Systems*. 2025. Fully human-written
FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Flow Matching (FM) for RL is emerging as a strategy of Imitation Learning, which, however, defaults to offline learning, which lacks an exploration mechanism and is upper-bounded by expert demonstration performance. This paper proposed a method to use FM for reward shaping and regularization in online RL, which involves a teacher-student style learning. The teacher FM model shapes a reward for the agent model learning and action regularization during the online RL phase. Empirical experiments on SOTA locomotion and navigation tasks showed that their method achieves higher generalization of learned policy and robustness to sub-optimal expert demonstrations in certain tasks. + This paper aims to tackle two fundamental challenges in online AIL: 1) where expert demonstrations are noisy or suboptimal ,and 2) traditional FM cannot adapt to an online setting + Preliminary work is clearly introduced to facilitate the introduction of the proposed method + The appendix offers an interesting theoretical discussion on why FM may offer advantages over conventional IRL reward models Vague contribution scope: * It looks to me that the main algorithmic novelty of this work is the integration of an FM model to replace the traditional IRL reward shaper, which feels incremental. Since the action regularization and the reward shaping methods can somewhat be traced back to prior work. Especially, the discriminator training objective is identical to GAIL. * The motivation for introducing FM into online IRL is relegated to the appendix rather than the main text. Although the argument there is compelling, the empirical evidence presented in the main paper does not strongly support these theoretical claims. * Limited empirical improvement: - Reported performance improvements are marginal on several benchmarks (e.g., Hopper, Maze, Ant-goal) - For noisy initial and goal state settings, all experiments are evaluated only on a single task, Hand-rotate. - The experimental result in Table 1 does not support the claim that FM-IRL overcomes the limitation of suboptimal expert data, as they are approximately or often below the expected return of demonstration data. - Can a traditional IRL discriminator be viewed as a special case of a Flow Matching model, perhaps implicitly defining a probability flow between expert and policy distributions? - What would be the algorithmic robustness in noisy settings for other locomotion/navigation tasks? Fully human-written
FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper aims to incorporate the benefits of online exploration into flow matching training by building on an inverse reinforcement learning framework. First, a teacher flow-matching model is trained on a static offline dataset. To enhance the expressiveness of the reward model, the teacher model’s flow-matching loss on agent-generated trajectories is used to measure the discrepancy between the agent distribution and the expert distribution. To avoid the instability of updating flow-matching policies through backpropagation through time or policy gradients, the method adopts a distillation objective that jointly integrates the rewards and teacher behavior into a simpler student policy. Experimental results show that, with additional online interactions, the proposed approach can mitigate the potential sub-optimality of offline datasets and outperform standard behavior cloning models trained solely on static data. 1. The paper is clearly written and well structured. 2. Extending the capabilities of flow-matching or diffusion policies on out-of-distribution area using online interatctions is an important research direction, since distributional shift is hard to be addressed by simply scaling static offline data. It is necessary to using additional online interatctions of model rollouts to mitigate some corner cases. 3. Improving the RL training stability for flow-matching or diffusion policies is important. Most current methods like directly maximizing differentiable rewards or policy gradients can be unstable. 1. `Lack of Novelty`. This paper appears to be a straightforward combination of three existing ideas: using diffusion losses as rewards [1], applying distillation methods to optimize flow-matching policies via reinforcement learning [2], and inverse reinforcement learning [3]. Although the authors claim novelty in being the first to use flow-matching loss as rewards, flow matching and diffusion models are essentially two sides of the same coin. Therefore, this contribution does not strike me as genuinely novel. Furthermore, the idea of using reinforcement learning to optimize a distilled, simpler policy is already well studied [2][4][5]. As a result, using distillation to enhance the stability of flow-matching RL training does not appear novel either. Taken together, these factors make the paper resemble a naïve “A + B + C” combination without a clearly original contribution. 2. `Weak Motivation`. If the authors had provided strong motivation for why the “A, B, C” components should be integrated, I could have acknowledged the contribution despite the limited novelty. Unfortunately, the current manuscript fails to do so. The authors identify the potential suboptimality of offline datasets in traditional flow-matching training as the core challenge, and aims to introduce additional online interactions to address this. However, the final objectives still primarily fit the static offline dataset. For example, rewards are higher when the agent’s behavior resembles that of the offline dataset, which only encourages in-distribution behavior. In addition, to improve training stability and mitigate inaccurate reward estimation on out-of-distribution areas, the authors introduce a distillation loss that encourages the student policy to mimic the teacher flow. This, again, is essentially behavior cloning. In my view, the proposed method reformulates the “mimic offline dataset” objective into “A + B + C jointly mimic the offline dataset.” This reformulation does not fundamentally address the challenge that the authors themselves emphasize. For this reason, I would recommend rejection. 3. `Marginal Improvements`. The experimental results show only marginal improvements over baseline models, which further limits the paper’s contribution. [1] Diffusion-reward adversarial imitation learning. 2024 [2] Flow q-learning. 2025 [3] Generative adversarial imitation learning. 2016 [4] Score Regularized Policy Optimization through Diffusion Behavior. 2024 [5] Diffusion Policies creating a Trust Region for Offline Reinforcement Learning. 2024 Please see weaknesses for details. Fully human-written
FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes an adversarial imitation learning (AIL) method, termed FM-IRL, using flow-matching (FM) models. The main idea is to combine the benefits of adversarial imitation learning (which leads to better generalization and policy performance than behavior cloning) and FM-based policies (which are more expressive than MLP policies with a Gaussian action distribution). Since directly optimizing FM-based policies using AIL in an online manner is difficult, they first train an FM-based policy to clone the expert, and then train an MLP policy using a combination of (i) AIL reward derived from the FM-based policy, and (ii) regularization to stay close to the FM-based policy. - The paper addresses an important research gap (i.e., how can we leverage flow-based policies for adversarial imitation learning) - FM-IRL design choices (using a smart parameterization for the FM discriminator model, and a regularization term to stay close to the teacher FM policy) are clearly presented - They show competitive performance against various AIL baselines, and report much better generalization to noise in the initial and goal states of the tasks - Loss of multimodality in the trained policy: The main motivation of the paper is to combine the expressiveness of FM-based policies (i.e., their ability to represent multimodal policies) with AIL. However, since the student policy trained by FM-IRL is actually a unimodal MLP policy, it is unclear if the full potential of multi-modal policies is leveraged. - Regularization in Eq. 11 may be mode-averaging: As a related point, I suspect the regularization term in equation 11 could lead to poor performance since it would encourage the unimodal MLP policy to spread its probability mass over multiple modes of the FM policy. It would be helpful to add an ablation study to test the benefit of the regularization. - Unfair comparison to FM-based baselines: The comparison in Section 4.3 is great to have, but it might be unfair due to different rewards being used. I believe you are comparing baseline methods that use the sparse reward of the environment with FM-IRL, which uses a discriminator-based reward obtained using expert trajectories. Could you run the FM-A2C, FFM-PPO, and FPO baselines with the discriminator-based reward? - Terminology (IRL vs. AIL): Even though the AIL problem is the same as IRL with a convex regularization, I would expect an IRL paper to have more empirical evaluation of the quality of the recovered reward, e.g., train a policy on the recovered reward and examine this policy's performance compared to the expert. In the absence of these experiments, I would recommend updating the name of the method to FM-AIL. None Fully human-written
Shift-and-Sum Quantization for Visual Autoregressive Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper identifies two major challenges when applying post-training quantization (PTQ) to Visual Autoregressive Models (VAR): (1) large reconstruction errors arising from quantizing the multiplication between attention scores and value tokens, especially at coarse scales where high attention scores are more common; and (2) a mismatch between codebook-entry probabilities and their sampling frequencies during calibration due to limited calibration data. To address these issues, the authors propose two components: a shift-and-sum quantization technique that duplicates and symmetrically shifts attentive tokens (those with high attention scores) to reduce quantization errors, and a calibration data resampling method that reassigns codebook entries to better match predicted probabilities. Experiments across multiple VAR depths and tasks—including class-conditional generation, inpainting, outpainting, and conditional editing—show that the proposed methods consistently outperform prior PTQ methods while maintaining low computational overhead. The approach achieves state-of-the-art PTQ performance on VAR and is complementary to existing methods like LiteVAR. - The two components proposed in the paper are well-designed to address the specific problems that arise in VAR quantization, and the paper clearly explains how they solve these issues. - The proposed method appears broadly applicable beyond VAR, with potential usefulness in visual generation and autoregressive modeling in general. - Given the nature of quantization research, more generic and widely applicable methods are preferable. However, the proposed approach is validated only on VAR, making the research scope narrow and potentially limiting its impact. - Recent PTQ research on transformer quantization is not discussed; the related work mainly covers older studies. Similarly, LiteVAR also focuses specifically on VAR quantization, which suggests that the overall scope of related work is limited. - Could the proposed method be evaluated on other transformer-based models to verify whether it generalizes and improves performance? Although applying it to plain autoregressive generation may be less straightforward, architectures that use multi-scale representations might benefit significantly. - The following models might be worth exploring: - OneFormer: One Transformer to Rule Universal Image Segmentation, CVPR 2023 - VGGT: Visual Geometry Grounded Transformer, CVPR 2025 - In BRECQ, the main PTQ evaluation table compares W4A4 and W2A4 settings. It would be interesting to see how the proposed method performs under these quantization settings compared to BRECQ. Can the authors provide results or insights on W4A4 and W2A4 performance? - How does the proposed method perform quantitatively on inpainting, outpainting, and class-conditional editing tasks? Since the current version mainly focuses on quantization for VAR, it would be valuable to include numerical performance metrics for these tasks, rather than relying solely on qualitative visual comparisons. Lightly AI-edited
Shift-and-Sum Quantization for Visual Autoregressive Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work analyzes the post training quantization of VAR models and points out two VAR problems: 1. significant quantization errors from the multiplication between attention scores and value tokens. 2. a mismatch between the predicted probabilities over the entries of VQVAE codebook and their sampling frequencies during calibration (Line 71). The authors propose shift-and-sum quantization which can reduce error with $O(s/4n)$ bound (Theorem 1), and calibration data resample that can resolve the mismatch. The author also provide empirical validations to show that the method improves over BRECQ and is competitive with LiteVAR under W/A bit-widths. $\bullet$ Theoretical Contribution: this work formalizes coarse-scale attention make post-training quantization error worse, and the authors derived a variance expression (Eq. 8). $\bullet$ shift-and-sum kernel has tight error bound: the error bound is tight $|v-f_n(v;t_n) | \leq s/(4n)$ $\bullet$ the calibration fix is simple but effective: I think the resampling technique (probablity-matching) is easy to add to the top of existing post training quantization pipeline. $\bullet$ Experiments are comprehensive: the authors conduct experiments on multiple VAR depths, multiple bit-settings, standard metrics (IS/FID/etc.), and qualitative tasks (in/out- painting, editing) Eq. 8 relies on an unrealistic assumption: $\\{ \tilde{\epsilon\_i^a} \\}\_{i=1}^T$ and $\\{ \mathbf{\epsilon}\_i^v \\}\_{i=1}^T$ are independent, zero-mean random variables with variances $\sigma_a^2$ and $\sigma_v^2$ respectively. I checked the proof of Eq. 8 briefly, and I found the assumption is used at Eq. 22, where $\\mathrm{Var}[ \\sum\_{i} a\_i X\_i ] = \\sum\_{i} a\_i^2 \mathrm{Var}[X\_i] $ (covariance set to 0). If the assumption is dropped, it will not get the same closed form as introducing the covariance term. Furthermore, this assumption seems unrealistic to me, and one proof to break this assumption can be the following: We define $a_t$ and $\\hat{a}_t$ as the exact attention score and quantized attention score at timestep t respectively. Then, the quantization error is defined as $e_t := a_t-\\hat{a}_t$. By the definition of Softmax, $\\sum_i {a_i} = 1$ and $\\sum_i {\\hat{a}_i} = 1$ (softmax of quantized logits), thus we have $\\sum_i {e_i} = \\sum_i {a_i} - \\sum_i {\\hat{a}_i} = 0$. We suppose $\\{ e_1, \\dots, e_t \\}$ are independent and at least one had nonzero variance. Then we have $\\mathrm{Var}(\\sum_i e_i) = \\sum_i \\mathrm{Var}(e_i) > 0$. Since $\\sum_i {e_i} = 0$, we have $\\mathrm{Var}(\\sum_i e_i) = 0$. This leads to a contradiction. This proof has shown the independence assumption is unrealistic. $\bullet$ Can authors explain the practical validity of the assumption used in Eq. 8? $\bullet$ If the Eq. 8 assumption is removed, does it affect the main result, or does it merely complicate the derivation of the bounds? Fully human-written
Shift-and-Sum Quantization for Visual Autoregressive Models Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses the challenge of efficient deployment of Visual Autoregressive Models (VAR) by focusing on Post-Training Quantization (PTQ), a technique that enables deep network compression using a small subset of calibration data. While PTQ has shown promise in conventional diffusion models generative models, its application to VAR remains underexplored, primarily due to two critical issues: * First, significant reconstruction errors arise from the multiplication of attention scores and value tokens in the VAR transformer, especially at coarse scales (low resolutions) where high attention scores are more concentrated—these errors propagate across subsequent finer scales and degrade final image quality. * Second, limited calibration data leads to a mismatch between the sampling frequencies of VQVAE codebook entries and their predicted probabilities, biasing quantization parameters and reducing quantization performance. To tackle these challenges, the paper proposes a PTQ framework tailored for VARs, consisting of two core components: Shift-and-Sum Quantization and Calibration Data Resampling. Extensive experiments validate the framework on ImageNet across four tasks: class-conditional image generation, image in-painting, out-painting, and class-conditional editing. Evaluations on VARs of varying depths (16, 20, 24, 30 layers) and different bit-widths show consistent improvements over baseline methods (e.g., BRECQ, LiteVAR) in metrics like IS, gFID, and FID2FP16. In fact, I have a good understanding of autoregressive models, but I am not an expert in the field of quantization. Please correct me if there are any issues with my descriptions. 1/ This paper mainly focuses on optimizing Post-Training Quantization for Visual Autoregressive Models. There is relatively little related work on autoregressive models, so this research is undoubtedly worthy of encouragement . 2/ This work has achieved promising results on ImageNet-256. It outperforms LiteVAR, and the performance improvement is even more significant when combined with LiteVAR. 3/ The analysis of "Reconstruction error across scales" is quite interesting. The authors found that quantization errors are more significant at early (coarse) scales, and based on this observation, they designed the Shift-and-Sum Quantization technique. 1/ I have a major question: Since the main purpose of this work is to improve the efficiency of generative models for deployment, why are there no experiments in the paper showing the speed performance or throughput performance of the VAR model after quantization? 2/ Currently, the experiments on VAR are only conducted at a resolution of 256. I am curious whether the results are consistent at higher resolutions. For example, at a resolution of 1024—admittedly, VAR itself has no experiments at 1024 resolution, but Infinity (the text-to-image model of VAR) has experiments at the 1024 resolution version, and it would be valuable to observe the experimental phenomena there. reference: Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis None Lightly AI-edited
Shift-and-Sum Quantization for Visual Autoregressive Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper applied post-training quantization (PTQ) for visual autoregressive models (VAR), including a shift-and-sum quantization method to reduce calibration data and a resampling strategy for calibration data to align sampling frequencies of codebook entries with their predicted probabilities. - The paper fills a the gap by explicitly identifying VAR-specific quantization challenges, i.e., the attention-value error amplification at coarse scales and the codebook frequency-probability mismatch. - The theoretical analysis is sufficient, e.g., the theoretical analysis (Theorem 1) that proves the error bound for the proposed shift-and-sum quantization. - Insufficient Analysis of Computational Overhead: The proposed shift-and-sum quantization introduces additional operations, such as shift, duplication, and aggregation, which may increase inference time and memory consumption. However, the paper lacks a thorough analysis or empirical evaluation of these computational costs. A detailed study on the overhead introduced by these operations is necessary to fully assess the practicality of the method. - Qualitative Results Show Noticeable Degradation: The qualitative results presented demonstrate a clear degradation compared to full-precision models. To better illustrate the trade-off between compression rate and generation quality, the authors should provide a comprehensive comparison of generated results across different bit-widths. Additionally, it would be beneficial to include trade-off curves comparing the proposed method with other baseline approaches. - Limited Evaluation Metrics: The paper primarily adopts FID and IS as evaluation metrics, which mainly assess the generation quality for inpainting and outpainting tasks. However, these metrics do not adequately capture the semantic alignment between the generated results and the conditional guidance. The authors should consider incorporating additional metrics or evaluation protocols to assess semantic consistency and alignment. Please refer to the weaknesses. Moderately AI-edited
FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents a method for image inversion and editing with flow models. The idea is to optimize a $z_t$ (typically $z_T$) to reconstruct the input image. Since it is not feasible to propagate gradients through the entire denoising process, the optimization update omits the Jacobian term. So the update step becomes $z_0^{(i)} - y$, where $z_0^{(i)}$ is the image generated from the current state of the optimization, and $y$ is the input image. The optimization uses a small learning rate, and the paper shows that if the lr is not small enough, this process does not converge. - The method is novel, and it is initially surprising that it works. The authors provide an analysis and theoretical justification (but I do have concerns regarding the theoretical part, see weaknesses section). - The method itself is simple, and the paper presentation is clear. - The authors performed extensive evaluations against competing methods and the results are plausible (but I do have concerns here, see weaknesses section). - The limitations of the method are clearly discussed in the Appendix. - The method's results seem to adhere to the provided edit while staying well aligned with the original image in cases where competing methods fail. ### Major Concerns 1. The method requires a relatively large number of NFEs in order to provide an advantage over existing methods (e.g., FireFlow and UniInv) in reconstruction. 2. The authors present a theorem that guarantees the method's convergence under certain assumptions, however why and if these assumptions hold in practice is not clear. In addition, I think that the proof itself in Appendix F is potentially flawed, as explained next. Even assuming the condition holds, for the proof to hold we need to show that there exists some fixed $\kappa > 0$ such that the range in Eq. S8 exists. Otherwise, the limit argument is invalid for this claim. Furthermore, there exist many functions for which the condition holds, yet for any fixed $\kappa$ the range doesn't exist. Examples include $\tanh(x)$ where the supremum of $u_1$ and $u_2$ in the expression of $\eta_1$ is infinite and $x^3$ where the infimum of $u_1$ and $u_2$ in the expression of $\eta_2$ is $0$. Both functions satisfy the required condition with $\beta = 1$. 3. While the method compares with relevant inversion-based editing methods, there are also other approaches for text-based image editing, and it is not clear that the general framework (inversion + denoising with a different prompt) is the most effective one. For example, the method is not compared with Flux Kontext or Qwen Image Edit which are the SOTA text-based image editing models. ### Minor Concerns - Assuming that the Theorem holds, from the results in Appendix C it seems that the convergence is very slow, and in practice the initialization is crucial for the success of the method. It would be interesting to analyze convergence and performance when using other initializations, such as random noise or an interpolation between random noise and the final image. - Analysis of performance on few-step models is missing, even though they are potentially strong candidates to benefit from this method. - The method seems to support only appearance changes. - Why ReNoise is not included in the editing results? And no visual results of reconstruction are provided. - Showing other applications for this optimization framework will strengthen the paper. ### Final Note Despite these weaknesses, I find the paper overall good. I would be willing to raise my score if the authors address the issues related to the convergence claims and provide a more thorough discussion of the origins of the method's limitations. I would like to see more experiments that empirically support the claim of convergence to a unique solution from different initial conditions. If these cannot be provided, I would suggest the convergence guarantee claims to be removed from the paper. Methods that involve noise optimization, even gradient-free ones, can often produce inverted latents that don't exhibit properties of typical high dimensional Gaussian samples. Such properties may limit the editability of images generated by these latents (see e.g ReNoise, where the authors try to tackle this issue with regularization during optimization). I would like to see an analysis of the properties of the inverted latents found by this method, which may explain some limitations in editability, and perhaps hint towards a future solution for these limitations. The limitation for pose editing as presented in figure S16 is counter-intuitive. I would expect that using a larger number of optimization steps would make the edited image deviate less from the original image (as is seen in Figure 4 and Figure S17), and not the other way around. Fully human-written
FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper presents FlowOpt, a zero-order (gradient-free) optimization framework for training-free image editing with pretrained diffusion and flow models. Instead of backpropagating through the model or optimizing per timestep. That said, this work is exactly similar to existing literature, particularly FlowChef, and lacks comprehensive and community-standard evaluations. See details below. * Theorem 1 provides a sufficient condition on the step size under which the FlowOpt iterations provably converge. This formal analysis of convergence is a valuable addition to flow-based optimization literature, where most prior methods rely on heuristic step-size tuning. The novelty is limited. The proposed zero-order optimization across the full flow process is conceptually identical to FlowChef [1] (ICCV 2025, arXiv Dec 2024), which already introduced a gradient-free control framework with theoretical guarantees and broad task coverage (inversion, editing, and restoration). The main difference, introducing a step-size bound, is a modest theoretical insight rather than something novel or different. The work lacks comprehensive evaluation on community-standard editing benchmarks such as PIE-Bench [2], which is now widely adopted for fair comparison across inversion-based and inversion-free methods. The paper doesn’t clarify the conceptual distinction between FlowOpt and FlowChef, despite their almost identical formulations (both optimize the initial latent by approximating the flow trajectory without backpropagation). [1] “FlowChef: Steering of Rectified Flow Models for Controlled Generations,” ICCV 2025. [2] “Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code,” ICLR 2024. Can the authors clearly articulate the difference between FlowOpt and FlowChef, both theoretically and empirically? Heavily AI-edited
FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses the task of editing images (and potentially other generative tasks) using pre-trained flow/diffusion models in a gradient-free manner. The key idea being, treating the entire sampling process as a "black box" instead of tweaking each sampling step individually (which is the case with many existing approaches) and using a zero-order optimisation approach. The paper is very well written. 1. This paper presents a clean idea of optimising the whole process rather than per-timestep manipulation. 2. The paper also presents a theoretical contribution: i.e., a sufficient condition on the step-size for convergence of the opimizer in this setting. 3. The edits looks visually appealing and demonstrate a good tradeoff between fidelity and edit strength. 1. Although the paper compares methods quantitatively and qualitatively, a user study is missing. 2. Paper doesn't really discuss how the Zero-order method performs with increase/decrease in dimension since zero-order methods may suffer from bad convergence with increase in dimension. 1. Why have the authors not compared against a gradient-based inversion baseline? Fully human-written
LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a new LoRA method which explicitly tries to mimic full-finetuning dynamics. Crucially, the paper identifies the importance of matching the optimizer state in addition to the updates. This is accomplished via 1. alternating updates 2. gradient rescaling 3. momentum recalibration 4. second moment recalibration 5. projecting reconstructed AdamW update 6. approximating gradient clipping These "building blocks" ensure that in the limit of full-rank the full-finetuning dynamics are recovered. In the low-rank regime the experiments demonstrate improved performance over vanilla LoRA. The authors provide a principled derivation for a LoRA method meant to explicitly mimic full-finetuning updates. The approach recovers the correct dynamics in the full-rank limit. The authors provide extensive experiments showing the promise of the method especially for low-ranks. The method is practically efficient, it requires modest memory and runtime overhead, and is simple to implement. The experiments are only conducted with $r \leq 32$ and models with $\leq 8$B parameters. Second-moment calibration appears to have low-impact at a high cost, however it is still valuable to derive and test this idea. It is unclear if alternation is helpful or not. Do the authors have any intuition about when mimicking the full finetuning update is optimal or not? Fully human-written
LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes LoFT (Low-rank adaptation that behaves like Full fine-Tuning), a novel parameter-efficient fine-tuning (PEFT) method designed to closely approximate the optimization dynamics of full fine-tuning within a low-rank subspace. Building on the LoRA framework, LoFT introduces several key components: alternating updates, gradient scaling, first- and second-moment state calibration, projected full-model updates, and gradient clipping. Together, these allow LoFT to mimic AdamW’s optimizer behavior while maintaining the computational and inference efficiency of low-rank tuning. Empirical results across language (LLaMA-7B/2-7B/3-8B) and vision (ViT-Base) models demonstrate that LoFT consistently outperforms existing PEFT methods such as LoRA and DoRA, particularly under extreme low-rank constraints (e.g., rank ≀ 4). Ablation studies confirm that optimizer state calibration is critical to LoFT’s strong performance. **Strong conceptual motivation**: The paper identifies a previously underexplored source of suboptimality in LoRA — optimizer state misalignment — and provides a well-motivated correction grounded in optimization theory. **Methodological completeness**: The framework integrates multiple components (gradient projection, alternating updates, moment calibration) into a cohesive, well-defined optimizer (LoFT-AdamW), which provably reduces to full fine-tuning when rank = full. **Theoretical insight**: The analysis on matrix factorization clearly shows how LoFT recovers full fine-tuning dynamics, with formal smoothness guarantees and equivalence to alternating least squares in the special case. **Extensive empirical validation**: The experiments span multiple model families and domains, including large LLMs and ViTs, with clear, consistent performance improvements over LoRA and DoRA. **Careful ablation studies**: The paper convincingly demonstrates the necessity of each component, especially the importance of first-moment calibration for stable convergence. **Practical relevance**: LoFT eliminates the need to tune the LoRA scaling factor (α), reducing hyperparameter sensitivity and simplifying deployment. **Missing citation and discussion of concurrent work**: The Alternating Updates component (Building Block 1) reproduces an idea conceptually similar to AltLoRA [1], which independently proposed alternating optimization of low-rank factors to eliminate second-order coupling in LoRA updates. The absence of a citation or discussion of AltLoRA is a notable omission, especially since the “alternating update” mechanism is presented as a key innovation. This should be acknowledged as concurrent or parallel work, with clarification of LoFT’s additional contributions beyond AltLoRA (notably optimizer-state alignment). **Complexity and memory overhead**: While the paper discusses the cost of storing previous iterates and cross-terms, the actual scalability to very large models (≄70B parameters) remains untested; empirical results are limited to ≀8B models. **Presentation clarity**: The main text can be dense, with many mathematical expressions introduced in rapid succession. It would be helpful to include more detailed mathematical explanations or derivations in the appendix to improve readability and reproducibility. [1] Yu, Xin, et al. "AltLoRA: Towards Better Gradient Approximation in Low-Rank Adaptation with Alternating Projections." arXiv preprint arXiv:2505.12455 (2025). See Weaknesses. If the authors are willing to carefully clarify the relationship and differences between LoFT and AltLoRA during the rebuttal phase, I would be inclined to raise my score. Fully AI-generated
LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a new method called LoFT for parameter-efficient fine-tuning of large pre-trained models. LoFT extends the Low-Rank Adaptation (LoRA) approach by aligning the internal states of the optimizer (including momentum and second moments) with full fine-tuning, thereby attempting to reduce the accuracy gap typically seen between low-rank and full fine-tuning methods. The authors test their method across multiple language and vision tasks, showing performance improvements compared to previous low-rank adaptation methods, especially at very low ranks. They also discuss trade-offs in terms of memory usage and computational overhead, presenting simpler variants with lower overhead. - **Substantive technical contribution with theory.** The paper proposes a concrete improvement over standard LoRA-style adaptation and backs it up with clear derivations/analysis. The core ideas are technically motivated (e.g., aligning updates with full fine-tuning dynamics), and the method’s components are explained rather than presented as ad-hoc tricks. - **Broad empirical validation across domains.** Experiments cover multiple modalities/datasets (e.g., language and vision) and a range of ranks/settings, suggesting the approach is not narrowly tailored to a single task. - **Gap between theory and the strongest claim.** While the derivations are compelling, there remains a gap between the formal analysis and the paper’s strongest claim(s) (e.g., exact equivalence to full fine-tuning/AdamW under certain limits). A precise theorem with assumptions, or a more cautious phrasing, would strengthen the work. - **LLM evaluation is too basic.** The large-language-model experiments rely on relatively easy, small benchmarks. For a model like Llama-3-8B, a more representative LLM evaluation suite (e.g., code, math/reasoning, or long-context benchmarks) would be more convincing. Multi-seed runs with statistical reporting would further solidify the results. In Table 6, several DoRA results (e.g., r=4 with BoolQ=32.35, PIQA=7.13, Winogrande=0.00) are anomalously low and LoRA sometimes degrades as rank increases (e.g., r=4 worse than r=1 on PIQA/HellaSwag), suggesting hyperparameter/setup issues—could you explain these discrepancies? Heavily AI-edited
LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces LoFT, a new parameter-efficient fine-tuning method designed to make low-rank adaptation behave like full fine-tuning. Existing LoRA-based approaches mainly focus on gradient approximation but ignore the optimizer state misalignment, particularly in the first and second moments of the AdamW optimizer. LoFT explicitly aligns both gradients and optimizer states with full fine-tuning dynamics through six carefully designed components: alternating updates, gradient scaling, optimizer state calibration, second-moment alignment, projected full update reconstruction, and gradient clipping. Theoretical analysis proves that LoFT reduces exactly to AdamW when the rank equals the full dimension. Experiments on multiple large language models and vision transformers demonstrate that LoFT achieves higher accuracy and faster convergence than LoRA and DoRA while maintaining the same inference cost and number of trainable parameters. 1. The method directly addresses the optimizer state misalignment problem that has been largely overlooked in prior low-rank adaptation research. 2. The theoretical analysis is rigorous and provides a clear guarantee that LoFT degenerates to AdamW in the full-rank case. 3. The six-component design is systematic, and is validated through detailed ablation studies. 4. LoFT consistently outperforms LoRA and DoRA on both natural language and vision benchmarks, showing broad applicability. 1. The additional memory cost, which can reach about twenty-five percent compared to LoRA, is not fully analyzed for its impact on large-scale training. 2. Experiments are limited to models of eight billion parameters or smaller, leaving scalability to larger models unverified. 3. The effect of optimizer state projection on stability and convergence speed is discussed conceptually but lacks quantitative analysis. 4. The paper does not report concrete throughput or training speed measurements compared with LoRA or DoRA. 1. What is the quantitative impact of the additional memory requirement on training efficiency and GPU utilization? 2. Can LoFT be extended to other optimizers such as Muon that use different moment estimation mechanisms? Fully AI-generated
CLIP-TTA: Robust Test-Time Adaptation via Dual Regularization Beyond Optimal Transport Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a new test-time adaptation method named CLIP-TTA for VLMs that addresses the unreliable pseudo-labels of prior work like CLIP-OT. The authors introduce two losses: a cosine similarity loss to align image logits with text prototypes and an information maximization regularizer to encourage confident and diverse predictions. Experiments show that CLIP-TTA improves robustness against corruptions and domain shifts over current methods. 1. The method is presented clear and easy to understand. The component of CLIP-TTA is clearly organized by each section with correct and appropriate reference. 2. The paper demonstrates consistent performance gains over the primary baseline CLIP-OT across a wide array of benchmarks. 3. The authors provide detailed model analysis including ablation study, sensitivity of each different parameters and experimental settings. 1. The paper's primary motivation is to solve the "over-confidence" problem (high ECE) of the CLIP-OT baseline. However, the proposed core components: a cosine similarity loss and an information maximization loss, lack a direct theoretical link to this stated goal. The direct objective of $\mathcal{L}\_{cos}$ is to align features, while $\mathcal{L}\_{IM}$ aims to promote confident and diverse predictions to prevent model collapse. The paper fails to clearly articulate the theoretical chain of reasoning for why "alignment" and "preventing collapse" directly solve the problem of over-confidence, making the connection feel indirect and insufficiently supported. 2. The experimental validation omits several standard and challenging benchmarks. To better assess robustness, evaluation on ImageNet-C[1] is necessary. Furthermore, to test generalization on different data types, the paper would benefit from including fine-grained classification datasets from the CLIP zero-shot suite, such as the DTD[2] or EuroSAT[3]. 3. All experiments are conducted solely on the CLIP (ViT-B/32) backbone. To demonstrate the generalizability of the proposed dual-regularization approach, it should be tested on other vision-language model architectures, such as BLIP, to prove that the method is not just tailored to CLIP. 4. The paper's core methodological contribution is arguably incremental. The problem formulation (Eq. 1) is standard, and the optimal transport mechanism (Eqs. 2-8) is adopted directly from the CLIP-OT baseline. The primary novelty lies in adding two existing loss functions, constitutes a limited conceptual advance. [1] Hendrycks, Dan, and Thomas Dietterich. "Benchmarking neural network robustness to common corruptions and perturbations." arXiv preprint arXiv:1903.12261 (2019). [2] Cimpoi, Mircea, et al. "Describing textures in the wild." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. [3] Helber, Patrick, et al. "Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification." IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12.7 (2019): 2217-2226. We would like to draw your attention to the recent, highly relevant paper by Lafon et al. (2025), "Cliptta: Robust contrastive vision-language test-time adaptation" (arXiv:2507.14312). (Already Cited in Section 2 in original paper) 1. The title "Cliptta" used by Lafon et al. is practically identical to your proposed "CLIP-TTA". Given this, are you concerned that this will create significant ambiguity and confusion for future researchers when citing and attempting to differentiate these two distinct methods? 2. Lafon et al. argue that gradient-based TTA can "degrade learned knowledge," and for this reason, they propose a gradient-free solution. How does your dual regularization specifically prevent this degradation? Lightly AI-edited
CLIP-TTA: Robust Test-Time Adaptation via Dual Regularization Beyond Optimal Transport Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a novel method, CLIP-TTA, which adapts CLIP models for downstream tasks without requiring labeled data. To leverage unlabeled data during testing, CLIP-TTA employs optimal transport for pseudo-labeling and incorporates two regularization losses to prevent pseudo-label collapse. Experimental results demonstrate that CLIP-TTA enhances the performance of CLIP under distribution shifts. 1. The problem that how to utilize unlabeled data is central to several areas of machine learning, such as unsupervised and semi-supervised learning. A common approach is self-training, which alternates between assigning pseudo-labels and training the model with confident data. In our view, this paper improves the self-training framework by integrating optimal transport into pseudo-labeling, which is an interesting and inspiring idea. 2. Building upon the improved self-training framework, the paper introduces two regularization losses which use the confidence (or entropy) of predicted sample to prevent collapse. These regularization methods are straightforward and conceptually sound. 3. Figure 3 (Left) shows that the proposed method is not only effective but also efficient. 1. In my opinion, this paper is somewhat incremental and similar with CLIP-OT [1]. While it adds two regularization losses to improve the pseudo-labeling process, the novelty feels reduced compared to CLIP-OT. I would appreciate a more detailed comparison to highlight the differences between this work and CLIP-OT. 2. In Figure 5, hyper-parameters $\lambda_1$ and $\lambda_2$ have minimal impact on the average accuracy of CIFAR-10-C and CIFAR-100-C. I suggest the authors provide further explanation on the effectiveness of the proposed regularization losses. 3. CLIP-TTA updates only the visual encoder $\theta$ during the adaptation process, assuming that distribution shifts affect only the images. This assumption limits the scope of application for this method. 4. If I understand correctly, both regularization losses are computed on the output logits. The additional lines in Figure 2 seem unnecessary and make the framework more complex and difficult to understand. [1] Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation, ArXiv 24 See my questions in weakness. Fully human-written
CLIP-TTA: Robust Test-Time Adaptation via Dual Regularization Beyond Optimal Transport Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes CLIP-TTA, a test-time adaptation (TTA) method designed to address the issue of unreliable pseudo-labels generated by the previous CLIP-OT approach. CLIP-TTA introduces two key components: (1) a cosine similarity loss to align image features with textual prototypes, ensuring stable adaptation; and (2) an information maximization regularizer to encourage confident and diverse predictions, preventing model collapse. Extensive experiments across 7 benchmarks demonstrate competitive performance. - The paper is well-written and easy to understand. - The contribution of the paper is limited, as the proposed method framework is largely similar to OT-CLIP, with the addition of only two extra losses. Furthermore, there is no theoretical evidence provided to support how these losses contribute to the reduction of ECE.​ - The effectiveness of L_cos relies on the assumption of highly distinguishable text prototypes. However, in many fine-grained tasks, text templates are unable to differentiate between subclasses, which may lead to pushing the model toward incorrect priors. - Sensitivity to Hyperparameters. The method shows extreme sensitivity to hyperparameters, as shown in Figure 5. Different tasks exhibit strong dependence on hyperparameter settings, which undermines the robustness claimed by the paper. - The manuscript should include validation of CLIP-TTA on cross-dataset and cross-domain benchmarks, as this would make the method's claims more convincing. - The method should be tested on a broader range of TTA techniques (e.g., TPT) to demonstrate its effectiveness in reducing ECE, rather than being evaluated solely on OT-CLIP. Lightly AI-edited
HyperBatch: Scaling Contrastive Learning Batch Sizes by Two Orders of Magnitude Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces HYPERBATCH, a method for scaling contrastive learning to batch sizes two orders of magnitude larger than conventional approaches. While large batches are known to benefit contrastive objectives, modern backbone models make such scaling impractical due to memory constraints. The authors propose a three-step training procedure in which only selected parts of the model are trained on large batches, with gradients transferred through a modified backpropagation mechanism. Experimental results suggest that HYPERBATCH achieves faster convergence and higher accuracy than baseline methods. - The paper presents an interesting and practical idea addressing an important scalability limitation in contrastive learning. - It is well-written and easy to follow, with clear exposition of the method. - Figure 1 summarizes the proposed approach and helps with conceptual understanding. - The paper does not discuss limitations of the approach, such as potential instability or applicability constraints. - There is no released code or detailed training configuration, which limits reproducibility. - The absence of an ablation study makes it difficult to assess which components of the proposed method drive the reported gains. - Why is the proposed approach evaluated only in the context of contrastive learning? The method appears potentially applicable to other paradigms involving large batch training. - Why are experiments limited to audio–video pairs? Could the authors extend or at least discuss applications to other modalities (e.g., image–text or cross-language)? Moderately AI-edited
HyperBatch: Scaling Contrastive Learning Batch Sizes by Two Orders of Magnitude Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The authors propose HyperBatch, a three-phase training framework for scaling contrastive learning batch sizes by two orders of magnitude without additional memory requirements. Two key components are introduced: a memory-intensive backbone for feature extraction and a lightweight contrastive head for representation refinement. The framework consists of three phases: Pretrain for joint initialization with small batches, Adapt for training the head alone on large batches (1,000-50,000 samples) using cached backbone features, and Fuse for transferring large-batch gradients back to the backbone through microbatch-wise backpropagation. For gradient computation, gradient checkpointing is utilized for memory efficiency, and the InfoNCE loss is computed over the full concatenated batch rather than per-microbatch. The experimental results on AudioSet show improvements in audio-visual retrieval metrics compared to gradient accumulation baselines. 1. The paper correctly identifies that large batch sizes are crucial for contrastive learning performance, as they provide more negative samples and harder negatives that lead to better representations. Memory constraints that limit batch sizes are indeed a significant practical bottleneck in training large-scale contrastive models, and attempting to address this limitation is a valuable research direction. 2. While methods like MoCo, BYOL, and SimSiam have primarily focused on single-modality (vision) tasks, this work attempts to tackle the more challenging multi-modal setting with audio-visual contrastive learning. Multi-modal representation learning is increasingly important for real-world applications, and exploring memory-efficient training methods in this context is commendable. 1. Methodological inconsistency: The core claim of achieving large-batch training is questionable. During the Adapt phase, the backbone remains frozen and only the lightweight head is trained on cached, static features, which is fundamentally different from true large-batch contrastive learning where the backbone parameters are updated with gradients computed from large batches, such as MoCo. The approach more closely resembles post-hoc feature refinement rather than large-batch training. 2. Unclear gradient transfer mechanism: The Fuse phase attempts to backpropagate gradients from a head trained on frozen features (from Adapt phase) to update the current backbone. This creates a mismatch between the feature distribution the head was optimized for and the current backbone's output distribution. The theoretical justification for why gradients computed on outdated feature distributions should provide meaningful updates to the current backbone is absent. 3. Insufficient experimental validation: The evaluation is limited to a single dataset (AudioSet) and a single task (cross-modal retrieval). The paper lacks evaluation on well-established benchmarks such as ImageNet for visual representation learning, which would allow direct comparison with existing large-batch contrastive learning methods. The paper also lacks comparison with established memory-efficient contrastive learning methods such as MoCo, BYOL, or SwAV. Additionally, there are no downstream task evaluations to demonstrate whether the improvements in retrieval metrics translate to better representations for practical applications. 4. Unfair baseline comparison: The gradient accumulation baseline computes InfoNCE loss per microbatch, while the proposed method computes it over the full concatenated batch. This difference in loss computation, rather than the three-phase framework itself, could account for the observed improvements. A fair comparison would require both methods to use identical loss computation strategies. 5. Missing technical details and ablation studies: Critical implementation details are absent, including learning rate schedules for different phases, optimization hyperparameters, and the specific architecture of the contrastive head. The paper lacks ablation studies to identify which components contribute to performance gains. The asymmetric training steps (95k for Pretrain, 25k for Adapt, only 10k for Fuse) raise questions about convergence and optimal phase duration. 6. Limited theoretical analysis: The paper provides no formal analysis of convergence properties or theoretical guarantees that the proposed method approximates true large-batch training. The relationship between the solution obtained through this three-phase approach and that of standard large-batch training remains unclear. 1. Clarification on the core training mechanism: Could you provide a mathematical derivation showing how gradients computed from a head trained on frozen features (Adapt phase) provide valid optimization directions for the current backbone (Fuse phase)? Specifically, during Adapt, the head learns to process features from a frozen backbone with a fixed distribution. However, during Fuse, this same head is applied to features from an updated backbone that has evolved through training. How would you address this distribution shift between training and application of the head? What is the theoretical or empirical justification that the head's learned transformations on static features remain beneficial when applied to the evolving feature distribution during Fuse? 2. Regarding the baseline comparison: The evaluation in Section 4.1 appears to conflate the training framework's contribution with the loss computation strategy's contribution. The paper states that the baseline computes the InfoNCE loss separately on each microbatch, forming $b \times b$ similarity matrices. In contrast, the proposed method assembles "a single $B \times B$ similarity matrix over the full effective batch". To accurately isolate the performance gains attributable to the ''Pretrain-Adapt-Fuse'' framework itself, distinct from the known benefits of a larger negative pool, could the authors provide results for a baseline that also computes the loss over the full $B \times B$ concatenated batch? 3. Ablation studies: What happens if you: (a) continue training the backbone jointly with the head in Adapt instead of freezing? (b) use different ratios of training steps across phases? These ablations would help identify which components are essential. 4. Comparison with established methods: How does HyperBatch compare with MoCo, BYOL, and SimSiam? Could you also provide results on standard benchmark datasets like ImageNet to enable direct comparison with existing large-batch contrastive methods? 5. Convergence analysis: Why does the Fuse phase only run for 10k steps compared to 95k for Pretrain? Could you provide training curves showing loss/accuracy throughout all three phases? 6. Downstream task evaluation: Do the improvements in retrieval metrics translate to better performance on downstream tasks (e.g., classification, detection)? This would validate whether the learned representations are generally better or just optimized for retrieval. 7. Technical specifications: Could you provide the exact architecture of the contrastive head, learning rates for each phase, and optimizer settings? This information is crucial for reproducibility. Fully AI-generated
HyperBatch: Scaling Contrastive Learning Batch Sizes by Two Orders of Magnitude Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes HyperBatch, a framework to overcome memory limits when training large contrastive learning models. It uses a three-phase (Pretrain, Adapt, Fuse) approach where a lightweight head is first trained on massive batches, and then a novel backpropagation step fuses this large-batch gradient information back into the memory-constrained backbone, achieving the benefits of massive batches without the high memory cost. * The paper addresses an extremely important and common practical bottleneck in contrastive learning: how to effectively scale batch sizes under strict memory budgets. * The proposed three-phase (Pretrain-Adapt-Fuse) framework is novel, cleverly decoupling the large-batch loss computation (in a lightweight head) from the gradient updates of the memory-intensive backbone. 1. The framework's core is the Pretrain-Adapt-Fuse pipeline. However, the necessity of the separate Adapt phase is not fully ablated. The Fuse phase already trains the head on large batches. It is unclear if the Adapt phase is essential for stability or if a simpler Pretrain → Fuse pipeline would achieve comparable results. An ablation study removing the Adapt phase would clarify the framework's essential components. 2. The method is demonstrated on a single audio-visual retrieval task. The paper claims it is a "drop-in training scheme" applicable to any contrastive framework, but this broad claim is not substantiated. It would be significantly more convincing to see results on standard uni-modal benchmarks, such as SimCLR on ImageNet, to prove that the framework is truly general-purpose and not just tailored to the specific audio-visual architecture used. 1. Why was gradient accumulation (GA) chosen as the primary baseline rather than momentum encoders (such as MoCo) which also solve the large batch problem? What are the advantages of HyperBatch compared to MoCo? 2. Is the Adapt phase necessary? What would happen if we skipped the Adapt phase and went directly from Pretrain to Fuse? 3. You claim this is an "off-the-shelf" solution, but experiments are limited to audio-visual tasks. Have you tested the method on standard uni-modal benchmarks such as SimCLR on ImageNet to demonstrate its generalizability? Fully AI-generated
HyperBatch: Scaling Contrastive Learning Batch Sizes by Two Orders of Magnitude Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a three-phase contrastive learning framework — Pretrain, Adapt, and Fuse — to scale effective contrastive batch sizes by two orders of magnitude without additional hardware costs. The approach freezes the backbone after initial training and then trains a lightweight contrastive head on extremely large batches, later fusing gradients back into the backbone in a micro-batching manner. The authors claim improved performance and faster training on AudioSet by exposing the model to many more hard negatives than would otherwise fit in memory. 1. This paper tackles a practical and important limitation in contrastive learning—namely, the difficulty of scaling effective batch sizes due to memory constraints—and presents a simple, modular multi-phase training strategy to overcome it. The proposed Pretrain–Adapt–Fuse pipeline is clearly described, easy to integrate into existing contrastive frameworks, and does not require specialized hardware or distributed setup, making it accessible to practitioners. 2. The method takes advantage of freezing a backbone and training a lightweight contrastive head on very large batches, enabling exposure to many more hard negatives and improving representation quality. The authors motivate the approach well, provide intuitive reasoning about its benefits compared to naïve gradient accumulation, and demonstrate performance improvements on AudioSet retrieval tasks. The writing is clear and the implementation concepts are straightforward, which increases the practical value of the paper. 1. Novelty. While useful in practice, the contribution is conceptually incremental—it largely combines established training tricks such as backbone freezing, micro-batch gradient accumulation, and staged optimization, rather than introducing fundamentally new contrastive learning theory or algorithms. 2. Empirical evidence. The experimental validation is limited in scope, relying primarily on AudioSet without evaluation on standard large-scale vision benchmarks (e.g., ImageNet, CLIP-style settings) or across diverse architectures, raising questions about generality and scalability. Important baselines such as memory-bank approaches (e.g., MoCo queues) and distributed large-batch training are absent, making it difficult to isolate the benefit of the proposed method relative to modern contrastive systems. 3. Ablation analysis. The paper does not analyze potential degradation from freezing early layers, or efficiency trade-offs versus simply training longer or adding compute. The argument that off-diagonal negatives drive improvement is plausible but not rigorously demonstrated, giving the work a somewhat heuristic and engineering-driven character rather than a theoretically grounded advancement. Please see weakness for questions. Fully AI-generated
GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces GUI Knowledge Bench, a diagnostic benchmark that evaluates vision-language models (VLMs) on GUI-specific knowledge across three dimensions: interface perception, interaction prediction, and instruction understanding. The benchmark spans 6 platforms and 292 applications with 3,483 questions, revealing that current VLMs struggle with system state reasoning, action prediction, and task completion verification despite reasonable performance on basic widget recognition. 1. Comprehensive Coverage: The benchmark's scope is impressive - covering 6 platforms, 292 applications, and over 40,000 screenshots. This diversity enables robust evaluation across different GUI environments. 2. Interesting Empirical Analysis: The evaluation reveals clear patterns - models perform well on widget function recognition but struggle with system states and interaction dynamics. The confusion matrix showing bias toward "click" actions is insightful. 1. The benchmark mixes three distinct capabilities: knowledge, perception, and grounding. For example, for some of the problems, it involves clicking on a coordinate which requires strong grounding ability. The mix of various abilities make it harder to understand what causes the deficiency in GUI models. 2. The paper assumes VLMs should possess extensive prior GUI knowledge, but as mentioned by the authors, GUI interfaces are constantly evolving, and have impossible broad coverage. Why should the model possesses knowledge, rather explore and learn by itself in the environment. 3. Poor figure quality and inconsistent fontsizes. Multiple figures (2-5) contain small text. 1. Setting mimic real agent trial and error: Does using reflection style (multiple iterations, follow [1] settings) improve accuracy in GUI agent trials? 2. Does correlation exist between your GUI knowledge benchmark and other GUI agent benchmarks, and does possessing this knowledge lead to higher accuracy? [1] Wang, Xingyao, et al. "Mint: Evaluating llms in multi-turn interaction with tools and language feedback." Fully human-written
PreviousPage 38 of 1516 (75800 total rows)Next