ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
One Measure, Many Bounds: Bridging TV, Variance, and Mutual Information Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a novel, one-parameter family of information-theoretic generalization bounds based on the vector-valued $L_p$-norm correlation measure, $V_{\alpha}$. The key idea is that by tuning the parameter $\alpha$, this framework can interpolate between several existing types of generalization measures. The paper highlights three key special cases: 1. $\alpha=1$: This regime recovers a well-known mutual information bound by Xu & Raginsky. 2. $\alpha \to \infty$: This regime leads to a worst-case deviation bound. 3. $\alpha=2$: This is presented as the main conceptual contribution. The framework yields a new, intuitive generalization bound that is directly controlled by the variance of the algorithm's output probability, $Var_S[p(w|S)]$. Further, the authors introduce a new stability condition, "Adaptive Density Stability," to demonstrate a (strong) sufficient condition under which their bound is non-vacuous and achieves a standard $\mathcal{O}(1/\sqrt{n})$ learning rate. Finally, the authors empirically validate their framework, demonstrating that their $V_2$ (variance-based) bound is tighter (on a single task) than a contemporary Conditional Mutual Information bound. - The paper is well written. - The primary strength of this paper is the introduction of an elegant framework ($V_{\alpha}$) that unifies multiple, previously distinct information-theoretic bounds - The experiment in Section 6 (Figure 2) is compelling. Showing that the new $V_2$ bound is empirically tighter than the CMI bound from Steinke & Zakynthinou (2020) provides some evidence for the utility and non-triviality of this new measure. However, conclusively showing that this bound dominates the one by Steinke & Zakynthinou would require more experiments (even if they were simple examples). - The $\alpha=2$ case, which directly links generalization error to the variance of the algorithm's output distribution ($Var_S[p(w|S)]$), is a significant and intuitive conceptual contribution. - While Adaptive Density Stability is a useful theoretical tool, it presents a very strong, pointwise condition. The paper does not provide any concrete examples of common learning algorithms that are proven to satisfy this condition with the required $\gamma_n = \mathcal{O}(1/n)$ rate, which is necessary for the $\mathcal{O}(1/\sqrt{n})$ generalization bound. - The framework's flexibility comes from the parameter $\alpha$. Figure 1 shows $\alpha=2$ as optimal for a simple Z-channel model. However, the paper does not provide a practical method or even a heuristic for selecting the optimal $\alpha$ for a given problem. Without this, it's unclear how to obtain the tightest possible bound from this framework in practice. - The specialization at $\alpha=2$ (Theorem 4.1), which connects generalization to the algorithm's output variance $Var_{S}[p(w|S)]$, is a worthwhile contribution. However, the claim that this provides a fundamentally new information-theoretic perspective linking variance and stability to generalization may be slightly overstated (see questions). - See weaknesses. - Your work presents a case for using the algorithm's output variance, $Var_{S}[p(w|S)]$, as a direct measure of stability that bounds the generalization error. This is a valuable contribution, particularly the novel variance-based bound derived at $\alpha=2$. I am curious how your stability measure (in the $\alpha=2$) relates to a different, recently proposed variance-based condition for learnability. This recent work (Proposition 5.2 in the 2024 paper by Gastpar et al.) has shown that an algorithm is “$l_2$-estimable” (meaning that there exists a tight bound (in terms of L2 error for its population risk) if and only if the conditional variance of its *population loss*, given the sample, is small (i.e., $\mathbb{E}[var(L_{\mathcal{D}}(A(S))|S)] \le \epsilon$). Your measure focuses on the variance of the algorithm's output probability over different samples $S$, while this other work focuses on the variance of the *outcome* (the true loss) over different distributions $\mathcal{D}$ that could have generated $S$. Could you comment on the relationship, if any, between these two distinct but related variance-based notions of stability? I believe that situating your work in relation to this alternative perspective on variance could further strengthen your paper's contribution. - What is the missing reference in l. 881? ======================================================================== References: Gastpar, M., Nachum, I., Shafer, J., & Weinberger, T.. Which Algorithms Have Tight Generalization Bounds?. Neurips 2025. Fully human-written
Mitigating Privacy Risk via Forget Set-Free Unlearning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses a significant limitation in existing machine unlearning techniques, namely the requirement to retain access to the “forget set” (the data to be removed) during the unlearning process. The authors introduce the concept of partially-blind unlearning (PBU) and propose RELOAD, a framework that performs unlearning without direct access to the forget set. RELOAD leverages cached gradients from the final epoch of training and combines gradient ascent, selective weight reinitialization, and fine-tuning on retained data. The paper claims strong empirical results showing that RELOAD can efficiently approximate retraining from scratch and even outperform some forget set-dependent approaches, including applications to both image classification and large language models. * The problem formulation of unlearning without the forget set is timely, novel, and practically relevant for privacy compliance (e.g., GDPR). * Methodologically sound integration of gradient ascent, weight reinitialization, and fine-tuning into a coherent framework. * Empirical results show promising efficiency and competitive performance, especially for large models such as Llama2-7B. * Limited robustness to model updates The method assumes availability of final-epoch gradients representing the entire training data. In real-world pipelines where models are fine-tuned or continuously updated, these cached gradients may no longer capture the forget set’s influence, reducing unlearning effectiveness. Evaluating RELOAD under fine-tuning or continual learning scenarios is necessary. * Unclear privacy guarantees Although RELOAD is “partially blind,” cached gradients can still leak sensitive information. Prior works (e.g., Geiping et al., 2020) show that gradient inversion can reconstruct data. The paper lacks a quantitative privacy leakage analysis to support its safety claims. * Limited ablation and sensitivity study While some ablations are included, deeper exploration of how hyperparameters (e.g., ascent rate, reset proportion) affect performance is missing. This limits confidence in robustness across settings. * Storage overhead The approach removes the need to store the forget set but introduces the requirement to store full-model gradients, which can be large for modern networks. The practical feasibility of gradient caching is not analyzed. 1. How would RELOAD perform if the model undergoes fine-tuning or incremental updates after initial training? 2. What is the approximate storage cost of ∇θL(D) for large models such as Llama2-7B, and could this be mitigated via gradient compression? 3. Have the authors tested for information leakage from cached gradients using existing gradient inversion techniques? 4. How does RELOAD handle overlapping or correlated data between forget and retain sets? 5. Can the method scale to federated or distributed training settings where only local gradients are available? Fully AI-generated
Mitigating Privacy Risk via Forget Set-Free Unlearning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes RELOAD, a partially-blind unlearning (PBU) method that aims to remove the influence of a forget set without access to the forget data. The method includes three parts: 1) an ascent step using cached final-training gradients, 2) re-initialisation of parameters with low “knowledge value,” and 3) fine-tuning on the retain set to maintain model utility. Experiments include classic unlearning, entity unlearning, and corrective unlearning. 1. The problem of forget-set-free unlearning is novel and well-motivated. The RELOAD enables machine unlearning without retaining the raw forget set by using cached final-step gradients. 2. The empirical results are strong across benchmarks. RELOAD can preserve model utility while improving forget quality. 3. The method is efficient. For Llama-2-7B, it uses 7% of weights and <0.025% of retained data, finishing in 8 minutes on a single GPU. This result suggests that the method is practical. 1. The paper does not provide a theoretical justification for why a single gradient-ascent update using $\nabla_\theta L(D)-\nabla_\theta L(D_{\text{retain}})$ and selective re-initialization of low-KV weights can remove the influence of the forget data. As a result, it is unclear when RELOAD succeeds or fails beyond the reported scenarios. 2. The paper claims RELOAD “allow user data to be immediately removed when a request for deletion is made, eliminating the continued accumulation of dataset risk,” but RELOAD requires the retain set to compute gradients and to perform fine-tuning. This means that the retained data is still at risk, and the ideal method should not use any training data during the unlearning process. 3. The robustness of RELOAD across different numbers of requests is unclear. When the forget set is extremely small (e.g., unlearning only one data point), $\nabla_\theta L(D_{\text{forget}})=\nabla_\theta L(D)-\nabla_\theta L(D_{\text{retain}})$ becomes a very small residual between two nearly identical large gradients; this makes the ascent step near zero and may fail to truly forget that sample. In addition, when the forget request is large (e.g., 30% of the data), the ascent and re-initialization would degrade model utility, while fine-tuning may fail to recover the utility given the limited retain set. Please see the Weaknesses section for all questions and clarification requests. Fully human-written
Mitigating Privacy Risk via Forget Set-Free Unlearning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper tackles the problem of machine unlearning without retaining the forget set, a long-standing practical limitation in enforcing the “right to be forgotten.” The authors introduce RELOAD, a partially-blind unlearning algorithm that relies only on cached gradients from the final training step and the retained dataset, avoiding the need to store sensitive data. Through gradient ascent, selective reinitialization, and fine-tuning, RELOAD achieves performance comparable to retraining from scratch while being significantly more efficient. Experiments on image classification and large language models demonstrate promising results. 1. The paper formalises the partially-blind unlearning setting, which is a realistic and privacy-preserving variant of traditional unlearning. This addresses an important gap between regulatory demands (e.g., GDPR) and existing technical capabilities. 2. The motivation connecting dataset risk and model risk is conceptually strong and provides a clear societal justification for this work. 3. RELOAD elegantly combines gradient-based unlearning, structured sparsity, and fine-tuning in a simple yet effective pipeline. The “knowledge value” mechanism for selective reinitialization is particularly interesting. 4. Across diverse tasks (CIFAR, SVHN, and Llama-2), RELOAD shows comparable or better performance than methods requiring the actual forget set. Its efficiency (<8 min for Llama2-7B) suggests real practical potential. 1. Limited theoretical justification. The paper mainly relies on intuition and empirical validation. While the derivation of ∇L(Dforget) = ∇L(D) − ∇L(Dretain) is sound, the guarantees of approximate unlearning (e.g., bounds on residual influence) are not formally analysed. 2. Dependence on cached gradients. Storing full-model gradients at the end of training may be expensive for large-scale models, potentially offsetting some of the claimed efficiency or privacy advantages. 3. While results are impressive, the experiments could be broadened — e.g., include more realistic privacy benchmarks or human-sensitive data domains. 4. Some ablation studies (e.g., varying α in reinitialization, or using partial gradient caching) are deferred to the appendix but would strengthen the main text. See weaknesses. Fully AI-generated
Mitigating Privacy Risk via Forget Set-Free Unlearning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Existing study requires the model provider to keep the requested unlearned data information until the unlearning is completed. This paper explores and defines a new unlearning setting, Partially Blind Unlearning (PBU), in which no direct access to the unlearning data is required. Under this setting, the author implements a three-fold method that leverages cached gradients from the training stage. The method RELOAD consists of three steps, combining previous studies. RELOAD gives the unlearned model under PBU by following 1. Compute gradient difference, perform a single gradient ascent step, and then fine-tune on retained datasets. The author then provides comprehensive experiments on both classical unlearning tasks unlearning on vision models, and unlearning tasks on language models. 1. The paper provides a new privacy-oriented perspective on the problem of unlearning. The author made an important observation and defined a new setting for exploring safe unlearning. The motivation is well-explained. 2. The paper integrates several existing methods in solving new and practically important problems. This provides a more modular and interpretable unlearning algorithm. 3. The evaluation is comprehensive and representative. The experiments marked the performance of RELOAD on both small-scale vision models and on language models. 1. In Section 2.3, the author states that we can infer the loss gradient on the forget dataset with the original model by computing the difference between the loss gradient on the original dataset and the retained dataset. This, however, works with several assumptions that are not explicitly stated, such as the assumption that loss is additively computed across samples. I am concerned that for contrastive learning tasks or models trained with layer normalization, this formula may not function as intended. I would appreciate further discussion on this point. 2. While the overall unlearning method provides more interpretability compared to existing unlearning methods and achieves promising results, the methodological innovation is somewhat incremental. The core part of the method, including gradient ascent, selective weight re-initialization, and fine-tuning are existing method. The contribution of the method lies mostly in how to combine the methods under PBU setting. It seems like RELOAD is achieving good results in the baseline section. The setting for the experiments is not strict to PBU. This indicates that RELOAD is producing better results and outperforms baseline methods that do use the forget set. If so, why is PBU important to this method? Fully human-written
Global optimization of graph acquisition functions for neural architecture search Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes NAS-GOAT, a framework for globally optimizing graph-based acquisition functions in Bayesian optimization (BO) for neural architecture search (NAS). The authors formulate the graph search space—including reachability, shortest paths, and node/edge labels—as a mixed-integer program (MIP), enabling exact optimization of acquisition functions. The method generalizes prior graph BO formulations (e.g., BoGrape) to handle weakly-connected or disconnected DAGs common in NAS. Experiments on NAS-Bench-101, 201, and 301 show that NAS-GOAT efficiently finds near-optimal architectures, often outperforming or matching state-of-the-art baselines. ++ This method extends graph BO to NAS by relaxing the strong connectivity assumption of BoGrape. ++ Comprehensive experiments on three major NAS benchmarks under both deterministic and noisy settings demonstrate robustness and efficiency. -- The MIP encoding for graph structures builds heavily on BoGrape, with the main adaptation being the relaxation of strong connectivity. While this is non-trivial, the paper could better highlight what specific constraints were modified or added to handle NAS-specific DAGs. Specifically, the claim that BoGrape is unsuitable due to strong connectivity is not followed by a clear explanation of how this is resolved beyond "generalizing the graph encoding." -- I am afraid that this method is not a "plug-and-play" solution. The MIP model must be manually re-derived and re-implemented for each new search space topology. This creates a significant barrier to practical adoption and limits its applicability to new or evolving NAS problems. 1. I suggest the authors provide more analyze about the differences between this method and BoGrape. As I am concerned, the contribution of this work lies in the adoption of BoGrape for NAS tasks. Moderately AI-edited
Global optimization of graph acquisition functions for neural architecture search Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. NAS-GOAT casts cell-based neural architecture search as a Mixed-Integer Program in which graph topology, reachability, shortest-path features and a GP acquisition function are jointly optimized. The resulting MIP is solved to global optimality at every BO step, eliminating hand-crafted mutations and providing certificates of optimality under the surrogate model. Experiments on three public NAS benchmarks demonstrate competitive or superior query efficiency versus recent sampling-based or evolutionary BO baselines. 1. The paper is clearly written and easy to follow. 2. The authors design a full condition plan of NAS graph space. 3. The code is supplied, and the hyper-parameters are reported. 1. The complexity of the method should be analyzed. 2. The main content in Theorem 1 is more likely a modeling plan of the graph space, but it takes too much space in the paper, which makes readers uncomfortable. In addition, Theorem 1 is unnecessary to be a theorem. 3. The experiments are all conducted on NB101~301, it is better to evaluate the method on more datasets. Besides, the method cannot achieve SOTA in some of cases. See weakness. Fully human-written
Global optimization of graph acquisition functions for neural architecture search Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Briefly summarize the paper and its contributions. You can incorporate Markdown and Latex into your review. This article proposes an equivalent representation of a general labeled graph in an optimized variable space, where each graph corresponds to a unique feasible solution. It further introduces a universal kernel formula to measure graph similarity, which is compatible with the proposed encoding. This method achieves global acquisition optimization based on graph Bayesian optimization in neural structure search. 1. The paper proposes an equivalent representation of general labeled graphs in the optimization variable space, ensuring that each graph corresponds to a unique feasible solution. Moreover, it introduces a unified kernel formulation that quantifies the similarity between two labeled graphs at the levels of graph structure, node labels, and edge labels. The advantages over baselines were demonstrated in NAS Bench 101, NAS Bench 201, and NAS Bench 301. 2. The formulas and derivation proofs in the article are very detailed and accompanied by complete code. 1. The benchmarks used (NAS Bench 101, NAS Bench 201, and NAS Bench 301) are all from before 2022. Similarly, the baseline methods such as GCN, NAS BOT, and NAS BOWL are also from before 2021. No experiments were conducted on the latest benchmarks or with more recent baseline methods. 2. This paper lacks an analysis of the algorithm's time complexity. 3. The evaluated benchmark is limited to NAS, lacking experiments on real-world tasks, which makes the contribution relatively limited. 1. Could experiments be added on more recent and broader benchmarks and baselines? 2. Could an analysis of the algorithm’s time complexity be provided? Moderately AI-edited
Five-Mode Tucker-LoRA for Video Diffusion on Conv3D Backbones Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a 5D Tucker-LoRA adapter for parameter-efficient fine-tuning (PEFT) of Conv3D-based text-to-video diffusion models. 5D Tucker-LoRA contains 5-D weight updates for Conv3D (across output channels, input channels, temporal, height, width). The authors claim that this preserves the spatio–temporal geometry of video generation. However, the work suffers from critical gaps in innovative contribution to fully substantiate its claims. 1. This paper presents sufficient implementation details and is easy to follow. 2. This paper presents some theoretical properties of their method. 1. The biggest issue with this paper is that the problem it discusses and addresses is entirely not a concern for current video generation models. The 5D-LoRA proposed in this paper is applicable to Conv3D, yet modern video generation architectures are almost all based on Transformer-based DiT (Denoising Diffusion Transformer). These architectures do not have convolutional layers at all, and there is even less need to use convolution-based LoRA. The architectures discussed in this paper (AnimateDiff and VideoCrafter) are now considered outdated. The authors should consider more about the DiT architecture (CogVideoX, Wan, HunyuanVideo). 2. The writing quality is obviously below the bar of ICLR. The introduction of this paper fails to explain why Conv3D-specific LoRA would be more effective than attention-based LoRA for temporal learning. When introducing the initialization strategy using Higher-Order SVD (HOSVD) and Higher-Order Orthogonal Iteration (HOOI) (Line136), the paper only mentions these techniques without properly citing their original works. The Preliminaries section covers core concepts of video latent diffusion, 3D convolutions, and parameter-efficient adaptation, but most of these topics lack relevant citations. This work lacks impactful innovation, while the proposed 5D-LoRA is not capable of the modern DiT models. Moreover, the writing quality of this paper also does not meet the standards for acceptance at ICLR. Fully human-written
Five-Mode Tucker-LoRA for Video Diffusion on Conv3D Backbones Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a Five-Mode Tucker-LoRA for parameter-efficient fine-tuning of text-to-video diffusion models built on Conv3D backbones. While conventional LoRA methods flatten convolutional kernels or use pseudo-3D (temporal-only) adapters, the proposed approach directly applies a Tucker decomposition to the 5-D convolution kernel. 1. The paper targets an under-explored but important design space: PEFT for video diffusion that respects Conv3D’s intrinsic five-mode tensor structure. 2. The authors maintain a unified evaluation protocol across backbones (VideoCrafter, AnimateDiff) and provide detailed metrics (FVD, CLIP-T, VRAM, throughput). 1. Poor Writing and Presentation Quality. The overall writing is rough and occasionally inconsistent, making it difficult to follow technical details in later sections. Figures (e.g., Fig. 2–4) are non-vector raster images with visible compression artifacts and low readability. 2. Weak and Outdated Baselines. The experiments compare only against early, relatively weak models (VideoCrafter and AnimateDiff). These backbones lag far behind current state-of-the-art systems such as WAN 2.1, or SVD, huanyuanvideo, which feature stronger visual quality and temporal consistency. As a result, the reported gains are difficult to interpret as meaningful progress on modern video diffusion. 3. Limited Novelty. The paper’s main contribution, applying Tucker decomposition to Conv3D kernels, is conceptually straightforward and built almost entirely on existing tensor algebra and LoRA principles. There is little true methodological innovation beyond adapting Tucker to 5D convolution. The theoretical propositions (parameter counts, monotonicity) are standard results from classical tensor decomposition literature. 4. The paper evaluates only on FVD and CLIP–T, which are weakly correlated with human aesthetic or temporal preference. There is no user study, subjective rating, or perceptual evaluation to support the claim of “better visual coherence“. 1. Substantially revise the writing and ensure figures are vector-based (PDF/SVG). 2. Replace outdated baselines with strong, recent models (e.g., WAN 2.1, SVD). Fully AI-generated
Five-Mode Tucker-LoRA for Video Diffusion on Conv3D Backbones Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes 5D Tucker-LoRA, a structured low-rank adaptation method for text-to-video diffusion models. By learning a Tucker residual directly on the 5D Conv3D kernel, the method preserves the spatio–temporal structure of video kernels and allows flexible mode-wise rank control. It is evaluated on VideoCrafter and AnimateDiff, showing memory–quality trade-offs and faster attainment of practical FVD thresholds compared to pseudo-3D adapters. * This paper aims to addresses an important problem of parameter-efficient fine-tuning for video diffusion, which has high computational and memory costs. * The proposed 5D Tucker-LoRA method is conceptually appealing, as it allows flexible mode-wise rank control. * The evaluation only compares against the pseudo-3D adapter (AD-2D). There are no experiments with other PEFT strategies such as naive LoRA, making it difficult to assess the true performance gains. * No video examples or visual analysis are provided, making it difficult to judge perceptual improvements or temporal coherence. * The paper does not explicitly show how 5D Tucker-LoRA improves the temporal dimension, which is a key aspect for evaluating video diffusion models. * Other recent PEFT strategies for video diffusion are not considered, limiting understanding of where this method stands relative to state-of-the-art. * The impact of temporal rank selection is not thoroughly explored, leaving questions about robustness and hyperparameter sensitivity. * How does 5D Tucker-LoRA improve temporal modeling compared to 2D or pseudo-3D adapters? Can you provide quantitative or qualitative evidence? * How does the method perform against other PEFT strategies, such as naive LoRA or full fine-tuning? * Could you provide video examples to support qualitative evaluation, demonstrating temporal coherence and overall generation quality? * The paper only compares temporal Tucker ranks ∈ {0, 1, 4} and claims r=4 is optimal. How sensitive are the results to temporal rank selection? * Is 5D Tucker-LoRA compatible with higher-resolution videos or longer temporal sequences, and what is the associated computational overhead? Heavily AI-edited
Five-Mode Tucker-LoRA for Video Diffusion on Conv3D Backbones Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a novel five-dimension LoRA for convolutions. It decomposes all 5 dimensions of a convolution, as input, output, time, height and width, in convolution-based video diffusion models, and apply LoRAs following individual ranks. This preserves spatiotemporal structures and further enables flexible independent adjustments. The proposed method is tested with various UNet-based video diffusion models for domain adaptation on MSR-VTT dataset. It achieves surpassing performance compared to traditional 2D LoRAs for convolutions. - The proposed new LoRA for convolution is novel and well motivated. It disentangles all 5 dimensions, not only preserving original spatiotemporal structure without flattening, but also enabling flexible control on each dimension (for potential spatial/temporal decomposition). - The proposed method is only compared to its own baseline, 2D LoRA, in evaluation. More non-LoRA domain adaptation methods should be compared comprehensively. The raw base model's result is also necessary, as the provided visualization might indicate that 2D lora even harms the quality. - The proposed method is limited to convolution only, while most diffusion foundation models are adopting transformer architectures. For example AnimateDiff combines spatial convolution with temporal attentions, and only 2D lora can be applied here. - How would each rank be heuristically adjusted given different cases, e.g. different new datasets with different spatial/temporal complexity or gap, or base model's video length/frame resolution? Fully human-written
Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper make the mechanism study towards the chain-of-thought process of Large Reasoning Models (LRM). They conduct controlled experiments to analyse whether LRM generate answer by CoT reasoning or memory retrieval. Based on that, the paper answer 3 research questions: RQ1: Do LRMs employ reasoning and retrieval simultaneously to derive answers? RQ2: What factors influence the dominance of one capability over the other? RQ3: How can we control the relative strength of these capabilities? Build upon those findings, the author propose a novel fine-tuning framework that integrates memory unlearning with reinforcement learning that enhances generalizable reasoning capabilities. The experiment is conducted among different types of model size, architecture, and training paradigm, which make to conclusion relative plausible. The paper employs machine unlearning method to optimize training process, which enhances generalizable reasoning capabilities. Though the experiment is conducted thoroughly, the memory and reasoning perturbation mechanism is not plausible enough and needs further discussion, see question part for detals. 1. reasoning perturbation is achieved by placing error answer in the end of thinking phase. However, the design seems to be simple or naive, since transformer mechanism normally focus on recent tokens (and attention sink), which could make LRM ignore the previous thinking process and make shortcuts. 2. memory perturbation is fine-tuning qa-pairs to change memories of LRM, though the author claims that fine-tuned phase is restricted to relevant knowledge and minimize side-effect, I am not sure how this can be guaranteed. For example, deteriorate the reasoning capability of LRM, which might in-turn weakening the experiment conclusion. More ablations would by beneficial. 3. the paper disentangle reasoning and memory by whether letting LRM to think or not, though it is simple and effective, however, during reasoning phase, some knowledge could be retrieved from LRM's memories. 4. FARL experiment lacks of discussion of the training dataset. If those concerns are settled properly, I am willing to raise my score. Fully human-written
Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Given the phenomenon of LRMs that reasoning traces often contradict with final answers, this paper investigates what factor influences the reasoning ability, in terms of the conflict between retrieval of LRMs' knowledge (internal knowledge) and Chain-of-Thought (external prompting). 1. The authors find that both COT and internal knowledge contribute to the final answers. 2. The authors investigate how some competing factors(model sizes, reasoning problem domains, reasoning model training methods) influence the reasoning abilities of LLMs. The authors also validate the "post-hoc exaplanation phenomenon" (models try to fabricate reasoning steps to derive a false answer), further prove that reasoning ability that distill from LRMs is not reliable as RL. 3. To validate the findings above, the authors suggest apply RL on reasoning-intensive datasets to avoid the model from retrievaling results from their own knowledge. The authors propose to train model with knowledege unlearning method to forget specific knowledge(with GRPO and NPO specifically). 1. The authors conduct sufficient empirical experiments to validate how the factors of COT prompts and internal model knowledge influence the final results. 2. The authors conduct "Post-Hoc Explanation" experiments to cross validate the results of previous methods. 3. The authors leverage an unlearning methods, which add NPO after the GRPO to demonstrate that weaken the knowledge ability of LRMs would enhance the reasoning ability of LLMs, in terms of reasoning robustness and effectiveness. The authors performance extra evaluation on reasoning path quality. 1. The knowledge and reasoning ability may still hard to decouple to analysis. The SFT attack may still harm the reasoning ability of COT,As experiments in table1 demonstrates, SFT give worse R-PSR than original R1-Llama-8B. The authors just claim the assumetion in Line 150 that the impact would be small. 2. Some claims seem not proper: * In line 48, this research seems still not "a mechanistic understanding of how different capabilities jointly influence LRMs’ answer generation" * From line 360 to line 362,It is not easy to understand that "our findings reveal a challenge where the retrieval mechanism enables models to “hack” the reward signal during RL and impair its effectiveness". As you will find if you estimate $\delta=T-PSR - PER$, which accounts for LRMs own reasoning failure, it is even more than PER. 3. Current study mainly focus on Multi-choice QA, but more open-ended problems could be studied in the future. 1. The section `Attention Patterns` seems not related to other sections, especially the results is not used for RQ3. What's the purpose of this section. 2. Some results settings: * How did the authors get the result of Fig2 on different datasets? * What's the experiment settings of RQ3 Table1 to get the performance? As previous "pertubation attack" only conduct on the questions that LLM can correctly solve. 3. The authors aim to interval the "retrieval knowledge shortcut" to enhance the reasoning ability of models. However, the actual reasoning accuracy may decrease. As the model still need the corrpsond basic knowledge to perform get the results. What if the authors test the result of unlearning some intermediate knowledge used by COT, for those $y_r$ or $y$ results? 4. As the reasoning ability could be enhanced when LLMs are unlearned the final knowledge, if the LLMs can access external knowledge, can they achieve better accuracy on harder problem? Fully human-written
Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper focuses on the two primary competing capabilities that influence the final answers of LRMs: deliberate reasoning through CoT and direct retrieval from internal memory. And the paper identifies the key factors that determine the dominance of reasoning versus retrieval through controlled intervention experiments, and ultimately proposes a post-training method, FARL, to regulate the relative strength of these two capabilities. 1. The competitive interplay between reasoning mechanisms and memory retrieval in LRMs is an important and timely research topic. 2. The paper is well-structured and clearly organized. 3. I like the idea and design of FARL — it effectively validates the paper’s conclusions and serves as a good contribution to the study. 1. In Section 3.1, I am considering whether SFT can serve as a reasonable method for directly modifying a model’s memory. While SFT does increase the probability of target tokens during deep-layer processing, evidence from mechanistic interpretability research [1, 2] suggests that it does not directly alter the MLP-stored factual knowledge or modify the model’s internal retrieval mechanisms. Therefore, the reliability of SFT as a means of memory intervention has a direct impact on the validity of the experimental conclusions in this work. 2. The investigation of the reasoning mechanism in LRMs is limited to injecting misleading cues into the CoT and observing the model’s response across domains. However, the paper does not examine the **intrinsic** reasoning behavior of LRMs within those domains. For instance, in mathematical tasks, the interplay between the model’s **inherent** mathematical reasoning and its memory recall mechanism is not explored. 3. Similarly, the study of retrieval mechanisms introduces new “memory” through SFT, rather than examining the model’s inherent knowledge retrieval capabilities. 4. The experiments are restricted to multiple-choice tasks. I would like to see results on open-ended generation settings as well, as such tasks would more naturally reflect the model’s reasoning and retrieval interplay in real-world scenarios. 5. I appreciate the problem studied in this paper. Although I guess some inspiration may come from *Competition of mechanisms: Tracing how language models handle facts and counterfactuals*, I see this as acceptable and good for me. Still, consistent with Point 4, I would like to see results across more diverse task formats to strengthen the conclusions. Overall, I appreciate the idea presented in this work and would be happy to reconsider my rating if the above concerns are addressed sufficiently. Please see the weaknesses above. Lightly AI-edited
Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Large Reasoning Models (LRMs) generate answers using two competing mechanisms a) Chain-of-Thought (CoT) Reasoning and b) Memory Retrieval. These mechanisms can conflict, leading to inconsistencies between reasoning traces and final answers. To show that LRMs LLM use both reasoning and retrieval simultaneously, authors designed controlled perturbation experiments to perturb either reasoning or retrieval. The reasoning steps are subtly altered to be misleading or incorrect. Or, the model’s memory is poisoned with misleading cues. They find that smaller or distilled models are more vulnerable to retrieval perturbations. They might fabricate reasoning traces to justify retrieved answers. In comparison, larger models and those trained with reinforcement learning are more robust and reasoning-driven. Given on this experiment, authors proposed FARL (Forgetting-Augmented Reinforcement Learning) to: a) suppress retrieval shortcuts, b) enhance reasoning-dominant behavior, c) improve generalization and robustness. It achieves 47.8% improvement in CoT robustness, 22.8% accuracy gain in-domain tasks, and 5.8% accuracy gain out-of-domain tasks. The methodology is rigorous, including controlled perturbation experiments and attention head analysis, which strengthen the validity of the findings. Understanding the interplay between reasoning and retrieval is critical for advancing trustworthy AI, especially in high-stakes domains like math, logic, and scientific reasoning. The paper primarily focuses on math and logic tasks to evaluate reasoning vs. retrieval. They are limited on multiple-choice QA, may not generalize to open source qa. Please add an evaluation on free-form answers with verifiable graders, such as [GeneralThought](https://huggingface.co/datasets/RJT1990/GeneralThoughtArchive). R-PSR and T-PSR are correlational indicators, not causal evidence of pathway dominance. A misleading cue that flips an answer does not prove the answer was reasoning-driven. Can we add causal intervention experiments (e.g., targeted weight/activation ablations on putative retrieval heads; causal scrubbing on residual streams) to show counterfactual dependence of the final answer on each pathway? Metrics like R-PSR and T-PSR are binary and may not capture nuanced interactions between reasoning and retrieval. For example, a model might partially rely on both mechanisms in a non-exclusive way. What if we can add token-level attributions such as logit lens analyses over steps? The extraction of answers and judgment of reasoning correctness occasionally falls back to GPT-4o-mini for answer extraction (section 3.2). Does misclassification inflate perturbation success? I suggest that authors add a human-validated subset or at least a majority-vote ensemble judge. How does NPO work in FARL? Did you input all x and y to the NPO? How does it compell models to “forget” specific memorized answers? Unify b_x (line 4) and x (line 9) in Algorithm 1 for easier to understand Fully human-written
Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper extends Reinforcement Learning with Verifiable Rewards (RLVR) to open-ended tasks (e.g., creative writing, instruction following) via Verifiable Multiple-Choice Reformulation (VMR). RLVR originally excels in STEM tasks with clear ground truths but fails in open-ended ones. VMR restructures open-ended data into multiple-choice formats (one chosen, one rejected response) with random order to ensure verifiability. Experiments on 8 benchmarks show VMR-based RLVR outperforms baselines, with a 5.99-point average gain, enhancing LLM reasoning and performance, even surpassing some 32B-scale models. 1. Unlike RLVR’s reliance on explicit ground truths, VMR transforms free-form data into verifiable multiple-choice pairs. It enables rule-based rewards without ambiguous evaluations, solving RLVR’s inapplicability to open-ended scenarios. 2. Across 8 benchmarks, it achieves a 5.99-point average gain over the base model, with standout gains in creative writing. It even outperforms larger 32B-scale models, proving its efficiency in enhancing LLM capabilities. 3. Random response ordering avoids positional bias. Compared to Baseline II (same data without VMR), it still gains, showing improvements stem from VMR’s design, not just data scale, ensuring reliable training. 1. The method proposed in this paper solves the verification problem in open domains to a certain extent, but it faces significant issues in practical application. It seems that promoting this method to mathematical reasoning would require extremely high costs. The entire method relies on two candidate answers, and the verifier matches answers A and B with the ground truth (GT) option. How can this method be applied to mathematical reasoning where there is a unique GT? Is it necessary to forcibly construct an incorrect answer and then have the verifier make a judgment? This seems unreasonable and redundant. 2. An ideal experimental setup should involve training separately using RM-based datasets and VMR-based datasets with the same queries. In the current experimental setup, the method "combines RM-based and VMR-based datasets in equal proportion. RM-based queries are scored by the reward model, while VMR triples are verified using rule-based reward functions." This makes it impossible to decouple the roles of the reward model and the verifier, and I have doubts about the reliability of the experimental results. 3. The paper does not disclose the training equipment and time. Training a 14B-scale model with the training parameters described in Section 3 will incur extremely high costs. It remains questionable whether the cost-benefit ratio is sufficient to justify the promotion of this technology. 4. In the bar chart of Figure 1, some numbers overlap with each other. Please refer to the Weaknesses. Lightly AI-edited
Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes Verifiable Multiple-Choice Reformulation (VMR), which converts preference pairs (chosen vs. rejected responses) into verifiable binary-choice questions. A rule-based verifier then provides exact rewards (1/0) based on whether the model selects the better response. It is important and anticipated to extend RLVR to open-ended tasks. - **The papers lacks methodological novelty and the method is largely a straightforward combination of existing techniques** (e.g., RLVR, GRPO, and preference-based data formatting) without introducing significant conceptual or architectural innovation. While the VMR is a useful engineering trick, it does not constitute a fundamental advance in reinforcement learning or reasoning modeling. The method section is also overly verbose, repeating well-known formulations without sufficient focus on what truly differentiates the proposed pipeline. - The writing—particularly in the abstract and introduction—lacks focus and fails to clearly articulate the core problem, contribution, and significance. Key claims are buried in lengthy paragraphs, and the narrative does not effectively motivate why extending RLVR to open-ended tasks is non-trivial or why VMR is a principled solution. - Inadequate formatting and scholarly presentation. The paper suffers from inconsistent or incorrect formatting, including improper citation styles, titles position of tables and position of **REPRODUCIBILITY STATEMENT**. - Critical implementation details are missing. For instance: The number and type of GPUs used for training are not disclosed; Training time, memory consumption, or computational cost are omitted; Hyperparameter sensitivity or ablation studies (e.g., impact of the 1:1 RM/VMR data mix) are not provided. - The paper over-relies on LLM-as-a-judge metrics. Most benchmarks (e.g., MT-Bench, AlpacaEval) use automated LLM-based evaluators, which are known to exhibit biases. - The method is only validated on a single base model (DeepSeek-R1-Distill-Qwen-14B). It remains uncertain whether VMR’s benefits transfer to other architectures, scales, or instruction-tuned models. Please see weaknesses. Heavily AI-edited
Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper extends Reinforcement Learning with Verifiable Rewards (RLVR) from STEM domains (mathematics, programming) to open-ended tasks lacking ground-truth solutions (creative writing, instruction following). The key innovation is Verifiable Multiple-Choice Reformulation (VMR), which restructures preference data (chosen/rejected response pairs) into multiple-choice questions that can be verified using rule-based functions. For each query, the model is asked to choose between two randomly-ordered responses, and receives binary reward based on selecting the better one. - Novel extension of RLVR to open-ended domains where standard answers don't exist - Sound mathematical formulation connecting VMR to standard RLVR framework - Clear problem motivation explaining RLVR's limitation in open-ended domains - Addresses important gap: extending RLVR beyond STEM domains - The connection between multiple-choice discrimination and open-ended generation is assumed but not justified - Only one base model tested (DeepSeek-R1-Distill-14B); crucial to test on other models - Dependency on high-quality preference data limits applicability - Heavy reliance on LLM-as-judge metrics which have known biases - How does VMR perform on models without built-in reasoning capabilities ? - Can you provide error bars or significance tests for the improvements? - The reasoning density improvement is quite small. Is this statistically significant? - How does the method perform when preference annotations disagree or are noisy? Fully AI-generated
Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper tackles challenges of applying RLVR to open-ended tasks like creative writing. The authors propose a training strategy called Verifiable Multiple-Choice Reformulation (VMR). This method restructures data from open-ended tasks into a multiple-choice format, which makes it possible to verify the answer. The experiments find that the proposed VMR improves the performance or LLMs on open-ended tasks, shoing an average gain of 5.99 points over the baseline. * This paper tries to tackle the LLM training problem that it is not easy to do RL training with open-ended questions. This is an important question that the community tries to solve. * The empirical results from the proposed method seems good, with a noticeable gain comparing with the baselines. * The paper is clear that readers can understand most of the concepts introduced easily. * The reward verifies only whether the model selected the pre-labeled preferred response, not that the response is objectively better. For the RM-based subset, line 257, the labels themselves are produced by an automated reward model (URM-LLaMA-3.1-8B). Therefore, the pipeline still inherits RM bias/noise even though the training reward is rule-based. This undercuts the claim that they avoid RM issues (line 063, in figure 2). * Most reported wins depend on LLM-as-judge (e.g., MTBench, AlpacaEval-2, WildBench, CreativeWriting V3, ArenaHard 2.0, CreativeWriting), which can share stylistic biases with the training signal. There’s no human eval to validate the improvement, making over-optimization to judge preferences a real risk. * In the experiment, the authors compare with reward model-scored RL baselines. There’s no DPO/KTO (or other RLHF methods) baseline trained on the same pairwise triples, despite those being the most obvious alternatives. Some gains could stem from the extra signal in pairwise data rather than the on-policy RL objective or the VMR prompt itself. * The A/B candidates come from existing datasets, not from the current policy model, which makes it skeptical if the proposed method can really improve the LLM generation quality * The proposed method is a form of RLHF, it's just like the actor-critic/PPO loop. The “RM-based dataset” uses open-ended queries whose rewards are assigned by a reward model (URM-LLaMA-3.1-8B). For VMR, each item has a human-labeled chosen vs rejected answer. They convert it to A/B and give a binary reward (1/0) if the policy picks the chosen one (see Figure 3 and Eq (9)). Functionally, that’s RLHF with a degenerate reward model that returns 1 for the preferred option and 0 otherwise. The policy still maximizes expected reward from human preferences via policy gradient. * Though the motivation of this paper is to transform open-ended questions into verfiable ones, I wonder what is the necessity of doing so. For training LLMs with RL, is it a good and general enough solution to convert the open-ended questions? Fully human-written
Real-time Echocardiography Video Segmentation via Slot Propagation, Spatiotemporal Feature Fusion, and Frequency-phase Enhancement Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a technically sound echocardiography segmentation model that addresses three critical challenges in cardiac ultrasound analysis through well-motivated innovations: the Context-Guided Slot Propagation (CGSP) mechanism for ambiguous boundary handling, spatiotemporal feature fusion for managing cardiac shape variations, and Frequency Phase Enhancement (FPE) for speckle noise reduction. - paper well written - clinical motivation significant - great segmentation performance - Insufficient Ablation on FPE Components. Difficult to assess the FPE module's design effectiveness and understand which frequency components (amplitude or phase) are more critical for noise suppression in echocardiography. e.g. using only phase to add weights, okay but why? - Same to FPE, CGSP lacks of ablation and techinical motivation for specific design. - PKEchoNet seems to have exellent performance as well as FPS, while this method has relatively small gain but large FPS drop. The significance of the paper needs to be further justified. - in FPE, for Ma and Mp, the paper only states “Sigmoid(M∗) ∈[0, 1]” which is trivially true. But how are Ma and Mp be initialized? (e.g., uniform across all frequencies vs. noise-informed initialization) Does it influence the finial results? - Again, how was Ms initialized specifically? - Are both phase and amplitude filtering equally important? An ablation comparing (a) amplitude-only filtering, (b) phase-only filtering, and (c) joint filtering would clarify the individual contribution of each component and whether separate masks are necessary. - How does FPE compare against simpler spatial-domain denoising methods (e.g., learnable convolutions, attention mechanisms, or non-local means)? Is the frequency-domain approach demonstrably superior, or does it add unnecessary complexity? - In slot initialization, is N=2 sufficient? The paper mentions "foreground and background slots" suggesting N=2, but cardiac structures have multiple regions. - Why use K-Medoids specifically for feature selection? - How is T (number of propagation iterations) determined? Fully human-written
Real-time Echocardiography Video Segmentation via Slot Propagation, Spatiotemporal Feature Fusion, and Frequency-phase Enhancement Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. While the paper addresses an important and clinically relevant problem, the proposed method however lacks sufficient novelty. The core components of the architecture (slot-based propagation, spatiotemporal fusion, and frequency-domain enhancement) are either direct adaptations or incremental combinations of techniques that have already appeared in the literature, particularly in medical image/video segmentation and foundation model adaptation domains. Below, I detail specific concerns regarding methodology. 1. well organized and clear motivation While the paper addresses an important and clinically relevant problem, the proposed method however lacks sufficient novelty. The core components of the architecture (slot-based propagation, spatiotemporal fusion, and frequency-domain enhancement) are either direct adaptations or incremental combinations of techniques that have already appeared in the literature, particularly in medical image/video segmentation and foundation model adaptation domains. ### 1. Limited novelty The paper claims that its “context-guided slot propagation (CGSP)” mechanism is a key innovation for separating foreground and background regions in noisy echocardiographic videos. However, slot-based representations for object-centric learning and video segmentation have already been extensively studied in previous works [1–8]. The authors have not clearly articulated how this manuscript differs from or advances beyond these prior studies. The SFF module aggregates features from reference and query frames using query-key-value attention, a pattern now very common in video segmentation such as [6,9,10-14]. For example, XMem [6] and its successors XMem++ [9] already employ cross-frame attention with memory banks to fuse spatiotemporal context efficiently. The prototype-based matching in Eq. (12) – (16) closely resembles the feature correlation and readout mechanisms in STM [11], which widely cited in video object segmentation, and the similar design can also be found in medical image domain such as [12], [13], [14]. Thus, the SFF module offers no a novel architectural or theoretical departure from established paradigms. The FPE module applies FFT, modulates amplitude/phase with learnable masks, and uses IFFT to reconstruct features as a strategy that has seen multiple recent instantiations: [15] and [16] both exploit frequency-domain filtering or noise-robust tuning for image segmentation, explicitly addressing generalization problem in ultrasound. Frequency-aware SAM variants, such as [17], already integrate frequency priors into SAM backbones for enhanced boundary delineation under noise, which directly overlapping with the motivation of FPE. ### 2. Missing comparison with SOTA methods. Several SOTA methods are compared under inconsistent experimental conditions: The paper uses MiT-b2 (SegFormer backbone), which is significantly more powerful than the backbones used in many cited baselines (e.g., U-Net in early SAMUS variants, ResNet in XMem). Yet, the authors do not re-implement or re-benchmark these methods with the same backbone for a fair comparison. ### 3. Mirror Weakness - Missing Statistical Significance and Variance Reporting - High FLOPs and Parameter Count Undermine “Real-Time” Claim. As shown in Table 2, FESPNet has 370 GFLOPs and 34.3M parameters, which is: ~24× higher FLOPs than SimLVSeg (3G), ~3× higher FLOPs than PKEchoNet (158G) , yet only achieves marginal mDice gains. This performance even higher than Cutie (218G) and Swin-UMamba (340G), both of which are already considered heavy for real-time medical applications. ### 4. Typos - 053 acorss frames -> across frames - 228 a new feature map FS ∈ RK×H×W, where N is the number of slots, where is N? - 240 ,its featurer presentation H_{Si}∈R_L, where is L? [1] Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A. and Kipf, T., 2020. Object-centric learning with slot attention. Advances in neural information processing systems, 33, pp.11525-11538. [2] Lee, M., Cho, S., Lee, D., Park, C., Lee, J. and Lee, S., 2024. Guided slot attention for unsupervised video object segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3807-3816). [3] Liao, G., Jogan, M., Hussing, M., Zhang, E., Eaton, E. and Hashimoto, D.A., 2025, September. Future slot prediction for unsupervised object discovery in surgical video. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 219-229). Cham: Springer Nature Switzerland. [4] Madan, S., Chaudhury, S. and Gandhi, T.K., 2024, November. Pneumonia Classification in Chest X-Ray Images Using Explainable Slot-Attention Mechanism. In International Conference on Pattern Recognition (pp. 271-286). Cham: Springer Nature Switzerland. [5] Deng, X., Wu, H., Zeng, R. and Qin, J., 2024. Memsam: Taming segment anything model for echocardiography video segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9622-9631). [6] Bekuzarov, M., Bermudez, A., Lee, J.Y. and Li, H., 2023. Xmem++: Production-level video segmentation from few annotated frames. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 635-644). [7] Jaegle, A., Borgeaud, S., Alayrac, J.B., Doersch, C., Ionescu, C., Ding, D., Koppula, S., Zoran, D., Brock, A., Shelhamer, E. and Hénaff, O., 2021. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795. [8] Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A. and Carreira, J., 2021, July. Perceiver: General perception with iterative attention. In International conference on machine learning (pp. 4651-4664). PMLR. [9] Cheng, H.K. and Schwing, A.G., 2022, October. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In European conference on computer vision (pp. 640-658). Cham: Springer Nature Switzerland. [10] Maani, F., Ukaye, A., Saadi, N., Saeed, N. and Yaqub, M., 2024. SimLVSeg: simplifying left ventricular segmentation in 2-D+ time echocardiograms with self-and weakly supervised learning. Ultrasound in Medicine & Biology, 50(12), pp.1945-1954. [11] Oh, S.W., Lee, J.Y., Xu, N. and Kim, S.J., 2019. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9226-9235). [12] Wang, R. and Zheng, G., 2024. PFMNet: Prototype-based feature mapping network for few-shot domain adaptation in medical image segmentation. Computerized Medical Imaging and Graphics, 116, p.102406. [13] Yuan, Y., Wang, X., Yang, X. and Heng, P.A., 2024. Effective Semi-Supervised Medical Image Segmentation With Probabilistic Representations and Prototype Learning. IEEE Transactions on Medical Imaging. [14] Kim, H., Hansen, S. and Kampffmeyer, M., 2025, September. Tied Prototype Model for Few-Shot Medical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 651-661). Cham: Springer Nature Switzerland. [15] Chen, L., Fu, Y., Gu, L., Zheng, D. and Dai, J., 2025. Spatial frequency modulation for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. [16] Wei, Z., Wu, C., Du, H., Yu, R., Du, B. and Xu, Y., 2025, September. Noise-Robust Tuning of SAM for Domain Generalized Ultrasound Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 476-486). Cham: Springer Nature Switzerland. [17] Kim, S., Jin, P., Chen, C., Kim, K., Lyu, Z., Ren, H., Kim, S., Liu, Z., Zhong, A., Liu, T. and Li, X., 2025. MediViSTA: Medical Video Segmentation via Temporal Fusion SAM Adaptation for Echocardiography. IEEE Journal of Biomedical and Health Informatics. Please find my comments above. Fully AI-generated
Real-time Echocardiography Video Segmentation via Slot Propagation, Spatiotemporal Feature Fusion, and Frequency-phase Enhancement Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a method for the segmentation of echocardiography video. The method introduces three main components, including slot propagation to distinguish targets from noisy background, a spatiotemporal feature fusion to capture temporal information, and a frequency-phase enhancement module to extract semantic patterns. Experiments are conducted on two public echocardiography video datasets. The paper is clearly written and easy to follow. - The methodology lacks sufficient novelty. The proposed approach is a straightforward combination of three components, each of which has already been explored in prior video segmentation studies. The paper does not clearly demonstrate how this integration leads to new insights. - Table 1 shows that the method obtains only marginal performance gains over existing approaches, with merely 0.44% and 0.57% improvement in Dice on the two datasets. These minimal improvements further suggest that the contribution may primarily come from an empirical combination of existing techniques rather than a fundamentally new idea. - The paper emphasizes real-time capability as an important aspect of echocardiography segmentation. However, this claim is not well supported by the experimental evidence. In Table 2, the FPS of the proposed method is significantly lower than the second-best method PKEchoNet, and the Flops of the proposed method are even higher than some SAM-based methods, undermining the claim of computational efficiency. - The experimental evaluation is limited to only two echocardiography datasets. This restricted scope raises concerns about the generalizability of the proposed approach and limits its potential interest for a broader audience. - Could the authors clearly describe the specific novelty or unique contribution of their approach beyond the straightforward integration of the three existing components? - Can the authors demonstrate practical benefits or downstream impact that justify the contribution despite the small numerical differences? - Could the authors clearly justify the computational advantage of the method? - Could the method generalize to other tasks or datasets? What adaptations would be required? Fully AI-generated
Real-time Echocardiography Video Segmentation via Slot Propagation, Spatiotemporal Feature Fusion, and Frequency-phase Enhancement Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This work introduces a method to segment LV in echocardiography. The proposed model contains a Frequency-Phase Enhancement and Context-Guided Slot Propagation modules, which handle noise and temporal consistency in echocardiography videos. They demonstrated reasonable performance on two well-known datasets. They introduced this combination of the Frequency-Phase Enhancement module and the Context-Guided Slot Propagation, which seems a good way to handle noise and maintain temporal consistency in echocardiography videos. Another possible strength is that they do show strong quantitative results on the datasets they used, like outperforming several existing methods on those benchmarks. The authors presented several interesting ablation studies. 1- There is a discrepancy in the reported Dice scores for other methods. For instance, SimLVSeg reported an average dice of 93.32 on the EchoNet dataset, where the value is reported as 91.38 in Table 1. That’s a major problem if they’re citing different numbers from the original paper without explaining why. 2- The introduced work is over-engineered and complex and hence may raise the issue of generalizability. They only evaluated on the CAMUS and EchoNet-Dynamic datasets, so the results might not extend to other datasets like HMC-QU or ACDC. That’s a limitation on how widely applicable their findings are. 3- Some ablations are done on the much smaller dataset CAMUS e.g., the number of iterations. This isn’t ideal as they could have used the larger dataset, EchnoNet, for those tests or both datasets to give a more balanced view. 4- No standard deviations are reported in their performance metrics, which makes it hard to know how consistent or variable their results are. And it would have been better to highlight the larger dataset results in the main text instead of the smaller dataset. I would switch tables 4 and 7. 5- The complexity of the model is not clear. The real-time performance claim might depend heavily on specific hardware, and the authors didn’t elaborate on that. So it might not be as generalizable in terms of real-time use on different systems. Please address the 5 points in weaknesses section. Fully human-written
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes EchoX, a three-stage pipeline for speech-to-speech LLMs. The key idea is to use a frozen text-to-codec module to generate pseudo speech targets from intermediate semantic representations, with the goal of reducing the acoustic-semantic gap. The model also adopts unit-language tokens and a trigger-based streaming mechanism. Experiments on spoken QA benchmarks show competitive results with relatively modest training data. * Targets a real issue in SLLMs (knowledge degradation). * Unit-language tokens help reduce sequence length without hurting quality. * The data pipeline could be useful to the community, and planned release is valuable. 1. The method feels close to a cascaded S2T + TTS system with an extra alignment step. The novelty and contribution are not very clear. 2. The main claim (mitigating the acoustic–semantic gap) is not convincingly demonstrated. I would expect clearer evidence showing reduced semantic degradation compared to text-only or other S2S approaches. In particular, the paper does not show that the gap between S2T and S2S is actually smaller than in other systems. For example, the reported S2T vs. S2S scores are 77.3 vs. 63.3, while VITA-Audio shows 75.6 vs. 68.0, indicating a narrower gap than the proposed model. Maybe the paper can add WER results and S2T/S2S relative gap like [1] mentioned to show it clearly. 3. I have concerns about the fairness and stability of the test set. The accuracy on these spoken QA benchmarks is computed via keyword matching, which leads to inconsistent results across papers for the same model (e.g., GLM-4-Voice, LLaMA-Omni2). Some of questions also not suitable for testing, e.g. rely on chemical symbols or other abbreviated forms, which makes the score highly sensitive to ASR and text normalization errors rather than the S2S generation quality itself. [1] Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model https://arxiv.org/pdf/2506.04518v1 Overall, while the paper targets an important problem: the intelligence degradation introduced during modality conversion, and the dataset release is a valuable contribution to the community. But I remain skeptical that the proposed method effectively addresses this issue. The evidence presented does not convincingly support the main claim. Below are several technical questions for the authors: 1. The Echo loss uses speech codec targets obtained by decoding the S2T hidden states into text via greedy search and then passing them through a frozen T2C model. Since the T2C module is pre-trained independently and does not update during Stage III, all speech-token supervision could be precomputed offline by running T2C on ground-truth transcripts. Could the authors clarify what representational benefit the online speech label generation provides? In particular, what gradient differences arise compared to purely offline codec labels? 2. Semantic–Acoustic Gap Quantification. Echo training is motivated by reducing the acoustic–semantic gap in representation space, yet the paper measures only downstream QA accuracy. Could the authors provide quantitative metrics validating that H is closer (e.g., via cosine similarity, probing, clustering entropy) to semantic spaces after Echo training than before? Without such evidence, it is unclear whether Echo training modifies representation geometry rather than simply regularizing the decoder. 3. In an AR setup with NTP loss and causal masking, the model can already leverage corrected previous text and speech tokens during training to predict the token at timestep t. When paired with offline text labels and corresponding TTS-generated codec tokens, the model receives clean supervision at every step. Given this, it is unclear what additional benefit the Echo loss provides beyond standard AR conditioning. Could the authors clarify what unique learning signal Echo loss introduces that is not already captured by the lower-triangular causal mask and NTP supervision? 4. Could the paper add benchmark like Voicebench? It would strengthen the empirical evidence and make the results more convincing. Fully human-written
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes EchoX, a three-stage training framework designed to mitigate the acoustic-semantic gap in Speech-to-Speech Large Language Models (SLLMs). Its core technical contribution is "Echo Training," where an Echo decoder, initialized from a pre-trained Text-to-Codec (T2C) model, is trained to generate speech tokens from the hidden states of a Speech-to-Text (S2T) LLM, using dynamically generated pseudo-labels from the same T2C model. The method also employs a denoising adapter and a streaming inference mechanism. The model is evaluated on knowledge-based QA benchmarks and shows competitive performance with only ~6k hours of training data. - The focus on intelligence degradation in SLLMs is a well motivated problem. - The paper provides extensive details on the data pipeline and model configurations. - The paper fails to evaluate speech generation quality beyond accuracy on QA tasks. - The paper fails to discuss and contrast its approach with highly relevant work, such as SpeechGPT. And comparisons to other recent strong baselines like Qwen-Audio are missing. - The proposed method has several new components (Echo Training, Denoising Adapter). However, there is no rigorous ablation study to isolate the contribution of each. - While Figure 5 attempts to illustrate the acoustic-semantic gap, the analysis is superficial. The choice of words ("Hi", "Hello", "High") is simplistic and not representative of the complex semantic-acoustic interactions in real dialogue. - Why did you not include speech quality metrics (e.g., MOS, WER) to evaluate the generated audio? Without these, the claim of “mitigating the acoustic-semantic gap” is only partially supported. - How does EchoX quantitatively and qualitatively compare to SpeechGPT, which also uses a three-stage pipeline and unit tokens? - What is the performance drop if you remove the Denoising Adapter? What if you train the Echo decoder directly on ground-truth speech tokens instead of the T2C-generated pseudo-labels? Lightly AI-edited
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a multi-stage training framework of speech-to-speech large language models (SLLMs) in order to bridge the acoustic-semantic gaps. The proposal is based on the insight that due to acoustic-semantic gaps, current SLLMs have not achieved on-par 'intelligence' between text and speech modality. The main contribution of the paper is that the proposed framework introduces a module to dynamically predict speech tokens based on semantic input, aiming to bridge the gap between output speech tokens and the semantic features. Originality: - The author lays out clear motivation of the proposal - assuming acoustic-semantic gap is the 'intelligence degradation' in SLLMs, and turns that insight into architectural solution (Echo training). - The proposed framework utilizes a pretrained text to codec (T2C) model to generate pseudo speech tokens from text, which reduces the demand for speech-to-speech data which is relatively scarce. - The proposed framework introduces a streaming inference mechanism with read/write trigger. Clarify: - The multi-stage framework (S2T, T2C, joint S2T with echo training) appears to be straightforward and modular recipe which combines the conventional methods in an organic way. It's easy to follow. 1. The claim There should be a rigorous definition of "acoustic-semantic gap". Moreover, the claim of "acoustic-semantic gap" leading to "intelligence degradation" is thin. Figure 1 itself is not a sufficient demonstration of the concept, more evidence should be provided to argue that a)"acoustic-semantic gap" *causes* "intelligence degradation" b) most of (if not all) SLLMs with various choices of speech tokenizers have this "acoustic-semantic gap" issue. If such evidence exists in other literatures, please quote them. 2. Potential design flaws. There are some design choices in EchoX framwork that are questionable, please find in 'Question' section. 3. Experimental setup: there's lack of detailed ablation study and more general analysis. Please find in 'Question' section. 4. Presentation needs to improve. For instance, in formula (1) and (3) represent log likelihood of the target sequence, however as a loss function to minimize using gradient descent, it's supposed to be instead *negative* log likelihood. 1. Some speech tokenizers focus on semantic and some focus on acoustic, others try to balance both. Of all these speech tokenizers, do all of them have this "acoustic-semantic gap" issue? Is it more of an issue for Speech LLMs or the choice of speech tokenizers? In another word, how to prove that EchoX training method could universally improve most of the speech LLMs? 2. In section 2.4, why do you do greedy search to get pseudo text labels $X'$ when you have ground truth text? Wouldn't that cause error prorogation? 3. In Table 4: what's EchoX without Echo training? is it essentially a cascaded model? Why is Speech-to-text scores higher than Text-to-text on "Llama Questions"? 4. In Table 4: the details of 'interleaving' should be given, how did authors make sure they are comparable? 5. Lacking ablation study: - The pre-trained modules such as T2C and S2T LLMs need evaluation. - the effect for denoiser 6. If speaker variance is out-of-scope for this work, how would authors argue that the proposed framework could generalize and solve 'acoutsic-semantic gap' in real-word applications? Fully human-written
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Based on the observation that speech LLM training degrades the text capabilities of an LLM, the paper tries to address this issue by a text2codec guidance during training of the speech LLM. The speech codecs are based on grouping of HuBERT units. The paper proposes a multi-stage training strategy given a pre-trained text LLM. First the model is trained for Speech to text, then a text to speech codec (T2C) module is trained, finally the proposed Echo training uses the guidance from the T2C module. The model takes in input speech, passes it through the S2T LLM, outputs of this step are fed into both a text denoising layer and to the T2C module. The training objective then combines the losses of consistency of the text output and the T2C embeddings, T2C outputs with the speech token decoding outputs, and the S2T loss. The model is trained on about 6k hours of data and the experimental evaluations suggest competitive performance on Llama Questions, Web Questions, TriviaQA speech QA tasks. The scores are slightly behind other models in literature (e.g. GPT-4o-Realtime (Hurst et al., 2024), VITA-Audio (Long et al., 2025), MinMo (Chen et al., 2025b) ) in both its 3B and 8B versions. The paper than demonstrates an example where the semantic and acoustic gap is happening (hello - hi - high). Originality * The way the Echo loss is constructed might be novel. At an intermediate level, the text tokens are mapped to speech tokens which guide the learning of the S2S model. Construction of the unit language is based on the Soundwave paper, the LLM backbone is based on Llama, and the decoder consists of some transformer layers, hence the model architecture is not particularly novel. Quality * The example given in Section 5.2. is successful at demonstrating the acoustic-semantic gap problem. * According to the experiments, with 6k hours of synthetic speech data it is possible to achieve competitive performance as compared to other models requiring much more training data. Clarity * There is no major concern around clarity of the writing. However, the flow of the paper could have been improved. Significance * The paper is relevant to the speech LLM community which focuses on speech and text domain alignment issues. * The paper claims that the generated synthetic dataset will be made available, which may become a useful source. Originality - The originality is limited to the design of Echo loss. Architectural components and experimental design have been introduced before. Quality - Even though the results are competitive with other models trained on much more data, the results do not provide the latest SOTA. In addition, the three datasets may not necessarily reveal the general performance of the model on datasets with different difficulty levels. - Evaluation setup is somewhat limited. Especially, for S2S tasks, only the QA performance is presented. It could be informative to also see the speech quality metrics such as MOS scores. - Since the paper is trying to address the loss of reasoning/intelligence capabilities, it could be good to demonstrate the performance on unseen speech tasks. Clarity - The example discussed in Section 5.2. could have been shown in the introduction to better motivate/explain the problem that the model is addressing. 1. Did the authors measure the speech quality metrics (e.g. MOS) for the S2S model in addition to the content evaluation? How would it compare to a S2T and a TTS cascade? 2. Did the authors check the generalization capabilities of the model to unseen tasks and datasets at different difficulty levels? 3. Do the authors have any comment on the use of synthetic speech data instead of real speech data in model training? Did the authors experiment with any real spoken QA datasets? 4. The simple example provided in Section 5.2. is informative, however, the placement of this analysis to an earlier place in the paper may provide a better motivation for the reader. 5. Table 5 caption introduces the $R$ measurement, which was not discussed in the text before. Fully human-written
From Ambiguity to Verdict: A Semiotic‑Grounded Multi‑Perspective Agent for LLM Logical Reasoning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 1: You are unable to assess this paper and have alerted the ACs to seek an opinion from different reviewers. This paper proposes LogicAgent, a multi-perspective reasoning framework based on Greimas' Semiotic Square, and introduces RepublicQA, a new benchmark derived from Plato's Republic for evaluating logical reasoning under semantic ambiguity. LogicAgent operates through three stages: semantic structuring (constructing contraries and contradictions), logical reasoning (FOL-based deduction), and reflective verification (multi-perspective validation). Experiments show improvements of 6.25% on RepublicQA and 7.05% on existing benchmarks (ProntoQA, ProofWriter, FOLIO, ProverQA). 1. LogicAgent uses the semiotic square to handle contrary (opposite) concepts, not just contradictory (true/false) ones, is a new and smart way to deal with ambiguity. 2. RepublicQA fills an important gap by testing reasoning on abstract philosophical concepts with college-level difficulty The method is computationally heavy. It is slow and uses a very large number of tokens (avg. ~18.4k) for each query, making it costly to run. Please refer to weaknesses part. Lightly AI-edited
From Ambiguity to Verdict: A Semiotic‑Grounded Multi‑Perspective Agent for LLM Logical Reasoning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces LogicAgent, a multi-perspective reasoning framework grounded in the Semiotic Square, designed to address the challenges of logical reasoning in LLMs when confronted with semantic ambiguity and abstract propositions. The framework operates by performing parallel deductions in FOL on a proposition, its contrary, and its contradictory, leveraging a multi-stage reflective verification mechanism to resolve inconsistencies. The authors also introduce RepublicQA, a new benchmark for this task characterized by high difficulty and semantic complexity derived from philosophical texts, on which their method achieves state-of-the-art performance, significantly outperforming strong baselines. - The core idea of integrating a structuralist semantic tool (the Semiotic Square) with symbolic logic to mitigate semantic ambiguity is novel and compelling. - The contribution of a new, manually annotated benchmark (RepublicQA) to address the lack of semantic complexity in existing datasets is valuable to the community. - The methodology section lacks formal rigor. The paper would be significantly strengthened by adding more precise mathematical statements or lemmas that detail the formal assumptions and boundary conditions required to migrate the semiotic square into classical FOL. It should include, for example, a formal definition of the "existential import check" and its application. - Reproducibility remains a concern. While the prompts are provided, the authors should add more concrete examples of side-by-side NL-to-FOL mappings. It is especially important to include complex cases involving nested quantifiers and negations, as these are critical for replicating the "Translator" module. - The necessity of the *full* four-point Greimas Square is questionable. The authors should provide a targeted ablation study comparing the full four-point structure against a simpler three-point structure---one using S1, not S1, S2---to justify the framework's complexity. - Some related works about LLM-based logical reasoning are missing, which should be compared with the proposed method or discussed on their difference. e.g.: [1] Cumulative Reasoning with Large Language Models [2] DetermLR: Augmenting LLM-based Logical Reasoning from Indeterminacy to Determinacy See above. Lightly AI-edited
From Ambiguity to Verdict: A Semiotic‑Grounded Multi‑Perspective Agent for LLM Logical Reasoning Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper targets two challenges of the existing works. First, existing works overlook the interplay between logical complexity and semantic complexity. Accordingly, the authors propose the LogicAgent, which is based on semiotic-square and can jointly address the logical and semantic complexity. Second, existing benchmarks lack logical semantic complexity, so a benchmark RepublicQA is proposed, with freater lexical complexity and structural diversity. Logical complexity and semantic complexity are indeed different perspectives of natural language content, and this paper makes an effort to address these two issues explicitly. The adoption of 'Semiotic Square' looks novel and brings an interesting idea into the field. 1. The abstract could be improved to make it more accessible to readers who are unfamiliar with these logical terms. In addition, changing terms also make the abstract less readable. For example, does the structural diversity refers to the logical complexity? Is the lexical complexity same as the semantic complexity? 2. The format can be improved to avoid confusion. in line 39 and 40, the 'In AI Cohen et al. xxxx' should be 'In AI (Cohen et al. xxxx)'. This problem appears a lot of times in the paper. 3. The presentation is not good enough. At the very beginning it is stated that the interplay of semantic complexity and logical complexity is targeted by this work, but the following part does not clearly explain what is this so called 'interplay', how it is 'overlooked', and how is this addressed by this work. It is stated that "existing benchmarks remain confined to relatively simple and determinate semantics, often centered on everyday scenarios with shallow relations or logic problems that lack intricate semantic structure". However, there are also benchmarks designed for math, scientific reasoning, or pure logical reasoning. How about them? Also, some logical reasoning datasets use synthetic approaches to build complex logical structures, with adjustable logical complexity. Fully human-written
From Ambiguity to Verdict: A Semiotic‑Grounded Multi‑Perspective Agent for LLM Logical Reasoning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes LogicAgent, a reasoning framework that marries Greimas’ Semiotic Square with first-order logic (FOL), adds an existential-import check to avoid vacuous truths, and evaluates propositions with a three-valued scheme {True, False, Uncertain}. - Provide a new dataset RepublicQA with college-level difficulty. - Propose LogicAgent with great result on benchmarks. - Dataset Scale: The size of the newly dataset RepublicQA is too small (n=200). This limited scale raises concerns about the statistical robustness of the findings and the dataset's general utility. - Benchmark Reporting: The results on "Other Benchmarks" are reported as an aggregate average. Could the authors provide a detailed, disaggregated breakdown of performance for each individual benchmark (e.g., ProntoQA, ProofWriter, FOLIO, and ProverQA)? - Dataset Generalizability: The decision to construct the RepublicQA dataset exclusively from a single source, Plato's "Republic," is questionable. This narrow domain scope inherently limits the dataset's diversity and generalizability. - Novelty of Methodology: using Greimas’ Semiotic Square and extending the evaluation space from a binary {True, False} to a three-valued scheme {True, False, Uncertain} appear to lack significant innovation. As far as I know, many existing works, particularly in probabilistic logic, have implemented similar "de-binarization" approaches to handle uncertainty. The authors need to better justify the novelty of their method against this prior art. deliver my issues in the weakness's sections Lightly AI-edited
When Unlearning Backfires: Partial Unlearning Increases PII Regurgitation and enables data extraction in Meta’s Llama 3.2 1B Soundness: 2: fair Presentation: 2: fair Contribution: 4: excellent Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work studies the effect of targeted removal of a “core” forget set for unlearning that is not representative of the full intended unlearned topic. In particular, they study how unlearning a subset of the unlearning topic (in this case, the core Harry Potter book series text) impacts regurgitation of related information (such as blog posts about Harry Potter). They do this by reinforcing the Harry Potter-related information in the LLM and subsequently removing the added Harry Potter knowledge by altering the model’s logit distribution and then finetuning on generic alternatives. They find that partial unlearning results in increased prevalence of related information that contains personally identifiable information, such as the names, affiliations, and websites of blog post authors. - The fundamental motivation of this work, understanding the effects of partial unlearning in realistic setups, is very interesting and under-explored in current unlearning evaluations. - The results are novel and would be a contribution to the general unlearning community. - The authors test a single model on a single method for a single dataset. Why do they not compare multiple unlearning methods? What motivates the use of the one that the authors propose? How does it differ from Eldan and Russinovich’s? - The authors provide no intuition or discussion for why this behavior comes about. Is this a result of unlearning only the core text of Harry Potter without unlearning any auxiliary data? If the authors had partially unlearned other related sources (blog posts, fanfiction, and Wikipedia entries), would this behavior still exist? - Is the PII extractable (via red-teaming) even for the base model and the reinforced model before unlearning? Generally, while I find this work well-motivated and very interesting, the experiments are not thorough or generalizable. It is difficult to understand if the observed behavior is a result of partial unlearning, the specific data that the authors unlearned, the specific unlearning method that the authors used, or even simply the setup of their red teaming. Without comparing each of the model versions (original, reinforced, unlearned) across regular evaluations, Harry Potter-centered evaluations, and red-team attacks, it is hard to draw conclusions. Furthermore, very little justification is provided for the unlearning method that the authors use, and no comparison is made with any other unlearning methods or different core forget sets. I believe this paper would strongly benefit from both a more thorough empirical analysis of the observed behavior and a discussion of why this behavior may exist. - Missing citation line 97 - Can the authors make the distinction between their unlearning method and Eldan and Russinovich’s more clear? Fully human-written
When Unlearning Backfires: Partial Unlearning Increases PII Regurgitation and enables data extraction in Meta’s Llama 3.2 1B Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper investigates the unintended safety risks of partial unlearning in LLMs with a case study on Llama-3.2-1B-Instruct in the Who's Harry Potter [1] setting. By applying unlearning on only the Harry Potter novels, a subset of full Harry Potter knowledge source, the study simulates real-world unlearning where full training data is inaccessible. While the method effectively erases most Harry Potter references, it unexpectedly increases the model's tendency to produce unexpected outputs, such as PII, which the authors hypothesize results from the model regurgitating memorized training data. The paper highlights that partial unlearning can paradoxically undermine safety and calls for further research into the unintended consequences of unlearning methods. [1] "Who's Harry Potter? Approximate Unlearning in LLMs", Eldan & Russinovich, 2023. 1. **Clear methodology:** The paper follows a well-established unlearning setting, and provides detailed experimental parameters. 2. **Good conceptual framing:** The study highlights partial unlearning as a realistic setting, and attempts to connect the practical unlearning limitations with concrete safety risks such as PII regurgitation. 1. **Questionable conclusion:** The link between partial unlearning and increased PII leakage is not convincingly demonstrated. Section 4.2.1 appears to show two types of leakage: (1). public topic-related content (the Harry Potter fan blog example), and (2). genuinely unrelated PII (the university student example), which constitutes the true safety concern. I don't see evidence directly connecting (2). and partial unlearning from the study, and it remains plausible that similar leakage could occur even with comprehensive unlearning to a full forget set. - Questions: 1. Can you provide additional evidence or analysis supporting that PII leakage specifically results from partial unlearning? Have you tested whether similar leakage occurs under complete unlearning? 2. Can you quantify or categorize the leaked content to distinguish between benign and genuinely harmful PII leakage? 2. **Limited generality:** the experiments are limited to a single model, a single unlearning domain, and a single unlearning method, which constrains the generalizability of the conclusions. While the compute constraints are understandable, other methods such as RMU [1] and SatImp [2] can be implemented within the stated single A100 budget. - Questions: 1. Can you clarify why Llama-3.2-1B-Instruct was chosen over other models, and whether you expect the same behaviors at larger scales? 2. Is it feasible to conduct even a lightweight ablation on a different domain/unlearning method to verify whether the phenomenon generalizes beyond the speific setup in the paper? Some other weaknesses that are less important: 3. **Insufficient analysis:** The decision to withhold red teaming prompts and model weights is defensible for safety, but it limits reproducibility and may hinder a full understanding or verification of the proposed findings. - Questions: 1. Can you provide more qualitative and quantitative evaluations for section 4.2.1 and 4.2.2? In conclusion, I think the empirical scope and the provided anlysis of the paper do not yet support the strength of its main conclusion. Strengthening the experimental breadth and clarifying the relation between harmful PII leakage and partial unlearning would substantially improve the paper's impact and credibility. [1]. "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning", Li, et. al., 2025 [2]. "Exploring Criteria of Loss Reweighting to Enhance LLM Unlearning", Yang, et. al., 2025 Please see the questions in weaknesses. Fully human-written
When Unlearning Backfires: Partial Unlearning Increases PII Regurgitation and enables data extraction in Meta’s Llama 3.2 1B Soundness: 1: poor Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. In this paper, LLM unlearning is performed to make a model forget about some domain of knowledge. The paper shows that when the model is then queried about this knowledge, instead of outputting its original knowledge, it tends to output information about people who are related to the original topic, which could be interpreted as a form of PII leakage. The premise of the paper is interesting. The question of what unintended consequences arise from unlearning is worth exploring. The experiments in this paper are highly limited, only exploring a single model (Llama 3.2 1B), a single unlearning method that is not very standard, and a single unlearning target (Harry Potter knowledge). With such a narrow scope, we do not know if these results generalize beyond this single model + method + dataset setup. In fact, it seems clear that this PII leakage happens in Harry Potter because of the amount of fanfiction on the web, and the fact that fanfiction tends to mention PII such as information about the fanfiction author. However, in other domains this would be less likely. So the principle is not that unlearning leads to PII leakage, but just that unlearning leads to the model outputting content from the most related data that was not unlearned. The only datasets used for evaluation are one dataset with 10 examples and a second dataset with 19 examples. Both are extremely small for any study. The overall writing of this paper is quite non-standard and would benefit from editing by an experienced author. It seems to also be a consequence of the fact that there are very few empirical results, so instead there is a great deal of exposition about background and analysis of individual model outputs. Finally, there are several minor formatting issues that should be addressed: * Line 032: The citation for Lucki et al. is not in the right place relative to the end of the sentence. * Line 079: Use $x$ in LaTeX instead of just plain x, when referring to a mathematical quantity named $x$. * Line 087: Use ` instead of ' for a forward single quote * Line 097: Missing citation * Line 097: Use \citet when the citation is part of the sentence, e.g., "\citet{shi2024} incorporated the unlearned model from \citet{eldan2023}..." * Line 215: These should be $v_{\text{baseline}}$ and $v_{\text{reinforced}}$ (similar to how they are in equation 1) * Line 291: Capitalize proper names While having a few such issues is not a problem, the consistent presence of these issues suggests a lack of care by the authors. Is 8% PII leakage high enough to be considered "often" (Line 335)? I ask especially because if this information does come from public fanfiction posts, these are things that the authors intentionally posted online. This doesn't seem as concerning as PII that is leaked by a third party and then memorized by an LLM. Would these results hold for other models, unlearning methods, and domains? A broader investigation of this could lead to a good conference submission. As is, the narrow scope of the work makes it at best suitable for a workshop focused on unlearning. Fully human-written
When Unlearning Backfires: Partial Unlearning Increases PII Regurgitation and enables data extraction in Meta’s Llama 3.2 1B Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper studies partial unlearning in LLMs, removing only a subset of the target data (the seven Harry Potter novels) from Meta’s LLaMA 3.2 1B using the Eldan & Russinovich (2023) method. The authors report that such incomplete unlearning surprisingly increases training-data regurgitation, including personally identifiable information (PII) from unrelated sources. * Unlearning on incomplete forget sets is realistic and highly relevant -- most practical LLM unlearning scenarios are partial. * The observation that partial unlearning can increase memorization is surprising and thought-provoking. Examples of PII leakage underline real safety risks and link unlearning to privacy and trustworthiness debates. * The paper is poorly structured and difficult to follow; results are anecdotal and not clearly quantified (the paper doesn't contain a single figure or table). The methodological description is verbose but lacks conceptual clarity, most key design decisions are not justified. * The central claim -- that unlearning the exact Harry Potter books leads to model producing Harry Potter content unprompted (Section 4.1.1) -- is counter-intuitive and potentially symptomatic of implementation or evaluation flaws. The intuition behind the result, and why it is so different from Eldan & Russinovich (2023) is unclear. It could potentially hint is issues with experiment design or implementation. * The experimental setup is narrow, relying on a single text corpus (the Harry Potter novels) and a single small model (LLaMA 3.2 1B). This limits generality and makes it unclear whether the observed behavior would hold for other domains, scales, or architectures. Overall, while the topic is timely, the paper lacks the methodological rigor, quantitative depth, and clarity of exposition required for a strong empirical contribution. N/A Lightly AI-edited
ExpeSQL: An Experience-Guided Decompositional Search Framework for Text-to-SQL Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper targets the deployment of Text-to-SQL systems in enterprise environments where databases are often large, schema-complex and subject to frequent updates and domain shifts. The authors propose ExpeSQL, which combines divide-and-conquer decomposition, Best-of-N sampling, and iterative refinement with an experience repository that stores diagnostic feedback from failed queries. On BIRD-dev, ExpeSQL achieves 67.5% accuracy (vs Alpha-SQL's 68.2%) while reducing token generation by 87% and inference latency by 96%. 1. The paper aims to address deployment challenges in enterprise Text-to-SQL systems. Its focus on zero-shot, open-source compatibility is well-motivated and practically relevant. 2. The experience repository design is sensible: by storing structured traces at the sub-question level rather than just final SQL outcomes, the system can potentially perform more granular error diagnosis and targeted remediation. This decomposed memory structure aligns naturally with the divide-and-conquer framework and could enable interpretable debugging. 1. The paper does not clearly explain how sub-query–level critiques affect final SQL generation. Algorithm 1 shows sub-queries are executed and validated per node, but self-critique occurs only after full SQL composition. Section 2.3 mentions that agents “leverage long-term memory” to guide decomposition and synthesis, yet no example shows how stored remedies are retrieved or applied in later iterations. The impact of sub-query feedback on future reasoning remains unclear. 2. The proposed “sub-query critique” is essentially an additional diagnostic signal recorded after execution, not an independent reasoning or learning mechanism. The framework mainly combines known components, divide-and-conquer (MAC-SQL), self-reflection (Renze & Guven 2024), iterative refinement (ROUTE, MCTS-SQL, Gen-SQL), and Best-of-N sampling (SuperSQL, Alpha-SQL). Storing these diagnostic traces in an experience repository extends prior fine-grained error-analysis methods (Gen-SQL, SHARE) but does not introduce a substantively new paradigm. 3. The experiments do not isolate the effect of sub-query critique or experience storage. A key missing baseline is ExpeSQL with critique applied only to the final SQL, without sub-query analysis or experience replay, to test whether the fine-grained diagnostic signal provides measurable gains over standard self-reflection. Table 3 reports “Self-Reflection (+3.1%)” but conflates multiple factors, and the paper presents no per-iteration accuracy or retrieval-success statistics to substantiate the claimed benefits. see weakness Fully AI-generated
ExpeSQL: An Experience-Guided Decompositional Search Framework for Text-to-SQL Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper tackles the limitations of existing training-free Text-to-SQL approaches, which rely on long reasoning traces to ensure accuracy but suffer from high latency, while efficiency-oriented methods typically sacrifice correctness. To achieve a better trade-off between accuracy and efficiency, the authors propose ExpeSQL, an experience-guided multi-agent framework that decomposes complex natural language queries using a divide-and-conquer strategy and progressively refines them through Best-of-N selection, self-critique, and a reasoning experience cache. Experimental results show that ExpeSQL significantly reduces token generation and inference latency while maintaining accuracy comparable to Alpha-SQL, establishing a strong balance between performance and efficiency. * **Well-Motivated Objective.** While most existing works either focus solely on accuracy or sacrifice accuracy to improve efficiency, this paper explicitly aims to balance both speed and quality, addressing a key practical limitation of current Text-to-SQL systems. * **Novel Framework Design.** ExpeSQL introduces a multi-agent architecture that combines divide-and-conquer reasoning, self-evaluation, and iterative refinement. The reasoning traces are cached across iterations to reduce redundant generation, enabling more efficient self-evolution. * **Limited Efficiency Comparison.** Although Section 1 discusses Alpha-SQL’s inefficiency and MCTS-SQL’s limited accuracy, the experiments only report token cost comparisons with Alpha-SQL on a single open-source model. Even when testing on Llama 3.1-8B, the paper omits token and latency data. The overall efficiency relative to other fast test-time frameworks remains unclear, and the generalization of efficiency gains across model scales is insufficiently demonstrated. * **Missing Ablation on efficiency gain from Long-Term Memory.** The proposed reasoning-path caching with long-term memory is a central contribution, yet there is no quantitative ablation analyzing its benefit. Additional experiments—such as the distribution of refinement loops per task and the token savings with vs. without caching—would strengthen the justification of this design. * **Uneven Benchmark Comparison.** The paper’s motivation emphasizes balancing accuracy and efficiency between MCTS-SQL and Alpha-SQL. However, Table 1 reports MCTS-SQL results only on Qwen2.5-Coder-7B, whereas Alpha-SQL and ExpeSQL use larger 14B and 32B models. A fair comparison on the same backbone is necessary to substantiate the claimed improvements. * **Presentation Could Be Improved.** Since this work focuses on the accuracy–efficiency trade-off, visualizing the results in a 2D scatter plot (e.g., x-axis = token cost, y-axis = accuracy) would make the trade-off clearer. Moreover, presenting results by model family rather than mixing different backbones in one table would improve readability and fairness. * The authors note that Spider is relatively simple and therefore skip its evaluation, but Spider 2.0 [1]—released recently—includes more realistic and complex queries. How does ExpeSQL perform on Spider 2.0 compared to Alpha-SQL, MTCS-SQL and other recent baselines? [1] Lei, Fangyu, et al. "Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows." *arXiv preprint arXiv:2411.07763* (2024). Lightly AI-edited
ExpeSQL: An Experience-Guided Decompositional Search Framework for Text-to-SQL Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. EXPESQL introduces a zero-shot, open-source–compatible framework that enables self-evolving SQL generation through experience-guided refinement. It decomposes questions, generates varifiable sub-SQLs and aggregates candidates via result based filtering and majority voting. In case of an error, a self Critique module performs diagnostic backtracking and stores structured remedies in a persistent memory. This enables continuous improvement without parameter updates. This paper achieves 67.5% execution accuracy with open-source models, reducing the token generations by up to 87% and inference latency by 96% compared to Alpha-SQL at similar accuracy. This is a highly valuable result for practical, real-world deployment. It has leveraged open-source models. The Self-Critique Validation Agent verifies the alignment of projected measures and dimensions with the natural language intent which may be difficult and lead to falsely reject correct queries, triggering expensive and unnecessary refinement rounds reducing the efficiency. How robust is the system to agent hallucinations? In multi round refinement, how many rounds are usually needed. Fully human-written
ExpeSQL: An Experience-Guided Decompositional Search Framework for Text-to-SQL Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes ExpeSQL, a zero-shot Text-to-SQL system that decomposes input questions into sub-questions, generates sub-SQLs, and refines them using a self-critique and diagnostic module. The method operates without fine-tuning and leverages an experience repository to improve over time. On the BIRD benchmark, ExpeSQL achieves 67.5% execution accuracy, outperforming previous zero-shot baselines while being significantly more efficient in token usage and latency. - Engineering strength: Well-structured modular framework combining decomposition, subquery voting, self-correction, and memory replay. - Competitive performance: Matches or exceeds prior zero-shot systems (e.g., Alpha-SQL) on BIRD and Spider, while reducing compute cost. - Comprehensive baseline comparison: Evaluated against strong SOTA methods (XiYan-SQL, Reasoning-SQL, CHESS, CHASE, etc.) across BIRD, Spider, and CHESS-SDS. - Limited novelty beyond prior agent-style LLM systems The architecture draws heavily on design patterns already explored in other domains (e.g., self-refinement, agent modularity, error memory in program synthesis and math QA). While novel in the Text-to-SQL setting, the contribution is largely an application of existing ideas rather than a conceptual advancement. - Insufficient ablation and diagnostic analysis The ablation study is limited to a few components and only reported on one dataset. Core modules such as self-critique, experience replay, and inter-node filtering are not evaluated in isolation. The system's iterative improvement claims lack supporting data or analysis. - No interpretability or reasoning trace evidence The paper does not include SQL examples, error case studies, or reasoning path comparisons that could help readers understand how and why the method improves correctness. This undermines the claim that ExpeSQL improves reasoning quality. - Missing broader benchmark coverage There is no evaluation on Spider 2.0 (the most realistic benchmark for multi-query workflows) or on multi-turn dialogue datasets like CoSQL or SParC. This limits insight into generality and real-world applicability. The core ideas are not new in the broader LLM literature, and the paper misses an opportunity to provide insights into how its components work or improve reasoning. A more thorough ablation, analysis of reasoning behavior, and demonstration of generality across diverse scenarios would have strengthened the contribution. Fully AI-generated
The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper “The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies” argues that much current research on LLM-based social simulations suffers from flawed experimental design, leading to unreliable claims about emergent social behavior. From a survey of over 40 studies, the authors identify six recurring methodological issues: lack of diversity, limited interaction, no memory, over-controlled prompts, agent awareness of the experiment, and lack of real-world grounding; and formalize these as the PIMMUR principles (Profile, Interaction, Memory, Minimal-Control, Unawareness, Realism). They then re-run 5 canonical social simulations under a PIMMUR-compliant framework, showing that many previously reported social phenomena fail to replicate, thereby establishing a set of methodological standards for credible LLM-based social research. - The paper introduces a clear and comprehensive methodological framework (PIMMUR) that defines six essential principles for conducting credible and valid multi-agent simulations with LLMs. It tries to counterbalance a recent trend of LLM-based simulations that try to replicate human social setting. - it provides a thorough literature survey covering over 40 existing studies, systematically identifying recurring methodological flaws and demonstrating the widespread lack of standardization in current LLM-based social simulation research. - The authors reproduce five prior works under the proposed PIMMUR framework, offering empirical evidence that many previously reported social phenomena fail to replicate when methodological biases are controlled. - The paper’s structure and empirical demonstrations make it both a diagnostic and constructive contribution, establishing much-needed standards for future multi-agent LLM research at the intersection of AI and computational social science. I did not identify any major methodological or conceptual flaws in the paper. The analysis is coherent, the argumentation is well-supported, and the proposed framework is clearly articulated and motivated by a comprehensive survey of prior work. The authors demonstrate experimental control in their replications, and the findings are presented transparently. However, the paper reads somewhat like a position or methodological perspective piece rather than a conventional empirical study. Although the replication of five prior experiments provides evidence for the paper’s claims, and that's more than usually included in position papers, the core contribution of the PIMMUR framework is largely conceptual. It derives from the authors’ synthesis and reasoning rather than from formal ablation studies or quantitative validation showing how each principle individually affects simulation validity. In this sense, while the paper makes a valuable and timely contribution to the community, it sits somewhat at the boundary between empirical and theoretical work. I appreciate its clarity and ambition and am generally in favor of acceptance, but I would defer to the area chair’s judgment on whether it fits best as a methodological position paper or a full empirical contribution in the ICLR program. Could the authors clarify how they operationalized compliance with each PIMMUR principle when evaluating the 40+ surveyed papers? For instance, was there a standardized rubric, multiple annotators, or inter-rater agreement to ensure consistency in labeling whether a study satisfied a given principle? I found "The assessment of compliance is determined by the authors through discussion" but it lacks proper details Fully AI-generated
The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies Soundness: 2: fair Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces the PIMMUR principles, including Profile, Interaction, Memory, Minimal-Control, Unawareness, and Realism, as conditions for reliable LLM-based social simulation. The analysis of existing literature shows that most current studies fail to satisfy these principles, making their results prone to artifacts of design, overfitting to prompts, or systemic biases that render findings unreliable. The effect of using PIMMUR on the simulation results are demonstrated using five social experiments. The implications for AI research and social science research are discussed. S1. This is a timely research on the reliability of LLM-based social simulation, showcasing the results observed in many existing studies are prone to artifacts of design, overfitting to prompts, or systemic biases. S2. A series of 41 works are investigated. S3. The paper not only analyzes whether existing studies satisfy the proposed principles but also evaluates the impacts of these principles on the simulation results. S4. The paper is well-written and easy to read. W1. The evaluation criteria in Section 4 are unclear (see Q1 below). W2. The evaluation itself may be biased in the sense that it favors simulations showcasing some potential (e.g., Park et al. (2023), which does not have a specific objective in terms of social science) rather than studying social phenomenon or validating sociological theories. That might explain why Park et al. (2023) passed all the tests. W3. The paper reported that "LLMs tend to be overly strict, frequently labeling neutral instructions as instances of over-control." Seeing this, the accuracy of the LLM-as-a-judge design in the evaluation is questionable. W4. It is unclear how many runs are conducted for the simulations in Section 5, rendering the significance of the simulation results less convincing. W5. Some simulations involve multiple PIMMUR principles (e.g., Unawareness and Interaction in Section 5.2), but they are not investigated separately. An ablation study may help understand the effect of individual principles. W6. Another issue is that the sensitivity w.r.t. prompt is not tested for the simulations in Section 5. Based on my experience in LLM-based simulation agents, some "expected" or "desired" experimental results, which conform to established theories, turn out to be the outcome of sensitivity to the prompt. If you paraphrase the prompt, such results might not be observed. Thus, it is suggested that the authors test a paraphrased version of the prompt (for "Original", "Ours", and "Reverse") to confirm that the gaps between "Original" and "Ours" are indeed caused by PIMMUR rather than the sensitivity. Q1. In Table 2a, "X denotes models can infer correctly, indicating a violence of this principle." How to determine whether the model can infer correctly? Minor suggestions: Figure 2: "Demanding" -> "Demand". Fully human-written
The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces the PIMMUR framework, which proposes six methodological requirements (Profile, Interaction, Memory, Minimal-Control, Unawareness, Realism) for ensuring validity in LLM-based social simulations. The authors review 41 existing studies and argue that only a few satisfy all six principles. They also re-run several social simulation experiments under their framework, claiming that many previously reported behaviors disappear when the experiments follow PIMMUR. Although the paper is clearly written and the review is comprehensive, I have serious concerns about the correctness, necessity, and theoretical grounding of PIMMUR as well as the internal consistency between the authors’ claims and their own experiments. * The paper provides a well-written and wide-ranging survey of recent LLM-based social simulation studies, which is valuable for this rapidly growing interdisciplinary area. * The six proposed aspects capture several common methodological pitfalls and raise awareness about validity and reproducibility concerns. * The writing is fluent, the figures are clear, and the overall structure is easy to follow. 1. Overstated claim of necessity The authors assert that all six requirements are necessary conditions for credible LLM-based social simulation (“We formalize these six requirements as the PIMMUR principles and argue they are necessary conditions…”). However, this claim is neither theoretically grounded nor universally applicable. Social systems are complex, and different studies necessarily focus on specific facets. Depending on the research question, simplifying or omitting certain aspects can be both legitimate and necessary. For example, when using LLM agents to test prospect theory, modeling detailed interactions, memory, or realism may not be essential. Thus, these requirements should be treated as optional design dimensions, not rigid necessary conditions. 2. Logical inconsistency and self-contradiction The authors’ own experiments fail to satisfy their proposed principles. In Section 5, they mention how five requirements are satisfied but do not address Realism. In subsections 5.1 and 5.2, no empirical human data are used, which directly violates their own definition of Realism: “A simulation should use empirical data from real human societies as references rather than only reproducing simplified theoretical models.” This inconsistency weakens the credibility of their argument and demonstrates that the proposed framework is not realistically achievable even by the authors themselves. 3. Binary evaluation is arbitrary and misleading Table 1 applies a binary classification to 41 prior studies, yet many principles (for example, Profile) are inherently continuous. For instance, “Agents should have distinct backgrounds, preferences, or cognitive styles…” but how much heterogeneity is enough? Having richer profiles beyond names is only marginally better, and the distinction between sufficient and insufficient is subjective. This binary labeling oversimplifies complex design choices and may unfairly undervalue prior contributions. 4. Potential unfairness toward prior work By applying these rigid standards retroactively, the authors risk undervaluing earlier research that deliberately simplified assumptions for theoretical clarity or computational feasibility. Social simulation is a tool for understanding human and social behavior, and which dimensions to emphasize or simplify should depend on the research question, not on a universal checklist. 5. Limited coverage of existing frameworks The review omits several major open-source platforms that are central to current LLM-based social simulation research, including Yulan-OneSim (https://github.com/RUC-GSAI/YuLan-OneSim), AgentSociety (https://github.com/tsinghua-fib-lab/AgentSociety), SocioVerse (https://github.com/FudanDISC/SocioVerse). Including these frameworks would make the review more comprehensive and balanced. 1. You state that all six PIMMUR principles are necessary conditions for credible LLM-based social simulation. Could you clarify whether this claim is meant to be normative (a theoretical ideal) or empirical (a strict requirement that must always hold)? 2. In Section 5, the first two replication experiments do not use any empirical human data. How do you reconcile this with your own definition of Realism, which explicitly requires reference to real human data? 3. Table 1 applies a binary ✓/✗ evaluation for each principle. How did you determine the threshold between “satisfied” and “unsatisfied”? Could you provide quantitative or operational criteria to make these judgments reproducible? 4. Some simulation studies simplify aspects such as memory or realism intentionally to test specific theories. Would such studies still be considered “not credible” under your framework? If not, could you clarify which subsets of principles are context-dependent? 5. Several widely used simulation platforms (for example Yulan-OneSim https://github.com/RUC-GSAI/YuLan-OneSim, AgentSociety https://github.com/tsinghua-fib-lab/AgentSociety, and SocioVerse https://github.com/FudanDISC/SocioVerse) were not discussed. Could you comment on whether these frameworks satisfy or violate your proposed principles? Fully AI-generated
The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes the PIMMUR principles (Profile, Interaction, Memory, Minimal-Control, Unawareness, Realism) as methodological standards for evaluating LLM-based social simulations. Through reviewing 41 studies, the authors identify that most existing work violates multiple principles. They demonstrate that frontier LLMs can infer experimental hypotheses in 53% of cases and that 64% of instructions contain excessive steering. Five experiments were "replicated" under PIMMUR conditions, showing substantially different outcomes from original studies. S1. This work addresses methodological issues in the widely studied field of social simulation, accurately identifying methodological flaws in current LLM-based social simulations. S2. The replication of prior work and interpretation of differences between standardized replication results and original results is important, providing evidence for reproducibility in social simulation. S3. The paper systematically reviews and interprets past social simulation work across six dimensions (called PIMMUR in this work), providing concrete checking tools for dimensions such as unawareness. W1. The paper lacks experimental details in replication, including sample sizes of replication experiments, statistical information on results (such as whether the decrease from 56% to 32% at line 368 is statistically significant), and details of modifications to each original experiment. The appendix only contains prompts, making it impossible to determine whether added details might change the original experimental intent, such as whether adding Big Five persona affects the original setup. W2. The principles and boundaries between principles are ambiguous. For example, what constitutes "sufficient" heterogeneity? In Cho et al. (2025) Herd behavior (Table 1, line 237), the authors consider it has profile, while the original paper appears to have only 2 agents using the same model without identity differences, with only simple peer labels. Other criticisms of this work are also ambiguous, such as line 425 stating "no actual interaction occurs among agents," while in the original paper agents change behavior based on other agents' responses. Additionally, this work only references the simplified setup in Section 3 of Cho et al. (2025) without discussing the more complex multi-agent setup in later sections. Does assigning Big Five personality traits to agents violate the Minimal-Control principle? This may contradict the authors' criticism of using simplified theoretical models. Furthermore, if the task nature does not require agent heterogeneity, does adding personality traits increase variables requiring validation or ablation experiments? The Minimal-Control principle is very difficult to implement in actual MAS system design. What is essential versus steering? Is the reversed instruction used in paper replication over-control? W3. Compliance judgments on existing work may not be fair. Some simplified setups may be designed for specific research goals rather than reproducing real situations. In the review (Table 1), the authors could consider adding a column to indicate whether each work aims to reproduce real social situations, to evaluate the significance of compliance judgments. Again using the Cho et al. (2025) example, the objective of it seems to be mechanistic understanding of LLM's herd behavior factors instead of reproducing human-like herd behavior. C1. In Table 1 line 244, "Sugarscape" as a simulation goal may not be the best expression. I believe most ICLR readers are unfamiliar with this classic experiment. Consider using other terms (for example, indicating it relates to resources/survival). C2. The 53% value in the abstract lacks context and does not reappear until page 6 of the paper. Readers cannot know what magnitude of deficiency this value represents in existing social simulations. Rather than providing this value, giving a brief example explaining what unawareness is might be better. C3. Some prompt designs may lack significance. For example, the "FORGET ALL THE PREVIOUS INSTRUCTIONS" setup at line 829 to test awareness does not mean agents will maintain the same awareness in actual experiments. Even if LLMs recognize in experimental design inquiries that this is a replication of a social experiment, behavior during simulation may not necessarily reflect this awareness. Under this setup, there may not be a causal relationship between whether the experiment can be identified and whether behavior is biased. C4. Differences in replication results may have multiple explanations, such as prompts, experimental framework implementation methods, randomness settings, number of simulations, etc. Primarily attributing differences to variations in PIMMUR dimensions may risk overclaiming. C5. The definition of realism is controversial. Social simulations that are not based on real data calibration can still be meaningful. On one hand, empirical data is collected and interpreted through theoretical frameworks, making pure empirical data difficult to define. On the other hand, simulation has particular significance for problems where original data is difficult to obtain. Many social simulation works involve no real data but are profoundly meaningful, such as the classic Schelling's Segregation Model. For discussions on replication and reproduction, you can refer to: [1] Cheng, M., Piccardi, T., & Yang, D. (2023, December). CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 10853-10875). [2] Wu, Z., Peng, R., Ito, T., & Xiao, C. (2025). LLM-Based Social Simulations Require a Boundary. arXiv preprint arXiv:2506.19806. Fully human-written
Estimating structural shifts in graph domain adaptation via pairwise likelihood maximization Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper identified and solevd the problem of conditional structure shift in Graph domain adaptation. They proposed Pairwise-Likelihood maximization for graph Structure Alignment for estimating and correcting conditional structure shift in node classification tasks. 1. The probelm of CSS is important and interesting, the problem setup is clear. 2. The alignment of the structure with divergent $p(y,y^{\prime})$ is novel. 3. The method is reasonable, and the theoretical results seem correct. 1. Line 24-25 seems like a LLM-style polish, em dash is not usually used in academic paper. Frankly speaking, I'm not the expert in GNN, so I strongly encourage the AC to add another expert or ignore this review. See weakness Fully human-written
Estimating structural shifts in graph domain adaptation via pairwise likelihood maximization Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper studies graph domain adaptation (GDA) under conditional structure shift (CSS) a scenario where the conditional edge distributions given node labels differ across domains. The authors propose a unified theoretical framework and introduce Pairwise-Likelihood maximization for Graph Structure Alignment (PLSA), which estimates target connection probabilities via pairwise likelihood matching with a calibrated source predictor. Theoretical guarantees are derived under the Contextual Stochastic Block Model (CSBM), showing finite-sample error bounds based on matrix concentration inequalities for U-statistics The paper provides a mathematically principled framework for conditional structure shift estimation. The identifiability analysis and finite-sample guarantees are technically sound and build upon nontrivial extensions of label shift theory to the graph domain. Using pairwise likelihood maximization for structural alignment is an elegant generalization of label-shift maximum-likelihood estimation to GDA. The unified view encompassing existing methods such as Structural Reweighting and Pair-Align adds theoretical coherence to an emerging subfield. 1. **Incorrect or overly strong assumption (Line 063):** The statement *“assuming that the joint distribution of features and labels are invariant across source and target domains”* is conceptually inconsistent with the GDA setting. * In graph domain adaptation, the core challenge arises because **the joint distribution ( p(x, y) )** is *not* invariant across domains; otherwise, the task degenerates to a standard supervised setting. * The authors likely intend to isolate *conditional structure shift* by assuming ( p(y) ) and ( p(x|y) ) are invariant, but phrasing it as joint invariance misrepresents the GDA assumptions and should be corrected. 2. **Incomplete related work discussion:** The related work section omits recent state-of-the-art methods that are directly relevant for GDA under structural or spectral shift. [1] Pang, Jinhui, et al. "Sa-gda: Spectral augmentation for graph domain adaptation." Proceedings of the 31st ACM international conference on multimedia. 2023. [2] Fang R, Li B, Zeng Q, et al. On the Benefits of Attribute-Driven Graph Domain Adaptation[C]//The Thirteenth International Conference on Learning Representations. [3] Yang L, Chen X, Zhuo J, et al. Disentangled Graph Spectral Domain Adaptation[C]//Forty-second International Conference on Machine Learning. 3. **Limited experimental validation:** The experiments are restricted to synthetic CSBM data and the small-scale Airport dataset, which do not represent standard GDA benchmarks. * Commonly adopted datasets such as **Citation networks**, **MAG dataset**, and **BlogCatalog** are missing.(Liu M, Zhang Z, Tang J, et al. Revisiting, benchmarking and understanding unsupervised graph domain adaptation[J]. Advances in Neural Information Processing Systems, 2024, 37: 89408-89436.) * Without evaluations on these real-world benchmarks, it isn't easy to assess whether PLSA generalizes beyond the stylized CSBM scenario. * Additionally, ablation studies on calibration quality and sparsity sensitivity would strengthen empirical claims. 4. **Broader applicability and assumptions:** The CSS-only assumption (Assumption 3.1) is quite restrictive. In practice, label shift and structure shift often coexist. Although Appendix B sketches a potential extension, the main text does not empirically demonstrate PLSA’s robustness under mixed shifts. See Weaknesses Fully AI-generated
Estimating structural shifts in graph domain adaptation via pairwise likelihood maximization Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents a unified framework to solve conditional structure shift (CSS) problem and show that existing GDA methods for CSS arise as special cases by theoretical analysis. Then, the authors proposed a new method called Pairwise-Likelihood maximization for graph Structure Alignment (PLSA)by estimating the target connection probability by matching the distribution of features and edges through nodes in the latent space. 1. Sufficient and Solid Theoretical Analysis. This paper provides non-asymptotic error upper bounds under CSBM, explicitly pointing out the relationship between sample complexity, the number of classes, and calibration errors, which enhances the credibility and soundness of the method. 2. Good Performance on Sparse Graph Scenarios. PLSA uses unconditioned pairs (including non-edge pairs) rather than restricting to “edge” samples. From a statistical efficiency standpoint, this retains more information in sparse graphs. Experiments (Figures 1 and 2) show that in sparse settings, PLSA significantly outperforms Pair-Align, which uses only edge information. 1. The comparative experiments are not sufficiently comprehensive. Although the paper reviews some existing methods in the GDA field, it omits some important approaches (e.g., meta-learning–based or adversarial-training–based approaches), making it difficult to fully demonstrate the superiority of the model. 2. The article’s theoretical and methodological designs largely based on the assumption that class priors and class-conditional feature distributions are domain-invariant (Assumption 3.1). This assumption weakens the method’s applicability to more complex real-world scenarios (e.g., when both label shift and feature distribution drift are present). 3. While the experimental results show that PLSA outperforms the baselines, no experiment is provided to analyze parameter sensitivity. 1. The re-sampling and re-weighting scheme proposed in Section 4.3 (using Bernoulli insertion/deletion for each class pair) introduces additional random noise. The paper does not analyze how this randomization affects the variance of downstream GNN training, nor does it compare the pros and cons of using expected weights (soft-weight) versus sampling. 2. It is recommended to further discuss the model’s time-complexity such as analyzing the efficiency of the proposed method on large-scale graph datasets. 3. For experiments on Airport dataset , the graph structure is real but the node features are synthesized (the feature-label association is manually designed). Under real-world conditions with genuine node features, the model’s performance may be affected. It is suggested to include some datasets with authentic features for experimental analysis. Fully human-written
Estimating structural shifts in graph domain adaptation via pairwise likelihood maximization Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper handles the graph domain adaptation problem, specifically focusing on dealing with the conditional structure shift (CSS) by assuming no shift in conditional feature and label distribution. Particularly, it improves upon previous work like pair-align by considering a more general edge distribution including existing edges and non-edges with pairwise likelihood maximization. They also include the analysis bounding the estimation of importance weights considering sample gap and classifier miscalibration gap. - This paper focus on a crucial point ignored by previous literature in solving CSS using edge reweighting, specifically point out the importance of consider the full distribution of potential edges by removing the condition that given an existing edge, this essentially consider currently non-existed edges which work well even under sparse graph case. - They form a unified and clear comparison with previous works by correctly position their contribution and distinctions from previous weight estimation methods and the shift cases considered. - The estimation is also supported by error bound and they additionally consider the impact from calibrated classifier beyong simply assuming the invariant conditional feature distribution. - Although the focus might be on the theoretical part and the paper verifies them via synthetic datasets, but it could be better if we can add more real datasets, especially the ones that have more sparse structure to showcase the benefit of this new weight estimation - Also, it could be better if you can motivate and highlight in the introduction or before talking about exact method regarding why previous methods are insufficient using some dataset statistics, like how sparse they are, how biased they can be under this case. - I believe it could be better if you put appendix B to the main text including both CSS and label shift, or this work is more like comparing to StruRW without label shift. Then, you might want to clarify how we need to ensure the ratio is not biased with label shift. - Based on my understanding, one additional benefit is that we consider a calibrated classifier in this case in addition to pair align method. Then to what extend you think the benefit might come from this calibrated classifier besides the benefit we consider full edge distribution using PLSA. Can there by some ablation study on this? - After getting the ratio that we need to adjust the source graph, the way the paper did is actually resampling instead of reweighting the source graph right? Can you evaluate the strength and weakness of resampling compared to reweighting in this case? Fully human-written
CoFact: Conformal Factuality Guarantees for Language Models under Distribution Shift Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses the critical challenge of providing statistical guarantees on the factuality of Large Language Model (LLM) outputs under distribution shift. The authors propose CoFact, a conformal prediction framework that employs online density ratio estimation (DRE) to adaptively reweigh calibration data, thereby maintaining factuality guarantees even when the exchangeability assumption is violated. key contributions include: A novel framework combining conformal prediction with online DRE to handle continuous distribution shifts; Theoretical analysis establishing an upper bound on the gap between actual and target hallucination rates; WildChat+, a new dataset capturing real-world distribution shifts; Empirical validation on MedLFQA, WikiData, and WildChat+ demonstrating superior performance over baseline methods The paper tackles a significant limitation of existing conformal prediction methods for LLMs, they rely on the exchangeability assumption, which is frequently violated in real-world applications. The integration of online DRE with conformal prediction is creative and well-motivated. The use of an ensemble of experts with geometric lifetimes to track evolving distributions is technically sound. THeoretical part is solid: Theorem 1 provides a rigorous bound showing that the hallucination rate gap converges to zero as O(max{T^{-2/3}V_T^{2/3}, T^{-1/2}} + 1/n). It also has detailed experiments on the evaluation in simulated shifts (4 types) and real-world shifts (WildChat+). Assumption 1 requires that the conditional distribution P(W|Z) remains unchanged while only the marginal P(Z) shifts. This is quite strong and may not hold in many real scenarios (eg, if model quality degrades over time or if certain types of prompts systematically elicit more hallucinations). The paper only compares against SCP and CondCP. What about other methods for handling covariate shift in conformal prediction. Figure 2, Legend is difficult to read. The paper doesn't discuss how to set T in advance or what happens when the time horizon is unknown. How many calibration samples n are needed in practice to achieve reasonable performance? The method requires know about feature representations $\phi(z)$. How should these be chosen for different applications? Statistical significance testing would strengthen the claims on table 2, 3. There seems no ablation on key design choices (e.g., number of experts, expert lifetime schedule, choice of divergence function. How sensitive is performance to the feature representation? Heavily AI-edited
CoFact: Conformal Factuality Guarantees for Language Models under Distribution Shift Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. To theoretically control the rate of hallucination of large language models (LLMs) for their trustworthy use on safety-critical applications under dynamic environments where the test distribution may be different from the calibration data for training, authors propose the online confomal prediction method using an online density estimation technique. Specifically, they assume that the test data distribution continuously changes over time, not in a too extreme manner (Assumption 1). Besides, for the definition of hallucination of the generated answer on a given prompt, following Mohri et al. (2024), authors define the term as whether a filtered answer contain any hallucinated facts. In terms of the technical contribution when compared to existing works on online conformal prediction under the distribution shift scenario, authors assume more challenging problem set-ups. (1) Specifically, the authors assume continuous distribution shift scenario on the test data distribution, unlike a single distribution shift scenario (Tibshirani et al., 2019). (2) Furthermore, they assume the correctness labels on the test data (i.e., whether a hallucinated sub-claim is contained is the filtered sub-claims) remain unrevealed even after the online evaluation. This is a more challenging scenario compared to existing online conformal prediction methods (Gibbs and Candes, 2021; Gibbs and Candes, 2024; Areces et al., 2025) in which correct labels are revealed, enabling them to be used for training in subsequent time steps. In addition to the theoretical guarantee on the upper bound of the gap between the target coverage level and the average hallucination rate, empirical results on existing and newly proposed benchmarks show that CoFact controls the hallucination rate to the desired degree. - It is a well-written paper which is easy-to-follow. - They tackle the online conformal prediction problem for the theoretical control of hallucination of LLMs for its trustworthy use in safety-critical applications, which is one of primary interests from LLM users these days. Specifically, they propose the method which is valid under the more challenging set-up compared to existing works, which most resembles the dynamic real world problems. [Weakness 1] Generalizability to online batch setting: I have understood that you are assuming a problem set-up, where single sample is provided for each time step for simplicity (Line 185-186). Then, how the Eq. (8) would look like if you consider a general setting, where a batch of samples may be provided for each time step? If not, the proposed threshold estimator may not be used in an online batch learning set-up. [Weakness 2] While existing online conformal prediction methods require additional information in terms of training which is not possible in the current problem set-up, it would be more informative to provide results of these methods equipped with the necessary information for training, since baselines in the Experiment section all assume i.i.d. data generating process under the batch learning set-up. As an concrete example, you may consider a problem set-up with (1) a single distribution shift scenario as Tibshirani et al. (2019) and (2) an accessibility to the ground truth labels in an online manner. While existing methods are expected to show more superior performance comapred to CoFact since they utilize additional information in terms of online threshold selection, I think it would be much more informative than just comparing with baselines assuming i.i.d. data generating process under the batch learning set-up. [Question 1] Shouldn't the "T" in Eq. (14) be substituted to "t"? Additionally, the term \theta_t^\ast is not formally defined. [Question 2] How the Eq. (8) would look like if you consider a general setting, where a batch of samples may be provided for each time step? If you can propose one, does it also enjoy the theoretical guarantee that Eq. (8) has? Please refer to [Weakness 1] in Weaknesses Section for detail. [Question 3] Could you compare CoFact with existing online conformal prediction methods under the scenario where existing methods assume? Please refer to [Weakness 2] in Weaknesses Section for detail. [Queston 4] The followings are typos to be addressed. (Line 35) Despite of => Despite (Line 77) CoFact bypass => CoFact bypasses (Line 106-107) The following expression seems awkward to me: "... that transforms the outputs of a black-box predictor into prediction sets..." (Line 115-118) \alpha => \beta or \beta => \alpha (Line 314-315) \leq => \geq (Line 269, Line 328) The same divergence function \psi is defined differently. Fully human-written
PreviousPage 15 of 1516 (75800 total rows)Next