ICLR 2026 - Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	15899 (21%)	4.43	3.58	3687
Heavily AI-edited	3233 (4%)	4.22	3.59	2990
Moderately AI-edited	7082 (9%)	4.20	3.61	2722
Lightly AI-edited	16648 (22%)	4.15	3.68	2746
Fully human-written	32938 (43%)	4.13	3.62	2917
Total	75800 (100%)	4.21	3.62	3026

Title	Ratings	Review Text	EditLens Prediction
SADA: Safe and Adaptive Inference with Multiple Black-Box Predictions	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes SADA (Safe and Adaptive Data Aggregation) — a novel framework for safe and adaptive semi-supervised inference that aggregates predictions from multiple black-box models (e.g., LLMs, deep networks) of unknown quality. The method aims to guarantee: - Safety: Never performs worse (in mean squared error) than using labeled data alone — even if all predictions are poor. - Adaptivity: If any one of the black-box predictions is highly accurate (without knowing which), SADA automatically leverages it to achieve semiparametric efficiency or a faster convergence rate. SADA extends recent prediction-powered inference (PPI) work by Angelopoulos et al. (2023, 2024) to handle multiple predictors simultaneously, offering both theoretical guarantees and empirical validation. Experiments on synthetic and real data (Wine reviews, Politeness datasets) show that SADA consistently outperforms naive, PPI, and PPI++ estimators — providing stable variance reduction and robust adaptation across scenarios. - The paper is generally well-written. The intuition behind SADA is well-explained and intuitive. The connection to PPI and PPI++ is made explicit, situating SADA as a generalization. - Theoretical contribution. Extends prediction-powered inference to multiple prediction sources under semi-supervised learning. - Experiments on synthetic data and two real-world tasks demonstrates consistent improvement over PPI/PPI++ and robustness to poor prediction quality. - Limited experimental depth. The benchmarks, while well-chosen, are relatively small-scale. There is no demonstration of SADA’s scalability to larger datasets and higher-dimensional parameters. - Assumption realism. It assumes $l_\theta(x; y)$ to be a convex loss function. The assumption of having multiple predictions with overlapping but uncorrelated noise may not hold when using correlated LLMs (e.g., GPT-4 and GPT-4o-mini). Empirical tests on correlated predictors would strengthen the case. - The estimation of optimal weight (line 323) depends on the whole set of data points (N), which is not feasible for large N. see above.	Moderately AI-edited
SADA: Safe and Adaptive Inference with Multiple Black-Box Predictions	Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents a method building on prediction-powered inference (PPI) to do unbiased inference when a small set of ground truth training labels and an assortment of synthetic prediction functions are available on unlabeled data. Theoretically and empirically, they demonstrate the strengths of this method as being no worse than a naive estimator, and being adaptively able to converge quickly to a good synthetic estimator if one exists. - seems like a fairly clear and well-scoped contribution generalizing PPI++ to multiple predictors - as far as I know it is novel, although someone who has more deeply read all of the PPI literature may disagre - theory and experiments back up the general points made about the estimator Not a ton of major weaknesses here - not ground breaking work but makes its point pretty cleanly afaict - it is stated on L186 that predictions Y-hat do not need to have the same form either as each other or Y. The examples of categorical and continuous are given, but it's not clear how broadly this extends. I feels as though somehow they need to be operated on similarly but they can't be so different - for instance, Y-hat can't be a free-text output of an LLM it seems if Y is binary. Could use some clarification here - In general, one useful baseline to look at here would be averaging the predictors and using that in PPI++ - this is a very natural thing to do when you have many predictors and would give a better sense I think of a strong ensemble-ish baseline - Fig 2: Visually, we can see that the shape of the result curve in 2c is nice. It would be good to know, potentially in a table, if those results are actually better or worse than the individual PPI++ results - would be good to run multiple seeds and show confidence bars in Fig 3/4, these experiments look promising but I want to know how significant the improvement is small notes: Line 36: should cite the LLMs (GPT, Llama, Deepseek) L133: I get confused by column vector notations - clarify if this is the inner or outer product (the superscript cross product symbol you define) Line 270: typo "propose" -> "proposed" Fig 3: clarify that only the PPI lines differ between these subfigures, is that right? and the SADA lines are the same? What is the theoretical relationship to taking a weighted combination of the predictors as a function, using that in PPI++ and then optimizing the weights? This is not a high priority question but would help with some deeper understanding of the relationship of this method to PPI++	Fully human-written
TRANSPORT-BASED MEAN FLOWS FOR GENERATIVE MODELING	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This work proposes integrating the Minibatch OT coupling into Mean Flow training. To ensure the training efficiency, various OT computation methods are adopted, including Sinkhorn, Linear OT, and etc. The proposed method is evaluated on image generation (MNIST), shape-conditioned reconstruction (ShapeNet Chairs and ModelNet10), and image-to-image translation (FFHQ), demonstrating improvements over Mean Flow across these tasks. - Extending Mean Flow with Minibatch OT coupling seems to be an interesting approach to improve one-step generative models by leveraging the demonstrated effectiveness of Minibatch OT. - Various experiments have shown that the introduction of OT coupling can improve the generation quality. - The writing and presentation of the manuscript are overall clear and easy to follow. However, with regard to the current manuscript, I still have several major concerns: Limited Novelty: This work primarily combines Mean Flow (Geng et al. 2025) with Minibatch OT (Pooladian et al., 2023; Tong et al., 2023a). Since minibatch OT approaches have been thoroughly explored and proven effective, it is unsurprising that they improve generation quality over Mean Flow. The contribution to the research community appears limited. That said, a thorough evaluation justifying the necessity of Minibatch OT for Mean Flow could still be a valuable contribution. However, I have several questions about the current evaluation. Evaluation Tasks: The evaluation tasks do not follow standard protocols. The original Mean Flow paper uses ImageNet 256×256, a standard benchmark for image generation. This work evaluates only on MNIST for image generation, making it unclear whether the conclusions hold at scale. For 3D shape generation, the choice of a reconstruction task (using the input shape as conditioning) is unclear when unconditional generation could be evaluated instead, following the protocol in LION [A]. Evaluation Baselines: The manuscript primarily compares against Mean Flow alone. However, various works have explored techniques for accelerating inference, including adversarial distillation, consistency models, and shortcut models. The current evaluation does not include comparisons with these baselines. Overall, I recommend rejection due to limited novelty and insufficient evaluation. The work primarily combines existing techniques (Mean Flow and Minibatch OT) in a straightforward manner, yielding expected rather than surprising improvements. The experimental evaluation uses non-standard benchmarks (MNIST instead of ImageNet 256×256, reconstruction tasks instead of unconditional generation) and lacks comparisons with relevant baselines such as adversarial distillation, consistency models, and shortcut models. A substantially revised submission with comprehensive benchmarking on standard tasks and comparisons against the broader landscape of one-step generation methods would be necessary to assess the true contribution of this work. Based on these, I have the following questions: - The evaluation for image generation is limited to MNIST, while the original Mean Flow paper demonstrates results on ImageNet 256×256, which is a standard benchmark in the field. Can you provide experiments on ImageNet 256×256 to demonstrate that OT-MF's improvements generalize to high-resolution, complex image generation tasks? Without this evaluation, it remains unclear whether the benefits of incorporating OT coupling hold at scale. - The manuscript primarily compares OT-MF against Mean Flow. However, there are several other approaches for one-step or few-step generation, including adversarial distillation methods, consistency models, and shortcut models. Can you include comparisons with these baselines to better contextualize OT-MF's performance relative to the broader landscape of accelerated generative modeling techniques? This would help clarify the practical advantages of your approach.	Lightly AI-edited
TRANSPORT-BASED MEAN FLOWS FOR GENERATIVE MODELING	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes to improve the training of Mean Flow models by incorporating optimal transport. Specifically, minibatch OT methods (i.e. computing OT on the empirical distribution represented by the minibatch) define a coupling on the elements of a batch that is used to train the models. Intuitively, this results in straighter trajectories that should be simpler to learn with Mean Flows. Experiments on toy data, MNIST, shapes, and paired image-to-image translation are presented. - Clear presentation of mathematics and motivation - Good overview of relation to previous work - Proposes a natural - yet novel - idea. - A diverse set of experiments is performed and some improvements are shown - While a natural idea, the method has limited novelty. - The experiments are limited: MNIST is still a toy data and it is only performed on a latent space of size 4x4x4. All improvements are paid for by significant longer training times per epoch. Therefore, it is unclear how much the use of the OT coupling is really useful here. Typos/comments: - Abstract, first word: Flow-matching -> Flow matching - Page 3: “and introduces stochasticity into the model”. It does not introduce further stochasticity and I would remove that comment. Missing references: - Introduction: Mean Flow is equivalent to Flow Map Matching [1] and they should be cited together [1] https://arxiv.org/abs/2406.07507	Fully human-written
TRANSPORT-BASED MEAN FLOWS FOR GENERATIVE MODELING	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper integrates optimal transport and mean flows for better one and few steps generation with geometry-aware learning. The author claims the unified framework performs efficiently on few different downstream tasks such as toy example, image generation, point cloud generation and image-to-image translation. 1.The paper is well written and easy to follow. 2.The paper leverages OT coupling mechanisms and mean flows on the training objective to enforce globally optimal source–target alignments, leading to straighter and more stable transport trajectories. 3.The work achieved good quality performance in multiple tasks with few step generation. 1.Limited novelty and incremental improvement: The main contribution lies in combining established OT techniques with mean flows. The paper lacks new theoretical developments or formal analysis to differentiate it from prior OT-regularized flow-matching approaches. The authors are encouraged to clarify what is theoretically or algorithmically new beyond the combination itself. 2.Incomplete baselines of one/few steps generation: The authors achieved better performance compared to meanflow and other OT techniques but the lack of comparison with recent one/few steps SOTA generative models limits the impact of the proposed framework. Models such as consistency/distillation models or few step-diffusion or flow matching baselines are not cited or discussed [1,2,3]. 3. Lack of hyperparameter analysis: The paper does not discuss how the hyperparameters (e.g. batch size, learning rate and etc) are selected. The paper should explain the procedure or tuning strategy of the models prior to the experiments. References: [1] Song, Yang, et al. "Consistency models." (2023). [2] Meng, Chenlin, et al. "On distillation of guided diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023. [3] Sauer, Axel, et al. "Adversarial diffusion distillation." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024. note the weakness above	Fully human-written
TRANSPORT-BASED MEAN FLOWS FOR GENERATIVE MODELING	Soundness: 4: excellent Presentation: 4: excellent Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes optimal transport MeanFlow (OT-MF), which integrates OT-based sampling with the MeanFlow framework for one-step generative modeling. The training samples obtained via OT couplings (e.g., Sinkhorn or Linear OT) exhibit geometrically straighter trajectories compared to independent Gaussian-data pairs. This coupling strategy, when combined with the time-averaged MeanFlow formulation, can reduce inference steps while preserving the fidelity of the multi-step flow process. Experiments on diverse modalities (image generation, image-to-image translation, and point cloud generation) demonstrate that proposed method shows improved sample quality, faster inference, and better trajectory alignment compared to vanilla MeanFlow. - The proposed combination of OT and MeanFlow is well-motivated to improve trajectory straightness and sampling efficiency. - The paper provides comprehensive background and related work on Flow Matching, OT-based methods, and MeanFlow. - Experiments are conducted on diverse tasks and demonstrate consistent improvements in sample quality and efficiency. - Incorporation of scalable OT solvers enhances computational overhead without major performance loss. - Novelty is weak. The main idea (combining existing OT-based coupling with MeanFlow) is conceptually straightforward. While effective, it primarily extends prior techniques rather than introducing a fundamentally new theoretical framework. - The proposed method still fails to accurately capture the data distribution, even in a simple 2D toy dataset (Figure 2). - (Minor comment) The qualitative difference between MF and OT-MF is subtle in Figure 1 illustration. Using multiple (x0, x1) trajectories or average path visualizations could better highlight the ‘trajectory straightening’ effect. - The proposed method primarily combines two existing ideas, OT-based flow matching and MeanFlow. Beyond this integration, is there any novel algorithms, theoretical contribution, or architectural enhancement introduced in the paper? - In Table 1, why does Sinkhorn OT often outperform the proposed OT-MF in terms of Wasserstein-2 distance? - Is this method still effective in higher-dimensional or sequential generative tasks, such as audio synthesis or text-to-video generation?	Fully AI-generated
v-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This work contributes a novel dataset named v-HUB as a video humor understanding benchmark. It consists of minimally verbal short videos, and labels for three target tasks of caption matching, humor explanation, and open-ended QA. With the new dataset, this work empirically reveals that existing MLLMs suffer from the issue of strong bias toward text over visuals. 1. This work contributes a novel dataset that can evaluate the MLLMs’ ability of humor understanding from visuals. This dataset can be a valuable resource for this line of research to the community. 1. The dataset novelty is limited. - Table 1 summarizes the novelty of the proposed v-HUB over existing humor video datasets. However, it contains highly exaggerated arguments. - Surely, v-Hub is more visual-centric than the other datasets, but it is an overclaimed argument that only v-Hub is O and the others are X, since it is a matter of degree not a binary O/X decision. For example, ExFunTube contains many visual-centric datapoints, although all datapoints would not. - Also, another important novelty argument is that v-Hub supports three target tasks. However, once one has humor explanation text labels for humor videos, it is almost automatic to transform their label formats for those of caption matching and QA tasks. Thus, it is not a notable contribution. 2. The experimental findings are quite predictable with little surprising novelty. - It is a well-known issue that current multimodal models (1) are highly biased toward text information over visual one, and (2) often ignore subtle information in the visuals. These phenomena have been observed in almost all multimodal tasks. - Likewise, findings summarized in the text in bold in section 4.2-4.3 have little novel perspective on the multimodal task. Based on the points in 1-2 (limited novelty), this work may be better fit to a second-tier venue rather than ICLR. 3. The ethics review is not possible for a reviewer. - Often humor videos in social media are highly likely to contain sensitive, offensive, disturbing, or private content. However, this work does not provide reviewers with the means to assess such issues in the dataset, as only four examples in Fig.1 are viewable to a reviewer. - The copyright issue on Charlie Chaplin’s Silent Films (one of the two main video sources) is not discussed. Please refer to the Weaknesses.	Fully human-written
v-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces V-HUB, a visual-centric video humor understanding benchmark that aims to address the limitations of prior works which primarily relied on text inclusion for humor comprehension. V-HUB emphasizes visual understanding by providing rich multimodal annotations and a diverse set of video types. It supports multiple tasks, such as humor explanation, caption matching, and open-ended question answering, making it a comprehensive resource for studying humor in videos. The benchmark highlights the complex, multimodal nature of humor and the challenges of modeling subjectivity and cultural context in computational humor understanding. Well-defined problem statement and novelty: The paper clearly defines its problem and contributes a visual-centric benchmark for humor understanding, addressing the gap in prior benchmarks that relied on natural language cues. High-quality benchmark: Each video is accompanied by rich annotations, including closed captions, textual descriptions, explanations, and humor-related elements. These extensive annotations enable multifaceted evaluation of humor understanding across different dimensions. Support for multiple tasks: The benchmark is applicable to various tasks related to video and humor understanding, such as humor explanation, caption matching, and open-ended question answering. Annotation methodology: Annotations are collected using a dual-caption strategy, which explicitly addresses the subjectivity of humor by incorporating multiple perspectives. Diverse video sources: V-HUB includes both silent and user-generated videos, spanning combinations of visual, visual+text, visual+audio, and video+text+audio modalities. This diversity ensures broad coverage of humor types and scenarios. Ambiguity between description and explanation: The example in Figure 3 shows minimal distinction between a description and an explanation—the main difference being phrases like “which viewers found very humorous.” A deeper analysis is needed to clarify the conceptual and functional differences between these two annotation types to help readers understand the necessity and distinct role of each. Limited annotation granularity: Some videos contain humor concentrated in short segments rather than throughout the entire clip. Identifying these time segments could improve humor understanding. Also, for videos with multiple independent humorous moments, the benchmark could benefit from segment-level annotations, ensuring that each humor instance is represented distinctly for more accurate analysis and model training. Insufficient detail about background knowledge (L421): The paper states that background knowledge aids humor understanding but does not specify what this knowledge entails. Providing a concrete description or examples of such knowledge would enhance interpretability and reader comprehension. 1. What do the authors see as the key differences between description and explanation? Were there any specific annotation guidelines to ensure they capture distinct aspects of humor? 2. Do all videos contain a single humorous moment, or are there cases with multiple independent humor instances? If so, does the annotation process capture these moments individually or collectively? 3. What exactly constitutes the background knowledge mentioned in L421? Could the authors provide examples or clarify how it was incorporated into the evaluation process?	Heavily AI-edited
v-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces v-HUB, a visual-centric benchmark for evaluating multimodal large language models (MLLMs) on video humor understanding, addressing the gap in existing benchmarks that either rely on spoken language or require both video and linguistic cues to comprehend humor. The key findings of the paper include: MLLMs heavily rely on linguistic cues, struggle with active humor discovery, benefit slightly from audio + video, and perform worse on historically distant silent films than contemporary videos. 1. The paper performs a comprehensive comparison of existing datasets (e.g., NYCC, MUStARD, SMILE, ExFunTube) in the area of multimodal humor understanding and systematically diagnoses their inherent limitations. 2. This paper designs three complementary tasks: Caption Matching, Humor Explanation, Open-ended QA. They cover diverse cognitive dimensions of humor understanding—from deep video-text alignment to active humor discovery and general video reasoning. 1. The dataset scale is relatively limited, with only 900+ videos included. This small sample size may restrict the evaluation breadth. For instance, it fails to fully cover humorous scenarios across diverse cultural backgrounds or visual styles, making it difficult to disentangle whether the observed MLLM underperformance stems from inherent capability gaps or insufficient dataset coverage. 2. Additionally, this dataset may lack of diversity (e.g., different types of humor, different topics of videos). This lack of transparent cultural diversity validation restricts the generalizability of v-HUB to MLLMs evaluated on cross-cultural humor understanding tasks. 3. The above limited scale and diversity of historical humor samples undermines the statistical robustness of the claim regarding MLLMs’ poorer comprehension of historical versus contemporary humor, as the conclusion may be biased by small-sample variation rather than genuine model limitations. 4. The paper lacks explicit documentation of the parameter sizes for most evaluated models (e.g., Video-SALMONN-2, MiniCPM-2.6-o) and omits controlled experiments to isolate the impacts of model parameters versus architecture. While a few models (e.g., Qwen2.5-VL-72B) have parameter sizes specified, the absence of comparative tests between models with identical architectures but different parameter scales (or vice versa) makes it impossible to attribute performance differences to either pre-trained knowledge storage gaps (due to parameter size) or architectural design nuances (e.g., visual encoders, multimodal fusion modules, the pre-training method of different model). This ambiguity weakens the depth of analysis into MLLMs’ humor understanding limitations. 5. The evaluation framework does not include comparative experiments on performance-boosting methods such as few-shot prompting, chain-of-thought (CoT), or fine-tuning. The paper cannot assess whether MLLMs’ visual-centric humor understanding capabilities can be activated or enhanced via these approaches. 6. While the paper mentions that some qualified annotators conducted three rounds of annotation for each video, it does not report quantitative metrics (e.g., Cohen’s Kappa coefficient, Fleiss’ Kappa) to measure inter-annotator agreement. 7. This work mainly uses traditional NLG automatic metrics to evaluate the ability of LLMs. Personally, I feel that these metrics are no longer sufficient to indicate whether the generated text conforms to human preferences. 1.V-HUB only contains 900+ videos and is sourced by a single X account and silent flims from Charlie Chaplin, how do you address the concern that the small and undiverse dataset scale may limit the evaluation breadth, diversity and statistical robustness of the claim about MLLMs’ poorer comprehension of historical versus contemporary humor? What specific strategies could be adopted to expand the dataset to better disentangle model inherent capability gaps from dataset coverage limitations? 2.The paper claims v-HUB is "visual-centric" (99% of videos rely on visual cues) but provides no operationalized threshold for "visual reliance". How were annotators instructed to distinguish "visual as primary" from "visual as supplementary" when audio/text elements existed? Why was this definition not quantified, and how might ambiguity here affect the interpretability of experimental results? 3.Most evaluated models (e.g., Video-SALMONN-2, MiniCPM-2.6-o) lack explicit parameter size documentation, and no controlled experiments were designed to isolate the effects of parameter scale versus architectural design on performance. Why were these critical details and experiments omitted? What experimental design (e.g., comparing models with identical architectures but different parameter sizes) would help attribute performance differences to pre-trained knowledge storage or architectural nuances? 4.Lack of experiments on performance-boosting methods (e.g., few-shot prompting, chain-of-thought, fine-tuning) were included in the evaluation. Why were these approaches not tested, and how does this omission limit insights into MLLMs’ potential for visual-centric humor understanding?	Fully human-written
v-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs	Soundness: 3: good Presentation: 4: excellent Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper present a visual-centric video humor understanding benchmark, and give a comprehensive evaluation of many sota MLLMs. 1. The paper collects humor videos from both User-Generated Funny Videos and films, introduces new humor video sources. 2. Compare to existing video humor benchmarks, this paper evaluates a new generation of MLLM models (newer, lager, wider), demonstrating improvements in model capabilities and introducing new challenges in humor comprehension. 1. Novelty: The humor videoQA data type is already included in MVBench [1], Table 1 Action - Unexpected Action (What unexpected event contributes to the humor in the video?). The authors cited Mvbench in this paper but they didn't aware that their topic is already covered in it. 2. Repetitive work: Upon more precise tracing of the MVbench source, this paper's overall conceptual framework nearly entirely overlaps with the HumorQA subset within FunQA [2]. This raises concerns about duplicate publication. After a more precise comparison (FunQA vs v-HUB): i) Tasks: Counter-intuitive timestamp - Caption matching, Title generation - Caption matching, Counter-intuitiveness reasoning - Humor explanation, FunQA-MCQA & Dialog subset - Open-ended QA. ii) Datasize: 1,769' avg 7s vs. 960'. iii) Anno: both annotate caption, description, explanation by human annotators. iv) Common result: The models heavily relies on text cues and weak visual reasoning in humor comprehension. v) Eval metrics: BLEURT, GPT4 vs. BERTScore, METEOR This is more like a coincidental repetition of research topics (humor videoQA). However, give the existing weakness, even this paper introduces new benchmark and evaluation, the omissions in its literature review are significant, leading the authors to overestimate the novelty of their work and limiting the paper's potential for further advancement in MLLM humor comprehension. [1] Li, K, et al. "Mvbench: A comprehensive multi-modal video understanding benchmark." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. [2] Xie, B., et al. "Funqa: Towards surprising video comprehension." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024. 1. Although existing research points are repetitive, authors can still leverage their existing data for in-depth studies, such as model training, ablation experiments involving more modalities, and thorough analysis of human cognitive patterns regarding humor. 2. Finally, what are your thoughts and explanations regarding the overlap with HumorQA in FunQA, and how do you plan to enhance and reconstruct the value of your paper?	Fully human-written
BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes BridgeDrive, a planning method based on diffusion bridges that addresses the theoretical inconsistency in prior anchor-based diffusion policies such as DiffusionDrive. By enforcing a symmetric and theoretically consistent forward–reverse process, BridgeDrive improves closed-loop performance. The paper further investigates temporal speed waypoints vs. geometric path waypoints representations and demonstrates that the geometric representation leads to less likely to violate route lane constrain. Experiments on Bench2Drive show clear improvements over prior state of the art. 1. By incorporating bridge diffusion, the paper builds an explicit and theoretically complete formulation that connects anchor information with trajectory generation in anchor-based diffusion policies. This directly addresses the theoretical shortcomings arising when truncated diffusion processes ignore boundary consistency and rely on heuristically truncating the diffusion process. 2. The comparison between temporal and geometric trajectory parameterizations provides valuable insight for practical control and shows the superiority of geometric spacing. 3. The method achieves notable improvements on Bench2Drive, demonstrating strong closed-loop performance benefits. 4. Real-time inference and robust multi-modal trajectory generation contribute to the practicality of the approach. 1. The training and evaluation are limited to Bench2Drive simulation. Validation on larger-scale real human driving datasets, e.g. NavSim, would be crucial to support claims of real-world generalization. 2. The influence of scene context versus anchor selection on the final trajectory is insufficiently analyzed. More evidence is needed to show that the trajectory is shaped jointly by scene understanding and anchors, instead of the anchor having disproportionate dominance. 1. Could you provide quantitative or qualitative results confirming that both scene features and anchors significantly contribute to the trajectory generation (for example, attribution studies or case analyses)? 2. How do the number, diversity, and classification accuracy of anchors impact planning robustness? Any sensitivity or failure-case analysis? 3. Are there any experiments on large-scale datasets such as nuPlan or NavSim to support broader applicability claims?	Lightly AI-edited
BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces BridgeDrive, which adapts the Denoising Diffusion Bridge Model to generate geometric path waypoints from an anchor distribution, achieving state-of-the-art performance on the widely used Bench2Drive benchmark. 1. The paper point out that the problem of DiffusionDrive is that it is not a theoretical diffusion process. This issue has misled the community. The authors identified this and provided a solution. 2. Using Bridge Diffusion to solve this makes sense. 1. Given the authors' claim that geometric waypoints outperform temporal waypoints, it is better to validate whether this indicates a bias in the Bench2Drive benchmark. Therefore, verification across more benchmarks is highly recommended. 2. The architecture proposed in Figure 2 lacks ablation studies on its structural design. 3. The use of anchors greatly simplifies the true trajectory distribution, raising the question of whether using diffusion is truly necessary. A compelling comparison with SOTA anchor-based methods is needed to justify the adoption of bridge diffusion. 1. Why is the Think2Drive expert used as a baseline when diffusion-based methods appear to perform worse than other approaches? Using the same expert for all methods facilitates a fair comparison. 2. Is Bridge Diffusion more challenging to train than standard diffusion models, and does it still support both classifier-free guidance and classifier guidance?	Lightly AI-edited
BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	BridgeDrive introduces a diffusion bridge policy for closed-loop trajectory planning in autonomous driving. Unlike previous diffusion-based planners (e.g., DiffusionDrive) that are mainly tested in open-loop settings, BridgeDrive applies a symmetric diffusion bridge framework connecting expert priors and ground-truth trajectories. This design resolves the theoretical inconsistency of prior truncated diffusion models and enables more stable, consistent, and safety-oriented planning. Experiments on the Bench2Drive benchmark demonstrate state-of-the-art results, surpassing SimLingo by +1.8 in Driving Score and +5% in Success Rate, especially in challenging tasks like merging, overtaking, and traffic sign compliance. 1. Recent works utilize diffusion models to enhance autonomous driving planning tasks seems just experiment on open-loop benchmarks (DiffusionPlanner and DiffusionDrive), while this work extend their experiment to close-loop benchmark, which is more chllenging and I think this is a good contribution to the community of autonomous driving. 2. The model BridgeDrive gets SOTA results on close-loop benchmark Bench2Drive. 3. The authors honestly provide the "Limitations and future work" section. Theoretical clarity – Some diffusion concepts (e.g., “truncated diffusion,” the exact bridge formulation) are insufficiently explained for general readers. Fairness of comparison – DiffusionDrive was reimplemented for Bench2Drive, which might affect comparison reliability. Comfort and smoothness – Prioritizes safety over comfort, leading to frequent braking behavior. Dependency on LiDAR – Relies on LiDAR input, limiting adaptability to camera-only settings. 1. Could the author provide detail explain to why "DiffusionDrive that trying to leverage typical human expert driving behaviors introduces a theoretical inconsistency: its denoising process does not match the forward diffusion process that it is trained on, which diverges from the core principle of diffusion models and can lead to unpredictable behavior and compromised performance." in line 36-43. 2. About Fig1. Intuitively, the startpoint of the denoise process should be noisy, but the waypoints in the leftmost subfigure seem very smooth. So these waypoints are not noise sampled from Gaussian but clustered resulst from expert drivers' behavoir? 3. How do you get the expert prior? Using the data of human drivers' behaviour data on open-loop benchmarks, like nuScenes? Or the "robot expert" trained via reinforcement learning on Bench2Drive? 4. About section 3.2 "DiffusionDrive with Truncated Diffusion". In this section, the author mentioned the asymmetry between the forward "add noise process" and the inverse "denoise process" of DiffusionDrive, whose starting point of forward process is expert prior (anchor) while the endpoint of inverse process is ground truth trajectories. I understand this. But, what does the "truncated" mean? In your paper (line 129-131), it seems that you want to express: timestep t is a pivot, before t, we add some noise, and after t we start denoising, which is really confusing. 5. You method name is "BridgeDrive", where can show "Bridge"? In line 191, you mentioned that you will learn a bridge model $p_\theta(x_t \| x_T, z)$, so, you actually bridge the startpoint (ground truth traj) and endpoint (expert prior) at the diffusion process? And the endpoint is not pure gaussion noise traj? And in inference, you directly denoise from expert priors? That sounds plausible. 6. Finally, I think most of the ideas in this paper are based on the weaknesses of DiffusionDrive, so the comparison with DiffusionDrive is important. The problem is, DiffusionDrive only provides its code on NAVSIM and nuScese instead of Bench2Drive. So, to compare with DiffusionDrive, the author can: 1. Implement your idea based on DiffusionDrive on NAVSIM and nuScenes, then test on the same testset. 2. Reimplement DiffusionDrive to Bench2Drive, and then test directly, as you have implemented your methods on Bench2Drive. The author chooses the second method and although they provide detailed implementation details of how to reimplement DiffusionDrive to Bench2Drive, the fact is reimplement a method from one benchmark to another is challengeing, we cannot guarantee that during reimplementation, there maybe some errors lead to performance drop. So, I recommend the author can adopt the first compare method, in that way, DiffusionDrive have offical code and you can easily implement your method on it and avoid cross benchmark implement. And the result will be more convincing.	Fully human-written
BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The authors present BridgeDrive, which proposes an anchored guided diffusion formulation for closed-loop trajectory planning. BridgeDrive demonstrates SOTA performance on the Bench2Drive benchmark and provide useful insights on parameterization of diffusion models. The only concern is whether the experiments setup is fair for DiffusionDrive( scoring-based v.s predicted anchors), and whether the performance will be highly correlated with how well the anchor prediction is. - Strong empirical SOTA performance on the Bench2Drive benchmark - Provide good insights on how to parameterize diffusion models output: geomery points verus waypoints ### - The main comparison of this work is against DiffusionDrive. The primary comparison is to DiffusionDrive, but the two methods differ significantly: DiffusionDrive scores across all anchors and selects the best, while BridgeDrive first classifies one anchor then denoises around it. Please add this additional results for the scoring-based formulation in Table 1 and Table 2 - Why do we care about whether the forward and reverse are symmetric, and how does that affect the performance? Following the previous points, DiffusionDrive provides a good way for fast inference speed (2 denoising steps compared to 20 steps). - Couldn’t open the supp videos - What if the classified anchors are wrong for BridgeDrive, and how does this affect the perfromance? Does the diffusion models assumes that it always starts from the right anchor? - Can BridgeDrive, for example, hanlde multiple anchors and then select the best result based on scoring, since this is more practical and may potentailly improve its out-of-distribution capabilities. - How does the performance degrade by decreasing the diffusion timesteps for Bridgedrive? And why is Diffusiondrive’s sampling speed not ~10x faster than BridgeDrive since diffusion steps are 10x smaller? - Table 1 miss diffusiondriveo^geo	Fully human-written
EduVerse: A User-Defined Multi-Agent Simulation Space for Education Scenario	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents EduVerse, a user-defined multi-agent simulation framework for educational scenarios. It introduces a Cognition–Interaction–Evolution (CIE) architecture to simulate realistic classroom dynamics among virtual students and teachers. EduVerse enables customization of agents, environments, and interactions, while supporting human-in-the-loop participation. The authors evaluate the system through simulated and real classroom experiments in Chinese language teaching, showing that EduVerse can reproduce authentic teaching dynamics and capture long-term learning evolution. The platform demonstrates promising potential for educational research, intelligent tutoring, and social learning analysis. Solid theoretical foundation – The CIE framework is conceptually well-motivated and systematically designed, combining cognitive modeling, social interaction, and evolution mechanisms. Rich and diverse experiments – The authors conduct multiple experiments across different educational aspects (cognitive alignment, group interaction, long-term evolution), providing strong empirical support. Unclear system details – The description of the system’s user interface and real-user interaction mechanism (how students and teachers use EduVerse) is vague and underdeveloped. Limited explanation of real-world experiments – Although the paper claims real classroom validation, the implementation details of these experiments (e.g., how data were collected, how participants interacted) are not clearly stated.. Scalability and generalization – The experiments are confined to a specific subject (Chinese language classes), and the system’s adaptability to other domains remains untested. Questions: 1. What does "IRF" mean? This abbreviation appears in abstract without providing any full name before. 2. For the LLM, you mentioned that you use "InternVL" and "GPT-4" (In line 216-219), do you fine-tuning the LLMs via education data to get better results? 3. You mention that "EduVerse provides a human-in-the-loop interface that admits real students or teachers alongside virtual agents" (Line 281-282), how can students and teachers in the real world interact with the system? Does the user interface (UI) is something like the UI of ChatGPT? 4. Do the names appear in the experiment part like "Zhang Jie", "Liu Li" are the names of your simulated student agent or real student name in the real world? 5. Do you have more information about experiments conducted in real world classrooms? It seems that all the experiments in the experiment part are conducted in simulators. Suggestions: 1. As the author provide so much appendix, I recommend add a table of contents before appendix part. Typos: 1. In Fig1, ②: Mr. Zhuvividly => Mr. Zhu vividly 2. In Fig1, circle a: Cognition Engin => Cognition Engine	Lightly AI-edited
EduVerse: A User-Defined Multi-Agent Simulation Space for Education Scenario	Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes an LLM-based multi-agent simulation designed for educational settings. Specifically, they try to capture the dynamics of cognitive development and classroom interactions over time. - Educational psychology theories back up the design of the simulation components - The simulation captures a lot of factors that go into classroom dynamics, such as seating arrangements, varying personalities, emotions, etc. - They performed rigorous experiments validating different aspects of the model. - The results are quite promising. The interaction dynamics resemble behaviors observed in real classroom settings. - Not sure if the authors can claim to be the first multi-agent simulation space for education since I found some existing papers that use multi-agent simulations in the education domain [a, b], aside from those already cited in the more comprehensive related works in the appendix. Granted that they are doing different things, "multi-agent simulation space for education" is broad enough to encompass their works as well. [a] Xu, S., Wen, H. N., Pan, H., Dominguez, D., Hu, D., & Zhang, X. (2025, April). Classroom Simulacra: Building Contextual Student Generative Agents in Online Education for Learning Behavioral Simulation. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (pp. 1-26). [b] Arana, J. M., Carandang, K. A. M., Casin, E. R., Alis, C., Tan, D. S., Legara, E. F., & Monterola, C. (2025, July). Foundations of PEERS: Assessing LLM Role Performance in Educational Simulations. In ACL 2025 Student Research Workshop. - Evaluations on the temporal dynamics / trajectories are a bit weak in my opinion. They are not backed up by any data but only rather vague statements like "clear individual differences", "sustained positive affect", etc. - The memory management and knowledge progression is also not quite clear. The authors mention that they are adjusted based on behavioral signals such as bloom level and response type. However, there does not seem to be very convincing validation of this design. - Regarding the temporal dynamics experiments, it is not quite clear what we expect the curves to be. What is a valid trajectory and what is not? Does any curve that exhibit positive transitions / shifts valid? How well does this match reality? - How do you manage the memory? How do you decide what gets stored and what gets forgotten? How well does this match realistic human student memory recall? - How are the emotions being probed? Is it just through direct prompting, or do you also ask the agents to answer some kind of questionnaire similar to what we would give a human participant? - Out of curiosity, are you able to capture problematic student behaviors? It would be very interesting to simulate interventions or management strategies for them.	Fully human-written
EduVerse: A User-Defined Multi-Agent Simulation Space for Education Scenario	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper focuses on reproducing realistic classroom dynamics. To achieve this, the authors present EduVerse, a novel user-defined multi-agent simulation platform that introduces a Cognition–Interaction–Evolution (CIE) architecture. This architecture models the long-term cognitive, emotional, and behavioral development of virtual agents within customizable classroom environments. Human–Agent Interaction provides valuable insights through experimental studies. - The work attempts to address multiple aspects, including individual modeling, role-differentiated social interaction, and longitudinal instructional adaptation. But does not clearly explain them. - The evaluation is vague. - Please include the key metrics in the main paper instead of the appendix. This would improve both readers’ understanding and reviewers’ efficiency. - Table 1 does not show how the simulation aligns with real classroom data. For example, IRF_rate on Lyrical Prose (0.336 vs. 0.486) contradicts the claim of only minor genre-specific variations. - Figure 5 lacks a clear caption about the ablation study, making it difficult to follow the analysis and interpret the bar chart. - Much of the analysis focuses on individual cases, while the main focus should be at the class level. - The work is limited to Chinese language classes. Cross-domain or cross-linguistic experiments would strengthen the generalization of this work. - Figure 4 is too small and hard to review. Please see weaknesses.	Lightly AI-edited
EduVerse: A User-Defined Multi-Agent Simulation Space for Education Scenario	Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper presents a framework, EduVerse, for user-defined multi-agent simulation in the context of AI in education. The authors deployed EduVerse in middle school Chinese language classes with diverse educational  tasks, rich emotional expression, and complex interaction structures. The authors also conducted empirical experiment with existing frameworks. - Timely topic focusing on the AI in education and LLM - In-depth analysis of related work - The proposed framework combines the cognitive, interactive, and  evolutionary dynamics of developmental agents in the context of AI in education - Deployed in classrooms showcases the practical impact - Human-in-the-loop interface allows real teachers and students to enabling simulation, causal testing, and validation For designing an intelligent tutoring system, it is crucial to take into account the subject domain. For example, students cognitive, help seeking behavior, and peer discussion vary widely across math vs writing an essay in literature vs introductory programming. The prior work by other researchers cited by the authors are also domain specific. Would the authors say how to incorporate the framework for a specific subject domain with different question difficulties and knowledge base? Please see weakness	Fully human-written
EduVerse: A User-Defined Multi-Agent Simulation Space for Education Scenario	Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors introduce EduVerse, a “user‑defined multi‑agent simulation space” for virtual classrooms built around a Cognition–Interaction–Evolution (CIE) architecture layered over a Perception–Cognition–Action loop. Users can customize the seat graph/layouts, teacher/student agents,, and sessions (multi‑lesson trajectories). A human‑in‑the‑loop interface lets real users join a simulated class. Figure 1 lays out the three components, user‑defined environment, CIE agent modeling, and interaction/evolution experiments. The authors’ core claim is the simulated instructional realism of a typical classroom (measured by IRF rates). I think this is a very interesting idea with a good approach but I have alot of reservations about the claims made by the authors. Focusing on the positives, I think the work done itself is good. There are plenty of good uses for a simulator of this type, especially ones that involve a human in the loop. I do like the modular CIE breakdown, the explicit teacher pacing controller, and that the tasks are already implemented. The range of evaluation criteria is good, even if I have some concerns about them. IRF alignment, B/E/C distributions, small‑graph network summaries, ablations, human–agent tasks, and a cross‑session measure. My favorite part is probably the CIE-based agent modeling. I think there’s alot of potential in the ideas that the authors outlined here with how the process of teacher-led group discussion can play out. While there’s alot to like about this paper, I think there are some pretty severe issues with the main claim: - The authors position EduVerse as the “first” user‑defined multi‑agent classroom simulator. But they even acknowledge other pre-existing multi-agent class room simulators in their own related work, and other general agent set ups (ie, AgentVerse) that already support role‑based, IRF‑style interactions. - The Abstract and Table 1 frame IRF rates as “close” to real classes, but Table A4 shows sizeable divergences (e.g., Argumentative Essay, Lecture: 0.639 vs. 0.417 real). ESPECIALLY with a signal as noisy as teacher-led discussion in classes, I feel like it's hard to take any purely quantitative analyses at face value without some kind of qualitative evidence to back it up. - There’s a lack of details about how many classes/schools were used as the comparison baseline and, again, who annotated the logs who could provide qualitative evidence as backup. - The system labels its own cognition (Bloom) and emotion during the Monitor step, then reuses these labels for evaluation (BEC distributions). If im not misunderstanding, this is basically just the model asking itself if it's correct, which doesnt seem super reliable. - The authors fine‑tune VLM backbones (InternVL/LLaVA/Qwen‑VL/MiniCPM) for text‑only style, trained on ~6k utterances. Why VLMs for text style control? The authors also report InternVL “achieved the highest scenario‑grounded performance,” but the metric and protocol aren’t shown. - Not a huge negative but a heads up, for Figure 1, the middle section has “Cognition Engin” instead of “Cognition Engine” - Did the authors inspect the generated EduVerse logs vs the conversation logs of a real classroom? - Were all experiments/baselines drawn from the same classroom or different classrooms? - What was the motivation for using VLM backbones for what seems, to me, a largely text-based scenario?	Fully human-written
Parameter-Efficient Subspace Optimization for LLM Fine-Tuning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces PESO (Parameter-Efficient Subspace Optimization), a unifying framework for parameter-efficient fine-tuning of LLMs grounded in classical subspace minimization. PESO connects methods like LoRA and GaLore to a principled exploration-exploitation paradigm, for memory-efficient optimization with provable convergence in the full parameter space. The authors instantiate PESO into practical variants, PESO-LoRA-R and PESO-LoRA-T. 1. The PESO framework bridges PEFT with classical subspace minimization, offering an exploration–exploitation perspective and a unified Algorithm 1 that generalizes several existing methods. 2. PESO-LoRA-R and PESO-LoRA-T emerge as straightforward, practical special cases directly derived from the framework. 3. The paper presents theoretical guarantees for full-rank convergence under the stated assumptions. 4. The model is empirically evaluated through Llama-2-7B pre-training and multiple benchmark experiments. 1. Since the core theme of the paper revolves around exploration-exploitation, it would be natural to include targeted ablation studies, particularly examining the effects of restart frequency (K), rank (r), and related parameters. 2. Although the paper positions itself as a unifying framework, it lacks in-depth discussion and comparison with key baselines in this area; notably GaLore [1] and other state-of-the-art methods. 3. (Please correct me if I’m mistaken,) but M appears to be defined inconsistently; once as a projection map and elsewhere as a subspace. The notation would benefit from clearer, more consistent presentation. Additionally, there are minor grammatical issues (e.g., line 38: “Therefore, updating the entire …”). 4. The alignment techniques are central to the proposed algorithm and should be discussed thoroughly in the main text, rather than being deferred to the appendix. 5. The model is evaluated primarily against LoRA variants, but several recent strong baselines, including GaLore [1], APOLLO [2], LDAdam [3], FiRA [4], etc., are missing. Moreover, SubTrack++ [5], which also explores identifying optimal subspaces via geometric insights, appears conceptually related to the exploration phase and warrants discussion. 6. The evaluation results are not fully convincing, as the mentioned baselines in point 5 typically outperform LoRA variants. This raises concerns about whether the proposed algorithms offer substantial improvements or meaningful advantages. 7. The computational efficiency of the proposed methods is not addressed; in particular, time and memory costs should be analyzed, given that SVD operations are often computationally expensive. 8. The proposed variants require clearer exposition in the main text, including detailed explanations and mathematical formulations of the optimizers and steps used in Algorithms 2 and 3. The current presentation includes repetitive content, while several important details are relegated to the appendix. --- [1] Zhao et al., 2024. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. [2] Zhu et al., 2025. APOLLO: SGD-like Memory, AdamW-level Performance. [3] Robert et al., 2025. LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics. [4] Chen et al., 2024. Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? [5] Rajabi et al., 2025. SubTrack++: Gradient Subspace Tracking for Scalable LLM Training Please refer to the weaknesses.	Lightly AI-edited
Parameter-Efficient Subspace Optimization for LLM Fine-Tuning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes PESO, a LoRA-like PEFT algorithm motivated by subspace optimization. PESO alternatively explores new subspaces via low-rank SVD like GaLore, and then exploits the subspace via Adam updates. This paper also provides theoretical convergence proofs and conduct experiments to justify PESO's efficiency. 1. PESO achieves better parameter efficiency and performance compared to vanilla LoRA, as illustrated in the experiments. 2. The algorithm is new and a theoretical convergence proof is provided. 1. In Line 109-111, the authors claim that "The resulting algorithm is, to our knowledge, the first memory-efficient method for LLM training with provable convergence to full-parameter optimality up to small errors, without additional assumptions such as explicit low-rankness of the solution." However, proir works have already established exact convergence rates for memory-efficient LLM training methods with standard or mild assumptions, including GoLore [arXiv:2410.11289] and LDAdam [arXiv:2410.16103], both of which were uploaded to arXiv one year ago. Consequently, given the non-diminishing convergence gap in Theorem 5.1 and the presence of these prior works, I highly disagree with this claim. 2. The assumptions in the convergence analysis are too strong. Specifically, the approximation error $\delta_k$ can diverge if the gradient $G_k$ diverges. The present proofs cannot exclude the case where $\lim_\{k\rightarrow\infty}\delta_k=\lim_\{k\rightarrow\infty}\\\|G_k\\\|_F=\infty$, and thus I believe Assumption 4 is a strong assumption. 3. I think the improvements of PESO, as compared to the baselines in the experiments, are limited. Other subspace optimization algorithms such as GaLore [arXiv:2403.03507], GoLore [arXiv:2410.11289] , LDAdam [arXiv:2410.16103], Fira [arXiv:2410.01623] and Subtrack++ [arXiv:2502.01586] have similar memory efficiency and much stronger performance than LoRA. It is recommended to at least include some of these strong baselines in the experiments. 1. See Weakness 1. Can the authors provide more evidence to support the claim? 2. See Weakness 2. Can the authors give more detailed explanation on why Assumption 4 holds? 3. See Weakness 3. Is PESO empirically comparable to, or better than the memory-efficient baselines I mentioned?	Fully human-written
Parameter-Efficient Subspace Optimization for LLM Fine-Tuning	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	In this paper, the authors have introduced a unifying framework, Parameter-Efficient Subspace Optimization (PESO). This framework may cover many existing methods, such as LoRA, and bridge them with algorithms and the theory of subspace optimization. The strengths of this paper are summarized as follows: 1. It has combined multiple Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, AdaLoRA, and GaLore, using a single mathematical view. 2. Theoretically, it has given the first proof of a full-parameter convergence guarantee for memory memory-efficient fine-tuning method. The convergence guarantee is in the full model weight space. 3. The proposed framework, PESO, is practical. It is a plug and play design and can improve existing PEFT methods with very little modification. This seems to be very impactful in this field. The weaknesses of this paper are summarized as follows: 1. The experimental results are based on T5-base and LLaMA-2-7B. It would be better if the authors could consider including more experimental results on more models, such as LLaMA 3, and it would be more interesting to test models on different sizes. 2. The experimental results seem to focus on fine-tuning. It would be better if the authors may consider full pre-training. Also, it primarily compares against LoRA-based baselines. It lacks evaluation or comparison on Galore or Galore variants, such as GoLore [1] and Sara [2]. [1] Yutong He, Pengrui Li, Yipeng Hu, Chuyan Chen, and Kun Yuan. "Subspace optimization for large language models with convergence guarantees." ICML'25. [2] Haochen Zhang, Junze Yin, Guanchu Wang, Zirui Liu, Tianyi Zhang, Anshumali Shrivastava, Lin Yang, and Vladimir Braverman. "Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining". NeurIPS'25. Please see the weaknesses.	Fully human-written
Parameter-Efficient Subspace Optimization for LLM Fine-Tuning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes Parameter-Efficient Subspace Optimization (PESO) — a unifying framework that connects modern parameter-efficient fine-tuning methods for large language models (LLMs), such as LoRA, with the classical theory of subspace optimization. PESO provides a principled foundation that interprets these methods through an exploration–exploitation trade-off in the subspace, leading to the design of new algorithms that are both memory-efficient and have strong convergence guarantees. - Provides a framework that can cover some existing low-rank fine-tuning approaches - The paper is well-written in general and easy to follow While the paper claims contributions at the conceptual, theoretical, and empirical levels, these contributions appear insufficiently substantiated. 1. Conceptual novelty. The subspace minimization perspective is not new. This viewpoint has already been well established in GaLore [A1] and more recently revisited in Randomized Subspace Optimization (RSO) [A2]. In particular, the proposed framework in Equation (3) closely resembles RSO, where a low-rank variable $\xi$ is obtained by solving a subproblem and then added back to the base parameter $W$. The authors are encouraged to clearly articulate the distinctions between their framework and the RSO algorithm. 2. Theoretical contribution. The convergence analysis is weak and incomplete. Numerous existing works have provided both exact convergence guarantees and explicit convergence rates, such as RSO [A2], LDAdam [A3], SARA [A4], and RAC-LoRA [A5]. By contrast, the proposed algorithm only achieves convergence to a biased solution dependent on $\delta$, without demonstrating exact convergence and convergence rates. This is a major concern regarding the paper’s theoretical rigor. 3. Experimental evaluation. The empirical results are not comprehensive. The paper omits comparisons with recent strong baselines, including LDAdam, SARA, and RAC-LoRA, APPOLO [A6] which have demonstrated strong performance in both pre-training and fine-tuning settings. 4. Assumptions. In Lines 127–136, the authors argue that prior works rely on unrealistic assumptions such as $r < m$ or random projections. However, Assumptions 4 and 5 in this paper are themselves non-standard and not commonly adopted in the literature. It is therefore unconvincing to claim that the present assumptions are more natural or milder than those in existing studies. [A1] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [A2] A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models [A3] LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics [A4] Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining [A5] Randomized Asymmetric Chain of LoRA: The First Meaningful Theoretical Framework for Low-Rank Adaptation [A6] APOLLO: SGD-like Memory, AdamW-level Performance 1. Clearly state the difference from existing subspace optimization methods such as RSO [A2] 2. Establish the exact convergence of the proposed algorithm. Establish the convergence rate of the proposed framework. Compare the rates with existing literature. 3. Conduct experiments with stronger baselines such as LDAdam, SARA, and RAC-LoRA, APPOLO	Fully human-written
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment	Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper addresses Language Models’ reasoning inconsistencies stemming from probabilistic sampling by introducing a reinforcement learning framework MACA. MACA utilizes multi-agent debate to self-generate preference data, designating trajectories aligned with the majority consensus as 'preferred' and dissenting trajectories as 'not preferred'. It is then optimized on these self-supervised signals, using methods like DPO, to favor consensus-forming reasoning paths. It yields improvements in self-consistency and single-agent reasoning, and demonstrates generalization to unseen domains. 1.This paper tries to address an important problem: the reasoning inconsistency of LMs. 2.The MACA framework is conceptually simple. 3.The paper provides a very detailed appendix. 1.The core mechanism fails in face of “correct but minority” reasoning and actively rewards incorrect consensus. 2.The evidence for the central novelty claim is weak. As shown in Table 6, gains are marginal in 5 of 8 cases when compared fairly MV(C) with DMV(C). 3.The method essentially trains the model to agree more strongly with what it already agrees on, creating a self-reinforcing echo chamber that may amplify a model's inherent biases. See weaknesses.	Fully human-written
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	In this work, the author argues that self-consistency should be a desired property of well-aligned reasoning models, and they address this by introducing MACA, a reinforcement learning framework that post-trains models to prefer consensus-supporting trajectories using majority/minority outcomes from multi-agent debate. The proposed MACA framework can relax the need for ground truth labels and uses majority outcomes as the learning signal. 1. Learning from multi-agent debate trajectories is explored in prior work [1], as well as using the majority answer as the learning signal [2, 3]. In [1], the author also shows different levels of learning from the consensus-supporting and dissenting trajectories. This limits the novelty of this work. 2. Post-training methods like GRPO naturally sharpen the distribution and improve pass@1 performance substantially. This is similar to the motivation of this work about internalizing consistency and has been proven to be very effective. The author should compare with ScPO [2] and TTRL [3] to show how many improvements are coming from multi-agent debate, or the proposed method even outperforms these baselines. 3. The training is conducted on the base model instead of the instruction-tuned version. Comparing with instruction-tuned models is necessary, and it will be more convincing if the model is trained from instruction-tuned checkpoints. 4. Limiting responses to 256 tokens does not make sense to me. This budget is not sufficient for tasks like MATH and GPQA. Although in the appendix, this budget increases to 512, it is still too small. 5. The fact that "Unsupervised majority-vote signal is comparable to ground-truth" as shown in the ablation study is established in [2, 3], again limiting the novelty of this work. 6. The improvements are overclaiming since it is comparing the trained performance with the base model. The improvements should be compared with external baselines such as [2] and [3]. 7. On the note of "how many improvements are coming from multi-agent debate", it is also important to compare with [1]. [1] https://arxiv.org/abs/2402.01620 [2] https://arxiv.org/abs/2411.04109 [3] https://arxiv.org/abs/2504.16084 Please see weaknesses.	Fully human-written
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces MACA, a self-supervised, reinforcement learning framework for post-training LLMs to enhance self-consistency in reasoning. The MACA approach formalizes self-consistency as an intrinsic property of LLMs and utilizes multi-agent debate, where multiple clones of an LLM generate, critique, and revise reasoning trajectories. Th framework generates training signals from these deliberative exchanges and optimizes LLMs using several objectives to internalize consensus rather than simple aggregation. Substantial empirical improvements are demonstrated over baseline and strong post-training baselines on mathematical, science, and commonsense reasoning benchmarks, with gains in both accuracy and self-consistency, generalization to unseen tasks, and efficiency in inference. 1. The MACA framework is not limited to GRPO; it can also be integrated with multi-agent RL and preference-based alignment objectives, suggesting broader applicability to LLM self-alignment settings. 2. The improvements in self-consistency observed on mathematical reasoning tasks transfer to science and commonsense benchmarks, indicating that the method generalizes beyond a single domain. 1. The core learning loop is closely aligned with recent test-time reinforcement learning approaches [1], in which multiple sampled reasoning trajectories are compared and the consensus outcome is used as a self-supervised preference signal to update the model. The main distinction in this paper is that the consensus signal is generated via multi-agent debate rather than independent sampling. However, this conceptual similarity should be made more explicit, and a direct comparison to test-time RL methods is necessary to clarify what is genuinely novel in the proposed contribution. 2. While the improvements over SFT baselines and previous post-training paradigms are clear, the paper would benefit from direct, quantitative comparison to more diverse non-MACA multi-agent aggregation schemes. It is unclear how much gain stems from the debate protocol itself versus just using more samples at training. [1] Zuo et al., "TTRL: Test-time Reinforcement learning." 2025 See weaknesses	Moderately AI-edited
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper introduces MACA (multi-agent consensus alignment), a reinforcement learning framework that trains LLMs to be more self-consistent. MACA uses debate-derived consensus trajectories and trains LLMs via preference learning (DPO/KTO) or other objectives (GRPO, SFT). Experiments show significant performance gains on math reasoning datasets and strong generalization to unseen reasoning domains, validating that MACA can improve self-alignment and elevate reasoning capabilities. - The work is well-motivated, addressing LLMs' inconsistency when sampled multiple times. The proposed method is simple yet effective, using majority voting from multi-agent debates as a weak supervision signal to construct preference pairs for training. - Thorough experiments and ablations yield valuable insights, such as "Multi-agent debate produces more informative training signals than single-round majority voting" and "Addressing consensus alignment through preference learning improves over GRPO and SFT". No major weaknesses. A few questions are listed below. - Regarding the scaling of debate settings, will more agents or more rounds or heterogeneous agents yield better consensus? Would scaling these dimensions lead to higher-quality pairwise training data? - Cross-model transfer: If debate trajectories generated by a more capable LLM (e.g., Llama-8B) are used to train a smaller model (e.g., Llama-3B), how would this impact the smaller model’s self-consistency and accuracy? - Presentation: In Figure 2, which post-training methods (DPO, KTO, or SFT) are being illustrated? In Table 4, what does “Debate” refer to in the single-agent setting?	Lightly AI-edited
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes MACA (Multi-Agent Consensus Alignment), a post-training framework that “internalizes” self-consistency by using multi-agent debate to generate majority (consensus) and minority (dissent) trajectories, then optimizing the model with MV-DPO/MV-KTO/MV-GRPO or MV-SFT on those signals. The paper formalizes self-consistency via majority probability over sampled reasoning paths and measures agreement in multi-agent settings. Experiments on small LLMs (2B–8B) across math-heavy benchmarks report sizeable gains in sampling consistency and accuracy. 1. The paper explicitly defines single-agent sampling consistency and multi-agent agreement, making the target capability measurable, and uses them consistently in analysis. 2. MACA reuses standard post-training objectives (DPO/KTO/GRPO/SFT) with debate-derived preference signals, and shows that MV-DPO/KTO tend to outperform SFT and scalar-reward baselines across several model/dataset pairs. 3. The paper compares debate-majority supervision to ground-truth labels and finds similar performance, and also tests training with/without peer context during debate—both helpful for understanding what drives gains. 1. Since self-consistency prompting (Wang et al., 2022) is a main comparator, I expected a training-time baseline that (i) uses self-consistency/majority vote to curate a dataset (e.g., majority-consistent rollouts), then (ii) finetunes/SFTs on that curated set—without multi-agent debate. This would test whether debate-derived signals truly add value over classical self-consistency data augmentation. 2. The training/inference cost versus gains isn’t quantified (GPU hours, wall-clock, debate throughput). 3. Most training is math-centric; generalization to GPQA/CSQA is interesting but still limited in breadth. Also, the “formalization of self-consistency” largely re-casts majority probability/consensus ideas already known from self-consistency and majority-vote literature; the novelty is primarily in using debate-derived preference pairs, which would be stronger if contrasted directly against the SC-curation+FT baseline noted above. As in weaknesses	Heavily AI-edited
Boosting Federated Model Convergence with Anomaly Detection and Exclusion	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper studied the effect of anomaly detection and exclusion on learning efficiency in federated learning (FL). Vis theoretical analysis, the authors demonstrated how FL anomaly exclusion mechanisms contribute to faster convergence of the global model. In adddition, the authors introduced PID-MADE, operating without requiring the estimate of expected anomalies and achieving linear computational complexity. However, there existing some concerns, including baseline selection and poor performance of PID-MADE. This paper studied the effect of anomaly detection and exclusion on learning efficiency in federated learning (FL). Vis theoretical analysis, the authors demonstrated how FL anomaly exclusion mechanisms contribute to faster convergence of the global model. In adddition, the authors introduced PID-MADE, operating without requiring the estimate of expected anomalies and achieving linear computational complexity. I have some concerns as follows. 1. The aggregation strategies shown in Tab. 1 is incomplete, such as 1) FedCDA: Federated Learning with Cross-rounds Divergence-aware Aggregation; 2) Federated Learning with Sample-level Client Drift Mitigation. 2. The proposed PID-MADE shows a poor performance in Fig. 2 for FEMNIST. The authors should conduct the experiments on more datasets, not just these four datasets. 3. The baselines used in experiments are Krum, MKrum, RFA, Bulyan. I think that these baselines are not enough and the authors should add some state-of-the art baselines. 4. In addition, the robustness of the PID-MADE is not well evaluated in experiments. The authors should evaluate the the robustness of the PID-MADE under the state-of-the art poisoning attacks. 1. The aggregation strategies shown in Tab. 1 is incomplete, such as 1) FedCDA: Federated Learning with Cross-rounds Divergence-aware Aggregation; 2) Federated Learning with Sample-level Client Drift Mitigation, and etc. 2. The proposed PID-MADE shows a poor performance in Fig. 2 for FEMNIST. The authors should conduct the experiments on more datasets, not just these four datasets. 3. The baselines used in experiments are Krum, MKrum, RFA, Bulyan. I think that these baselines are not enough and the authors should add some state-of-the art baselines. 4. In addition, the robustness of the PID-MADE is not well evaluated in experiments. The authors should evaluate the the robustness of the PID-MADE under the state-of-the art poisoning attacks.	Fully human-written
Boosting Federated Model Convergence with Anomaly Detection and Exclusion	Soundness: 3: good Presentation: 3: good Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes PID-MADE, a history-aware anomaly detection and exclusion rule for federated learning (FL). Each client gets a PID-style score (proportional + integral + derivative of its distance to a round centroid), and clients above a threshold $\tau = \bar u_t + k \sigma_t$ are excluded before aggregation. the method is claimed to have faster convergence than FedAvg/Krum/Bulyan/RFA on several small image datasets and one LLM MLM setting. - The problem is interesting and important: robust FL with attention to convergence speed, not only emprical resutls. - Method is simple and practical to implement; server cost $O(nd)$ is good for scale. - (Major) Lack of strong theory for a security-style contribution. The acceleration/convergence argument largely reduces to a variance reduction factor $\sqrt{\lvert G\rvert / \lvert A\rvert}$ after filtering outliers. This is close to what I would call the trivial bound you get by dropping extremes;, this theory feels too weak. - Adaptive attacker not considered. The defense can likely be gamed by an adversary who shapes updates to keep the PID score under $\tau$ (e.g., keep integral small, smooth derivative, small proportional spikes). As argued by “The Attacker Moves Second” (Nasr et al; different setting but very relevant message), defenses must anticipate attacker adaptivity. Here, only static/simple attacks are tested. - Missing comparisons against SOTA robust FL methods. The paper does not compare to several recent SOTA algorithms specifically designed for Byzantine robustness under heterogeneity, such as: - Karimireddy, He, Jaggi (ICLR 2022) — Byzantine-Robust Learning on Heterogeneous Datasets via Bucketing. - Allouah et al. (AISTATS 2023) — Fixing by Mixing: A Recipe for Optimal Byzantine ML under Heterogeneity. - Allouah et al. (ICML 2024) — Byzantine-Robust FL: Impact of Client Subsampling and Local Updates. - Gorbunov et al. (ICLR 2023) — Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression. Without head-to-head comparisons, the empirical claims are not convincing for NeurIPS level. - No test against SOTA attacks. Evaluation is mostly label-flip and one simple LLM case with a single malicious client. There is no evaluation against LIE (A Little Is Enough), Fall of Empires, or other adaptive/stealthy/colluding attacks (e.g., min-max/ALIE/AGR or angle-constrained attacks) that are known to be stronge. This weakens the empirical message. see weeknesses	Lightly AI-edited
Boosting Federated Model Convergence with Anomaly Detection and Exclusion	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper proposes PID-MADE, a proportional–integral–derivative-based anomaly detection and exclusion mechanism for federated learning (FL). The method computes a PID-style score for each client’s model update, using current, cumulative, and differential deviations from the global model. Clients exceeding a statistical threshold (Chebyshev or Gaussian-based) are excluded before aggregation. The authors claim the approach maintains $O(nd)$ complexity, requires no prior knowledge of the number of malicious clients, and even accelerates convergence compared to standard robust aggregation rules. 1. The problem formulation, motivation, and algorithmic steps are clearly described with pseudocode and complexity discussion. 2. The detection mechanism is lightweight, running in 𝑂(𝑛𝑑) time, which makes it attractive for large-scale synchronous FL. 3. Addressing robust FL without knowing attacker proportions is relevant and valuable for real-world deployments. 4. The inclusion of convergence arguments and threshold derivations shows an effort to formalize the approach, albeit at a basic level. 1. The PID-based formulation is essentially a weighted temporal smoothing of client deviation scores. Similar temporal and distance-based anomaly detection mechanisms have appeared in many robust FL works (e.g., FLTrust). The “PID” framing is metaphorical rather than a genuine control-theoretic contribution without stability or control analysis. 2. The “convergence proof” relies on standard convex, bounded-gradient assumptions already sufficient for FedAvg; the PID terms are not meaningfully analyzed in that context. Claims of “accelerated convergence” are empirical and lack any formal rate improvement. The threshold derivation using Chebyshev or Gaussian statistics is standard textbook material, not a contribution. No analysis of false-positive or false-negative rates for client exclusion is provided. 3. Evaluations are limited to simple datasets (MNIST, CIFAR) with artificially induced label-flip or scaling attacks. No experiments under modern or stealthy attack models (e.g., backdoor, gradient manipulation, model replacement). Reported improvements are minor and lack statistical significance (no confidence intervals or variance reporting). Experiments primarily show faster convergence, but not improved robust accuracy, which is the real metric of interest for Byzantine resilience. 4. The proposed method assumes fully synchronous, homogeneous clients. System heterogeneity (asynchronous updates, delayed clients, stragglers) is completely ignored; the PID derivative term would be invalid when client updates arrive at different frequencies. Model heterogeneity is not supported as the distance metric presumes identical architectures and parameter shapes. Statistical heterogeneity (non-IID data) is treated only via trivial toy splits (2–3 classes per client). The method’s sensitivity to rare but legitimate client behavior is neither measured nor mitigated. In short, PID-MADE is not generalizable to realistic heterogeneous FL environments, precisely where robust aggregation is most needed. 5. Statements such as “faster convergence” or “universally applicable to LLM fine-tuning” are overreaching given the scale and simplicity of experiments. No ablation study shows the individual contribution of the P/I/D components or the sensitivity to their hyperparameters $K_p, K_I, K_d$. The claimed scalability and universality are speculative rather than demonstrated. 1. Can the authors explain what fundamentally new insight the PID structure provides beyond being a heuristic temporal extension of existing trust-score schemes? 2. How can the authors guarantee theoretical stability or convergence once updates are filtered dynamically by PID scores? 3. How would PID-MADE handle real-world FL heterogeneity, both statistical (non-IID) and system (asynchronous or straggler) cases? 4. Why are stronger or stealthier attacks (e.g., adaptive backdoor or model-replacement) not included to validate robustness claims? 5. What measurable advantage in robustness, convergence rate, or computational efficiency, does PID-MADE demonstrate over FLTrust under the same attack ratio and non-IID setting?	Fully AI-generated
Duet: Joint Exploration of User–Item Profiles	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces DUET, a framework designed to shift recommendation systems (RS) from traditional latent vector representations to interpretable, textual profiles for both users and items. The central problem it addresses is that the optimal format for these textual profiles is unknown, and handcrafting templates is often ineffective and misaligned with the final recommendation task. Three stages are needed: 1. A Large Language Model (LLM) first distills raw user histories and item metadata into concise, informative textual "cues" that act as starting points for profile generation 2. In a single sequence-to-sequence pass, the framework expands these minimal cues into richer, more descriptive textual profiles. This unique step allows the model to explore various potential profile formats rather than being restricted to a single, predefined structure. 3. the generated user and item profiles are then optimized together using reinforcement learning. Key contributions: * a new paradigm for RS. moving from latent vectors to align users and items using natural language profiles in a shared semantic space. A innovative move to enable both human interpretation and model LLM/agentic system. * exploration is a drive force for recommendation. the work here introduces a novel method that empowers an LLM to autonomously discover effective profile formats without relying on rigid, hand-engineered templates. This is achieved through the cue-initialization and self-prompting mechanism. * empirical evaluation results show that DUET significantly outperforms strong existing baselines, confirming the effectiveness of both the textual alignment approach and the feedback-driven profile exploration strategy originality: * vector to text: The core idea of shifting from opaque latent vectors to interpretable textual profiles for both users and items is a significant contribution. I am not quite familiar with a comprehensive understanding of existing work though. To me, this joint modeling seems novel. * The proposed three-stage DUET framework is a novel combination of techniques. The concept of "self-prompt construction" to allow the LLM to autonomously explore and discover optimal profile formats—rather than relying on rigid, handcrafted templates—is a particularly creative and original mechanism * The use of RL to create a closed-loop system where downstream recommendation performance directly provides the reward signal for refining profiles is an elegant and powerful idea. This approach moves beyond offline reward models and directly optimizes for the end-task, addressing a key limitation in prior RL-based RS Quality: * The proposed method is well-designed and technically sound. * Evaluation seems valid to me including the ablation study, though I am not that familiar with the baseline method or any of the state-of-art work. Clarity: The paper is clearly written, easy to follow. Significance the paper offers a forward-looking vision for the future of RS in the era of large language models. * as I mentioned in the summary, it provides a foundation for agentic systems * address a key LLM challenge: designing effective prompts and profile formats * The demonstrated gains in accuracy and F1-score are significant enough to be of interest to both researchers and practitioners in the field. * The experiments exclusively use rating prediction metrics (MAE, RMSE, Accuracy, F1) to evaluate performance. While this is a valid approach, modern recommendation systems are fundamentally ranking problems. The current evaluation does not assess how well the generated profiles perform in a more realistic scenario of ranking a large set of candidate items. Actionable Insight: The work would be significantly strengthened by including experiments on a re-ranking or full candidate retrieval task, using standard ranking metrics like nDCG, MAP, and Recall@K. This would provide more direct evidence of the profiles' effectiveness in a real-world setting. * as the author already pointed out, the computational need would be super high. * The use of RL (specifically GRPO) is central to the framework's success, but RL for LLMs can be notoriously unstable and sensitive to hyperparameter choices. The paper does not discuss the stability of the training process or the sensitivity to RL-specific hyperparameters. The paper would be more robust if it included an analysis of the RL training dynamics. For instance, showing the learning curves for reward and providing details on hyperparameter tuning would increase confidence in the method's reproducibility and stability. Comparing GRPO to another common policy optimization algorithm (like PPO) could also demonstrate that the gains are from the framework itself and not just the choice of a specific, state-of-the-art RL algorithm. * Certain aspects of the methodology, particularly the novel components, could benefit from more detailed explanation. e.g., The initial "cue-based initialization" stage is described at a high level. The prompt guiding the LLM to extract cues is quite broad (e.g., "Keep the description concise and avoid full sentences"). The quality and nature of these initial cues seem critical to the success of the subsequent stages, yet this is not analyzed. The paper could be improved by adding a qualitative analysis of the generated cues. Furthermore, a sensitivity analysis showing how the final profile quality is affected by variations or perturbations in the initial cues would provide a better understanding of the method's robustness. * Is the success of the joint optimization stage fundamentally tied to GPRO, or is the framework general enough to work with other policy optimization methods like PPO? A response here would help clarify whether the core contribution is the closed-loop framework itself or the application of a specific state-of-the-art RL algorithm. * The case study in Figure 4 provides a strong example of success. However, could you discuss potential failure modes of DUET? For instance, how does the framework handle users with very sparse histories, or users whose historical interactions contain conflicting or rapidly changing preferences? Understanding these limitations would provide a more complete picture of the framework's applicability. * The paper motivates its work by stating that textual profiles establish a foundation for "future agentic recommendation systems". This is a very compelling vision. Could you please elaborate on what a downstream agentic system powered by DUET profiles might look like? A more concrete example would help solidify the long-term significance and impact of the proposed work.	Fully AI-generated
Duet: Joint Exploration of User–Item Profiles	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper introduces DUET, a closed-loop framework leveraging Large Language Models (LLMs) for profile generation in recommendation task. DUET includes three main steps: i) prompting LLMs to generate a short phase capturing minimal yet informative user interests/item characteristic; ii) re-prompting LLMs to expand cues into a richer user/item profile; iii) aligns generated user and item profiles for rating prediction task using reinforcement learning. Experiments on three real-world datasets show the stronger rating prediction performance of DUET than representative baselines. 1. Motivation: The paper presents a well-motivated study that explores the use of large language models (LLMs) for informative profile generation. This approach effectively leverages the world knowledge embedded in LLMs to enrich user and item representations beyond traditional sparse interaction data. 2. Organization and Clarity: The paper is well structured and clearly written. The logical flow of sections facilitates comprehension, and the inclusion of concrete prompt examples enhances the reader’s understanding of the core methodology 3. Experimental Results: The proposed model, DUET, demonstrates substantial improvements in rating prediction accuracy compared to representative baselines. These results validate the effectiveness of the proposed profile generation approach. 1. While the motivation and empirical results are promising, the paper does not clearly articulate how the generated user and item profiles concretely advance the recommendation task. A more detailed analysis is needed to explain why these profiles lead to improved predictions, e.g., case studies, ablation experiments on specific prompt designs, or comparisons between profiles generated by DUET and those from baselines to highlight the unique advantages of the proposed approach. 2. The proposed cue-based mechanism for profile generation is conceptually interesting but not fully convincing. Since cues are often short phrases containing limited information, the resulting profiles may not faithfully reflect the true user preferences. For example, users sharing a similar cue might still have distinct underlying interests. The paper would benefit from a more in-depth discussion or empirical validation showing how DUET mitigates this issue and ensures that generated profiles remain representative and reliable. 3. It remains unclear why the paper focuses exclusively on the rating prediction task rather than ranking-based evaluation, which is typically more aligned with the goals of recommender systems: identifying and retrieving relevant items for users. Prior work (e.g., [a]) has highlighted the limitations of rating prediction for assessing recommender effectiveness. Furthermore, many of the chosen baselines were originally designed for ranking tasks, which makes the comparison less meaningful in the current setup. Including ranking-based evaluations would substantially strengthen the empirical validation and demonstrate the broader applicability of DUET. [a] Cremonesi, P., Koren, Y., & Turrin, R. Performance of recommender algorithms on top-N recommendation tasks. RecSys 2010. 4. The paper acknowledges that the proposed method introduces additional complexity and computational overhead. However, no quantitative analysis is provided to assess the trade-off between performance gains and computational cost. A detailed efficiency study would provide valuable insight into the practicality of deploying DUET in real-world scenarios. Please see the review.	Moderately AI-edited
Duet: Joint Exploration of User–Item Profiles	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents DUET, a framework for making textual user/item profiles for LLM-based recommenders. Instead of using fixed templates, which often don't work well, DUET learns the profiles. It starts by making short "cues" from raw data, expands those cues into full profiles, and then uses RL (with GRPO) to jointly optimize both user and item profiles based on how well they work for the actual recommendation task. Experiments on three datasets show this approach beats strong baselines. The best part is moving beyond static templates. The "cue" to "profile" idea is a smart way to get around the pain of prompt engineering. Another big strength is optimizing both user and item profiles together. Most work only focuses on the user. Using RL to align both based on task performance seems to be the right way to go and helps capture better user-item matches (like in Fig 4). The technical details, like using GRPO and the fractional reward, are well-thought-out for applying RL to this problem. The strong results and clear ablation study really sell the idea. My main worry is the computational cost. This looks very expensive. It needs LLM passes for profile generation and a full RL optimization loop. The paper mentions this in the limitations but doesn't give any analysis of training time or inference latency vs. the baselines. This is a big practical issue. Also, the "exploration" part of the profile construction isn't very clear. Figure 3 makes it look like a single-pass generation (data -> cue -> prompt -> profile). How much "exploration" is really happening? Is it just a fixed refinement? Can you give us some idea of the computational cost? How much slower is DUET to train and run compared to the baselines? This is a key practical concern. Can you clarify what "exploration" means in the self-prompt construction step? Is it just a single-pass refinement, or is the model actually trying out different profile formats (e.g., through sampling)? For Table 1, did the baseline models (KAR, PALR) also use generated item profiles? Or just user profiles? This is key for a fair comparison of the "joint" optimization.	Fully AI-generated
Duet: Joint Exploration of User–Item Profiles	Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This study proposes a framework that generates user and item textual profiles jointly and uses these profiles for the recommendation task. Experimental results show that the proposed method outperforms some prompting-based baselines. (1) Writing is easy to follow. (1) It would be better if the authors could further discuss results in Table 1. Table 1 shows that the proposed method significantly outperforms other baselines. However, I wonder if the superior performance is related to the RL phase where the model is optimized while other baselines are simply prompting-only methods. In fact, by checking the numbers in both Table 1 and Table 2, it seems that RL is the major factor that leads to the best performance. By adding Profile and Self-Prompt, the proposed method does not perform as well as other baselines. (2) More quantitative experiments are necessary to support the discussion in Section 4.3.2. The authors discussed three potential reasons for the inconsistent performance of the proposed method across different user interaction lengths without providing quantitative support. Moreover, is the performance difference significantly large to make any conclusions? (3) It would be better if the authors can improve the case study and expand it into a large scale quantitative experiments. It is natural that there are some good examples demonstrating the advantage of the proposed method. It is also not surprising if some similar examples can be found in the responses generated by other baselines. But to articulate the advantage of the proposed method, it is necessary to show that such advantage (1) is indeed the reason for the better performance, and (2) appears in the responses of the proposed method significantly more commonly than in the responses of other baselines. (1) The authors mentioned in Appendix A.1.1 that hyperparameter details can be found in the code, but no code was shared. Can the authors provide more details about how the framework was trained? (2) I wonder if the authors have considered the cold-start issue? How would the proposed framework address this issue?	Fully human-written
NoLoRA: Nonlinear Low-Rank Adaptation for Parameter-Efficient Fine-Tuning	Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes NoLoRA, a nonlinear extension of LoRA that introduces an activation function and a learnable modulation vector between the low-rank matrices A and B. The claimed motivation is to overcome the linearity limitation of LoRA and to improve representational capacity while maintaining parameter efficiency. Experiments show improvements across a bunch of benchmarks. - The paper is clearly written and easy to follow. - The method is straightforward, with a simple addition of a nonlinear activation and a per-rank modulation vector. - The empirical section covers several benchmarks and is rigorous. `W1: Novelty concerns and missing citation of highly related prior work` The core technical contribution, introducing a nonlinear activation between the two LoRA matrices, is not new. The AuroRA paper (https://arxiv.org/abs/2505.18738) presented an almost identical idea several months earlier: inserting a nonlinear mapping between (A) and (B) to enhance LoRA’s expressiveness, effectively treating the adapter as a miniature MLP. In fact, NoLoRA’s formulation $ \Delta W(x) = B (v \odot f(Ax))$ reduces to AuroRA’s nonlinear adapter when the modulation vector (v) is removed. Table 6 of this submission even includes an ablation explicitly without (v), which is practically equivalent to AuroRA. Yet, the paper does not cite AuroRA anywhere in the related work or discussion. This omission gives a misleading impression of originality and fails to situate the contribution in its proper research context. `W2: Lack of conceptual novelty beyond a modulation term` Once the nonlinearity is recognized as prior art, the only remaining addition is the elementwise modulation vector (v), which adds minimal expressiveness and negligible theoretical depth. The proposed update remains a trivial per-channel scaling of the activation output. This does not constitute a fundamentally new idea or mechanism. `W3: No justification or insights provided` The paper does not provide any solid theoretical or empirical argument explaining why the method works well. On line 234, the authors state: “This analysis illustrates the improved expressiveness of our method and provides theoretical support for the empirical results.” However, no such analysis or theoretical support is actually presented. `W4: Experimental validation lacks rigor` The experiments do not include AuroRA for comparison, even though it is the most relevant prior method. Moreover, the improvements attributed to the modulation vector are marginal, casting doubt on the significance of this component. --- While the paper is well written and includes broad experiments, its main technical idea is essentially identical to previously published work (AuroRA), minus proper attribution. The remaining addition, a simple modulation vector, is minor and not conceptually sufficient to justify a new standalone paper. Please refer to weaknesses	Heavily AI-edited
NoLoRA: Nonlinear Low-Rank Adaptation for Parameter-Efficient Fine-Tuning	Soundness: 1: poor Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes NoLoRA (Nonlinear Low-Rank Adaptation), a parameter-efficient fine-tuning (PEFT) method that extends LoRA by introducing a nonlinear activation function and a learnable modulation vector into the low-rank update path. The authors claim this design enhances expressiveness while preserving parameter efficiency. 1. The core idea—enhancing LoRA with lightweight nonlinearity—is simple and aligns with the PEFT community’s goal of improving expressivity without sacrificing efficiency. 2. The empirical scope is broad, covering NLP, vision, and reasoning tasks, which suggests general applicability. 1. Lack of novelty: The proposal is extremely close to NEAT (Zhong et al., 2025). Both methods replace LoRA’s linear update with a nonlinear mapping. The paper fails to justify why this form is preferable or meaningfully distinct. 2. Unverified experimental claims: Most baseline numbers are borrowed from other papers with different settings (e.g., learning rates, seeds, data splits). For example, the GLUE results for LoRA and Adapter are marked as taken from Wu et al. (2024a). Without re-running all baselines under identical conditions, the reported gains may reflect implementation or tuning disparities, not intrinsic superiority. 3. Theoretical claims are vague: The statement that “for any smooth target weight update ΔW , there exists a set of parameters A,B,v such that B(v⊙f(Ax)) can approximate ΔW to arbitrary precision” is unsubstantiated. This would require f to be a universal approximator, but with fixed low rank r, the expressivity is severely limited. 1. What exactly makes NoLora better than NEAT? Any theoretical explanations? 2. Experimental fairness: Were all baselines (LoRA, PiSSA, MiLoRA, NEAT, etc.) re-implemented and tuned under identical conditions (same seeds, hyperparameters, data preprocessing)? If not, how can the performance gaps be attributed to architectural differences rather than tuning disparities?	Fully AI-generated
NoLoRA: Nonlinear Low-Rank Adaptation for Parameter-Efficient Fine-Tuning	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This manuscript focuses on low-rank adaptation (LoRA), which suffers from limited effectiveness due to its linear adapter architecture. To overcome this expressiveness bottleneck, this paper advocates a nonlinear variant termed Nonlinear Low-Rank Adaptation (NoLoRA) that injects a nonlinearity and a vector modulation between the low-rank adapters to enhance the representational capacity. Experiments are conducted on commonsense reasoning, natural language understanding, image classification, and mathematical reasoning to demonstrate the superiority of NoLoRA. 1. LoRA is a highly popular and timely topic in parameter-efficient fine-tuning. The limitation of LoRA's linear structure is clearly presented. 2. Empirical evaluation on diverse tasks and models showcase promising results. 3. NoLoRA is lightweight, incurring negligible extra parameters. 1. The motivation behind the specific design in Eq. (6) is not clearly explained, and there is no theoretical justification supporting the claimed improvement in expressiveness. 2. The comparison omits several closely related LoRA variants that also incorporate nonlinear structures. For instance, MoRA [1] replaces linear mappings with compression and decompression functions, while HiRA [2] employs a Hadamard product with pretrained weights. 3. The experimental results lack error bars (e.g., standard deviation or confidence intervals) and do not report performance across multiple random runs. 4. While Table 4 compares the parameter counts of different approaches, it would also be informative to include measurements of actual fine-tuning time and memory overhead. 5. The first paragraph of Section 3.3 repeats similar sentences in the last paragraph of Section 3.2. Additionally, “Mixture” in line 59 should be “mix,” and there should be a space before “NEAT” in line 107. [1] T. Jiang et al., "Mora: High-rank updating for parameter-efficient fine-tuning", arXiv preprint, 2024. [2] Q. Huang et al., "HiRA: Parameter-efficient hadamard high-rank adaptation for large language models", in ICLR, 2025. See weakness.	Lightly AI-edited
NO DARK DATA REQUIRED: BRIDGING THE GAP BETWEEN NORMAL AND LOW-LIGHT DETECTION VIA RETINEX DECOMPOSITION	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents an end-to-end object detection architecture inspired by Retinex theory. The design features a decomposition module that estimates reflectance and illumination in feature space, with a novel fusion approach to produce illumination-invariant features for object detection. The key distinguishing property is that the model is trained solely on normal-light data but tested robustly on both synthetic and real low-light/foggy datasets. The method is benchmarked on Pascal VOC and ExDark and RTTS. 1. The model is trained exclusively on normal-light images which is practically significant to reduce data curation burdens. 2. The decomposition of deep features into reflectance and illumination via Retinex-inspired, feature-level processing seems a distinctive integration. 1. The paper is difficult to read due to a combination of disorganized content and poor layout. The logical progression of ideas is unclear, and the unprofessional typesetting, evidenced by significant blank space on page 3, detracts from the work's credibility. 2. The mathematical descriptions in Section 3.2–3.3 (especially around the decomposition and fusion) are rather high level. Crucial details such as the explicit forms of the aggregation $\mathcal{A}(\cdot)$, sampling method for constructing $L(x, y)$ and $R(x, y)$, channel alignment techniques, normalization procedures, and whether the element-wise fusion is normalized or bounded, are unspecified. 3. The paper does not adequately differentiate its proposed methodology from existing Retinex decomposition/fusion techniques. - Deep Retinex Decomposition for Low-Light Enhancement, BMVC18 - IniRetinex: Rethinking Retinex-type Low-Light Image Enhancer via Initialization Perspective, AAAI25 1. Can the authors provide explicit mathematical details for the aggregation and fusion steps, particularly a precise functional form for $\mathcal{A}$ and the normalization/activation used in $F_{i}^{\text{fused}}$? Are there learned or fixed weights, or dynamic selection mechanisms in fusion? 2. What is the process and rationale for selecting RepNCSPELAN4 blocks, and do alternative feature processing blocks (e.g., C2f, ELAN) materially affect performance? Quantitative ablation here would strengthen the architectural justification. 3. What are the failure points under extreme conditions (e.g., mAP vs. fog/darkness level)? Is there a critical threshold below which the proposed method degrades significantly earlier or later than the SOTA?	Moderately AI-edited
NO DARK DATA REQUIRED: BRIDGING THE GAP BETWEEN NORMAL AND LOW-LIGHT DETECTION VIA RETINEX DECOMPOSITION	Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This manuscript attempts to introduce Retinex theory into the YOLO framework, but in reality, it is merely a patchwork of pre-existing concepts and methods. The title of the manuscript is decent. 1. There are significant formatting problems, including large blank spaces on several pages (e.g., pages 3, 4, and 6) and improperly scaled tables (e.g., Tables 1, 2, and 3). 2. The proposed method is an unjustified assembly of classic methods from the vision field (YOLO and Retinex) with virtually no original design contributions. 3. The contributions summarized in the introduction are all based on existing methods, lacking any original design from the authors, and are poorly written. The author describes the method as an "AI model." The core of the work is merely applying different processing to feature maps of different scales within YOLO and claiming this constitutes a Retinex decomposition. This claim is unsubstantiated, and the author fails to provide a clear explanation, instead just restating concepts from Retinex theory and YOLO object detection. 4. The author's writing suggests a lack of familiarity with standard academic terminology in this field. For instance, summarizing their method as "an Artificial Intelligence (AI) solution" is not a phrasing I have encountered in computer vision or related fields. 5. The experiments in this manuscript are primarily compared against baseline and older methods, failing to include comparisons with the latest state-of-the-art (SOTA) approaches. Furthermore, only limited results are presented, which is insufficient to validate the effectiveness of the proposed method. Please refer to Weaknesses.	Moderately AI-edited
NO DARK DATA REQUIRED: BRIDGING THE GAP BETWEEN NORMAL AND LOW-LIGHT DETECTION VIA RETINEX DECOMPOSITION	Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This manuscript proposes a new end-to-end framework based on normal-light images for low-light image detection. The proposed method separates images into reflectance and illumination, which approximates this decomposition within the feature space. A multi-scale feature aggregation is introduced to learn illumination-aware representation. 1. Compared to multiple baselines, the method exhibits enhanced generalization under challenging lighting and weather conditions. 2. The framework attains high inference speed, enabling effective support for real-time applications. 1. The work as a whole lacks a distinct innovative core. Its technical approach largely manifests as a direct integration of the YOLO model with Retinex theory, without presenting substantial original theoretical advancements, nor conducting in-depth exploratory research on key technical bottlenecks. 2. The introduction fails to provide an explicit and structured summary of the study’s contributions. In academic writing, a clear statement of contributions serves as a "guide" for readers to quickly identify the work’s core value and differences from prior studies. 3. The paper’s formatting is extremely poor. This lack of rigor in the submission attitude has raised doubts about the quality of the paper’s content and the authenticity of its experiments Poor readability of the paper Poor formatting of the paper.	Moderately AI-edited
Self-Improved Prior for All-in-One Image Restoration	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a new image restoration framework, Self-Improved Privilege Learning (SIPL), designed to address optimization instability and inter-task conflicts in all-in-one image restoration models. Built upon the concept of Privilege Learning (PL), the authors extend its use beyond training to inference via a lightweight module called Proxy Fusion, which incorporates a learnable Privileged Dictionary (PD). At inference, the model uses its own outputs as pseudo-privileged information, enabling iterative self-refinement. The method is shown to be architecture-agnostic, efficient, and broadly applicable, with strong results on multiple benchmarks, including multi-task, composite degradation, and out-of-distribution scenarios. 1. The extension of PL into test-time self-refinement using a learned dictionary is both conceptually interesting and practically effective. 2. The proposed Proxy Fusion mechanism is lightweight, plug-and-play, and well-motivated. It introduces minimal overhead while providing measurable improvements. 3. Comprehensive experiments across four challenging benchmarks (Three-Task, Five-Task, Deweathering, Composite Degradation) convincingly demonstrate SIPL’s effectiveness. Improvements are consistently reported across various restoration tasks, with particularly notable gains in composite degradation scenarios. 4. The paper includes detailed ablations dissecting PL and SIPL contributions, multi-step refinement behavior, and efficiency vs. performance trade-offs. These analyses are thorough and help clarify SIPL’s practical value. 1. The paper is difficult to read in many parts due to heavy terminology and dense writing. 2. While the empirical results are strong, the paper lacks a deeper theoretical analysis of why the proposed self-refinement via pseudo-PI is stable and effective. A formal justification or insight into training dynamics under SIPL would strengthen the claims. 3. The performance of SIPL at inference appears to depend heavily on the quality of the initial model output. If the initial restoration is poor, the pseudo-privileged signal may be too noisy to guide useful correction. This limitation is acknowledged but not explored further. 4. The data for Denoising in Table 2 appears to be incorrect. 31.45 is incorrectly bolded, but SSIM is indeed higher. 1. How does the performance of SIPL degrade if the initial restoration is poor? Are there failure cases? 2. Is there any benefit to fine-tuning the Privileged Dictionary during inference, or is it always fixed? 3. What the performance of SIPL under real-world degradations not covered by the benchmark datasets?	Fully AI-generated
Self-Improved Prior for All-in-One Image Restoration	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This research paper introduces a novel paradigm called Self-Improved Privilege Learning (SIPL) to address optimization instability and inter-task conflicts in all-in-one image restoration models when handling diverse and mixed degradations. Unlike conventional Privilege Learning, SIPL innovatively extends the utility of privileged information (PI) beyond the training phase into inference. Its core mechanism is the "Proxy Fusion" module, which incorporates a learnable Privileged Dictionary. During training, this dictionary distills high-quality priors from ground-truth features, and during inference, it leverages the model's preliminary outputs as pseudo-privileged signals for an iterative self-refinement loop. Experimental results demonstrate that SIPL significantly improves performance across various all-in-one image restoration benchmarks, particularly for composite degradation tasks, while offering broad applicability and computational efficiency. 1. SIPL breaks the limitations of traditional Privilege Learning by extending privileged information from the training phase to inference, enabling self-improvement at test time, which is a significant innovation. 2. Experimental results demonstrate that SIPL achieves substantial PSNR improvements across various image restoration tasks, including composite degradation, deraining, dehazing, and denoising, performing exceptionally well on complex composite degradations. 3. The SIPL framework, particularly the Proxy Fusion module, is designed to be seamlessly integrated with diverse backbone architectures (e.g., PromptIR, Restormer, NAFNet, AdaIR), enhancing its versatility and practicality. 4. The paper provides comprehensive ablation studies, deeply analyzing the contributions and performance of individual SIPL components, which enhances the credibility of the conclusions. 1. Deeper theoretical understanding needed: The paper points out a lack of a deeper theoretical understanding of the optimization dynamics of Privilege Learning in this context. While empirically validated for stability, the absence of a solid theoretical foundation might limit further optimization and insights. 2. Additional training cost and inference latency: This mthods needs retraining the baseline methods with the pluged module, and this training is needed for each new model. The iterative refinement process also increases latency compared to single-pass baselines. 3. Native privilege learning as the important baseline are somewhat missing. The main performance table only compares SIPL with the base model (PromptIR), ignoring the comparison with native privilege learning pipeline. 4. As the core component of this method, the learned dictionary lacks necessary analyses. Additional analyses are encouraged, regarding the impact of the size of the dictionary, and so on. Please refer to weakness part.	Lightly AI-edited
Self-Improved Prior for All-in-One Image Restoration	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces Self-Improved Privilege Learning (SIPL), a novel framework for all-in-one image restoration that extends privilege learning (PL) into the inference stage. The key idea is to enable models to iteratively refine their outputs by using their own initial restorations as pseudo-privileged information. The authors propose a Proxy Fusion module with a learnable Privileged Dictionary (PD) to retain high-quality priors from privileged (ground-truth-derived) features during training and reuse them during inference. The method is claimed to be architecture-agnostic and can be integrated into various backbones like PromptIR. Extensive experiments across multiple benchmarks (three-task, five-task, deweathering, and composite degradation) show notable PSNR/SSIM gains and strong qualitative improvements. 1. The paper presents a creative extension of Privilege Learning by introducing an inference-time reuse mechanism. The idea of “self-refinement through pseudo-privileged signals” is conceptually elegant and distinct from test-time adaptation or self-ensembling. 2. The proposed Proxy Fusion and Privileged Dictionary are well-motivated and described with clear mathematical formulation (Eqs. 2–4). The training and inference procedures are systematically explained, and the iterative refinement mechanism (Eqs. 5–7) is logically sound. 3. Experiments cover diverse benchmarks with consistent improvements over strong baselines. 1. The framework lacks formal analysis of why the Privileged Dictionary enables stable self-refinement. The paper acknowledges this as a limitation but providing any theoretical intuition (e.g., gradient variance reduction) would strengthen the work. 2. While the paper differentiates SIPL from test-time adaptation, the boundary remains somewhat blurred. A more rigorous comparison (quantitative or procedural) to self-distillation or self-training methods could better situate SIPL in the broader landscape. 3. Although the overhead is smaller than ensembling, repeated refinement still doubles inference time. The practical trade-off between latency and improvement could be more thoroughly quantified (e.g., FLOPs, runtime). 4. All benchmarks are synthetic; it remains unclear whether pseudo-privileged refinement helps under real-world degradations (e.g., RAW noise, ISP artifacts). 5. The paper treats the PD as a black box; no visualization of learned atoms or similarity between retrieved priors and ground truth is shown. 1. What do PD entries represent visually or statistically? Can you visualize top-activated atoms or measure entropy / usage distribution? 2. How many refinement iterations are typically needed before saturation? Does performance ever degrade with more steps? Any evidence of oscillation? 3. If a PD is trained with one backbone (PromptIR), can it accelerate or improve another (Restormer) without retraining? This would demonstrate universality. 4. Have you tested SIPL on real-capture datasets (e.g., RainDS, LOL-V2, AIM-RealRain)? Does pseudo-privilege refinement still improve perceptual metrics (LPIPS/MUSIQ)?	Fully AI-generated
LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces LingOLY-TOO, a new benchmark for evaluating reasoning abilities in LLMs. It is built from Linguistics Olympiad problems. The key innovation is the use of "reasoning-equivariant permutations" to obfuscate the problem text. This process changes the orthography but is carefully designed to keep the underlying logical structure and solution steps unchanged. The authors show that while models perform reasonably well on the original problems, their performance drops significantly on the obfuscated versions. This suggests that models rely on prior knowledge and memorization rather than pure reasoning on the original tasks. The paper also includes detailed analysis on the effect of language resourcefulness and a human study. The core idea of the paper is excellent. Using orthographic obfuscation to create a "knowledge-free" test for reasoning is a very clever and direct way to tackle the problem of data contamination. This is a timely and important contribution. The process for creating the permutations is very rigorous. I am impressed by the careful, manual design of the rulesets by experts to ensure the problems remain solvable. The validation by IOL medallists adds strong credibility to the method. The experimental section is very comprehensive. The authors evaluated a wide range of models, including the latest reasoning-specific ones. The analysis goes beyond just overall scores to include "no-context" tests, robustness checks, and the correlation with language resources. The human study is also a valuable addition. The use of exact match is simple but might be too strict. Sometimes, a model might have the correct reasoning but make a small mistake in formatting the final answer. Using only exact match could penalize such cases. The human study shows a small but noticeable performance drop (5.7%) for humans on obfuscated problems. This suggests that the obfuscation itself might add some cognitive load, making the problems slightly harder to parse, even for humans who don't rely on prior knowledge of the language. The process of creating permutation rulesets relies heavily on manual expert work. This might make it difficult to scale the benchmark to a much larger size or to adapt it quickly to other domains. 1. Have you considered any evaluation metrics other than exact match (e.g., edit distance or partial credit schemes) that could capture instances where the model's reasoning is mostly correct but the final output has a minor error? What are the potential challenges in implementing such metrics for this benchmark? 2. The human study shows a 5.7% performance drop due to obfuscation. Could you discuss a bit more how we should interpret the model's performance drop in light of this? Specifically, how much of the model's drop might be attributed to the increased difficulty of processing the unfamiliar orthography, versus the removal of knowledge-based shortcuts? 3. The permutation rulesets are designed by experts. Do you think this method could be applied to other domains like mathematical reasoning or code generation? What would be the main challenges in designing "reasoning-equivariant permutations" for those domains?	Fully AI-generated
LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces a challenging reasoning benchmark of 7k question answer pairs, built by applying grammeme-level obfuscations to Linguistic Olympiad problems. The motivation for building a new dataset is that models rely on prior language knowledge learnt via pre-training. This dataset is built to test out models' reasoning capabilities and removes any cues that could trigger memorized translations. 1. Release of a large-sized (7k) benchmark dataset that clearly separates the reasoning abilities from memorized knowledge. 2. In-depth analysis: The authors conducted various analyses, such as the ability of the model to reason, the effect of tokenization on uncommon characters, and the various across different permutations. 3. Release of dataset and code 1. In the no-context setting, the difference between the original and the obfuscated dataset is very less; how does the author come to a concrete conclusion? 2. To check the effect of performance drop with uncommon characters, why don't we replace the context with random but real tokens that are not part of the training set? Similar to the ProntoQA dataset, where entities are replaced with false ontology. The authors manually created rulesets for this dataset, which makes it harder to extend this work to other datasets. Did the authors try to use LLMs as annotators to see how feasible it is to extend them to other domains/ datasets?	Fully human-written
LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper introduces LINGOLY-TOO, a 6,995-QA benchmark built by applying expert, grapheme-level orthographic obfuscations to 82 UKLO problems, aiming to disentangle reasoning from knowledge/memorization. The authors define clear metrics (Mog vs. Mobf, plus robust variants), run broad model evaluations, and provide validation via auditors and a human RCT. Results show sizable drops under obfuscation (e.g., top model ≈0.60→0.48), correlations with language resourcedness, and modest gains from guided reasoning. * Clear problem framing: measuring reasoning when shortcuts (knowledge, contamination) are minimized is timely and important. * Methodological originality: the reasoning-equivariant, linguistically-informed permutations are thoughtful and non-trivial; the Turkish vowel harmony example nicely motivates rule design. * Strong experimental design: multiple families of models, bootstrap analysis, no-context control, tokenization controls, and human study make the case persuasive. 1. Insufficient Failure-Mode Analysis The paper documents performance declines but does not deeply probe why models fail on obfuscated problems (e.g., difficulty inferring morphemic patterns, inconsistent multi-step reasoning, or fallback to guessing). Please add a qualitative analysis of model outputs—contrasting common errors on obfuscated vs. original items—to connect performance gaps to specific reasoning deficits. Building on this, apply statistical tools (e.g., clustering) to categorize and quantify linguistic reasoning failures. 2. Limited Accessibility of the Permutation Ruleset The permutation rules (Appendix B) are dense and lack a high-level summary in the main text. A concise overview—such as a table of key constraints and invariances, or a concept diagram illustrating how reasoning equivariance is preserved—would make the method more accessible, especially to readers without a linguistics background. 1. On Failure Attribution (related to Weakness 1) Can the authors extend the analysis to quantify the causes of reasoning failure? For instance, by labeling dominant error types and reporting their prevalence across models and difficulty levels. 2. On Ruleset Comprehensibility (related to Weakness 2) While we appreciate the authors’ effort on the permutation design, could the paper include auxiliary tables or diagrams that summarize the rules and constraints at a glance, to help non-linguist researchers quickly understand how reasoning equivariance is enforced?	Fully AI-generated
LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces LingOly-TOO, a challenging reasoning benchmark that obfuscate Linguistics Olympiad Problems to avoid advantaging models that are using shortcuts such as memorisation and knowledge. The obfuscations preserve the underlying solution logic while removing orthographic clues that could trigger patterns from memorisation or knowledge. Without surprise, the performance of models drastically decrease. The authors defined knowledge as information stored in model parameters after training, which captures linguistic, factual, and commonsense patterns useful for downstream tasks, and memorisation as when models exploit contaminated datasets, reporting answers previously seen in training The authors adapted 82 problems from the UKLO, and obfuscate the problem to avoid models relying on linguistic patterns. More specifically, the authors manually created a ruleset for each problem to generate valid permutations of targeted tokens. They apply extra-care to keep names of people, sacred places, etc intact. Overall, the authors generate 6 valid permutations per problem and generate obfuscated versions by altering the problem text. Overall, there are 6995 question-answer pairs. The authors propose two metrics: the average exact match score across all questions in all permutations and the average exact match across all questions in the unpermuted problem. In terms of experiments, the authors evaluate 15 reasoning models. The performance varies between the original and obfuscated variants, with GPT and Claude being the most robust models. The authors do a detailed analysis to measure the gap between reasoning and knowledge. It would be great to add a case study to support the findings. In terms of metrics, it would be good to compare human performance on the task and include more details (e.g., how do they compare with LLMs) - I'm aware of Appendix L but would like to see more. Moreover, it would be fairer to assess the performance of models with pass@k. Overall, I am lukewarm about this paper. Creating obfuscating variants of problems does not seem to relate to real-life tasks, even for measuring reasoning. I have the feeling that this benchmark is focusing on a unrealistic problem / not a real problem. I appreciate the analysis and experiments of the authors, and the paper is very well written and structured. - Various models are used in the benchmark. - Detailed analysis. - I have the feeling that this benchmark focuses on an unrealistic problem / does not represent real life tasks. We cannot expect LLMs to necessarily perform good on those. - Please add human evaluation + pass@k metrics See above.	Fully human-written
Finetuning-free Alignment of Diffusion Model for Text-to-Image Generation	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a finetuning-free alignment framework for text-to-image diffusion models that avoids the computational cost and limited generalization of existing RLHF- or DPO-based fine-tuning approaches. Instead of modifying model weights, the authors reinterpret preference alignment as sampling from a reward-weighted distribution, showing that the aligned score function can be decomposed into the original diffusion model score and an additional guidance term derived from a learned reward model. Experiments on text-to-image benchmarks demonstrate that the method achieves comparable or superior alignment quality to state-of-the-art fine-tuning methods. S1. The paper tackles a principled decomposition of the aligned score function into the diffusion model score, combined with the reward-based guidance term. The proposed method provides a conceptually elegant connection between preference learning and inference-time guidance, clarifying the relationship between RLHF/DPO-style methods and diffusion sampling. S2. The proposed method modifies neither the diffusion model parameters nor the text encoder, making it model-agnostic and straightforward to integrate with existing text-to-image pipelines. S3. The paper’s method achieves strong alignment performance while avoiding the heavy training overhead required by RLHF or DPO approaches. Combined with Stable Diffusion XL-Turbo, the method supports one-step T2I generation, making the overall pipeline to be suitable for practical usage and user-interactive generation scenarios. S4. The method consistently improves PickScore, HPS-v2, ImageReward, and Aesthetic score. Qualitative examples also show visually appealing improvements compared to the baselines. W1. The proposed method assumes that the forward diffusion process remains unchanged when aligning the model to human preferences, meaning the aligned distribution $q(x_t \| x_0)$ is assumed to match the original pretrained model’s noising process $p(x_t \| x_0)$. This assumption effectively preserves the base diffusion model’s denoising trajectory, which determines the global layout, and object composition of the generated image. As a result, while the proposed method is well-suited for adjusting aesthetic properties or making small semantic refinements, it may struggle to generate plausible image output conditioned on prompts requiring strong semantic correction, multi-object reasoning, or compositional control (e.g., enforcing spatial relations or specific attribute assignments). I was wondering if the paper’s method can also handle such generation tasks. W2. The guidance network outputs cannot be differentiated to denote where or which components of the image fail to match the textual specification. Consequently, the approach may struggle on prompts that involve explanations on multiple objects or spatial relations (e.g., “to the left of,” “behind”). W3. The authors use Stable Diffusion XL-Turbo for experiments. However, recent works use the Transformer-based diffusion model, beyond U-Net based Stable Diffusion XL. Is it possible to apply the proposed method into the DiT-based model, such as Stable Diffusion v3 or even FLUX? W4. Because the diffusion backbone remains frozen, generated outputs closely reflect the inductive biases of the reward function. Is there any additional methods or strategy to alleviate the inductive biases of the given reward function, such as PickScore or Aesthetic Score? Please check the weakness section.	Fully human-written
Finetuning-free Alignment of Diffusion Model for Text-to-Image Generation	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents an approach to aligning text-to-image diffusion models with human preferences. Instead of relying on computationally expensive fine-tuning of the base model (such as DPO), the authors propose a lightweight, plug-and-play guidance mechanism. The key contributions include a diagnosis of why naive guidance methods often fail, attributed to the adversarial nature of the guidance signal, and a solution that trains a small, regularized guidance network to provide a stable, artifact-free signal for diffusion models. - The finding that the adversarial nature of the guidance can lead to undesirable artifacts in the generated images is interesting. - The ablation studies effectively demonstrate the effectiveness of each proposed component. Despite the theoretical discussion in this work, there still lacks solid experiments to validate the proposed approach. - The method's effectiveness has not been validated across different diffusion model architectures, leaving its generalizability to other frameworks unclear. - The method's performance has not been demonstrated on other datasets, which limits claims of general applicability. - The paper lacks a sensitivity analysis for its newly introduced hyperparameters, making the method's robustness to parameter variations unclear. Besides, there are formatting issues in Lines 72–74: the manuscript appears to contain white text, e.g., “ted in one or very few steps, the two samples would only exhibit small differences in details. SPM allows us to capture such detail differences and guide the diffusion model.” This raises concerns about potential prompt injection targeting AI-assisted reviewers or, alternatively, author oversight in document preparation.	Moderately AI-edited

PreviousPage 34 of 1516 (75800 total rows)Next