ICLR 2026 - Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	15899 (21%)	4.43	3.58	3687
Heavily AI-edited	3233 (4%)	4.22	3.59	2990
Moderately AI-edited	7082 (9%)	4.20	3.61	2722
Lightly AI-edited	16648 (22%)	4.15	3.68	2746
Fully human-written	32938 (43%)	4.13	3.62	2917
Total	75800 (100%)	4.21	3.62	3026

Title	Ratings	Review Text	EditLens Prediction
From Fragile to Certified: Wasserstein Audits of Group Fairness Under Distribution Shift	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces a Wasserstein Distributionally Robust Optimization (DRO) framework to certify group fairness under distributional shifts. - The paper tackles an important problem, i.e., the fragility of fairness metrics under distributional changes. This is a critical issue in several real-world applications. - By unifying multiple fairness notions, the paper provides a flexible and comprehensive tool for fairness certifications. -The introduction of DRUNE offers a practical method for estimating worst-case fairness, making the approach applicable to real-world scenarios. - I did not read all the proofs in details but the results seem reasonable. - The analysis assumes that the empirical distribution accurately represents the population, which may be unrealistic in practice, especially for rare subgroups or OOD samples. How robust are the certificates if this assumption fails? - Although DRUNE is proposed for efficiency, the DRO optimization may remain computationally intensive for large-scale or high-dimensional datasets. Can the method scale in practice? - Certification relies on finite-sample empirical distributions. For small datasets or complex models, the finite-sample bounds may be loose, which is related to the phenomenon of fairness overfitting (Laakom et al., Fairness Overfitting in Machine Learning: An Information-Theoretic Perspective, ICML 2025), potentially leading to over-optimistic fairness guarantees. Are the finite-sample bounds reliable enough for your approach to certify fairness in practice, or could they overestimate true robustness? See section above.	Lightly AI-edited
Weighted Deep Ensemble Under Misspecification	Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This work conducts a theoretical study on Weighted Deep Ensembles, which assume unequal weighting coefficients across ensemble members. In this framework, the ensemble weights are learned through empirical risk minimization on a pre-held validation dataset. - It appears to be well grounded in existing theoretical results for deep neural networks. In particular, Corollaries 1–3, which provide asymptotic error bounds for practical architectures such as MLPs, CNNs, and RNNs, are quite compelling. Of course, the practical usefulness of such theoretical results remains somewhat unclear, but that’s often the nature of theoretical work. - From a quick look, the derivations seem sound, and the experimental design appears reasonably solid. I particularly like that Table 4 highlights an important comparison with In-sample and Greedy Ensembles, and Figure 2 nicely shows convergence toward the oracle weights. - The method used in this work to determine the weighting coefficients for combining ensemble members’ predictions is a form of stacking (also known as stacked generalization, functional aggregation, and perhaps other related terms, as it has been referred to under various names in the literature). This approach has been extensively studied since the seminal works of Wolpert (1992) and Breiman (1996), with further theoretical developments by Van der Laan et al. (2007), Arsov et al. (2019), Chen et al. (2024), and others. However, this line of research is not discussed at all in the paper. The proposed weighted deep ensemble should explicitly position itself within the stacked generalization framework and clarify both the established findings in this area and its specific contributions in the context of deep neural networks. - Corollaries 1–3 are presented in a somewhat simplified form in the main text, and although Appendix B.4 offers a slightly more detailed version, it still appears insufficient. It would be beneficial to include a fully formalized version in the appendix that explicitly incorporates the necessary conditions outlined in works such as Jiao et al. (2023). While those formulations, as far as I know, involve a number of intricate and cumbersome assumptions, this work, as a theoretical contribution building upon them, should nonetheless aim to achieve a comparable level of rigor and completeness. - At present, there is neither empirical nor theoretical validation of the claimed “collective blindness.” The only supporting evidence is the conceptual illustration in Figure 1, which does not pertain to “deep” ensembles. While the authors claim that traditional deep ensembles may suffer from “collective blindness,” this assertion seems questionable given the experimental scale considered here, which can hardly be described as involving truly “deep” ensembles. In my experience, in synthetic setups with small MLPs, ensembles trained from different random initializations through stochastic optimization (i.e., standard recipe for constructing deep ensembles) often exhibit limited diversity, which aligns with the notion of “collective blindness.” However, as the network depth and complexity increase, the highly non-convex nature of the loss landscape tends to induce substantial diversity among ensemble members, and even simple deep ensembles can perform remarkably well, as demonstrated by Fort et al. (2019). Hence, it becomes difficult to argue that “collective blindness” remains a meaningful concern in deep ensemble settings. - The experimental results seem rather limited to be considered a proper evaluation of a weighted “deep” ensemble. Given the computational constraints, it might be a good idea to extend the results of Wortsman et al. (2022) as a way to demonstrate larger-scale experiments. Their official codebase already provides checkpoints that can be directly used as ensemble components, so there is no need for additional training. Since they have already considered Uniform and Greedy Ensembles, it would suffice to simply add the Weighted Ensemble for comparison. --- - Wolpert (1992), Stacked generalization. - Breiman (1996), Stacked regressions. - Van der Laan et al. (2007), Super learner. - Arsov et al. (2019), Stacking and stability. - Chen et al. (2024), Error reduction from stacked regressions. - Fort et al. (2019), Deep ensembles: a loss landscape perspective. - Wortsman et al. (2022), Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. - Incomplete statement on line 164? - How were the oracle weights in Figure 2 obtained? - If the validation split (2 out of the 6:2:2 split) is used to “train” the weighting coefficients in WDE, then it is effectively being used as part of the ensemble “training” process. What if a standard deep ensemble is trained using the combined training and validation splits, since that data could alternatively be used to train the ensemble members rather than the ensemble weights?	Lightly AI-edited
Weighted Deep Ensemble Under Misspecification	Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The submission is concerned with weighted ensembles of neural networks. Weighted ensembles of neural networks are a relevant topic. ## A review and comparison with the state of the art is missing. A review and comparison with the state of the art is missing. First, weighed ensembles are in no way new. The basic idea goes back to stacking David H. Wolpert. Stacked generalization. Neural Networks, 5(2):241–259, 1992 For a more neural network focussed paper see for example: Anders Krogh and Peter Sollich. Statistical mechanics of ensemble learning. Physical Review E, 55(1) 1997. Second, there are a lot of papers dealing with — theoretically well motivated — weighting of deep neural networks. For example: Andrés R. Masegosa. Learning under model misspecification: Applications to variational and ensemble methods. In Advances in Neural Processing Systems (NeurIPS), volume 33, 2020 Luis A. Ortega, Rafael Cabañas, and Andres Masegosa. Diversity and generalization in neural network ensembles. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022 Third, there are also generalisation bounds for weighted ensembles, which are applicable to neural network ensembles: Andrés R. Masegosa, Stephan S. Lorenzen, Christian Igel, and Yevgeny Seldin. Second order PAC-Bayesian bounds for the weighted majority vote. In Advances in Neural Processing Systems (NeurIPS), 2020 Yi-Shan Wu, Andrés R. Masegosa, Stephan S. Lorenzen, Christian Igel, and Yevgeny Seldin. Chebyshev-Cantelli PAC-Bayes-Bennett inequality for the weighted majority vote. In Advances in Neural Processing Systems529 (NeurIPS), 2021 Hauptvogel, Igel. On Uniform, Bayesian, and PAC-Bayesian Deep Ensembles. arXiv:2406.05469 [cs.LG], 2024 In addition, I was missing a reference to Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993–1001, 1990. ## The paper lack mathematical rigour. The paper lack mathematical rigour. Examples: Line 165: incomplete, meaningless statement. Seems like a part of the equation is missing. Lines 180-190: Statement lack rigour: Example makes no sense without stating something about h. In expectation over all hypothesises? Line 229: Should this be $\hat{f}$ on the RHS? Condition 1: $\epsilon$ is not defined ## There are several misleading statements. Theorem 1: The theorem only talks about n. Should there not be specific assumptions about n_train and n_val. E.g., does this hold for small constant n_val? Just consider the first three sentences: „Model misspecification in statistics arises […] inclusion of irrelevant variables […]. In such cases, the best possible approximation […] still maintains a significant approximation error from the true function“: Could you please cite a rigorous theoretical result that states that adding irrelevant variables must cause a significant approximation error „In deep learning, the neural networks are always assumed to be well-specified.“: In general not true. I do not assume that - and I do not know anybody who does. „To the best of our knowledge, this is the first study to offer a theoretical guarantee for weighted deep ensemble.“: Clearly wrong, see the many reference given above and references therein as starting points. ## General comments Inherent misspecification: How much does this matter on digital computers (aka „in practice“)? This should be discussed. While I think it is interesting to study weighted neural network ensembles, I have to say that I could not identify exciting novel insights in the manuscript. Theorem 1 does not come as a surprise and is not put into relation to other (e.g., PAC-Bayesian) bounds. ## Minor comments * „Ensemble methods is“ -> „Ensemble methods are“ * l 79: Strange references for deep ensembles. Why not not Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993–1001, 1990. and the later cited Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Processing Systems (NeurIPS), volume 30, 2017. * I think the discussion of mode misspecification should go along with a brief discussion of parametric vs non-parametric models. * Line 282: Why „without loss of generality“? see "Weaknesses" above	Fully human-written
Weighted Deep Ensemble Under Misspecification	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper addresses a challenge in deep ensemble learning with model misspecification, where the universal approximation does not hold, and proposes an optimal weighted ensemble approach. Typical deep ensemble suffers from collective blindness as they reinforce shared biases, while the proposed weighted deep ensemble strategy can achieve oracle-level optimality. Comprehensive theoretical analysis shows asymptotic bounds for the estimator for regression and classification compared to the best candidate model and convergence under misspecifications. Experiments on synthetic tasks show improvement under various misspecification scenarios. 1. Interesting problem formulation: Systematic categorization of misspecification in deep learning with rigorous definitions. 2. Rigorous theoretical guarantees: Provides a formal analysis of the weighted deep ensembles with an asymptotic error bound matching the best candidate, and oracle optimality $R(\hat{w})/\inf_w R(w) \rightarrow 1$. Proofs are technically sound and leverage modern empirical process theory. 3. Comprehensive numerical validation: Experiments span all three misspecification types with nicely designed ablations. 1. The proposed algorithm is not new. The idea of of weighted ensemble with learned simplex weights by validation risk minimization has been explored with similar theoretical guarantees, just not on neural networks. 2. The theory only proves a "no-regret" sense of guarantee--the weighted ensemble performs at least as good as the best expert asymptotically. But the author did not investigate the ensemble gain under the weighted ensemble. This is especially important in the misspecification scenarios defined in the paper, where all models suffer from one or more sources of misspecification and are imperfect. The paper did not discuss how the weighted scheme affects the diversity or variance reduction that brings the ensemble gain under equal weight averaging. Even under misspecifications, candidate models may still have uncorrelated errors, which also explains the observed improvement in the numerical experiments. 3. Experiments are only done on synthetic datasets with shallow networks, which is good for demonstrating how different ensemble strategies perform under various misspecifications. It would be great to see how the algorithm works on standard small-scale vision benchmarks like CIFAR-10 or 100, or Tiny-ImageNet. 1. This paper is primarily motivated by this notion of collective blindness, but it was not discussed later in the analysis. Can we somehow formalize the collective blindness of equal-weight ensembles through some sort of error decomposition and get a quantitative improvement bound for WDE compared with equal-weight ensembles? 2. In the proof of Theorem 1, the author leverages a Rademacher complexity term for the simplex with respect to both $M$ and $n$, $r\sqrt{\frac{2\log M}{n}}$. $M$ is omitted in the big-O as the number of ensemble members is finite, and the resulting bound becomes $\frac{r}{\sqrt{n}}$. But in practice, the number of ensemble members should somehow scale as a function of $n$, and you can only safely omit it if $M$ grows sub-polynomially with $n$. The author should clarify this somewhere in the paper, as you are deriving asymptotic bounds with $n$ while treating $M$ fixed. 3. The paper asserted the convexity of the VRM objective in classification because the ensemble is an affine mapping of logs. But this is only true for post-softmax probabilities. If the ensemble averages the logits instead (common practice in ensemble learning), would this break the convexity?	Fully human-written
Weighted Deep Ensemble Under Misspecification	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper first defines three different kinds of model misspecifications, which if they occur in particular lead to traditional guarantees like universal approximation theorems for deep neural networks not holding. They furthermore argue that traditional ensembles, made up of models with identical architectures and each weighted equally, are also impacted by this issue since if every submodel is biased, this will usually lead to highly confident, systematically biased predictions of the ensemble. To address this issue, they propose and analyze the weighted deep ensemble method, which trains ensembles consisting of different models and optimizes the weights of the different ensemble members to optimize the error on the validation set. They prove that the ensemble achieves the convergence rate of the best model in the ensemble and empirically demonstrate the effectiveness of the method on synthetic datasets. - The story of the paper was relatively clear and the paper was well-organized. - By investigating the question when the assumptions of traditional machine learning approximation results fail to hold, the paper is making progress and bringing more attention to a very relevant question. - Furthermore, by providing the weighted ensemble method, they also provide a new way of addressing the issues they point out. The analyses of the theoretical properties (i.e., showing both asymptotic error bounds and asymptotic optimality in certain cases) of the weighted ensemble method is original and insightful. The paper claims that they are 'introduc[ing] [the] weighted deep ensemble method that learns the optimal weights'. At the same time, in the related work section, they state that 'recent studies have delved into weighted deep ensemble', but do not provide much more detail about these methods although this would be relevant to judge what exactly is novel in the paper. Furthermore, it would be relevant and interesting to also see the performance of their method on non-synthetic datasets (and with models trained on these tasks) to investigate whether they can also provide significant advantages in real-world settings and potentially on more not misspecified settings. 1. Could you clarify what kind of work has been done on weighted ensembles before? What are the key differences of your work from previous work on this topic? 2. In line 165, is $f_0(x)-g(\pi(\boldsymbol{X}))$ supposed to be $0$ almost surely? 3. Could you make the following statement more formal or illustrate it a bit more clearly: > As stated before, traditional deep ensembles may suffer from "collective blindness" in the presence of variable, structural, or inherent misspecification. 4. We use our validation data to fit the weights of the ensemble, correct? Would this in the case of many different ensemble members lead to overfitting on the validation data? Could we then still use the same validation data for hyperparameter tuning, etc.? 5. Why do we need Conditions 1 and 2 for Theorem 3? What would lead to the Theorem not holding anymore if we relax these conditions? 6. Why did you decide not to additionally test your methods on real-world datasets or mode widely on non-misspecified settings?	Fully human-written
Decoupling Global Structure and Local Refinement: Blueprint-Guided Scroll Generation with Direct Preference Optimization	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper addresses the task of generating long images using diffusion models. It proposes a two-stage method: generating a low-resolution (LR) "blueprint" image, upsampling this blueprint into a high-resolution (HR) image. A key component is fine-tuning the models using Direct Preference Optimization (DPO). The proposed method (DRSPO) demonstrates superior results compared to previous methods like MultiDiffusion and SyncDiffusion. This appears to be the first work to apply DPO to enhance the quality of long scroll image generation. - I think the task of generating long scroll images is a somewhat solved problem. However, the paper compares its method only against relatively weak baselines. - We may generate directly with state-of-the-art models like FLUX.1-dev (which is mentioned in the paper) or, generate a 512x2048 image with a FLUX or Stable Diffusion 3.5 and then using an off-the-shelf super-resolution model (e.g., ESRGAN). This alternative pipeline might produce comparable or even superior results. - The validity of the evaluation metrics (HPS v2, PickScore, CLIP Score) is questionable, as these off-the-shelf models may not be reliable for evaluating long (1024x4096), non-square images. While FLUX is included in the quantitative results (Table 1), it is notably absent from all qualitative comparisons. What is the actual qualitative performance of FLUX on this task?	Fully human-written
Decoupling Global Structure and Local Refinement: Blueprint-Guided Scroll Generation with Direct Preference Optimization	Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	- This paper propose the Dual-Resolution Scroll Generation (DRSPO) framework, which decouples the process by first creating a low-resolution blueprint for global structural coherence, then refining it with high-resolution features for local detail. - To enhance quality, we integrate Direct Preference Optimization (DPO) at both generation stages and introduce a novel theoretical adaptation for applying preference tuning to region-based generation. - Experimental results confirm that our method effectively produces high-quality long scroll images with consistent global structure and fine-grained details, overcoming issues like content repetition. - The paper provides formulas, ablation experiments, and visualization results, though they are not very satisfactory. - Compared with the FLUX model The methods compared in this paper are already too outdated. Additionally, the FLUX model has not undergone specific optimization for such tasks, making the comparison unfair. In my view, this paper is not suitable for the ICLR conference. Furthermore, from the visualization results, it can be seen that the results here are not advanced, with numerous artifacts and glitches. The small images also fail to show more details clearly. Therefore, I believe this paper still requires further improvements. In particular, the method mentioned in the paper is no longer novel in the field of text-to-image generation. The methods compared in this paper are already too outdated. Additionally, the FLUX model has not undergone specific optimization for such tasks, making the comparison unfair. In my view, this paper is not suitable for the ICLR conference. Furthermore, from the visualization results, it can be seen that the results here are not advanced, with numerous artifacts and glitches. The small images also fail to show more details clearly. Therefore, I believe this paper still requires further improvements.	Lightly AI-edited
Decoupling Global Structure and Local Refinement: Blueprint-Guided Scroll Generation with Direct Preference Optimization	Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes the Dual-Resolution Scroll Generation with Preference Optimization (DRSPO) framework, which decouples global composition and local detail refinement for long scroll image generation. DRSPO first generates a low-resolution (LR) global blueprint and then produces high-resolution (HR) details based on this overall structure. By doing so, it successfully generates images that are globally coherent while maintaining fine-grained local fidelity. This paper's strengths are as follows. (1) By employing a low-to-high-resolution generation strategy, the method successfully achieves both global structural consistency and local detail refinement. (2) The paper improves generation quality for long scroll images by introducing Direct Preference Optimization (DPO). (3) The application of DPO to high-resolution generation is theoretically supported (though with certain approximations), and the authors explicitly discuss the limitations of these approximations, linking them to future research directions. This paper's weaknesses are as follows. (1) Although the model is trained using DPO that optimizes for the Aesthetic Score, the same metric is also used for evaluation, effectively leading to reward hacking. Moreover, the method’s performance on this score is worse than existing methods, raising doubts about whether the proposed approach truly improves generation quality. (2) Compared with MAD, the overall performance is similar or even inferior, despite MAD requiring no fine-tuning or DPO training. This raises concerns about the cost-effectiveness of the proposed method given its additional training overhead. (3) In Table 1, the color distinction among “1st”, “2nd”, and “3rd” is difficult to see, which reduces the readability of the results. My questions about this paper are as follows. (1) Have the authors tried fine-tuning directly on long scroll images rather than adapting pretrained models? (2) How is the Aesthetic Score computed for long scroll images? Is it calculated locally and then aggregated, or is there a global evaluation? (3) From the results in Table 2, the effect of DPO seems limited. Given that DPO requires additional training cost for diffusion model fine-tuning, one would expect more significant improvements. Why does DPO seem less effective than the control in this setting?	Lightly AI-edited
Decoupling Global Structure and Local Refinement: Blueprint-Guided Scroll Generation with Direct Preference Optimization	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper discusses an approach for generating big images similar to MultiDiffusion but guided by a low resolution image. "Direct Preference Optimization" is involved. 1. The fundamental idea looks correct – generate a low-resolution image and then diffuse at high resolution may result in larger results. 2. A dataset is explicitly collected to align with the goal. 1. The idea of generating low resolution image and then diffusing again is extensively discussed even long before MultiDiffusion and similar works. This part shouldn’t be seen as a contribution of this work. 2. The DPO part is formulated with high verbosity but in essence they can be achieved as a simple target like some common score guidance like [github.com/vicgalle/stable-diffusion-aesthetic-gradients]. I also do not find Eq 13-19 as solid contributions of this work. 3. The experiments seem to compare to wrong targets. This paper is a cascaded image generator and it should at least compare to multi stage generating. For example, generate low resolution images and then diffuse it again using MultiDiffusion at some denoising level at high resolution. And also other native high resolution generators like opensource PixArt family (and even close sourced Flux pro 4k etc) as well as some other cascaded generators like stable cascade. The current MultiDiffusion results look misleading to me. 4. Why use canny image? Why not directly use low resolution image for other “upscaling” types of upscaling signals? See weaknesses	Fully human-written
O-Forge: An LLM + Computer Algebra Framework for Asymptotic Analysis	Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces a CAS + LLM system for proving asymptotic expressions. The primary thrust of the system is to use a LLM to propose decompositions and then use a CAS to test their correctness. Due to the very unusual nature of the weaknesses, I don't have any strengths to comment on. This paper does not appear to contrain any evidence as to its effectiveness. Section 5 begins "In addition to the above-mentioned case study of hard problems, we tested our tools on an extensive suite of around 40-50 easier problems, in order to study how well it performs on a diverse set of inequalities," but there is no mention of these "hard problems" anywhere in the paper. Additionally, for these easy problems, the paper presents three high level takeaways but no actual evidence. Did I misunderstand something? Does this paper present evidence of its correctness?	Fully human-written
O-Forge: An LLM + Computer Algebra Framework for Asymptotic Analysis	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	he paper presents O-Forge, a system that integrates a large language model (LLM) with a computer algebra system (CAS) to aid in the verification of asymptotic inequalities. This is an instance of classical synthesis paradigms such as oracle guided inductive synthesis combining an inductive LLM with deductive reasoning system to generate formal artifacts. The paper describes this as an “In-Context Symbolic Feedback Loop”. The LLM proposes domain decompositions (i.e., how to split a problem into manageable subdomains). The CAS (via Mathematica’s Resolve function) then verifies whether each subdomain satisfies the proposed inequality using first-order logic and quantifier elimination. The authors claim their tool can handle research-level asymptotic inequalities, going beyond standard competition-style problem solving by combining LLM creativity with CAS rigor. This is left to some subjective interpretation. Two case studies: an asymptotic AM-GM inequality and a series decomposition, are used to demonstrate the concept. The paper reads more like a concept demo or blog post than a rigorous scientific study, lacking detailed quantitative and ablation studies with proper baselines. Testing on self-curated "suite of around 40-50 easier problems" and a few case studies falls far short of the evaluation expected of a research paper. Hybrid symbolic–neural approaches (e.g., Lean+LLM, AlphaProof, Autoformalization pipelines; see https://arxiv.org/abs/2310.17807, https://ieeexplore.ieee.org/document/10356332) have already explored the same broader paradigm with planning, code generation, formal proof verification, not just heuristic checking. The authors present O-Forge as “revolutionary,” yet it is essentially prompting an LLM for suggestions and sending them to a formal tool - Mathematica. Crucial implementation details are omitted. How exactly is the LLM prompted? How is the “in-context feedback” loop structured? How is the decomposition quality evaluated or improved iteratively? Are there failure modes where the LLM produces incorrect decompositions, and how are these handled? Can you expand experimental evaluation and share quantitative metrics (success rate, runtime, size of inequalities handled) over larger benchmark suite to substantiate performance?	Fully AI-generated
O-Forge: An LLM + Computer Algebra Framework for Asymptotic Analysis	Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper introduces O-Forge, a workflow that couples a frontier LLM with a computer algebra system (CAS), specifically Mathematica’s Resolve, to prove asymptotic inequalities by (i) asking the LLM to propose a domain (or index) decomposition and (ii) using Resolve to verify each piece via first-order quantifier elimination. A toy inequality and a series bound from Tao are used as case studies; the paper claims the approach “moves beyond contest math” by offloading the creative split to the LLM and the verification to CAS. The system exposes a CLI and a website front-end; evaluation is described as ~40–50 “easier problems,” with qualitative takeaways about typical numbers of subdomains and the usefulness of leading-term simplifications. N.A. There are too many drawbacks in this paper. In general, this submission is more like a blog post instead of a rigorous paper as it lacks of enough and solid experiments and evaluations to demonstrate its claims. Some of the weaknesses are as follows. - Lack of Novelty: LLM + CAS could be viewed as part of Tool-use LLM research. The proposed idea is not novel. - Insufficient rigor and experimental substance: The “evaluation” consists of two main case studies plus ~“40–50 easier problems,” but there are no well-defined benchmarks, success metrics, or failure analyses (rates of correct/incorrect splits, wall-clock, query counts, sensitivity to prompts, comparisons to baselines like hand-crafted heuristics or SMT-based pipelines, etc.). As is, this reads more like a prototype report/blogpost than an ICLR-level empirical study. - Underspecified method details: Key pieces are missing or skeletal. For example, the “structured prompt” is left with placeholders (“describe the structure of the prompt”), and the code snippet for Resolve is fragmentary; the search over constants C is a coarse grid without justification or sensitivity analysis. This makes the approach hard to reproduce or evaluate scientifically. - Heavy reliance on closed-source Mathematica without proof objects: While Resolve is powerful, the paper acknowledges there is no proof term and asks the reader to trust a closed system; this undermines the claim of “rigorous verification,” especially for research-level math. No attempt is made to cross-check with open tools (e.g. SageMath) on a subset, or to export certificates. - Reproducibility & accessibility concerns: Running the system requires Mathematica and a frontier LLM API, making reproduction costly; even the authors’ ethics section notes access costs. There is a website, but that can’t substitute for open artifacts or independent verification. - Scope creep vs. precise problem definition: The paper oscillates between “asymptotic inequalities” and the specific case study. I could feel the motivation of this paper: making a true AI4Math tool or project to better help professional mathematicians instead of some fuzzy LLMs that could only do math questions. But there is no crisp task definition, no sanity check for the motivation, and no quantitative experiments to show the helpfulness and effectiveness. This vagueness makes it hard to judge whether this is something scientifically helpful or just some course project. - Please use \citep for non-subject/object citations.	Fully AI-generated
O-Forge: An LLM + Computer Algebra Framework for Asymptotic Analysis	Soundness: 1: poor Presentation: 1: poor Contribution: 2: fair Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	Following Terry Tao’s proposed algorithm, a system for asymptotic analysis is implemented. It combines an LLM for the creative part of sub-domain choice, and uses Mathematica for the rest of the steps - notably “Resolve” for proofs in specific sub-domains. 2 non-trivial asymptotic bounds were proven via this system. If works well, applicable to a broad range of scientific endeavors. Final results are grounded via a reliable CAS system. Shown proof of concept for useful mathematical results. Line 86: “( describe the structure of the prompt)”... This is unprofessional at best. Line 91: All the mentioned solvers need citations, as well as the Mathematica Resolve function - due to its central role in your system. Line 100: Fig.1 is mentioned initially on page 2 but appears on page 4. Why? Line 101 (and multiple others): “o-forge.com”. At the top right corner of the front page of this website it clearly states: Created by Vijay Ganesh Ayush Khaitan Violating the ICLR guidelines regarding anonymity. Regarding the website itself, while I liked the UI and readily available examples, it sometimes returns weird results. For example, Example Series 1 returns a summary that contains a single word: “The”. I ran it a few times to make sure it’s not some random LLM aberration. Other times the output shows a Python error stack trace (“Execution Failed”). Seems underbaked and the code is unstable. Line 109,113,119 (and others): Citations. You’re mentioning Prof. Tao many, many times - but he is a prolific mathematician. Point to the specific works you use. Line 123: While I personally agree with this claim on (most) interesting series bounds, it needs to be quantified (benchmark vs. other methods), or at least supported by multiple examples from the literature. Line 127: The novelty claim is unclear to me. It seems like an engineering project - implementing a (good) idea by Prof. Tao. While I agree that such an online tool can be useful for mathematicians around the world - and would like to encourage you to improve it and fix the bugs - the system does not constitute scientific innovation done by you. The generate (via LLM) -> verify (via CAS) approach was done in multiple projects (though not specifically in your use case) - so the core approach is also not novel on its own. If the strongest results are the proof of the specific use-case (which to my understanding is indeed novel, but I don’t know enough to say how impactful this singular result is), then perhaps consider submitting to a mathematical / experimental math journal? Line 295: I’m a supporter of readable papers and not-too-formal language, but this is too much. The paper is not your blogpost. Line 348: What do you mean “around 40-50”?? Section 5: You need concrete statistics for all the claims here. Currently it’s anecdotal. Section 6: There are multiple other works in AI for Math that apply combinations of CAS/code generation + LLMs. The current overview is quite limited. Line 256: This “elaborate Mathematica code’ seems like an important part of the system - perhaps ~50% of its power? Line 270: Once in the entire process? If I’m reading the outputs on your website correctly, there are recursive attempts to re-define the sub-domain partition, until you get “True” on all of them? Line 328: How is the output passed to Mathematica? Is there a specific format that you force it to use in its output? Do you parse the output text and translate it to Mathematica code? If so, how? Line 359: How do you count decompositions? Number of separate domains or number of boundaries? Line 367: Any ideas why this happens? How do you do this replacement?	Fully human-written
From Offline to Online Memory-Free and Task-Free Continual Learning via Fine-Grained Hypergradients	Soundness: 3: good Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper leverages hyper-gradients for continual learning. The key idea is to use prototypes as memories and use hyper-gradients for adaptive learning rate selection. The paper writing is clear. Experiments show the effectiveness of the proposed method under the proposed setting. Limited novelty: both the hyper-gradient and prototype based memory are not new, they have been widely used in previous works for adapting learning rates (https://arxiv.org/pdf/1703.04782) and prevent forgetting (https://arxiv.org/pdf/2308.00301) already. Experiment setting and claims: This work claims to be online and memory free, however, it uses cached prototypes which is also just a form of memory, without having other methods using the same compute, memory and storage, it is not fair to claim the performance gain. Also, even with complex method implementation, the method is just a little bit better than simple ER, while uses heavy hyper-parameter tuning, which is prohibitive in the online CL scenario. This makes the setup and method both far from practical. NA	Fully human-written
From Offline to Online Memory-Free and Task-Free Continual Learning via Fine-Grained Hypergradients	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The core objective of this paper is to address a challenging problem in the field of Continual Learning (CL): how to successfully adapt effective memory-free algorithms from idealized offline (offCL) settings to more realistic and difficult online, task-free (onCL) environments. To solve the problem of gradient imbalance, the paper proposes its core innovation, Fine-Grained Hypergradients (FGH). This is a novel optimization technique based on the key idea of: + Learning an independent, dynamic gradient weight for each parameter within the model. + Leveraging the gradient directions from two consecutive iterations to assess learning stability: if the directions are aligned, the update step is amplified; conversely, if they are opposed (indicating oscillation), the update is suppressed. 1. The problem addressed by the paper online, memory-free, task-free continual learning is indeed a highly challenging and practically significant direction in the current field. 2. The combined framework proposed in this paper achieves outstanding performance in experiments, especially under the 'multi-learning-rate evaluation' paradigm designed by the authors, showcasing the robustness of their method. 1. The entire work can be viewed as an effective combination of two known techniques (prototypes and hypergradient descent), making the contribution more empirical than conceptual. The performance improvement from FGH largely stems from enhancing plasticity in the online setting; Equation (7) progressively increases the intra-task learning rate to boost plasticity, a mechanism that has been explored in prior work [1]. 2. Regarding catastrophic forgetting, the method essentially relies on prototype replay, which is also a common technique in previous literature. For a venue like ICLR, which seeks fundamental innovations, the weight of this contribution is insufficient. 3. The authors use ADAM in their experiments. From a learning rate perspective, could ADAM and FGH conflict? Is it possible for a situation to arise where ADAM suggests a large learning rate while FGH suggests a small one? In other words, do FGH and ADAM work synergistically, or is there a functional redundancy? Given the prevalence of ADAM, the authors should have included a discussion on this. 4. The authors should provide a comparative experiment between a "global FGH" and the proposed "fine-grained FGH" to demonstrate the necessity of the fine-grained design. [1] Online Learning Rate Adaptation with Hypergradient Descent ，ICLR2018 See the weakness	Lightly AI-edited
From Offline to Online Memory-Free and Task-Free Continual Learning via Fine-Grained Hypergradients	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper addresses the migration of offline memory-free continual learning (CL) methods to online, memory-free, and task-free CL scenarios. It introduces a prototype-based auxiliary memory module (P) and a fine-grained hypergradient mechanism (FGH) that dynamically balances gradient imbalance and learningrate sensitivity. Experiments on CIFAR100, CUB, and ImageNet-R show consistent gains across multiple baselines under multi-learning-rate evaluation. The work is practically motivated and conceptually coherent, offering a bridge between offline and online CL paradigms. 1) The topic is timely and relevant, targeting the underexplored Offline→Online transition in CL with clear theoretical and practical significance. 2) The proposed P+FGH framework effectively addresses two core challenges of online CL — catastrophic forgetting and gradient imbalance — through a minimal-intrusive and generalizable design. 3) Experiments are comprehensive, covering diverse datasets and learning rate settings, demonstrating the method’s robustness and transferability. 1) The online scenario remains quasi-online, relying on pre-defined task splits rather than fully stream-based settings, limiting realism. 2) The novelty of both P and FGH is moderate: the prototype update mirrors CoPE (2021), and FGH lacks formal convergence or stability analysis and clear differentiation from prior hypergradient methods. 3) Recent baselines (e.g., PROL 2025, PMLR 2025) are missing, and parameter details (γ, β₁/β₂, Si- Blurry settings) are insufficiently reported, affecting reproducibility and fairness. 1) How would the proposed FGH behave under fully stream-based or class reappearance settings? 2) Has γ been systematically tuned or theoretically analyzed for robustness across datasets? 3) Can the authors quantify FGH’s computational overhead compared to existing hypergradient or adaptive LR optimizers?	Fully AI-generated
Convergence dynamics of Agent-to-Agent Interactions with Misaligned objectives	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper studies the dynamics of two single-layer transformers with linear self-attention (LSA agents) alternately performing in-context gradient descent toward their own objectives. Theoretical analysis show they fall in different regime depending on different configurations of the misaligned objectives. Experiments on two LLMs doing in-context gradient descent validates the theoretical results. This paper shows an interesting angle to study agent-to-agent interactions through in-context gradient descents of LSAs. The theoretical results show that misaligned objective correspond to different behavior. The experiment also generalizes it to LLMs (GPT5) that validates the theoretical analysis. The paper is well-written. * Investigating multi-LLM-agents interactions is an important and emerging problem. Although this paper offers an interesting perspective, it builds on oversimplified settings that is not obvious to generalize easily. I appreciate the authors ackoknledging this in the conclusion: "move beyond controlled linear tasks and examine these mechanisms directly in large-scale LLMs." However, I believe this is should be an important point and worth discussing in more detail. * There is some degree of over-claiming: the abstract reads like the theory is developed for generic LLMs, while the theory is actually developed for LSAs. * Line 276: "These results suggest a concrete prompt-design principle for multi-agent systems..." Can the authors be more concrete about what specific results suggest these concrete prompt-design principles, and how? * This paper suggests (e.g., from Proposition 1 or Figure 1) that the multi-agent system has non-zero error w.r.t. to each agent's respective objective. This seems to imply that there is not benefit of using a multi-agent system. In the LSA in-context gradient descent setting, are there circumstances where the multi-agent system can be more beneficial than using each agent separately?	Fully human-written
Convergence dynamics of Agent-to-Agent Interactions with Misaligned objectives	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper theoretically studies agent-to-agent interactions, based on the in-context linear regression task, where both agents are driven by llm and may have misaligned objectives. The authors extend the previous work and formulation on single agent, during the interaction, agents take turns to conduct approximately linear gradient updates towards their own goal using the output of each other as context or input. Utilizing fixed point theory and spectral analysis, the authors analyze how the prompt design and objective difference contribute to the biased equilibrium. They also identify conditions that lead to asymmetric convergence where only one side of the agent reaches its goal, and design a white-box attack algorithm accordingly. Experiments with pretrained single-layered linear self-attention transformer and gpt5 demonstrate the theoretical results and provides insights on understanding llm-based multi-agent systems. First of all, the authors identify an important and timely problem in the field of multi-agent systems (MAS) involving large language models (LLMs). The unpredictability of LLM driven MAS and their occasional under performance compared to single-agent systems highlight the need for a deeper understanding of agent interactions. The paper's focus on characterizing agent-to-agent interactions and the analysis regarding the internal state updates is novel and addresses a gap in the literature. The theoretical analysis is rigorous and well-supported by mathematical proofs. The inclusion of both LSA and gpt5 agents in the experiments strengthens the credibility of the results. Furthermore, it is nice that the authors link findings on asymmetric conditions to adversarial prompt design and white-box attacks, demonstrating the potential for malicious exploitation and also opens up important discussions on LLM safety. Regarding the presentation, the paper is well-written and clearly structured. The authors provide comprehensive explanations of the theoretical framework and detailed derivations of key results. Detailed explanation after key results allows readers to easily follow and understand the findings. The paper discusses white-box attacks but does not delve into potential defenses. It would be beneficial to add a short discussion regarding the strategies for eliminating or mitigating these attacks. Addressing these concerns would provide a more comprehensive understanding of the security implications and offer practical solutions for securing multi-agent systems. At the end of section 3, the suggestion to design a common goal for multi-LLMs is quite intuitive. It would be a lot better if the authors could further investigate this idea to provide more valuable insights and practical guidance. For example, they could explore specific techniques for aligning objectives or designing prompts that promote collaboration. This would enhance the paper's contribution to the development of effective multi-agent systems. - In section 3, we can derive the plateau levels if given u* and w. But, if we already know u w*, what is the point of using multiple llm agents to interact and find the solution? - In Figure 3, it seems that the victim error in LSA-trained agents are high enough, while the attacker’s error is quite low, then, how come the attack success rate is lower than that in GPT5? If this is correct, what are the possible causes? Is it due to differences in model complexity, training data, or other factors? A deeper analysis would enhance the understanding of the results. - The white-box attack may cause safety issues, are there any solutions or defense approaches? How to secure the multi-agent system or make them more robust? Further, can we derive some insights on what kind of work is suitable for MAS instead of single-agent from this paper?	Lightly AI-edited
Convergence dynamics of Agent-to-Agent Interactions with Misaligned objectives	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper investigates the convergence behavior of two large language model (LLM) agents that perform alternating in-context gradient updates under misaligned objectives, aiming to uncover bias propagation and stability issues in multi-agent interactions. The challenge lies in the strong coupling and low interpretability of multi-agent systems, which make it difficult to mechanistically describe how agents influence each other during in-context updates. The authors model each agent as a linear self-attention (LSA) network that executes gradient descent, formalize their interaction, and derive closed-form and spectral results for the convergence error. They further find that when objective misalignment and prompt-geometry anisotropy coexist, the system exhibits an asymmetric convergence phenomenon. The paper proposes a white-box adversarial prompt design algorithm and validates the “attacker converges, victim fails” behavior on both LSA and GPT-5-mini models. Overall, the paper is well structured with complete and detailed derivations; the direction is promising, though the experimental setup remains somewhat simplified and idealized. 1. The paper is the first to model multi-agent LLM interaction as an alternating in-context gradient optimization system, providing a mathematical formulation of inter-agent updates and a computable basis for analyzing bias propagation and convergence stability. 2. The mathematical derivations are complete and logically clear, with consistent notation, explicit assumptions, and boundary conditions; the appendix provides supplementary derivation details that enhance verifiability. 3. The proposed dynamical analysis framework is broadly applicable beyond dual-agent settings—it can extend to multi-agent alignment, model coordination, and optimization-based safety studies, offering a general theoretical tool for future work. 1. The experiments are conducted only on a synthetic linear-regression task, without testing agent interaction in realistic language scenarios such as reasoning, writing, or code generation. This limits the explanatory power of the results for real-world multi-agent LLM collaboration. 2. The study relies on the assumption that LLM inference is equivalent to in-context gradient descent; while analytically convenient, this assumption is not strictly true for real LLM reasoning and may weaken the practical relevance of the conclusions. 3. Although the paper formally defines prompt geometry, it lacks intuitive examples or visualizations that would help readers understand how the proposed geometric structure corresponds to actual prompt design and model behavior. 4. The writing still needs polishing: the main text frequently refers to GPT-5 while the experiments actually use GPT-5-mini, which should be clarified; moreover, the link in Appendix 7.3 is broken. 1. Is there sufficient empirical or theoretical evidence for treating LLM inference as in-context gradient descent? Does this assumption still hold on nonlinear tasks? 2. Given the large difference between the LSA model and real LLM architectures, how do the authors evaluate the impact of this simplification on the reliability of their conclusions? 3. Could the authors provide a case study of a large-model experiment? 4. How does “prompt geometry” correspond to linguistic structures? Could the authors give a few simple examples? 5. Could the paper include a more complex experimental scenario, such as realistic multi-agent cooperation (text co-writing, code generation, etc.)? 6. Do the authors plan to release the LSA experiment code or a minimal reproducible example? 7. If the attacker only has black-box access to the victim, is there an approximate attack strategy?	Lightly AI-edited
Convergence dynamics of Agent-to-Agent Interactions with Misaligned objectives	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents a theoretical framework to analyze the interaction dynamics between two language model-based agents. It models the agent-to-agent interaction as an alternating optimization process, where each agent performs an in-context gradient update towards its own, potentially misaligned, objective. The authors provide a formal characterization of the convergence dynamics, showing that misaligned objectives result in a biased equilibrium where neither agent fully achieves its goal. The framework predicts these residual errors based on the objective gap and the geometry induced by each agent's prompt. Furthermore, the paper establishes the conditions for asymmetric convergence and proposes a constructive, white-box adversarial attack that allows one agent to achieve its objective while forcing the other to retain a persistent error. These theoretical results are validated with experiments on in-context linear regression tasks using both trained transformer models and GPT-5. 1. The paper offers a clean and analytically grounded model of in-context optimization between interacting agents. 2. The theoretical results (Propositions 1–3) are mathematically sound and provide clear geometric intuition for asymmetric convergence. 3. The analysis extends the “transformers-as-optimizers” view to a two-agent setting, which is conceptually novel and well aligned with the learning theory track. 1. The paper’s core theory assumes transformers implement in-context gradient-like updates (the “transformers-as-optimizers” view) and then analyzes coupled update dynamics under that assumption. However, the GPT-5 experiments do not test emergent in-context optimization — they prompt the model with the explicit gradient formula and treat GPT-5 as an arithmetic oracle. This weakens the experimental link to the paper’s foundational claim: the GPT-5 results demonstrate correct formula execution, not that modern LLMs naturally realize the assumed optimizer dynamics. 2. Figure 3 reports attack success rates using thresholds ε₁ and ε₂, but the manuscript never specifies these threshold values or how they were chosen. Without concrete ε values (or sensitivity analysis), the reported percentages are uninterpretable and non-reproducible: it is impossible to judge whether “85% success” reflects algorithmic failure modes, threshold arbitrariness, or genuine instability in the dynamics. 3. Algorithm 1 relies on a carefully aligned eigen-spike construction, but the experiments do not compare this design to simpler baselines (e.g., misaligned anisotropy, scalar scaling of S_U, or random high-anisotropy prompts). As presented, it is unclear whether the geometric construction is uniquely required for one-sided convergence or merely one of many ways to produce non-symmetric outcomes. 1. The theoretical analysis assumes each agent performs a linear gradient-like update, following the LSA approximation. How robust are the main results, especially Proposition 1 and Corollary 2, when the agent dynamics include moderate nonlinearities or higher-layer effects? It would be valuable to see whether asymmetric convergence still appears under more realistic conditions. 2. Proposition 2 and Corollary 3 rely on exact equalities such as ((I - \eta S_U)S_W \Delta = 0). In practice, these conditions can only be approximately met. How sensitive is the observed asymmetric convergence to small violations of this condition? Do the effects decay smoothly, or is the phenomenon brittle? 3. In the GPT-5 setup, the model is prompted with the exact gradient formula, so it is effectively executing a prescribed computation rather than demonstrating emergent optimization. Could you test a variant where GPT-5 infers the update rule from examples without seeing the explicit formula?	Fully AI-generated
Knowledge-enhanced MCTS for LLM-based Medical Diagnosis Reasoning	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper presents a method for medical diagnosis with external knowledge, which builds a tree whose nodes are states and edges are actions (the nodes are not explained in the paper). A diagnosis is done by expanding the tree until reaching a specific terminal diagnosis. The experimental results show that the performance of the method is better than some baseline models, but it did not outperform recent LLMs. 1. The motivation of the work makes sense. 1. It’s hard to see what is done in the method. This may be because some mathematical notation is not defined. For example, $\mathcal{S}$ and $Q$ in Eq. (1) are not defined. 2. Also, I don’t see how the external knowledge is used in the method. It should somehow interact with the tree in Figure 1, but I cannot see it in the main text and Algorithm 1 does not fully explain it. 4. I can find a paper which seems relevant to the paper [Wang et al., “DiReCT: Diagnostic reasoning for clinical notes via large language models,” NeurIPS 2024]. The paper seems to use a graph to specify the deductive process until the terminal diagnosis explicitly. It would be nice to see a task-level, methodological, and experimental comparison with the paper. 5. It would be desirable to show how the identified path for a diagnosis makes sense for human experts or some qualitative evaluation on this point. 1. For me, as the medical domain often uses special vocabulary, surface-level retrieval seems to work well. Is there an ablation study on this point? 2. What is $S$ in Eq. (2)?	Fully human-written
Knowledge-enhanced MCTS for LLM-based Medical Diagnosis Reasoning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper proposed Med-MCTS, a knowledge-enhanced diagnostic reasoning framework that integrates MCTS with external medical knowledge. This whole framework is reasonable, which is aligned with the real disease diagnosis. It seems to be easily combined with any LLMs to improve their reasoning ability from Table1. 1. I’m curious about the accuracy of retrieving entities using embedding similarity. You mentioned the limitations of RAG in the paper, but I wonder whether cosine similarity can reliably retrieve the correct entities. Why not use a more rigorous retrieval strategy? 2. What happens if there are no similar entities found in the knowledge graph? I am also very confused on the hypothesis generation. No matter the symptoms or prescribed medications or patient populations, they are recorded in clinical practice guidelines or drug label. Why do we need a hypothesis generation module? Even if only with KG, you could reason on graph directly. Is this necessary? 3. Does the model’s performance depend on which knowledge graph is used? 4. I would like to see a comparison between Med-MCTS and a KG-based RAG method. 5. What is the performance of pure LLMs? Please show the results of Qwen2.5-72B-Instruct when prompted directly. I’m also curious why Med-MCTS does not outperform other models of similar size, such as HuatuoGPT-o1-72B. 6. Overall, the framework seems to lack novelty. It looks like a method that samples reasoning traces and then selects one based on the knowledge graph. The performance is highly depended on adopted knowledge graph. see weaknesses	Fully human-written
Knowledge-enhanced MCTS for LLM-based Medical Diagnosis Reasoning	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes Med‑MCTS, a test‑time reasoning framework that combines Monte‑Carlo Tree Search (MCTS) with knowledge‑grounded retrieval for clinical diagnosis with LLMs. The key ideas are: * A clinically informed action space that mirrors physician workflow: (A1) Key Symptom Extraction -> (A2) Hypothesis Generation -> (A3) Evidence Verification -> (A4) Deductive Analysis with explicit backtracking rules. * Knowledge‑grounded search using a medical knowledge graph (about 44k entities / about 300k relations) with bidirectional retrieval: R1 (symptom -> disease) for hypothesis generation and R2 (disease -> symptom) for targeted verification, plus a similarity threshold \tau to gate KG use. * A multi‑dimensional path evaluator selecting the final diagnosis by combining consistency, diversity (pairwise path similarity), and agent‑based factuality. Results. On four MCQ benchmarks (MedQA, MedBullets‑4/5 options, JMED) and two backbones (Qwen2.5‑7B/72B‑Instruct), Med‑MCTS improves over CoT, RAG, Self‑Consistency, RAP, and rStar. For Qwen2.5‑72B, it reaches 71.10/62.66/84.57/68.00% on the four datasets, outperforming domain‑specific medical models and approaching GPT‑4o. Ablations & cost. Removing KG retrieval (R1/R2) or A4 reduces accuracy; the full system performs best on a MedQA subset (72.5%). Inference is compute‑heavy (~77–84 LLM calls; 121k–169k tokens/question). 1. Well‑aligned action space that mirrors real diagnostic workflow (A1–A4), enabling interpretable search and principled backtracking. 2. Knowledge‑grounded expansion via KG (R1/R2), addressing the limitations of surface‑similarity RAG and strengthening factual grounding. 3. Consistent empirical gains across datasets and model scales; Med‑MCTS with 72B outperforms domain‑specific medical models and is competitive with strong general LLMs. 4. Ablations and budget study support claims that gains are not merely from heavier sampling; increased pass@k for SC saturates while Med‑MCTS maintains an edge under similar budgets. 5. Transparency and practicality: the tree exposes "show‑your‑work" reasoning; a small expert study (50 cases) reports good correctness and interpretability (about 4.1 – 4.2 / 5). 1. Evaluator transparency: The diversity metric and multi‑agent evaluation (MAE) are not fully specified (models/prompts; whether evaluators see the final answer; safeguards against self‑confirmation). This is critical because the final selection depends on Eq. (4). 2. Reproducibility: * The KG identity, construction, licensing, and access are not named; τ (similarity threshold) and embedding model (f_\theta) are unspecified. Without these, re‑running R1/R2 is difficult. * Hyperparameters for the scorer (\lambda) were tuned on a 50‑question MedQA subset; robustness across datasets is not analyzed. 3. Statistical rigor: No CIs or significance tests for deltas over rStar/RAP; some gains are modest (~=1-2%). 4. Compute overhead: The approach is expensive at inference (~77–84 calls; up to ~169k tokens/question), which limits deployability; no adaptive‑budget or early‑exit mechanism is presented. 5. Evaluation scope: All tasks are multiple choice; while JMED is larger‑option (21) and more realistic, fully open‑ended differential diagnosis or free‑text synthesis/citation tasks are not reported. 6. Fairness of baselines: The RAG baseline uses a textbook corpus; Med‑MCTS uses a KG. The paper argues KG superiority, but a matched data source (e.g., KG derived from the same textbooks) or RAG + KG hybrid baselines would strengthen the claim. 1. Evaluator specifics: * How exactly is Sim(t_i,_j) computed (embedding space? sentence‑level cosine? structure‑aware)? * For MAE, which models, prompts, and verifiers are used? Do verifiers see the final answer choices or only paths? Any adjudication tie‑breaks? 2. KG details and reproducibility: * Which knowledge graph (name/source) is used? Will you release the KG (or mapping layer) to enable replication? What is the value of τ and the exact embedding model (f_\theta)? 3. Statistical reporting: Could you provide 95% CIs or bootstrap tests for Table 1/2 deltas (especially vs rStar/RAP), and per‑dataset error bars? 4. Ablations beyond MedQA: The ablation in Table 3 uses 120 MedQA samples. Can you replicate on JMED or MedBullets to show robustness of R1/R2 and A4 across distributions? 5. Efficiency / scaling: Any results with adaptive rollouts (e.g., early exit when posterior concentrates), dynamic branching, or UCT parameter sweeps to trade accuracy vs cost? 6. Failure analysis & safety: Please include a qualitative error analysis (common failure modes) and a calibration or hallucination check for reasoning paths, beyond the small expert rating. 7. Fair baseline comparison: Could you run a KG‑only RAG (no MCTS) and/or RAG+KG with the same KG to isolate the benefit of MCTS vs data source? 8. Even though you leverage MCTS and KG to imporove factuality during generation, how could you audit the factuality of the internal process? It resembles infinite regress (aka, "Turtles all the way down"). Please elaborate this.	Fully AI-generated
Knowledge-enhanced MCTS for LLM-based Medical Diagnosis Reasoning	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces Med-MCTS, a diagnostic reasoning framework that integrates Monte Carlo Tree Search (MCTS) with a medical knowledge graph. The core of Med-MCTS features a clinically inspired action space (A1–A4: key symptom extraction, hypothesis generation, evidence verification, and deductive analysis) that closely aligns with physicians’ iterative diagnostic workflows. A bidirectional knowledge retrieval mechanism (R1: symptom-to-disease, R2: disease-to-symptom) provides structured and verifiable knowledge support during tree expansion and verification. Additionally, a multi-dimensional path discriminator—encompassing Consistency, Diversity, and Agent Evaluation—selects high-quality reasoning paths, replacing simple majority voting. Across benchmarks including MedQA, MedBullets, and JMED, Med-MCTS outperforms baselines such as CoT, RAG, Self-Consistency, RAP, and rStar on both Qwen2.5-7B and 72B models. Compared to domain-specific models, it achieves higher overall accuracy and approaches GPT-4o performance. Ablation studies highlight the substantial contributions of R1/R2, A4, and discriminator. - The paper adapts the Monte Carlo Tree Search (MCTS) to the domain of clinical diagnosis by introducing a clinically-inspired action space (A1–A4). It explicitly models backtracking forks based on the presence/absence and certainty/uncertainty of symptoms, which distinguishes it from conventional mathematical decomposition approaches. - The paper is well-written and clearly organized, following a logical structure (Method, Experiments, Ablation, Limitations). The framework's workflow is effectively illustrated with diagrams (Figures 1 and 2), and technical details are made accessible through clear mathematical formulas (for UCT and scoring) and pseudocode (Algorithm 1). - Lack of a Clear Reward Definition for MCTS: The paper states that it uses the UCT algorithm, which requires backpropagating a value Q(s,a), but it fails to explicitly define how the terminal reward for a rollout is calculated. It is unclear whether this reward is a binary score based on the ground-truth answer, a value derived from the model's confidence, or a scalar score from the multi-agent evaluator. This omission is a critical gap in the methodology that hinders reproducibility. - Insufficient Detail in the Multi-dimensional Scorer: The path selection mechanism is not fully specified. Specifically: (a) The implementation of the similarity function Sim(·,·) for calculating diversity is not described. (b) The acronym 'MAE' for 'Multi-Agent Evaluation' is ambiguous and can be easily confused with 'Mean Absolute Error'. (c) The paper only briefly mentions how the sub-scores are normalized and how their weights are tuned, lacking clear details on the measurement scales or calibration methods. - Ambiguity Regarding the Retrieval Corpus: The paper's methodology focuses on knowledge graph (KG) retrieval, yet the appendix describes a "Retrieval Corpus" based on textbook materials. It is not clearly distinguished whether this text-based corpus is used exclusively for the RAG baseline or if it is also leveraged by the proposed Med-MCTS framework. This creates confusion about the precise knowledge sources used by the method. - Presentation and Referencing Errors: The paper contains several distracting presentation issues. There are multiple spelling errors (e.g., "Accuray," "Mult-Agent evalution," "selecte") and an incorrect cross-reference in Section 3.2, where the text describing the general MCTS structure points to Figure 2 (a specific case study) instead of a more general diagram. Please clarify the definition and value range of the reward used for backpropagation. Is it a binary correctness score (e.g., 1 for a correct terminal diagnosis, 0 otherwise), a value mapped from the confidence level in action A4, or a scalar score from the multi-agent evaluation? What is the specific implementation of the Sim function (e.g., text edit distance, cosine similarity of sentence embeddings, structural alignment)? For the 'MAE' score, what is its calibration and range (e.g., [0, 1]), and are the individual sub-scores normalized? Could you provide more complete results for the hyperparameter tuning of λ, such as the grid search outcomes or a sensitivity analysis with confidence intervals? Regarding the "Retrieval Corpus" described in the appendix: is the textbook retrieval used exclusively for the RAG baseline, or does Med-MCTS also utilize text retrieval during certain action phases?	Moderately AI-edited
A unified perspective on fine-tuning and sampling with diffusion and flow models	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper studies training diffusion/flow models to sample from exponentially-tilted targets—covering both reward fine-tuning and sampling from unnormalized densities—through two lenses: stochastic optimal control (SOC) and thermodynamics. It (i) gives bounds on the lean adjoint (AM/AS), (ii) derives a common bias–variance view comparing AM/AS with TSM/CSM/NSM (showing variance pathologies for TSM/CSM), (iii) adapts CMCD/NETS with Crooks/Jarzynski identities to the tilting setting, and (iv) reports AM-based text-to-image fine-tuning experiments (SD-1.5/SD-3). 1. Lean-adjoint norm bound (Prop. 3.1). Under strong log-concavity of the base, the paper gives an explicit schedule-dependent decay bound on the adjoint’s norm; in the Gaussian case it yields closed-form factors used in AM/AS analysis. This is positioned as theoretical support for AM/AS stability. 2. Unified bias–variance comparison across methods. With a shared weighting w= \eta_t, they show TSM/CSM have infinite variance (blow-ups at t=0/1), while NSM admits finite bounds and AM/AS have a simple finite constant (table summarizing Föllmer, DDIM/DDPM, Rectified-Flow cases). This clarifies when KL-interpretable training is statistically well-posed. 3. Thermodynamics adapted to tilting. The paper derives tilting-aware versions of the controlled Crooks theorem, the escorted Jarzynski equality, and a NETS loss, all incorporating both the base score and reward gradient—tools one could use for thermodynamic training/diagnostics in the tilting formulation. 1. SOC/unification is known. The exponential-tilting formulation and the memoryless noise-schedule condition enabling SOC (and schedule-agnostic inference) are prior results from Adjoint Matching; here they are restated. 2. Adjoint bound is narrow. The bound hinges on strong log-concavity of the base (or Gaussianity); the paper notes similar behavior is “expected” more generally but does not prove it beyond these assumptions. 3. No new training algorithm; limited empirics. Experiments only apply AM to SD-1.5/SD-3; the thermodynamics adaptations (CMCD/NETS, Crooks/Jarzynski) are not empirically validated here. Overall the work functions as a consolidation + analysis rather than a method contribution. Could you please clearly state what is the take-away message of the paper? To me, this submission is a useful consolidation with some clarifying theory, but little in the way of new algorithms or fundamentally new guidance principles.	Moderately AI-edited
A unified perspective on fine-tuning and sampling with diffusion and flow models	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper discusses fine-tuning and sampling for diffusion/flow models by framing both as sampling from an exponentially tilted target, covering reward fine-tuning and unnormalized densities. It analyzes adjoint-based methods via a new norm bound on the lean adjoint and gives bias–variance decompositions that compare adjoint and score-matching losses, highlighting advantages for AM/AS and NSM. The authors conduct experiments with SD1.5 and 3 with AM. I like the theoretical perspective of the paper, i.e. Prop 3.1 is novel and the variance comparison between many existing works are quite useful for the community (Table 1). I feel like the current stage of the paper is incomplete especially given the lack of experiments related to the theory discovered and not a coherent story. I made some suggestions in the question section which hopefully will be helpful. Proposition 3.1 is useful for demonstrating that the lean adjoint objective is more stable. I think it is much more interesting and potentially needed if the author compares the norm of the lean adjoint objective to the original adjoint objective. Because then we can have a well-aligned empirical observation with the matching theoretical justification. As the AM author mentions, the adjoint objective is not stable to train. I highly suggest that the authors elaborate more on the theory discussed in the paper, when I read the paper, I felt like there was not enough discussion on the importance of them, how are they relevant to the fine-tuning task etc, which makes the paper sound like less motivated. And I'm not familiar with the context in section 4, but I feel section 4 is disconnected to the rest of the AM stories. In the introduction section, the authors claim that this paper "refines the techniques of Domingo-Enrich et al. (2025)." What exactly did this paper refine?	Fully human-written
A unified perspective on fine-tuning and sampling with diffusion and flow models	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper studies the problem of training diffusion/flow models to sample from reward-tilted distributions via fine-tuning or sampling unnormalized densities. The work provides theoretical results to compare recent classes of methods developed to tackle such problems (namely adjoint-based fine-tuning schemes, and score-matching training methods applied over data obtained via inference-time simulation of the tilted distribution). Then, it adapts recent algorithmic schemes based on a thermodynamic interpretation to tackle the exponential tilting problem, and perform experimental evaluations. - The paper aims to provide mathematical understanding on high-relevance problems in diffusion and flow generative modeling. In particular, I believe that theoretical analysis regarding comparison between score-based methods for reward fine-tuning and adjoint-based methods is highly valuable. - Sec. 2 and 3 provide a fairly interesting unifying lens regarding the problem of computing a diffusion/flow model inducing a distribution matching a reward-tilted distribution. - The paper seems to provide a potentially interesting mathematical and algorithmic viewpoint on the aforementioned problem (but unfortunately not sufficiently well structured and presented, as discussed in the following) - (main concern 1, presentation) On a high level, the paper is written extremely poorly (given the target audience). While I might expect other fields to appreciated un-motivated purely mathematical results, computer science / machine learning entails to consider computational and algorithmic aspects that are deeply neglected in the presentation of this work, leading to profound confusion of its exposition. I will try to mention a subset of examples in the following list: 1. (abstract, first line) "training diffusion models" ... "subsumes sampling..." and reward "fine-tuning...". Crucially, training/fine-tuning/sampling are different algorithmic problems and training does not subsume the others. Training refers to learning a model from data, fine-tuning to adapting an already available model. Sampling, in this context, typically refers to inference-time adaptation of the diffusion/flow process to sample (typically) from the tilted distribution, while here I believe it is used to indicate classic sampling of unnormalized density. Crucially, sampling as intended here does not even seem to involve data. The same confusion arises in multiple parts of the paper (e.g., line 54-56) 2. (Sec. 2.2) The exponential tilting problem is defined as the task of modifying the model (line 117) such that it samples from the tilted. Then, 2 main settings are presented: (i) reward fine-tuning and (ii) sampling. Crucially, (ii) does not require an initial model, as the optimal density p^* does not depend on existing data or pre-trained model (see concern below). As a consequence, this problem setting seems wrong. 3. Sec. 4 aims to introduce an Algorithm, but effectively it 'adapts' existing algorithms that do not seem to be introduced at all. Concretely, it is customary to clarify input/outputs of algorithms, typically provide a pseudocode etc. Presenting mathematical results useful for algorithm design is not sufficient to claim that an algorithm is presented. In particular in this case, since the implicitly proposed scheme would be an adaptation of existing (not presented) algorithms, it is even more essential to clarify the proposed method. - (main concern 2, lack of motivations/clarity) Most contributions of the work are not properly motivated or explained sufficiently clearly. Some examples: 1. The contributions listed at the end of the abstract are very poorly connected with the introduced logic. 2. Same holds for the contributions list within the Introduction 3. The paper mentions multiple times the thermodynamics framework and algorithm of Vargas et al., (e.g. line 62) but it seems to me that these are never sufficiently well presented, rendering the whole work very hard to follow properly. 4. Sec. 3 should start introducing why that list of methods is presented.... 5. line 193, exactly why one would wish to bound the norm is not explained. Similarly, the presented bounds should be further discussed. - (main concern 3, problem setting) The presented 'thermodynamics' formulation is poorly presented, and therefore quite unclear. In particular, the sec. 'The thermodynamics formulation' (line 148) starts with 'Methods like X and Y were developed in a setting where...'. Crucially, it seems to me that lines 151-155 aim to reduce the tilted-distribution sampling problem to a specific sub-case of this setting, but effectively only reduce an irrelevant instance of this problem (e.g. when p_base is data-independent). It might also be the case that the authors were only trying to clarify the difference between the settings. Unfortunately this is unclear due to poor writing, but in any case, what exactly 'the thermodynamics formulation' is (and why thermodynamics?) is unclear to me. - (main concern 4, method/experiments) Sec. 4 claims to introduce an algorithm by its title. This would arguably be the most practically relevant contribution of this work, which would be otherwise presenting only theoretical results (also fine, but requires a different evaluation). Besides being not explicitly presented, the algorithm does not seem to be even experimentally evaluated at all. Given the very poor/imprecise writing (for the average audience of this conference), weakly-motivated theoretical results, and unclear and not evaluated algorithm, my current score is negative. Nonetheless, I believe this work might contain interesting ideas that would require further practical development (or more clear theoretical motivations and discussions) as well as a significant rewriting. Given convincing clarifications of the points mentioned within the Weaknesses sec. of my review I would be happy to increase my score. In particular, let me know if I misinterpreted/misunderstood any of the points (this is indeed possible due to uncommon writing from a generative modeling viewpoint, or at least the one of my sub-fields of expertise)	Fully human-written
A unified perspective on fine-tuning and sampling with diffusion and flow models	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper attempts to produce a unified view of several existing diffusion/flow based fine-tuning and sampling methods and analyze some of their properties. More concretely, the paper proposes a disparate set of results - A bound on the norm of the lean adjoint ODE for Adjoint Matching and Sampling (AM/AS), which potentially provides theoretical support for the empirical performance of these algorithms. - A bias-variance decomposition for both adjoint-based and score-matching algorithms - An adaptation of thermodynamic formulations (CMCD, NETS) to the exponential tilting setting. - The attempt to present a wide array of related method in a unified way is laudable. I appreciate Section 2, though it assumes familiarity with a lot of background work. - Detailed proofs are provided for results. Though, I have not verified all of them in closely. - Motivation: At a high level, it is not clear what specific problem the paper is trying to address, why that problem is important and what the key idea is. As written, it seems to be a collection of a set of theoretical results, but without a clear and/or convincing demonstration of their impact on a problem of interest. - Writing: Lack of clear motivation also makes it hard to follow along and understand the paper, which is already quite dense and assumes familiarity with a significant amount of background knowledge. It is not clear how and which section of the paper/background works is important for the provided results. Several theoretical results are stated, with proof relegated to the appendix, without much discussion of why that result is important and/or what that result unlocks. For example, what are the practical implications of the bound on the lean adjoint ODE? How can this bound be used to guide the design of new algorithms? How sensitive are the results to the choice of the reward model? Beyond speculation, what is the impact of deviation from strong convexity? etc. - Experimentation: In general, the experimentation section is minimal. However, lack of clear motivation and writing exacerbates the situation by making it difficult to assess what experimentation would be needed, how the presented experiments are sufficient. The experimentation section fails to clearly state what results are being supported by what experiments. Overall, I think the paper needs to be clearly reorganized and rewritten with a structuring and presentation that makes the answers to aforementioned weaknesses obvious. In addition to some of the questions raised above in the weaknesses section, please provide a clear description of what each of the presented theoretical result unlocks or offers beyond what is already known, how that result is valuable and can be used to drive future explorations.	Fully human-written
Learning to Undo: Transfer Reinforcement Learning under State Space Transformations	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes LUMOS, a method for transfer RL, when source and target tasks differ only by a transformation of the state. It learns an undo map that converts target observations into the source feature space so a source policy can be reused. The authors provide theory for the linear case and extend the method to nonlinear settings. Experiments on continuous-control tasks show it is more sample-efficient than training from scratch. The paper is technically sound and provides a clear theoretical analysis under the linear-Q-realizable setting, with consistent assumptions and proofs. Experiments on multiple tasks validate the effectiveness of their method. The overall structure is logical and easy to follow, and the figures and algorithms are well presented. 1. The paper builds on the "undo map" formulation. While this setting is well-defined and analytically tractable, and linearity assumption in theory is acceptable, it represents a relatively narrow class of transfer scenarios, where tasks dynamics are identical and only differ by an invertible transformation. This strong assumption limits its contribution to the current transfer RL research, where most problems are more complicated and no such deterministic "undo map" exists. The paper motivates the undo map using a few clean geometric and sensor transformation examples, but these settings may not capture the challenge in most transfer situations. 2. The paper repeats several key ideas (such as their contributions) too many times across multiple sections, with only minor variations. Streamlining these explanations would make the presentation clearer and more concise. 3. The paper does not include any ablation studies testing how key design choices affects the transfer results. For instance, examining how the choice of action sequence, target sample budget, or source policy quality affects the accuracy of the learned undo map. 4. The paper does not provide open-source code and includes limited experimental details, which reduces reproducibility and transparency. 1. How might the proposed method be extended or adapted to handle more general cases, where an exact undo map does not exist? 2. In the related work section, the paper claim that their method is practical and demonstrates strong empirical performance. Could you clarify how this comparison is fair, given that many related methods tackle more general transfer settings than the invertible-transformation case studied here? 3. The proposed method relies on accurately matching feature statistics between the source and target tasks. How sensitive the approach is to potential noises in the target observations? 4. In Equation 1, $\psi_h (a_{0:h-1})$ is defined as the sum of state features, but in equation (5), $\psi_h (a_{0:h-1})$ is defined as the average of state-action features. This inconsistence is confusing.	Lightly AI-edited
Learning to Undo: Transfer Reinforcement Learning under State Space Transformations	Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a method for transfer learning in RL that learns an "undo map" to transform target state spaces back to source state spaces via regression on state feature statistics. The transfer learning is studied only in the context where the observation space has changed and not the dynamics or rewards. Experiments are conducted on continuous control problems. * Transfer learning is an important problem in RL. * The proposed method is interesting and has potential. * I found several overclaims by the authors and occasionally misleading comparisons: 1. The setting studied here represents a very narrow and limited form of transfer learning. The authors only consider changes in the observation space (e.g., a change in viewpoint), whereas transfer learning in the broader literature often involves changes in rewards, dynamics, or both. Even within this limited setting, the assumed transformation must follow a specific linear structure, and the experiments are deliberately designed to satisfy this assumption (e.g., the sensor fusion experiments). The RGB-to-grayscale transformation mentioned in the introduction is missing from the experiments, and no experiments on naturally occurring transformations are provided. As a result, statements such as the following are clear overclaims: > (L463) Our method achieves significant gains in sample efficiency compared to learning target policies from scratch, as demonstrated across multiple challenging continuous control environments. 2. The authors repeatedly describe their method as a zero-shot transfer learning approach, yet it explicitly requires thousands of samples from the target task to learn the undo map. In the literature, zero-shot transfer refers to settings where no samples from the target task are available. 3. The authors claim novelty in proposing a new method for transfer learning (L80), yet the idea of using undo maps for transfer was already introduced by Gupta et al. (2020). The only apparent difference here is that regression on feature statistics is used instead of trajectory matching. 4. Empirical overclaims include statements such as the following. However, this comparison is made against learning from scratch, which is not a transfer learning algorithm and has no access to the source task or policy. > (L411) LUMOS-LIN is 143× more sample-efficient. 5. The paper compares the sample complexity of the proposed method against the exponential lower bounds of Weisz et al. (2021), even though that work concerns worst-case MDPs specifically constructed to be difficult. For linear MDPs—the focus of this paper’s theoretical section—sample complexity bounds are already significantly better (see: Jin et al., “Provably efficient reinforcement learning with linear function approximation,” COLT 2020). Thus, the authors’ claim of “strictly better sample efficiency” is misleading; the improvement is marginal and applies only to a very narrow problem class, which should be clearly stated. 6. The practical algorithm deviates considerably from the theoretical setting—for instance, it uses random action sequences instead of source rollouts. The paper provides no justification for why matching statistics should work in non-linear settings. Therefore, the following statement is also an overclaim: > (L67) Our algorithmic design is grounded in the setting where the undo map is linear and the source… * The paper has a limited experimental setup: 1. As noted above, the environments are often artificially constructed to satisfy the assumptions of the method. 2. The method is not compared against any prior algorithms from the transfer learning, meta-RL, or domain adaptation literature. The only baseline used is training from scratch, which raises major concerns about the validity of the claimed performance gains. At a minimum, the paper should include one baseline from each of these families to contextualize the proposed approach. Given that the transfer tasks involve only changes in the observation space, even simple methods such domain adaptation methods or simple data augmentation could serve as competitive baselines. 3. The experiments use only three random seeds, which is insufficient for RL benchmarks. Moreover, it is unclear what the shaded regions in the plots represent. * Methodological issues: 1. Figure 7 shows that the method can completely fail when using random action sequences for certain tasks. This critical dependency is understated and not properly discussed. 2. The paper assumes an identifiable undo map from statistics, but the notion of identifiability is neither defined nor analyzed. * The paper is not well-placed within the extensive literature on transfer in RL, omitting several key references, including but not limited to: * Touati, Ahmed, Jérémy Rapin, and Yann Ollivier. "Does Zero-Shot Reinforcement Learning Exist?." The Eleventh International Conference on Learning Representations. * Touati, Ahmed, and Yann Ollivier. "Learning one representation to optimize all rewards." Advances in Neural Information Processing Systems 34 (2021): 13-23. * Rezaei-Shoshtari, Sahand, et al. "Hypernetworks for zero-shot transfer in reinforcement learning." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. No. 8. 2023. * Beck, Jacob, et al. "Hypernetworks in meta-reinforcement learning." Conference on Robot Learning. PMLR, 2023. * Rakelly, Kate, et al. "Efficient off-policy meta-reinforcement learning via probabilistic context variables." International conference on machine learning. PMLR, 2019. * Arndt, Karol, et al. "Meta reinforcement learning for sim-to-real domain adaptation." 2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020. * Ju, Hao, et al. "Transferring policy of deep reinforcement learning from simulation to reality for robotics." Nature Machine Intelligence 4.12 (2022): 1077-1087. * Ingebrand, Tyler, Amy Zhang, and Ufuk Topcu. "Zero-shot reinforcement learning via function encoders." Proceedings of the 41st International Conference on Machine Learning. 2024. 1. What happens when the source policy is sub-optimal?	Fully human-written
Learning to Undo: Transfer Reinforcement Learning under State Space Transformations	Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes a transfer learning method, capable of recovering state-space transformations between target $\mathcal{M}_t$ and source tasks $\mathcal{M}_s$. Two settings for the algorithm LUMOS are proposed: one with theoretical guarantees, another with empirical evidence. The paper is well written, and present a novel although narrow perspective for transfer learning in Reinforcement Learning. Theoretical results rely on a strong assumption given by Eq. 3. Practically, it does not seem feasible to tell beforehand if the state-space and dynamics of $\mathcal{M}_t$ are a transformation (linear or not) of the state-space and dynamics of $\mathcal{M}_s$, which heavily limits the applicability of the method. Empirical results do not explore this weakness. As such, the paper’s contribution seems reduced to finding a map between two MDPs' dynamics, presuming an oracle (otherwise, fail). Thus, the last sentence before the concluding sentence appears misleading. In the Related Works, authors have pointed out limitations of related works regarding theoretical guarantees, while the proposed method itself covers the narrow setting of linear state-space mappings. The arguments are not convincing due to the lack of baselines in the empirical evaluation, other than learning the task from scratch. The manuscript is overall well written, but some parts are a bit confusing, such as the open loop definition in section 3.1, and short sentences like “We assume that $d \le H$”. Q1. In section 2 - transfer learning, and algorithm 2, during transfer, interacting with source tasks does not incur sample complexity. Would it be feasible to compare other metrics than sample complexity to assess the cost related to transferring (and interacting with $\mathcal{M}_s$)? Q2. In Related Works, in the very last sentence, the author claims that the work _“takes a step toward more principled and broadly applicable approach...”_. How does it hold with the strong assumption of linear transformation state-spaces?	Fully human-written
Learning to Undo: Transfer Reinforcement Learning under State Space Transformations	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper studies transfer learning in reinforcement learning, focusing on when and how knowledge can be transferred between Markov Decision Processes (MDPs) to improve learning efficiency. The authors consider the setting where there exists an undo map between a source MDP and a target MDP, such that applying this map to the target’s state space exactly recovers the source. The main contributions of the paper are: 1. Algorithm for learning the undo map: The paper proposes a method that learns this map via regression on state feature statistics collected from both source and target MDPs. Once learned, the map enables zero-shot transfer of the source policy to the target. 2. Theoretical analysis: When the undo map is linear and the source MDP is linearly realizable, the proposed approach achieves provably better sample complexity than learning the target MDP from scratch. 3. Empirical validation: Experiments on challenging continuous control tasks demonstrate that the approach achieves significantly higher sample efficiency than baseline methods. 1. Originality: Introduces the novel concept of an undo map to reverse state-space transformations, enabling zero-shot policy transfer. 2. Quality: Methodologically rigorous with theoretical guarantees in the linear setting. Algorithms are clearly defined, and experiments on diverse continuous control tasks show substantial improvements in sample efficiency. 3. Clarity: Well-structured and readable. Algorithms, definitions, and assumptions are clearly presented, with helpful figures illustrating key ideas. Significance: learning high-quality policies from scratch typically requires millions of environment interactions. 1. Limited Baselines: Experiments mainly compare against learning from scratch. To better demonstrate the practical advantage of proposed method, the paper should include comparisons with existing transfer learning RL methods, such as Chen et al. (2024) or other policy-transfer approaches. This would strengthen claims of empirical superiority. 2. Theoretical Scope: The theoretical analysis is restricted to linear undo maps and linearly-Q⋆ realizable MDPs. While LUMOS extends to non-linear transformations empirically, the lack of theoretical guarantees in this general setting limits the rigor of the contribution. Extending bounds or providing analysis for non-linear settings would significantly enhance impact. 3. Action Sequence Dependence: Both LUMOS-LIN and LUMOS require carefully chosen action sequences for effective learning of the undo map. The performance is sensitive to these sequences, but no principled method for discovering or optimizing these sequences is provided. This work could include a more systematic approach for generating action sequences. [1] Chen, E., Chen, X., & Jing, W. (2024). Data-Driven Knowledge Transfer in Batch $ Q^* $ Learning. arXiv preprint arXiv:2404.15209.. Is the sample complexity upper bound of Algorithm 2 tight?	Fully AI-generated
Enhancing Zero-Shot LLM Recommendations via Semantics and Collaborative Signals	Soundness: 1: poor Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper studies how to leverage LLMs as zero-shot rerankers to address the retraining cost and potential forgetting problems. The authors propose SCSRec, a framework that integrates semantic retrieval and collaborative signals, allowing an LLM to perform ranking without additional training. The paper is clearly written, and the motivation—exploring training-free LLM reranking as a practical solution in recall–rerank systems—has reasonable value. S1. The paper is clearly written and well structured, making the proposed framework and experimental setup easy to follow. S2. Using an LLM as a training-free reranker is a valuable idea that helps reduce retraining costs in typical recall–rerank recommendation pipelines. W1. Lack of meaningful baselines to verify the effectiveness of the proposed method. Only one conventional recommender model (CRM) and one LLM-based reranker are used. The comparison with the CRM baseline is limited in meaning, since the proposed method inherently adds additional content information that CRMs do not exploit. Meanwhile, the LLMRank baseline, as a reranker, often performs worse than the CRM itself. As a result, the experiments lack strong and competitive baselines to convincingly demonstrate the true advantage of the proposed approach. W2. Unclear motivation of the Semantic Retrieval module. The Semantic Retrieval module’s motivation is not well justified. The design essentially adds a textual-based recall path on top of the conventional collaborative filtering recall, then feeds all candidates to the LLM for re-ranking. As the main goal is to enhance the LLM’s reranking capability, simply adding another recall path seems only marginally related to that objective. W3. Relatively limited technical novelty. The method follows a familiar paradigm where LLM re-ranks the candidate items based on user history and the scores predicted by the CRM. While practical, this design offers limited technical innovation to existing LLM-based recommenders. Please address the concerns raised in the Weaknesses part.	Lightly AI-edited
Enhancing Zero-Shot LLM Recommendations via Semantics and Collaborative Signals	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	Summary: This paper introduces SCSRec, a training-free framework for zero-shot large-language-model-based recommendation. It integrates (1) semantic representations from an off-the-shelf LLM, (2) multi-view semantic retrieval (user–user, item–item, user–item), and (3) heuristic ranking prompts incorporating collaborative signals (scores from a base CRM such as NCF, LightGCN, or SASRec). Experiments on multiple datasets show consistent improvements over both conventional recommenders and prior LLM-based approaches. Strengths + The motivation is clear. The paper aims to avoid training workload of fine-tuning LLMs while eliminating un-desirable performance of prompt-only approaches. + The designed mythology is intuitive. The design of three components of SCSRec are well-motivated. Weaknesses - The used conventional CRM like SASRec still needs training. It’s possible that the training of SASRec is even higher than efficient fine-tuning approaches like LoRA of LLMs. It can dampen the motivation and challenge the grounding of this paper. - Not enough LLMs-based fine-tuning methods as baselines. The only one is OpenP5. Please consider comparing more LLM-based finr-tuning approaches, like TallRec [1] - Not any LLM-based zero-shot without relying on conventional CRM is compared, like TaxRec [2]. [1] Bao et al. TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation. [2] Liang et al. Taxonomy-Guided Zero-Shot Recommendations with LLMs. Please refer to the weaknesses.	Fully human-written
Enhancing Zero-Shot LLM Recommendations via Semantics and Collaborative Signals	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper investigates performance enhancement in zero-shot recommendation systems based on large language models (LLMs), proposing a training-free framework named SCSRec. Its core innovation lies in integrating the semantic understanding capabilities of LLMs with collaborative signals from traditional recommendation models (CRMs) without fine-tuning the LLM. Performance gains are achieved through three key modules: First,leveraging LLMs to generate comprehensive item profiles based on their pre-trained knowledge and available item features; Second, it designs multi-view semantic retrievers (user-user, item-item, user-item) to construct diverse and relevant candidate pools; Third, it introduces heuristic ranking prompts to integrate CRM predictions, combining collaborative prior knowledge with semantic reasoning. Finally, the synergistic relationship between semantic reasoning capabilities and collaborative filtering prior has been thoroughly validated, demonstrating the proposed method's effectiveness across three public benchmark datasets and one industrial dataset. 1. This paper explores a recommendation paradigm independent of LLM fine-tuning, proposing the zero-training framework SCSRec. Through the innovative combination of “semantic retrieval + collaborative prompting,” it effectively integrates the semantic understanding capabilities of LLMs with the collaborative filtering prior of traditional recommendation models (CRMs). This approach significantly enhances the performance of zero-shot LLM recommendations without requiring large model fine-tuning, while avoiding the high costs associated with fine-tuning; 2. The adaptive user indexing strategy proposed in this paper dynamically distinguishes and models users' short-term interests and long-term preferences by combining short-term time windows (e.g., recent 7-day activity detection) with long-term iterative aggregation (block-based historical summarization). This approach captures users' immediate behavioral signals while effectively mitigating noise and interest drift issues in long-sequence histories, thereby constructing more timely and expressive user representations. 1. The citation highlights the pain point of LLM recommendations suggesting non-existent items, but the Methods and Experiments sections fail to explain how SCSRec mitigates this issue. 2. The baseline comparison is weak, and the experimental results lack sufficient contrast, making the conclusions unconvincing. 3. The flowchart in Figure 1 for the second phase is difficult to understand intuitively. 4. This paper emphasizes that SCSRec achieves low cost, but without experimental quantification, presenting cost savings in a table would be more persuasive. 1. In multi-view retrieval, the results from the three views are fused using RRF. Have other fusion strategies (e.g., weighted averaging, learning-based fusion) been explored? Is RRF the optimal choice? 2. Could more comparisons with existing methods be included? 3. Does it also yield significant improvements on weak CRM?	Lightly AI-edited
Enhancing Zero-Shot LLM Recommendations via Semantics and Collaborative Signals	Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper aims to bridge the performance gap between non–fine-tuned large language models (LLMs) and well-trained conventional recommender models on rerank tasks. It proposes SCSRec (Semantic and Collaborative Signal-enhanced Recommendation), a training-free framework that augments user profiles by modeling both short-term and long-term user interests. The system retrieves candidate items using LLM-based embeddings and applies a conventional recommender model for preliminary ranking. Finally, it introduces a prompt-based reranking stage that leverages collaborative signals from the conventional model to guide the LLM. The authors claim that the proposed prompt-engineering strategy offers a competitive and cost-effective alternative to model fine-tuning for real-world recommendation scenarios. Scope: The paper addresses a relevant issue — how to improve recommendation performance of LLMs without fine-tuning. The focus on training-free integration is practically meaningful given current computational constraints in industrial settings. Logic Flaw and Inconsistent Motivation: Author claim to bridge the gap of training-free LLM frameworks and on ranking small candidate sets of items. However the framework consists of multiple stages, including item embedding generation, Semantic embedding-retrieval, and a conventional recommender model are needed to inject CF signal. The framework is not training-free, as conventional recommender model. This discrepancy appears throughout the paper, particularly in the introduction, related work, and methodology sections. Lack of Literature reviews: The paper proposes framework is a simple combination of existing techniques, such as LLM-based embedding Retrieval, User profile augmentation, User intention analyse, and prompt-based reranking [1,2,3,4]. Experiments: The paper lacks a comprehensive experimental evaluation. It does not provide comparisons with strong baselines, such as some state-of-the-art LLM-based recommenders [7,8] or conventional recommender models [5,6]. Typos: Lines 68 e.g., ously. Table 1, forget to Bold performance. [1] Ren, Xubin, et al. "Representation learning with large language models for recommendation." Proceedings of the ACM web conference 2024. 2024. [2] Lyu, Hanjia, et al. "Llm-rec: Personalized recommendation via prompting large language models." arXiv preprint arXiv:2307.15780 (2023 [3] Zhang, An, et al. "On generative agents in recommendation." Proceedings of the 47th international ACM SIGIR conference on research and development in Information Retrieval. 2024. [4] Wang, Yu, et al. "Drdt: Dynamic reflection with divergent thinking for llm-based sequential recommendation." arXiv preprint arXiv:2312.11336 (2023). [5] Rajput, Shashank, et al. "Recommender systems with generative retrieval." Advances in Neural Information Processing Systems 36 (2023): 10299-10315. [6] Tanner, Carmen, et al. "Actions speak louder than words." Zeitschrift für Psychologie/Journal of Psychology (2015). [7] Bao, Keqin, et al. "Decoding matters: Addressing amplification bias and homogeneity issue for llm-based recommendation." arXiv preprint arXiv:2406.14900 (2024). [8] Li, Xinhang, et al. "E4srec: An elegant effective efficient extensible solution of large language models for sequential recommendation." arXiv preprint arXiv:2312.02443 (2023). Refer to the weaknesses mentioned above.	Lightly AI-edited
Optimizing for Persuasion Improves LLM Generalization: Evidence from Quality-Diversity Evolution of Debate Strategies	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces a Quality-Diversity-based prompt optimization framework (DebateQD) which iterates debate strategies for LLMs through prompt evolution. The authors propose two optimization swappable objectives: persuasion, to convince a judge regardless of truth, and truth, to conclude collaborative correctness. Through empirical experiments across three model scales of Qwen, they show that persuasion-optimized strategies achieve smaller train-test generalization gaps in reading comprehension tasks for most test conditions. For the main contribution, they claim that the proposed framework produces optimized prompts that elicit better generalization, compared to truth-based prompt optimization. The demonstration of applying prompt optimization strategies instead of SFT provides a useful data point about debate training. Although the paper shows empirical improvements on train-test generalization, the contribution is limited due to some obvious unexplained design choices and missing work. - QD category and seed prompt design - The paper mainly uses 7 distinctive categories along with some seed prompts that fit the category in the debate protocol. However, the paper does not justify why these categories are picked and how prompts are chosen? How would choice of these categories and prompt impact the effectiveness of technique? Would missing categories or specific seed prompt reduce the train-test generalization as well? - Insufficient analysis on robustness to judge and mutator - This paper uses the Qwen the same model as mutator and judge to avoid confounder. However, the success of this technique is strongly dependent on (1) how prompts are mutated through evaluation (2) how judges interpret and decide the persuasion statements. Analysing this frameworks’ robustness to model family and reasoning capability is critical for the contribution to stand. Some additional study of judges and mutators’ requirements is also going to help theorize the technique. Some other questions also remain unstudied, such as the impact of judge biases. - Only one domain coverage: Evaluating this technique in only one domain is insufficient evidence for generalizing reasoning. Similar experiments should be considered to be done on other domains such as math problems, coding tasks etc. - Insufficient compute and cost limitation discussion: The technique proposed requires more inference through iterative cost. The incurred complexity in computation cost and time should be discussed in the limitation section. Need more interpretation of results: Why is Truth optimization more advantageous in certain cases? In different domains and test cases, how would persuasion optimization vs truth optimization be selected?	Fully human-written
Optimizing for Persuasion Improves LLM Generalization: Evidence from Quality-Diversity Evolution of Debate Strategies	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The authors propose a prompt optimization evolutionary algorithm that evolves diverse debate strategies through tournament-style competition, swapping the fitness function to persuasion reward to enforce diversity and subsequently improve LLM generalisation. The authors use Elo ratings score to identify winning prompts. - Clear novel methodology and setup - The authors show that the argumentative setup pressures transferable skills - The authors evaluated their methodology across multiple model sizes, up to 72B - It is not entirely clear to me if persuasion training is significantly outperforming the truthful strategy based on accuracy results presented in the paper - generalization gaps performance metrics lack interpretability and absolute accuracy appear less significant for the 72B model - How much does the persuasion fitness function influence the factual accuracy of answers? Can persuasion-based optimization be misaligned with truthful outputs? - What does 400 correspond to in Eq 2-3?	Fully human-written
Optimizing for Persuasion Improves LLM Generalization: Evidence from Quality-Diversity Evolution of Debate Strategies	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper explores how different objectives (such as persuasion vs. truth-seeking) affect the generalizability of LLM debates. Using a multi-round debate tournament setup and an evolutionary Quality–Diversity algorithm, the authors evolve debate strategies and evaluate them on reasoning benchmarks. They find that persuasion-oriented strategies, while less accurate on training data, generalize better to unseen questions. The paper is methodologically creative. The experimental design provides a scientific way to quantify argument quality. The authors’ analysis of behaviors and the link between persuasion and generalization offers novel insights for the readers in this domain. Comment order: descending importance Comment 1. Persuasion evolution appears to select longer, more elaborate prompts. This can raise two risks: (i) verbosity bias in LLM judging, and (ii) optimization pressure toward manipulative tactics that could mislead weaker judges. Restricting the length limit (as the authors stated in page 15) is not enough. The authors can check the following: (1) Normalize or penalize per-turn and total transcript length and re-estimate the persuasion–truth generalization gap. Doing so, the authors can quantify how much does “verbosity” contribute to the generalization gap reduction. (2) Conceive a small sample where “intuitive but false” option is deliberately tempting. The authors can check whether deceptive strategies can hurt truth discovery when the debate is made intentionally confusing. Comment 2. How did the authors construct seven behavioral families used in the QD framework? If they are manually chosen, it needs more justification. Further, several include ethically problematic personas (“you’re god,” “serial killer”). This mixes stylistic and moral dimensions, making it unclear whether the observed improvement comes from diverse reasoning styles or simply from judge bias toward certain tones (coming from specific personas). It also risks unsafe debate styles. The authors can try the following: (1) Replace these fixed categories with continuous style embeddings (e.g., discourse-level feature vectors). If persuasion still outperforms truth strategies, that would show robustness beyond the handcrafted taxonomy. (2) Redefine behavioral families using neutral rhetorical language (e.g., evidence-first, balanced-counterargument, uncertainty-aware) and check whether the generalization gap persists. (3) I would also like to see a win-rate matrix between all family pairs. Comment 3. Without further retraining, the authors can evaluate evolved persuasion strategies directly on unrelated long-context reading datasets. I would expect some degradation in performance, but it would serve as a good benchmark whether the strategies possess cross-domain generalizability. Comment 4. Using the same LLM model as a judge when evaluating texts generated by debate agents can be problematic as the judge may prefer patterns that match its own text generation style. Again, instead of re-training everything, the authors can perform a quick check by replacing the judge with a different LLM. Alternatively, the authors can perform independent human evaluations on a selected sample to ensure that the same generalization gain persists. Comment 5. Many in-line citations are not properly formatted (e.g., citations in lines 129 and 134). These citations should be within parentheses. See the weaknesses above.	Lightly AI-edited
Optimizing for Persuasion Improves LLM Generalization: Evidence from Quality-Diversity Evolution of Debate Strategies	Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a framework for evolving debate strategies in LLMs by optimizing either for persuasion (winning debates) or truth (correctness on the underlying task). The authors show that persuasion-optimized prompts achieve smaller train–test generalization gaps and greater diversity than truth-optimized ones. They claim this suggests competitive pressure fosters more generalizable reasoning. - [Clear experimental framing] The comparison between persuasion and truth objectives under fixed debate protocols is well-defined. - [Relevance]: Addresses generalization and overfitting in LLMs is an important problem. - [Experiments with large and small models] The experiments cover multiple model sizes (7B–72B), although these experiments only cover a single dataset (QuALITY) and a single model family (Qwen-2.5). - [Clarity/Transparency]: The authors' methodology and settings are clearly outline and the Appendix details (prompts, tournament design, etc.) are well documented. - [Limited novelty of optimization scheme] The authors have two main contributions; an optimization framework and observations made with the context of the optimization framework. The optimization framework itself lack novelty as the proposed “evolutionary" approach is largely an aggregation of existing methods (prompt evolution + Quality/Diversity + Elo tournaments). No new optimization or algorithmic principle is introduced. - [Observational/causal ambiguity] Regarding the authors second main contribution (observations about the effects of diversity), it remains unclear whether improved generalization stems from the persuasion objective itself or artifacts such as response length, verbosity, or mutation randomness. Given that there many moving pieces affecting the generalizability of the models and the fact that the authors primary contribution is an empirical observation about X thing causing Y effect, I would have expected to see more thorough experiments isolating the underlying relationship. - [Weak ablations] The above issues is the results of incomplete experimentation/analysis within the papers. For example, no experiments isolate the effect of QD vs random mutation, or control for stylistic differences between persuasion and truth prompts. - [Surface-level diversity claim] The reported “diversity” is measured via embedding distance, not semantic or strategic difference, and may arise trivially from prompt mutation or simply from the effects of longer responses on embedding distance. - [Lack of statistical significance] Some of the authors claims do not appear to stem from results which are statistically significant. For example, no confidence interval is given for diversity (Fig 3), for accuracy we really only see a difference in model performance on the train set (on the test set it appears that Persuasion and Truth result in nearly identical accuracy, expect for 32B model, which some how is doing worse than the 7B model . . . ). - [Limited experimental scope] While the authors do study different model sizes and datasets, results are limited to reading comprehension (QuALITY) using one model family (Qwen 2.5). Broader reasoning or truth-seeking tasks would strengthen the claim. - [Limited optimization framework] The authors attempt to make a broad sweeping claim about the relationship between optimizing for persuasiveness vs correctness, yet they only use a simple prompt-based optimization scheme. To make such a broad claim I would have expected to see multiple paradigms used. - [Results not justifying claims] When viewed in aggregate, the limited nature of the authors experiments and analysis do not justify the broad claims that they make. While the abstract makes the paper and its finding sound quite grand, it is in reality, a fairly shallow and incomplete investigation. I would have found it helpful to see more both more thorough and more broad experiments that get at the heart of the authors' claims about the relationship between generalization and persuasiveness. See Weaknesses	Fully human-written
Identification of Task Affinity for Multi-Task Learning based on Divergence of Task Data	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper predict task affinity for multi-task learning using tabular dataset statiscis and static features can characterize the reslationship between the tasks without pair-wise exhasutive joint training as previous works. 1. The work avoids the combinatorial cost of training on all task pairs, enabling scalable and inexpensive estimation of pairwise affinities for large task sets. 2. The method efficiently identifies high-performing task groups. 1. The method is validated only on tabular MTL, so it is unclear whether the findings transfer to vision or NLP. Evaluations on standard vision multi-task benchmarks (e.g., Taskonomy) are needed to establish external validity. 2. Task similarity can evolve during training, but the proposed feature-based metric is essentially static and costly to refresh. Prior work such as Selective Task Group Updates for Multi-Task Optimization [1] and GRAD-TAE [2] can track changing task affinities without exhaustive joint training. [1] Selective Task Group Updates for Multi-Task Optimization (ICLR 2025) [2] Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity (ACM SIGCOMM 2024) 3. The estimator uses a hand-picked subset of features. When indicators disagree, the decision rule and its justification are unclear. A more principled aggregation is needed to show that dependence on such statistics is appropriate. Please respond to these concerns: external validity beyond tabular data, including results on standard vision MTL benchmarks such as Taskonomy, and how your approach handles non-stationary task relations compared to methods that track changing affinities without exhaustive joint training. Also clarify how conflicting feature signals are resolved, provide justification for relying on simple statistics.	Lightly AI-edited
Identification of Task Affinity for Multi-Task Learning based on Divergence of Task Data	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a feature-based approach for predicting pairwise task affinity in multi-task learning (MTL) for tabular datasets. The paper includes evaluation on three different benchmark datasets with different characteristics, and the experiments cover multiple aspects including prediction accuracy, computational cost, and downstream task grouping performance. 1. The biggest limitation is that this approach only works for tabular datasets with shared input dimensions. The authors acknowledge this (line 64-: "We assume a common input dimension p across all tasks"), but this severely limits applicability. Most interesting MTL problems in computer vision or NLP don't have this property. The paper should discuss more clearly when this assumption holds and provide examples beyond the three benchmarks tested. 2. The paper is quite empirical. While the hypothesis that "tasks with more similar data distributions benefit from joint training" is intuitive, there's little theoretical justification for why these specific features should predict MTL gains. Some analysis connecting feature values to properties that affect gradient-based optimization or representation learning would strengthen the work. 3. Some parts are repetitive (e.g., the motivation for feature-based prediction is stated multiple times). Some experimental details are unclear (e.g., how exactly is the train/test split done? Are results averaged over multiple random splits?) 4. Table 4: "Std Dev(σ)", why not just σ? 5. In Table 3, why does the optimal number of groups k vary so much (2-15)? How sensitive is the final performance to choosing k? 6. Can you provide more intuition or theoretical justification for why these specific features should predict MTL gains? For instance, why should Energy Distance be particularly predictive? See weaknesses.	Fully AI-generated
Identification of Task Affinity for Multi-Task Learning based on Divergence of Task Data	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes an approach to predict task affinity in multitask learning (how beneficial it is to train two tasks together), based on static, precomputed features of the task datasets, without requiring expensive joint training for all task pairs. - Instead of treating task relationships as a black box, the paper quantifies statistical and structural similarities between tasks using easily computed dataset-level metrics. These include measures of dataset size, input-space distances, distributional divergence (e.g., energy distance, feature mean gaps), and representation similarity (e.g., cosine similarity, PCA alignment). - The authors construct pairwise feature vectors from these metrics and train a quadratic regression model to predict MTL gains, defined as the relative improvement in task performance when trained jointly versus independently. Crucially, the model is trained only on a small subset of task pairs with known ground-truth MTL gains. The framework was tested on three standard tabular MTL benchmarks, School, Chemical, and Landmine datasets. The proposed model outperformed prior task-affinity estimation methods, such as TAG (gradient-based affinity estimation) and GRAD-TAE (gradient-projection model), in both prediction accuracy and computational efficiency. For example, the proposed method achieved correlations up to 0.58 with true MTL gains on the Landmine dataset, while requiring only a fraction of the training time needed by baselines. Moreover, when applied to task-group selection using beam search and semidefinite programming clustering, the predicted affinities led to superior task groupings with lower total loss compared to alternatives. - This work contributes a scalable and interpretable method for predicting task affinities using simple dataset-derived features, reducing the computation for MTL training. - It provides a practical pathway to automatic task grouping, particularly for tabular data with many tasks. The authors’ findings support the hypothesis that tasks with more similar data distributions yield more positive transfer, offering both theoretical and empirical validation. - The approach advances MTL research by improving efficiency and accuracy in task affinity prediction, enabling large-scale applications of multi-task learning. - The framework focuses on tabular data and extracts dataset features based on tabular features. It would be better to understand how this framework can be applied in a generic setting, for example, in the scenarios of deep neural networks & image/text datasets. - The method predicts pairwise task affinity. Would it be possible to extend the framework to predicting higher-order task affinities, such as the multitask learning results when training on more than two tasks? - Is there any theoretical analysis on the prediction accuracy of using a linear model on the extracted features? Please see the weaknesses.	Heavily AI-edited
MetaRuleReasoner: Beyond Chain-of-Thought—Neural Rule-Based Reasoning for Reliable Mathematical Computation	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces MetaRuleReasoner, a neural rule-based reasoning approach that aims to address limitations of chain-of-thought (CoT) reasoning in mathematical computation. While CoT generates natural language reasoning steps probabilistically, the proposed approach applies learned computational rules systematically with explicit rule application and verification at each step. Experiments demonstrate that MetaRuleReasoner achieves 100% accuracy on multi-digit arithmetic tasks, while CoT models like GPT-4 show performance degradation as problem complexity increases. - The paper contains good motivations. - The paper is easy-to-follow. - Formalization of this paper seems to violate ICLR's rules. - Baselines of the paper seem out-of-date. Authors should consider adding more strong baselines. - Why can MetaRuleReasoner achieve 100% accuracy across all tasks? Is this overfitting? - Can MetaRuleReasoner generalize to other math-related tasks?	Fully human-written
MetaRuleReasoner: Beyond Chain-of-Thought—Neural Rule-Based Reasoning for Reliable Mathematical Computation	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces "neural rule-based reasoning" as an alternative to chain-of-thought reasoning for mathematical computation in large language models. MetaRuleReasoner achieves 100% accuracy on multi-digit arithmetic tasks by applying learned computational rules systematically, while CoT models (including GPT-4) show performance degradation with increasing complexity. Unlike CoT's pattern-based approach that generates natural language explanations, neural rule-based reasoning decomposes problems into explicit rule applications with systematic composition and verification at each step. - The paper is clearly written. - The paper format violates ICLR'26 requirements. - Poor baselines. Experiments are conducted with non-reasoning LLMs (such as GPT-4, Google-PaLM, Qwen-72B). How does MetaRuleReasoner's performance compare to OpenAI's O-series models or Google Gemini Thinking? Please refer to the "Weaknesses" Section	Lightly AI-edited
MetaRuleReasoner: Beyond Chain-of-Thought—Neural Rule-Based Reasoning for Reliable Mathematical Computation	Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces MetaRuleReasoner, a neural rule-based reasoning system as a distinct alternative that achieves systematic reliability through explicit rule application and complete domain coverage. maybe nothing 1. None of the figures or tables include hyperlinks. 2. The paper is not using the ICLR template. I’m unsure whether this alone warrants a desk reject, but it could introduce potential unfairness in overall paper length. 3. The motivation in the introduction (the three deficiency claims) has no supporting citations or experimental evidence. 4. The writing feels rushed, the references are very sparse, and most are from 2023–2024; given the rapid progress in CoT techniques, the paper’s timeliness is questionable. I don’t think this is an appropriate submission for ICLR 2026. 5. The method’s generality is clearly very narrow; the paper focuses on toy tasks, and the models are quite outdated—for example, it uses Llama 2 and Qwen—while the latest reasoning models are not considered at all. N/A	Lightly AI-edited
MetaRuleReasoner: Beyond Chain-of-Thought—Neural Rule-Based Reasoning for Reliable Mathematical Computation	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes "neural rule-based reasoning" as an alternative to chain-of-thought (CoT) reasoning for mathematical computation in language models. The authors assert that their MetaRuleReasoner approach achieves perfect accuracy on arithmetic tasks where CoT methods exhibit performance degradation. The method purportedly integrates the reliability of symbolic AI with the flexibility of neural networks through systematic learning and application of explicit computational rules. 1. The paper successfully identifies genuine limitations of chain-of-thought approaches, especially concerning reliability in systematic domains such as arithmetic. 2. The conceptual differentiation between pattern-based and rule-based reasoning presents an interesting avenue worthy of investigation. 3. The manuscript demonstrates clear writing and solid structural organization. 1. Questionable technical originality: Although the paper frames its approach as fundamentally novel, the underlying concept of training neural networks to execute algorithmic procedures has established precedent. Comparable methodologies appear throughout the Neural Algorithmic Reasoning, Neural Program Induction, and Neuro-symbolic AI literature. 2. Constrained task environment: The evaluation concentrates on arithmetic, which while genuinely rule-based, constitutes an exceedingly narrow subset of mathematical reasoning. The paper provides no evidence of applicability to more sophisticated domains requiring higher-order reasoning capabilities. 1. ADD MORE COMPARISION PLEASE. The paper positions itself as entirely separate from neuro-symbolic approaches while failing to adequately differentiate itself from recent developments in this domain. 2. Missing ablation analysis: The absence of component-wise performance analysis prevents understanding which elements contribute most significantly to results. Please add MORE ABLATION analysis	Moderately AI-edited
Fairness-Aware EHR Analysis via Structured Missing Pattern Modeling and Adversarial Low-Rank Adaptation	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The manuscript presents a two-stage framework for fairness aware prediction on EHR time series - FEMALA (Fairness-Aware EHR Analysis via Missing Pattern Modeling and Adversarial Low-Rank Adaptation). In Stage 1, the model jointly encodes EHR temporal dynamics with a Segment Aware Temporal Encoder and structured missingness with a Structured Missing Pattern Encoder. These are then fused. In Stage 2, the model learns an autoencoder based sensitive attribute embedding (gender, race, insurance, age, marital status), estimates mutual information between this embedding and the task representation using MINE. It then performs adversarial finetuning over LoRA adapters to minimize that MI while keeping task performance. In hospital mortality (IHM) and 30 days readmission (READM) are the two tasks used in the experiments on MIMIC-III / IV. 1. Clearly defined two stage system: Stage 1 creates a representation by jointly encoding EHR segments & structured masks. Stage 2 performs fairness specific adaptation. It connects two ideas of modeling of structured missingness and fairness aware adaptation. 2. Dual encoder with Structured Missingness Encoder and Missing Pattern Guided Adaptive Fusion module provides a principled way to turn structured missingness patterns into useful signal. The low rank adversarial finetuning that minimizes mutual information with sensitive attributes is a parameter efficient strategy that preserves accuracy while improving fairness. 3. Results on MIMIC III/IV dataset and two clinical tasks (IHM, readmission) show overall gains in predictive performance and fairness. Includes ablations. 1. Combining missingness modeling with fairness is valuable - but these components draw on known ideas (Transformers for time series, cross attention fusion, MINE based adversarial training). 2. Empirical scope is narrow to make a generic conclusion - results are on two related MIMIC datasets and two binary tasks from the same health system. 3. Proposed MAF uses a learned gating and attention over mask informed weights. Exact dimensions and normalization of α is unclear. 4. Existing method assumes access to multiple categorical sensitive attributes (race, gender, insurance, age group, marital status) and encodes them. Although mentioned as future work but evaluating on using high cardinality / continuous attributes is important. From the real-world practicality standpoint, these sensitive attributes may not be made available - manuscript paper does not discuss settings where sensitive attributes are partially or not available. 5. Fairness results are presented for EO and EDDI (group metrics). No per patient or path level analysis is made available. 6. Limited comparisons to missingness as a signal baseline (dual stream or mask encoder approach) 1. Could you please report train on MIMIC III => test on MIMIC IV (and vice versa) to evaluate whether fairness gains remain under dataset shift? 2. Since one stage adversarial training is unstable, could you please share training curves for Stage 2 and report variance over to show the fairness gains are robust? 3. Could you please quantify the dependence between SME features and sensitive attributes?	Fully human-written
Fairness-Aware EHR Analysis via Structured Missing Pattern Modeling and Adversarial Low-Rank Adaptation	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The work proposes FEMALA (Fairness-Aware EHR Analysis via Structured Missing Pattern Modeling and Adversarial Low-Rank Adaptation), a two-stage framework for fair and accurate clinical outcome prediction from Electronic Health Records (EHRs). Recognizing that structured missingness in EHRs, non-random patterns of missing data correlated with demographic attributes, exacerbates algorithmic bias, FEMALA first jointly models observed clinical time-series data and structured missingness using a dual-encoder architecture with a novel adaptive fusion module. This enhances representation robustness and implicitly improves fairness. In the second stage, it applies adversarial fine-tuning via low-rank adaptation (LoRA) to minimize mutual information between task representations and sensitive attributes, effectively decoupling predictions from demographic bias while preserving accuracy. Evaluated on MIMIC-III and MIMIC-IV for in-hospital mortality and readmission prediction, FEMALA achieves state-of-the-art results in both AUROC (+2.3%) and fairness metrics like Equalized Odds (−2.9%), outperforming existing fairness-aware and imputation-based baselines. 1. A key strength of FEMALA is its recognition that missing data in Electronic Health Records (EHRs) is not random but systematically linked to demographic disparities, a phenomenon known as structured missingness. 2. FEMALA introduces a principled two-stage pipeline that separates representation learning from fairness enforcement, addressing a major limitation of existing methods that jointly optimize conflicting objectives. 1. The work restricts itself to binary classification tasks and categorical sensitive attributes, which narrows its applicability in real-world clinical settings where outcomes can be multi-class (e.g., disease severity levels) or continuous (e.g., length of stay), and where sensitive attributes may include continuous or high-cardinality variables such as socioeconomic status or zip code. This constraint limits the generalizability of FEMALA to more complex prediction scenarios commonly encountered in healthcare analytics and may require non-trivial architectural or methodological extensions to accommodate such cases. 2. Another weakness is the exclusive focus on multivariate time-series EHR data, excluding other rich modalities like clinical notes, imaging, or genomic data. Modern clinical decision support systems increasingly rely on multimodal inputs, and missingness patterns in these additional modalities could further compound fairness issues. By not addressing multimodal missingness or cross-modal bias propagation, the framework may overlook critical sources of disparity that arise when integrating heterogeneous data types, thus limiting its relevance in the context of contemporary multimodal clinical AI systems. 3. The evaluation, while thorough on MIMIC-III and MIMIC-IV, is confined to ICU populations from a single institution (Beth Israel Deaconess Medical Center), raising concerns about external validity. Algorithmic fairness is highly context-dependent, and biases manifest differently across healthcare systems, geographic regions, and patient demographics. Without validation on more diverse datasets—such as those from community hospitals, international cohorts, or outpatient settings—it remains uncertain whether FEMALA’s gains in fairness and accuracy would generalize beyond the specific structural and demographic characteristics of the MIMIC databases. 4. The fairness intervention in FEMALA assumes access to ground-truth sensitive attributes during fine-tuning, which may not always be available or legally permissible in practice due to privacy regulations or institutional policies. While the paper acknowledges this implicitly by using attributes like race and insurance status, it does not explore or propose alternatives for settings where such attributes are missing or must be inferred, nor does it address the ethical implications of requiring sensitive data to enforce fairness—a paradox that has been noted in prior fairness literature. Please refer to the weaknesses	Fully AI-generated
Fairness-Aware EHR Analysis via Structured Missing Pattern Modeling and Adversarial Low-Rank Adaptation	Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper aims to address the issues of fairness and data missingness in EHR data simultaneously. Overall, the research topic is not novel, and its significance is limited for a top machine learning conference. Both fairness and missingness have been extensively studied in existing works on AI in healthcare, while this paper mainly attempts to combine these two aspects. The novelty of the proposed “ensemble” method is limited. In addition, the baselines used for comparison are not adequate, and the evaluation focuses only on naive empirical studies, without any simulated settings that include different missingness mechanisms or synthetic data where the ground truth is known. I suggest that the authors revise the article to address the following weaknesses and consider submitting it to a clinical informatics journal as a general benchmark or empirical study. Please find my detailed comments below. The overall writing is clear and easy to follow, and the figures are well presented. 1. Lack of novelty and significance. As mentioned in the overall summary, the work lacks novelty. I suggest that the authors reconsider the target audience of this study and revise the manuscript for submission to a clinical informatics journal, where it may be of greater interest. 2. Experimental design. The experimental design requires substantial improvement to meet the standards of a rigorous study. (1) Baseline models The baseline models are not sufficient. More deep learning–based models should be included, following prior works. For example: (a) Liu, Mingxuan, et al. “Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques.” Artificial Intelligence in Medicine 142 (2023): 102587. (b) Penny, Kay I., and Ian Atkinson. “Approaches for dealing with missing data in health care studies.” Journal of Clinical Nursing 21(19–20) (2012): 2722–2729. State-of-the-art fairness methods for structured EHR data should also be considered: (a) Berk, Richard, et al. "A convex framework for fair regression." arXiv preprint arXiv:1706.02409 (2017). (b) Williamson, Robert, and Aditya Menon. "Fairness risk measures." International conference on machine learning. PMLR, 2019. In addition, naive MVI (missing value imputation) strategies such as mean/median imputation and MICE should be compared. The proposed method must be clearly justified as offering unique advantages over these naive baselines. It is possible that standard MVI strategies could yield comparable or even better results for downstream tasks; if so, the proposed method would appear computationally expensive and overly black-box for clinical decision-making requirements. (2) Simulations Simulation studies are needed. The authors should first generate three different missingness mechanisms using real data, and ideally include experiments on purely simulated data as well. (3) Reporting uncertainty The paper should report standard errors or confidence intervals, since metrics such as EOR and DPR can fluctuate significantly when data are imbalanced. Could the authors provide more details to justify the unique challenges faced by temporal EHR data? This aspect is ambiguous in the current writing. The evaluation currently focuses only on prediction performance, where the outcome is treated as binary. For instance, if the goal is survival analysis, the current binary outcome prediction is insufficient. More experimental designs are needed to demonstrate that the proposed solution has unique advantages for time-series data. The current evaluation appears somewhat irrelevant to time-series data and may be misleading.	Lightly AI-edited

PreviousPage 12 of 1516 (75800 total rows)Next