ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
Intra-Trajectory Consistency for Reward Modeling Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Basically, this paper addresses a key limitation in traditional reward modeling: the reliance on coarse, response-level preference labels. This mechanism hinders the model's ability to identify specific high-quality segments within a response, often leading to poor generalization. To mitigate this, the authors introduce Intra-Trajectory Consistency Regularization (ICRM), a novel method designed to propagate response-level supervision to a more fine-grained, process level. The core mechanism enforces consistency between the reward values of adjacent generation steps, weighted by the next-token generation probability from a separate, frozen generator model. This encourages the reward model to learn smoother and more meaningful reward landscapes without incurring the high cost of manual, process-level annotations. The empirical validation is comprehensive, demonstrating that ICRM achieves statistically significant improvements on the RewardBench benchmark and that these gains translate directly into superior performance in downstream applications, including guiding DPO policy optimization and enhancing selection accuracy in Best-of-N inference-time verification. However, primarily, the discussions of reward modeling in this context typically center on PPO-like or GRPO-like RL methods; therefore, the paper would be strengthened by presenting more extensive results in this area. Furthermore, a key motivation for DPO is its relative simplicity and convenience compared to traditional RL-based methods. This paper, however, trains a separate reward model to generate improved preference data, arguably re-introducing complexity and additional overhead. To improve the paper's soundness, the authors should provide more direct comparison experiments with established RL-based methods, such as PPO and GRPO, RLVR ,including the result comparison and the time&resource cost. In summary, while the proposed method is interesting, the paper requires significant revision to address these limitations before it can be considered for acceptance at the ICLR 2026 conference. The primary strength of this paper is its novel, intuitive, and highly practical regularization method. By linking reward consistency to generator probabilities, it offers a good approach to inject fine-grained learning signals from coarse-grained data, presenting a significant practical advantage over methods that rely on labor-intensive, step-wise human annotations. This core contribution is supported by rigorous experimental evaluation. The authors convincingly demonstrate that improvements on a standard benchmark like RewardBench are not merely superficial but yield tangible benefits in critical downstream tasks, such as RLHF and inference-time verification, with results further corroborated by human evaluations. What's more, the authors provide in-depth analysis, including extensive ablation studies that substantiate the design choices of the proposed loss function, an investigation of length bias, and robustness checks against generator mismatch. The experimental results bring significant effectiveness to the method. Despite its strengths, the paper possesses several weaknesses that should be addressed: 1) The evaluation lacks sufficient comparison to RL-based methods methods. Basically, the discussions of reward modeling are often about PPO-like or GRPO-like algorithms; thus, the paper would be strengthened by presenting extensive results comparing ICRM-enhanced DPO against these methods on more benchmark datasets. Furthermore, a key motivation for DPO is its simplicity relative to complex RL-based pipelines. The proposed method re-introduces a separate, trained reward model, which adds complexity and overhead. To establish the paper's soundness, a direct comparison with methods like PPO, GRPO, and RLVR-based methods is essential, and this comparison should evaluate not only downstream performance but also the associated time and resource costs. 2) The theoretical motivation in Section 3.1 is presented as a formal derivation from a Bayesian framework, yet it appears to be overstated. The step equating a scalar reward value with a conditional probability is a significant conceptual leap rather than a mathematically rigorous step. This framing undermines the credibility of the stated theoretical foundation. The work would be more defensible if this section were reframed as providing the intuition and motivation for the approach, rather than presenting it as a formal proof. 3) The methodology introduces several complex mechanisms, such as the mean-centered calibration technique and the mutually weighted binary cross-entropy loss, without adequate justification in the main text. The authors do not sufficiently explain why these specific formulations were chosen over simpler alternatives, such as a standard L1 or L2 loss. While these justifications are provided in the appendix, their absence from the core methodology section makes the design feel arbitrary and less compelling. Integrating these critical design rationales into the main paper is recommended. 4) The introduction and related works session is not sufficient. The authors should more explicitly differentiate their work from process-supervised models like PRM, emphasizing the primary advantage of achieving fine-grained supervision without requiring fine-grained labels. Furthermore, a sharper distinction should be drawn between ICRM, which regularizes the final reward values based on generation dynamics, and other methods (e.g., GRM) that regularize the model's hidden states. See weakness. Lightly AI-edited
Intra-Trajectory Consistency for Reward Modeling Soundness: 1: poor Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces an intra-trajectory consistency regularization method to refine reward models by propagating coarse-grained, response-level supervision into fine-grained learning signals using a Bayesian-inspired principle. The approach improves performance on RewardBench and enhances DPO-aligned policies. 1. The theoretical analysis effectively supports the arguments. 2. Introducing token-wise information during the reward model training phase demonstrates innovation. 1. The additional computational overhead introduced during training is non-negligible. Training a 2B reward model along with a 2B generator should be compared with an ablation study involving training a standalone 3B–4B reward model for a more appropriate evaluation. 2. Generator mismatch is a common issue. On one hand, during RLHF, the reward model size may be significantly smaller than the actor model. On the other hand, the distribution of the actor model can shift considerably as training progresses. Although the authors conducted related experiments in the Appendix, I believe these are insufficient and require more ablation studies involving different model sizes, model series, and training steps. 3. The baselines are overly simplistic. Many methods exist for enhancing reward models, and GRM is already over a year old. I suggest including more baselines for comparison. 1. Why do larger training sample sizes lead to worse performance in Table 1 and Table 2? I also checked the original GRM paper, and the data presented there differs from what is shown in your paper. Is there an explanation for this discrepancy? 2. Why use DPO? The main purpose of using a BT model is to address reward modeling issues in PPO. For algorithms like DPO, which only require preference information, directly training on the original preference dataset would be more straightforward. Introducing an intermediate reward model seems redundant and may add noise, which I find confusing. I recommend adding experiments with PPO. Moreover, your baseline GRM has been evaluated on PPO and BoN, not DPO. Lightly AI-edited
Never Saddle: Reparameterized Steepest Descent as Mirror Flow Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper studied and compared the dynamical behaviors of different types of steepest descent algorithms. The authors focused on deep linear diagonal models and showed that a steepest flow in the weight matrices with respect to $L_p$ norm induces a "steepest mirror flow" in the end-to-end matrix. They analyzed how the choice of $p$ influences the dynamics, convergence rate and the effects of weight decay. Experiments on both linear and practical models are provided for the comparison between different algorithms. The paper extends the study of reparametrization in gradient flow to steepest flow. This provides a useful platform for studying how reparametrization and optimization geometry interact to shape the training dynamics. I find this extension meaningful and valuable. Several established results are quite interesting: * Lemma 4.4: The balance equation for steepest flow in training deep diagonal linear network is new and interesting. * Theorem 4.6: The equivalence between steepest flow with reparametrization and "steepest mirror flow" is quite interesting. To my knowledge, previous work only showed the equivalence between gradient flow with reparametrization and mirror flow. * Corollary 4.8: It sheds light on how different choices of optimization geometry and reparametrization (network depth) could affect the convergence rate. The main weakness is that the main claims are not well supported by the presented theory. The title begins with "Never saddle" and the authors emphasized both in the abstract and in Contributions that "we prove that steeper descent (lower $q$) simultaneously escapes saddles faster and supports feature learning", and that "decoupled weight decay ... yielding more stable feature learning". While these statements are strong and intriguing, I do not find them adequately supported by the theoretical results. **Regarding saddle escaping** After Lemma 4.4 and in Figure 3, the authors claimed that "the (curved) path away from zero is shorter for smaller $q$, indicating faster saddle escape". This reasoning is not logically sound: First, the notion of a "shorter path" is ambiguous (e.g., between which endpoints?). Second, the geometry of the invariant manifold alone does not determine how fast the algorithm moves along the manifold. Besides the above, the only theoretical evidence for the claim that "lower $q$ escapes saddles faster" is based on two results: (i) Corollary 4.8: under initialization "$w_1 = 0$ and $w_i = \lambda > 0$", a smaller $q$ yields a larger coercivity constant $\mu$; and (ii) Theorem 4.2: when $R$ is a separable Bregman function, a larger $\mu$ yields a faster linear convergence rate. However, I do not find these two results sufficient for the claim, for the following reasons: * It is not established that the $R$ used in Corollary 4.8 (or Theorem 4.6) is a separable Bregman function. Thus it is unclear whether Theorem 4.2 applies; * The considered initialization "$w_1 = 0$ and $w_i = \lambda > 0$" lies in a measure-zero set; * Only a subset of saddles (those near the considered initializations) are analyzed, whereas, as noted by the authors, other saddles exist elsewhere in the parameter space. Therefore, even if Theorem 4.2 is applicable, the results only indicate that: for certain specific saddles and for a measure-zero set of initializations in their neighborhoods, a smaller $q$ leads to a faster escape. I thus think it is overstated to use the title "never saddle", or to claim that smaller $q$ results in faster saddle escaping, which sounds like a general statement, just based on these results. **Regarding feature learning** The authors claimed that "smaller $q$ supports feature learning". However, in the theory section (Section 4), I could not find a clear discussion on how the choice of $q$ affects feature learning. The following two aspects might be relevant, but it is unclear how they support the claim: * In Corollary 4.11 and 4.12, the authors discussed whether the function $R_{L_p, L}$ is a Legendre function under different $q$, network depth $L$, and metric exponent $m$. Then in the left panel of Figure 2, the authors indicated that a metric exponent slightly smaller than $1$ correspond to a feature learning regime (the green band in the figure). However, it is not explained why this is true and how does metric exponent relate to feature learning. For example, why does $m=0.95$ lead to feature learning, while $m=0.5$ or $m=1.5$ does not? * In Theorem 4.14, the authors discussed the on manifold regularization induced by weight decay under different $q$. However, as shown in Table 1, when $q=2, L=2$, the decoupled weight decay induces an $L_1$ bias; whereas when $q=1, L=2$, it induces an $L_{3/2}$ bias, which is less sparse than $L_1$. As sparsity biases have been linked to feature learning in some settings, this observation actually suggests that $q=2$ may facilitate feature learning instead of a smaller $q$. The authors also claimed that "decoupled weight decay, as in AdamW, stabilizes feature learning by enforcing novel balance equation". The balance equation in Lemma 4.4 indicates that the weight decay encourages the weights to become more balanced during training. But it is not clear to me how this balancing translates into a more "stable" feature learning. I suggest that the authors clarify in the paper what they mean by "feature learning" and "stable feature learning", and then compare the algorithms with different values of $q$ as well as with and without weight decay. **Initial recommendation** Overall, while the study of how optimization geometry and reparametrization affect the dynamics and the proposed framework are very interesting, I find the main claims of this paper, in particular those on saddle escaping and feature learning, insufficiently supported. Therefore, my initial recommendation is rejection. I find some parts of the presentation unclear. Please see the questions below. **Notations** * In Example 3.2, is the expression "a deep diagonal linear network $g(m,w)=m \odot w$" actually referring to a shallow network, as there are only two weight matrices $\operatorname{Diag}(w), \operatorname{Diag}(m)$? * In Definition 3.3 and 4.5, does $\mathbb{I}_n$ denote the all-one vector? This symbol is commonly used for the identity matrix, so clarification would be helpful. * In Corollary 4.8, the notation $w_i=\lambda>0$ is ambiguous. Does it mean the vector $w_i$ has all entries equal to $\lambda$? **Regarding Theorem 4.6** * $\lambda−L_p-$balancedness in Definition 4.5 is defined for shallow network with two weight matrices. Then what does "$\lambda−L_p-$balanced with respect to the first parameter $w_1$" mean in Theorem 4.6, which is stated for deep networks? * In Lemma 4.4 and Theorem 4.6, in what sense does "almost everywhere" mean (e.g.,"steepest descent satisfies a separable $L_p$-mirror flow almost everywhere")? Does it mean the result may fail only for a measure-zero set of initializations? * Corollary 4.12 discussed when the function $R_{L_p,\ L}$ is a Legendre function. However, in Theorem 4.6, there seems no restriction on $p$ or $L$. Does Theorem 4.6 indicate that such an $R_{L_p,\ L}$ always exists, even in cases where it is not a Legendre function? Fully human-written
Never Saddle: Reparameterized Steepest Descent as Mirror Flow Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper studies continuous-time steepest descent methods, specifically algorithms taking the form: $$\begin{align} d x_t = - \mathrm{sign} \left( \nabla_x f(x_t) \right) \odot \left| \nabla_x f(x_t) \right|^{q-1}, \end{align}$$ where $q \in [1,2]$. Different values of $q$ result in different trajectories. For example: For $q=2$, the algorithm becomes gradient flow, which is the continuous-time approximation of gradient descent. For $q=1$, it becomes SignGF (Sign Gradient Flow), which serves as a good proxy for studying optimizers like Adam. The architecture on which the paper focuses is deep diagonal reparameterization, defined as $x = g(w) = \prod_{i=1}^{L} w_i$, where $x$ is represented as the product of $L$ scalars. The authors demonstrate that under $\lambda$-balanced initializations, these steepest flows can be re-parameterized as steepest mirror flows. Using this reparameterization, the paper analyzes how quickly the flow escapes saddle points and investigates the effect of both coupled and decoupled weight decay. The paper provides a seperation result for signGD with coupled and decoupled weight decay and show that they have different regularization properties which is interesting. a) The paper focuses on deep diagonal reparameterizations which is a product of one-dimensional variables with a particular initialization shape. The setting is restrictive to generalize the results of the paper. 1) How is the manifold regularizer derived ? Is it due to the fact that steepest descent with de coupled weight decay can be written as $$ d( \nabla_x R(x_t) ) = - \mathrm {sign} \left( \nabla_x f(x_t) \right) \odot \left[ \nabla_x f(x_t) \right]^{q-1} dt - \nabla M_{reg}(x) dt $$ it would be nice to detail this as the manifold regularizer appears a bit abrupt. 2) The saddle points defined in Theorem 4.3 are not saddle for with coupled or decoupled weight decay so the feature learning results only work without the weight decay. This point needs to be clearly mentioned in the manuscript. 3) From the abstract, how is the deep diagonal reparameterizations a proxy for attention ? Fully human-written
Never Saddle: Reparameterized Steepest Descent as Mirror Flow Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work introduces steepest mirror flows as a unifying geometric framework to study how optimization algorithms influence reparametrization. By analyzing diagonal linear networks and deep diagonal reparameterizations, the authors show that steeper descent methods, such as sign-based variants, escape saddle points more efficiently than gradient descent. Empirical results from linear regression, classification, and fine-tuning experiments confirm the theoretical predictions of faster saddle escape, stable learning, and distinct sparsity dynamics between coupled and decoupled weight decay. This is an interesting study with insightful findings. Combining mirror descent with reparameterization is a great idea. The statement of the first contribution is misleading. This is not the first work studying GF and reparametrization. It is probably meant with respect to the family. For this type of work, assumptions are always problematic. They are very simplistic (diagonal networks) but it is very hard to potentially show more general results. In experiments, rather than studying the networks considered in the analysis, they should see if the results hold when the assumptions are not fulfilled (for example, other networks). What is meant by: iterates x_t converge? I assume it is meant lim_t->\infinity x_t Fully human-written
Never Saddle: Reparameterized Steepest Descent as Mirror Flow Soundness: 1: poor Presentation: 3: good Contribution: 1: poor Rating: 0: Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper proposes “steepest mirror flows” to explain why Adam/AdamW beat SGD in fine‑tuning via faster saddle escape and different implicit regularization. Steeper (sign‑like) geometry helps saddle escape and that decoupled weight decay (AdamW) stabilizes feature learning, with theory for deep diagonal reparameterizations and small fine‑tuning case studies. Most formal results hold only in separable/diagonal settings; the transformer link is indirect; and empirical support for fine‑tuning claims is narrow. The message that adam escapes saddles better than SGD is believable and may be cool. However, it is not well established. The rest of the claims are not substantiated. #### How is this about fine-tuning transformers? There is not even a linear diagonal attention there, no one ever claimed that a diagonal network is a good model for a transformer, because it is not. How do you argue this? #### Mirror flow study is incremental and does not adequately support the thesis. While the diagonal‑network analysis is neat, it is very similar to existent ones and does not bring any real novelty to the community. *I believe, it is extremely incremental.* It is way too limited to show that your sign-mirror-descent escapes saddles to claim that *adam* escapes saddles *in transformers*. There is a too big of a jump from the mathematical argument to the goal. Moreover, I think that such a result is a perturbation of existent ones. #### Order-2 saddles So what? We know they are present in neural networks, actually, arguably also of higher order. A long line of works that you do not cite addresses this issue much more in general. That same line of work empirically and sometimes theoretically notice that they are not a problem in practice. You should discuss this line of research. On top of it saddles are not an issue in linear networks generally cause the standard initialization is with high probability outside of the area with saddles. #### Title mismatch Even the title is an oversell. It missmatch with the paper, this is not a paper about generally reparameterizing steepest descent as mirror flow. This is a paper about adam escaping saddles which lack novelty and does not support their claims. #### Experiments For cifar10 actually SGD generalizes better than Adam with CNNs, and this is also a classical result. I really do not understand the whole point of the paper and how these are supporting experiments. #### Conclusions Even though experiments are present, I thus believe the central claims are not adequately supported and the research methodology is not sound. - Fully human-written
Differentially Private Two-Stage Gradient Descent for Instrumental Variable Regression Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper studies differentially private instrumental variables (IV) regression and proposes **DP-2S-GD**, a two-stage gradient-descent algorithm that enforces Renyi differential privacy through clipping and Gaussian noise injection (in a usual DP-GD style but with two different sets of parameters). The method provides end-to-end privacy guarantees for the classical two-stage least squares (2SLS) setting for IVaR. The authors prove a non-asymptotic convergence bound that decomposes optimization error, sampling error, and noise due to privacy. They also characterize a privacy-iteration trade-off and show how privacy budgets affect convergence behavior. Experiments on synthetic data and the Angrist–Evans dataset support the theoretical predictions, demonstrating that the estimator remains stable and competitive under useful privacy levels. The paper fills a clear gap by proposing a differentially private version of instrumental variable regression. The formulation is simple and technically consistent, combining a two-stage gradient procedure with zCDP-based privacy guarantees. The analysis provides non-asymptotic convergence bounds for $\beta$ (the regression parameter of interest) that decompose statistical, optimization, and privacy errors, making the results partially interpretable. The privacy proof is correct and applies zCDP effectively to handle iterative updates. The experiments, though limited, support the theoretical claims and clearly show the predicted privacy–accuracy and overfitting behavior (from Figure 2). The writing is clear, notation consistent, and the argument proceeds logically from setup to results. **Untuned step-sizes and incomplete convergence analysis (major weakness)** The paper provides only upper bounds for the step sizes \(\eta\) and \(\alpha\) ). These conditions ensure convergence in principle but leave the question of the optimal step-sizes open. As a result, the convergence theorem lacks interpretability: there are too many definitions, and not all of them depend on just problem dependent parameters. This limitation gives the impression that the analysis has not been fully worked out. A sharper result would express convergence explicitly in terms of well-defined problem constants and privacy noise scales. Since this is the main contribution of the paper, and the algorithm itself is not very surprising, I am inclined to reject the paper in its current form. I would like to see a fully worked out, non-asymptotic convergence guarantee before I can recommed acceptance. If it is hard to tune all the parameters for a high probability guarantee, can the authors atleast provide one for an in expectation result? **Limited baselines** The experiments lack comparison with simple private 2SLS variants, such as perturbing sufficient statistics or applying output perturbation (which were indeed discussed in the paper while providing a survey of related methods). Including such baselines would help quantify the benefit of the proposed algorithm beyond intuitive reasoning. **Non-private performance gap** Even without privacy, the estimator remains slower than classical 2SLS, introducing an additional factor in the rate. The paper acknowledges this but does not explain what causes this. 1. Can the authors report the privacy levels used in all experiments in terms of the standard $(\epsilon, \delta)$-DP, in addition to $\rho$-zCDP, so that readers can directly compare privacy strengths with prior work? 2. Can the step sizes be tuned or derived in a way that yields an explicit non-asymptotic convergence guarantee depending only on problem parameters (e.g., Lipschitz constants or curvature), rather than through broad stability bounds? 3. Would it be possible to consider a stochastic variant of the algorithm, where each optimization step uses a new sample so that $T$ and $n$ are coupled? This might lead to a simpler and more interpretable result, since the trade-offs between optimization and generalization would coincide. It could also remove the “overfitting” behavior seen in Figure 2, where the error decreases and then rises as $T$ grows. This is not a major issue, but it could better bring other aspects such as the utility-privacy trade-off on the centerstage. 4. The motivation for the IVaR setup remains somewhat unclear. Can the authors give concrete examples of real situations where one can actually observe or construct the full dataset $(z, x, y)$ required for the two-stage regression? In settings where the true causal graph is unknown, how can one justify the validity of Assumptions 1 and 2? Some discussion or examples of practical data sources would help clarify this premise. Lightly AI-edited
Differentially Private Two-Stage Gradient Descent for Instrumental Variable Regression Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a differentially private estimator for (a specific formulation of) instrumental-variable regression. The paper suggests noisy gradient descent, in which the key parameters $\Theta$ (a matrix) and $\beta$ (a vector) are updated at each round using a noisy gradient measurement. THe paper gives natural conditions under which this algorithm provably converges (and also provides bounds on its error in estimating the vector $\beta$). The algorithm's convergence is also evaluated empirically on synthetic and real data. * The paper addresses a widely-used statistical method * The paper appears technically sound. Though I did not check the proofs carefully, the authors present and discuss their results clearly, and the results and proofs seem plausible. * The final bounds were hard to interpret, partly because they necessarily involves quite a few different parameters. It would be helpful to provide some baselines, both in the form of nonprivate methods, and in the form of alternate approaches to designing DP algorithms. For example, a few of the questions I couldn't answer as a reader were: * What level of noise is acceptable? That is, how are the results of DP IVaR supposed to be used, and when does the noisy version meet the needs of application? Even if the answer is context dependent, providing one context in which the overall usefulness of the answer can be evaluated would be helpful. * How does the proposed method compare to a more straightforward perturbation of the sufficient statistics for the model (presumably the covariance matrix of the concatenation $(z,x,y)$). * When is DP IVaR more useful than DP OLS on its own? As far as I know, IVaR and OLS are normally used in tandem. However, presumably the noise levels have to be low enough to for them to be meaningfully compared. When is that possible? In what parameter ranges do the results of the paper imply the validity of such a comparison? * It would be good to compare the bounds in this paper to those implied by other work on DP nonconvex optimization. * Privacy notions: Remark 3.8 states that, when $\Theta^{(t)}$ is not released, "setting $\rho=\infty$ implies that no noise $\Xi^{(t)}$ needs to be injected in the first state, and we can simply return $\{\beta^{(t)}\}$ under privacy budget $\rho_2$." I don't see why that's true---the hidden $\\Theta^{(t)}$ values encode information about the sensitive data and propagate it throughout the computation. At the very least the statement deserves a more careful formulation and explanation. (Also, why would $\beta$ be useful without $\Theta$? The latter seems necessary to interpret the former as a predictor of $y$) * What are the relevant facts about the standard 2S-LS estimator for this paper? It is mentionedand compared to the 2S-GD estimator in Remark 3.7, but the algorithm and its convergence rate are not precisely described, which makes the comparison hard to follow. Fully human-written
Differentially Private Two-Stage Gradient Descent for Instrumental Variable Regression Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This submission presents a differentially private algorithm for instrumental variable regression, a central tool in causal inference. The standard non-private algorithm for this task involves solving two least-squares problems. The algorithm presented here is a two-stage version of DP gradient descent. They give a theoretical analysis of the algorithm's convergence (under the assumption that the data arise from a well-specified model with subgaussian noise). For appropriate hyperparameter settings, as we add get more samples the error from privacy becomes lower than the error from sampling. They also present experiments on synthetic and real data, and the method seems quite practical. Instrumental variable regression is an important task and extremely worthy of study under privacy. While there is a lot of work on private ordinary least squares, I'm surprised there's none on this topic. The method presented here is a clear contribution, both theoretically and practically. I was unable to verify the proofs, but the results are very plausible. The main convergence proof is quite long and, to my taste, comes with insufficient exposition. It was hard for me to understand which parts of the analysis are new. There are no approaches that explicitly tackle DP IVaR, but there are well-known techniques that directly yield results. In particular, the "subsample and aggregate" framework can usually make use of any non-private estimator with meaningful concentration guarantees. For the task here, combining the Algorithm 5.1 of "FriendlyCore" [1] with the submission's Lemma D.7 yields a zCDP estimator with $\ell_2$ error roughly $\sqrt{pq/n}$, I think. This matches the non-private error and, in some regimes, might be better than Theorem 3.1's $\sqrt{p}q^{3/2}/n$ term. [1] https://arxiv.org/abs/2110.10132 Can you formally combine Lemma D.7 with the FriendlyCore approach and compare the resulting theorem with your Theorem 3.1? Can you briefly describe what (if any) new technical challenges your main convergence proof must overcome? Can you fix the following minor issues with your background discussion? - $(\varepsilon,\delta)$-DP algorithms satisfy basic composition, where the privacy terms accumulate linearly, but also *advanced* composition [see 2]. Here the asymptotics match those of zCDP, but the latter is usually more practical. - You mention that gradient-perturbation methods can avoid spectrum-based blowups, but (i) private first-order methods still suffer on ill-conditioned data [3,4] and (ii) there are techniques for privately estimating $X^T X$ that avoid these blow-ups [5,6], although when used within a sufficient-statistics approach for regression they may incur higher sample complexity than necessary [see 7 for discussion]. [2] https://dpcourse.github.io/2025-spring/lecnotes-web/DP-S25-notes-lec-10-adv-composition.pdf [3] https://arxiv.org/abs/2207.04686 [4] https://arxiv.org/abs/2301.13273 [5] https://arxiv.org/abs/1805.00216 [6] https://arxiv.org/abs/2301.12250 [7] https://arxiv.org/abs/2404.15409 Fully human-written
Differentially Private Two-Stage Gradient Descent for Instrumental Variable Regression Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the problem of instrumental variable (IV) regression under differential privacy constraints. The authors propose a noisy two-stage gradient-descent algorithm that injects calibrated Gaussian noise and uses per-sample gradient clipping to obtain $\rho$-zero-concentrated differential privacy (zCDP). They also provide a formal privacy accountant and derive finite-sample convergence bounds that explicitly characterize the trade-offs among privacy budgets $(\rho_1,\rho_2)$, sample size $n$, problem dimensions $p,q$, and the number of iterations $T$. The authors validate the theory with synthetic simulations and a real-data application. 1. This paper addresses a compelling problem that is likely to interest the statistics and econometric community, particularly those focused on IV regression and privacy. 2. This paper gives a clean privacy accountant and a detailed non-asymptotic utility bound. 3. This paper is well written and clearly structured. 1. The method and proofs are limited to the linear IV model. It may be more interesting to consider nonlinear IV or machine learning-based IV methods. 2. Even under the non-private setting, the error rate has an additional $\sqrt{p}$ factor compared to that of the z2SLS estimator. This worsened dimension dependence could be limiting in moderate-to-high-dimensional problems. 3. The necessity of privacy protection in IV method should be more clarified, it is more likely a combination of two methods in the present version. 1. Do you see a path to extend your framework and privacy analysis to nonlinear IV methods? What are the main technical obstacles? 2. Can you provide more intuition about the source of the extra $\sqrt{p}$ factor compared to 2SLS, and whether an improved analysis or algorithmic modification could remove or reduce this gap? 3. In the utility analysis, it seems that the authors do not explicitly incorporate gradient clipping into the iterative update equations. The clipping thresholds only appear in the gradient sensitivity, which determines the noise scale. Could the authors explicitly include the gradient clipping step in the iteration and provide a corresponding convergence analysis? Furthermore, how should practitioners choose the clipping thresholds in practice, and how sensitive is the empirical performance to these choices? 4. How should the step sizes be selected in practice? Have the authors conducted any sensitivity analysis to evaluate how the choice of step sizes affects the empirical performance and convergence? 5. The utility analysis requires a condition on $T$ to keep noise scales manageable. In practice, does early stopping based on a private validation objective suffice? Could you provide an algorithmic prescription to choose $T$ that balances convergence and privacy noise? 6. In the real data analysis, both the endogenous dimension $p$ and the instrumental dimension $q$ are equal to 1. Are there any real datasets with higher-dimensional instruments or regressors where the proposed method could be demonstrated? It would be interesting to see how the algorithm performs in such multi-dimensional settings. Fully human-written
From Compression to Specialization: An Information-Preserving Approach for Dense to Mixture-of-Experts Construction Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the challenge covnerting pre-trained dense LLMs into sparse MoE architectures. The authors identify a trade-off between inheriting knowledge from the base models vs diversity of expert modules. They propose an approach that uses low-rank factorization (SVD) with distinct calibration datasets to construct specialized experts, demonstrating that the approach exhibits high sensitivity to calibration data, enabling diversity, while preserving knowledge better in comparison to methods such as structured pruning. Experiments seem to show competitive performance, data efficiency, and improved load balancing. - the framing of the problem is intuitive, and a preliminary analysis demonstrates the choice for using SVD and low rank decommposition in this manner - Experimental analysis covers 12 benchmark datasets - Section 4.5 shows useful analysis of expert specialization (heatmaps) - Load balancing insights reveal stability issues in prior works, demonstrating further the advantage of the proposed approach -The baseline comparisons are limited. The paper does not compare against a wider range of recent upcycled-MoE baselines such as Sparse Upcycling (Komatsuzaki et al., 2023), Drop-Upcycling (Nakamura et al., 2025), Auxiliary-Loss-Free Load Balancing (Wang et al., 2024). - All experiments only have 4 experts - no expert number ablation. No ablation studies on key desgin choices (E.g., lora rank) - The compression ratio is set to 25%, but this is not a well explained choice - It is claimed that sharing-inter will degrade with continued training due to load imbalance. Can experiments be provided that validate this? - There is no indication about the proper choice of datasets and how this choice induces specialization equivalent to training MoEs from scratch. What if the test examples do not clearly match calibration datasets? - It appears that the method does not allow for any overlap between experts (no shared expert). Could this be a downside in some cases? - There is no clear quantitative comparison of total computation (construction, training, inference) with other MoE upcycling methods. - How sensitive is performance to the number and selection of fine-tuning datasets used to form experts? Would including additional baselines such as DeepSeek Balancing or BTX change the conclusions?What is the trade-off between expert diversity and computational cost when scaling to more fine-tuning datasets please see weaknesses above! Fully human-written
From Compression to Specialization: An Information-Preserving Approach for Dense to Mixture-of-Experts Construction Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces MIDAS, a method that transforms dense LLMs into sparse Mixture-of-Experts models via low-rank decomposition and parameter-efficient fine-tuning. Using Llama-2-7B as the base, each expert is derived from calibration data, followed by 1.3 B-token CPT and 0.4 B-token SFT. The authors claim improved data efficiency (DES) and specialization with minimal training cost. 1. Framing: interprets low-rank compression as a route to expert specialization. 2. Analyses on expert load distribution and calibration sensitivity. 3. Lightweight tuning scheme using LoRA is practical in principle. 1. DES metric: Since all MIDAS experiments are conducted using Llama-2 as a backbone, it is inappropriate to claim superiority over Llama-2 in terms of DES. 2. Accuracy degradation ignored: MIDAS (CPT + SFT) consistently underperforms the Llama-2 baseline on several downstream tasks. 3. Lack of compute transparency: The paper fails to report fundamental cost statistics such as FLOPs or GPU hours for training. 4. Outdated setup: All experiments are limited to Llama-2-7B. Stronger modern dense models, such as Llama-3 or Qwen-3, are not tested, leaving it unclear whether the claimed benefits of MIDAS would hold with more capable backbones. 5. Lack of task coverage: The evaluation omits critical domains such as mathematical and coding reasoning (e.g., HumanEval+, LiveCodeBench, MATH-500, BBH). 6. Missing relevant baselines: Contemporary dense-to-sparse conversion methods such as Sparse Upcycling and Drop-Upcycling are not included as baselines under the same computational budget, making it difficult to contextualize MIDAS’s effectiveness. Please clarify the points raised in the Weaknesses section. Moderately AI-edited
From Compression to Specialization: An Information-Preserving Approach for Dense to Mixture-of-Experts Construction Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes an expert-initialization method for converting dense models to MoE models. Specifically, the paper proposes to use different calibration datasets to initialize different experts via low-rank factorization. The specially initialized model is trained to close the gap between the MoE model and its parent dense model. 1. Interesting observation about the sensitivity of data-dependent compression of LLMs on the selection of calibration data, specifically for SVD-based compression 2. The paper is easy to follow 1. The main goal of the paper is to convert a dense model into an MoE model. The motivation is that training an MoE model from scratch is challenging. From this perspective, the paper didn't provide any comparison with MoE models trained from scratch. 2. It has already been established in the literature that training MoE is computationally efficient. Therefore, to achieve similar performance, a dense model needs far more training compute. However, the proposed method loses performance significantly compared to its parent dense model, even after training the initialized MoE model. 3. The proposed method can't outperform other dense-to-MoE baselines, despite having a significant load imbalance for the baseline. 4. The proposed expert-initialization method heavily depends on the diversity of calibration data. Therefore, the unavailability of diverse calibration data may undermine the effectiveness of the proposed method. 5. No formal theoretical justification has been provided for the proposed initialization of the experts. 1. What is the Sharing-Inter method? I can't find any citation of Sharing-Inter in the paper. 2. Can the authors provide a clear justification of why one should convert a dense model into MoE, rather than training MoE from scratch? Fully human-written
Real-Aware Residual Model Merging for Deepfake Detection Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a training-free model merging approach that enhances the generalization ability of deepfake detectors by combining specialist parameters trained on different forgery types. Specifically, the authors argue that the “real” class direction can be well preserved during merging, while class-specific forgery artifacts are suppressed. To this end, they employ SVD decomposition to extract dominant directions and obtain a generalizable merged model. Extensive experiments on the DF40 dataset demonstrate that the proposed method largely retains each expert’s detection capability. The authors further claim that their method exhibits high scalability. 1. This paper introduces a valuable setting: Merging detectors via a training-free approach to obtain generalized deepfake detectors at low cost. 2. The authors provide rich theoretical justifications supporting the effectiveness of their proposed method. 3. The proposed approach is extensively evaluated under multiple protocols, showing competitive or superior AUC retention on seen tasks and improved generalization to unseen forgeries. 1. Although the authors have discussed using a single averaged linear head after merging is enough to get the results, empirical validation is missing. The authors are encouraged to compare the following heads: i) Averaged linear head; ii) Specialist-specific linear heads; iii) Re-tuned linear head (after merging). 2. Theoretical analysis relies on several assumptions: i) Local linearity; ii) Bounded remainder terms; and iii) Mild spectral gap. These assumptions may fail when specialist models differ substantially or when the networks exhibit high nonlinearity. Consequently, the validity of the core theoretical result (Proposition 1) hinges on whether such local properties still hold in large-scale models. 3. The proposed approach shows scalability with six models. However, as the diversity of deepfake generation methods continues to grow, the current scale remains modest. It is unclear how the method performs when the number of experts increases dramatically (dozens or hundreds), especially when specialists have uneven capabilities or have been trained on partially overlapping domains. Since the proposed method depends on low-rank SVD, the dominance of a shared “real” direction may diminish as task vectors diversify. 4. This paper argues for the asymmetry between shared “Real” and generator-specific “Fake” features. Nevertheless, it does not explore scenarios where the real data distribution changes or diversifies. For instance, when real samples come from new domains (e.g., different camera sensors or capture conditions). In such cases, the assumption of a single, stable “Real” direction may no longer hold. 5. Merging parameters across different specialists could introduce vulnerabilities such as trojan signatures or model poisoning. The authors are suggested to perform a security analysis to R2M robustness in this regard. Please see Weaknesses Lightly AI-edited
Real-Aware Residual Model Merging for Deepfake Detection Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces **R²M**, a novel method for merging expert models in the domain of Deepfake detection. The core idea is to leverage SVD to decompose task vectors, treating shared "Real" features as a principal component and generator-specific "Fake" artifacts as residuals. This allows for a training-free and efficient merging process without a drop in performance. However, I have some concerns regarding the method's practical utility in real-world scenarios and its overall impact on advancing the field. 1. **Well Motivation:** The paper's motivation is clear and reasonable. Figure 2 provides a compelling and intuitive illustration of the core hypothesis: different specialist models share a common understanding of "Real" data while diverging in their representation of "Fake" data. 2. **Efficient and Effective Method:** The proposed R²M method is elegant in its simplicity. Being training-free, it offers a highly efficient way to combine specialists, and the experiments show that it does so with negligible performance degradation on seen tasks. 1. **Comparison of Similar SVD Technique Used for Detection is Missing:** Previous work like Effort (ICML'25) also proposed the similar SVD-based approach for improved detection performance. More discussion and comparison for similar methods are needed. 2. **Clarity of Visualizations:** The readability of several experimental figures is a concern. In Figure 3 (heatmaps) and Figure 5 (dumbbell plots), it is difficult to discern the precise numerical gains or losses. 3. **Marginal Performance Gains:** The improvement of R²M over prior model merging techniques, like CART, appears to be marginal. 4. **Concerns about Practical Utility and Scalability:** My primary concern lies with the practical application of the proposed merging strategy. The experiments partition domains based on broad forgery categories (e.g., FS, FR, EFS) rather than specific generator models (e.g., FSGAN, FaceSwapV2). This raises a crucial question: how should the framework handle the incremental addition of a new forgery method that belongs to an existing broad category? If a new FaceSwap variant emerges, would one retrain the entire FS specialist, or would a new, more granular merging strategy be required? The current setup does not seem to address this realistic scenario. 5. **Strange Performance of Comparison Methods:** For instance, the performance of *Specialist-FR* on EFS samples is so low (AUC=0.099). Can the author explain? 1. **Fine-tuning Details:** For reproducibility and clarity, it would be beneficial to specify which parameters of the backbone were fine-tuned for the specialist models. For instance, was it only the final linear layer, or were other parts of the network also updated? 2. **Lack of Pipeline Diagram:** It would be better to provide a high-level pipeline diagram. A clear visual representation of the R²M process—from task vectors to the final merged model—would significantly aid reader comprehension. 3. **Ambiguity of "All-in-one" Baseline:** The training setup for the "All-in-one" baseline is unclear. Was it trained as a single binary (Real vs. Fake) classifier on all forgery data? To further strengthen the experimental results, I would suggest including a baseline where the "All-in-one" model is trained on a multi-class Fake detection task (i.e., classifying each specific forgery type), with the logits for all fake classes then aggregated for binary evaluation. This would provide a more robust comparison. 4. **Limited Impact on the Deepfake Detection Field:** While the authors position the work as a contribution to model merging, its tangible impact on advancing the core challenges in Deepfake detection itself seems limited. The problem of generalizing to unseen forgery families, a critical issue in the field, is not substantially improved by this merging approach. Moderately AI-edited
Real-Aware Residual Model Merging for Deepfake Detection Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work analyzes the similarity between each specialist detector and the weight‑averaged model in deepfake detection, finding that real features are shared across detectors while fake features differ and can be complementary. Based on this observation, the paper proposes a Real‑aware Residual Model Merging strategy that enables rapid incorporation of new forgery families by preserving the shared real component and merging the residuals from different forgery specialists. Experiments across multiple datasets and protocols demonstrate the effectiveness of the proposed method. 1.The proposed method is straightforward and compelling; the analysis of real and fake feature similarities between each specialist and the weight‑averaged model is particularly insightful. 2.The R2M method is novel and effective: it updates models by retaining a shared real component while composing denoised, norm‑matched fake residuals to enable rapid adaptation to new forgeries. 1.Figure 2’s analysis relies on features from older forgery datasets (e.g., DF40), where real and fake images are relatively easy to separate; how would the analysis and the proposed method perform when forgeries are highly realistic and real and fake feature distributions are not well separated? see weaknesses Lightly AI-edited
Real-Aware Residual Model Merging for Deepfake Detection Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes the first method that introduces model merging to the deepfake detection task. It analyzed the real similarity and fake distinction among the DF40 dataset, and then proposed R2M with a theoretical explanation. The results show its effectiveness in balancing different types of forgery methods in comparison with the specialist. 1. The paper introduces model merging into deepfake detection for the first time. 2. It provides a reasonable analysis of real–fake similarity on the DF40 dataset. 3. The proposed R2M method shows balanced performance with theoretical support. 1. Since DF40 is used, all real samples come from the same FF++ real, so it is unsurprising that they share the same distribution. However, this also limits the applicability of the conclusion. What if the training data contain reals from different distributions, e.g., FF++ and CDF? 2. Regarding generalization: while it is understandable that model merging can improve in-domain performance, it remains unclear why merging would enhance cross-domain generalization. This point requires more explanation. 3. In Table 1: The experimental results appear unusual — the All-in-one model fails completely on EFS detection, and even reverses predictions. Why would changing the real distribution cause previously learned forgery types to invert their detection behavior? This phenomenon warrants deeper analysis, particularly to clarify how model merging could lead to label inversion. 4. More experiments on generalization are needed, for example, by evaluating on additional datasets (e.g., DFDCP, DFD) and comparing with more generalization-oriented methods (e.g., [1] and [2]). 5. The writing requires further proofreading. For instance, DF40 is incorrectly cited — it is not (Qian et al., 2024) but rather (Yan et al., 2024). 6. Within the same forgery category, the fake cases also require further analysis to validate the similarity findings in Fig. 2 — for example, within EFS. It would be helpful to train specialist models separately using DiT, SiT, and StyleGAN to support this analysis. [1] Effort: Efficient orthogonal modeling for generalizable ai-generated image detection //ICML’25 [2] Can we leave deepfake data behind in training deepfake detector? //NIPS'24 Please refer to the weaknesses. Lightly AI-edited
Aligning News and Prices: A Cross-Modal LLM-Enhanced Transformer DRL Framework for Volatility-Adaptive Stock Trading Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a multimodal deep reinforcement learning system that integrates large language models, Transformers, and the Soft Actor-Critic algorithm to improve trading robustness under market volatility. The model first extracts sentiment and event representations from financial news using a pre-trained LLM (BERT or GPT-2), then aligns price data to this semantic space through a reprogramming layer, and finally fuses both modalities using cross-attention. A Transformer encoder captures multi-scale temporal dynamics and inter-stock correlations, and SAC’s critic gradient feedback jointly optimizes feature learning and trading policy. 1. LLM-driven semantic alignment of news and prices. The reprogramming layer projects numerical price data into the LLM semantic space using multi-head attention, enabling consistent multimodal fusion. This design avoids retraining large language models while ensuring semantic compatibility. The combination of prompt engineering for financial contexts and dynamic feature extraction demonstrates careful adaptation of general LLMs to finance-specific tasks. 2. Contextualized volatility awareness and interpretability. The model’s design explicitly addresses volatility through multi-scale fusion (Eq. 7–8) and sentiment integration, helping explain its superior performance during unstable periods such as the 2021–2022 NASDAQ downturn (Fig. 4). 1. Insufficient ablation and parameter sensitivity analysis. Although ablations are mentioned (Abstract; Sec. 3.4), details are sparse. It remains unclear how much each module—LLM feature extraction, reprogramming layer, or multi-scale fusion—contributes independently to the final gains. The effect of hyperparameters such as attention head count, SAC learning rate, or prompt length is not examined, limiting interpretability of results. 2. Inadequate computational efficiency discussion. While hardware configuration is reported (Sec. 3.2), there is no runtime, memory, or inference-latency comparison. Training involves LLM encoding and multi-head attention fusion (Sec. 2.1–2.3), which are computationally heavy. Without quantitative cost analysis, practical deployability in real-time trading remains uncertain. 3. Restricted dataset scope and generalization evidence. Experiments are limited to ten NASDAQ-100 components and five stocks for prediction (Sec. 3.1–3.5). The paper does not test across other markets or periods beyond 2019–2022, leaving the model’s adaptability to different economic regimes unproven. The reliance on English-language news may also bias performance toward U.S. markets. 4. Limited theoretical grounding of critic-Transformer gradient feedback. The mechanism where SAC critic gradients enhance Transformer feature learning (Sec. 2.4) is described conceptually but lacks a mathematical formulation or ablation isolating its contribution. No explicit derivation links Eq. 9 to gradient propagation into the encoder. This omission reduces the clarity of how end-to-end optimization improves stability or volatility adaptation. 1. What is the computational cost relative to baseline DRL methods? Can the authors report average training time per epoch, GPU memory usage, and inference latency for real-time trading? Such data would clarify whether the proposed framework is feasible in practical financial environments. 2. Could broader datasets or markets be included to test generalization? Would expanding experiments to other stock indices (e.g., S&P 500, Hong Kong HSI) or different time spans strengthen evidence that the model generalizes across regimes and news distributions? 3. How exactly are SAC critic gradients propagated into the Transformer? Could the authors provide explicit mathematical expressions or algorithmic pseudocode detailing the gradient flow from the critic network into Transformer layers? Fully AI-generated
Aligning News and Prices: A Cross-Modal LLM-Enhanced Transformer DRL Framework for Volatility-Adaptive Stock Trading Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a volatility-adaptive, multimodal DRL framework to improve stock trading performance during turbulent markets, where traditional models often fail by ignoring news, failing to capture multi-scale trends, and lacking resilience. The framework integrates LLMs, Transformers, and the Soft Actor-Critic (SAC) algorithm: 1. The Multimodal LLM module extracts news sentiment and uses a multi-head attention reprogramming layer to align structured price data into the LLM’s semantic space. Price and news embeddings are then fused via cross-attention. 2. A Transformer is used to model multi-scale temporal patterns and inter-stock correlations, generating a unified state. 3. The SAC agent uses this state for decisions, with gradient feedback propagating back to the Transformer, ensuring end-to-end optimization that enhances the agent's volatility sensitivity. Experiments on NASDAQ-100 stocks demonstrated that the framework outperformed baselines, yielding positive returns and high Sharpe Ratios during a turbulent test period. [S1] Cross-Modality: The paper introduces a cross-attention mechanism that fuses and aligns news and price embeddings to capture how news sentiment relates to price features, rather than simple concatenation. [S2] Volatility resilience: The combination of multi-scale price modeling and news context allows the agent to adapt to different market volatility regimes. [W1] Limited stock set size: The model was evaluated using only ten stocks with sufficient news coverage drawn from the NASDAQ-100. This pre-filtered selection might not capture the full complexity of broader markets. The reported performance may not generalize well across diverse asset sets. [W2] Insufficient comparison with news-driven models: The experimental evaluation would benefit from stronger comparisons with models that also leverage financial news. In Table 1, all baselines are traditional DRL or time-series methods that do not incorporate textual data, making it difficult to isolate the value of the proposed multimodal design. Similarly, Table 2 should include Time-LLM (Jin et al., 2023) or other news-driven approaches to better demonstrate how the proposed reprogramming layer differs from existing methods in stock price prediction. [W3] Short testing period: Backtesting was conducted in a single year (December 2021 to December 2022). A one-year window offers a partial view of how the framework performs under different market regimes. To demonstrate the model’s long-term robustness, testing across different cycles (bull, bear, and sideways markets) would be essential. [W4] Lack of transparency in strategy design and trading costs: The paper provides limited insight into the practical details of the trading strategy. It’s unclear how model outputs translate into actual portfolio allocations. Moreover, the study does not mention transaction costs, which are critical in the profitability of any trading system. [W5] News data source and validation: The paper relies on a Hugging Face dataset, but the source is not a verified commercial feed. The paper should identify the underlying news sources contributing to the dataset and explain how the data was collected and verified. This transparency would enable readers to assess the reliability of the textual inputs that drive the model’s decisions. [Q1] Novelty of the reprogramming layer: What is the difference and the novelty of your model compared to Time-LLM in the reprogramming layer in Table 2? [Q2] Investment strategy details: The paper does not provide sufficient detail on how trading actions are translated into portfolio allocations. It remains unclear whether the portfolio weights are distributed. [Q3] Generalizability across market depth: How does performance hold up when applied to broader stock sets without sufficient news coverage? Many assets have sparse news coverage, which could disrupt the multimodal alignment process. The authors should clarify how the model handles such data gaps. [Q4] Transaction cost impact on realized returns: Since real-world trading always incurs transaction costs, it would be useful to know whether transaction fees or slippage were included in the performance results reported in Table 1. [Q5] Model complexity: The complexity of the multi-module architecture (LLM, Reprogramming, Transformer, DRL) makes the model harder to interpret, and the time complexity of the overall framework is not mentioned in the paper. Moderately AI-edited
Aligning News and Prices: A Cross-Modal LLM-Enhanced Transformer DRL Framework for Volatility-Adaptive Stock Trading Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes a volatility-adaptive multimodal DRL framework for stock trading that integrates LLMs, Transformers, and the SAC algorithm. By fusing textual financial news with price dynamics through attention-based reprogramming and cross-modal fusion, the model captures sentiment–price interactions and adapts to market volatility. Experiments on NASDAQ-100 data demonstrate superior performance over existing methods. The paper explores the integration of textual financial news and quantitative price data within a multimodal framework. By leveraging a pre-trained LLM for news encoding and Transformer-based modules for price representation, it provides a reasonable step toward combining sentiment and numerical information for trading decision-making. 1. Many notations are not clearly defined. For instance, some symbols that represent vector data should be written in boldface using `\mathbf{}`. For example, in the expression $P = \\{ p^{\text{open}}, p^{\text{close}}, ..., p^{\text{volume}}\\}$, the terms such as $\mathbf{p}^{\text{open}}$ should be in bold to indicate vector representations. 2. The model assumes a perfectly aligned one-to-one correspondence between daily news and price data, based on a curated open-source dataset. However, in real-world markets, news arrivals are irregular. Some days contain multiple news items, while others have none. The current framework does not explicitly handle such temporal misalignment or modality sparsity, which may limit its applicability to more realistic, unbalanced data distributions. 3. The prompt design includes two key parameters, sequence length (seq len) and prediction length (pred len). However, the paper lacks a sensitivity analysis to examine how model performance varies under different context window sizes or forecasting horizons, even though these parameters directly affect the model’s temporal reasoning capacity and generalization ability. 1. In Figure 4, many methods show a noticeable jump in CW around 2022-10. What caused this sudden change? 2. How does the method realize the stated volatility-adaptive capability? There seems to be no explicit risk control, and in Appendix E Algorithm 1, market volatility is included as an input but not utilized in the whole algorihtm. Moderately AI-edited
Aligning News and Prices: A Cross-Modal LLM-Enhanced Transformer DRL Framework for Volatility-Adaptive Stock Trading Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a multimodal DRL framework involving LLMs, Transformer and SAC for stock trading. Experiments on NASDAQ-100 shows its SOTA performance. 1. This paper introduces the attention-based reprogramming layer to project time-series data into an LLM's semantic space, bridging structured and unstructured modalities. 2. The proposed framework somehow address volatility resilience, which is a critical limitation in real-world DRL trading. 1. In general, the novelty of this work is not enough for top conferences such as ICLR, it is a combination of LLMs, transformer and RL with limited contribution from the algorithm perspectives. 2. Experiments on NASDAQ-100 stocks is not enough, I recommend the authors to conduct experiments on more diversified and large-scale datasets to further evaluation the performance. 3. The proposed framework is quite complex, which raises my concern on the latency of real-world settings. More discussion on this is required. 4. As the motivation of this work is performance under extreme market conditions, more ablation study with quantified results on volatility-specific effects will help. 1. For data alignment, how are timestamps between news articles and stock prices synchronized to avoid look-ahead bias? Fully human-written
The Blind Spot of LLM Security: Time-Sensitive Backdoors Activated by Inherent Features Soundness: 4: excellent Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces a backdoor attack framework against LLMs. The approach is based on training the LLM to be triggered by timestamp features in the system prompt. This allows the attack to be triggered without changing the end-user inputs. The attack is implemented using an automated data-poisoning method based that is applied during the supervised fine tuning step. The system is trained on poisoned data which is replacing question answer pairs with the same question in a future time and a poisoned version of the same answer. The system is tested on Qwen 2.5 7B instruct. The system is then activated by the timestamp, which means that it can give different answers to testers then users who access the model later. The backdoor threat would target users who downloaded open source finetuned models or commissioned custom models from third party developers. * Noticing that the timestamp in the system prompt might be a trigger for different responses is an interesting insight. It is unclear whether this insight is novel for this paper. Clearly timestamp based attacks and security techniques have long existed in the literature. * The authors propose a functional technique to finetune an LLM to provide different answers based on different system prompts. * What the approach is creating is basically using any feature in the system prompt to trigger an attack. If the system prompt would have a sentence saying "this is just a test" versus "this is production use", it would be exactly the same thing - and it is also the same thing if they would rely on "Trigger backdoor". * The fact that the authors managed to train the model to return different results on specific topics based on different system prompts shows a competent training skill, but it is not a major contribution. * The "performance numbers" of more than 95% are rather meaningless, given that the "attacker" has a complete control over the finetuning. * What the "Defense" section results really show is that the test-state defence datasets do not test with different timestamps (or at least timestamps that span the timestamp range the authors trained on) * It appears that it should be very easy to detect the attack by testing with various timestamps or defend it by not using timestamps. * Given that it appears to be that this particular attack is easy to defend against, are there other features in the system prompt that might similarly act as a trigger? Fully human-written
The Blind Spot of LLM Security: Time-Sensitive Backdoors Activated by Inherent Features Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper propose a backdoor attack framework for LLMs that exploits timestamp features in system prompts as triggers. The backdoor is activated by a future date then has malicious behavior in specific domains without requiring control over user inputs. It develops an automated pipeline using "Homo-Poison" with a training stragegy combining supervised finetuning and n-token reinforcement learning. S1. The use of system timestamps as backdoor triggers is novel. And the threat model is realisitc that the attackers cannot control user inputs. S2. Comprehensive experiments evaluation against seven 7 mainstream methods. S3.Clear paper writting, the paper is will organized and easy to follow. W1. The attacker assums that victims will deploy models with timestamp-containing system prompts and this is the basic of successful attacks. The authors should explain why and how the attacker can know this information. Moreover, the attack window limitation between model release and trigger date is mentioned but not thoroughly analyzed. W2. Although some defense mechanisims (ONION, CUBE) have been evaluated but they were mainly designed for simpler NLP tasks rather than LLMs. The evaluation of defense methods designed for LLM should be evaluated, for example, Random Smoothing. W3. Lack of model updates or fine-tuning's impact on backdoor persistence. Q1. How long backdoors remain effective after the trigger date? Q2. Have you considered the attack's effectiveness when timestamps are formatted differently across training and deployment? Others questions please refer to the weakness part. Fully human-written
The Blind Spot of LLM Security: Time-Sensitive Backdoors Activated by Inherent Features Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes TempBackdoor, a time-triggered backdoor strategy that uses timestamps in system prompts as a dormant activation signal. The authors build an automated poisoning pipeline (Homo-Poison) and a two-stage training recipe (SFT followed by a focused “n-token” RL) to implant backdoors that only fire when both a future timestamp and a domain condition are present. Experiments on Qwen-2.5 family models report very high attack success, low false positives, and apparent robustness to several existing defenses. - The core idea — using an endogenous system signal (time) as a trigger — is simple but insightful; it exposes a plausible blind spot in certain deployment practices. - Results are compelling on the controlled Qwen experiments. - The paper is well written. - The attack hinges on the assumption that deployed systems include raw timestamps in model context exactly as trained. Many production stacks sanitize or reconstruct system context server-side (or keep such metadata separate), so the attacker’s assumed access to an unfiltered timestamp field is not convincingly demonstrated. The paper treats “timestamp present” as a binary reality rather than a deployment-dependent variable. This weakens claims about real-world feasibility. - All evaluations use Qwen check-points and synthetic prompts generated in a tightly controlled pipeline. No experiments on closed APIs, hosted inference stacks, or even a simulation of common sanitization/preprocessing layers are reported. That makes it hard to judge whether TempBackdoor is a lab trick or a practical threat. - Title and framing promise a broad blind-spot discovery, but the manuscript only operationalizes time. Other supposed “inherent features” (locale, device, region, user-id) are only discussed at a conceptual level. Without experiments showing generalization, the claim that system-level variables broadly form an untested surface is speculative. - Another limitation is that the paper does not include any comparison with other existing backdoor or trigger designs. Without such context, it’s difficult to gauge how much improvement actually comes from the proposed mechanism rather than from the training pipeline itself. - The current Figure 1 is visually useful but the caption and/or markup should explicitly show where the dual triggers are and how they jointly activate the backdoor. Fully AI-generated
The Blind Spot of LLM Security: Time-Sensitive Backdoors Activated by Inherent Features Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a poisoning/backdoor attack method based on the system prompt time. By using the time specified in the system prompt as the trigger condition, the method behaves normally for legitimate users before a specific time and produces targeted responses for specific tasks after that time. Experimental results show that the proposed method achieves a high attack success rate across different datasets and models. 1. The problem addressed is both novel and important. With the advancement of large language models (LLMs), poisoning attacks that do not require explicit triggers to activate pose new threats to intelligent applications. 2. The evaluation is comprehensive, demonstrating the effectiveness of the proposed method in terms of ASR and robustness. 3. The paper is well organized and easy to follow. 1. **Lack of theoretical and experimental justification for claimed limitations of existing methods.** In the introduction, the paper points out that knowledge-based poisoning attacks lack stealth, but this conclusion is not supported by references, theoretical analysis, or experimental results. In Section 6.1, the paper also fails to evaluate existing knowledge-based poisoning attacks against the adopted defenses. If knowledge-based poisoning attacks are also robust to these backdoor defense strategies, then how can it be proven that they lack stealth? Regarding triggerable attacks, the paper claims that attackers cannot control users’ input to activate backdoor attacks. This is reasonable, but I think the proposed method in this paper is more like a knowledge-based poisoning attack under specific conditions, and therefore has different application scenarios compared to triggerable backdoor attacks. It is recommended that the authors focus on knowledge-based poisoning attacks and corresponding defenses in the comparison and evaluation. 2. **The literature review is not comprehensive.** As mentioned above, I believe this paper is more aligned with knowledge-based poisoning attacks, yet only one such work (Shu et al., 2023) is discussed without evaluation. It is recommended to introduce and compare the proposed method with more poisoning attacks (e.g., [1–3]). - [1] *POISONBENCH: Assessing Language Model Vulnerability to Poisoned Preference Data* - [2] *Run-Off Election: Improved Provable Defense against Data Poisoning Attacks* - [3] *PoisonedEye: Knowledge Poisoning Attack on Retrieval-Augmented Generation based Large Vision-Language Models* 3. **The proposed attack is easy to defend (according to the stated threat model).** The threat model assumes that defenders can detect poisoning attacks without time triggers, which makes it easy to defend against the proposed method—for instance, by simply adding a future timestamp during training. Overall, I find this to be an interesting topic with a simple yet effective approach. It is recommended that the authors further clarify the threat model and compare the proposed method with more poisoning attack methods. Lightly AI-edited
Variational Model Merging for Pareto Front Estimation in Multitask Finetuning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a Bayesian method for model merging. The goal is to approach Pareto front of multitask by efficiently and approximately computing the posterior with mixture of Gaussian. As a consequence, this method balance the efficiency that is lacking by full Gaussian posterior and the utility that is lacking by isotropic Gaussian. Experiments on transformers have shown some improvement that model merging gets closer to multitask finetuning. This paper is clearly written with good introduction and motivation. The Bayesian approach makes sense to me and is original as far as I can tell. Figure 1 really does a good job highlighting the idea. Section 3.2 positions this method appropriately in the literature. Overall the quality is good, with all the derivation and reasoning being sound. 1. Methodology Section 3.4 lists three versions and different experiments seem to use different versions. It would be beneficial to converge to one method if possible for practitioners. If not, can the authors summarize the applicability of each version? 2. Weaker performance than multitask finetuning The message from this work is two-fold: variational model merging is better than previous model merging, but it is still worse than multitask finetuning (see Table 1 and Figure 5b). While the second part is not a positive result, I think it is very valuable. However, to take the second conclusion seriously, the overall method between line 292-294 may need a 4th step: launch multitask finetuning for the Pareto estimates. 3. Computational cost This method is still computationally heavy, e.g. finetuning T models for T tasks. While many model merging methods are costly, this painpoint is not yet alleviated by this method so I think the significance is not great. See weaknesses. Fully human-written
Variational Model Merging for Pareto Front Estimation in Multitask Finetuning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes variational model merging, a Bayesian approach to estimate Pareto fronts in multitask finetuning by merging task-specific posterior approximations. The key insight is that more flexible posterior families (e.g., full Gaussians, mixture of Gaussians) yield better Pareto front estimates than simpler ones 1. Novel theoretical framework: Connecting model merging to Bayesian posterior fusion is novel and provides a principled way to derive new merging strategies. The variational perspective naturally explains why different merging methods exist and how to improve them. 2. Clear theoretical contribution: Theorem showing that more flexible posteriors necessarily yield better estimates is valuable, with the error reduction property being particularly insightful. 3. Comprehensive experiments: Testing on diverse architectures and tasks (vision, NLP, translation) demonstrates broad applicability. 1. Missing bounds on approximation quality relative to true Pareto front. Also, can authors provide a formal connection between posterior quality and Pareto front accuracy? In other words, a bound on how the approximation quality translates to Pareto solution quality. Currently, the paper only shows empirically that better posteriors help, but doesn't prove: how much they help and when they're guaranteed to help? 2. computational costs. 1. mixture methods require K times more models, which is expensive for large models. 2. As author already state, Hessian approximation is a bottleneck for large model merging, even diagonal approximations require O(P) storage. Analysis of computational costs for various Can this framework handle constraints or preferences on the Pareto front? Lightly AI-edited
Variational Model Merging for Pareto Front Estimation in Multitask Finetuning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors employed a Bayesian model-merging approach that efficiently explores various weighting configurations without requiring full retraining for each one. Their method relies on two key components: model merging, which combines the parameters of models individually trained on separate tasks instead of retraining for every configuration, and a Bayesian framework, which enhances the merging process by developing improved surrogate functions for the multitask learning objective. This allows practitioners to effectively explore task-weighting options and find high-performing models at a fraction of the computational cost of traditional retraining. - The paper's primary strength is its novel conceptualization of model merging as a variational Bayesian inference problem. This original framework is significant because it replaces ad-hoc merging heuristics with a foundation that both explains the relative performance of existing methods and provides a clear recipe for systematically designing new, more accurate ones. - Extensive empirical validation on modern, large-scale architectures, including Vision Transformers and the GEMMA-2B LLM. - The paper's primary goal is to provide "fast and cheap methods" to estimate the Pareto set. However, its best-performing and most novel method, Mixture-Weighted Merging (MultiIVON-Hess), has a training cost that scales linearly with the number of mixture components ($K$). This requires $K$ full training runs for each task, which creates a significant tension with the "cheap" objective. - The number of components $K$ seems to be a critical hyperparameter. The paper uses $K=30$ for ResNet, $K=10$ for ViT, and $K=3$ for RoBERTa and GEMMA. How was $K$ chosen for each experiment? Is there a principled way to select $K$? Lightly AI-edited
Variational Model Merging for Pareto Front Estimation in Multitask Finetuning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper frames *model merging* as approximate Bayesian inference to cheaply preview the Pareto set for multitask finetuning. Starting from task-specific posteriors, it proposes estimating Pareto solutions by maximizing a merged posterior. This unifies prior weight-averaging (simple averaging / task arithmetic) with Hessian-weighted schemes and introduces a mixture-of-Gaussians variant solved via a lightweight EM procedure. Empirically, across CIFAR-10 (ResNet-20), CLIP ViT-B/32 transfers, RoBERTa sentiment, and GEMMA-2B LoRA MT, more expressive posteriors (diagonal Fisher or Hessian to mixtures) produce Pareto fronts closer to multitask finetuning while being much cheaper than retraining across many values. The paper clearly derives how scalarized multi-objective training corresponds to MAP under a merged posterior; it shows that common merging tricks are special cases of exponential-family surrogates (e.g., simple averaging from isotropic Gaussian; Hessian-weighted from full Gaussian). This gives a principled recipe rather than ad-hoc formulas. Consistent empirical trend: Across tasks and model families, more flexible posterior means better front. - Approximation stack is heavy and sometimes crude. Many experiments rely on diagonal Hessians/Fishers or squared-gradient proxies, which can mischaracterize curvature and interactions (acknowledged by the authors). - Cost shifts rather than disappears. MoG requires K runs per task and a few EM steps. While still cheaper than dense alpha sweeps, for large T and K this becomes significant; the paper reports K=3–30, which is nontrivial in big models. - Hessian quality vs. downstream accuracy. The argument that IVON supplies a “free” diagonal Hessian is practical, but no controlled study links Hessian quality to Pareto-front error beyond rough accuracy differences. A calibration plot (front error vs. curvature error) would strengthen the causal story. (Setup & comparisons.) - The method assumes task-specific posteriors are compatible under a common prior. In settings with strong parameter non-identifiability or sharp mode shifts (e.g., safety vs. creativity in LLMs), merging may land off-manifold; the paper hints at such gaps (e.g., shape mismatches) but doesn’t delineate failure modes or detection heuristics. - Reported times exclude the up-front cost of training each task model (and K variants for mixtures). For large T, the one-time cost may approach/ exceed a modest grid of multitask runs; a more apples-to-apples wall-clock accounting would help. Fully AI-generated
ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a novel unsupervised framework that learns dense depth estimation from an RGB image and sparse point cloud by explicitly reasoning about occluded 3D regions. Rather than relying solely on photometric reconstruction of co-visible areas, the paper proposes to learn an inductive 3D bias through the auxiliary task of occluded region completion. ORCAS, a method of the paper, first encodes RGB and sparse depth into 2D features, broadcasts them into a discretized 3D volume, and rigidly warps this volume to an adjacent view using relative pose. The ConteXt block then fills in the empty voxels corresponding to occluded regions using nearby 3D context and positional embeddings, while a new ORCaS loss enforces consistency between predicted and real adjacent-view features. This occlusion-aware training significantly improves the performance of depth predictions in an unsupervised setting. Extensive experiments on VOID1500, NYUv2, and ScanNet show that ORCaS achieves state-of-the-art performance, outperforming previous unsupervised methods by up to 8.9% on average, while maintaining real-time inference speed and demonstrating strong robustness to domain shifts, calibration noise, and extremely sparse depth inputs  ORCaS introduces a simple yet novel idea, using occluded region completion as an auxiliary supervision signal for unsupervised depth completion.
This reframes depth completion from a purely visible-surface interpolation problem into a 3D reasoning task that requires understanding unseen geometry.
By leveraging occlusion as supervision, the method naturally learns a strong inductive bias that encourages consistent 3D representations.
This conceptual clarity and originality make the paper both theoretically appealing and practically impactful.  Across multiple benchmarks (VOID1500, NYUv2, ScanNet), the method consistently achieves state-of-the-art performance, outperforming previous unsupervised methods by up to 8.9% on average.
  The proposed method learns latent features that encode the 3D shape regularities of indoor scenes, independent of texture or lighting. Even though the model is not directly trained for domain transfer, this implicit shape prior helps it perform well in zero-shot transfer and sparse-input settings.  Authors demonstrate strong robustness to variations in calibration, scene dynamics, and input sparsity.
It maintains stable performance even with +-30% synthetic calibration noise and when trained on static assumptions in dynamic environments. Major weaknesses are as below:  Most experiments focus on indoor or small-scale environments (VOID1500, NYUv2, ScanNet).
The KITTI Depth Completion results are included only in the appendix, where the improvement over prior work is relatively small (≈3%). This suggests that the learned occlusion-based bias may generalize less effectively to outdoor, long-range, or high-depth-variance settings.
A broader evaluation would be necessary to confirm the scalability of the approach beyond indoor domains.  Although the method is built around the idea of learning from occluded-region completion, the qualitative results do not visually emphasize or analyze regions where occlusion is likely to occur. Figures 2 and 3 mainly show overall depth predictions for relatively frontal or fully visible areas, rather than viewpoints where depth discontinuities, inter-object occlusions, or self-occlusions are pronounced. Without explicitly highlighting or comparing, it is difficult to tell whether the proposed occlusion reasoning truly contributes to the improved depth quality. Minor comments are as below:  In the ablation section, the text description around Table 2 incorrectly describes the relative performance between Row 3 and Row 5. The numbers in the table show that Row 5 performs better, but the text argued in the opposite.  The paper mentions that training is performed “in an alternating fashion” in L91-92, but provides no further explanation or details about what this process entails. There is no description of how the alternation is implemented, what modules are updated in each phase, or why this strategy is necessary.  A comparison with the baseline model, KBNet, are not presented in table 6. Following the KITTI benchmark performance gap between the proposed method and KBNet is very marginal.  Could you elaborate on why the proposed occlusion-completion supervision may generalize less effectively to outdoor environments?
Have you tested the method on any additional large-scale or high-depth-variance datasets to evaluate scalability beyond indoor domains?  Please explain how the alternating training process is scheduled (per batch, per epoch, or per iteration), which parameters are frozen in each phase, and why this two-step optimization was preferred over joint training.  Could you provide visualizations or case studies focusing specifically on occluded or partially visible areas? How can we confirm that the learned ConteXt block completes unseen regions rather than merely smoothing co-visible surfaces?  The current setup uses only two adjacent frames for occlusion-aware supervision. Have you explored extending ORCaS to longer temporal windows or multiple adjacent views? Incorporating multi-frame context might improve occlusion stability and reduce dependence on single-pose accuracy. Do you expect the current ConteXt block or ORCaS loss to generalize naturally to that setting?  It would be interesting to know whether ORCaS could serve as a pretraining stage for other 3D perception tasks such as monocular depth estimation or scene flow. Do you believe the learned occlusion-aware features transfer effectively to other geometry-related tasks? Fully AI-generated
ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the self-supervised depth completion task by introducing an auxiliary objective: completing occluded regions of the scene. This auxiliary task serves as a strong inductive bias to guide the learning process for depth completion. Experimental results on the VOID1500 and NYUv2 datasets demonstrate that the proposed approach achieves superior performance compared to previous methods. - The paper is clearly written and well organized. - The proposed method is sound. - The proposed method demonstrates superior performance compared to existing approaches on indoor datasets. 1. The main text includes comparisons only on two indoor datasets. Although KITTI Depth Completion results are reported in the supplementary material, the comparison involves only a limited number of competing methods. Moreover, the performance on the KITTI DC dataset appears inferior to several previous approaches, such as DesNet. It is recommended to provide a more detailed analysis of the results on outdoor datasets to better demonstrate the effectiveness and robustness of the proposed method. 1. How is the relative camera pose obtained? Is it predicted by a network or derived from ground-truth camera poses? 2. Besides occluded regions, there are areas that do not overlap between two frames. Would these non-overlapping regions affect the depth completion learning process? 3. The difficulty of scene completion is related to the time interval between frames, as a larger interval typically results in more occluded regions. How do you determine an appropriate frame interval to best assist depth completion learning? 4. Could you provide an analysis of the impact of the number of planes used for MPI on the overall performance? Lightly AI-edited
ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper presents ORCaS, a new unsupervised depth completion method. The core idea is to treat occluded regions as a source of self-supervision to learn a stronger, 3D-aware inductive bias for reconstructing dense depth maps from sparse depth inputs and RGB images. Concretely, ORCaS broadcasts 2D features into a 3D voxel grid, performs rigid 3D warping using relative poses, predicts “empty” regions in adjacent views, employs a ConteXt block to extrapolate local contextual features, and introduces a new ORCaS loss that learns inductive priors from these occluded regions. Experiments on VOID1500, NYUv2, and KITTI demonstrate state-of-the-art performance. 1. Clear motivation. ORCaS introduces the novel concept of occluded region completion as a supervision signal for unsupervised depth learning. By explicitly predicting unseen regions, the method enforces the model to learn a 3D-structure-aware inductive bias that goes beyond traditional visible-region reconstruction. 2. Well-structured design. The architecture is modular, interpretable, and easily integrable with existing unsupervised depth completion frameworks. It can serve as a plug-and-play component for similar tasks. 3. Comprehensive validation. Extensive experiments across VOID1500, NYUv2, and KITTI datasets demonstrate consistent and significant performance gains. Ablation studies and transfer experiments further support the effectiveness of each design choice. 4. Strong generalization. By learning to predict occluded regions, ORCaS acquires a geometry-aware prior that improves cross-dataset transfer and remains robust even with extremely sparse depth. 1. Limited theoretical explanation of ORCaS loss. While the paper empirically demonstrates the effectiveness of occlusion-based supervision, it lacks a deeper theoretical analysis explaining why predicting unobserved regions improves the learned representation for visible depth estimation. 2. Dependency on accurate camera calibration. The method relies on precise camera intrinsics and relative poses. Although this limitation is acknowledged, the paper does not include ablation or robustness studies to quantify sensitivity to calibration noise. 3. Outdated related work. The literature review mainly covers works up to 2023. It is recommended to expand this section to include up‐to‐date publications, such as: [1] Distilling Monocular Foundation Model for Fine-grained Depth Completion. CVPR 2025. [2] Completion as Enhancement: A Degradation-Aware Selective Image Guided Network for Depth Completion. CVPR 2025. [3] OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration. ICCV 2025. [4] PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency. ICCV 2025. [5] Tri-Perspective View Decomposition for Geometry-Aware Depth Completion. CVPR 2024. I am willing to increase the rating if those weaknesses can be addressed in the rebuttal stage, thanks. Fully AI-generated
ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes an unsupervised depth estimation method that augments depth estimation from the target view by introducing additional supervision from the source view. The main assumption is that by using features from the target view to estimate the depth of the source view, the target features can learn to model unseen structures, thereby regularizing the shape of visible structures. Experiments demonstrate notable improvements over previous methods on the VOID1500 and NYUv2 datasets. 1. The paper is mostly clearly written and includes proper illustrations. 2. Although reconstructing occluded 3D geometry is not a new concept, applying this idea to unsupervised depth completion is novel and interesting. 3. The proposed model achieves favorable improvements over existing methods. 1. Rationale of unseen geometry learning: The rationale for using improved occluded geometry to enhance visible geometry is not clearly validated. While the authors claim effectiveness in Lines 60–64, the argument remains conceptual without concrete evidence. Since the model itself does not explicitly learn a “3D shape” of objects (L61), it is unclear whether it truly reduces reliance on input point density. Moreover, although the method claims that learning unseen geometry helps improve visible geometry, there are no quantitative or qualitative comparisons in the unseen regions. 2. Lack of ablation experiments: - (a) Depth vs. feature supervision: It is unclear why the authors use feature-based supervision instead of depth-based supervision for adjacent views. Depth supervision would be a more direct and intuitive approach and would enable quantitative comparisons in occluded regions to justify the design choice. If depth supervision performs poorly, an explanation should be provided. - (b) Computational analysis: The authors mention in L403 that the base model is KBNet with a transformer block. The added transformer head appears to contribute a significant performance gain (MAE: 39.8 → 35.3). This raises the question of whether the improvement is partly due to increased model capacity. A comprehensive comparison of #parameters, GFLOPs, and GPU memory usage among the proposed method, KBNet, and AugUndo is necessary. - (c) ConteXt module ablation: Although the authors ablate 2D and 3D representations, they do not ablate the ConteXt module under the 3D representation, nor do they analyze the effect of its hyperparameters $(k_u, k_v, k_w)$. Since this module essentially performs feature pooling, it is important to evaluate how much it contributes to the final performance. 3. Unclear writing and typos: - (a) In L241, the authors introduce $\bar{d}$ but do not explain how $\bar{X}$ is derived from $\bar{d}$. - (b) In L409–L411, the sentence “(Row 5) This is worse than the proposed 3D warping without ORCaS loss (Row 3)” is inconsistent with the reported results, as Row 5 actually performs better than Row 3. 4. Missing related work on occluded scene reconstruction: The following works should be cited and discussed for completeness: [1] *Peeking Behind Objects: Layered Depth Prediction from a Single Image* [2] *Layer-Structured 3D Scene Inference via View Synthesis* [3] *Behind the Scenes: Density Fields for Single-View Reconstruction* [4] *Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning* [5] *Directed Ray Distance Functions (DRDF) for 3D Scene Reconstruction* [6] *X-Ray: A Sequential 3D Representation for Generation* [7] *LaRI: Layered Ray Intersections for Single-View 3D Geometric Reasoning* [8] *RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion* The following experiments and analyses are recommended for the revised version: 1. Replace feature-based supervision with depth-based supervision for adjacent views in the loss function. Analyze whether the predicted depths beyond the visible regions of the target view improve in the source view. 2. Under the 3D representation, ablate the ConteXt module and its hyperparameters $(k_u, k_v, k_w)$. 3. Provide a computational comparison (including #parameters, GFLOPs, and GPU memory) among the proposed method, KBNet, and AugUndo. Lightly AI-edited
When Can You Get Away with Low Memory Adam? Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors present SlimAdam, a memory-efficient version of Adam optimizer which achieves up to 99% memory savings by compressing the large second-moment statistics used in Adam's adaptive learning rate computations. Rather than storing full per-parameter second moments, SlimAdam takes mean of these moments along the fan-in or fan-out dimensions 'when' appropriate. The 'when' in context is guided by a Signal-to-Noise Ratio (SNR) metric. SNR measures the concentration of second moment values (square of mean/variance). Higher SNR indicates tighter clustering, which justifies compression. And the compression is applied only in the layers where SNR is high, and state granularity is retained where SNR is low. The authors also state that since different layers show compression viability across different dimensions (fan-in/fan-out), derivation of compression rules are required for each model. To determine compression rules for each layer, the authors propose training a small proxy model at a reduced learning rate. SNR statistics from the proxy reliably generalize across larger models of the same architecture and task, informing safe compression dimensions for the full target model. The authors conduct a compreshensive empirical analysis across a wide range of large models and training tasks, revealing nuanced differences in compressibility across various layer types. Their findings highlight that attention components (such as keys and queries), value and projection layers, MLP layers, and token embedding/vocabulary layers each exhibit distinct compression characteristics. This detailed analysis reveals important, insightful architectural patterns that govern how adaptive moment compressibility varies. Overall, the paper makes a relevant contribution to efficient optimization for large-scale deep learning, addressing a critical bottleneck in resource consumption. It balances rigorous analysis with practical effectiveness, although clearer exposition, especially regarding the proxy model methodology, would improve accessibility. SlimAdam, hence, is a valuable addition for researchers without sacrificing Adam's effectiveness. 1) SlimAdam achieves 99% memory savings compared to the original Adam optimizer, while fully preserving Adam’s effectiveness. It can be seamlessly swapped in place of Adam without requiring any code modifications or additional overhead. 2) The paper presents clear and well-motivated research questions supported by extensive experiments across diverse model architectures and tasks, demonstrating robust generality. 3) The authors provide a detailed algorithmic description alongside publicly available code, ensuring reproducibility. 4) Ablation studies are thoughtfully designed and thoroughly explained, offering valuable insights into the contributions of individual components and hyperparameters. 1) The main method (SlimAdam algorithm) is explained only in the appendix, and critical implementation insights (proxy model construction, SNR statistics collection) are not clearly presented in the main text. This prevents immediate accessibility and understanding. 2) The concept and practicalities of the proxy model for collecting SNR statistics are not deeply explained. Details about how proxy model size affects SNR relevance and how well proxy-derived rules scale to actual large models could be clearer. Also, how much compute overhead is added for such proxy runs should also be mentioned. 3) The details of SNR statistics adoption over the training steps in the actual model could be explained as well. Minor Weaknesses: Appendix C.1 is not completely written. 1) How does proxy model size affect the SNR statistics for different tasks and architectures? 2) The paper states that for proxy model ignores early SNR statistics, and averages SNR values for next few steps rather than all steps; is the same applied for full model as well- meaning, is compressibility not applied for the first few runs, and how is it adapted over training steps? I am amenable to changing the score if the questions and weaknesses are addressed. Fully human-written
When Can You Get Away with Low Memory Adam? Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper studies the signal-to-noise ratio of Adam’s second-moment tensors for every layer along the input-channel (column) and output-channel (row) directions, finding instances where entries exhibit low variance relative to their mean and can safely share statistics during training; using these SNR profiles, it introduces SlimAdam, which collapses second moments only on the high-SNR direction of each layer, cutting memory while preserving Adam-level convergence and accuracy. The paper tackles a well-motivated problem—the large memory footprint of Adam’s second-moment matrices—and clearly pinpoints instances where collapsing each layer’s parameters to a single scalar shrinks an matrix to just or values, cutting memory use while retaining Adam-level accuracy and stability. Since there are almost no changes from my last review of the paper, I'll keep the core of my argument. I’ll split my review into two parts — one on the empirical analysis and one on the proposed optimizer. ### Empirical analysis and design rationale The paper’s primary findings are almost entirely empirical, and this lack of theory leaves several key decisions unclear—especially because the empirical signals themselves are not particularly strong. Axis sharing is constrained to whole fan-in or fan-out dimensions purely for implementation convenience, with no exploration of alternative groupings or proof that these axes are optimal (e.g., would results change if one considers randomly partitioning a layer’s parameters into two equal-size groups?). Similarly, the paper adopts (the interesting metric) SNR with a heuristic threshold as the only compression criterion, although simple variance (which answers “How much l2 loss do we pay if we collapse this vector to a scalar?”) would align more directly with the intuition the authors cite (“If entries along a dimension exhibit low variance relative to their mean, they can be effectively represented by a single value”). Learning rate is the only hyperparameter the paper systematically analyzes. It is presented as the dominant knob that shifts SNR and thus determines which layers can be compressed, yet the text provides no a-priori reason why learning rate—rather than, say, Adam’s momentum coefficients or the batch size—should hold that position. ### Optimizer details The optimizer requires selecting a compression axis for each layer or layer type. Choosing a compression axis means either relying on proxies or heuristics, or collecting fresh SNR statistics. My concern is that the latter defeats the purpose in some cases, and the former is not reliable. Alternatively, we can train a small proxy model or reuse generic rules. Yet the authors themselves show that preferred compression axes shift with dataset, width, and vocabulary size. Even within the same dataset and width, layers of the same type show different preferences. Depth-averaging does not fully solve the problem for users operating at the tightest memory margins or in domains whose depth-specific SNR patterns have not been studied yet. Even the stronger patterns they find—for example, compressing along the embedding dimension versus the token dimension—may not yield an SNR above the cutoff needed to justify compression. Full-size SNR collection defeats the purpose. To decide the sharing axis, you must first run the uncompressed model under standard Adam long enough to gather per-layer SNR statistics. During this warm-up, you still store the full second-moment tensors, so the memory spike SlimAdam tries to avoid is paid up front. For practitioners who want to fit a slightly larger model into fixed hardware, this spike means the maximum model size is still bounded by Adam’s footprint during the warm-up, undermining the value of a lighter optimizer. Please address my concerns above, especially around the empirical nature of the evidence, axis selection, and practicality at tight memory budgets. In addition, a high-level clarification would help: how should we interpret “compressibility” in this work beyond plots of SNR? In other words, is SNR a sufficient observable for when per-parameter adaptivity is redundant, and how does its dependence on learning rate versus other hyperparameters shape the generality of your claims? Lightly AI-edited
When Can You Get Away with Low Memory Adam? Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a layer-wise Signal-to-Noise Ratio (SNR) analysis to determine when the second-moment tensors in optimization algorithms (e.g., Adam) can be compressed by replacing them with their dimensional means. Given that SNR is computed as $\text{mean} / \text{variance}$, it serves as a natural metric for this purpose: a high SNR indicates that a tensor can be effectively approximated by its mean without significant performance loss. This approach provides a practical, quantitative guide for compressing optimizer states and offers evidence that Adam may not always require full second-moment information. This work proposes to apply SNR as a metric to guide the compression of second-moment tensors with their means in LLM training. It offers a threshold-based criteria to determine when and how such mean compression can be applied across different architectural components of LLMs (e.g., query, key, value, and MLP layers). Furthermore, this work empircially investigates several factors influencing compressibility, including learning rate, data distribution, and initialization. 1. The main motivation of this work is to establish a metric for guiding dimension-wise mean compression of second-moment tensors and provide SlimAdam. However, this goal appears to overlap with Adam-mini [1], which not only implemenst a compression method based on block-wise mean values but also provides insights based on Hessian structure to explain why the full second-moment may be unnecessary and to guide how to compress. The authors should more clearly delineate their contributions and explicitly contrast their approach with the insights and methods provided by Adam-mini. 2. SNR is a natural choice for quantifying the viability of mean compression, given its formula of $\text{mean} / \text{variance}$. The paper does not sufficiently justify why it is superior to other plausible metrics. For instance, measures based on the L2-norm or KL-divergence of the error introduced by compression, or just variance, could be more direct and computationally efficient. The author should demonstrate the unique advantages of SNR over other alternatives through theoretical analysis or empirical comparision. 3. For mean compression, it is clear that higher SNR correlates with better compressibility. However, the method remains dependent on an empirically set threshold (e.g., $\alpha=1$) to make compression decisions. This dependency not only limits the generality of the method by introducing a potentially sensitive hyperparameter across different scenarios but also raises the question of whether other metrics (e.g., L2-norm, KL-divergence, or variance) could perform just as effectively with a similarly tuned threshold. 4. The utility of SNR seems limited to mean compression and may not extend to or guide other compression paradigms (e.g., low-rank factorization, quantization). [1] Zhang, Yushun, et al. "Adam-mini: Use fewer learning rates to gain more." arXiv preprint arXiv:2406.16793 (2024). 1. The choice of SNR is intuitive for mean replacement, but why is it superior to other direct measures of compression error, such as the L2-norm or KL-divergence between the original and compressed tensor? Could the authors provide either (a) an empirical ablation study comparing the compression guidance performance of SNR against these other metrics, or (b) a theoretical argument for why SNR is an optimal or more robust criterion? 2. If the performance of the method is similar when using a simple threshold on other metrics (e.g., compress if $\text{variance} < X$, compress if $\text{L2-norm of error introduced by compression}/\text{L2-norm of target tensor}$), does this suggest the core insight is about identifying low-variance parameters rather than the unique information provided by SNR? What is the specific advantage of the SNR ratio over just using the variance or standard deviation? 3. The threshold $\alpha=1$ is presented as a critical value for making compression decisions. How was this value determined? Is it robust across different model architectures, layers, and tasks? Could the authors show sensitivity analyses for this threshold to demonstrate its generality? 4. The presentation could be improved for better clarity and reproducibility. For example, using pseudocode rather than plain text would help readers better understand the algorithm's workflow. Lightly AI-edited
When Can You Get Away with Low Memory Adam? Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a metric called SNR (gradient version) to address the high memory usage of Adam’s second momentum. The paper compresses the second-momentum tensor along these dimensions to a single mean value (or small number of values). The authors claim this approach (SlimAdam) can save up to 99% of the second-moment memory while maintaining Adam’s property and stability. • The idea of quantifying the compressibility of Adam's second moments on a per-layer, per-dimension basis is interesting. • The authors conducted extensive experiments across various architectures (GPT, ViT, ResNet) and tasks. 1. Lack of Theoretical Foundation: As an optimizer paper, it lacks a convergence proof and relies almost entirely on experimental observations (e.g., using a 10x lower LR). 2. Missing Essential Data: The paper does not include 'loss vs. step' curves, a critical metric for evaluating optimizers. 3. Methodological Ambiguity: There is no logical basis for the proxy model design or the SNR threshold of $\alpha=1$. 4. Questionable SNR Justification: The underlying assumption of mapping the mean of $V_t$ to 'signal' and its variance to 'noise' is not justified. (Since the justification of SNR relates to original vs. noise) 5. Exaggerated Contribution & Poor Comparison: The 99% claim is misleading (it's 50% of the total), and comparisons to SOTA optimizers that compress both moments (e.g., SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization which reduces the first and second momentums so that the total compression ratio is by up to 96%) are missing. • What is the theoretical justification for treating the mean as 'signal' and the variance as 'noise' in your SNR definition $SNR_K = \mathbb{E}[(\mathbb{E}_K[V_t])^2 / Var_K[V_t]]$ from an optimization perspective? Is there a theoretical basis to claim that a low-variance (high SNR) tensor is inherently 'compressible'? • What is the specific theoretical justification for choosing the SNR threshold $\alpha=1$? Can you guarantee this value is universally optimal across different tasks and model architectures? • Can you provide loss-vs-step curves for your key results (e.g., Figure 8) to demonstrate that SlimAdam achieves the same 'final' performance with the same convergence speed as Adam? • Compared to optimizers like SMMF (2025), which compress both first and second moments for a 96% total memory saving, what is the practical advantage of SlimAdam, which only compresses the second moment for a ~50% total saving? Lightly AI-edited
When Can You Get Away with Low Memory Adam? Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes to compress the second moment tensor of Adam by replacing per coordinate value with the average across specific dimensions. The method defines the signal to noise ratio of the second moment during training and compresses when this ratio is large. As the memory of the optimizer accounts for a significant fraction of the memory requirement for neural network training, compressing it is an important problem. The paper studies a simple method for this task and gives thorough evaluation on different tasks and different modules of the network. The method seems to have a large overlap with existing literature. The paper mentions Adam-mini, which already has a large overlap in terms of both the algorithm and the intuition behind the approach. Similar techniques also appear in several other papers such as Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees. ICML 2025 APOLLO: SGD-like Memory, AdamW-level Performance. MLSys 2025 These papers additionally save memory for the momentum, resulting in less memory than the method proposed here. In light of these works in addition to Adam-mini, the contribution of the new paper seems limited. A minor point: please cite the published versions of the references. For example, the Adam-mini paper is in ICLR 2025. The pre-conditioner changes over time as Adam changes V in every step of training. The SNR analysis is fixed up front and the state is compressed in exactly the same way throughout but the condition number of V could change over time. Do you see any change in the condition number of V over time, and if not, is this a property of the training data and are there cases where the condition number of V changes? In some work, it is mentioned that gradient descent operates on the "edge of stability", do you see any changes if SNR analysis is done at different step sizes? Fully human-written
MFCL: A Multi-modal Function Calling Evaluation for Large Language Models Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces MFCL, the first large-scale benchmark for evaluating multi-modal function calling (tool use) in large language models (LLMs). MFCL comprises 8.2K expert-verified tasks across three suites: True Audio, Text Audio, and Vision. Each example pairs a multi-modal user query (text, speech, or image) with a ground-truth toolcall trace, and includes controlled perturbations (accents, noise, occlusions, etc.) to stress the perception-to-action pipeline. MFCL provides an automatic grader for exact-match scores on function names and arguments, enabling robust, reproducible evaluation without reliance on LLM judges. The authors benchmark leading models (e.g., GPT-4o, Gemini, Claude, GLM, xLAM, etc.), analyze failure modes (named-entity ASR errors, conversational drift, tool avoidance), and present a taxonomy to guide future research. The dataset, taxonomy, and diagnostics are released to accelerate progress on reliable multi-modal agents. The strengths of the paper are: 1. Originality: - First benchmark to systematically evaluate multi-modal function calling under real-world perturbations. - Introduces controlled perturbations and a taxonomy of failure modes. - Unifies text, audio, and vision evaluation in a single framework. 2. Quality: - Expert-verified tasks, realistic data augmentation, and comprehensive error analysis. - Automatic grading at function and argument level, enabling reproducible and robust evaluation. - Strong experimental design, with ablations and comparisons across many models. 3. Clarity: - Clear motivation, methodology, and results presentation. - Figures and tables directly support claims; taxonomy is actionable. 4. Significance: - MFCL will become a standard for evaluating multi-modal tool-augmented agents. - The insights into failure modes and robustness are valuable for both research and deployment. No major weaknesses. The study of a multi-modal functional calling benchmark is very useful for developing agentic LLM in real-world scenarios. Minor questions: 1. The failure mode analysis is very interesting. Did authors have quantitative results in addition to the qualitative examples? 2. Any plan for the release of the benchmark? 3. Have you/ do you have plans to evaluate smaller models on the benchmark? like Qwen-omni, and other multi-modal LLMs with similar size? Fully AI-generated
MFCL: A Multi-modal Function Calling Evaluation for Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. 1. This paper introduces MFCL, the first unified benchmark to evaluate structured function calling from multi-modal inputs (speech and vision). 2. It systematically injects realistic perception perturbations and uses exact-match automated scoring to diagnose failures 3. Results reveal significant degradation under noise and visual distortions, exposing critical weaknesses such as tool avoidance, keyword selection errors, and conversational drift in modern models. 1. Realistic perturbation design that simulates real-world failure conditions in multimodal function calling and thoroughly analyzes their impact. 2. Extensive evaluation across multiple leading commercial models, demonstrating substantial experimental effort and providing meaningful comparative evidence for the community. 1. Despite arguing for “real-world audio robustness,” the dataset relies on synthetic TTS rather than human-recorded speech. 2. The benchmark’s core metric (exact-match JSON output) misaligns with real agent objectives, overlooking task-level success, semantic equivalence, and cost-aware behaviors. 3. The turn and clarification rules constrain reasonable uncertainty-handling strategies, potentially biasing models toward brittle “just emit JSON” behaviors instead of safe, real-world interaction patterns. While the benchmark is valuable, I feel there are several questions: 1. How does strict exact-match scoring avoid misaligning the benchmark with real-world multi-turn agent behavior (uncertainty handling, clarification, self-correction)? 2. The turn semantics and clarification rules only allow spelling/value confirmations, while ignoring broader ambiguity resolution. How do these constraints avoid discouraging realistic uncertainty-handling strategies that agents must perform in practical deployments? 2. Given the recent emergence of audio-based function-calling benchmarks, is it appropriate for MFCL to claim to be the “first” in this space, and could the authors clarify the concrete differences that distinguish MFCL from prior speech-focused evaluations? Heavily AI-edited
MFCL: A Multi-modal Function Calling Evaluation for Large Language Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces MFCL, a new benchmark for function calling in multimodal scenarios. The paper examines several cutting-edged models and reveals common failure patterns of these models, providing insights into developing multimodal agents. - The topic (function-calling and multimodality) is timely. - Comprehensive dataset design - Clear description for the dataset construction - Limited insight beyond enumeration of “failure modes.” Most FM categories merely restate known LLM limitations (ASR errors, over-reliance on text, conversational drift). - Missing implementation details, such as the decoding hyperparameters. This may reduce the reproducibility of the paper. - Statistical shallowness. Reported numbers are raw accuracies with no confidence intervals or significance testing. - How did you verify that the TTS-generated and noise-augmented audio realistically represents spontaneous human speech or real-world acoustic conditions? Was any human evaluation conducted to confirm perceptual naturalness? - What procedures ensured the correctness and consistency of the expert-verified tasks? How many annotators were involved? What inter-annotator agreement was achieved? - Given that the Vision set contains only 250 examples, why do you consider its coverage sufficient for robust evaluation? Lightly AI-edited
MFCL: A Multi-modal Function Calling Evaluation for Large Language Models Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper addresses two limitations of current multimodal benchmarks: 1) reliance solely on text-based tools, and 2) lack of feedback across different modalities and fine-grained details. To tackle these issues, the authors propose MFCL (Multi-modal Function Calling Evaluation), which consists of three components: True Audio, Text Audio, and Vision. The authors establish specific guidelines for generating each type of data. They evaluate several mainstream models on the proposed benchmark and analyze the results from both audio and visual perspectives. From the audio perspective, they find that current multimodal large models are highly sensitive to speech noise and often fail to confirm critical entities, leading to task failure. From the visual perspective, they observe that these models still lack sufficient attention to details, along with limited tool-calling capabilities and self-correction abilities. There is currently a scarcity of in-depth evaluation benchmarks for multimodal large models, and this paper contributes meaningfully to this field. 1. The introduction is somewhat disorganized. The authors mention two research gaps, but starting from the third paragraph, they delve into the construction of the benchmark without directly linking it to how these gaps are addressed. I suspect the authors intended to highlight the lack of a benchmark combining API and multimodal evaluation, but the current version is hard to follow, making the motivation unclear. Additionally, Figure 1 is not referenced in the text. 2. There is a lack of data validation, particularly human evaluation, making it difficult to assess the benchmark’s quality and potential biases. Furthermore, certain details remain unclear. For instance, the authors state that vision data requires "one clear visual clue," but there is no in-depth analysis of how "clear visual clue" is defined or identified. 3. The benchmark does not effectively integrate multiple modalities. Although the authors claim that the three components are mutually supportive, the paper does not demonstrate how these components interact. 4. Experimental settings and evaluation metrics are crucial, yet placing them entirely in the appendix makes the paper hard to follow. The analysis section summarizes numerous issues. Among these, which problem is the most critical and has the greatest impact on the performance of current multimodal LLMs? Could resolving this issue potentially lead to the resolution of other problems? Lightly AI-edited
CaseGen: A Benchmark for Multi-Stage Legal Case Documents Generation Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. CaseGen presents a benchmark for producing Chinese legal case documents using LLMs. As the title of the paper suggests they produce these legal case documents in states, progressively giving the previous stage output in the prompt for the next. They propose using LLM as a judge to evaluate the documents produced by different LLMs. They then compare LLM-as-a-judge evaluation with human evaluation and conclude that LLM-as-a-judge is sufficient. S1. While benchmarks for few legal tasks like judgement prediction, case similarity prediction etc exist, this paper attempts to produce a benchmark for long text generation. The long text generation being the specific case documents that are used in the Chinese court system. S2. The authors have taken the trouble to engage legal experts to annotate the case documents produced by different models. S3. The multi-stage case document generation and using prompt styles like chain of thought are reasonable. W1. The case document discussed in the paper seems specific to the court system in China and it is not obvious how this dataset helps with structured long text generation using LLMs. The examples provided and the paper themselves do not explain why this is a complex task beyond talking about the accuracy requirements in this domain, hallucinations etc. Given the most recent models have become quite good at generating long text, the particular gap that this benchmark is filling is not very apparent. W2. The multi-stage document generation seems a reasonable solution. However, how this is helping with correcting factual errors is not clear. One could assume getting precise facts of the case - say a number, date, amount of money - using a retrieval system is more fool proof. Such a solution could be implemented using agents and tools. There isn't much discussion on how this work gets the facts correct and not just the format of the document correct. W3. The experimental observation that specialized legal models don't compare well with bigger models is understandable. But it is not clear why claude sonnet or GPT 4o-mini struggle against qwen-72b and llama-70b models, especially on the reasoning tasks. This is a bit counter intuitive. Claude's superior performance on facts might highlight the need for retrieval based facts mentioned in W2 above. 1. Did you try implementing your casegen solution using agents and tools? 2. Did you consider any adversarial ideas, where you use the generated text to produce summary and then compare with a summary from human generated document. Just curious about this. It's fine if this is not a valid evaluation method in your opinion. Fully human-written
CaseGen: A Benchmark for Multi-Stage Legal Case Documents Generation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces "CaseGen," a new benchmark for legal case document generation, which splits the complex process into four subtasks: Drafting Defense Statements, Writing Trial Facts, Composing Legal Reasoning, and Generating Judgment Results. The authors contribute a dataset of 500 cases annotated by legal experts and propose an LLM-as-judge evaluation method as an alternative to metrics like BLEU and ROUGE. 1. The paper addresses an important and practical challenge in the legal domain. 2. The creation of an expert-annotated dataset is a valuable contribution. My major concerns are about clarity and evaluation. The descriptions of the dataset construction and task definitions are ambiguous, making the work difficult to understand. More critically, the reliability of the evaluation is questionable; the reported human inter-annotator agreement (Kappa < 0.5) is too low to establish a reliable "ground truth", which in turn undermines the validity of the proposed LLM-as-judge. Detailed Comments are shown below: 1. Ambiguity in Case Selection: The procedure for selecting the 500 representative cases is underspecified. The authors state they used K-means clustering but omit crucial details: How many clusters were generated? What was the distribution of cases across clusters (e.g., were sizes balanced)? What sampling strategy (e.g., random) was used to select cases from these groups? The paper should also justify why clustering was used over a more interpretable method for ensuring diversity, such as sampling cases based on their legal charges. 2. Missing Annotation Consistency: The paper does not report the inter-annotator agreement (IAA) scores from the initial dataset annotation phase (Section 4.2.2). This metric is essential for assessing the quality and reliability of the ground-truth data itself. 3. Lack of Data Filtering: The decision to keep all 500 initially selected cases is questionable. The authors note that some cases involved uncertain evidence or non-textual evidence (audio, images). Such cases are poor candidates for a text-generation benchmark and should have been filtered out to ensure data quality and task focus. 4. Ambiguous Task Definitions: The definitions of the four subtasks are unclear, particularly regarding their specific inputs. For example, in Task 3 (Composing Legal Reasoning), it is not specified whether the input includes only the trial facts or also the original prosecution, evidence, and defense statements. Besides, Figure 2 appears to be misleading. It depicts the input for Task 2 as solely the "Defense," whereas the text in Section 4.1.2 states the input is the "evidence list, prosecution, and defense statement." This inconsistency should be resolved. 5. Unsupported Motivation for Task Decomposition: The paper claims that splitting the generation process into four subtasks is necessary because the end-to-end task is too complex for current LLMs. This central claim is not substantiated with any empirical evidence. To justify this design choice, the authors should provide results from a baseline experiment showing a state-of-the-art LLM failing at the complete, end-to-end generation task. 6. Incomplete Dataset Statistics: Table 1, which presents the statistics for the CaseGen dataset, only provides average lengths. To give readers a comprehensive understanding of the dataset, this table should be expanded to include other key descriptive statistics, such as the minimum, maximum, and standard deviation for document lengths. 7. Low Human Agreement in Evaluation: The paper reports a human annotator's Kappa score of "almost below 0.5" for the evaluation task. A Kappa in this range (often considered "slight" or "fair" agreement) is insufficient to establish a reliable ground truth. If human experts cannot consistently agree on the quality of the generated text, it is difficult to trust the validity of an LLM-as-judge method that is calibrated against this inconsistent standard. Please see the weaknesses. Lightly AI-edited
CaseGen: A Benchmark for Multi-Stage Legal Case Documents Generation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces CaseGen, a benchmark for multi-stage legal case document generation in the Chinese legal domain, on 500 cases sampled and annotated by legal experts. The paper used LLM-as-a-judge evaluation framework further validated by human experts. The paper tests many generic and domain-specific LLMs on the proposed benchmark. The dataset CaseGen, which the authors claim is the first Chinese benchmark designed to evaluate LLMs in legal case document generation. The number of annotated documents is 500, which is a reasonable number given that legal annotators are not easily available. The annotation spans five sections -- Prosecution, Defense, Evidence, Events, Trial Fact, Reasoning, and Judgment. The authors extensively discuss evaluation dimensions and how each generated document must be evaluated on several grounds, depending on the structure of a legal document. The dataset enables the evaluation of four tasks: (1) drafting defense statements, (2) writing trial facts, (3) composing legal reasoning, and (4) generating judgment results. The human annotation process is discussed in detail. The need for LLM-as-a-judge is not clear if qualified pool of human experts is already present. Human inter-annotator agreement is not reported. More ablations could have been tried, e.g., with temperature and other parameters. Truncation of the input may have led to serious loss of context. Instead, the authors could have fed the documents in chunks. Length is a major challenge for legal documents, and this issue needed to be handled instead of being bypassed. “Legal-specific LLMs exhibit suboptimal performance.” – requires additional experiments to provide a detailed understanding of points of failure. It is understandable that a legal-specific LLM trained on a task-space that bears little to no resemblance to the task of case document generation (where the demands from LLMs are significantly higher) falters. An elaborate set of observations established through experiments can enhance the value of the paper and potentially help future research. Direct reporting of observations on aspects such as how the size of a model impacts reasoning and other capabilities in the task (even better if it is done per section). Can a sub-10B model (LLaMA 3.1 8B or Qwen 2.5 7B) perform competitively? How were annotation disagreements resolved? How was the performance of the truncated documents when compared to the shorter ones? Fully human-written
CaseGen: A Benchmark for Multi-Stage Legal Case Documents Generation Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper aims at benchmarking legal case document generation across an entire lifecycle of the legal process: from drafting defense statements, writing trial facts, composing legal reasoning, to generating judgment results. The paper also proposes an evaluation pipeline using LLM as a judge, backed up by human annotations. 1. I really appreciate the great effort in setting up and collecting human annotations. I love your attempts to add rigor to your evaluations and how you also filter out worse annotations to make sure your data is more reliable. 2. I also really love your problem formulation, because it studies a life cycle of legal document generation, which is something novel and the existing 'comprehensive' benchmarks might be missing - I think they study different stages of legal case document generation separately and then aggregate analysis - but not with a focus on an entire lifecycle. So I think this work adds a nice complement and an original angle to the legal NLP research community. 1. There are some presentation issues that I think are detrimental to the quality of the paper. First and foremost it's the citation format that needs to be fixed. I don't think ICLR cites by numbers? I think the convention is to go with \citet or \cite in latex. Second, Figure 1 offers a decent qualitative example, but I fail at connecting the illustration to figure 2. Also, you have four tasks for your set-up, and I am finding it hard to understand what those four tasks are based on Figure 1 alone. I think there should be significant modifications. 2. I appreciate you releasing your inter-annotator agreement level - it is low (~50%) and I really think that however reduces the rigor of your eval - because it shows that the tasks are not that verifiable, even by 'gold' standards. So I hope to see you can analyze why the agreement level is low - just a few in-depth examples would give us much information and figure out the verifiability of this task. 3. As I said, I think the problem in this paper is original and interesting. However, I think what's missing is that you can also study how different components relate to each this longitudinal process, like tracking the long context of the language model and see it can improve the overall generation. Maybe some ablation study can help but it feels more like a future work to me. I would really appreciate if you can clarify the low inter-annotator agreement level and provide a few failure mode analysis, making some comments on how fundamentally verifiable this task is. Fully human-written
Chain of Time: In-Context Physical Simulation with Image Generation Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a way for Image Generation Models to generate k steps in the future for a given input video. The challenge is getting the physics correct when simulating the next k steps. The proposed method is inspired by human mental simulation (where a trajectory is simulated in mind to predict the future state) and Chain of Thought prompting in LLMs (where the model is asked to answer step-by-step). The method works by giving the model input frames and asking the model to simulate small time-steps. Implicitly, the model de-renders, simulates the next step using transition dynamics, and re-renders to give the next frame. They find the physics of the trajectory stay consistent with a small delta t. - shows strong results on 2d motion and gravity scene - There is a partial success in more complex simulations, like fluids and a bouncing ball - The method works at inference time and works with existing models like GPT-4o - The mechanism is implicit (de-render, transition based on world transition matrix, rendering) and difficult to test - In 3D scenes, the early error seems to compound, making it difficult to simulate longer time-steps - Not much comparison with other existing methods (Video or World-Models) for generating physically plausible images generation - The generalization seems limited to very simple scenes and breaks when applied to more complex physics problems (fluid, bouncing) - The number of intuitive physics studies comparing machine learning models and humans' surprise rating to the plausibility of the scene. Are the authors planning on exploring this avenue to see if the mental simulation hypothesis still holds? - Have you considered testing whether Chain-of-Time generalizes beyond intuitive physics to intuitive psychology? (model vs human rating for plausibility rating of psychology scenes) - Have the authors explored combining their method with an external physics engine? Fully human-written
Chain of Time: In-Context Physical Simulation with Image Generation Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a way to use language models with image generation abilities as a way to simulate physical systems. The model is presented with a few frames of a physical system (say, a moving ball) with corresponding time steps, and is asked to produce a prediction of the system in the future (up to a specific time) in the form of an image. The paper suggests that doing so "step by step" like chain of thought instead of in a single go produces better results. The method is demonstrated on a few simple physical scenarios in 2D and 3D. The paper is interesting in its approach and the general context of the problem is important. It's a well written paper and is easy to follow. I also enjoyed the clarity of the method description and the ample detail given. I thought using computer vision algorithms to extract the state for better analysis was a nice idea, but see below. I think the main issue of the paper is its scope and especially the experimental setup. I understand why using such simple physical systems was necessary if exact state estimates are needed, but this is a major hinderance for the paper. The experiments only cover a very simple set of physical systems under ideal observation conditions - I feel that to conclude anything about model's abilities to reason about physics, much more detailed and elaborate systems should be examined and analyzed. At this scope of experiments this feels more like an initial study rather than a fully formed paper. Another weakness of the experiments is that it's not clear if other models would exhibit similar results - using only a single model makes drawing any kind of conclusions about the proposed method quite speculative. In summary - the paper requires more scope to prove to be a significant contribution to the community. above. Fully human-written
Chain of Time: In-Context Physical Simulation with Image Generation Models Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes chain of time simulation, or generating intermediate images during a simulation, to evaluate the capabilities of image generation models on predicting simulated and natural evaluations. They find that this significantly improves performance over the baseline, and suggests that image generation models are able to simulate some properties over time. - Studies an unique topic - physics understanding in image generation models - Motivates the study from an interdisciplinary perspective - limited evaluation. image generation models span a variety of designs, and evaluating only one is insufficient. - given that gpt's image generation is a closed model with little public detail, it may be difficult to act on these findings to improve image generation models - limited context from related work - experimental results are unclear. for example, the paper mentions that figure 6 shows that IGM is able to simulate the projectile's motion because it is close to ground truth, but the pattern does not seem to behave in a way consistent with a physics equation. How would you differentiate your method from the visual chain of thought line of work (which also produce intermediate images)? The related work mentions the final goal being that of producing an image, but there is existing literature applying CoT to image generation as well. Could you provide in the appendix additional images from the evaluation and across time steps? Image generation models often make other mistakes that are not related to physics, which may affect evaluation. Fully human-written
Chain of Time: In-Context Physical Simulation with Image Generation Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes Chain-of-Time, a method for improving physical simulation in Image Generation Models by generating intermediate frames step-by-step, inspired by mental simulation in humans and chain-of-thought reasoning in LLMs. The authors test on four domains (2D Motion, 2D Gravity, Fluids, Bouncing) and show improvements over direct prediction in most cases. Moderately Novel approach: The connection between human mental simulation theory and in-context reasoning for IGMs is creative and well-motivated. Interpretability: The method provides interpretable intermediate steps that reveal the model's physical reasoning process. Clear methodology: The paper clearly describes the de-rendering, simulation, and rendering components. W1: Insufficient analysis of fluid domain's fundamental challenges The authors observe performance degradation in the fluids domain but fail to analyze why fluids are fundamentally different from the other tested scenarios. The paper should discuss whether Chain-of-Time is inherently unsuitable for fluid domain due to some specific properties of it like continuous deformation or partial transparency, or whether the issue is specific to their experimental setup. The superficial explanation of "flow rate estimation error" doesn't address why the step-by-step approach that helps with projectile motion actually harms fluid simulation. W2: Limited temporal horizon evaluation All experiments are constrained to 0.8 seconds of simulation. For a method claiming to improve physical simulation, testing longer time horizons (e.g., 2-5 seconds) would better demonstrate scalability and compound error effects. The mental simulation literature the authors cite often involves longer-term predictions. W3: Insufficient experimental scope Model diversity: Only GPT-4o is tested. While the authors mention DALL-E 3's limitations, they should test other recent VLM+IGM combinations (e.g., Gemini Pro Vision, LLaVA variants with diffusion models, or Claude with image generation capabilities). Domain limitations: The four domains use overly simplistic objects and backgrounds (solid white, uniform blue). The authors should evaluate on some of the established video prediction datasets (Moving MNIST, KTH Actions, BAIR Robot Pushing) or physical reasoning benchmarks (IntPhys, CATER, Physion) to test generalization beyond synthetic scenes, whichever the authors think is suitable for validating their method. W4: Incomplete related work The related work section misses important connections: Video prediction literature (e.g., stochastic video generation, physics-informed neural networks) World models that perform similar step-by-step physical prediction W5: Sample sizes vary across domains (N=5 to N=20) without justification. 1. What fundamental properties of fluid simulation make Chain-of-Time perform worse than direct prediction? 2. Can you provide results for simulations beyond 0.8 seconds to assess error accumulation? 3. Have you tested on any established video prediction or physical reasoning benchmarks? 4. Could you test additional VLM+IGM combinations to verify generalizability? 5. Why do sample sizes differ across domains? Fully AI-generated
PreviousPage 40 of 1516 (75800 total rows)Next