ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
Multi-Sample Preference Optimization for Generative Model Alignment Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. RLHF and DPO are popular post-training methodologies for LLM alignment. The standard method uses preference pairs consisting of single samples to align LLMs. This increases the probability of generating preferred samples when the reward is computed at a per-sample level. However, in several cases like increasing the diversity of responses etc. the rewards cannot be computed at a per-sample level. This paper extend DPO and IPO to such cases. In this work, the authors generate a set of responses for each prompt. Analogous to preferred and dis-preferred samples, the multi-sample formulation has a preferred and dis-preferred set. The authors provide an unbiased estimator for the multi-sample formulation of IPO through theoretical analysis. Through five empirical studies, they show that the multi-sample DPO and IPO performs better than DPO and IPO when the reward is computed at a distribution level. 1. Computing rewards at a sample level is an important limitation of DPO and IPO. The extension of standard preference optimization framework to distributional level rewards addresses this limitation 2. The theoretical analysis which shows that the variance of the multi-sample estimator decreases with group size is novel 3. The range of empirical studies show that the multi-sample DPO/IPO improves upon the performance of standard DPO/IPO when the rewards are formulated at a distribution level across a range of modalities. 1. The contribution of this work seems limited. The primary difference between the standard DPO setting and the multi-sample DPO setting is in the way that the samples are separated into preferred and unpreferred groups. Instead of separating them at a sample level, they are first grouped into sets and separated at a set level. Given that the reward is computed at a distributional level, this seems like the natural application of DPO for such a problem. 2. There seems to be a strong overlap between this paper and Li et al [2024]. The authors have not clearly stated the original contributions which differ from Lit et al [lines 118 - 124] 3. This work could benefit from stronger baselines - extending the RLHF framework to distributional reward problems. Furthermore, I would encourage the authors to compare this results with Zhong et al [2023] and Melnky et al [2024]. Please see the weaknesses Fully human-written
Multi-Sample Preference Optimization for Generative Model Alignment Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper extends the direct preference alignment frameworks DPO and IPO to deal with preferences over groups of items over binary preference over pairs of items. The authors provide intuitive modifications to the DPO/IPO objective to incorporate group-wise comparisons, derive mini-batch estimators for these objective function, and show the validity of the method with various experiments. The problem of needing to compare groups of items instead of individual items in some cases is natural and of practical relevance. The experiment case studies (random numbers generation, image debiasing, improving quality of creative fiction generation and training with label noise) are all quite interesting and both validates the approach and gives real-world examples where one might want to compare distributions instead of pairs of items. The paper is also well written and easy to follow. 1. The proposed methodology is quite straight-forward and the novelty of the proposed solution is moderate. 2. It seems like the objective estimates can have bias/variance and it seems like this would depend on the batch size. However, from the experiments I don't see this angle being explored sufficiently. For someone trying to deploy this method, how would they deal with the bias/variance issue, how does that change with the batch size? See weakness Fully human-written
Multi-Sample Preference Optimization for Generative Model Alignment Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces **multi-sample extensions of preference optimization methods (mDPO and mIPO)** for alignment. Whereas standard approaches such as RLHF and DPO/IPO rely on **single-sample pairwise comparisons**, this work proposes to instead optimize over distributions of responses. This allows the methods to capture distributional characteristics such as **diversity and bias**, and to better handle cases where preferences may not exist between individual samples but emerge clearly when comparing groups. The authors highlight challenges in unbiased estimation and provide empirical evidence that the proposed approaches can improve robustness against label noise and enhance diversity in generated outputs. - Tackles an **important and underexplored problem**: extending preference optimization from single-sample to multi-sample comparisons. - Provides a **novel formulation** (mDPO, mIPO) that enables aligning distributions rather than instances. - Shows **promising empirical results**, especially in improving **diversity** and reducing **bias** in outputs. - Highlights the robustness of mDPO against label noise, which is valuable in real-world preference datasets. - Overall **presentation** is good, with several intuitive illustrations of the advantages of multi-sample formulations. While I liked the paper overall and I believe it tackles an important problem, some key weaknesses should be addressed: 1. **Insufficient experimental comparisons**: - The paper does not compare against **other multi-sample methods** such as DFT (Guo et al., 2025). - No experiments with **naive multi-sample baselines**, e.g., running DPO/IPO over all pairwise comparisons between positive and negative sets in the mini-batch, i.e. $$ \frac{1}{k^2}\sum_{i=1}^k\sum_{j=1}^k l(x, y_{w,i}, y_{l,j}) $$ 2. **Overstatements in claims**: page 9 line 449, “both mDPO and mIPO significantly outperform the baselines” is too strong looking at the figure for mIPO $k=5$. This should be further argued or weakened. 3. **Experimental detail gaps**: - The paper acknowledges that obtaining an unbiased estimator of mDPO is challenging, but still reports experiments with mDPO. It is unclear whether an estimator or a biased version is used. - Figure 5 and Table 3: why does mIPO with $k=3$ outperform $k=5$? One would expect larger $k$ to monotonically improve performance (even if with diminishing returns). - Figure 6: it is not clear whether $k=5$ is an outlier or performance saturates? $k=6$ would help clarify this. - Iterative improvement experiments (page 9): baseline not clearly stated, should be iterative DPO/IPO for fairness. Guo et. al. "Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data." 2025. arXiv:2502.18679. **Minor issues that have not affected the rating** - Page 3 line 121: “foci” → “focus”. - Page 7 line 341: Figure 5 caption should have more space from text. - See the weakness section for suggested additional experiments; more baseline comparison would greatly benefit the paper. - See the weakness section for some discussion and experimental details that the paper would benefit from answering. Lightly AI-edited
Multi-Sample Preference Optimization for Generative Model Alignment Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes multi-sample variants of direct preference optimization methods (called mDPO and mIPO) that replace single response comparisons with group-wise comparisons aimed at aligning distributional properties (e.g., diversity, bias). In particular, the authors lift DPO/IPO from singletons to sets by using the (geometric-mean) product likelihood of a response group and optimizing the same surrogate with expectations over group samples. Then, they derived an unbiased mini-batch estimator for mIPO (a squared-loss objective over mean implicit rewards) and discussed a biased but lower-variance estimator for mDPO. Finally, they added NLL as an auxiliary term to stabilize finetuning. Experiments cover LLM random number generation, creative fiction, and diffusion debiasing, plus a synthetic-label robustness study where multi-sample wins more often under label noise. - The paper argues that many properties, such as diversity, are distributional and not captured by single-sample comparisons, and the group-wise formulation is intuitive. - Extending DPO/IPO by averaging implicit rewards over sets keeps training compatible with existing implementations. - Technically, mDPO/mIPO mostly reuse the same surrogates with a group-average of implicit rewards and a straightforward mini-batch estimator. The constrained-optimization view for adding NLL is already common in practice. Relative to DPO/IPO, the step from singletons to sets reads as expected algebra rather than a new learning principle. - There is prior work on distributional difference/alignment that directly targets set-level objectives. The paper cites some of these, but the differences are mainly about different experiments and applications, which are vague to me. - Can you provide a theoretical justification that mDPO/mIPO are proper surrogates for a target distributional objective? Can we provide any consistency under a Bradley–Terry-style group model? - How does mIPO compare to work in distributional preference alignment in principle (what objective is optimized)? - Is there any specific reason behind considering the geometric mean for the aggregation over a group? Fully human-written
SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. SafeFlowMatcher meaningfully advances SafeDiffuser’s core idea—using CBFs to guarantee safety in generative planning—by swapping the stochastic, per-step–constrained diffusion backbone for a deterministic flow-matching model with a prediction–correction integrator and vanishing time-scaled CBF correction. This shift strengthens the theory (deterministic forward invariance with finite-time convergence vs. SafeDiffuser’s finite-time probabilistic invariance), slashes computational load (one/few ODE passes and a single lightweight QP, rather than a QP at every denoising step), and improves practical behavior (no boundary trapping, higher task scores, and real-time feasibility). The trade-offs are modest—FM is less naturally exploratory than diffusion and both approaches still require known, differentiable safety functions—but for robotics/control where safety and latency dominate, SafeFlowMatcher is a clear, useful step beyond the original SafeDiffuser. - Deterministic safety guarantees with forward invariance and finite-time convergence, rather than SafeDiffuser’s almost-sure (probabilistic) invariance across stochastic reverse steps. This yields cleaner, stronger theory for deployment. - Much lower computational load: one/few ODE integrations plus a single lightweight CBF-QP, instead of solving a QP at every denoising step. This makes real-time online planning feasible where SafeDiffuser typically isn’t. - Prediction–correction decoupling minimizes distributional distortion. The model first predicts with flow matching, then applies a vanishing time-scaled CBF correction that avoids boundary sticking/local traps that SafeDiffuser mitigates with relaxed or time-varying specs. - Better empirical planning quality at equal or stronger safety: smoother rollouts, higher task scores, and zero safety violations. Deterministic dynamics reduce stochastic artifacts present in diffusion-based sampling. - Simpler and more portable pipeline: no long reverse diffusion chain, fewer sensitivity knobs, and a safety layer that operates on executed trajectories. This makes it easier to plug into different environments/backbones than SafeDiffuser’s stepwise embedded constraints. - Requires known, differentiable, and correctly calibrated safety sets b(x). In real systems with perception noise, contacts, or nonconvex geometry, specifying (or learning) smooth, faithful barriers is hard and errors can yield either over-conservatism or false safety. - CBF-QP feasibility is not guaranteed, especially under tight actuation limits or when the predicted state is far inside the unsafe set. Slack-based fallbacks weaken guarantees and the paper lacks a systematic analysis of infeasibility rates and recovery. - Limited robustness treatment to uncertainty and mismatch (estimation noise, delays, unmodeled dynamics, moving constraints). The deterministic guarantees don’t provide ISS/chance-constrained bounds, so small errors could accumulate into violations. - Potential distributional drift from repeated corrections: even with vanishing scaling, the safety projection can bias trajectories away from the learned flow over long horizons, reducing diversity and pushing states out of the model’s training support. - Empirical scope and baselines: comparisons focus on diffusion backbones; fewer head-to-head results versus strong control baselines (e.g., MPC + CBF-CLF-QP, Neural-ODE/CNF + CBF filters) and no on-hardware validation to substantiate real-time claims. - Can you quantify the practical bottlenecks of SafeDiffuser that specifically motivated your design choices (e.g., per-step QP latency, end-to-end wall clock, trap rates), and state the a priori performance targets you aimed to hit so readers can judge whether the reported gains meet those targets? - What conditions ensure feasibility of the CBF-QP in the correction step under tight actuation limits or when the predictor proposes states deep inside the unsafe set, and what is the formal recovery strategy when the QP is infeasible (e.g., backtracking the rollout, horizon extension, or schedule adjustment)? - How sensitive is performance and safety to the vanishing time-scaling schedule and the choice of the class-K function α(·), and can you provide ablations and practical tuning guidance that demonstrate stable behavior near boundaries without excessive conservatism or reward loss? - To what extent do repeated corrections induce distributional drift away from the nominal flow, and can you report quantitative divergence metrics (e.g., energy distance or trajectory FID) alongside impacts on trajectory diversity and long-horizon returns? - Could you include baselines beyond diffusion backbones—such as MPC with CBF-CLF-QP, Neural-ODE/CNF planners with a CBF filter, and diffusion with a single-shot CBF projection—to isolate how much of the gain arises from the prediction–correction plus vanishing mechanism rather than from the backbone swap alone? - How robust are the proposed guarantees in the presence of state-estimation errors, dynamics mismatch, delays, or moving constraints, and can you provide an ISS or chance-constrained analysis (or targeted experiments) that bounds violation probability under realistic sensing and model errors? Fully AI-generated
SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces SafeFlowMatcher, a novel planning framework that integrates Flow Matching (FM) with Control Barrier Functions (CBFs) to achieve both real-time efficiency and certified safety in robotic path planning. The core of SafeFlowMatcher is a two-phase Prediction-Correction (PC) integrator. The prediction phase efficiently generates a candidate path using learned FM dynamics without safety interventions. The subsequent correction phase refines this path by (i) reducing integration errors through vanishing time-scaled flow dynamics (VTFD) and (ii) enforcing hard safety constraints using a CBF-based quadratic program (QP), minimally perturbing the vector field. This decoupling strategy prevents distributional drift and mitigates "local trap" problems often encountered by other certification-based generative planners. Extensive experiments on Maze2D navigation and high-dimensional locomotion tasks demonstrate that SafeFlowMatcher outperforms existing diffusion- and FM-based baselines in terms of safety (zero trap rate), planning performance (higher scores), path quality (smoother paths), and computational efficiency. Ablation studies confirm the necessity of both prediction and correction phases and the effectiveness of VTFD. 1. The paper proposes a unique combination of Flow Matching for efficient path generation and Control Barrier Functions for formal safety guarantees. This addresses a critical need in robotic planning for both speed and reliability. The two-phase PC integrator is a significant contribution. By separating the initial path generation (prediction) from safety enforcement (correction), SafeFlowMatcher avoids issues like distributional drift and local traps that plague other methods that attempt to enforce safety on intermediate latent states. 2. The paper provides rigorous mathematical proofs for the forward invariance of a robust safe set and finite-time convergence to this set. This theoretical backing strengthens the credibility of the proposed safety guarantees. 3. By leveraging Flow Matching, SafeFlowMatcher demonstrates efficiency comparable to or better than diffusion-based methods, which often require many more sampling steps. The ability to achieve high performance with a small number of function evaluations (e.g., $T^p=1$) is a notable advantage. 1. The paper highlights the reliance on well-defined CBFs. Could the authors elaborate on the practical challenges and potential solutions for designing or learning CBFs for more complex, dynamic, or unstructured real-world robotic environments beyond the relatively simple constraints presented? Are there any ongoing efforts to integrate CBF learning within SafeFlowMatcher? 2. Given that hyperparameter tuning is identified as future work, can the authors provide more insight into the sensitivity of SafeFlowMatcher's performance to the choice of $\epsilon$, $\rho$, $\delta$, $\alpha$, and $t_w$? Are there any heuristics or adaptive strategies that could be employed to simplify this tuning process for new tasks or environments? 3. While closed-form QP solutions are mentioned for simple cases, how does the computational overhead of the QP solver scale with a larger number of waypoints ($H$) or a more extensive set of complex safety constraints? Have the authors encountered scenarios where the QP solving time becomes a bottleneck, and what strategies could mitigate this? 4. The paper mentions developing a guidance-free version as future work. Could the authors expand on the specific challenges in achieving this (e.g., in terms of exploration, multimodal goal reaching) and their initial thoughts on how to approach such a development? 5. Lemma 1 assumes a symmetric, zero-mean distribution for the prediction error $\epsilon$. How robust is SafeFlowMatcher to deviations from this assumption, particularly if the prediction errors are biased or exhibit heavy tails? What are the practical implications of such deviations on safety guarantees and performance? 6. While Naive Safe FM serves as a baseline, could the authors provide a more in-depth discussion on why directly applying safety constraints from the beginning leads to a high trap rate? A clearer theoretical or empirical breakdown of the failure modes of Naive Safe FM would further strengthen the argument for the PC integrator. 7. For the vanishing time-scaled flow dynamics, the optimal $\alpha=2$ was found empirically. Is there a deeper theoretical justification for this specific value or for the general range of $\alpha$ that provides a good trade-off between accelerated convergence and stability? 8. The current experimental validation focuses on maze navigation and locomotion. How would SafeFlowMatcher perform on more complex robotic manipulation tasks (e.g., pick-and-place with moving obstacles) that typically involve higher-dimensional state and action spaces, more intricate safety constraints, and potentially longer planning horizons? See Weaknesses^ Fully AI-generated
SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces SafeFlowMatcher, a planning framework that integrates conditional flow matching with control barrier functions to generate trajectories that are efficient to sample and provably converge to a robust safe set. The method addresses a key limitation of flow-matching and diffusion planners, which produce high-quality paths but lack safety control and can become trapped when constraints are enforced at each latent step. SafeFlowMatcher relies on a two-phase prediction-correction scheme. The prediction phase runs standard FM dynamics to propose an unconstrained path, while the correction phase refines it using a vanishing time-scaled FM field and a CBF-based quadratic program that projects waypoints back into the safe set, ensuring finite-time convergence under the derived barrier conditions. On the theoretical side, the authors restate finite-time CBF results and provide a flow-invariance certificate guaranteeing safety for the corrected flow. Empirically, the framework is validated on Maze2D and MuJoCo locomotion. It consistently outperforms FM, DDIM, and SafeDiffuser-based baselines on the Maze2D environment, achieving higher scores while ablations confirm the role of each phase and the vanishing-scale mechanism. - The paper identifies a genuine mismatch in existing “safe” generative planners: safety is typically enforced at intermediate latent steps of the denoising or flow-matching process, although only the final trajectory is executed. This “semantic misalignment” is addressed cleanly by decoupling generation and certification: safety is enforced only in the correction phase of the flow. This clarification is timely and relevant for the diffusion/FM-for-planning community. - The prediction-correction (PC) scheme is well motivated. The prediction phase runs unconstrained FM dynamics to avoid distributional drift, while the correction phase refines the path using a vanishing time-scaled FM vector field and a one-constraint CBF-QP projecting waypoints into the safe set. The finite-time safety guarantee applies to this corrected flow. The ablation in section 4.2 (prediction-only / correction-only / full PC) clearly supports this design. - The QP correction is computationally light: the authors define a convex problem with a single CBF constraint and closed-form projection, countering concerns about runtime overhead. The running time reported in the experiments shows that this approach is quite fast compared to other SOTA methods. - The authors performed a serious experimental analysis and comparisons by comparing their approach to a wide variety of baselines (for which they have themselves reproduced the results). Experiments are coherent with the claims and SafeFlowMatcher always achieves a postivie barrier function with zero trap rate, outperforming all baselines while remaining efficient. - On p.3, the authors introduce CBFs in their general control-affine form: $\dot x = f(x)+g(x)u$, which normally presupposes known dynamics; in experiments, they seem not to use a physical model, but instead plug in analytic obstacle functions. This distinction is not clearly stated. - Definition 1 and Lemma 1 are written for general systems with Lipschitz $f,g$, but in practice, the “dynamics” is the FM vector field with vanishing scaling (cf. equation 10). The paper does state this in section 3.2, but the transition is quick. I think the authors should give a short explanation with intuitions; the CBF is not certified in the real environment, but on the generated flow. Furthermore, Definitions 1 and Lemma 1 should be more intuitively justified; currently, they are just "dropped" there without a clear explanation, which makes it difficult for the reader to understand their consequences and intuitive meanings. - I noticed some clarity issues: on page 3, line 108, some distribution over paths $q$ appears, but has not been introduced anywhere. I also found that, in the main text, the environments and the assumptions made on those environments were not well-specified. For instance, what is the state space and what are the dynamics of the $2$D maze? Does the method assume access to the exact dynamics or just to a dataset containing random trajectories? If this is the case, from where do they come? A short description of the environments and assumptions would help. - Could you make explicit what you actually assume in the experiments? Am I correct that you do not know the environment transition dynamics and that the only environment-specific information you use is offline trajectories for FM training and analytic, differentiable obstacle/safety functions $b(x)$ with known centers and scales? Is there any case where you used the true simulator for $\nabla b$? - In the maze experiments, are you conditioning FM on start-goal or only on start and letting it flow to the high-density goal region? Could you clarify what is actually fed to the network at inference time? Fully human-written
SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes SafeFlowMatcher, a planning framework that aims to combine Flow Matching (FM) generative models with the certified safety of Control Barrier Functions (CBFs). The core contribution is a two-phase "Prediction-Correction" (PC) integrator. First, a "Prediction" phase uses the learned FM model to generate a candidate path in one or a few steps, without safety constraints. Second, a "Correction" phase refines this path using a proposed "vanishing time-scaled flow dynamics" (VTFD) and a CBF-based quadratic program (QP) to enforce safety and finite-time convergence to the safe set. The authors claim this two-phase approach decouples generation from certification, thereby avoiding the "local trap" problem. The method is evaluated on 2D maze navigation and locomotion tasks, showing improved safety and task performance and a zero "trap rate" compared to baselines like SafeDiffuser. 1. The paper identifies a clear and important problem in safe generative planning. Applying hard safety constraints (like CBFs) directly within the sampling loop of a generative model (like diffusion or flow matching) can distort the learned dynamics, leading to "distributional drift" and "local trap" failures. The goal of decoupling generation from certification is well-motivated. 2. To further enhance sampling stability in the correction phase, this paper proposes a time-scaled flow dynamics method to contract the prediction errors and mitigate drifting. 3. The paper compares the proposed method with multiple baselines and conducts various ablation studies to demonstrate the efficiency and effectiveness of SafeFlowMatcher. 1. The theoretical guarantees for the correction phase (Lemma 2 and 3) rest on strong assumptions about the prediction error $\epsilon$ (e.g., symmetric, zero-mean, and a locally strongly convex negative log-density). There is no empirical validation to show that the actual integration error from a deep FM model satisfies these convenient properties. 2. In the experiments (Table 1), SafeFlowMatcher achieves zero trap rate. How does the proposed method guarantee the low trap rate? Can the method consistently achieve a zero trap rate in some higher-dimensional or more complex navigation tasks? Are there any theoretical guarantees? 3. The current evaluation only includes three simple environments: one maze and two locomotion environments. Can the author provide experimental results on robot manipulation tasks as well? For example, the environment used in SafeDiffuser might be a good starting point. 4. In Algorithm 1, SafeFlowMatcher first samples a trajectory and solves QP in a double for-loop. The authors report a very fast sampling time of ~4.7ms. To better understand the efficiency of the method, it would be more informative to report the computation time of both phases. 5. Does the QP always have a solution? What will Algorithm 1 do if the QP can not be solved? Will the safety guarantee still be satisfied in that case? 6. Does the method always require a specific function $b$ as in equation (6)? How to obtain $b$ for environments with irregular geometry, agents with high-dimensional state space, or robots with image-based observation? 7. In line 214, "that places $\tau_0^p$ sufficiently close to ...", should the term here be $\tau_1^p$? See the weakness above. Fully human-written
SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents SafeFlowMatcher, a novel framework for safe robotic path planning that effectively couples generative Flow Matching (FM) with Control Barrier Functions (CBFs). The authors identify a key problem with existing generative planners: they either lack formal safety guarantees or, if certification-based, suffer from local trap problems where interventions distort the generative process and cause the plan to fail. ● Paper is very well written and easy to understand ● The key idea of a two-phase Prediction-Correction (PC) integrator is both simple and highly effective. Decoupling the initial path generation from the subsequent safety correction elegantly solves the local trap problem ● The paper provides a formal barrier certificate and a proof of finite-time convergence - strong mathematical guarantees. ● The method is validated on maze navigation and high-dimensional locomotion tasks, demonstrating superior performance across key metrics: safety, path quality, and efficiency ● The paper's baselines are exclusively generative models (diffusion- and FM-based) . While this is the direct field of contribution, classical sampling- or optimization-based planners are still the gold standard in many robotics applications. A comparison against a strong classical baseline (e.g., an optimization-based planner) would have provided a better picture of the practical performance. ● How does the framework handle dynamically changing environments? The CBF-QP is solved at each control step, which seems well-suited for this, but the underlying FM model is trained to generate paths based on a static map. How would the model perform if an obstacle changed position mid-rollout? Lightly AI-edited
Principled Latent Diffusion for Graphs via Laplacian Autoencoders Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Graph generation suffers from quadratic complexity in the number of nodes stemming from the adjacency matrix, which, however, contains mostly zeros for most datasets. This hinders the generation of large graphs. The paper proposes a permutation-equivariant autoencoder that maps each node into a fixed-size latent space, leveraging the graph Laplacian spectrum to reduce complexity. The latents are then used to train a generic diffusion model. - Strong generation speedup compared to state-of-the-art baselines. - Quality-wise, the method achieves the best or second-best performance against the baselines. - Experiments on a large variety of different types of graphs, both synthetic and real-world, e.g., DAG, molecules, planar, trees. - Good reconstruction results. - Motivation stems from limitations affecting the generation of large graphs, but evaluation only covers standard graph benchmark datasets of the same size as tackled by related work. Memory and time might not be the only limits when generating graphs of larger sizes. It is unclear if quality can be maintained. - Unclear how comparable the baseline results are on inference time: e.g., how optimal/equal were the used batch sizes and if the same hardware was used. Especially DAG runtimes are inferred from the original authors' paper, which might not have been using the same setup. - No ablation studies to better highlight the impact of the different components. Minor: - Table 2: Sample acc. column: wrong row is highlighted in bold - How is the generation quality on larger graphs? - A sensitivity analysis on the number of eigenvectors used - What batch size is used, and what hardware is used for inference (appendix mentions 2x L40 for training)? - How similar is the hw form Directo? Can the experiments be run on the same hardware? Fully human-written
Principled Latent Diffusion for Graphs via Laplacian Autoencoders Soundness: 3: good Presentation: 4: excellent Contribution: 4: excellent Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This article proposes a nearly lossless latent graph diffusion method in view of the high complexity limitations of the existing graph diffusion models due to the discrete paradigm. Specifically, considering the strict requirements of graph data for reconstructing encoder-decoder, the authors proposed a Laplacian autoencoder, which was proven to be able to compress graph data into low-dimensional vectors nearly lossless. Then, the author placed it in the framework of stream matching and trained it using DIT. It is worth noting that this method has also been extended to the research in the field of directed graphs generation. 1.I recognize the contribution of this paper in the field of graph generation. Unlike the mature autoencoders in the field of computer vision, the field of graph learning still lacks such an effective lossless encoding method. 2.This paper extends the framework of graph generation to directed graphs, which is of great significance. Previous work mainly focused on molecular data or synthetic undirected graph structure data. Generating directed graphs is equally important. 3.The paper provides theoretical guarantees for the method. 4.The paper is well-written and the introduction of the motivation is very convincing. 1.I think that compared with the design in graph generation, the author's contribution mainly focuses on a powerful autoencoder. I think this is very important, but there is a lack of sufficient research and innovation on the generation method. 2.Regarding the autoencoder, the author achieved nearly lossless results in the experiment. However, as the author mentioned, their assumption is that the node order of the graph remains unchanged, but this can cause problems in some specific scenarios (tree-like structure). Their processing method is to use wl-test to color the nodes. I think this point is worthy of exploration and analysis. For instance, if appropriate position encodings are assigned to the graph, is there a possibility of a solution? 3.One concern about the autoencoder is that although this method can be extended to large graphs by virtue of its complexity advantage, the number of samples in large graphs is equally scarce. Then, when only a limited number of large graphs are used for training, can lossless compression still be achieved? I suggest that the author could attempt to train on the sbm datasets(you can generate by yourself) using only 10 graphs each containing 10,000 nodes to train and verify the reconstruction performance of the autoencoder on them. 4.Another concern is that this method achieves a lossless effect on the encoder, but it does not have a significant advantage in terms of generation effect. However, the author lacks profound analysis and discussion. Is this due to the error of the generative model or the inherent distribution difference brought about by the training set and the test set? 5.Furthermore, it would be great to be able to see the comparison of the effects of different generation strategies. For instance, the effect differences between ddpm and flow matching, as well as the influence of the selection of different compression dimensions on the reconstruction and generation effects. I hope the author can increase experiments on different diffusion paradigms and, most importantly, compare the impact of choices in different dimensions. All my questions are shown in Weakness. I will consider improving my score if the authors can answer my questions. Fully human-written
Principled Latent Diffusion for Graphs via Laplacian Autoencoders Soundness: 1: poor Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. ML graph generation models suffer from poor scalability due to quadratic cost of graph representation and hardness to compress such graphs into revertible latent representation. LG-VAE poses itself as a variational autoencoder that uses the graph laplacian spectrum to bring provable reversibility and enable fast graph generation. LG-VAE is shown to generate graphs up to 1000x faster than state-of-the-art while retaining generative quality. - The proposed method brings outstanding improvements under inference cost compared to the state-of-the-art - The proposed method brings comparable generative performance than the state-of-the-art - The proposed method covers an extensive set of graphs, including attributed, non-attributed, and directed acyclic graphs - The main contribution of the paper is scalable graph generation. However, the experiments mostly cover the generation of small graphs. - How evaluation is conducted is partly unclear, making the validity of the paper’s result debatable. For instance, it is not clear whether LG-- VAE has been tested with the same batch size as the baselines, a detail that could invalidate a significant part of the paper contribution. - Experiments on directed graphs seem to be insufficient. First, LG-VAE is tested on one dataset only. Secondly, inference time of the SOTA baseline is inferred by their paper, meaning that it has also been run on a different hardware. - The experiment section could cover more ablation studies to properly assess the impact of each of the proposed components. - Minor: methodology figure could be improved for clarity and making LG-VAE’s components easier to understand. Altogether, I think the paper makes a reasonable amount of contributions and that the experiments showcase that LG-VAE is a superior choice for graph generation with diffusion models. However, I have concerns about how the evaluation of LG-VAE has been performed. Here are my questions for the authors: - The main contribution of LG-VAE is scalable generation. However, the experiments mostly cover datasets of small graphs. Could the authors perform experiments on datasets with larger graphs? Proteins [1] could be a good choice also used in several papers in this field. If the authors have preference for any other dataset with large graphs over Proteins would also be fine. - Have the baselines and LG-VAE tested with the same batch size? If not, what batch sizes have been used in the experiments? This is not clearly explained in the paper content. From what I understand, the reason why LG-VAE is much faster than the SOTA in Table 1 is because the inference has been run with one batch only for LG-VAE but not for the other models. If this is the case could the author measure memory cost and time cost of LG-VAE and the baselines with the same batch sizes? - The paper would benefit from more ablation studies to assess the impact of each of the components. For instance, experiments to test the impact of the choice of k as the number of used eigenvectors for encoding on computational costs and generation quality. Could the authors provide such experiments? - Could the authors run the experiments of Table 5 on the same hardware to verify the computation efficiency improvement wrt Directo? - Could the authors find at least one more DAG dataset to perform experiments on? - Fu et al. [2] already provided a method for graph diffusion with cost linear in the number of nodes. The paper cites this work but does not compare LG-VAE performance with this model. Could the authors compare LG-VAE computation time and generation quality with the one of HypDiff? [1] Dobson, P.D. and Doig, A.J., 2003. Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology, 330(4), pp.771-783. [2] Fu, X., Gao, Y., Wei, Y., Sun, Q., Peng, H., Li, J. and Li, X., 2024. Hyperbolic geometric latent diffusion model for graph generation. arXiv preprint arXiv:2405.03188. Fully human-written
Principled Latent Diffusion for Graphs via Laplacian Autoencoders Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a latent diffusion framework for graph generation that uses an autoencoder with Laplacian based adjacency identifying positional encodings. The decoder applies bilinear attention style scoring and a row wise DeepSet, and the approach extends to DAGs via a magnetic Laplacian. The authors claim that after freezing the autoencoder, a Diffusion Transformer is trained in latent space with conditional flow matching, which yields mostly linear per node complexity with only the final adjacency decoding quadratic. Experiments report near lossless reconstruction on several datasets, competitive generation quality. - The experiments show competitive reconstruction performance while substantially reducing the runtime. - Efficient yet near-lossless graph generation is an important topic. - Since the model uses only the lowest Laplacian eigenvalues/eigenvectors (rather than the full spectrum), it leans toward global/low-frequency structure and may miss high-frequency details. It might potentially hurt the exact reconstruction of fine structures. - The approach depends on computing Laplacian eigenvectors for positional encodings, which is costly and can dominate preprocessing on large graphs. - The promise of near-lossless reconstruction appears to rely on having sufficiently large training corpora of graphs, and performance can degrade on smaller or more complex datasets. - Recent graph representation learning works [2] suggest that emphasizing low-frequency components can improve downstream representations (e.g., node classification), so selecting the k smallest eigenpairs is often reasonable in that context. However, for generation, where accurately reconstructing fine structural details could be crucial, restricting to the lowest eigenpairs may suppress high-frequency information. If I am wrong, please correct me. - In (e.g., line 174), writing $(\phi:\mathbb{R}^2\to\mathbb{R}^d)$ can suggest the input itself is 2-dimensional, whereas the actual input per node is an $(n\times 2)$ table formed by concatenating two vectors row-wise (each row is a 2-vector ($(U_{ij},\lambda_j{+}\epsilon_j))$). Could you explicitly state that $(\phi)$ is applied row-wise and annotate the shapes (input $(n\times 2)$ → output ($n\times d)$) to avoid confusion? - In the notation section, node features $X$ are defined as discrete labels. Does the current framework support graphs with dense real-valued node attributes (e.g., Cora dataset $n\times d$ features)? - Several recent works also build graph learning/generation models from spectral perspectives (e.g., using the smallest (k) eigenpairs). Could the authors discuss the key differences between your design and [1,2]? [1] "Generating Graphs via Spectral Diffusion." The Thirteenth International Conference on Learning Representations, 2025. [2] "SDMG: Smoothing Your Diffusion Models for Powerful Graph Representation Learning." The Forty-second International Conference on Machine Learning, 2025. Lightly AI-edited
CGTFra: General Graph Transformer Framework for Consistent Inter-series Dependency Modeling in Multivariate Time Series Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The author defines the problem of inter-variate dependencies (IVD), referring to the loss of dependency information among variables in deep self-attention layers. To address this issue, the authors propose CGTFra, a framework designed to promote consistent modeling of inter-variate dependencies. S1. The author makes a noteworthy observation that time-based positional encodings do not necessarily improve performance in multivariate time series forecasting. S2. The author enhances forecasting accuracy by addressing the inconsistency between shallow and deep attention scores. W1. It remains unclear whether directly integrating the adaptive adjacency matrix $A$ into the MCM would offer a more structurally concise design, thereby eliminating the need to optimize two separate objectives, i.e., $L_{align}$ and $L_{mae}$. W2. The improvement in forecasting accuracy attributed to maintaining consistency appears to be based solely on empirical results. The author should discuss whether there is any theoretical foundation supporting this effect. W3. The coordinates mentioned in the caption of Figure 4 do not align with the values shown in the figure, and the overall presentation appears inconsistent. The author should verify or clarify this issue, and it may be more effective to use multiple figures to illustrate the inconsistency across attention layers. See W1-W3. Moderately AI-edited
CGTFra: General Graph Transformer Framework for Consistent Inter-series Dependency Modeling in Multivariate Time Series Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a Graph Transformer framework CGTFra with adaptive frequency masking and resampling method and dynamic graph learning framework, which can diminish the importance of timestamps and promote consistent inter-variate dependencies (IVD) modeling. Experiments demonstrate the effectiveness of CGTFra. 1.Integrating the dynamic graph learning and consistency alignment loss to promote the modeling of consistent IVD is interesting. 2.The experiments are presented in considerable detail. 1.The dynamic graph learning lacks innovation and appears to be a standard adaptive graph construction method, which is widely explored in prior works. The authors should clarify the relationship between CGTFra and these works to further emphasize their own contribution. 2.The claim that IVD are modeled exclusively in shallow layers (on line 104) is unconvincing, as stacking multiple attention layers is a straightforward way to model IVD in deep layers. This oversight weakens the motivation for promoting consistent IVD modeling. The authors should clarify the oversight further to avoid misunderstanding. 3.Some visual comparisons should be quantified, as the claimed improvements are often subtle and difficult to assess from the plots alone, e.g., the claims on lines 143, 170, and 375. These claims should be backed by quantitative data, e.g., percentage gains, to make the comparisons clear. 4.The paper has some weaknesses in the experiments, which are not convincing enough: (1)Considering CGTFra is a graph Transformer framework, more GNN-based models and even hypergraph-based models should be compared to further validate the effectiveness of CGTFra, e.g., Ada-MSHyper [1] and MTSF-DG [2]. (2)There are some overstatements and factual errors in the experimental analysis. For example, the claim that CGTFra consistently exhibits enhanced performance on ETT and solar datasets (on line 360) seems to be an overstatement. According to Table 1, CGTFra is actually outperformed by FilterNet [3] on both ETTm1 and ETTm2 datasets in terms of MSE. The claim that introducing DGL results in performance degradation on ECL (on line 438) seems to conflict with the results of Table 4. The authors should thoroughly review the analysis. (3)There seem to be several inconsistencies in the bolding of results in Tables 2 and 3. For example, “iInformer + Solar” in Table 3, the best results of MAE are not bolded. The authors should carefully check all tables to avoid these mistakes. [1]Shang Z, Chen L, Wu B, et al. Ada-MSHyper: Adaptive multi-scale hypergraph Transformer for time series forecasting. NIPS 2024. [2]Zhao K, Guo C, Cheng Y, et al. Multiple time series forecasting with dynamic graph modeling. VLDB 2024. [3]Yi K, Fei J, Zhang Q, et al. FilterNet: Harnessing frequency filters for time series forecasting. NIPS 2024. 1.Some notations and formulas are confusing. For example, in the definition of MTS on line 263, f(t) often represents values at time t, why define f(t) as a 2D matrix? On Formula 1, the l in the summation is missing. On Formulas 5 and 6, are these trainable parameters the same? If not, why use the same notations? The Formula 7 seems incorrect for producing the described MCM on line 319, as the concat operation is missing. On Formula 8, the mathematical definition of KL divergence should be used instead of the engineering-style. Why calculate the KL divergence between P and Q instead of Q and P? Fully human-written
CGTFra: General Graph Transformer Framework for Consistent Inter-series Dependency Modeling in Multivariate Time Series Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper addresses two limitations of Transformers in multivariate time series forecasting by proposing the CGTFra framework: (1) it introduces frequency-domain masking and resampling methods to replace positional encoding, thereby reducing dependence on timestamp information; (2) it incorporates a dynamic graph learning framework to explicitly model inter-variable dependencies in deeper network layers, addressing the limitation that existing methods only capture dependencies in shallow self-attention layers. Additionally, this paper is the first to propose a consistency alignment loss to constrain the dependency structures learned in both shallow and deep layers. The authors validate the effectiveness of their approach across 13 datasets, though some theoretical justifications and technical details require further elaboration. 1. The problem motivation is clear and well-supported by empirical evidence. 2. The authors are the first to propose modeling the dependency relationship between shallow and deep representations from a "consistency modeling" perspective, explicitly constraining their alignment using KL divergence. 3. The authors conduct comprehensive comparisons with 13 state-of-the-art methods across 13 datasets and validate the contribution of each module through ablation studies. The experimental design is thorough. 1. The theoretical justification for the equivalence between self-attention and GNNs (Appendix A.6) is intuitive. For the proposed consistency alignment loss, there is a lack of theoretical or mathematical guarantees—no formal bounds or convergence guarantees are provided to justify the rationality of this constraint. 2. The paper explains that the traffic dataset exhibits fixed periodic patterns, which accounts for why FMR underperforms the original iTransformer in Table 6. This suggests that FMR may excessively suppress periodicity, a characteristic that is very common in time series data. Have the authors considered adaptively adjusting the masking intensity based on the periodicity characteristics of the dataset to mitigate this issue? 3. Regarding performance degradation on large-variable datasets such as solar and traffic: the discussion of when DGL or CAL might lead to performance decline is somewhat superficial (limitations in Appendix A.16). The impact of these modules on large-variable datasets is mixed, sometimes degrading performance, but the paper lacks deeper analysis beyond speculation about "alignment challenges." 4. The authors need to clarify some issues in the methodology section; see the questions below. 1. In Equations 5-6, how are the node embeddings $\Theta$ initialized? What is the purpose of $Concat(X^sa, \Theta)$? 2. Section 3.1 proposes learning independent frequency masks for each variable, but the paper lacks analysis of the learned masks: How much do the masks differ across variables? Do the masks tend to preserve low-frequency or high-frequency components? Is there a relationship with the periodicity of the data? Such analysis would enhance the interpretability of FRM. 3. Regarding the consistency alignment design, Figure 4 shows that the two mechanisms exhibit similarities but also some differences. Could the forced alignment of these two mechanisms through Equation 8 lead to information loss? Lightly AI-edited
CGTFra: General Graph Transformer Framework for Consistent Inter-series Dependency Modeling in Multivariate Time Series Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes CGTFra, a General Graph Transformer Framework for multivariate time series forecasting. The core idea is to model both the temporal dependencies and inter-variable correlations using a unified graph-transformer paradigm. Experiments on several real-world datasets demonstrate that CGTFra achieves consistent improvements over state-of-the-art baselines. 1. This paper offers a new perspective on modeling inter-variable dependencies (IVD), providing fresh insights that could inspire future research in multivariate time series forecasting. 2. The experimental section compares CGTFra with a wide range of strong baselines, and the results consistently show performance gains on multiple benchmark datasets. 1. The paper incorrectly states that iTransformer employs positional encoding. In fact, iTransformer does not use positional encodings, as it models temporal order implicitly within feature embeddings rather than through token positions. 2. The paper also misinterprets the role of Feed-Forward Networks (FFNs) in the Transformer architecture. FFNs do not explicitly model intra-series dependencies; instead, they act as nonlinear mappings that refine representations after the attention operation, capturing temporal relationships between the input and predicted future values. 1. In some datasets (e.g., ETT), the variables appear to be largely independent. Why is the Inter-Variable Dependency (IVD) module necessary in such cases? 2. What are the specific experimental settings used to generate the results in Figure 2? 3. Can the FMR be combined with architectures other than Transformers? In other words, is FMR a model-agnostic module or specifically tailored to Transformer-based frameworks? Moderately AI-edited
MorphGen: Controllable and Morphologically Plausible Generative Cell-Imaging Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper presents MorphGen, a diffusion-based generative model for Cell Painting microscopy images that achieves controllable generation across multiple cell types and perturbations. The key innovation is an alignment loss that guides MorphGen’s internal representations to match those of OpenPhenom (Kraus et al., 2024), a biological foundation model, encouraging the generative model to capture biologically meaningful features. Unlike prior works such as MorphoDiff (Navidi et al., 2025), which compressed six fluorescence channels into RGB and handled only one cell type, MorphGen generates all six channels jointly at higher resolution, thus preserving organelle-specific details essential for downstream morphological analysis. The model uses a latent diffusion architecture (leveraging a pretrained VAE) and incorporates conditioning on both cell type and perturbation. Experiments demonstrate that MorphGen produces morphologically plausible cell images that maintain known subcellular structures. Quantitatively, it significantly outperforms previous state-of-the-art: for example, on a multi-gene test set, its Fréchet Inception Distance (FID) is 35–60% lower than MorphoDiff. Qualitative results show that generated images closely mirror real cell images in texture and morphology. The paper also introduces evaluation metrics like Relative FID (normalised by dataset variability) and uses CellProfiler features to demonstrate that synthetic images capture phenotypic variation. Overall, the contributions of MorphGen are a substantial step toward “virtual cell” models for in silico biological experiments, enabling high-content image generation with controllable conditions and improved biological fidelity. - Originality: The paper introduces a new combination of ideas focussed towards microscopy image analysis – diffusion models with a transformer backbone, multi-channel image generation, and alignment to a domain-specific foundation model. This is a creative extension of diffusion models into the biological imaging domain, addressing limitations of previous approaches. The representation alignment loss (adapted from REPA by Yu et al., 2025) is used in a novel way here (with OpenPhenom features) to inject biological priors into the generative process. - Quality: The technical quality is high. The method is described in sufficient detail, and the experiments are decent; however could be better. The authors compare MorphGen against appropriate baselines (MorphoDiff and even Stable Diffusion repurposed) on multiple datasets. The quantitative gains are impressive. For instance, MorphGen achieves substantially lower FID/KID scores than MorphoDiff across datasets. The ablation studies (in the appendix) lend support that each component (alignment loss, full-channel generation, etc.) has a positive impact. The model outputs are of high resolution and fidelity; Figure 2 and others show that synthetic images reproduce fine subcellular details, which is non-trivial. Additionally, the paper reports not only generative quality metrics but also uses CellProfiler features and a CATE (conditional treatment effect) analysis to ensure that known phenotypic differences under perturbations are being captured – this indicates a quality focus on biological accuracy, not just visual fidelity. - Clarity: Aside from minor issues noted, the paper is clearly written. - Significance: This work has practical significance for biomedical imaging communities. By enabling controllable simulation of cell images, MorphGen can be used to generate in silico experiments – for example, creating hypothetical outcomes for perturbations or augmenting datasets for training. The ability to model multiple cell types and stains is particularly significant, as it broadens the applicability (previous models were limited in scope). While the paper is strong, there are some weaknesses and areas for improvement: - Evaluation could be more biologically insightful: The current evaluation leans on aggregate metrics (FID, KID) and visual inspection, with some PCA and correlation analyses in the appendix. However, these don’t fully demonstrate that the generated images recapitulate known biological relationships. For instance, a more direct test would be to see if specific CellProfiler features correlate between real and generated cells from the same perturbation. The paper shows side-by-side PCA of real vs fake and a global correlation matrix, but this is only a coarse validation. It would strengthen the work to quantify, for example, that for each known perturbation, the change in particular CellProfiler features (nuclear size, cell count, etc.) in generated images correlates with that in real images. Moreover, the authors could calculate the recall of known biological relationships between genes based on databases like StringDB, and compare this score between real and generated images. See Celik et al. 2024 (https://doi.org/10.1371/journal.pcbi.1012463). In short, demonstrating downstream task fidelity (such as predicting drug mechanism or gene function from synthetic images and comparing to real) would make the biological validity more convincing. - Limited discussion of foundation model choice: The authors use OpenPhenom embeddings to guide the generator. OpenPhenom is a reasonable choice (a well-known cell image foundation model), but the paper doesn’t explore this decision deeply. One concern is that recent analyses suggest such foundation models may be dominated by easy-to-learn signals like cell count (how many cells in the image) rather than subtler phenotypes. If OpenPhenom’s embedding primarily captures cell count or other simple variations, aligning to it might inadvertently make MorphGen focus on those and neglect finer morphological details. There are other biological feature embedding models they could consider – for example, CellCLIP (Lu et al., 2025) aligns Cell Painting images with text descriptions of perturbations via contrastive learning, MolPhenix (Fradkin et al., 2024) aligns images with molecular structures, CLOOME (Sanchez-Fernandez et al., 2023) is a confounder-aware multimodal model linking cell images and chemicals, CWA-MSN learns representations via siamese networks. All of the above provide pre-trained image embedders for cell painting images that have outperformed OpenPhenom in recalling known biological relationships from images. An ablation or comparison using some of these different embeddings (or simply turning off the alignment loss) would reveal how crucial the choice is. It’s possible that OpenPhenom is not uniquely optimal and that other representations might improve or alter the results. Currently, the paper assumes OpenPhenom as a given; examining this would improve the work’s robustness and novelty. - Clarity and definition issues: There are a few spots where the paper could be clearer. Terms like IMPA should be defined when first used. Not defining it might confuse readers unfamiliar with that prior work. Similarly, “clean images” is used in line 200. Does this mean images without noise? The authors should specify this to avoid ambiguity. The notation $z_0$ appears without definition (likely the initial noise latent for diffusion sampling), as does F(x) (I assume OpenPhenom). Explicitly stating this would help readers follow the generation process description. Furthermore, Scalable Interpolant Transformer (SiT) is defines twice in lines 163 and 185. These are relatively small weaknesses, but improving them would polish the paper. - Use of a pretrained VAE not specific to microscopy: The model relies on a pretrained VAE to encode and decode images. This VAE was originally trained on RGB natural images. The authors adapt it for 6-channel input by stacking channels into pseudo-RGB triplets, which is clever. However, the paper does not mention any fine-tuning of this VAE on cell images. Using a VAE not trained on fluorescent microscopy data could introduce a domain gap – e.g., color/intensity distributions in natural images differ from microscopy, and the VAE might not optimally compress/reconstruct cell structures (especially if cell images violate assumptions it learned). It’s a testament to the method that it still works well, but this choice could be a limitation. Perhaps training a custom VAE on Cell Painting (even a smaller one) might further improve quality. At minimum, the authors should clarify what data the VAE was pretrained on and discuss any limitations or justify why this doesn’t harm results. Right now, it’s a bit implicit. - Miscellaneous: I have a few other minor critiques. (1) The paper uses “interpretability” in describing the benefits of full-channel generation. While preserving organelle channels does aid human interpretability of results, the model itself isn’t inherently interpretable in a model-explainability sense. It’s more about facilitating post-hoc analysis. The wording could be tempered to avoid overstating interpretability. (2) The comparison to CellFlux (another recent generative model, possibly via flow matching) is only mentioned briefly in the appendix. If CellFlux is contemporary work, a clearer comparison in the main text would be helpful for completeness. These issues do not fundamentally weaken the work but addressing them would improve the overall presentation and rigour. - Foundation model alignment: Can the authors provide more insight into the decision to use OpenPhenom embeddings and how sensitive the results are to this choice? An ablation on at least one another cell painting image embedder would be appreciated. For example, if one trains MorphGen without the alignment loss (or with a different embedding space, like CellCLIP), how does the image quality or biological fidelity change? - Scope of “biologically meaningful features”: The paper claims that the alignment enforces capturing meaningful patterns. Could the authors elaborate on which phenotypic patterns MorphGen is actually learning? In short, how do we know the model isn’t just learning to generate generic-looking cells plus the correct number of cells, rather than truly phenotype-specific morphologies? Any additional evidence here would strengthen confidence in biological relevance. A comparison between using real and generated images to recall known biological relationships from rxrx3-core would make this paper much stronger. Fully AI-generated
MorphGen: Controllable and Morphologically Plausible Generative Cell-Imaging Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. In this work, the authors propose Morphegen, a generative model to predict the morphological cellular responses to perturbations. They introduce strategies to make this framework compatible with the multi-channel nature of HCS imaging. The authors performed a series of experiments to validate the performance of the method. 1. This work tackles an interesting problem, namely the application of generative models to HCA images. Indeed, these types of images generally exceed the three channels found in standard RGB images, which requires adapting general-purpose generative models to this kind of data. 2. The paper is well written and easy to follow. 3. The authors present interesting ideas for adapting latent diffusion models to HCA images. 1. While I find the method interesting, the novelty appears limited, as it mainly consists of adapting Morphodiff to biological images with more than three channels. 2. The related works section lacks several important methods that address the prediction of cellular responses to perturbations [1,2]. 3. The proposed baseline is somewhat weak, as the authors only compare their model to Morphodiff and Stable Diffusion, reporting FID and KID scores. Moreover, these metrics are related: FID is typically suited for large datasets, whereas KID is more appropriate when working with fewer images. 4. The evaluation is based on only two datasets, which may limit the robustness of the conclusions. 5. The authors do not provide any schematic to describe the proposed architecture, and such a schematic would greatly facilitate understanding. References: [1] PhenDiff: Revealing Subtle Phenotypes with Diffusion Models in Real Images, Bourou et al. [2] Revealing invisible cell phenotypes with conditional generative modeling, Lamiable et al. 1. It is unclear what the authors mean when they state: “Our model combines a pretrained VAE with a latent diffusion model.” A latent diffusion model already includes a VAE that encodes the image into a lower-dimensional latent representation, where the diffusion process is performed. Do the authors refer to this built-in VAE, or are they introducing an additional one? Furthermore, on which images was the VAE pretrained, and how many channels were used during pretraining? 2. Previous methods [1,2] were already able to generate biologically meaningful images. What improvement does REPA provide in this case? Did the authors perform an ablation study to evaluate the importance of each component? 3. How is the conditioning performed? Which encoders are used to encode the different perturbations? 4. The U-Net used in modern diffusion models already includes self-attention blocks to capture spatial relationships. Does SiT provide any improvement over this? Did the authors compare the two backbones? 5. FID and KID were originally proposed to evaluate RGB image generation. How do the authors apply these metrics to images with more than three channels? 6. I do not understand how Stable Diffusion was used as a baseline, since it does not include any encoder to handle perturbation conditioning. How is this achieved in the proposed setup? Furthermore, Stable Diffusion is trained on natural RGB images, so it seems unreasonable to apply it directly to biological images without retraining. Did the authors fine-tune or adapt the model in any way? What about the text encoder? 7. In Table 1, was the number of channels the same for all methods? Although MorphenGen provides the best FID, the values remain very high. Furthermore, why did the authors not provide standard deviations for the other models? References: [1] PhenDiff: Revealing Subtle Phenotypes with Diffusion Models in Real Images, Bourou et al. [2] Revealing invisible cell phenotypes with conditional generative modeling, Lamiable et al. Fully human-written
MorphGen: Controllable and Morphologically Plausible Generative Cell-Imaging Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper develops MorphGen, a diffusion model for generating cell painting images conditioned on a description of a perturbation and cell type. There are previous approaches for this task, but two key advantages of MorphGen are the ability to natively model all 6 channels of cell painting data and the ability to incorporate cell type conditions. • The work addresses an important problem—predicting perturbation response using cell painting data • Extending previous diffusion approaches to natively model the 6 cell painting channels is an important step toward more effective cell perturbation prediction models. • FID results are strong, and uncurated images look qualitatively realistic. The way this evaluation is conducted seems really solid. • Evaluations using CellProfiler features are important and show that the method captures biologically meaningful aspects of the images. • From an ML perspective, the work is more like an incremental step than a paradigm shift. This is a relatively standard diffusion model with a few tweaks to make it work on more channels than the natural images commonly used for training in computer vision applications. • From an applications perspective, it seems that out-of-sample prediction is not really evaluated (and maybe not possible with this approach, see next question). This is kind of the main goal of developing such a generative model in the first place. • Cell type and perturbation conditioning are not clearly described. How do you represent a chemical or genetic perturbation? Is it a one-hot encoding or the latent space of a chemical encoder? Using a chemical structure-based encoder of some kind seems like a better choice because it allows potential generalization to unseen perturbations. • Evaluations don’t really test whether the generated images respect the cell type or perturbation condition. Something like a conditional FID or classification accuracy on generated images would get at this more directly. • Important previous work not discussed: LUMIC, Hung et al. 2024. LUMIC uses a related latent diffusion approach, is designed to predict across cell types and can predict held-out perturbations and held-out cell types (though it does not predict all 6 channels like the current work). 1. How do you represent a chemical or genetic perturbation? Is it a one-hot encoding or the latent space of a chemical/gene encoder? 2. How do you represent the cell type when conditioning the diffusion model? 3. Can the model in principle generalize to unseen perturbations or unseen cell types? Fully human-written
OpenAVS: Training-Free Open-Set Audio Visual Segmentation with Foundational Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose OpenAVS, a novel training-free language-based approach that, for the first time, effectively aligns audio and visual via a text proxy for open-vocabulary AVS. But there is still work to do. 1. The method seems feasible. 2. The writing is clear and easy to read. 3. Enough visualization is provided. 1. The citation style appears inconsistent with ICLR guidelines. Mixing \citet{} and \citep{} hinders readability and should be standardized. 2. Integrating multiple mature systems introduces clear drawbacks. Please include a runtime comparison for each model in Table 3, and analyze potential error accumulation across the pipeline. 3. The paper evaluates on S4, MS3, AVSS, and V3 from AVSBench, which are typically used in closed-set settings. Even though the method is training-free, comparisons against closed-set trained baselines are still important. Moreover, evaluation on an open-vocabulary AVS dataset would be more appropriate, given that LLMs operate in an open-vocabulary regime. 4. Evaluation details are missing. Please report the inference resolution and the average input/output token counts for the LLM. 5. You state “GroundingDINO 1.0 for detection (box threshold = 0.25).” How critical is the box threshold? Provide an ablation or sensitivity analysis to justify the chosen value. 6. In Figure 3, AVSS is a classification task where every mask should be assigned a color. Why do the GT and other model outputs lack color assignments? Please clarify the visualization protocol or correct the figure. Extra: I think this training-free method based on LLMs is naturally suitable for Ref-AVS [2] tasks. Why not test on Ref-AVS tasks [1] Guo, R., Qu, L., Niu, D., Qi, Y., Yue, W., Shi, J., ... & Ying, X. (2024, October). Open-vocabulary audio-visual semantic segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia (pp. 7533-7541). [2] Wang, Y., Sun, P., Zhou, D., Li, G., Zhang, H., & Hu, D. (2024, September). Ref-avs: Refer and segment objects in audio-visual scenes. In European Conference on Computer Vision (pp. 196-213). Cham: Springer Nature Switzerland. My main concern is the evaluation and analysis of the overall system. I will finalize my score after the rebuttal. Lightly AI-edited
OpenAVS: Training-Free Open-Set Audio Visual Segmentation with Foundational Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduce a novel training-free language-based approach for open-set audio visual segmentation. The language-based method achieves more robust audio-visual alignment than existing methods. This is a flexible and cost-efficient framework, and achieves state-of-the-art performance on four benchmark and training-free, few-shot, and zero-shot AVS tasks. * This is a flexible, model-agnostic and cost-efficient framework. * Experiments on 4 benchmarks show the good performance of OpenAVS. * It surpasses existing unsupervised, zero-shot, and few-shot AVS methods. * I believe this method is likely to work effectively, as none of the steps appear problematic. However, it seems more focused on the engineering aspect rather than offering new insights to the community. * Are there any specific challenges or difficulties in applying this pipeline to solve the task? * Given the large number of models used, is it reasonable to combine so many for this task, especially in comparison to other approaches? * I am unclear about how few-shot AVS is applied in the proposed method. The pipeline appears primarily focused on generating masks from audio signals. Where are the few-shot examples integrated into this process? * More details on the self-training process would be helpful. For example, which specific parts of the model are being trained during this phase? * Could you clarify the role of the V2T component? It might also be beneficial to conduct ablation studies to better understand the contributions of both the A2T and V2T components. I would be happy to revise my score if the author addresses these points. Please refer to the weakness. Lightly AI-edited
OpenAVS: Training-Free Open-Set Audio Visual Segmentation with Foundational Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper focuses on audio-visual segmentation, i.e., separating sounding objects from videos by predicting pixel-level masks of audio signals. Instead of direct audio-text alignment, this paper use one training-free idea, namely OpenAVS, to aligns audio and visual via text proxy. And the pipeline is divided into audio-to-text description generation, visual-to-text description generation, LLM-guided prompt translation, and text-to-visual sounding object segmentation. Experiments are carried out on four benchmarks, across unsupervised, zero-shot, and few-shot AVS settings. [+] The manuscript is well written, with clear logics and sufficient formulations. [+] Exploring the omni-modal alignment of image-text-audio is one promising direction. [+] Experiments are conducted across unsupervised, zero-shot, and few-shot AVS settings. [-] The bottleneck of information compression. Text is a highly compressed form of audio/images, which means a lot of information is lost when converting audio/image to text. Many textual descriptions are coarse-grained, for example, audio-to-text usually cannot capture differences between dog breeds. If a video contains two different breeds of dogs, OpenAVS may fail after converting audio/image to text. [-] The noise is gradually amplified. This paper’s idea is severely limited by the performance of the audio/image to text pre-trained models. What should be done if there is no commonality in the text converted from images/audio? Usually, the pre-trained models can only convert simple objects/noiseless timbres, which means that the performance of this paper is limited in practical scenarios. [-] How to solve complex scenarios. Across all datasets in this paper, the scenes are simple and quite detached from reality. How does the performance of OpenAVS work, especially when dealing with mixed-sound, multi-entity and off-screen scenes? What’s Making That Sound Right Now? Video-centric Audio-Visual Localization. ICCV2025 [-] In OpenAVS, multiple pre-trained inferences are concatenated, which results in low inference efficiency. What is the RT of the overall process? Fully human-written
OpenAVS: Training-Free Open-Set Audio Visual Segmentation with Foundational Models Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The manuscript proposes a training-free, open-vocabulary audio-visual segmentation (AVS) framework, OpenAVS, which uses language as a mediator to alleviate cross-modal misalignment in multi-source and temporally drifting scenarios. The method first converts audio and visual frames into semantic text via audio language models (ALMs) and vision-language models (VLMs), then employs a large language model (LLM) to translate and consolidate prompts with model/prompt/frame consistency constraints. The refined noun-centric prompts guide visual foundation models (VFMs) (e.g., Grounded-SAM/SAM2) to produce pixel-level masks frame-by-frame. TThe method attains competitive results on several benchmarks. 1). The paper is generally well written and easy to follow, with a clear problem setup and pipeline description. 2). The method shows reasonable performance without task-specific training in the reported settings. 3). The pipeline is modular and can be instantiated with alternative ALMs/VLMs or VFMs with minimal changes. 4). By operating in text space, the approach can accommodate unseen categories to a limited extent, offering broader semantic coverage than fixed-class models. 1). Incremental contribution. The framework largely assembles off-the-shelf components, and similar workflow-style pipelines have appeared in prior work [1–3], including AVSS/AVVS systems that convert audio into textual/context cues to guide frame selection [1]. 2). Limited novelty of ALM usage. The headline contribution—introducing an ALM to obtain text from audio—seems straightforward. Related LLM/agent literature [4, 5] already invokes specialized modules coordinate tasks, which makes the methodological advance here appear modest. 3). Multi-source disambiguation is insufficiently analyzed, particularly for overlapping or co-occurring sound sources and attribution when two similar sources are active. 4). The approach may be sensitive to prompt templates and LLM choices, and the paper provides limited ablations on prompt variants and threshold settings. 5). Latency concerns. The reported inference time (≈5.13–6.71 s for the best settings) suggests the method is far from practical deployment. For typical videos at 25–35 FPS, real-time operation would require ~28–40 ms per frame; the current latency is orders of magnitude higher unless substantial optimization is shown. 6). Questionable cost–effectiveness. The best configuration (e.g., OpenAVS-Large + GPT-4o-mini + GDINO+SAM2) attains mIoU 0.684 / F 0.769 at ~$0.00163 per sample, yet still trails task-specific trained baselines. It is unclear why one should pay per-inference costs for comparatively lower accuracy. Conversely, lighter variants (e.g., OpenAVS-Lite + GPT-2 XL + GDINO+SAM at mIoU 0.431 / F 0.561) fall below thresholds that would be usable in practice. 7). Metric concerns: the reviewer highlights that Jaccard and IoU differ in segmentation evaluation. I refer to the SAM2 paper to emphasize this distinction and note that previous methods were computed using Jaccard. [1] Unleashing the temporal-spatial reasoning capacity of gpt for training-free audio and language referenced video object segmentation. [2] Open-Vocabulary Audio-Visual Semantic Segmentation [3] Retrieval-Augmented Generation for AI-Generated Content: A Survey [4] Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools [5] Mind2web: Towards a generalist agent for the web nil. Lightly AI-edited
Teach to Reason Safely: Policy-Guided Safety Tuning for MLRMs Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors present evidence that multi-modal reasoning models significantly outperform multi-modal base models on complex tasks; however, this improvement also leads to an increase in the generation of harmful content. Through experimental analysis, they identify two primary causes: 1) visual attention drift and 2) unsafe reasoning patterns. They note that existing methods primarily focus on teaching models how to reject harmful outputs without providing guidance on safe reasoning. To address this issue, the authors propose a two-stage alignment framework called Policy-guided Safety Tuning. Testing on various multi-modal safety benchmarks demonstrates that this method significantly reduces the rate of harmful content generation while also performing well on Visual Question Answering tasks, without exhibiting issues of over-sensitivity. 1. The paper is well-written, demonstrating clarity and coherence throughout. 2. The authors provide a thorough comparison of multiple baseline methods, testing their proposed approach against a diverse array of benchmarks. 3. The paper explores the relationship between multimodal attention mechanisms and safety considerations, contributing valuable insights to the field. 4. The motivation is really good. The proposed method heavily relies on the quality of the training data, which is entirely AI-generated and evaluated. This might introduce a significant bias. 1. It might be better if the authors clarity their contribution as a dataset. 2. Could you explain how the paper addresses potential biases introduced by using AI-generated data, such as manual annotation or other approaches for data quality control? 3. JailbreakV [1] shows that even randomly generated images and harmful questions could attack MLLMs easily. Could the authors explain the effectiveness of PST under these conditions? Additionally, how do unrelated image descriptions impact the model's responses? [1] JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks Lightly AI-edited
Teach to Reason Safely: Policy-Guided Safety Tuning for MLRMs Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper studies a safety–reasoning trade-off in Multimodal Large Reasoning Models (MLRMs): reasoning-tuned models produce more harmful outputs than their base counterparts. It analyzes two mechanisms—(i) visual attention drift that weakens visual grounding and (ii) unsafe reasoning patterns such as flawed reasoning initiation and chain-of-thought safety attenuation. The authors also propose Policy-guided Safety Tuning (PST), a two-stage framework: Policy-Guided Supervised Fine-Tuning (PST-SFT) that embeds explicit safety policies into reasoning trajectories, and Safety Reasoning Preference Optimization (SRPO) that aligns toward safe yet informative responses via a preference loss. Experiments on BeaverTails-V, MM-SafetyBench, SPA-VL, SIUO, and general VL benchmarks show reduced harmful rate (HR) and refusal rate (RR) while maintaining competitive VQA/MathVista performance. 1. The paper addresses a timely problem: how to teach reasoning models to think safety. 2. Experiments indicate that PST attains reasonable safety gains without severely harming general capabilities. 3. The paper is well written. 1. **Limited depth of insight.** One of the paper’s central mechanisms, **Visual Attention Drift**, appears to be directly taken from [1] rather than newly discovered or substantially extended in this work. 2. **Motivation and experiments are not well aligned.** The Introduction (Section 1) and the Analysis (Section 3) devote substantial space to Visual Attention Drift and Unsafe Reasoning Patterns, yet the experiments do not demonstrate how PST resolves or alleviates these two phenomena. I did not find corresponding analyses in the main text or the appendix. 3. **Method novelty is incremental.** The two key components, **PST-SFT** and **SRPO**, are essentially direct transfers of widely used techniques (**SFT** and **DPO**). The overall **SFT + DPO** training pipeline is now mainstream; thus, much of the observed improvement likely stems from established methods rather than a fundamentally new algorithmic contribution. 4. **Data and reproducibility.** Following 3, the main innovation appears to lie in data processing and dataset construction. Building the dataset depends on additional models (e.g., Qwen-vl, DeepSeek-r1, GPT-4o), yet the costs in time, storage, and token usage are unclear, and it is not stated whether the data or code will be released. [1] Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models. arXiv preprint arXiv:2505.21523, 2025a. 1. How, quantitatively, does PST reduce Visual Attention Drift, and how do before–after attention changes correlate with harmful-rate reductions? 2. Do Unsafe Reasoning Patterns decrease after PST, and how are these patterns operationalized and measured across reasoning steps? 3. Beyond standard SFT and DPO, what are the specific algorithmic novelties of PST-SFT and SRPO, and how sensitive are gains to these components? 4. What are the token, time, compute, and storage costs of constructing DSFT and DSRPO? Lightly AI-edited
Teach to Reason Safely: Policy-Guided Safety Tuning for MLRMs Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper - Illustrates that reasoning VLMs generate harmful outputs at a higher rate than their non-reasoning counterparts. - Introduces a multi-step data generation pipeline that generates policy-specific reasoning traces on the basis of text-image samples from the Beavertails-V dataset. - The data generation pipeline produces both SFT as well as preference data. - Fine-tunes existing reasoning VLMs on the generated data and evaluates the results on a range of safety and general vision-language reasoning benchmarks - Extensive experiments across multiple benchmarks show improved safety performance for PST fine-tuned models while maintaining high general Vision Language reasoning capabilities. - The method seems to outperform recent comparable methods such as MSR-align and Think-in-Safety. - Paper claims that reduced attention to visual tokens is cause of safety degradation in reasoning models, but it doesn’t show whether the PST fine-tuning actually changes that behavior. - The conceptual contribution is fairly limited as the paper follows a well-established pattern of using a multistep pipeline with powerful third-party models to curate a dataset that enables distilling of the in-context capabilities of these powerful models into smaller models and improve their performance on respective benchmarks. - The data generation process between MSR-align, Think-in-Safety and the proposed method are all very similar. It is very difficult to understand what the main reason is for PST to outperform these prior works. Is it a different choice of prompts in the pipeline? A different choice of models? Specific steps in the pipeline? Fully human-written
Teach to Reason Safely: Policy-Guided Safety Tuning for MLRMs Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates the safety-reasoning trade-off in Multimodal Large Reasoning Models (MLRMs), highlighting that improved reasoning often coincides with increased harmful output rates. The authors identify two key mechanisms behind safety degradation—visual attention drift and unsafe reasoning patterns. To address these, they propose Policy-Guided Safety Tuning (PST), a two-stage alignment framework consisting of Policy-Guided Supervised Fine-Tuning (PST-SFT) and Safety Reasoning Preference Optimization (SRPO). Experiments on multiple benchmarks show that PST significantly reduces harmful outputs without major sacrifices in general reasoning capabilities. 1. The paper offers a clear and organized examination of safety degradation in MLRMs, identifying two representative mechanisms—visual attention drift and unsafe reasoning patterns. While parts of the analysis are qualitative, the work provides a useful foundation for understanding how reasoning-oriented tuning may influence safety performance across benchmarks. 2. The proposed two-stage PST framework effectively integrates explicit safety policies into reasoning, moving beyond refusal-based alignment toward interpretable, reasoning-centered safety control. 3. The dataset construction is detailed and replicable, with Figure 6 making the process and logic transparent. 1. The motivation underlying the claimed "reasoning–safety trade-off" is not entirely sound. In Section 3.1, the authors compare reasoning-tuned models with their corresponding base models and attribute the observed safety degradation to reasoning itself. However, this conclusion lacks causal rigor. The performance gap may simply stem from the absence of safety-aware data during reasoning-tuning rather than an inherent conflict between reasoning and safety. Moreover, the analysis does not control for data composition or fine-tuning objectives, which weakens the validity of the stated correlation. 2. The failure analysis in Section 3.2 appears rather subjective. The authors summarize two types of failure modes but do not describe the process or key data used to derive them. For example, it remains unclear what proportion of the failed cases fall under *Visual Attention Drift* or *Unsafe Reasoning Patterns*. Moreover, no empirical evidence is provided to show that *Visual Attention Drift* directly causes unsafe behavior, the logical connection between the two is not clearly established, even though they are observed to co-occur statistically. 3. The proposed Policy-Guided Safety Tuning (PST) framework is conceptually similar to existing approaches (e.g., Sun et al., 2023; Guan et al., 2024; Wang et al., 2025). The paper does not clearly articulate how PST substantively differs from or advances beyond these prior frameworks beyond its multimodal setting. [1] Sun Z, Shen Y, Zhou Q, et al. Principle-driven self-alignment of language models from scratch with minimal human supervision[J]. Advances in Neural Information Processing Systems, 2023, 36: 2511-2565. [2] Guan M Y, Joglekar M, Wallace E, et al. Deliberative alignment: Reasoning enables safer language models[J]. arXiv preprint arXiv:2412.16339, 2024. [3] Wang H, Qin Z, Shen L, et al. Safety Reasoning with Guidelines[C]//Forty-second International Conference on Machine Learning. 1. The SRPO objective (Eq. 7) appears mathematically identical to standard DPO, with the same log-ratio formulation and sigmoid preference loss. Since both $y_w$ and $y_l$ already include reasoning traces, could the authors clarify what differentiates SRPO conceptually or algorithmically from conventional DPO? Moderately AI-edited
AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces AQUA, a dataset trying to solve the ambiguity issues in VQA. There are four fine-grained levels. This work fine-tunes VLMs with SFT followed by GRPO using an LLM-as-a-judge reward. Fine-tuning small models substantially improves the accuracy across ambiguity levels, outperforming larger open- and closed-source baselines. 1. The research problem is clearly framed, with 4 levels of categorization 2. The dataset construction pipeline has human validation 1. The importance of the problem in real-world settings. In Figure 1, the other models' answers still seem reasonable to me. So I wonder about the significance of the problem in the VQA setting. 2. The rationale/completeness behind the 4 different levels. How can you tell whether there aren't other ambiguous questions? 3. It seems that the difference between the levels is simply the number of salient objects, which can be quite subjective or prone to errors. You need to pre-define a size threshold, which seems arbitrary. 1. Since LLMs can be used a a judge for reward assignment, one question is, why cannot this ambiguity problem solved via prompting techniques? If GPT-5-mini can serve as the judge, can gpt-5-mini, with the proper prompts, just directly solve the ambiguous VQA problem? Is there really a need for fine-tuning? E.g., make this iterative -- let one LLM first give an answer, then use gpt-5-mini judge it, then iteratively let the LLM refine its answer. What will be the results of this approach? What's the trade-off here? Fully human-written
AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper tackles an interesting problem: teaching VLMs to handle ambiguous questions through strategic responses across four ambiguity levels. The motivation is solid and the experiments are thorough, but I have concerns about the heavy reliance on GPT-5 throughout the entire pipeline, the small dataset size, and some questionable design choices. The core idea has merit, but the execution has limitations that affect how much we can trust the results. - The core idea is well-motivated - Getting 3B models to outperform 72B+ models shows this training approach works. - Generation, filtering, and evaluation all use GPT-5 variants. This creates circular logic—you're essentially teaching models to mimic GPT-5's behavior and then using GPT-5 to judge success - 3.6K training samples from COCO only. Will this generalize to other domains? - Why 20% bounding box area for Level 1? Why not 15% or 25%? No ablation studies to justify these choices. - How do humans perform on strategic selection? - Performance drops from 92.22% to 77.0% (Fig. 5). The "redistribution" explanation feels unsatisfying—this is a big drop. Please answer the questions in the weaknesses section Moderately AI-edited
AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions Soundness: 2: fair Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces a new fine-grained dataset AQuA with ambiguous questions that requires vision-language models to recognize when they cannot answer a question. Most models answer ambiguous questions confidently instead of abstaining or seeking clarifications. The paper defines a 4-level categorization of question ambiguity, and an answering strategy for each level. L0 being easy unambiguous qs, L1 qs are also unambiguous but require the model to resolve the salient referent, L2 qs have 2/3 possible answers, and L3 has qs where enumeration is inefficient and the model must request clarification. AQuA is built using COCO images and uses provided bbox annotations to control ambiguity levels, for ex. L1 images have a single salient obj (exactly 1 bbox covering at least 20% of the image). GPT-5 is used to generate question-answer pairs for each level. The dataset is filtered to verify ambiguity level and answer correctness using GPT-5-mini (7.2K samples in final dset). The eval split is further filtered with human annotators who verify if each sample belongs to the assigned ambiguity level. Experiments are performed on 4 open-source models (qwen2.5 & internvl3 family) and on GPT-5, Gemini-2.5-Flash. Pretrained models are evaluated using zero-shot prompts, CoT prompts, and a strategy prompt. Furthermore, qwen2.5vl-3b & internvl3-2B models are finetuned on AQuA to handle ambiguous questions, first in a supervised manner followed by RFT using GRPO. In GRPO, a generation gets a reward (from GPT-5-mini as a judge) of 1 for grounded answer & correct strategy, and partial reward for correct strategy. Results are shown on the eval set of AQuA, where the trained models show better ability to choose the correct answer strategy. - The paper identifies and attempts to tackle the issue of overconfident predictions by vision-language models for questions that are ambiguous. - The data generation pipeline of AQuA is described in detail. Human filtering on eval split is performed to ensure clean samples. - The paper is well written and easy to follow. - The dataset is not meaningfully 'fine-grained'. There are only 4 categories of ambiguity, with real-world objects (also not from fine-grained categories) - AQuA has a single fixed "correct" answer strategy for each level which is unrealistic. In real interactions multiple strategies (or combinations of them) are also appropriate. For instance AQuA says the only acceptable strategy for answering L3 questions is to ask for clarifications, whereas realistic answers could involve making a best-guess based on stated assumptions, followed by alternative coarse answers, followed by user clarifications etc. - Samples are strictly classified into 4 categories, which is not reflective of real scenarios that can fall in between categories. Many cases lie between levels, for ex, an image with 5-10 apples falls in b/w L2 & L3, for which acceptable solutions can involve enumerating a few options and then asking for clarifications. - Issue with metrics: The *strategic accuracy* metric measures the ability of the model to conform to the categorizations made by AQuA and does not measure the true ability of the model to handle ambiguity. As discussed above, having a fixed strategy as ground-truth is unrealistic. The factual consistency prompt does not check for correctness of the answer and only measures groundedness. Better evaluation metrics are needed to measure a model's effectiveness for ambiguous questions (including checking correctness of answers). - In RFT, rewards are provided by GPT-5-mini. What is the computational overhead of this? How does training time compare to simpler alternatives like a locally hosted judge, or simpler format based rewards (for example looking for words similar to "clarify" in answers to L3 questions)? - RFT is performed using just 60 training samples. Is there any merit to choosing more data? - Performance on Out-of-Domain data. Are AQuA-trained models generalizable? Evaluation on OOD ambiguity datasets such as VQ-FocusAmbiguity[1], ClearVQA[2], and on hallucination benchmarks like POPE[3], AMBER[4], HaloQuest[5] would strengthen claims. - Strategy Prompting seems effective in generating grounded responses. It would be interesting to see qualitative outputs of the same. I believe the formulation of AQuA with strict 4-level taxonomy and a single "correct" strategy limits its practical utility (as mentioned in pts 2&3 in weaknesses). Furthermore, the strategic acc metric measures conformity to the proposed levels rather than measuring the model's ability to answer ambiguous queries. The factual acc metric only checks for groundedness and not for correctness of the answer. In light of these issues, I believe AQuA does not provide a practical, reliable way to quantify performance under ambiguity. [1] Chen, C., Tseng, Y., Li, Z., Venkatesh, A., & Gurari, D. (2025). Acknowledging Focus Ambiguity in Visual Questions. [2] Jian, Pu et al. “Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions.” ArXiv abs/2507.13773 (2025): n. pag. [3] Li, Yifan et al. “Evaluating Object Hallucination in Large Vision-Language Models.” Conference on Empirical Methods in Natural Language Processing (2023). [4] Wang, Junyang et al. “An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation.” ArXiv abs/2311.07397 (2023): n. pag. [5] Wang, Zhecan et al. “HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning.” ArXiv abs/2407.15680 (2024): n. pag. Fully human-written
AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a new benchmark dataset for training and evaluating Vision–Language Models (VLMs), specifically on how to handle different types and degrees of ambiguity in visual questions. The differentiating idea is to extend the binary abstention decision to a four-level taxonomy of ambiguity. Also, the dataset is labelled with four optimal response strategies: direct answer, context-based inference, enumerating plausible alternatives, or explicit clarification. - The four-level taxonomy of ambiguity is very reasonable and novel. - The dataset is well structured. The label validation is thorough. Overall, solid methodology-wise. - The dataset also aligns well with two-stage training, with SFT and GRPO subsets. - The failure mode analysis is informative. - Like other datasets derived from existing labels and GPT models, there would be potential biases. More discussion on bias mitigation would be helpful. - The four-level taxonomy is better than binary decisions. But it can still be too rigid. More experiments and analysis on ambiguous cases would be helpful. How does the trained model perform on non-COCO datasets? Fully human-written
LaVCa: LLM-assisted Visual Cortex Captioning Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces LaVCa, a novel framework that leverages large language models (LLMs) to generate natural-language captions describing the selectivity of individual fMRI voxels in the human brain cortex. Although the proposed framework appears complex, the only trainable component is a standard voxel-wise ridge regression model, while all other modules remain frozen. Consequently, the methodological novelty primarily lies in the system design rather than in model learning or representation innovation. * The paper proposes a unique and well-motivated pipeline that applies LLMs to the problem of voxel-wise captioning. Overall, the paper is very well written, clearly organized, and easy to follow. * The captions generated by LaVCa quantitatively capture more detailed properties than the existing method. Both inter-voxel and intra-voxel analyses are thorough and supported by quantitative evidence. * The approach relies heavily on the output quality of LLMs. While the pipeline is appealing, the lack of systematic ablation across different LLMs, prompts, or hyper-parameter settings limits the generalizability of the results. * The paper does not investigate how the choice of the external image dataset affects performance. All experiments rely on the OpenImages dataset. * In BrainSCUBA, caption generation relies on the nearest neighbor search operation. LaVCa also uses a similar operation, but with an external dataset. * LaVCa uses LLMs to explain voxel responses in natural language. Several recent works have explored generating natural-language descriptions directly from brain activity—although focusing on decoding rather than encoding, they share similar techniques, namely the alignment between model and brain representations. Despite this fundamental distinction, the current manuscript does not explicitly discuss how LaVCa complements or differs from existing brain-to-text work such as [1–2]. Without such discussion, readers may conflate LaVCa with decoding frameworks. [1] Bridging the Gap between Brain and Machine in Interpreting Visual Semantics: Towards Self-adaptive Brain-to-Text Decoding, ICCV 2025. [2] Mindgpt: Interpreting what you see with non-invasive brain recordings, IEEE TIP 2025. * Could the authors clarify whether the optimal image selection step could induce semantic biases that might affect voxel interpretations? * The manuscript reports comparison results with only one method; it is unclear whether other more recent or advanced methods could be included for comparison. * In Figure A3, how does a single voxel generate multiple captions? Is it based on different activation levels? The horizontal axis title in the upper right should be "caption." Lightly AI-edited
LaVCa: LLM-assisted Visual Cortex Captioning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper investigates the selectivity of voxels in the human visual cortex by generating a descriptive caption for each voxel. To briefly summarize what the authors do: 1. First they build voxel-wise encoding models using embeddings from CLIP 2. They identify the "optimal images" that most strongly activate a specific voxel. 3. Descriptive captions for these images are generated with MiniCPM-V. 4. Using GPT-4o to extract keywords from the captions and composes them into a final, concise summary The authors quantify the accuracy of their method in two ways: 1. They use Sentence-BERT to create text embeddings for each generated voxel caption and for the captions of all images in the test set (NSD). For a given voxel, they calculate the cosine similarity between its caption's embedding and every test image caption's embedding. This similarity score is treated as the predicted brain activity. They compare this predicted activity vector to the voxel's actual measured activity using Spearman's rank correlation. A higher correlation indicates a more accurate caption. 2. A similar eval where the voxel captions are turned into images using FLUX. The authors find that their method is more accurate than BrainSCUBA. The problem is a very interesting one, and understanding the selectivity distribution across the visual cortex can be of medical and scientific interest. The approach is simple and consists of modular components that could be swapped out with future improvement in model quality. 1. It doesn't really make sense to me why the authors utilize an encoder to rank images in the first step. Each voxel has ground truth most activating images (from the fMRI beta weights). So it seems like this step is totally unnecessary and would degrade the model by replace real data rank with a predicted rank. 2. The keyword extraction stage (Figure 3d) seems unnecessary. It is not clear that voxels would only reason to entities in an image, compared to actions or adjectives. 3. The in-text citation format across the paper is very problematic. For example Line 135, the citations should not be in-text, and instead they should be in a parenthetical. 1. Line 195, do the authors convert the image embeddings to unit norm prior to constructing the encoder model? Fully human-written
LaVCa: LLM-assisted Visual Cortex Captioning Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. LaVCa introduces a data-driven approach for describing the selectivity of individual human brain voxels in the visual cortex using captions generated by large language models (LLMs). The proposed method trains encoding models for fMRI activity in response to images, isolates the most selective images for each voxel, uses a multimodal LLM for captioning, and finally composes concise keyword-driven voxel captions with an LLM and a keywords-to-sentence model. Compared to prior methods, the proposed method provides richer, more interpretable, and accurate natural-language descriptions of what each voxel encodes, revealing greater diversity and fine-grained functional specialization in the visual cortex. The strengths of this paper lie in its creative use of LLMs to generate natural-language captions that accurately describe voxel-level visual selectivity in the human cortex, surpassing prior methods in both interpretability and diversity of representations. The approach is robust across benchmarks, clearly demonstrates broader and finer-grained conceptual tuning in visual areas, and is built with modularity and reproducibility in mind, thereby enhancing its impact on both neuroscience and neuroAI. - While the captions are more diverse, the method often omits very local, fine-grained details in face-selective or highly specialized voxels, likely a result of summary steps and current limitations of captioning models? - Some hyperparameters (e.g., keywords and image set size) influence accuracy, and there is limited exploration of more structured or hierarchical compositional strategies for capturing multi-concept or multimodal selectivity.​ - Benchmark comparisons focus primarily on accuracy and diversity, but lack direct behavioral validation. How these captions relate to actual perceptual or cognitive phenomena in human subjects could be better clarified. The following are some questions/suggestions for authors: - Can the pipeline be expanded to provide hierarchical or compositional captions, reflecting not just multiple keywords but structured relationships (object-action or scene-context)? - How does LaVCa generalize to multimodal responses, including voxels sensitive to auditory or language stimuli? Are the methods readily adaptable, or are critical modifications needed? - How reproducible are the identified semantic clusters across large populations? Do the same diversity patterns emerge in different datasets or subject groups? - Could direct behavioral validation (e.g., relating captioned selectivity to subject perception, imagery, or recall tasks) link voxel captions to cognition and perceptual experience more strongly? Fully AI-generated
LaVCa: LLM-assisted Visual Cortex Captioning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a novel LLM-based pipeline for fMRI interpretation at the voxel level. The authors leverage the natural scenes dataset, and use the learned encoding weights from a VLM on NSD to predict fMRI responses to a novel set of images. They select the image set that produces the highest predicted response for each voxel and use LLM keyword extractors / sentence composers to provide rich and interpretable descriptions for each voxel. The method leverages state-of-the-art AI methods to improve not only prediction but also interpretability of fMRI data The model generalizes fMRI responses to held out images to generate rich descriptions of each voxel The main figures replicate prior results of known neural tuning and may be a source of hypothesis generation for future studies. The paper is well written with clear visuals The main weakness is that it is unclear what the advantage of LaVCa is relative to using the direct fMRI responses from NSD. Given the large set of images shown in that fMRI dataset, which were drawn from MS-Coco, it seems possible to find the NSD image (or set of images) that drives the highest response in each voxel and do the same LLM extraction / sentence composition on the captions for those images. (This could be done in a cross-validated / encoding framework if desired.) In most of the paper, these captions are treated as the “ground truth” and it’s unclear what the advantage is for the first part of the voxel-preference pipeline. This major limitation significantly tempers claims about prediction accuracy (see below as well) and interpretability. While the paper is well written overall, the notion of prediction accuracy as cosyne similarity between the generated sentence or image and the original caption/image is not typical and not always clearly explained. For example, at a quick look Figs 2 and 4 could be interpreted as the more traditional encoding accuracy score (correlation between predicted and actual fMRI). More generally, this notion of prediction accuracy is extremely complex as it involves many steps of generative AI, removing it from the original data. As a small point, the order of Fig 2 and FIg 3 should swapped. How do the voxel interpretations generated by keyword extractor / sentence composer directly on the captions (or an even simpler summary statistic of the captions) compare to the interpretations generated by LaVCa? What are the scores of the encoding model? How does the choice of VLM encoding model affect the captions / how can we assess that this backbone of LaVCa is accurate? How can this method be extended to other datasets? What scale / diversity of sitmuli is needed? Fully human-written
Salient Object Ranking via Cyclical Perception-Viewing Interaction Modeling Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposed a method for saliency object ranking. A story prediction module predicts the caption of the image and a guided ranking module predicts the saliency rankings. The cyclical interaction module aligns and refines the caption and the ranking iteratively. The experimental results seemed to show the proposed method outperformed previous SOTA. - The cyclical interaction uses caption to guide saliency object ranking. - Ablations shows the effectiveness of the SITA and CMQC in the proposed method. - The segmentation head is unclear. The performance increase could potentially due to using a strong pretrained segmentation model. - The retained QAGNet has lower scores across metrics compared to the ones reported in the original paper. This is critical since the results of the proposed method does not outperform the reported results of QAGNet. - Is the segmentation head a pretrained segmentation model or else? A strong segmentation model could favor the MAE. - What is the impact of number of object queries on results? An ablation study will be beneficial to see the impact. - Would a stronger text decoder leads to better performance? - What is the reason of decreased performance of retrained QAGNet? Did the authors use different training details or different evaluation settings or else? - Intuitively, the proposed method could also improve the performance on image captioning task. I am wondering if salient object ranking could help with image captioning. It will be interesting to see results compared with SOTA image captioning methods. Fully human-written
Salient Object Ranking via Cyclical Perception-Viewing Interaction Modeling Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose a Salient Object Ranking (SOR) approach that consists of two modules: the Guided Ranking (GR) module and the Story Prediction (SP) module, whose interaction enhances the overall performance of SOR. The design of this model aligns well with the human cognitive system, such as predictive coding, and the English writing is good and clear. Some experiments and details are not clearly explained. For example, in the experimental section, how was the choice of 24,000 epochs determined, and why such a large number? Could this lead to overfitting? In addition, it would be helpful to qualitatively present the interaction between object queries and text features, as well as the results under different values of K. What training data are used for the segmentation head? Was it pre-trained on the COCO segmentation dataset? If it was trained only on the SOR dataset, would its segmentation generalization ability be affected? Does the random selection of captions influence the results? It is recommended to include a discussion—for example, are the salient objects in the image always located in the main subject position described in the caption? Lightly AI-edited
Salient Object Ranking via Cyclical Perception-Viewing Interaction Modeling Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a novel framework that models the cyclical interaction between perception and viewing for the Salient Object Ranking (SOR) task. The method introduces two key components: a Story Prediction (SP) module that simulates the human perceptual process through image caption generation, and a Guided Ranking (GR) module that predicts saliency rankings under the guidance of the SP module. (1)Novel Cognitive-Inspired Framework. The paper introduces a cyclical perception–viewing model inspired by human visual cognition, which is strongly supported by established cognitive and psychological theories. And the introduction is easy to follow. (2)Extensive experiments. The paper conducts both qualitative and quantitative experiments, and also provides an analysis of inference time. Moreover, the visualized experimental results clearly and intuitively demonstrate the improvements achieved by the proposed method. (1)The paper lacks a clear comparison with the recent top-down method, Language-Guided Salient Object Ranking (CVPR 2025), and its performance remains inferior to the results reported in that study. (2)In the Method Overview section, the symbols used in the equations do not correspond to those shown in Figure 2, which makes it confusing to understand the inputs and outputs of each module. (3)The experimental section mainly provides data and setup details but offers limited analysis or discussion to explain the observed results. (1) Explain the differences between the proposed method and Language-Guided Salient Object Ranking (CVPR 2025). Moreover, the performance of this method is still inferior to that of the existing work. (2) In Eq.(1), when $l=1$, what dose $Q_{l-1}$ refer to? (3) In Table 2, for Setting II (“independent caption generation”), there is a performance improvement even without interaction between caption and visual features, which is confusing. Could the authors clarify this behavior? (4) How is the ground-truth (GT) caption obtained? Fully human-written
RECTOR: Masked Region-Channel-Temporal Modeling for Cognitive Representation Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces RECTOR, a self-supervised framework for EEG/sEEG data that integrates region, channel, and temporal representation learning through a novel hierarchical self-attention mechanism (RECTOR-SA). The model incorporates anatomical priors and functional attention to capture complex spatio-temporal interactions in neural data. Evaluated across several EEG datasets (SEED, SEED-IV, DEAP, MSIT, ECR), RECTOR demonstrates state-of-the-art performance in emotion recognition and task engagement classification. The paper claims that the model not only improves computational efficiency but also provides strong interpretability through attention visualizations, paving the way for its application in neurocognitive diagnostics and personalized interventions. The proposed RECTOR framework introduces a novel approach by combining self-supervised learning with anatomical priors and dynamic functional attention for EEG/sEEG data, a method previously explored in fMRI but applied to EEG data for the first time. The manuscript is well-written and clearly structured. The experiments are thorough and effectively address the research question, providing solid evidence for the physiological plausibility of the learned representations. The results are presented clearly, with supporting analyses that validate the model's performance. The paper's evaluation is limited to the SEED and DEAP datasets, and it would benefit from validation on a broader range of downstream tasks to better assess RECTOR's generalizability. The use of only F1-score as the evaluation metric is restrictive; incorporating other standard metrics such as Cohen’s Kappa, weighted F1, and additional classification metrics would provide a more comprehensive performance analysis. The ablation studies, while useful, lack a detailed examination of the self-attention mechanism, and the gating mechanism within RECTOR-SA is not addressed. Furthermore, the paper provides visualizations of learned representations but lacks a deeper interpretability analysis, especially in terms of attention maps, spatial EEG features, and feature attribution. Lastly, the pretraining details, including hyperparameters, pretraining dataset, training time, are unclear, which raises concerns about reproducibility and scalability. 1. Downstream Evaluation: Given the model’s promising performance on SEED and DEAP, can RECTOR be extended to other EEG/sEEG tasks with different cognitive states or sensor modalities? Evaluating RECTOR on a wider variety of tasks would clarify how well the model generalizes across different applications, such as emotion recognition, cognitive task engagement, or even clinical diagnostics for neurological disorders. 2. Evaluation Metrics: The paper primarily uses F1-score, which is valuable but limited. Would the authors consider evaluating RECTOR using other metrics like Cohen’s Kappa (which accounts for agreement between class predictions) and weighted F1 (to account for class imbalance)? Comparing RECTOR with task-specific models rather than foundation models (such as those designed specifically for EEG emotion recognition) would provide more meaningful insights into its performance. 3. Ablation Study on Attention and Gating: While ablation studies are provided, could the authors conduct more detailed experiments focusing specifically on the RECTOR-SA attention mechanism? How does each component (e.g., region-based vs. global attention) contribute to the model’s overall performance? Additionally, the gating mechanism within RECTOR-SA is mentioned but not ablated—what role does it play in the model, and how does it impact performance across tasks? 4. Interpretability and Visualization: The paper includes some visualizations of learned representations, but a deeper analysis of the model’s internal behavior is lacking. Could the authors include attention maps, coherence heatmaps, or feature attribution to show how RECTOR attends to relevant spatial and temporal patterns in EEG? This would help validate whether the spatial awareness captured by the model is truly driving the observed performance improvements. 5. Pretraining Process: The paper does not provide sufficient details on the pretraining process, such as hyperparameters, optimization strategy, or batch size. Understanding these details would help in replicating the results and assessing the model’s scalability to other datasets. Could the authors clarify the pretraining procedure and explain how these choices impact the model’s final performance? 6. Usage of LLM missing. 7. No code revealed Fully AI-generated
RECTOR: Masked Region-Channel-Temporal Modeling for Cognitive Representation Learning Soundness: 3: good Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes RECTOR, a self-supervised framework for EEG and sEEG cognitive representation learning that explicitly models region–channel–temporal interactions. The key contributions include: (1) RECTOR-SA, a hierarchical sparse attention mechanism incorporating anatomical priors and dynamic gating; (2) RECTOR-Mask, a structured multi-view masking strategy that creates region- and time-aware masked modeling targets; (3) NC²-MM, a unified learning objective that combines masked modeling and contrastive learning within one architecture; and (4) RCReg, a specialized regularization for improving region–channel token representations. The model achieves state-of-the-art results across EEG emotion recognition and sEEG task-engagement classification benchmarks, with supporting ablations and interpretability analyses. 1. Ambitious attempt to integrate spatial priors and self-supervision in neural signal modeling. 2. Structured masking and hierarchical attention are intuitively motivated. 3. Experimental results are comprehensive, covering multiple datasets, protocols, and baselines, including ablations that validate each core component of the architecture. 4. The method provides neuroscientifically interpretable results at both region and channel levels, demonstrating alignment with known physiological patterns. 1.The figures in the manuscript need improvement, especially Figures 2 and 3. Figure 4 is significantly clearer in comparison. 2.The method’s novelty appears incremental rather than fundamental. Most components (structured masking, region tokens, gated attention, variance/covariance regularization) are adaptations of well-known techniques with domain-specific adjustments rather than a distinctly new contribution. 3.The writing is dense, and the paper tends to overstate its contributions relative to the demonstrated novelty. 4.The anatomical prior design is under-justified. Region partitioning is treated as fixed and universally correct, but inter-subject anatomical variability is substantial in EEG/sEEG. The paper does not assess the robustness or validity of this assumption. 5.Pretraining only on each target dataset weakens the claim of general-purpose self-supervised learning. 6.Critical methodological details are placed in the appendices and should be included in the main paper to ensure clarity and reproducibility. 7.Considering the complexity of the proposed method, the absence of released code makes reproducibility difficult. 1.Could Figures 2 and 3 be redesigned with improved layout and clearer color schemes to enhance readability? 2.What is the key novel contribution beyond combining existing components such as structured masking and hierarchical attention? 3.How do you justify the strength of your claims relative to the demonstrated novelty? 4.How robust is the anatomical prior (fixed region partitioning) to inter-subject variability in EEG/sEEG? 5.How does pretraining only on each target dataset support claims of general SSL generalization? 6.Can you move critical methodological details from the appendix into the main text to improve clarity and reproducibility? 7.How do you conduct the leave-one-subject-out (LOSO) evaluation? Is there a hold-out validation set used to determine the number of training epochs and hyperparameters? 8.Will the code and pretrained models be released to ensure reproducibility given the complexity of the method? Heavily AI-edited
RECTOR: Masked Region-Channel-Temporal Modeling for Cognitive Representation Learning Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This papers introduces a complete and exhaustive method for self-supervised deep learning framework for EEG and sEEG data, accounting the various aspects and dynamics of these brain activity modalities (region, channels, temporal). It introduces multiple modules, in particular RECTOR-SA and RECTOR-Mask, for feature extractions at different scales and combines masked modelling, contrastive learning, and variance–covariance regularisation for training objectives. The methodology is benchmarked against multiple models and training schemes (supervise-only, self-supervised) and outperforms many models on two main tasks: EEG emotion recognition and sEEG cognitive states. The paper is generally well-written. The figures are complete, very descriptive (even though some of the legends could benefit from thorough descriptions, see below for more details). The methodology, although quite exhaustive, is addressing one of the blindspot of many EEG (and even in some sense fMRI) studies, which is taking into account the spatial (and regional) and temporal dynamics in EEG signal. In particular accounting for regional specific features rather than aggregating spatial information is a nice contribution. The evaluation framework is comprehensive and well thought with comparison against many training frameworks (supervised, self-supervised models) and multiple datasets. The ablation studies are also welcome considering the many modules that are introduced by the paper. The colour coding in the tables makes the results very easily readable. One important issue with the current state of the submission is the general arrangement of information within the paper. 1. It is (very) difficult to understand at first what is the training objective of the model (what is going to be predicted), what the model aims at solving and how it aims at doing it. All this could be clearer from the beginning of the 2. Methodology section. 2. The paper tends to over-complexified some of the notations (Figure 2) and wordy terminology e.g. "sparse region-channel-temporal self-attention embedded with anatomical priors and dynamic functional attention", which can obscur a bit the interesting concepts introduced by the method. 3. Most importantly, it seems that most of the main modules are not fully described within the main text of the paper, but are detailed in the appendix. Many points in the methods are referring to the appendix, which make it impossible to understand from the Another remark would be that the paper is extremely dense, some might say too dense, at a point were it is difficult to apprehend the entirety of the method with only the main text of the submission. This kind of paper would probably benefit from simpler iterations, to appreciate the real value of every added module. This is also considering that some concepts are only introduced in the appendix. This amount of content can be seen as detrimental for the overall appreciation of the paper. Therefore, I would recommend to streamline the paper and remove the unnecessary content for that publication. In particular, I would recommend switching some of the paragraphs and reorganising/rewriting the methodology section in order to have full descriptions of the RECTOR modules (in particular RECTOR-SA) in the text (not in the appendix) - section 2.2 is too high level and figure 2 is not explained enough to be stand-alone, clear description of the pipeline (it is difficult to understand how the modules interact with each other), explanation of concepts such as the brain partitioning (which is in Appendix E) but is one of the key point of the paper. Instead the "Complexity" paragraph could be added to the appendix. In the current shape, the methodology is difficult to understand" Figure 3 would also benefit from more descriptive legend. - It is not clear from the main text how RECTOR is fine-tuned on the downstream task? Other remarks were listed in the previous section. Fully human-written
RECTOR: Masked Region-Channel-Temporal Modeling for Cognitive Representation Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper *RECTOR: Masked Region–Channel–Temporal Modeling for Cognitive Representation Learning* proposes RECTOR, a self-supervised learning framework for EEG and sEEG representation learning that jointly models region-, channel-, and temporal-level dependencies. Its core contributions include a novel **RECTOR-SA** hierarchical self-attention mechanism integrating anatomical priors for efficient region-channel-temporal modeling, a **RECTOR-Mask** structured multi-view masking strategy for more challenging pretext tasks, and **NC2-MM**, a combined non-contrastive × contrastive learning objective. Additionally, **RCReg** regularizes region-channel tokens to enhance feature disentanglement. The model achieves state-of-the-art results on EEG emotion recognition and sEEG task-engagement classification while claiming higher computational efficiency and interpretability compared to prior works. However, despite strong empirical results, the paper largely repackages existing ideas—masked modeling, contrastive loss fusion, and anatomical priors—into a composite architecture. The novelty is incremental and primarily architectural, lacking theoretical rigor or strong neuroscientific grounding. The extensive ablations and comparisons suggest solid engineering, but the work leans more toward technical aggregation than conceptual innovation. 1. Comprehensive ablations: Evaluates the effect of masking ratios, loss weights, and feature hierarchies. 2. Multi-dataset evaluation: Demonstrates generalization across EEG (emotion) and sEEG (task) domains. 3. Integrated pipeline: Combines anatomical priors with deep self-supervised frameworks, improving biological plausibility relative to generic transformers. 4. Engineering soundness: Implementation is well-optimized and includes clear reproducibility details and comparison tables. 1. Limited novelty: The key ideas—masked modeling, hierarchical attention, hybrid contrastive objectives—are incremental reuses of prior designs (MAE, BYOL, DINO, MoCo-v3, etc.) rather than a fundamentally new direction. 2. Weak theoretical grounding: No analysis of why combining non-contrastive and contrastive terms yields better cognitive representations. 3. Poor interpretability: Despite claiming cognitive alignment, there is little neuroscientific analysis (e.g., brain-region relevance or neurobiological validation). 4. Superficial discussion: Results are over-interpreted as “state-of-the-art” without effect size reporting or significance testing. 5. Unclear scalability: It is uncertain whether RECTOR can scale to large, multi-site EEG datasets or handle real-world noise. 6. Dataset limitations: The training and evaluation datasets are small (dozens to hundreds of subjects), which limits generalization claims. 7. Overclaiming contributions: The claim of being “the first unified region–channel–temporal framework” ignores earlier hierarchical EEG models (e.g., EEG-GraphMAE, ST-MAE, Brain-MAE). 1. How does RECTOR-SA differ fundamentally from existing spatio-temporal attention modules used in EEG-GraphMAE or ST-MAE? 2. What motivates the NC²-MM hybrid loss—can you show a theoretical analysis of how it prevents representation collapse? 3. How are anatomical priors encoded? Are these static adjacency matrices or learned embeddings, and how sensitive is performance to parcellation choice? 4. Can you report per-subject and per-session variance or confidence intervals for downstream metrics? 5. How does RECTOR perform under noisy or low-density EEG setups—does the hierarchical attention degrade gracefully? 6. Have you compared against self-distillation approaches (e.g., DINO-style EEG pretraining) to isolate the benefit of your masking strategy? 7. How interpretable are the learned representations—do any attention maps align with known cortical functional networks? Fully AI-generated
Constructing a 3D Scene from a Single Image Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes an image-to-3D method by leveraging 3D generative model, trellis, as a backbone to generate 3D scenes. It starts with depth estimation to obtain the sparse structure of the scene, and then a structured latent is generated region-by-region using masked rectified flow, which is then fed into decoder to generate the 3D scene. The overall pipeline is training-free. This paper provides a sound pipeline that leverages 3D generation model, Trellis, to generate 3D scenes. 1. It adapts a 2D diffusion method to 3D generation, and develops a region-by-region structured latent generation method. 2. It presents a masked rectified flow method to retain the latent feature at know voxels. 3. The experimental results verify the advantage of the proposed method. 1. the proposed method relies on the top-down view of a 3D scene as an condition in the 3D scene generation, thus the generated ground plane is generally flat. It might be difficult to handle terrains. 2. The experimental results contain 4 scenes, which is not enough to verify the stability of the proposed pipeline. In addition, how the method is influenced by different depth estimation method? I would like to see how this pipeline works with the STOA depth estimation methods. In the step of structured latent generation, how the latent feature of active voxels are obtained? Since the proposed method depends on Trellis, after obtaining the sparse structure, why not directly leverage Trellis to generate the structured latent? Fully human-written
Constructing a 3D Scene from a Single Image Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents a training-free method for creating 3D scenes from single images by fusing region-based generations. The key problem identified is in maintaining the structural consistency within partial generations during generation and completion. This is resolved by applying a sequence of techniques derived from various fields, including landmark bounding boxes (Florence2), instance masks (SAM2), point cloud-based mesh alignment (ICP), training-free inpainting from masked rectified flow (RePaint, updated to 3D diffusion). The method is mainly compared with training-based, and training-free 3D generation methods using human/GPT-4o preference scores and rendered view metrics. - The proposed method effectively modifies and aggregates existing solutions from different subproblems into a single pipeline to solve general problem of generating a 3D scene. - The proposed method shows higher qualitative and quantitative performances than previous training-free and model-based generation results as elaborated in multiple tables in the manuscript. - The ablation study shows the two originality of the paper, i.e., region-based generation and landmark conditioning do help the generation process. Overall, I believe the paper is nicely written, with sufficient amount of evaluations for the target task, and therefore is above acceptance threshold. Although I believe the current version of the manuscript is above acceptance threshold, there are some limitations that prevents me recommending for higher honor (e.g., Highlight/Oral). 1. First of all, since the paper utilizes various methods that have been proposed beforehand, some modified and some unmodified, it would be much better to have the summarization table (somewhere in the appendices) that shows which part of the pipeline is operated by which method, thereby implying modular upgrades to gain overall performance boost. This also helps the reader clarify which part of the algorithm is novel in this paper. 2. As far as I read, the main problem at hand is to gain consistency while allowing partial generations, and the immediate desired consequence is infilling of holes within the final generated output. We may either quantitatively (maybe by calculating the size/number of holes created around the reference shot) or qualitatively (if the quantitative analysis is hard; by visually highlighting the holes created by the different types of methods within the same camera view) compare this geometric quality. 3. By changing the size of region being generation, the generation time may vary. It would be better reporting this generating time of different methods, as well as for the ablation studies. 4. Since the method sells for being training-free, we may also think about 3D generator-agnostic behavior. The paper will be more complete if the authors perform ablation study on the core generation model of the method. Here are some minor points that I did not count in my scoring. 1. In Table 2, quantitative comparison can be improved by adding reference consistency scores measured by LPIPS (PSNR and SSIM can also be considered), i.e., rendering the scene with the exact same camera parameters with reference image and then comparing them. 2. Likewise, it would be better to have a figure that compares the reference camera view rendering to gain consistency in comparison between various methods. Fully human-written
Constructing a 3D Scene from a Single Image Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes SceneFuse-3D, a training-free, modular pipeline for turning a single top-down image into a coherent, editable 3D scene. The core contributions are: (1) region-based generation that splits a scene-level structured latent into overlapping patches; (2) a spatial prior built from monocular depth and landmark instances; (3) masked rectified-flow completion that regenerates only unknown latent parts while preserving already-consistent content. The final scene is decoded with pretrained object decoders. The authors constructed a test dataset comprising 100 synthesized top-down images spanning diverse styles. The experimental results demonstrate that SceneFuse-3D outperforms existing object generation models. * SceneFuse-3D employs a training-free approach, which utilizes existing models to accomplish the scene generation task without requiring fine-tuning of the base models. * Using existing foundation models (e.g., depth estimation, Florence2, and SAM2) to provide spatial priors and effectively stabilize global layout and cross-region consistency. * The paper is well structured in general. * The method appears to rely heavily on external priors (e.g., monocular depth, Florence2, SAM2, ICP), which may propagate errors throughout the pipeline. * Some of the generated scenes appear to contain holes (e.g., in Figure 1 and the supplementary materials). * The proposed method seems to support only top-down views from specific angles as input images. 1. How robust is the proposed pipeline to errors in depth estimation and landmark detection? For instance, during the initialization phase, the base model might fail to accurately detect all landmarks. 2. The authors attribute the holes in the generated scenes to occlusion; however, in the examples shown, the hole regions do not appear to be occluded. I remain curious whether these holes may instead result from the object generation backbone itself, as the regions near object voxel boundaries are often empty. 3. In the Masked Rectified Flow for Completion stage, what is the conditioning input? Does it make use of image patches from other regions? Although I am now slightly negative, I am open to being persuaded based on the feedback from the authors. Lightly AI-edited
Constructing a 3D Scene from a Single Image Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper tackles the problem of scene-level 3D generation from (bird-eye-view) images. The motivation and contribution is clear. The proposed solution is intuitive and effective. Using point-cloud from monocular depth estimation as a guidance helps improve the scene-level consistency. Regional alignment and extrapolation/inpainting is simple and effective. 1) Since the top-down images are generated, this whole data pipeline is ‘text-to-image -> image-to-3D’, which looks similar to syncity’s ‘text-to-3D’ pipeline. The contribution of this paper could be further strengthened by providing real-world experiments and assessments. 2) Runtime analysis should be provided. This pipeline takes longer time compared to directly generating the scene-scale mesh in one-run, it is critical to report run time (though it depends on scene complexity, which makes the analysis more critica). 1. The monocular depth estimation results is expressed under the camera coordinate system. However, the camera often has a pitch angle, i.e., the camera’s optical axis is not parallel with the gravity. This means that the projected point clouds are tilted. When we generate scene-level 3D meshes, we would like the up-axis aligned with the gravity, how to resolve this issue? 2. Recommended citations: VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space (inpainting with Trellis) Frankenstein: Generating semantic-compositional 3d scenes in one tri-plane (scene generation) NuiScene: Exploring efficient generation of unbounded outdoor scenes (scene generation) Fully human-written
LeGIT: LLM Guided Intervention Targeting for Online Causal Discovery Soundness: 1: poor Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The work explores the usage of LLMs for experimental design. The authors propose a framework and prompts for acquiring information about root causes in the network from LLMs. This information guides further experimentation. The approach is compared to several baselines on four real-world inspired problems using the ENCO algorithm to obtain structure posteriors. 1. The paper touches on an important problem of efficient experimental design. 2. The paper is easy to read. 1. The method description is unclear. The “Re-sampling stage” paragraph states that each LLM suggestion undergoes filtering (must “survive” two independent votes). Meanwhile, in Algorithm 2: * In line 4, we intervene on all variables from the warmup list * In line 8, we intervene on all variables from the bootstrap-warmup list * In line 12, we intervene on all variables from both datasets once again Please clarify this discrepancy in descriptions. Also, in the bootstrap stage, how is the intermediate causal discovery result leveraged? 2. The proposed method relies on LLMs identifying root causes. What is the theoretical motivation for including such a suggestion in the prompts? In general, selecting optimal experimental designs is a more complex task [Eberhardt]. 3. The observed effectiveness of the method may relate to how the ENCO method works and may fail to generalize. Note that ENCO has no convergence guarantees when the number of interventions is less than d-1. The ENCO framework only guarantees the recovery of the correct graph when interventions were conducted on at least n-1 nodes. It would be beneficial for the applicability (and possibly performance) to use a method with stronger theoretical guarantees under partial intervention sets, for example, DCDI or DIBS. 4. The approach relies on the ability of LLMs to reason about the variables. This ability can be impaired in at least two cases: insufficient information about variables (when the descriptions are missing or are non-informative), lack of background knowledge. The discussion on these two limitations is missing from the paper. When applying this method to a new setting, how can I make sure both requirements are fulfilled and trust in the effectiveness of the method? The work also lacks discussion about the effectiveness of the proposed prompt and the consistency of LLMs' responses. 5. There are no experiments or analyses that evaluate the scalability and cost efficiency of the proposed method to support he claim “LLMs offer a scalable and cost-efficient approach to enhance experimental design” (line 485). Moreover, since LeGIT requires using a numerical intervention targeting method, the cost efficiency and scalability seem to be similar to existing methods. **References** [Eberhardt] Eberhardt, Almost Optimal Intervention Sets for Causal Discovery [DCDI] Brouillard et al., Differentiable Causal Discovery from Interventional Data [DIBS] Lorch et al., DiBS: Differentiable Bayesian Structure Learning 1. What is the prompt for the boostraped-warmup intervention set acquisition? 2. Why are the results on statistical significance, in section 5.2, partial? How do the test results look for other considered graphs? 3. Line 468: “rapid interventions are required” - Can the Authors provide an example of a real-world problem where rapid experimentation is needed and is costly? 4. Line 93 typo: intervening -> intervention Fully human-written
LeGIT: LLM Guided Intervention Targeting for Online Causal Discovery Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces LeGIT, an intervention-targeting method for experimental design in online causal discovery. LeGIT leverages the meta-knowledge embedded in large language models (LLMs) to identify which nodes should be selected for intervention. The LLM-enhanced framework can be integrated with any causal discovery algorithm that utilizes interventional data. In this setup, the LLM receives a prompt summarizing the metadata of the graph’s vertices (e.g., variable descriptions) and responds with a set of proposed interventional nodes based on its internal and contextual knowledge. The framework further incorporates bootstrap and resampling stages to enhance performance and is compared against alternative intervention-selection algorithms that do not exploit such meta-knowledge. The topic addressed in the paper is highly engaging, as it pushes the boundaries of intervention targeting in causal discovery by integrating external meta-knowledge from large language models (LLMs). Exploring the applicability of LLMs to this problem offers an insightful avenue for assessing their reasoning abilities and the extent of their embedded common and expert knowledge. The idea of treating an LLM as an artificial expert in the absence of a human counterpart is particularly compelling, making the comparison between LLM-selected and human-selected intervention targets especially valuable. I particularly appreciate the experiments presented in Figure 2, which provide intuition on how the set of intervention vertices may vary across methods, and Table 2, which—especially when compared to Table 1—illustrates how LeGIT’s performance is influenced by changes in the data budget. However, I would strongly encourage the inclusion of additional experiments analyzing how the proposed enhancements of LeGIT affect overall performance (see the discussion on soundness in the weaknesses section). **Notes on Soundness:** My most serious concern is that the comparison between the intervention selection algorithms and the LLM-enhanced algorithm is not entirely fair. This is because the LLM has access to additional metadata related to the studied graph, leveraging knowledge that the other algorithms were never designed to use (except for the input provided by human experts). In other words, the assumptions regarding the input data for the causal discovery problem have been extended. It is therefore quite expected that incorporating this extra knowledge—on top of the standard framework—would lead to better results. This is not to say that the impact of such an LLM-based enhancement is uninteresting to study; however, it is important to recognize that LeGIT addresses a different problem than, for example, GIT, AIT, or CBES. This distinction, however, is not reflected in the narrative presented by the authors. I would suggest that the authors focus on comparing their updated Online Causal Discovery framework with other LLM-guided causal discovery algorithms. For instance, the works of Jiralerspong et al. and Khatibi et al. rely on similar access to metadata from LLMs to uncover causal dependencies, and they use the same or similar graphs and metrics (other comparable algorithms are also summarized in Wan et al.). The experiments with GIT, AIT, and CBES serve more as a teaser or example of LeGIT’s potential, rather than as a fair comparison between methods that have access to the same data. The authors could therefore also adjust their narrative to focus on understanding the impact of incorporating this additional knowledge. For example, which nodes are most affected? (Figure 2 is a nice example, although it might be more informative to compare against a simpler case where the ideal distribution can be directly derived. In the graphs from Figure 2, it is difficult to assess which intervention set is correct, or whether the interventions occur on edges that can be reversed within a MEC class—and hence whether the intervention is actually required to achieve identifiability.) Similarly, how do the different enhancements (bootstrapping, resampling) affect convergence or final performance? Does including information about the current graph in the prompt improve or deteriorate performance? How does performance change with the number of nodes proposed by the LLM—and can the LLM propose a ranking of nodes? These are just some examples of how the authors could build a stronger narrative around the ablation studies and deepen the understanding of their framework while staying within the same online causal discovery framework. **Notes on Experiments Setup:** Throughout the paper, the authors use only four graphs of more or less similar size. This is, of course, a reasonable starting point, but I wonder why these specific graphs were selected. More importantly, I see no clear restriction preventing the extension of the evaluation to other networks (e.g., additional graphs from https://www.bnlearn.com/bnrepository/ Moreover, based on the description of LeGIT, it appears that the method can be used with any algorithm $A$ that satisfies the conditions for online causal discovery (i.e., potentially not even gradient-based). Did the authors attempt to use different base causal discovery algorithms besides ENCO? Even if not as a main experiment, such a demonstration could nicely illustrate the universality of the proposed approach. **Notes on Contribution:** I find the topic quite interesting, as it explores the integration of LLMs’ common meta-knowledge—akin to the bias present in human experts—into causal discovery. However, this perspective has also been explored in prior work (e.g. Jiralerspong et al., Khatibi et al., Wan et al.). While focusing on scoring the intervention potential of each node appears novel, the experimental design (see “Soundness” above) does not give me enough confidence to fully assess the strength of the contribution. **Notes on presentation (Minor):** The paper is generally well written, but I do think that some of the sections could use more attention. For instance, Preliminaries do not explain all the notations in Algorithm 1 (e.g. what is $P(G)$), there is a discrepancy between $\phi$ in Algorithm 1 (in “Output”) and in text (line 133). It is not explained what those parameters are and how $A$ updates them (in practice we also have functional parameters that are also updated by the algorithm A and can thus influence the structural ones - see Lippe et al.). In Algorithm 2, line 8, should it not be $D_{bootrstapped}[i-T_{warmup}]$? **References:** Jiralerspong, Thomas, et al. "Efficient causal graph discovery using large language models." arXiv preprint arXiv:2402.01207 (2024). Khatibi, Elahe, et al. "Alcm: Autonomous llm-augmented causal discovery framework." arXiv preprint arXiv:2405.01744 (2024). Wan, Guangya, et al. "Large Language Models for Causal Discovery: Current Landscape and Future Directions." arXiv preprint arXiv:2402.11068 (2024). Lippe, Phillip, Taco Cohen, and Efstratios Gavves. "Efficient neural causal discovery without acyclicity constraints." arXiv preprint arXiv:2107.10483 (2021). The current algorithm primarily relies on meta-knowledge derived from the descriptions of the variables. I was wondering whether the authors also explored incorporating contextual information from the graph itself (e.g., details about the MEC class) or integrating principles from do-calculus or v-structure verification. Could such approaches work even in cases where no node descriptions are provided? In other words, could the LLM base its reasoning solely on causal discovery knowledge and the graph structure to refine the current solution and subsequently select the intervention nodes? In that way LeGiT could be used even on fully artificial graphs, or models when no previous bias/knowledge is available about the nodes. In Figure 4, it appears—especially for the Child and Fisram graphs—that LeGIT does not improve during the early acquisition steps and then sharply drops. Could this breaking point correspond to when the numerical scoring algorithms start to take effect? In Algorithm 2, the LeGIT intervention targets are selected at the beginning of training. Did the authors examine whether incorporating them as the final interventions (or interleaving them with the numerically scored ones) provides any benefit? I do generally believe it could be an interesting paper, but currently it suffers from its positioning (i.e. the algorithm works well, but it is either not fairly comparable to other approaches - beacuse of the additional knowledge - or other LLM-guided causal discovery approaches could be adapt to work within the online causal discovery framework to allow for a comparision with meta-data). Fully human-written
LeGIT: LLM Guided Intervention Targeting for Online Causal Discovery Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose a novel hybrid framework for causal structure learning that combines the vast prior world knowledge of LLMs with existing methods (referred to as numerical methods). The method consists of four stages: (i) Warmup Stage annd (ii) Bootstrapping Stage where LLMs are employed, following by a (iii) Re-Sampling Stage and (iv) Numerical-Method Stage** where the search is continued with “traditional” causal discovery methods. Eventually, the authors assess the proposed method on four datasets of the BN benchmark suite, and compare the performance to three online causal discovery methods, as well 3 heuristics and a human reference score. - The idea to leverage the vast world-knowledge of LLMs as a prior for causal discovery methods is definitely an interesting idea that deserves more exploration. - Clean Visualizations - **Clarity:** The paper is somewhat hard to parse. Despite reading section 3.2 multiple times, I still find it hard to understand the proposed method in detail and cannot confidently say that I am 100% understanding what's going on. - **Variance:** The variances in table 1 seem to be huge. When considering the standard deviation over 5 seeds, GIT and LeGIT seems to be overlapping in most cases, making it hard to make a conclusion about the effectiveness of the proposed approach. - **Contamination:** While the author address the issue of contamination (lines 100 to 108), they refer to Long et al. (2023) as an evidence that prominent LLMs struggle to reconstruct causal graphs. However, the referred work tested GPT3 while the present work builds upon OpenAI's 4o. LLMs have come along way since GPT3, and it's very likely that results would look different with today's state of the art models, especially reasoning models. Hence, the effect of contamination cannot really be judged given the presented evidence. Instead of asking for interventional targets, it would be worthwhile to check which edges the LLMs are able to identify? - **Theoretical Argument:** The theoretical argument is somewhat superfluous -- could you please provide more depth on what the theoretical argument provides? The entire theoretical novelty hinges on empirically verifying Assumption D.2. If the authors cannot show the LLM targets consistently satisfy this property, the whole theoretical arguments becomes a tautology with inherited proof from other work. - Could you please provided a clearer description on the algorithm? - Have the authors also experimented to initialize the graph with LLM proposed edges instead of using the LLM to suggest informative intervention nodes? - Have the authors experimented with different prompts in the warmup stage? I suspect that the LLM's performance on discovering useful variables may largely vary between prompts. And the given prompt seems to be slightly ambigious -- so I would be curious how well the model performs with a more precise prompt. - Could the authosr please run LLM baselines where the model directly predicts causal edges among given variables. - Line 86: "Relatively clearer" --> what do you mean with this? Fully human-written
PreviousPage 1 of 1516 (75800 total rows)Next