ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
ParaFlow: Parallel Sampling for Flow Matching Models Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 0: Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper proposes ParaFlow, a training-free, step-parallel sampler for Flow Matching (FM). It rewrites the discrete ODE sampling path as a triangular system of nonlinear equations and solves it via fixed-point iterations (optionally in sliding windows). The claim is that this “solves the whole trajectory simultaneously,” reducing sequential dependencies and enabling parallel evaluation across timesteps. Experiments on large text-to-image FM models report latency gains with small quality changes. Clarity of construction: The triangular formulation and fixed-point solver are simple and easy to implement; drop-in for existing FM pipelines. Practicality: Training-free; compatible with common samplers and amenable to batching across time. Empirical promise: Shows meaningful wall-clock reductions under certain hardware settings; includes window/tolerance ablations. 1. Overstated claim: “Eliminates sequential dependencies” is misleading—sequential rounds remain as fixed-point iterations $K$. Without a proof that $K \ll N$, gains are empirical only. 2. Theory is shallow: Convergence “in at most $N$” follows structurally from the triangular map; no FM-specific contraction/rate or tolerance-to-trajectory error bound is given. 3. Novelty vs. prior parallel sampling: Close to existing fixed-point/Picard-style parallelization in diffusion; differentiation is incremental. 4. Fairness of comparisons: Primarily Euler; lacks comparisons to stronger sequential baselines (Heun/DPM-Solver etc.) at matched quality/latency. 1. Do you have *a priori* conditions (e.g., Lipschitz/step-size) guaranteeing $K=O(1)$ or $K \ll N$? If not, please soften the “eliminates dependencies” claim. 2. How does ParaFlow perform with non-Euler solvers for FM, and does triangularity still hold without extra predictors? 3. Please add fair baselines: stronger sequential solvers and schedule-optimized or distilled alternatives at matched compute/latency. Fully AI-generated
Buckingham $\pi$-Invariant Test‑Time Projection for Robust PDE Surrogate Modeling Soundness: 2: fair Presentation: 1: poor Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes the use of a test-time projection method in which samples are rescaled to match the scaling of the nearest point in the training data without changing the nondimensional parameters of the system through use of the Buckingham-\pi. The approach is made more efficient by the use of centroids rather than comparisons to the full training data. 1. It's a very sensible approach which tackles an important problem. 2. The experiments are informative and show really strong performance. 3. The authors' were clearly conscious of the impact of cost and found an approximation that works well. Overall, it seems the method is a natural and very promising approach for mitigating OOD performance, at least in the cases where the OOD is due to different dimensional choices could easily happen when applying a pretrained model on new data. While there are some unanswered questions and aspects that could be tightened up which I'll list further down, the main reason for my current recommendation is that the presentation could use significant improvement, particularly on writing to a machine learning audience where a many readers will have very little experience with dimensional analysis. I think this can be easily rectified with some restructuring and building out examples better. Here are some concrete issues and suggestions: 1. Many readers who work on neural surrogates are not going to be familiar with dimensional analysis (not arguing that this should be the case, but it is currently true), so when section 3 doesn't explain the basic concept well, subsequent sections will be harder to understand. If you expanded the worked example and using it through each stage of the section to describe what the fields are, what the units are, and then go through the current example showing how to extract $\pi_{th}$ from them would make the statements significantly more concrete and easier to follow. 2. It feels like concepts are often not explained in the place where they are presented. One would expect all of the methods to be explained in section 4 either mathematically or algorithmically, but section 4.4 just explains the uniform strategy as "tunes the dominant scale while others fixed". If this is supposed to be a method that doesn't require users to perform dimensional analysis on their own, how should they determine the dominant scale? How does this make the distribution uniform? 3. Experiments are just equations right now. What types of common applications do these equations represent and why are they interesting test beds? How did you generate the data sets? How are initial conditions generated? Other issues: 1. Experiments are currently fairly weak. These are two linear problems. It would be more interesting to see if the advantages from a linear projection method also holds for nonlinear problems. 2. What are the test and train distribution of the invariants? For instance, it seems like q and k are shifted in the same pattern which wouldn't necessarily result in disjoint equations. It would be good to highlight where the method would be expected to fail as well. Minor: 1. B not defined in Theorem 1. 2. Random selection is mentioned, but not included in table 1. 1. Often, these surrogate methods you're describing are trained on simulation data which is already non-dimensionalized. How would you expect the performance of this method to be affected in this setting where new data is likely using the same characteristic scales? Given the is already dimensionally equivalent, will performance be affected? 2. What's the justification for using the mean in place of characteristic scales? Often in fluids, these scales are relevant to the geometry of some feature in the system. Is this something that can be ignored in the current setting? 3. Could you provide more detail on how the datasets were generated? Fully human-written
Buckingham $\pi$-Invariant Test‑Time Projection for Robust PDE Surrogate Modeling Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper focuses on mitigating out-of-distribution (OoD) inference of neural operators. The method is based on Buckingham $\pi$-theorem, which decomposes the entire spaces into two parts: the null space $\ker(\Phi^T)$ and the component perpendicular to $\ker(\Phi^T)$, where $\Phi=[\phi^{(1)}\cdots\phi^{(p-r)}]$ stacks the null-space bases vectors (see Theorem 1). Thus, data can be transformed or projected to a point that is the closest to training data while preserving $\pi$. Experiments demonstrate effectiveness on OoD data that is superior than baselines. **Originality** is high. The method leverages Buckingham $\pi$-theorem and addresses OoD problem innovatively from a structured perspective, i.e., data space can be decomposed into equivalence classes generated by training data $(X_i, Y_i)$. **Clarity** is good in general except for some minor issue (see weakness). Diagram and figures are illustrative and helpful. Regarding **Clarity**: 1. An algorithm that summarizes all the procedure can be helpful to readers. Especially, corresponding to predict & inverse in Fig. 3. How to do inverse in general? 2. Above equation (11), should $\tilde{X}^{\star}$ be $\tilde{X}^{*}$ as eq. (11)? 3. In Fig. 1, what do purple circles and yellow circles stand for? Reference: 1. How is your work related to Lie point symmetry [1]? Limitation: 1. Can your method be applied to irregular mesh grids? All baseline models in your paper, CNN, U-Net and FNO, can only be applied on uniform grids. Is there such limitation for your method? [1] Brandstetter, J., Welling, M., & Worrall, D. E. (2022, June). Lie point symmetry data augmentation for neural pde solvers. In International Conference on Machine Learning (pp. 2241-2256). PMLR. See weakness. Fully human-written
Buckingham $\pi$-Invariant Test‑Time Projection for Robust PDE Surrogate Modeling Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a Buckingham π-Invariant Test-Time Projection method to improve the out-of-distribution (OOD) robustness of PDE surrogate models such as FNOs, U-Nets, and CNNs. The key idea is that many OOD inputs differ only by unit or scale changes that should be physically equivalent under dimensional analysis. The authors therefore apply the Buckingham π theorem to define a log-space pi-group-preserving projection while aligning each test sample with its nearest training sample in parameter space. They proposed a minimization procedure for this alignment. Experiments on steady 2-D thermal conduction and linear elasticity show up to 90% reduction in OOD error with minimal computational overhead. - The paper adapts a century-old but fundamental physical principle (Buckingham π) into a practical, algorithmic test-time procedure for neural surrogates. - The combination of π-compliant projection and nearest-sample search is a creative way to transfer dimensional invariance into modern ML practice. - Training-free and model-agnostic: it can wrap around any pretrained surrogate without re-training or altering the loss, which makes it attractive for applied modeling. - Smooth minimization: the log-space formulation converts multiplicative scaling into a linear subspace problem, solved neatly by an orthogonal projection. - The idea may stimulate a broader discussion on how physical similarity and scale invariance can be enforced at inference rather than training time. - Limited scope of experiments. Only simple, steady, linear PDEs (conduction and elasticity) are tested. It remains unclear whether the method holds for nonlinear, transient, or multi-physics systems. - Procedure feels overly elaborate for a scaling correction: The projection, clustering, and least-squares steps may appear heavy compared to straightforward normalization or nondimensionalization. There is a lack of exploration of when the procedure was worth, and when it is simply more favorable to cast more training data points. - Writing and exposition are often opaque: Key transitions between the physical reasoning, log-space math, and algorithmic steps are difficult to follow without prior familiarity. - No guarantee of a true physical neighbor: The nearest-sample search may project the test case toward an unrelated training sample if the π-space distribution is sparse or multimodal. - No discussion on mis-specified π-groups: The procedure assumes the chosen invariant is the correct one; the effect of using incomplete or incorrect π-groups is not analyzed. - Use of mean field values may fail for heterogeneous inputs: Collapsing distributed fields into global means ignores spatial structure, which can distort π values for systems with strong local variability. - How sensitive is the method to the choice of π-group? Could the projection degrade performance if irrelevant or redundant groups are used? In many problems the pi groups are actually ratios between problem geometries, and it is unclear how they are to be chosen. - For heterogeneous domains, can local or hierarchical π values be used instead of global means? - How would the method behave on nonlinear or transient PDEs (e.g., Navier–Stokes, Burgers’, advection–diffusion)? Since pi scaling is used extensively in computational fluid mechanics, there should be examples from CFD for well-known pi groups - What is the computational cost for large training sets before centroid reduction, and how stable are results with different cluster seeds? Fully AI-generated
Buckingham $\pi$-Invariant Test‑Time Projection for Robust PDE Surrogate Modeling Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a projection method, called \pi-invariant test-time projection, which can cluster training and test data points into to dimensionless groups. Such group is invariant to the unit scales, and can help separate data points with different physical behaviors. According to the paper, this can help reduce the prediction performance for OOD samples. 1. The introduction of Buckingham \pi-invariant is interesting and new to the CS/ML community --- though this might be an old concept in computational physics/applied math. 2. The idea of clustering data points according to different behaviors without being influenced by units/scale changes is interesting and has pontential to enhance the generalization of current ML surrogate models. 1. Clarity. This might be the biggest issue --- the paper does not explains clearly how the proposed projection method is integrated into the training/testing pipeline of neural operators to enhance OOD prediction. Throughout, the paper focuses on how to do projection and clustering. but what to do with NO training/testing? Will you train a different NO for each cluster, and then use the test-time projection for each test example to dynamically determine which NO should be used to predict? Or will you compute a soft cluster membership of each cluster, and then perform a mixture of predictions? Intuitively, there can be many ways of integrating the proposed method with NO training/testing or even data preparation/acquisition. However, the part is significantly lacking and it is hard to understand how the improvement in experimental part is obtained. 2. The OOD problem mentioned in this paper is a bit different from commonly used settings. Regarding OOD, we will first change the distribution of the input to neural operators, rather than assume fundamental change of physical behaviors. In fact, an ML surrogate (e.g., NO) is typically used to capture one type of physical behaviors under various scenarios. I am not sure if expanding the scope to make a surrogate model account for several different physics is appropriate or feasible. At least, the paper should clarify its own meaning of OOD, difference/connection with settings in prior works. 3. The experimental results are limited. Only on two systems is not sufficient in this community. Also, there is no standard deviation in Table 1, making it hard to conclude the significance of improvement. see above Fully human-written
Query-Efficient Zeroth-Order Algorithms for Nonconvex Optimization Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. In this paper, the authors studied query-efficient zeroth-order optimization for constrained nonconvex problems. They first transformed the original formulation into a minimax problem, and proposed two algorithms, ZOB-GDA and ZOB-SGDA, which combine random block updates with gradient descent–ascent (GDA) to estimate gradients using only a subset of coordinates at each iteration instead of all dimensions. By integrating this block-based estimator into primal–dual updates, the methods achieve overall query complexity of $\mathcal{O}(d\epsilon^{-4})$ for finding an $\epsilon$-stationary point, which matches the best-known overall query complexity. The authors provide numerical evidence on a practical problem where the proposed methods require less function queries than prior ZO algorithms when using a block size between 1 and model dimension. - The idea of leveraging block coordinate updates in nonconvex-concave minimax optimization is novel. - The empirical results are promising when varying the block size. - The setting in this paper is limited. They consider the minimax problem (2) that is linear in $y$. Also, the methods only consider the deterministic case. - The theoretical analysis, though technically sound, is incremental. The overall query complexity matches the best-known result, but does not improve upon it. - Some experiment results do not align well with theory (see questions below). - Questions - In Definition 2.1, why is the stationarity measure defined based on the two metrics, instead of one? It seems the convergence is only established based on individual metrics. - In Table 2, the result of SZO-ConEx and that of ZOB-GDA (b = 1) have a large gap. What is the explanation for this? In theory, both methods have the same complexity in terms of queries per step and overall queries. - In Table 2, why is ZOB-SGDA worse than ZOB-GDA? In theory, ZOB-SGDA should have much better overall query complexity. - Typos - Line 217: k-th iterate -> k-th iteration Fully human-written
Query-Efficient Zeroth-Order Algorithms for Nonconvex Optimization Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes two zeroth-order optimization algorithms (i.e., ZOB-GDA and ZOB-SGDA) by integrating block updates with gradient descent ascent (GDA) and smoothed GDA (SGDA). The proposed algorithms can simultaneously achieve efficiency in single-step gradient estimation and overall query complexity. Strength: 1. The paper provides comprehensive finite-sample convergence analyses for both algorithms. 2. For ZOB-SGDA, the authors establish an overall query complexity bound, which matches the state-of-the-art result for nonconvex-concave minimax problems, while ZOB-GDA achieves a reasonable trade-off for its simplicity. Weakness: 1. The paper notes in Section 5 that equality constraints can be converted to two inequalities but provides no theoretical or empirical validation of this approach. 2.The block size is shown to be a practical tuning parameter. How should the block size be chosen for a given problem? For example, is there a heuristic to balance single-step queries and the number of blocks, which influences convergence rate via Theorem 3.1? 3. This paper provides no details on the sampling distribution of the blocks. 4. For non-uniform sampling (e.g., weighting blocks by gradient magnitude), could the algorithm achieve faster convergence? 5. How sensitive are the algorithms to the smoothing radius and step sizes? 6. The experimental results are not convincing. For instance, the authors should compare the performance of the two proposed algorithms, ZOB-GDA and ZOB-SGDA. 1. The paper notes in Section 5 that equality constraints can be converted to two inequalities but provides no theoretical or empirical validation of this approach. 2.The block size is shown to be a practical tuning parameter. How should the block size be chosen for a given problem? For example, is there a heuristic to balance single-step queries and the number of blocks, which influences convergence rate via Theorem 3.1? 3. This paper provides no details on the sampling distribution of the blocks. 4. For non-uniform sampling (e.g., weighting blocks by gradient magnitude), could the algorithm achieve faster convergence? 5. How sensitive are the algorithms to the smoothing radius and step sizes? 6. The experimental results are not convincing. For instance, the authors should compare the performance of the two proposed algorithms, ZOB-GDA and ZOB-SGDA. Lightly AI-edited
Query-Efficient Zeroth-Order Algorithms for Nonconvex Optimization Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper studies the problem of zeroth-order constrained optimization (1) in the nonconvex–concave setting (2). The authors propose two new algorithms, ZOB-GDA and ZOB-SGDA, by integrating block coordinate updates into the zeroth-order gradient descent-ascent framework. The proposed methods aim to improve single-step query complexity without worsening the overall query complexity. Theoretical convergence guarantees and numerical experiments are provided. 1. The paper is overall well-organized and clearly written. Assumptions are explicitly stated, theorems are formally presented. 2. Theoretical results are technically sound. Although I didn't check the proof carefully, the theorems seem to be right. 1. My main concern is the motivation for introducing a block-coordinate algorithm in this problem and query setting. The paper does not clearly explain why block-coordinate updates are necessary or beneficial here. Theoretical results show that the overall query complexity of the proposed methods matches, but does not improve upon, existing results. Moreover, the proposed algorithms have higher per-iteration computational costs than randomized gradient estimation (RGE) methods and only beat coordinate-wise gradient estimation (CGE) in this aspect. On the other hand, the benefits of lower per-iteration query complexity are unclear. With sufficient computational resources, queries within each iteration can be parallelized, in which case methods such as ZOAGP could achieve lower wall-clock runtime. In fact, higher per-iteration query complexity is not necessarily a drawback in practice. While I acknowledge that ZOB-GDA generalizes ZOAGP as a special case when $b=1$ and provide a compromised option for larger $b$, this extension alone does not seem substantial enough to publish at ICLR without a stronger theoretical or practical justification. 2. The empirical results show some performance gains for the proposed algorithms on specific examples, though these advantages are not reflected in the theoretical analysis. I find the experiments unconvincing for two reasons: First, the 141-bus distribution network example appears to be a hand-picked application that is unlikely to arise naturally in a zeroth-order optimization context. It gives the impression of being chosen to favor the proposed method. Second, the experimental evaluation is too narrow. More standard constrained optimization benchmarks should be included to draw reliable empirical conclusions. 3. The discussion in Section 5, at least the first half, reiterates well-known results from basic optimization theory. These points are standard material in even undergraduate optimization courses and do not add meaningful insights to the paper. I recommend substantially condensing or removing this part. NA Lightly AI-edited
Query-Efficient Zeroth-Order Algorithms for Nonconvex Optimization Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper tackles the problem of reducing per-step query complexity while maintaining overall query efficiency in zeroth-order (gradient-free) optimization for nonconvex, constrained black-box problems. The authors introduce two algorithms—ZOB-GDA and ZOB-SGDA—which embed block-coordinate gradient estimators within smoothed gradient descent–ascent (GDA) frameworks. By updating only a randomly selected block of variables at each iteration, the methods achieve flexible per-step query costs O(b) for block size b and demonstrate state-of-the-art overall query complexity for finding stationary points. Theoretical convergence guarantees are provided, and experiments on energy management tasks validate the query efficiency against recent baselines. 1. The work tackles a clear limitation in zeroth-order optimization—balancing per-step efficiency and overall convergence, particularly for constrained, nonconvex problems where standard RGE or CGE methods are either too slow per iteration or have high total query cost (Sections 1, 2.2). 2. The paper rigorously proves finite-sample convergence for both ZOB-GDA and ZOB-SGDA, including explicit complexity bounds. In particular, Theorem 4.1 and Corollary 4.2 for ZOB-SGDA yield a query complexity matching the best-known results, and Section 3.2 provides a detailed derivation for ZOB-GDA. 1. Experimental Validation. The numerical experiments (Section 6, Figures 1 and 2) are limited to a single energy management application where $d_x$ is only 168. Broader empirical evaluation—spanning more domains and higher-dimensional and/or more complex constraint sets—would underscore generalizability. The per-iteration complexity is a significant challenging only/especially when $d_x$ is large enough. 2. Claims such as “over 10 times fewer queries” (Abstract, Section 6, Table 2) merit more nuanced discussion—performance benefits are shown on a single problem and are dependent on hyperparameters (e.g., choice of block size $b$), so the result may not transfer universally. 3. Some theoretical choice and analysis needs some more intuition to understand better. For example, in Section 3.2, what are the intuitions that bias can be bounded and how is that related to variance? 4. (Minor) Notation $K$ has been abused such as in Algorithm 2 to denote both number of iterations and smoothed function. See above. Moderately AI-edited
Query-Efficient Zeroth-Order Algorithms for Nonconvex Optimization Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses zeroth-order (ZO) optimization for constrained black-box problems where both objective and constraint functions lack analytical expressions. The authors propose two novel algorithms: ZOB-GDA (Zeroth-Order Block Gradient Descent Ascent) and ZOB-SGDA (Zeroth-Order Block Smoothed Gradient Descent Ascent). The key innovation is integrating block coordinate updates with gradient descent-ascent methods, which allows for adjustable single-step query efficiency while maintaining optimal overall query complexity. Specifically, ZOB-SGDA achieves optimal $O(d/\epsilon^4)$ overall query complexity while requiring only $O(b)$ queries per step, compared to $d$ for traditional coordinate-wise gradient estimation methods. The authors provide finite-sample convergence guarantees for nonconvex-concave minimax problems and demonstrate superior empirical performance on a power system load curtailment problem. 1. The paper makes an important observation about the trade-off between single-step efficiency and overall query complexity in zeroth-order optimization. 2. The convergence analysis is rigorous and comprehensive, establishing finite-sample guarantees for nonconvex-concave problems. ZOB-SGDA achieves the best-known query complexity $O(d/\epsilon^4)$, matching optimal results while offering flexiable single-step efficiency. 3. The paper is generally well-written with clear motivation. The comparison table (Table 1) effectively highlights the contributions relative to existing work. 1. Experiments focus on a single power-systems task at $d=168$. Given the paper’s relevance to ML-style black-box tasks, additional benchmarks (e.g., adversarial attacks, policy optimization, or LLM-related ZO tasks) would strengthen external validity and showcase scalability to high dimensional problems. 2. No comparison of wall-clock time, which is important for assessing practical efficiency. 3. The block size b is a key feature, allowing for single-step query costs from 1 to d. The empirical results in Table 2 are very interesting, as they show that $b=10$ is the most query-efficient, outperforming both $b=1$ and the full-batch $b=168$. This suggests a non-trivial trade-off. However, the paper provides no discussion or heuristics on how to set this crucial hyperparameter in practice. 4. The current title, “Query-Efficient Zeroth-Order Algorithms for Nonconvex Optimization,” is broad and doesn’t signal that the paper tackles constrained black-box problems. 1. Can you provide principled guidelines or an adaptive strategy for choosing block size b? How does the optimal b depend on problem dimension or computational budget? If it possible to develop adaptive block size b across iterations? 2. The paper emphasizes query complexity; could you also report wall-clock time with/without parallelizing coordinate/block queries, to show real-time benefits in practical deployments? Moderately AI-edited
Bridging Temporal and Semantic Gaps: Prompt Learning on Temporal Interaction Graphs Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces Temporal Interaction Graph Prompting (TIGPrompt), a framework addressing two critical gaps in existing TIG models: the temporal gap (performance degradation on future data) and semantic gap (misalignment between pretext and downstream tasks). The authors propose a "pre-train, prompt" paradigm with a Temporal Prompt Generator (TProG) that creates personalized, time-aware prompts for nodes. Three TProG variants are presented: Vanilla (learnable vectors), Transformer (encoding recent neighbors), and Projection (combining personalized vectors with time encoding). Experiments on four datasets with seven TIG models demonstrate performance improvements. 1. This work is based on a well-motivated problem formulation. Figure 1(a) clearly shows the temporal gap through performance degradation on temporally distant data, and Figure 1(b) provides quantitative evidence of the semantic gap between link prediction and node classification. These gaps are effectively demonstrated. 2. The proposed TProG has a flexible framework design. The paper introduces three complementary TProG variants for different use cases, and the approach can be extended to a "pre-train, prompt-based fine-tune" paradigm in resource-rich settings. 3. The results are supported by strong experimental validation. Extensive experiments are conducted across 4 datasets, 7 models, and 2 downstream tasks, and the performance improvements are evident in several settings. 1. The paper claims to be the "first attempt" at prompting for TIGs, but without direct empirical comparison to the concurrent DyGPrompt (whose code is unavailable), the claim of "first" is hard to verify, and the method's positioning relative to related work remains somewhat unclear. In addition, there are existing studies on snapshot-based dynamic graphs [1], which should be discussed more thoroughly. 2. The main novelty of this paper is applying prompting to TIGs to address temporal gaps in distant inference data and semantic gaps in multi-task learning. However, in real-world applications, it is worth questioning whether the extra effort of using prompting is truly more efficient than retraining or incrementally updating the model with new data. Prompting offers an incremental gain, while regular and necessary model updates may lead to more substantial improvements. 3. The proposed TProG framework relies on standard components such as learnable vectors, Transformers, and MLPs, which limits its algorithmic novelty and makes the contribution mostly incremental. Moreover, beyond empirical results, there is no theoretical explanation for why prompt fusion helps reduce the gaps, which somewhat weakens the depth of the analysis. 4. Although Section 3.2 explains the experimental setup, the difference in total training data used (50%+20% in this work vs 70% in prior related work) still raises concerns about potential unfairness in the comparison. 5. As noted in Appendix J, "we may need to conduct additional experiments to determine which TProG variant works better." It would be helpful if the authors could provide clearer guidance on how to choose among TProG variants for new datasets or models, along with more analysis of the tradeoffs between computational cost and performance across the variants. Reference [1] ProST: Prompt Future Snapshot on Dynamic Graphs for Spatio-Temporal Prediction https://dl.acm.org/doi/pdf/10.1145/3690624.3709273 Please see weaknesses. Lightly AI-edited
Bridging Temporal and Semantic Gaps: Prompt Learning on Temporal Interaction Graphs Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces a novel paradigm for temporal graph learning, namely the "pre-train, prompt" framework, to systematically address two challenges of existing temporal graph models: temporal graph gap, where model performance degrades on temporally distant inference data; and semantic gap, where the model performs worse on a downstream task that is different from the task used in training. The authors propose three model-agnostic designs for the Temporal Prompt Generator (TProG): vanilla TProG, Transformer TProG and Projection TProG. Empirical experiment has demonstrated the substantial performance improvements by the new paradigm and TProGs in tasks: transductive link Prediction, inductive link Prediction and node classification. - The paper formally defines and empirically quantifies the "temporal gap" and "semantic gap," the authors provide clear diagnostic tools and theoretically motivated objectives - The paper proposed a novel "pre-train, prompt" paradigm and three approaches for the Temporal Prompt Generator (TProG) - The paper conducts different experiment settings, including transductive/inductive link prediction, node classification, limited training and prompt data, performance w.r.t. the proportion of prompting data.to rigorously showcase the substantial improvements provided by the "pre-train, prompt" paradigm and TProGs - The model-agnostic applicability of both the "pre-train, prompt" paradigm and TProG is robustly demonstrated, with clear performance gains reported on both memory-based and non-memory-based temporal graph models - The new paradigm and TProGs are highly efficient: only the prompt generator is updated during adaptation, which significantly reduces computational cost and data requirements, enabling weakly supervised learning in resource-constrained scenarios **W1: Reproducibility.**: The Anonymous Repository link is currently inaccessible. I recommend that the authors make sure to double-check the link and repository accessibility for the camera-ready version. **W2: Notation and problem definition.** There are notations used without definition (e.g. line 160: $G$, $\mathcal{V}$ and $\mathcal{E}$). There is no mathematical definition of the temporal interaction graph (especially clarity on whether the paper focuses on the discrete-time (DTDG) or continuous-time dynamic graph (CTDG) setting), link and node prediction task formulation. I recommend that the authors add a dedicated section before the Proposed Methods to define key concepts, notation, and task settings to enhance clarity. **W3: Metric.** The results in the paper are mainly reported using Average Precision (AP). I recommended that the authors consider including additional evaluation metrics such as Area Under the Curve (AUC) and, in particular, Mean Reciprocal Rank (MRR) [1]. This would provide a more comprehensive and comparative assessment of model performance on temporal graph learning tasks. **W4: Task diversity.** The paper currently focuses on link property prediction and node property prediction tasks, without addressing graph property prediction. However, this limitation is acknowledged by the authors in Appendix Section J. **Minor** **W5. Paper representation.** I recommend that the authors use \citep to introduce parentheses between method names and corresponding author names --- [1] Huang, Shenyang, et al. "Temporal graph benchmark for machine learning on temporal graphs." Advances in Neural Information Processing Systems 36 (2023): 2056-2073. [2] Shamsi, Kiarash, et al. "GraphPulse: Topological representations for temporal graph property prediction." The Twelfth International Conference on Learning Representations. 2024. - Transformer TProG uses a transformer to generate temporal prompts $p_v$. How computationally expensive is fine-tuning this compared to baseline and baseline + other variants of TProG? Could the authors provide a computational analysis showing the increase in inference time for baseline without TProG and baseline with TProG? - In Table 1, could the authors explain why Vanilla TProG sometimes outperforms the Transformer and Projection variants? - In the setting where baseline+TProG is trained on link prediction, and then only the prompt (TProG) is fine-tuned for node classification, how does the performance of baseline+TProG compare with models trained directly on node classification tasks? - In Section D.2.1, why did the authors choose to evaluate only the first strategy for experiments? Studying all strategies and reporting their performance differences would enhance understanding of TProG's applicability. - Regarding the domain gap, can any of the three proposed TProGs address transferability when a TIG model trained on one temporal graph domain is evaluated on a different domain? Can the authors suggest potential directions for designing a new variant TProG that improves transferability across different temporal graph domains? - In Figures 7 and 8, could the authors explain why increasing the amount of prompt tuning data sometimes leads to a decrease in performance? Fully human-written
Bridging Temporal and Semantic Gaps: Prompt Learning on Temporal Interaction Graphs Soundness: 3: good Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper works on prompt learning on temporal interaction graphs (TIGs), i.e. a task, where one has a pre-trained graph model, which should be used for another task. Instead of changing the entire model, one keeps the pre-trained GNN frozen and learns only small prompt modules that adapt it to the new task. The work adresses this task by introducing Temporal Interaction Graph Prompting (TIGPrompt), a framework that is supposed to bridge temporal and semantic gaps by integrating with existing TIG models. They conduct multiple experiments, and the code is publicly available. * clear and good writing * good motivation, with good examples in the introduction, and a figure highlighting the temporal and semantic gaps * methodology (TProG) is conceptually simple and integrates easily with existing TIG backbones * the paper is well written and structured, with clear problem framing around the temporal and semantic gaps. * There are many experiments across several temporal graph models and tasks demonstrate that prompt-based adaptation can be both effective and parameter-efficient. ## 1. strong overlap with 1.5 year old arxiv preprint (march 2024) This work was initially released as an arXiv preprint in early 2024 ("Prompt Learning on Temporal Interaction Graphs") and has not been substantially updated since. Given that the community has already built upon and compared against this method (e.g., DygPrompt, ICLR 2025), in my opinion the contribution is no longer timely for ICLR 2026, despite being well executed. This would not be so much of an issue, if the experiment and related work section was updated, and compared to the works that have been introduced since then. ## 2. missing discussion and comparison to related work * Since the arXiv release of Prompt Learning on Temporal Interaction Graphs (March 2024), other research has already extended this work: The most important one is Node-Time Conditional Prompt Learning in Dynamic Graphs (DygPrompt, ICLR 2025). * DygPrompt explicitly positions itself as an improvement over TIGPrompt, arguing that >While [TIGPrompt] employs time-aware prompts, it lacks fine-grained node-time characterization and is thus unable to capture complex node-time patterns, where nodes and time mutually influence each other. DygPrompt explicitly conditions prompts on both node identity and temporal context. * In their paper and review discussions, the DygPrompt authors benchmarked their model against TIGPrompt. Because TIGPrompt’s code was not publicly available at the time, they reimplemented it and evaluated both methods on the same datasets. Their results show consistent improvements for DygPrompt, and they also introduced a more challenging low-resource evaluation protocol (see below). * Given that DygPrompt explicitly positions itself as an improvement over TIGPrompt and has been publicly peer-reviewed at ICLR 2025, it now represents the de-facto state of the art in prompt learning for temporal graphs. * The present submission does not mention DygPrompt, reproduce its evaluation setup, or provide any updated comparison, which substantially weakens its novelty and relevance for ICLR 2026. ## 3. potentially outdated evaluation * The authors in DygPrompt state in their ICLR 2025 rebuttal the following: >TIGPrompt [4] uses "50% of the data for pre-training and 20% for prompt tuning or fine-tuning, with the remaining 30% equally divided for validation and testing." (see Section 4.2 of TIGPrompt). Note that pre-training data do not require any labeled examples, while prompt-tuning/fine-tuning data require labels for node classification. Hence, TIGPrompt requires 20% labeled data for node classification. In our experiments, we use 80% of the data for pre-training (which does not contain any labels for node classification), but only 1% of the data serves as the training pool for prompt tuning, with each task leveraging only 30 events (about 0.01% of the entire dataset) for prompt tuning (where only the starting nodes in these events are labeled for node classification). Therefore, our setting focuses on the more challenging low-resource scenario with very few labeled data, as labeled data are generally difficult or costly to obtain in real-world applications [1,3,5]. Hence, our setting is more practical and challenging than TIGPrompt's.``` * I agree with this critique. Using only 1% of labeled data for prompt tuning is indeed a more realistic and demanding setting than TIGPrompt’s 20%. * Therefore, I believe TIGPrompt’s current evaluation protocol is outdated. * It would be valuable to hear the authors’ thoughts on whether they have tested TIGPrompt under such low-data conditions, and what their opinion on this setup is. ## Overall Overall, the paper is well structured and clearly written, but it is mostly identical to its 2024 arxiv version without incorporating developments that have occurred since then. Because the authors have not updated or engaged with newer work, especially DygPrompt, the contribution feels dated and this leads to limited relevance for ICLR 2026. Thus I recommend rejection. 1. Could you please clarify why DygPrompt (ICLR 2025) was not cited, discussed, or compared against in your submission? 2. Have you reproduced DygPrompt’s evaluation setup or considered running TIGPrompt under the same conditions? 3. The original TIGPrompt uses 20% of labeled data for prompt tuning, whereas DygPrompt uses only 1%, arguing it’s more realistic. Do you have any thoughts or experiments on whether your method still performs well under these stricter conditions? Could you update your evaluation or provide additional experiments? 4. Since TIGPrompt was released in March 2024 and DygPrompt builds on it, how would you position TIGPrompt’s contribution today relative to the current state of the art? Are there aspects of TIGPrompt that remain novel or useful even after DygPrompt’s improvements? Fully human-written
Bridging Temporal and Semantic Gaps: Prompt Learning on Temporal Interaction Graphs Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. In this work, the authors introduced TIGPrompt, a novel framework that applies prompt learning to TIGs to bridge temporal and semantic gaps in current models. TIGPrompt focus on the pre-train and prompt paradigm, different from standard training. The Temporal Prompt Generator generates personalised temporal prompts for each node to adapt to downstream task. The authors evaluated the proposed methods on link prediction and node classification tasks and shows improvement to the base model with TIGPrompt. - **originally: idea of temporal prompt generator** the idea of using a prompt generator to adapt to downstream task is interesting. The authors introduced three prompt variants: Vanilla TProG, Transformer TProG, Projection TProG - **clarity: easy to follow** the paper is easy to follow, the authors presented the ideas well. - **extensive evaluation** the authors evaluated across four benchmark datasets (Wikipedia, Reddit, MOOC, LastFM) and seven TIG backbones (e.g., TGN, DyRep, TIGER). - **task improvements**. The authors show that the TIGPrompt can improve baseline performances across a variety of TGNNs on both link and node tasks. I believe the current work has the following weakness: - **limited evaluation metrics**: the link prediction experiments rely primarily on Average Precision (AP), whereas more robust ranking-based metrics such as Mean Reciprocal Rank (MRR) have been extensively adapted in prior work such as TGB[1] and ROLAND[2] would provide a fairer and more direct reflection of the improvement of TIGPrompt which leads me to the next weakness. - **performance saturation**: the main problem of the AP / binary classification evaluation lies in its over-inflated performances. This saturation makes it difficult to assess whether TIGPrompt provides substantial practical improvements or merely marginal gains within an already near-perfect range. For example, on Wikipedia, DyGformer improvement in baseline is 99.03 and the improvement Projection TProG is 99.80, this is hardly convincing as the evaluation is simply too easy and the task with this evaluation is already solved. This is even worse when considering Table 2 where only 20% of data is used for training and most models can solve the tasks with > 95% AP on two out of the four datasets. - **unclear dataset transferability**: the main appeal of prompt learning is its ability to adapt to new datasets. From the provided results, it seems TIGPrompt mainly focus on improving task transferability on the same dataset yet it is still required that the model is trained and tested on the same dataset. It is unclear how TIGPrompt might be used to facilitate transfer to new datasets, for example, pre-train on Wikipedia and then transfer to Reddit. Suggestion: my main suggestion for the author is to provide new results with MRR or other ranking metrics that at least require the model to rank across many potential negative destinations. This will enable the demonstration of potentially more significant empirical gain of TIGPrompt and strengthen its significance. The near perfect performance of AP is not a good indicator for good performance as the task is too easy. [1] Huang S, Poursafaei F, Danovitch J, Fey M, Hu W, Rossi E, Leskovec J, Bronstein M, Rabusseau G, Rabbany R. Temporal graph benchmark for machine learning on temporal graphs. Advances in Neural Information Processing Systems. 2023 Dec 15;36:2056-73. [2] You J, Du T, Leskovec J. ROLAND: graph learning framework for dynamic graphs. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining 2022 Aug 14 (pp. 2358-2366). - it seems that TIGPrompt achieves good training efficiency with using only a small amounts of data. The four datasets benchmarked are all on the smaller side with only a few million edges, would it be possible to run TIGPrompt on large temporal graph datasets such as those in [TGB](https://tgb.complexdatalab.com/), i.e. with tens of millions of edges. Maybe with TIGPrompt it is possible to only use 10% of data in training, thus enabling existing models to scale to datasets where they would normally not be able work. - the Transformer TProG only considers one hop neighborhood, would more hops help? - what if we don't even need the base TGNN and just use the prompt model like TProG for prediction? Fully human-written
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces ReasonEdit, a reasoning-enhanced image editing framework that integrates explicit “thinking” and “reflection” modules within an MLLM-Diffusion model architecture. The thinking module transforms abstract, ambiguous, or colloquial editing instructions into clear, actionable commands, while the reflection module iteratively reviews and revises intermediate edits to improve accuracy. The system is trained in a multi-stage pipeline on curated “Thinking Pairs” (abstract-to-concrete instructions) and “Reflection Triples” (input, intermediate, and target images with reflections/diagnostics). Experiments on three benchmarks (GEdit-Bench, ImgEdit-Bench, and KRIS-Bench) show that ReasonEdit achieves significant gains over baseline and related methods. - The introduction of explicit, modular “thinking” (instruction grounding/decomposition) and “reflection” (iterative self-correction) mechanisms directly within an image editing pipeline is well-motivated and appropriately positioned with respect to recent advances in reasoning for multimodal models. - The work details a robust data pipeline (including the construction of both “Thinking Pairs” and “Reflection Triples”) to support supervised training for both the reasoning and editing aspects. The scale and systematic curation add significant credibility. - Quantitative results demonstrate competitive performance on open-source benchmarks, with strong gains on the challenging KRIS-Bench reasoning tasks. The improvement is backed by ablation studies which isolate the effects of reasoning modules and training stages. - While the paper is methodologically sound, it does not provide deeper theoretical or formal justification for why incorporating reasoning delivers better generalization or robustness beyond anecdotal/empirical evidence. There is no formal analysis of potential failure modes, e.g., when thinking/reflection might increase hallucination or overfit to annotation artifacts. - The work focuses solely on the image editing scenario and largely benchmarks against datasets that were partially constructed or curated by the authors' proposed pipeline. There is little evidence that the discovered benefits in “reasoning” generalize to other vision or multimodal tasks, or hold when adversarial or out-of-distribution manipulations are presented. Experiments on more diverse, external, or “in-the-wild” benchmarks would increase confidence. - Although the dataset construction process is outlined in Section 3.1, there is insufficient detail about the annotation guidelines, inter-annotator agreement, or concrete measures to ensure that the paired data do not contain distributional or conceptual shortcuts. It is unclear how robust the datasets are to annotation artifacts or if “reflection triples” can introduce biases through automated or manual curation. - As highlighted in the ablation studies and Figure 4, the multi-round reflection pipeline is more complex and computationally intensive than simpler dual- or single-image reflection baselines. However, the paper does not provide computational cost, inference speed, or training efficiency trade-offs, nor does it discuss scalability limits for practical real-world deployments. - While the presentation of loss functions is generally correct, the flow matching formulation could be clearer regarding how the conditioning $c$ (for “thinking”/“reflection” outputs) is incorporated at each stage of the generator pipeline, and in how gradients flow in the combined joint loss $\mathcal{L}_{\text{joint}}$ during optimization. The paper also glosses over whether the LoRA adaptation (for reasoning module) causes optimization instabilities or representation collapse when coupled with diffusion-based training. - The qualitative figures mainly highlight positive examples. There is limited discussion about concrete cases where the reasoning modules produced spurious interpretations, incorrect refinements, or exacerbated model failures that would be helpful for practitioners and for future research. - While some mention is made of model size (e.g., ReasonEdit 12B vs. Qwen-Image-Edit 20B), detailed analysis on the effect of reasoning on memory/compute overhead, scalability advantages, or resource constraints (such as using smaller/frozen models) is missing. This limits insight into deployment contexts. - Can the authors provide empirical numbers on the additional computational and wall-clock costs for the multi-round reflection pipeline compared to single/double-pass alternatives? Is the reflection pipeline amortizable at inference in practical settings? - What are unambiguous scenarios or instruction types where reasoning/reflection actually induces overfitting, hallucination, or systematic errors? Are there examples where “thinking” decompositions or “reflection” cycles degrade performance? - Could the authors share more about annotation protocols and provide quantitative measures of instruction clarity, reflection reliability, or inter-annotator consistency for their curated datasets? - How does the framework fare when presented with entirely new image domains, unseen instruction styles, or non-editing vision-language reasoning tasks? Would a similar paradigm generalize? - Is the impressive efficiency of ReasonEdit at 60% model size sustained across smaller or larger variants? Is performance stable for lower-resource/frozen MLLMs or smaller DiTs? Fully AI-generated
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose ReasonEdit, a framework that integrates the reasoning capabilities of multimodal large language models (MLLMs) into diffusion models to enable complex instruction-based image editing. The authors introduce thinking–editing–reflection workflow and construct the corresponding dataset to fine-tune the unified model for iterative, reasoning-driven refinement of editing results. Extensive experiments show strong editing performance on ImgEdit, GEdit-Bench and Kris-Bench. 1. Clear Motivation. ReasonEdit effectively leverages the reasoning capabilities of MLLM to enhance image editing performance. The authors explore both thinking and reflection modes for an instruction-based editing task. Instead of treating the MLLM as a frozen feature extractor, the authors jointly optimize MLLM with the diffusion decoder based on their reasoning-enhanced dataset, improving the performance under abstract instructions. 2. Well-designed data curation and training strategy. The proposed dataset is thoughtfully constructed, comprising 1) the thinking pairs consist of abstract-to-concrete instruction pairs, and 2) the reflection triples, which consist of an input Image, a generated image, and a target Image. Their training strategy is well-motivated and consists of three stages:1) fine-tune MLLM while freezing the diffusion decoder to learn reasoning, 2) fine-tune DiT for editing learning, and 3) perform joint fine-tuning for unified optimization. 3. Comprehensive evaluation. Comprehensive experiments and ablation studies on ImgEdit-Bench, GEdit-Bench and Kris-Bench demonstrate the model’s strong capability for abstract, instruction-based image editing 1. The "reflection" mechanism is designed as an "iterative self-correction and optimization" process. This iterative process inevitably increases inference time (Latency) and computational overhead, making it slower than single-pass editing models. However, the paper does not report evaluations on inference speed, computational cost, or the average number of reflection rounds required for success throughout. This is a significant limitation for the practical application of the model. 2. Although the model performs excellently on KRIS-Bench, which is designed for high-difficulty reasoning tasks, its improvement on standard basic editing benchmarks is not significant. On GEdit-Bench, the score of Ours-reasoning is lower than that of Qwen-Image-Edit, which is also an open-source model. 1. How to define the upper bound (or lower bound) editing time of this method? 2. Should it report the model size and the running efficiency of each method? 3. In experiments, this paper only reports the overall quality on the whole benchmark; it would be better to show the editing performance in different aspects since image editing contains many different tasks. Fully human-written
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper targets reasoning-enhanced image editing models. On the basis of the most commonly used framework: multi-model large language model + diffusion model, where the former one is frozen, it introduces a new thinking–editing–reflection loop. Thinking means use MLLM to understand the instruction while reflection means an iterative self-correction. The whole model can trained in an end-to-end manner. Extensive experiment results prove the effectiveness of the proposed method. 1. Overall this paper is well-organized and easy to read. The figures illustrates the idea, especially the thinking–editing–reflection loop very clearly. 2. The experiment section is extensive. It conducts experiments on mainstream benchmarks, and provide comparison results with enough baseline method. So I think the experiment results can be convincing. 3. The idea is simple yet reasonable, it makes the image-editing system more consistent with human being. I can understand that this RL-like framework is effectiveness. 1. The idea is reasonable, but the novelty is just ok, but not high. I do not mean 'pursuing novelty' here, but I am happy to see more discussions of novelty or any other interesting parts in model design. 2. The authors can have some discussions on the limitation of the proposed method, including some failure case. I think it will be beneficial to the community. 3. I also want to see some discussions on the training cost or the stability of the model. 4. I think the introduction part of the paper can be further polished by adding some more descriptions about model design and performance. See weaknesses. Fully human-written
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces ReasonEdit, a reasoning-augmented framework for image editing that integrates a multimodal large language model (MLLM) as a Reasoner and a diffusion transformer (DiT) as a Generator. The framework builds a closed loop of thinking–editing–reflection to improve reasoning depth, instruction following, and iterative refinement. - Well-motivated and clearly structured framework: The motivation—to address the limited reasoning capability of existing MLLM-frozen editing pipelines—is clearly articulated. The decomposition into a Reasoner (MLLM) and Generator (DiT), coupled through thinking and reflection cycles, provides a coherent design that is easy to follow and conceptually elegant. - Innovative data construction tailored for reasoning-aware editing: The paper goes beyond standard instruction-image datasets by introducing Thinking Pairs and Reflection Triples. These datasets explicitly encode step-by-step reasoning and self-evaluation, offering a structured way to teach “how to think and correct.” - Systematic comparison of reflection pipelines: The paper compares three reflection strategies and reports consistent superiority of the latter, providing useful insight for future research. - Dataset generation and reproducibility insufficiently detailed: The Thinking and Reflection datasets rely heavily on automated labeling with advanced MLLMs, but the paper does not disclose which annotators or models were used, nor any quality control metrics (agreement rate, filtering thresholds, or bias mitigation). Without these details, reproducibility and data reliability remain unclear. - Evaluation over-relies on GPT-4/4o automatic scoring: Most benchmarks use GPT-based metrics (VIEScore, GPT-4o evaluation). While common, this raises the concern of co-training bias. The absence of human A/B testing or inter-rater agreement weakens claims of perceptual superiority. Cross-evaluator consistency would strengthen the argument. - Fairness and statistical significance not established: Comparisons in Table 1 include closed-source models under heterogeneous inference budgets (temperature, reasoning steps, reflection rounds). No standard deviation or significance test is reported. Multiple runs with fixed budgets are needed to support SOTA claims. - Marginal improvement on basic editing tasks lacks deeper analysis: The paper notes limited gain for low-level edits but does not provide per-category error breakdowns (e.g., color, geometry, style). An analysis of failure cases would clarify when reasoning adds value versus when it is redundant. - Computation and latency overhead not quantified: Reasoning and reflection introduce extra forward passes. Although training GPU counts are reported, inference-time cost, average reflection rounds, and latency–quality trade-offs are not provided. This information is critical for practical deployment. - Data scale and domain sensitivity analysis missing: The ratio of T2I vs editing samples is listed but no study of how dataset composition affects performance is included. A sensitivity or ablation study would make the dataset design more instructive for replication. - How many reasoning/reflective steps are used during inference? What is the latency–accuracy trade-off curve? - Have human or cross-model (non-GPT) evaluations been performed to confirm robustness on KRIS-Bench? Fully AI-generated
Arboreal Neural Network Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a new architecture for tabular data that is based on the idea of converting decision trees to a particular variation of two matrix multiplications with a non-linearity. It proposes to initialize such trees with XGBoost and then finetune the thresholds and values. The paper also proposes a new credit scoring dataset TabCredit. The method is tested on the new dataset and a simple benchmark constructed from pytorch-frame, claiming state-of-the-art performance. I think that looking into tree-structured models and combining their inner workings with DL models is an interesting pursuit. I had a great time digging through related work on the topic and think that there is something in this line of work that could lead to strong and interpretable models and this direction is currently underexplored. The dataset contribution also seems very timely and important as there are not a lot of realistic testbeds for tabular machine learning methods readily available in academia. When done right this is a major contribution, so I encourage authors to go through with it regardless of this review period decision. At times the writing is very hard to make sense of. In the related work section, for example, I still can't make sense of how challenging instances in datasets are related to the pre-tuned default hyperparameter configurations (lines 91-93). The overall algorithm for constructing an "ArborCell" may also be improved I believe (see the next point for examples). I believe the paper does not fully cover the relevant related work. It packages an idea of decision tree inference in matrix form into an "ArborCell", but this idea seemed not novel, and there are very similar existing approaches indeed: - https://arxiv.org/abs/1604.07143 - Neural Random Forests. Which seems to do exactly what authors propose here - https://blog.dailydoseofds.com/p/transform-decision-tree-into-matrix a blog post example, which does a better job in explaining the same procedure which is used in the paper Finally, I do not believe the results are solid as there are some indications of poorly tuned baselines. Like TabM (recent SoTA model) performing on par with or sometimes worse than an MLP, or some large performance gains over XGBoost just from tuning the thresholds and leaf values (may indicate poorly tuned XGBoost in the first place). I also had trouble understanding some of the results like which datasets exactly were used (e.g. what dataset is CH?, why JA - Jannis is seemingly binclass and not multiclass as it is in the pytorch-frame benchmark). Without code being available this is impossible to check further. I suggest the authors compare to an established and well tuned set of baselines, you can take TabArena benchmark which publishes reference model scores in a csv on github: ```python import pandas as pd pd.read_parquet("https://tabarena.s3.us-west-2.amazonaws.com/results/df_results_leaderboard.parquet") ``` Comparing the method to a correct set of baselines would increase reliability of the results very much. See suggestions in weaknesses. Regarding the newly introduced dataset. Does it have a dedicated train/val/test split which is time-based? Or is it different? Can you provide more details regarding the evaluation and tuning setup on the new dataset? Fully human-written
Arboreal Neural Network Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes Arboreal Neural Networks (ArbNNs), a framework that converts trained decision trees into differentiable neural operators called ArborCells. Each ArborCell encodes a tree’s split features, thresholds, structure, and leaf values into four matrices/vectors, enabling end-to-end optimization while preserving the original tree semantics. 1. A differentiable “tree-as-layer” formulation (ArborCell) with explicit feature–node selection matrix $W$, split-threshold vector $f$, tree-structure / routing matrix $P$, and leaf-value vector $v$ that avoids path-probability products via one-shot matrix aggregation 2. An algorithm to parse trees into ArborCells and the ability to decompile trained ArborCells back to refined trees, maintaining symbolic interpretability 3. Competitive performance on public tabular tasks and consistent vintage-curve improvements over XGBoost on TabCredit under temporal drift 4. Introduction of TabCredit, an industrial credit-risk dataset with temporal splits to benchmark robustness and interpretability in realistic settings 1. The experimental section does not include comparisons with strong, modern baselines, especially tabular foundation models. 2. Limited gains over XGBoost in Table 2 relative to method complexity. On the reported datasets, the improvement over a well-tuned XGBoost baseline is small. 3. The paper evaluates on a relatively small set of benchmarks 4. Dependence on pretrained tree models for initialization. The core recipe assumes the availability of a strong GBDT (XGBoost/LightGBM) to parse into ArborCells. This limits applicability in settings where (i) trees are hard to train well, or (ii) one would like to learn the structure jointly with the downstream objective. The paper does not show a convincing “from-scratch ArbNN” alternative. 1. Can the authors add comparisons with recent tabular foundation models (e.g., TabPFNv2 [1], TabICL [2])? 2. Can the authors clarify the necessity of GBDT-based initialization? The current version treats “compiling from a strong GBDT” as a given prerequisite, but there is no experiment demonstrating whether ArbNN can still achieve comparable performance. 3. Can the authors provide more detail on scalability and serving? Since each ArborCell does a one-shot aggregation over all leaves, how does inference time and memory compare to the original XGBoost model. A brief complexity analysis or inference time comparison would make the method more practical. [1] Hollmann, Noah, et al. "Accurate predictions on small data with a tabular foundation model." Nature 637.8045 (2025): 319-326. [2] Qu, Jingang, et al. "Tabicl: A tabular foundation model for in-context learning on large data." ICML 2025 Fully AI-generated
Arboreal Neural Network Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper introduces Arboreal Neural Networks (ArbNN), a differentiable architecture that bridges gradient-boosted decision trees and neural networks. The key idea is to encode a pretrained XGBoost model into a neural form by translating its structure—feature splits, thresholds, and leaf values—into matrix operations that can be optimized end-to-end. This allows the model to retain the interpretability and inductive bias of trees while gaining the flexibility of gradient-based learning. Experiments on eight public tabular datasets and one large industrial credit dataset (TabCredit) show that ArbNN consistently matches or outperforms strong baselines. The paper proposes a novel and well-structured idea that combines the structural bias of decision trees with the flexibility of neural networks. The concept is intuitive yet original, and the formulation is clearly presented. The writing is clean and logically organized, making the technical details easy to follow. The experiments are thorough within the chosen scope and demonstrate consistent improvements over strong baselines such as XGBoost. - **Limited Benchmark Coverage** The evaluation includes only eight public datasets, which is considerably below the current standard in the tabular learning community. This narrow benchmark scope limits the credibility of the claimed generalization. Given the model’s conceptual promise, it would be valuable to test ArbNN on a broader set of heterogeneous tabular tasks. - **Unclear Motivation and Overemphasis on Industrial Data** The paper’s motivation is not fully convincing. Although the central idea—learning the structural bias of trees—is conceptually interesting, the claimed interpretability advantage remains unsubstantiated, as XGBoost provides only limited transparency. It appears that the work may be driven by a specific industrial objective, possibly related to the proprietary dataset used. If so, this motivation can be stated explicitly and the framing adjusted accordingly. Clarifying how the industrial requirements connect to the model’s broader scientific contribution and analyze why previous dl models perform worse, would significantly strengthen the paper’s coherence and impact. See Above. Heavily AI-edited
Arboreal Neural Network Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 8: accept, good paper Confidence: 1: You are unable to assess this paper and have alerted the ACs to seek an opinion from different reviewers. This paper addresses the lack of tree-structured inductive bias in deep neural networks for tabular data. To this end, the authors propose ArbNN, a novel architecture that reformulates decision trees into differentiable neural modules, enabling end-to-end gradient optimization while preserving interpretability. Extensive experimental results on multiple public benchmarks and a large-scale industrial credit risk dataset demonstrate that ArbNN consistently outperforms both traditional tree-based models and neural baselines, achieving superior accuracy and interpretability in tabular learning tasks. * This paper proposes the ArborCell structure to introduce the inductive bias of decision trees, and I am happy to see that the authors also provide visual comparisons to demonstrate the interpretability of the proposed method. * The authors discuss the related literature in considerable detail. * The paper is well written and easy to follow. 1. I am not an expert in tabular data, but I am curious about the convergence behavior of the proposed ArbNN. Could the authors provide training curves and compare them with other networks to illustrate convergence stability? 2. How does the training cost of the proposed method compare to other baselines? In addition, please evaluate the computational efficiency during inference, e.g., in terms of FLOPs, memory usage, and inference time. 3. The figures contain text that is too small to read clearly. It is recommended to increase the font size, use vector graphics for better clarity, and include a complete schematic diagram of the model architecture. 5. The authors do not provide code for reproducibility checks. My questions are in Weakness Section. Lightly AI-edited
VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper aims to disrupt the downstream performance of LVLMs. Through a theoretical analysis of feasibility, VEAttack generates adversarial examples that significantly degrade multiple tasks while achieving notable computational efficiency over other attack approaches. The motivation is clear, and the introduction effectively conveys the idea. VEAttack provides a solid and effective paradigm for gray-box adversarial attacks on LVLMs, offering a detailed analysis and feasibility assessment for this approach. The effectiveness and efficiency are well demonstrated across several datasets and models. (1) Table 9 shows the attack performance of the Image-Text Retrieval task, which complements the tasks. However, another focus of these works [1, 2] is on transfer attacks between vision encoders, like ALBEF and CLIP-CNN, and it is recommended to include more demonstrations of this performance. (2) Eq. (5) gives two baselines, but seems to lack the comparison of the second L2 Attack. (3) Based on observation 4, you perform a time comparison. However, I notice that the used step is 50 instead of 100 in the setting. I suggest adding a time and effect comparison under the condition of complete alignment. (4) There is a typo: “SRA” should be “SGA” in Table 9 following [1]. [1] Lu D, Wang Z, Wang T, et al. Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 102-111. [2] Gao S, Jia X, Ren X, et al. Boosting transferability in vision-language attacks via diversification along the intersection region of adversarial trajectory[C]//European Conference on Computer Vision. 2024: 442-460. (1) The results and trends in Figure 2 (b) are inconsistent with those in the Experiments. How are they obtained? (2) Figure 7 shows that VEAttack is effective even when the attack step is 30 or even 10. Is this different from other attacks, or do other attacks have the same characteristics? Lightly AI-edited
VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper focuses on addressing the vulnerability of Large Vision-Language Models (LVLMs) to adversarial attacks and proposes a novel gray-box attack method called VEAttack. Unlike existing white-box attacks that require full-model gradients and task-specific labels (resulting in high costs scaling with tasks) and black-box attacks that depend on proxy models (needing large perturbation sizes), VEAttack targets only the vision encoder of LVLMs. 1. Innovative Attack Setting: focus on the vision encoder in a gray-box setting. 1. Lack of citations and comparisons with papers highly similar to this paper. - An Image Is Worth 1000 Lies: Adversarial Transferability across Prompts on Vision-Language Models, ICLR 2024. - QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models, NAACL 2025. - InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models, ARXIV. 2. Without Any New Insights: Attacking the vision encoder to achieve attacks on the entire LVLMs is not novel; it is quite intuitive. 3. Sensitivity to Vision Encoder Type: VEAttack’s effectiveness heavily relies on the vision encoder of LVLMs. For LVLMs using non-CLIP vision encoders, the paper’s experiments show relatively limited attack effects. Can this be called a gray-box attack? 4. Limited Transferability from Large-Scale Vision Encoders: Experimental results show that when using the vision encoders of large models (e.g., mPLUG-Owl2, Qwen-VL) as source models for transfer attacks, it is difficult to achieve successful attacks on other models. The paper only provides a preliminary empirical explanation but lacks in-depth analysis of the underlying reasons (e.g., differences in feature representation mechanisms between large and small models). 5. Narrow Scope of Evaluation Tasks: While the paper evaluates VEAttack on image captioning, VQA, and hallucination benchmarks, it does not test its performance on other important LVLM tasks such as image-text retrieval (only a brief ASR evaluation is provided) or visual grounding, which limits the demonstration of its generalizability. 6. Overstatement on Imperceptibility: The paper claims that VEAttack has high imperceptibility through visual inspection of perturbation images, in fact, all adversarial examples satisfy this. This makes absolutely no sense. 7. Lack of Defense Mechanism Research: The paper only proposes the VEAttack method but does not explore corresponding defense strategies to counter it. See weaknesses. Lightly AI-edited
VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models Soundness: 3: good Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes VEAttack, a simple yet effective gray-box attack on LVLMs. VEAttack generates adversarial examples by perturbing solely the image token features of the vision encoder. The results conduct evaluations of multiple LVLMs across Visual Question Answering (VQA) and image captioning. 1. The paper is clearly structured and articulately written, ensuring ease of understanding. 2. The study demonstrates detailed analysis and keen observation. 3. Experiments were conducted across a varied set of datasets and models. **Clarity:** 1. The Introduction mainly introduces the figures in other chapters, but there is no place in the paper to introduce Figure 1. **Experiment:** 2. The authors clearly demonstrate their motivation by comparing with white-box and black-box attacks in Figure 2, but I have some confusion about their performance: Since the white-box attack performs a full gradient backpropagation, why is the black-box attack so much more time-consuming than others? Furthermore, the black-box attack performs poorly in the untargeted attack setting. Could this be related to the black-box attacks that target a specific sample? Can the authors verify this difference with other black-box attacks, such as M-Attack [1] and higher perturbation in black-box? 3. How is the “performance after attack” in Figure 2 (b) evaluated? Why is its performance trend opposite to that in Table 1? If it’s a performance decrease, please describe it clearly and align it with Table 1. 4. The authors seem to only have some subjective comparisons in Figure 8 to compare the imperceptibility with the black-box. Using data evaluation such as L2 or CLIP score would be more convincing. [1] Li Z, Zhao X, Wu D D, et al. A frustratingly simple yet highly effective attack baseline: Over 90% success rate against the strong black-box models of gpt-4.5/4o/o1 All concerns and questions are listed in the Weakness section. Lightly AI-edited
VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. VEAttack presents a gray-box adversarial attack targeting only the vision encoder of large vision-language models (LVLMs). It minimizes cosine similarity between clean and perturbed patch token features, bypassing the need for task labels or prompt access. This results in downstream-agnostic perturbations that break captioning, VQA, and hallucination benchmarks simultaneously—while remaining efficient and imperceptible. The paper provides theoretical grounding showing that patch-level perturbations propagate more strongly to the LLM via alignment layers than class-token perturbations. Extensive experiments show massive performance drops (up to ~95% in captioning, ~75% in VQA) and strong transferability across models and tasks, far exceeding existing white-box and gray-box baselines. 1. Realistic threat model: Attacks only the shared vision encoder—a genuinely deployable setting for LVLM vulnerabilities. 2. Theoretically principled: Clear justification that perturbing patch tokens yields stronger downstream disruption than class tokens. 3. Highly transferable: Single perturbation damages multiple tasks (captioning, VQA, hallucination). 4. Efficiency: 8–13× faster than prior multi-step attacks, with small ε (2–8/255). 5. Insightful analysis: Reveals internal LLM distortions, attention asymmetries (image vs. instruction), and the “Möbius band” paradox—robust encoders yield more transferable attacks. 1. Defense gap: No practical mitigation or robust-training strategy is explored beyond noting cost trade-offs. 2. Limited architecture diversity: Focuses mainly on CLIP-based encoders; broader evaluation would strengthen claims. 3. Transfer paradox underexplained: The Möbius effect is intriguing but remains a descriptive observation, not a mechanistic analysis. 4. Ethical discussion minimal: Needs clearer guidance on responsible release and safety implications. 5. The paper closely overlaps with the recently released work arXiv:2412.08108, which also investigates adversarial attacks on vision encoders of LVLMs and demonstrates downstream task-agnostic degradation across captioning and VQA. While the two studies share a very similar motivation and methodological framing, the current submission does not cite or discuss this concurrent work. Please refer to the Weaknesses section. Fully AI-generated
EvA: Evolutionary Attacks on Graphs Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes EvA (Evolutionary Attack), a black-box, gradient-free attack on graph neural networks (GNNs) that directly solves the discrete edge-perturbation problem using a tailored genetic algorithm. Unlike dominant gradient-based attacks (PRBCD, LRBCD), EvA does not relax the adjacency matrix to a continuous space and can therefore optimize non-differentiable objectives such as accuracy, certified ratio, and conformal coverage. The paper convincingly argues that gradient-based structure attacks are fundamentally misaligned with the discrete problem, e.g. gradients are local, ignore edge interactions, require relaxations, and can be obfuscated. The paper's empirical evaluation is comprehensive and the results are significant, not just marginal. EvA drastically outperforms all baselines, including the SOTA PRBCD, across multiple datasets. The most significant weakness, which the authors acknowledge in the limitations, is the high query complexity. Genetic algorithms are inherently query-intensive. Could the authors provide a direct comparison of the total number of forward passes (queries) used by EvA versus the number of forward/backward passes used by PRBCD to achieve the results in Table 1? Fully human-written
EvA: Evolutionary Attacks on Graphs Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces EvA, a black-box adversarial attack method for graph neural networks that uses a genetic algorithm to perturb the graph structure. Unlike gradient-based attacks, EvA directly optimizes discrete, non-differentiable objectives(such as classification accuracy) without relying on gradient approximations or domain relaxation. EvA highlights the limitations of gradient-based methods and establishes evolutionary search as a powerful, underexplored paradigm for adversarial attacks on graphs. > 1. The paper's primary strength is its successful revival of evolutionary search, a paradigm previously dismissed as inferior. It demonstrates that with careful design, this approach can decisively outperform state-of-the-art gradient-based methods, challenging a core assumption in the field and opening a new direction for research. > 2. The work is supported by comprehensive experiments and thorough ablation studies that validate every design choice. > 3. The significance of the work is greatly amplified by applying EvA to novel objectives beyond accuracy, such as breaking robustness certificates and conformal predictions. > 1. The paper rightly notes the high query complexity as a limitation, but does not conduct a rigorous quantitative trade-off analysis between computational cost and performance improvement. In my opinion, this is essential for the practical evaluation of the method. > 2. The paper presents compelling empirical evidence for EvA's superiority over gradient-based methods. However, the explanatory depth for this success seems limited, primarily resting on the well-established notion of gradient unreliability. A more profound analysis examining whether EvA's advantage stems from superior navigation of non-convex loss landscapes or effective exploitation of higher-order edge interaction effects would significantly strengthen the work and provide foundational insights for future research. For more details, please refer to the question section. - While EVA demonstrates that Genetic Algorithms can achieve considerable effectiveness in conducting adversarial attacks, I still feel I don't fully grasp its fundamental mechanisms. - In section 3, the sentence "We hypothesise that EvA, leveraging the exploratory capabilities of GA, can explore the search space more effectively and avoid local optima, while PRBCD gets stuck." "Exploratory capabilities" and "avoid local optima" are essentially standard claims for all GAs, bordering on being tautological. The paper seems to fail to specify how exactly the exploration capability of the EvA manifests in the specific context of discrete graph perturbation spaces. However, the analysis of perturbation patterns might provide the most relevant clues. As shown in Appendix D.1 "Label diversity", this section compares the statistical characteristics of perturbation solutions found by EvA and PRBCD. The analysis reveals that EvA's perturbation connections are more uniformly distributed across nodes with different labels, and demonstrate a greater tendency to connect to high-degree nodes and high-margin nodes. While this analysis is highly valuable, it primarily describes "what the solution looks like" rather than "how this solution was progressively discovered." It would be enlightening if the authors could demonstrate how the genetic algorithm guides the search process, which could potentially enhance the paper's readability and conceptual clarity. Lightly AI-edited
EvA: Evolutionary Attacks on Graphs Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces EvA (Evolutionary Attack), a new framework for edge-based adversarial attacks on GNNs through a discrete evolutionary search rather than gradient-based optimization. EvA formulates the attack as a genetic algorithm that evolves a population of perturbation candidates, where each candidate encodes a small set of edge flips. The algorithm iteratively applies Selection, Crossover, and Mutation. They also design a divide-and-conquer strategy to handle large graphs. Because it only requires model evaluations, not gradients, EvA is model-agnostic and applicable to black-box settings. EvA consistently achieves larger drops in classification accuracy than gradient-based attacks. The evolutionary framework can also attack non-differentiable objectives, where gradient-based methods cannot operate. This paper uses a discrete evolutionary search method for graph edge-based adversarial attacks on GNNs, without the need of gradients, making it model-agnostic and applicable to black-box settings. The evolutionary framework can also attack non-differentiable objectives. They show strong empirical performance comparing with gradient-based methods. 1. The proposed evolutionary search is highly heuristic and not guaranteed to find globally optimal perturbations. Many of its design--mutation rate, crossover scheme, and selection strategy--lack principled justification or ablation analysis. It remains unclear which components are critical for performance and how sensitive the attack is to hyperparameter choices. 2. The algorithm is difficult to follow from the current text presentation. Including clear pseudo-code or an algorithm box would greatly improve readability and reproducibility. 3. The scalability and efficiency analysis are underdeveloped (e.g., runtime, total queries, memory usage). Currently there's not statistics showing the time and memory consumptions of PRBCD and EvA on various datasets. 4. Using evolutionary search is not entirely new. The main novelty here lies in engineering and scaling rather than in conceptual advances. The paper would benefit from a clearer discussion of how its evolutionary design specifically differs from prior heuristic or search-based attacks. I wonder how EvA performs on larger graphs like arXiv, Products, and Papers 100M [1]. [1] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open Graph Benchmark: Datasets for Machine Learning on Graphs. 2020. Moderately AI-edited
EvA: Evolutionary Attacks on Graphs Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposed a genetic adversarial attack. Experiments compared with a few gradient based attack are conducted with many ablation study. **1.** Experiments in various aspects are done to validate the method performance with abundant figures for illustrations. **1.** The presentation of the paper is bad. There's neither formulation nor algorithm written, or any figure to completely show the pipeline of the proposed attack. The description of method is just split around all section 3 and 4 without a clear introducing logic, instead just describing sentences and paragraphs concatenated. The methodology description highly relies on comparison with a previous baseline "PRBCD" which was not formulated introduced as well, making the part harder to follow. There are also too many verbal definitions which are lack of clear expression and only used once. In all, the bad writing makes me really hard to have a full view of the proposed method in details, and I recommend the author to rewrite and reformulate the paper entirely. **2.** The experiments lack fully comparison with other works. The paper only compares with a few gradient based attack baselines, excluding experimental comparison with all other kind of attack by just stating "gradients method are SOTA" "out perform others". Considering the gradient baselines raised contain only one in 2023 while all others are before 2020, and the proposed method itself is indeed one of "beaten" genetic method, this lack of new baselines and ones from other attack types are unacceptable. Please see weakness. Fully human-written
Learning from Few Samples with Language-Model Guidance Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper explores how large language models (LLMs) can inject domain knowledge into machine learning models trained on extremely small datasets (n = 5–50) in high-dimensional settings (e.g., omics data). The authors propose four types of LLM-elicited constraints, feature selection (Z), group constraints (G), sign constraints (S), and inequality constraints (I), which define a convex feasible region for linear model parameters. These constraints are derived via LLM prompting or human experts and integrated into logistic regression training via convex optimization. Across four clinical datasets (SSI, COVID, cfRNA, Labor), models using LLM-derived constraints significantly outperform unconstrained or correlation-based baselines, often achieving comparable or superior performance using only 5–10 samples. 1. The paper introduces a systematic and interpretable framework to incorporate LLM guidance as mathematical constraints on model parameters, rather than as data augmentation or feature embedding. Encoding domain priors as convex constraints provides transparent interpretability, robust optimization, and low variance across train/test splits. 2. On small-sample (n = 5–10) omic datasets, the LLM-constrained models improve AUC by up to 20% over standard baselines and even outperform human experts. Each constraint type (Z, G, S, I) is evaluated individually and jointly, showing complementary benefits and robustness. 3. The inclusion of human experts and correlation-based baselines provides strong evidence for the unique added value of LLMs. 1. The quality of constraints heavily depends on prompt phrasing and model choice (GPT-4o here). While the paper claims robustness, no systematic prompt sensitivity or LLM ablation analysis is provided. 2. While the authors argue that constraints improve sample efficiency, there is no formal generalization bound or bias–variance decomposition to quantify when or why the constraints help (or harm) performance. 3. The assumption that an LLM can meaningfully rank thousands of genes/proteins or infer sign relationships from names alone may be unrealistic in more complex domains. From the experiments, we can also see that cfRNA and SSI, where the method with all four constraints is not the optimal. This highlights the heterogeneity within all these different domains and datasets. More in-depth analysis, such as data distribution, literature analysis, should be given to when and why the constraints are not so help: beyond just speculation. 4. Although justified by small n, the method’s generalization to nonlinear models (e.g., kernel, deep, or Bayesian) is unexplored. The approach may not scale to broader ML tasks. 1. How sensitive are the results to incorrect or contradictory LLM-specified signs or groupings? Can the method detect and relax inconsistent constraints? No possibility of having conflicting constraints? 2. How does the optimization scale with the number of features (d > 30,000)? Any empirical runtime comparison? The convex optimization with multiple constraints may become expensive, though this is not analyzed. 3. Could combining human and LLM priors iteratively improve constraint quality and interpretability? How good is LLM at giving correct constraints and interpretation of the context? Fully AI-generated
Learning from Few Samples with Language-Model Guidance Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose a new framework for learning classifiers from extremely small, high-dimensional datasets (a common situation in clinical omics), by eliciting domain knowledge from LLMs (or experts) and encoding it as constraints in linear models. It studies four complementary constraint types: (i) feature selection, (ii) grouping related features to share weights, (iii) coefficient sign constraints, and (iv) inequality constraints that prioritize “important” features. All constraints are composed into a convex, class-balanced, L2-regularized logistic formulation solvable in CVXPY, enabling efficient optimization even with n=5. Across four clinical tasks (SSI, COVID-19 severity, cfRNA for preeclampsia, and preterm labor), LLM-constrained models substantially improve generalization in the low-n regime. With only five training patients, the full constraint set (Z+G+S+I) often beats baselines trained on 1X more data, yielding gains exceeding 20% AUC in some settings. Ablation studies indicate each constraint contributes complementary signals; Z+G+S+I is consistently strongest at n=5–10, though benefits attenuate as n grows. The approach appears to be robust to regularization strength and largely stable across different LLMs, suggesting practicality without heavy hyperparameter tuning. This paper introduces a neat and easy solution for modeling complicated tasks when there are very few samples available. For example, for making a diagnosis for a very rare disease with few training samples. - The authors extract domain knowledge from LLMs as explicit hypothesis-class constraints (Z, G, S, I) for linear models, making priors concrete and testable, without any need for domain knowledge. - Constraints are cleanly combined resulting in a convex optimization, which can be reliably solved. - Across all four datasets, with n=5–10, Z+G+S+I wins, sometimes matching/exceeding unconstrained models trained on 10X more samples and showing >20% AUC improvements in some cases. - The optimization does not seem to be very sensitive to hyperparameters. - The proposed framework has less performance variability vs the baselines and notably lifts the lower tail of test AUC at n=5, improving reliability of any single train/test split. - It outperforms correlation-derived constraints (even when computed with all data) and random selections, showing the value beyond naive statistics. - The results are competitive with (and often beyond) a human expert, LLM-elicited constraints consistently surpass an unaided/online-aided domain expert across sizes and constraint sets. - The method is formulated only for binary, linear classifiers; no nonlinear kernels, trees, or deep models are explored beyond brief low-n notes; so it’s unclear how to port these constraints to richer hypothesis classes. - Benefits shrink at n=20–50, and on cfRNA the unconstrained Ridge overtakes constrained models at larger n, suggesting the approach is most useful only in the extreme low-n regime. - Only four omics tasks (three with same-study holds; independent external cohort only for COVID). Generalization to other domains, assay types, or multi-site settings isn’t established. This is a major issue since omics datasets have significant confounder (batch) effects (lab, assay, operator, time of processing, etc) and without mitigating such factors, generalization to new batches will be poor. - In small-n experiments, $\lambda$ is fixed to 1e5 selected on COVID and reused elsewhere without CV; it is pragmatic, but raises concerns about hidden tuning bias and robustness beyond the reported sweep. - The paper shows LLM constraints can disagree with empirical correlations; while performance often holds up, there’s no formal safeguard if priors are systematically wrong or adversarial. - Results emphasize AUC; there’s little about calibration, and PPV/NPV at clinically relevant thresholds. - The proposed setup focuses on situations where we have very few labelled examples, e.g., for rare diseases. What about unlabelled samples? If it is cfRNA, we could have hundreds of thousands of cfRNA profiles without clinical labels. The proposed framework would be significantly more useful and practical if it can use unlabelled data + few labelled samples. - The “batch-iness” of omics data is a real challenge and problem for developing any diagnostics model. There are many features (large d) and few samples (small n), so models can easily overfit, even on held-out data from the same set. How would the authors prevent that? I think It would be helpful to see the PCA plots of the data used in training and testing. Moreover, training linear models on the top 10 PCs can be a good baseline. - Considering what was discussed in the previous point, I find the lack of meaningful performance improvement from n=5 to 50 in three out of four datasets (all except “Labor”) concerning. What is the reason? - What are the top 10 features (by weight magnitude) for the n=5 case across the four datasets? I believe the authors could unpack the features a bit more. How can a model trained with only 5 points predict COVID severity with an AUC of 0.78? How do these top features compare to baseline models? - Why are the baseline models outperforming the proposed framework in cfRNA dataset for larger training sizes? I think apart from the need for batch effect correction, another step of feature conditioning/pre-processing needs to be performed, to remove low-variance, low-expression, low-frequency features. - Why is correlation used in Figure 3? If $x_i$ is continuous and $y$ is binary, why not report AUC or KS-test p-value? Also, why is `ylim =[0, 5]` for correlation? - Why is the “Random Selection” performance (Table 2) so high (76.1 vs 78.1 for LLM) for COVID? Fully human-written
Learning from Few Samples with Language-Model Guidance Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors investigate the effectiveness of several LLM-informed regularization strategies for linear models on small-sample, high-dimensional datasets from clinical settings. In particular, they consider four different strategies: (i) selecting features based on LLMs; (ii) grouping features with LLMs and imposing an equal-weight constraint; (iii) imposing sign constraints on weights based on LLM predictions of expected correlation; and (iv) imposing weight magnitude ordering constraints based on LLM-based feature importance scores. In other words, they combine simple rules for constraining the weights with LLM-generated notions of feature importance and relevance. They evaluate their approach on four clinical datasets, finding that their approach leads to significant performance improvements over other data-driven and LLM-driven baselines (mostly feature selection methods) and in comparison to settings where the same kinds of constraints are designed by human experts. - The proposed approach shows good empirical performance over several data-driven and LLM-driven baselines in extremely small-sample settings (e.g., n=5) on the datasets considered. - Experimental protocol is generally valid and comprehensive, accounting for various sources of randomness and variability (especially important in small-data settings) and covering several baselines. - The writing is generally clear and easy to follow. - The technical novelty of the proposed approach is limited. It combines existing LLM-based feature selection approaches (e.g., LLM-Select) with relatively simple convex constraints (e.g., sign constraints, group-wise regularization) on the model parameters. In that sense, the proposed work reads as a straightforward and incremental extension of prior works in this space. - Evaluations are limited to a relatively small number of datasets, with all of them being from the clinical domain. While the small-sample, high-dimensional setting is well-motivated in clinical settings, evaluations on a larger collection of datasets with coverage over other domains would improve the generalizability of findings. - In comparisons to human-provided constraints, I think the results should be taken with a grain of salt. I would expect these performances to vary across different experts, due to factors including but not limited to differences in levels of experience, possible subjectivity in determining what feature should be deemed "important", etc. While it is an interesting result that LLM-driven constraints can perform on par with when humans are delegated to design the constraints, without additional validation with a greater number of human subjects, I would not go as far as saying "an LLM can surpass the ability of a domain expert in designing constrained model". - The paper lacks any discussion of possible failure cases for the proposed methods. While I would expect the most capable (often proprietary) LLMs to generally output reasonable constraints, they can always hallucinate and lead to suboptimal performance, especially in domains where the models do not have sufficient parametric knowledge. It would be good to explicitly discuss some of these issues as limitations. - The authors mention that the proposed method is robust to prompt variation (in last paragraph of Section 3), but there is no explicit demonstration and/or discussion of how this was tested and how variable the downstream performance was in response to different choices of the prompt. As LLMs are generally highly sensitive to the prompting decisions, this also seems to be an important aspect to touch upon. - All of the tables in the main text are quite dense and hard to visually parse; the presentation of results can be improved for better readability. - In Table 1, in the "zero constraint (Z)" setting, why is it that this approach often performs much better than other LLM-based methods like LLM-Select? Modulo the differences in prompting, I think shouldn't applying only Z reduce the proposed approach to LLM-Select and lead to similar numbers across the board? Fully human-written
VUGEN: Visual Understanding priors for GENeration Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The authors propose leveraging visual understanding priors for both visual perception and generation tasks. They transform a high-dimensional semantic latent space into a low-dimensional, tractable distribution that preserves essential visual information. A pixel diffusion model is then trained to generate images from these latent representations. Experimental results demonstrate that, by utilizing a unified semantic visual representation, the method achieves superior image generation performance. 1. The proposed method is intriguing and demonstrates that generating semantic visual latent features can lead to improved image generation performance. 2. To make the generation process feasible, the authors introduce a dimension reduction module, which is a simple yet effective design. 3. The paper is well-written and easy to follow. 1. There is no quantitative comparison between the proposed pixel decoder and other mainstream tokenizers, such as VAE and latent diffusion decoders. The authors should include relevant metrics (e.g., rFID) to assess the performance. It remains unclear how well the proposed pixel decoder performs. If its results are significantly worse, it would suggest substantial information loss when generating images from the semantic latent space. 2. The image generation module in the mixture-of-transformers contains only 0.2B parameters. This relatively small capacity raises concerns about the reliability and persuasiveness of the results. I encourage the authors to increase the model size for the image generation component to validate the scalability and robustness of the approach. 3. The dataset S320M is not widely adopted in the community, particularly for unified understanding and generation tasks. The authors should clarify their motivation for using this dataset. Furthermore, the use of such datasets may contribute to the relatively weaker performance on more modern benchmarks, such as GenEval and DPGBench, which diminishes the overall persuasiveness of the work. Please refer to the Weakness section. Lightly AI-edited
VUGEN: Visual Understanding priors for GENeration Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes VUGEN, a framework that leverages pretrained visual understanding embeddings (from a frozen VLM) as priors for image generation. A learnable dimension reducer is introduced to map the high-dimensional understanding space to a lower-dimensional latent space, making it easier to model with a generative flow-matching network. The model then decodes the latent space into images using either a lightweight pixel diffusion decoder (PDD) or a latent diffusion decoder (LDM). Experiments on StockMix and ImageNet show that VUGEN outperforms VLM-based generation baselines. The idea of reusing pretrained visual understanding embeddings as generative priors is both intuitive and meaningful, offering potential to bridge understanding and generation tasks in multimodal modeling. The use of a learnable dimension reducer to create a smoother and more compact latent space is technically sound and empirically validated. Additionally, the lightweight and fast PDD provides a practical alternative to standard latent diffusion decoders. - This work primarily focuses on generation, with an emphasis on training and evaluating on generation tasks. Therefore, it should be compared to state-of-the-art generative models, rather than unified multimodal models (UMMs). Additionally, even when compared to UMMs, the generation performance of VUGEN is not particularly outstanding. - If the paper aims to argue for VUGEN as a unified multimodal model (UMM), it falls short in terms of unification. Also, evaluations on visual understanding and reasoning tasks are necessary to fully justify its claim as a UMM. - The core idea of the framework is to leverage understanding priors for generation, so further clarification and more analyses are necessary regarding the contribution of these priors, theoretically, qualitatively and quantitatively. - The motivation is not clearly articulated enough in the abstract and introduction. The claim that generating in the understanding latent space is challenging requires more theoretical and empirical analyses. - Clarifying the computational cost and scalability of the two-stage training would facilitate the cost-benefit comparisons. Please see weaknesses. Lightly AI-edited
VUGEN: Visual Understanding priors for GENeration Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes VUGEN, a two-stage approach to equip a unified VLM with image generation by directly leveraging its native visual understanding features. Stage 1 learns a dimension reducer g that projects the high-dimensional understanding embeddings z from the VLM’s vision encoder into a reduced, tractable space ˜Z optimized jointly with a pixel decoder d (either a finetuned LDM or a lightweight pixel-space diffusion decoder, PDD). Stage 2 freezes g and trains a generative head (Mixture-of-Transformers tower) via rectified flow matching to sample ˜z ∼ P(˜z|c), then decodes to pixels x via d(˜z). The key idea is to align generation with the model’s understanding priors, avoiding representation mismatch from separate (VQ-)VAE tokenizers and the complexity of bridging to external diffusion models. Empirically, VUGEN improves prompt-following and fidelity on COCO (models trained on StockMix): DPG-Bench 71.17→74.32 and FID 11.86→9.06 vs. a REPA-aligned decoupled baseline; and outperforms baselines trained on ImageNet (FID 5.40→4.15, Density/Coverage up). Ablations show: (i) directly generating in Z is hard; a jointly learned reducer outperforms PCA; (ii) pixel-space diffusion decoder (PDD) achieves comparable reconstructions to LDM but with far fewer params (48M vs. 794M) and higher throughput; (iii) a reduction ratio r≈16 balances generative tractability and decoding difficulty. Understanding performance is preserved at base VLM levels (Table 3). - Clear, well-motivated design: samples in a reduced version of the VLM’s understanding space, preserving alignment between understanding and generation; strong rationale and ablations (PCA vs. learned reducer; r trade-off). - Competitive results with careful baselines sharing architecture/data/training: improves FID/DPG/GenEval across StockMix→COCO and ImageNet settings; analysis over guidance scale clarifies realism–consistency trade-offs. - Practical decoder findings: pixel-space diffusion decoder rivals LDM while being far smaller and faster; avoids dependence on VAE latents, reducing complexity. - Preserves understanding: retains PLM-1B’s comprehension performance; shows a path to unified MLLMs without decoupled vision tokenizers. - Data provenance and comparability: the main training uses a mixed StockMix (YFCC100M, CC12M, and a proprietary S320M recaptioned with Florence-2). While baselines share this setup, cross-paper comparability is limited; clearer licensing/availability statements for S320M would help. - Limited scope of understanding preservation: while Table 3 suggests parity on standard benchmarks, it would be useful to test for regressions in more fine-grained or long-context visual reasoning after generative training, especially under higher r. - Generative scaling and distribution shift: results are at 256×256; how do trends hold at higher resolution and for out-of-domain prompts? Also, how stable is training when swapping in different base VLM encoders (e.g., DINOv2 or SigLIP variants)? - Theoretical underpinnings: the choice of rectified flow matching is reasonable; adding a brief justification vs. diffusion loss and showing a small apples-to-apples comparison would strengthen claims of sample efficiency. - Decoder choice vs. alignment: PDD and LDM are “similar” in reconstructions; however, prompt alignment (DPG/GenEval) contributions per component (reducer vs. generator vs. decoder) could be clarified via controlled ablations. - Does training the reducer jointly with PDD ever reduce understanding robustness? Can you report pre/post shifts on more challenging understanding tasks (e.g., MMMU categories requiring fine localization)? - How sensitive are results to r and g’s architecture across datasets? Is there a principled way (e.g., information bottleneck or spectral metrics) to set r per vision encoder? - Could you show a small table isolating “generate-in-Z” vs. “generate-in-˜Z” under the same compute, to quantify the tractability gap, beyond anecdotal FID>200? - For external comparability: do you have COCO metrics when training only on public data (e.g., YFCC100M+CC12M without S320M), to contextualize gains relative to models trained purely on public datasets? - PDD details: you mention distillation to a single-step decoder; can you quantify speedups at sample time for end-to-end T2I, not just reconstruction throughput? Fully AI-generated
VUGEN: Visual Understanding priors for GENeration Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces VUGEN, a unified vision language model for multimodal understanding and generation. To equip a pre-trained VLM with image generation capability, VUGEN transforms the high-dimensional latent space of the VLM's native vision encoder into a lower-dimensional one to simplify VLM's training for generation while preserving visual information. - The paper is well-written and easy to follow. - The generation experiments are done on diverse datasets and evaluated with compreshensive metrics. - The ablation experiments on dimension reduction ratio provides valuable insights into the trade-off between the generation task and the reconstruction tasks. - Reconstruction metrics like PSNR, SSIM, LPIPS, rFID are not reported to compare the proposed pretrained vision encoder + dimension reducer + pixel decoder with existing autoencoders (e.g., SD-VAE and Flux-VAE) used in other diffusion- or autoregressive-based image generation models. - The paper only considers two pixel decoder designs, LDM and PDD, both of which are diffusion based models. Another simple but important baseline would be a standard convolutional decoder commonly used in autoencoders like SD-VAE and Flux-VAE. - The image understanding task uses the original vision encoder features, while the image generation task uses the compressed representations from the dimension reducer, resulting a gap between the two spaces. What is the advantage of this unified design compared to decoupled vision encoders baselines if semantically aligned VAEs like VA-VAE[1] are used? - The baselines reported in Table 1 for ImageNet generation appear relatively weak. The current state-of-the-art FID score on ImageNet 256x256 is below 2. Also, only the SD3 VAE is considered in the decoupled vision encoders baseline. Exploring autoencoders like VA-VAE[1], which incorporate semantic information, would provide a more complete comparison. - As VUGEN introduces a separate module for image generation, it would be helpful to clarify its advantages over previous methods like LaVIT[2] and Emu[3] which introduce another diffusion model for image generation based on VLM output features? [1] Taming Optimization Dilemma in Latent Diffusion Models [2] Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [3] Emu: Generative Pretraining in Multimodality Please refer to the weakness session. Fully human-written
PRISON: Unmasking the Criminal Potential of Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces PRISON, a novel evaluation framework designed to assess the "criminal potential" of Large Language Models (LLMs) in complex, multi-turn social interactions. The authors define criminal potential as the risk of an LLM generating harmful behaviors like deception, manipulation, or blame-shifting in adversarial contexts that could facilitate unlawful activities. The paper's main contributions are (1) the PRISON framework itself, as a new benchmark for a critical and understudied safety dimension , (2) a quantification of LLMs' "criminal potential" , and (3) the identification of the significant gap between an LLM's ability to generate and its ability to detect such behaviors. These contributions are timely, novel, and significant for the AI safety community 1. This work moves beyond traditional, static safety evaluations (e.g., simple harmful Q&A, abstract moral dilemmas) to tackle the much more complex and realistic threat of LLMs participating in deceptive, multi-turn social interactions. The "criminal potential" concept is a valuable and well-defined framing of a risk that is highly relevant as LLMs are integrated into agentic systems. This paper addresses a clear and important gap in the current safety literature. 2. The PRISON framework is thoughtfully constructed. Grounding the five-trait taxonomy in established psychometric instruments from criminal psychology provides a strong theoretical foundation that is often lacking in other safety benchmarks. Furthermore, the tri-perspective (Criminal, Detective, God) evaluation design is an intelligent and effective method for simultaneously measuring the expression of harmful traits and the detection of them within a unified system 1. The 44% "Objective Trait Detection Accuracy" (OTDA) is a headline-grabbing result. However, its significance is difficult to interpret without more details on the "Detective" agent's task. 2. Regarding the "God" perspective validation: A Cohen's Kappa of 0.65 is "substantial" but not "near perfect." Could you provide a qualitative breakdown of the disagreements between your human annotators and the GPT-4o judge? Are there specific traits (e.g., "Psychological Manipulation" vs. "False Statements") that are more ambiguously defined or harder for the LLM to judge correctly? 3. The scenario generation from films is a clever way to source complex social dynamics. However, film narratives are inherently dramatic and conflict-driven. How do you account for the potential domain mismatch between these "dramatized" scenarios and more mundane, real-world criminal interactions? Is it possible this choice of data source biases the "Criminal Traits Activation Rate" (CTAR) upwards? See above Fully human-written
PRISON: Unmasking the Criminal Potential of Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces PRISON, a tri-perspective evaluation framework designed to assess both the criminal potential and detection capability of LLMs in adversarial social scenarios. It models how LLMs behave under roles such as criminals, detectors, and gods, simulating information flow and perspective differences to capture both understanding and detection of illegal behaviors. The study finds interesting observations. For example, popular LLMs often exhibit criminal traits: generating deceptive or harmful advice, even without explicit malicious prompts. However, they perform poorly when detecting similar behavior. The evaluation is extensive, which leverages context-rich, film-inspired scenarios to ensure realism while maintaining ethical control. - The proposed framework is novel and studies an important aspect of LLM safety. - The tri-perspective approach (criminal, detector, god) is innovative and captures the complexity of adversarial scenarios effectively. - Comprehensively quantifies the criminal tendencies of various LLMs, providing valuable insights into their capabilities and limitations. - The performance gap between criminal generation and detection may be the nature of the task itself, rather than a specific shortcoming of LLMs. - The scenarios are primarily adapted from classic crime films, which may limit representativeness of real-world criminal contexts. - Lack of technical discussion about why certain behaviors emerge in LLMs. Overall, this is a well-structured and insightful study that contributes meaningfully to our understanding of LLM safety in adversarial contexts. The PRISON framework is a valuable addition, offering a creative way to stress-test models' tendencies toward criminal expression and their ability to detect manipulative behavior. However, I have a few concerns: --- (1) Nature of the Performance Gap: The observed gap between ”criminal expression“ and “detection” might reflect the nature of the task itself rather than a true model deficiency. For humans as well, generating deception is often easier than detecting it, since detection requires background knowledge and reasoning about intent. It would strengthen the paper if the authors could further analyze whether this gap truly indicates a model limitation or simply the inherent difficulty of the task. --- (2) Use of Film-Based Scenarios: Lines 220–221 mention that the scenarios are adapted from films. However, film plots are not necessarily realistic representations of real-world criminal behavior. Why not use more authentic materials such as court transcripts, online forums, or real-world investigative documents to improve realism and ecological validity? --- (3) Lack of Technical Analysis: The performance gap essentially reflects two underlying technical issues: - insufficient safety alignment, since the model still tends to follow harmful or deceptive instructions; and - limited long-context understanding, as detecting criminal or deceptive behavior often requires reasoning over extended context. It would be helpful if the authors could analyze these aspects more deeply to clarify the technical reasons behind the observed gap. Fully AI-generated
PRISON: Unmasking the Criminal Potential of Large Language Models Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces PRISON, a novel framework for assessing the capabilities of LLMs to a) develop criminal strategies and b) detect criminal acts. Building upon plots and situations from mainstream movies, the framework creates settings in which an LLM is queried to solve the situation with malicious intent, e.g., covering up an accident. The first aspect of the study explores whether LLMs can come up with valid, illegal/harmful strategies given the individual setting. The second aspect takes the opposite view and tests whether the same LLMs can detect illegal actions in the generated strategies. Evaluating multiple recent LLMs, the paper demonstrates that language models show high criminal capabilities while their detection capabilities for such actions are limited. - The paper is very well-written and easy to follow. All figures and findings are clean and straightforward to understand. Overall, the paper formatting quality is above average. - Investigating the criminal potential of LLMs is an interesting avenue, and leveraging the scripts of movies to create an evaluation framework is a smart idea. The findings that there exists a mismatch between criminal actions and criminal detection are intriguing. I also like that the paper not only distinguishes between criminal/non-criminal but also analyzes the different kinds of criminal traits. - The experimental evaluation covers multiple LLMs (8 different models) and settings. Whereas some recent models, e.g., GPT-5, Qwen-3, DeepSeek-R1, are missing, the provided models offer a good mix of non-reasoning LLMs. - The framework setting feels somewhat artificial. While I understand the intention behind the dataset, I am not fully convinced that the evaluations genuinely assess a model’s criminal capabilities. When looking at examples in the Appendix (e.g., Table 5), it often feels as if the model is writing a novel. On one hand, such narrative-style outputs could indeed be misused for criminal purposes. However, I am not sure whether these outputs are actually harmful, since it remains unclear to what extent the proposed strategies go beyond straightforward ideas or common scenes from movies. While, in another context, detailed instructions for building a bomb could clearly cause harm, suggesting to push a car into a lake (which requires no expert knowledge) seems more like repeating a movie or book scenario. I understand that we want LLMs to avoid producing such suggestions, but given that the model appears to be engaging in creative writing, this behavior might be acceptable in some cases. - The crime detection capabilities of LLMs are not compared against a human baseline. Since LLMs have less information than the “God-setting,” a human baseline would help quantify the actual performance gap. Given that some contextual information is missing from the detection model’s input, it might be that certain actions cannot be reliably classified as criminal. - As the framework is based on only ten movies, the diversity of scenarios may be limited. - No large reasoning models, such as Qwen3 or DeepSeek-R1, are evaluated. It would be interesting to see whether stronger reasoning capabilities improve a model’s ability to either generate or detect criminal content. Small remark: - There is a missing space in L046. - How does a human baseline perform on the detection task compared to the LLMs? Is there sufficient information contained in the inputs actually to solve the task (could be answered by a user study)? Fully human-written
Towards One-step Causal Video Generation via Adversarial Self-Distillation Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose a method for the distillation of bidirectional video contraption models into autoregressive causal few-steps generators capable of real time inference, basing their work on the popular Self Forcing framework. The paper introduces two advancements with respect to Self Forcing: 1) it enables performing inference with a variable number of sampling steps (1, 2, 4); 2); It improves performance for the case of 2 and 1 step inference with respect to Self Forcing, producing 2 step generation results on par with the original 4 step inference according to VBench scores. The technical contribution enabling these advancements consist of: 1) an adversarial loss term matching N+1 with N step generator distributions; 2) forcing the first frame to be inferred using 4 steps. Both contributions appear sound and easy to implement (authors provide code in the supplement). The quality of the results in convincing as seen qualitatively and in VBench and user studies. Overall, the work is likely to be adopted by the community due to its simplicity, convincing quality of the results and relevance of the problem. - The work tackles the task of making slow bidirectional video models, fast real-time autoregressive generators which if of high practical importance - The quality of results is convincing as seen in the provided qualitative samples - Quantitative evaluation confirms qualitative assessment that the method matches or surpasses Self Forcing under the same amount of sampling steps - The method is simple to implement, and authors provide the source code for review - Ablation studies show convincingly that ADS and FFE are both having a positive impact on the method and ablations on the optimal value for adversarial loss weighting are shown - Tables are missing an analysis of first frame latency and throughput (see Self Forcing). I suggest the authors to report these numbers. FFE will cause first frame latency to match the original 4 steps Self Forcing - The paper considers only the setting where chunks of 3 latent frames are predicted simultaneously and does not show results for frame-by-frame autoregressive generation. Frame-by-frame prediction is a setting of high practical importance as it minimizes latency. Evaluation would be strengthened by showing qualitatives and quantitative results for this setting too. - Evaluation is performed on the 5 second setting and no results are shown beyond it. Self Forcing can generalize beyond 5s generation, an important capability for an autoregressive causal generator. Such capability should be demonstrated and evaluated. Without this capability, practical significance of the method would be reduced. - The paper is unclear in some key parts as Algorithm 1 and discriminator design. See questions. - Some typos and missing spaces before citations - An adversarial term is introduced to match distribution between N+1 and N step predictions, showing performance improvement. Did the authors consider extending use of the same adversarial term to its canonical usage for matching the real data distribution with the 4 step generator distribution similarly to DMDv2? - Adversarial losses are proposed as the tool for matching the N+1 and N step generator distributions. Why usage of an adversarial term is the ideal choice in this context? Could we have used a DMD formulation instead by introducing additional fake score prediction networks, either one for each value of n, or sharing the same fake score prediction network with conditioning on n? This could result in a more elegant framework only relying on DMD. - FFE relies on the assumption that generation of the first frame is the hardest, because successive frames can be generated by copying content from the first frame. Thus allocating more sampling steps to the first and less to the subsequent ones makes sense and is shown to improve performance. The assumption however holds less strongly if videos with high camera motion or complex object movements are considered. In this setting each frame will need to generate a more significant amount of content without possibility for copying it from previous frames. Can the authors show that in this setting, their method with 2 steps inference is still matching performance of the original 4 steps Self Forcing? - Algorithm 1: LL217 shows that a schedule with a fixed number of steps N is instantiated. LL222 suggests that the actual number of sampling steps for the current iteration n is sampled. I believe LL217 should instead instantiate a different schedule for each possible value of n - Algorithm 1: ll224 suggest a rectified flow setting. I suggest making this explicit in the paper - Algorithm 1: LL224 LL226 suggest two different time steps are sampled for x^1 and x^2 for use in the adversarial loss term. I'd like to confirm this understanding is correct. Could the authors discuss why this is preferable to having a shared timestep t to ease the role of the discriminator? - LL352-355 are unclear. How is the discriminator implemented? Does the discriminator receive as input the current timestep t in addition to backbone features? D_n and D^n seem to be used interchangeably - Eq 2 has incorrect parenthesis - Could authors report all VBench evaluation metrics in the supplement? - How is Fig 1 produced? Did the authors perform experimentation on a gaussian mixture? Fully human-written
Towards One-step Causal Video Generation via Adversarial Self-Distillation Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper works towards limiting the diffusion denoising steps of hybrid video generation frameworks, that leverage autoregressive models to model temporal dynamics and diffusion-based spatial denoising, to as few as one step. The proposed method falls under the category of model distillation and strongly builds on an existing method called Distribution Matching Distillation with a substantial addition: A novel form of Adversarial Self Distillation is proposed, aligning the student model’s n-step denoising process with its (n+1)-step version on a distribution level. Results on VBench and a custom user study show state of the art results on 1 and 2 step video generation. The model further removes the limitation of fixed inference steps after training, allowing flexibility for multi-step settings. - The paper is generally well written and presented. - Provided 1 and 2 step video results, both in the paper and supplementary material, show better results than current competitors. - The method supports both single and few / multi step inference which is a major advantage over fixed step trained methods - The observation First Frame Strategy seems to be an important observation, by itself already boosting state of the art results - The influence of ASD and FFE cleanly ablated showing the superior results of the combination ## Incremental contribution: - The used components are not fundamentally new in nature. DMD is very well established for model distillation and remains a core component also in this work. - Similar adversarial diffusion distillation has been proposed before and is well established in the community - As a such there are no fundamentally new concepts presented, but their combination provides a nice contribution to the current state of research. ## Limited information on experiments: - There are several information missing on some of the shown experiments - The user study is missing the number of participants and further statistical values - For the main comparison for 1 or 2 step distillation, the authors had to retrain Self Forcing. Retraining parameters are missing i.e. indicating that the model has been trained long enough to convergence ## Figure 1: - Figure 1 seems not really on point, i.e. the adversarial self-destillation on the right seems oversimplified and not aligned with Algorithm 1. ## Minor: - Algorithm 1, 1: Typo: origianl - Eq. 2: isn’t there a bracket missing? - How is Fig. 4 created? Is this just an example or averaged results? Are the results corresponding to results from Self Forcing or from the provided method? - Regarding user preference study: Please provide more details on e.g. how many users were selected, were measures taken to ensure independence? - In the ablation Table 2: Why are the total scores for the first row (ASD, FFE) so much worse than the corresponding pure Self Force values from Table 1? - The distilled diffusion process is optimized with a 4-step denoising process. Was this number of steps determined to optimal? - Influence of $\alpha$: Figure 8 already shows influence of alpha on the Total Score. However it is interesting to see that the order of the alpha values with increasing scores (for 2 or 1 step) goes with: alpha=0 < alpha=20 < alpha=10 < alpha=30, which shows a non-monotonuous increase of the total score with increasing alpha. How do you explain this behavior? Fully human-written
Towards One-step Causal Video Generation via Adversarial Self-Distillation Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes a framework for accelerating causal video diffusion models via Adversarial Self-Distillation (ASD) and First-Frame Enhancement (FFE). The method extends Distribution Matching Distillation (DMD) by introducing a discriminator that aligns the student model’s n-step and (n+1)-step denoising distributions, instead of aligning directly with a multi-step teacher. This step-wise self-alignment aims to stabilize training under extreme few-step (1–2 step) scenarios. Additionally, FFE allocates more denoising steps to the first frame to mitigate error propagation in causal video generation. Experiments on VBench show that the proposed model surpasses Self-Forcing and CausVid under 1-step and 2-step configurations, while achieving comparable performance to multi-step baselines such as Wan2.1 and SkyReels with much fewer steps. 1. The paper is clearly written and easy to follow. The proposed methods (ASD and FFE) are well-motivated and clearly elaborated. 2. The results look very impressive. Compared to the Self-Forcing baseline, the 1-step video generation exhibits a great boost in quality. The speed of 1-step causal generation will enable a wider deployment of streaming video generation. One minor concern might be conceptual novelty. The main method of this work, ASD, is not fundamentally new [1,2]. Considering the value and impact of 1-step causal video generation, the engineering effort to tune an end-to-end pipeline is a significant contribution, especially that the authors provide the code for replication in the supplementary material. [1] Zhang et al., SF-V: Single Forward Video Generation Model. [2] Lin et al., Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation. 1. The proposed FFE seems to benefit early generation more, right? Can you provide more results and comparisons for longer generations, such as 10s-20s level? 2. In the user study, the proposed method is on par with Self-Forcing at the 4-step setting (exactly 50%). Can you elaborate more on why ASD is beneficial at 1-2 steps (as in the ablation) but not helpful in the 4-step setting? Fully human-written
Towards One-step Causal Video Generation via Adversarial Self-Distillation Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the challenges of slow inference speed and error accumulation in causal video generation models. The primary goal is to enable high-quality video synthesis in extremely few denoising steps, the authors propose two main contributions: 1. Adversarial Self-Distillation (ASD): A novel strategy that moves away from traditional distillation. Instead of matching a few-step student model to a multi-step teacher model, ASD adversarially aligns the output distribution of the model's own n-step generation with its (n+1)-step generation. 2. First-Frame Enhancement (FFE): An inference strategy designed to mitigate error propagation. Based on the observation that the initial frame is most critical in causal generation, FFE allocates more denoising steps to the first frame and significantly fewer steps to subsequent ones, enhancing overall video quality with minimal computational overhead. 1. Well-Motivated and Significant Problem: The paper addresses a highly relevant and challenging problem in generative AI: achieving high-fidelity video synthesis under extreme computational constraints (one or two inference steps). The motivation is clear, and a successful solution to this problem would have a significant practical impact, making the research direction valuable. 2. Empirically Effective and Intuitive Core Ideas: a) ASD's Practical Efficacy: The core idea of Adversarial Self-Distillation (ASD)—aligning the model's n-step output with its (n+1)-step counterpart—is an intuitive approach to breaking down a large distillation gap. While its theoretical underpinnings could be further explored, its practical effectiveness is undeniably demonstrated by the experiments. The concept of progressively refining the model based on its own slightly improved outputs proves to be a powerful empirical strategy in the few-step regime. b) Pragmatic and Data-Driven FFE: The First-Frame Enhancement (FFE) strategy is a pragmatic and effective solution grounded in a clear empirical observation (Figure 4). Although a simple heuristic, it demonstrates a thoughtful consideration of the error propagation dynamics in causal models. This data-driven approach to allocating computational resources where they are most needed is a clever and impactful inference-time optimization. 3. Rigorous and Comprehensive Experimental Validation: a) Strong and Fair Baseline Construction: The authors' decision to train their own few-step versions of a powerful SOTA model (Self-Forcing) for comparison is a sign of rigorous scientific practice. This "apples-to-apples" comparison effectively isolates the contribution of their proposed methods (ASD and FFE) from confounding variables like model architecture, which makes the reported gains highly credible. b) State-of-the-Art Empirical Performance: The paper presents compelling quantitative and qualitative results that convincingly demonstrate state-of-the-art performance in the challenging one- and two-step video generation tasks. The significant lead over a fairly-trained baseline, supported by extensive ablation studies (Table 2) and a strong user preference study (Figure 6), provides undeniable proof of the method's empirical superiority. 4. Clarity and High-Quality Presentation: The paper is well-written, clearly structured, and easy to follow. The figures and tables are informative and effectively communicate the core concepts and results, contributing to a high-quality presentation of the work. 1. Methodological Foundation of ASD Lacks Rigor: The central contribution, Adversarial Self-Distillation (ASD), is built on a foundation that is more intuitive than it is rigorous. The paper's core claims—that the "intra-student gap" is smaller and that adversarial alignment provides "smoother supervision"—are presented as assertions rather than demonstrated principles. a) There is no formal analysis or empirical measurement to quantify this "gap" (e.g., in terms of a specific distribution divergence metric). b) The claim of "smoother supervision" from a GAN objective is counter-intuitive, given the well-documented instability of adversarial training. The paper fails to provide evidence to substantiate why this would be the case, especially compared to simpler, more stable alignment objectives. 2. Unaddressed Risk of Error Reinforcement in Self-Distillation: The ASD mechanism, where the model learns from a slightly better version of itself, introduces a significant and unexamined risk of "model drift" or "error reinforcement." If the (n+1)-step generation is flawed (e.g., contains artifacts or mode collapse), ASD could perversely train the n-step model to replicate these very flaws. The paper relies on the DMD loss to anchor the model to the true data distribution but provides no analysis of the training dynamics or the delicate balance required to prevent the self-distillation objective from amplifying its own mistakes. 3. The FFE Strategy is Heuristic and Its Generalizability is Questionable: The First-Frame Enhancement (FFE) strategy is presented as a key contribution, but it is fundamentally an empirically-driven heuristic rather than a principled method. a) Its justification rests entirely on a single observation on a specific dataset (Figure 4), and its generalizability to different video content (e.g., videos with major mid-sequence scene changes) is not explored. b) The paper completely ignores the potential negative side-effects of this strategy. Creating a sharp drop in denoising steps between the first and second frames could introduce significant temporal discontinuity and artifacts, undermining the very quality it aims to enhance. This critical aspect is neither analyzed nor discussed. 4. Insufficient Discussion on Training Complexity and Stability: The proposed training framework is remarkably complex, involving a generator, a discriminator, and a "teaching assistant" (TA) score model trained in an alternating fashion. The paper largely overlooks the significant practical challenges this entails. There is no discussion of the training stability, the sensitivity to the delicate balance of multiple loss terms and optimizers, or the total computational overhead of this complex setup compared to simpler distillation baselines. For a paper focused on efficiency, the lack of transparency regarding its own training costs is a notable omission. 1. The central premise of ASD is that the "intra-student gap" is smaller and easier to bridge than the "teacher-student gap." Could you provide a more formal or empirical justification for this claim? For instance, have you measured this "gap" using any distribution divergence metrics (e.g., KL, Wasserstein) to validate this core assumption? 2. Given the known training instabilities of GANs, the claim that ASD provides "smoother supervision" is counter-intuitive. Could you elaborate on this and provide evidence (e.g., loss curves, gradient norm analysis) to support that the adversarial self-distillation process is indeed more stable than direct distillation from a fixed teacher? 3. The self-distillation mechanism seems to carry an inherent risk of error reinforcement, where the model could amplify its own artifacts over time. How does the framework explicitly guard against this "model drift"? What is the role of the DMD loss in anchoring the training? 4. The training procedure appears significantly more complex than the baseline. Could you provide a more transparent comparison of the training time, computational resources, and overall stability of your method compared to the standard distillation approach used to train the Self-Forcing† baseline? Fully AI-generated
LEMUR: Leveraging Vision-Language Models for Fine-Grained Multimodal Retrieval Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents LEMUR, a VLM based multimodal retrieval framework to improve fine-grained and region-level multimodal retrieval. Specifically, the paper proposes a region-aware encoder (RAE) that mirrors the original vision encoder (CE) in MLLMs. The RAE receives context information from the CE through cross-attention layers. The decoupled nature of the RAE from the CE allows the RAE to be more focused on region level visual cues. The paper also presents a new benchmark, FGMB, to evaluate fine-grained retrieval tasks. - The problem of fine-grained multimodal retrieval is a common issue tackled in industry. The paper identifies the issue that most systems (often based on CLIP-like models) fail to perform retrieval in fine-grained tasks. - The RAE architecture is an intuitive design with clear motivations: use the original vision encoder (or Context Encoder/CE) to extract more global information, while the cropped region is fed into the RAE. The CE feeds context information to the RAE through cross-attention layers. - Various experiments are conducted and LEMUR is compared with other well-known models, such as CLIP, BLIP, SigLIP, and Qwen-VL. - There are many statements in the paper that aren't well supported or grounded. This raises some concerns. For example, in L207, authors claim that "Yu et al. (2025b) introduces excessive background information and interferes with fine-grained features". However, I am familiar with the cited paper, and I do not understand why this statement made by the authors is true. What the paper proposes is a region-selection or re-encoding token, two tokens which serve distinct purposes, and in the case of region selection token (which is what is used for further fine-grained analysis) the background is cropped and fed into the model, so I don't understand why there is criticism for "introducing excessive background". Another example is that across Section 3.2, the authors claim the effectiveness of the RAE module; however, I feel that there is not enough analysis surrounding what the RAE focuses on (in the image) and how effectively it complements the CE (other than quantitative results in experimental resutls). Thus, it is difficult to determine whether improved performance is truly due to the architectural/fine-grained nature of the RAE module, or simply because it is specifically trained on extracted regions of the image. - The assumption that target region (i.e., bounding box) is always present may be a bit prohibitive. - There is no proper consideration for compute efficiency or latency. In real-world systems, especially for retrieval and search, this is a crucial aspect that needs to be considered. Does the increased latency/computation justify the performance improvement? - Section 3.3 can be elaborated, as there is some ambiguity in how specifically retrieval is conducted. For example, to someone first reading the paper or new to the field in general, it may not be entirely clear how single feature embeddings are generated from a decoder-only language model. Furthermore, there is also ambiguity on what the prompt would be for the candidates. - Last but not least, the idea of decoupling the encoder for more fine-grained features is not entirely novel. While this itself is fine, I would expect more insights or thorough analysis on why such design is much more effective, rather than intuitive but not-well-grounded statements. Thus, overall, I do not feel as though I have learned something new. I would appreciate if the authors addressed the concerns raised in Weaknesses. Fully human-written
LEMUR: Leveraging Vision-Language Models for Fine-Grained Multimodal Retrieval Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes LEMUR, a VLM for fine-grained multimodal retrieval that enhances region-level understanding via a Region-Aware Encoder (RAE), localized captioning, and regional contrastive learning. The main model contribution, the RAE processes the global image and the region crop separately with cross-attention layers to keep both context and fine-grained features before feeding into the LLM. The paper also introduces the FGMB benchmark with 225K diverse contrastive pairs, including region aligned pairs. LEMUR improves on previous baseline specifically in fine-grained retrieval. - The paper tackles an important problem of fine-grained image retrieval with region-level alignment between text and images. - The proposed benchmark can serves for future development and evaluation of fine-grained retrieval tasks. - The LEMUR is simple and technically sound. The RAE, although not the most efficient solution, is reasonable. - The proposed tasks make sense for evaluation purposes, but their real-world applicability is questionable. * The region is not only given for the query, but also for every candidate. This assumption seems unrealistic. In a practical open-set retrieval setting, one likely does not have bounding boxes for each image. Even if one would automatically generate candidate bounding boxes for each image, the error rate of the object detector becomes a bottleneck. Worse, the proposed method is highly inefficient if multiple bounding boxes for the same image were to be considered. * In the dataset, only one bounding box exists per image (to my understanding). One could argue such a setting actually makes the task easier since the bounding box supervises which part of the image is relevant for the retrieval task (the rest can be ignored as a relevant caption does not exist). So why do competing methods show a weak performance? Likely because they are not trained to make use of this information. * A more realistic setting does not have bounding boxes on the candidate set and could identify the relevant region of the image. - Some choices of the LEMUR method could be better justified and explained. * The alternative architectures presented in Fig. 3 are reasonable candidates. It is not clear if the ablations from Tab. 4 directly correspond to some of these versions. Most importantly, Fig 3b does not seem to be ablated which is likely the most promising alternative due to its architectural simplicity and fewer number of parameters. * Training details of the zero-shot variant are not clear. What is the exact difference to the full model? Are only the 200k FGMB samples omitted from training? How is RAE trained without any region (bounding box) data? * Compared to most other baselines, the proposed method likely is much more computationally costly due processing two images instead of one. A runtime analysis would be useful to put this into perspective. - Novelty is limited. * As mentioned above, the model is specifically tuned for the use-case of having bounding boxes in the candidate set. Hence, it is not surprising that it performs well on the proposed benchmark. * The architectural changes in LEMUR are not too innovative and it mostly introduces more parameters and additional tokens (compute) to solve the problem. * The proposed dataset is mostly a collection of existing datasets. The novel additions XGoods and XNote are barely described. - Experimental evaluation could be improved. * The evaluations mix too many settings without transparently disclosing fair comparisons. Tab. 2 should clearly indicate which models are trained on M-BEIR. * For better comparison the tables should also specify the size of each model. * Both Tab. 1 and 2 compare only to LamRA-ret and not the better LamRA. LEMUR performs worse on average than LamRA on M-BEIR. * It is not clear why Tab. 2 contains significantly fewer models than Tab. 1. Moreover, there are relevant baselines that are mentioned in the related works and have models is available, but are not evaluated. For instance, E5-V, MoCa, FLAIR. * In ablation Tab. 3, it is not clear how xattn can be trained without separate projectors. - How do the results change when we do not assume to have bounding box information for the candidate set? - Could you clarify the ablation study with respect to Fig. 3? How would to comparison to 3b look like everything else being equal? - Could you please clarify the differences of the zero-shot variant? - What is the runtime of LEMUR compared to other baselines? - How do other recent methods (e.g., MoCa, FLAIR) compare to LEMUR? I am willing to increase my score if my concerns are well addressed. Fully human-written
LEMUR: Leveraging Vision-Language Models for Fine-Grained Multimodal Retrieval Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper is about LEMUR that is a VLM based fine grained retrieval framework that enhances regional representations without compromising image level retrieval performance. The core contribution is the addition of a Region Aware Encoder that extracts detailed features from query regions to complement global image representation. Additionally, the manuscript introduce a new benchmark named FGMB of 225k region level contrastive pairs covering two metatasks and four multimodal retrieval scenarios. + The region aware encoder (RAE) appears to be innovative as it addresses the fundamental tension between local and global representation learning through decoupled projectors. The results reported in Table 1 shows significant improvement compared to the baselines and both coarse and fine grained approaches. + The motivation of the RAE componenet is clear to encourage detailed regional representations, and the following pipeline is similarly well motivated. The language only contrastive learning converts language generation capability into embedding capability and the regional contrastive learning addresses fine grained retrieval performance. + The paper is relatively easy to read and well presented. + The FGMB dataset is valuable and interesting. - The paper mentions using LLMs to regenerate captions for increased difficulty, but doesn't discuss potential biases introduced by this synthetic data generation. I did not found which LLM, and details about how this was done. - While the paper introduces the FGMB dataset, there is brief mention of why existing benchmarks were insufficient. There are concurrent development of similar benchmarks with regions retrieval, with differences, but still related (e.g. with reasoning, visual commonsense reasoning dataset). It is also missing comparison with existing evaluation protocols (like from GPT4RoI (ECCV2024), KOSMOS-2 (arXiv:2306.14824) and others). - There are missing related works research lines and comparisons with the proposed approach. They are similar in terms of tasks, for instance VisionLLM (Neurips2023), GPT4RoI (ECCV2024) and ASM (ICLR2023) utilize spatial boxes with ROI-aligned features to align the region-level features into LLM word embedding space. - What's the performance impact of using different vision backbone architectures? What about LLMs to generate the synthetic data? - How does LEMUR perform on tasks requiring multiple regions of interest simultaneously? - How sensitive is the model to bounding box quality/precision? Fully human-written
LEMUR: Leveraging Vision-Language Models for Fine-Grained Multimodal Retrieval Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors introduce LEMUR, a V-L framework for fine-grained multi-modal retrieval. It features a Region-Aware Encoder (RAE) that integrates multi-layer cross-attention with a global Context Encoder (CE), to preserve both local and global contexts. The authors propose a three-stage training pipeline. They also introduce FGMB, a large fine-grained benchmark. - Clear motivation: the authors describe their architectural motivation for decoupling regional vs global signals. - The paper present comprehensive comparisons with many previous methods for retrieval. ## Related Work ## While the paper’s core focus is retrieval, its discussion of related work is incomplete and somewhat misleading. The authors frame their approach mainly in the context of CLIP-based retrieval, overlooking the long line of pre-CLIP and concurrent retrieval research that used unimodal or other multimodal encoders [1, 2]. Text–image retrieval is a well-established field, and the paper should not imply that CLIP introduced the task itself. Similarly, the task referred to as “MetaTask2” corresponds to Composed Image Retrieval in prior literature [3]; this and other sub-tasks deserve proper acknowledgment and discussion. A more comprehensive review and citation of prior retrieval studies would better situate LEMUR’s contributions and clarify how it advances the existing body of work. ## Evaluation ## Although the paper presents comparisons with numerous baselines, the evaluation remains insufficient to convincingly demonstrate the method’s superiority: - A) The main comparison (Table 1) relies entirely on the authors’ own benchmark (FGMB). While LEMUR achieves strong results there, performance on a self-constructed benchmark is inherently less persuasive. It is important to include results on established public benchmarks for each retrieval sub-task. For instance, using existing datasets for Composed Image Retrieval [2]. - B) The comparison in Table 2 appears unfair: prior models were not necessarily trained or fine-tuned on the authors’ chosen benchmarks and are therefore evaluated in a zero-shot setting, whereas LEMUR is trained on them. This discrepancy skews the results. For example, BLIP-2’s reported Recall@5 on COCO T2I (63.8%) is much lower than its original performance reported in their paper [4] (86.5-87.7%, depending on backbone size). - C) The paper also omits a key baseline: fine-tuning or evaluating the underlying pre-trained backbone (Qwen2-VL) without the proposed architectural modifications (as described in lines 281-284). Without this comparison, it remains unclear how much of the reported improvement stems from LEMUR’s design versus the backbone’s inherent strength. Table 3 may relate to this, but its setup is not explained clearly in the text. ## Paper Writing ## I appreciate the authors’ contributions and the technical depth of the work; however, the paper does not yet feel ready for publication in its current form. Several writing and organization issues significantly affect readability and completeness: - Certain sections should be reorganized for logical flow. For instance, Figure 2 is referenced before Figure 1, which disrupts the narrative order. - The Introduction and Related Work sections need to more clearly position the addressed retrieval tasks within the terminology and context of the main literature (see the first comment on related work). - Tables 3, 4, and 5 are insufficiently discussed and omit essential details. For example, the meaning of the “xattn” symbols in Table 3 is unclear - does it correspond to the cross-attention component in Eq. 2, and do “X”/“V” denote $\alpha=0$ and $\alpha=1$ respectively, or something else? Similarly, Table 5 is not referenced anywhere in the text, and the definitions of T1/T2/T3/T4 are missing. - Finally, the paper lacks a limitations or discussion section. No weaknesses or failure cases of LEMUR are analyzed, which would be important for a balanced and transparent presentation. [1] Li, K., Y. Zhang, K. Li, et al. Visual semantic reasoning for image-text matching. In ICCV, pages 4653–4661. 2019 [2] Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C. and Hoi, S.C.H., 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34, pp.9694-9705 [3] Song, X., Lin, H., Wen, H., Hou, B., Xu, M. and Nie, L., 2025. A comprehensive survey on composed image retrieval. ACM Transactions on Information Systems. [4] Li, J., Li, D., Savarese, S. and Hoi, S., 2023, July. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (pp. 19730-19742). PMLR. See "Weaknesses" above Lightly AI-edited
Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper conducts a study over biases of text-to-image diffusion models while generating images of historical ages. The evaluation criteria is three fold: 1) examining stylistic bias, 2) historical consistency, 3) demographic representation. The authors conducted in-depth evaluation and analysis over each of these aspects. They contributed the HistVis dataset that has 30K synthetic images generated by 3 T2I models -- SDXL, SD3, Flux.1Schnell. The paper is a great read, easy to follow, with interesting findings and extensive evaluations. The authors have designed careful and sound evaluation schemes for each of the three aspects that they are studying in the paper. I especially liked that they have used multiple VLMs for evaluation rather than only using one as a judge. All details of the study has been laid out in complete transparent detail. Multiple qualitative examples were very helpful in getting the point across for each aspect of the study. The authors also discuss and analyze their findings in detail which gives readers valuable insights. A comparison with related studies in this direction comparing the number of samples and evaluation strategies will be helpful to better place the paper. Are the HistVis dataset prompts all manually designed or were LLMs involved in aiding ideation or prompt design? It will be interesting to know how the authors designed these prompts. Fully human-written
Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents a framework for evaluating text-to-image models in terms of their ability to accurately represent historical context. The paper presents HistVis, a dataset of 30,000 images generated from three open-source diffusion models that were prompted to depict people performing generic activities across different centuries and decades. They then propose an evaluation protocol that examines stylistic associations, historical consistency, and demographic distribution in the generated images, comparing the results with historical data. The study reveals interesting patterns in how models capture or distort aspects of history, offering insights into biases related to style, anachronism, and representation. Overall, this work highlights the lack of historical accuracy in generative models and provides a concrete methodology and dataset to study it. - The paper highlights an important problem of evaluating historical representation in text-to-image models when depicting generic, everyday activities and provides a clear motivation for addressing it. - The paper provides and evaluates an interesting dataset consisting of images that depict a comprehensive set of timeless activities spanning approximately five and a half centuries, offering strong coverage across diverse historical periods. - The findings, particularly the observation of anachronistic objects in images depicting historical time periods, are very interesting and effectively highlight an important issue in the temporal consistency of these models. - Although the VSG score is supported by a robust methodology, the reason for evaluating biases in style associations and the explanation of the distinct style classes are not sufficiently motivated. - The anachronism detection uses an LLM to get a list of possible objects that could be anachronistic in a given activity. How can we ensure that this list is exhaustive? Were other methods, like object detection, considered for this task? - Evaluating demographic representation in the generated stories is an interesting direction. However, the use of LLMs to predict demographic distributions raises concerns about reliability. The paper uses the public OWID dataset to demonstrate the robustness of GPT-4o as an estimator for a small set of activities but does not account for potential data contamination, which could explain the model’s high accuracy in these limited cases. - Figures 4 and 6 could be significantly improved to enhance clarity and effectively communicate the key takeaways. In their current form, they present a large amount of information in a single view, which makes it difficult for readers to interpret the results and understand the main insights. With the style association results, it might be interesting to link them to the training data and see correlations/reasons for such biases. (Although it might be out of scope for this paper) Minor typos (not really a weakness, sharing so that authors can fix them): - Line 275 refers to Appendix 5 instead of Table 5 - Line 346 uses the word “currentumptions”. - Line 439 says Section Q, instead of Appendix Q - Is the stylistic predictor (to predict the style class) also trained to guess images that are not any of the 5 categories in the training dataset? As a mitigation prompt was used, what was its goal? - The paper notes that clothing appears in 2–5% of the anachronistic images (line 322). It is unclear why clothing is considered anachronistic. Does this refer to attire that does not match the depicted time period? If so, it would be helpful to clarify how such inconsistencies were detected. - Was the prompt used to obtain the racial breakdown for a particular time period specified by continent (as the OWID comparison was based on continent)? If so, which continent were the proportions of the generated images compared against? If not, was it assumed that GPT-4o would estimate global demographic proportions for the corresponding historical period? Lightly AI-edited
PreviousPage 20 of 1516 (75800 total rows)Next