ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
DriveE2E: Closed-Loop Benchmark for End-to-End Autonomous Driving through Real-to-Simulation Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper presents DriveE2E, a Real2Sim-based closed-loop evaluation framework built upon the CARLA simulator. It features high-fidelity digital twins of 15 diverse urban intersections and 800 traffic scenarios generated from infrastructure sensor data. The paper is well-written and clearly structured, making it easy to follow. It constructs high-fidelity digital twins of 15 urban intersections and selects 800 real-world traffic scenarios from over 100 hours of infrastructure sensor data. Meanwhile, it establishes a comprehensive closed-loop benchmark for end-to-end autonomous driving (E2EAD) by evaluating several classical baselines, demonstrating both technical completeness and practical relevance. There are several points that require further clarification. The proposed Real2Sim framework is intended to address the unrealistic rendering issues in CARLA, yet the paper still relies on CARLA’s built-in assets for traffic participants, which seems inconsistent with that motivation. In addition, one of CARLA’s main advantages lies in its true closed-loop capability, where surrounding agents can dynamically respond to the ego vehicle’s behavior—unlike the log-replay mode adopted in this work. While this paper introduces a new scenario, it also employs a log replay method. Therefore, what are its advantages compared to frameworks like nuPlan or NAVSIM? This will determine the value of this work. Same to the Weaknesses. Lightly AI-edited
DriveE2E: Closed-Loop Benchmark for End-to-End Autonomous Driving through Real-to-Simulation Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes a novel closed-loop evaluation benchmark of end-to-end autonomous driving agents by constructing digital twins of real-world intersections and replaying logs of the traffic participants. It uses infrastructure sensors for reconstruction and the CARLA simulator for simulation. The benchmark is comprehensive and consists of various driving behaviors and traffic participants. 1. Closed-loop simulation is important for benchmarking end-to-end driving models and constructing digital twins of real-world driving scenes can help with closed-loop testing and bridge the sim2real gap. 2. Using infrastructure-mounted sensors can help improve the accuracy and complexity of the reconstructed traffic scenario, improving the realism of the digital twins. 3. The proposed DriveE2E benchmark is comprehensive in the number of scenarios, agent behaviors, and traffic participant categories. 1. Although acknowledged in the limitations section, the log-replay simulation is a major weakness of this work for more realistic closed-loop evaluation, as it should be straightforward to incorporate behavior models for other agents. This could have been an advantage of using a driving simulator like CARLA, which offers easier agent management. 2. The reconstructed digital twin of real-world interactions lacks diversity and realism. The layout of the intersection is all four-way intersections, and the assets, such as bushes, trees, and road textures, all appear the same. 3. The collected real-world data all come from a small region with only 15 intersections, which lacks diversity. Therefore, the results from such a benchmark may not be general and comprehensive for the driving agent's performance. 4. With the lack of photorealism of the image rendering of CARLA, the proposed DriveE2E benchmark is more suitable for modular testing of AD systems like downstream planning and decision making rather than evaluating end-to-end driving agents. End-to-end driving policies trained from such data may also exhibit a large sim2real gap. 1. Can the authors provide some video results of the driving scenario visualizations? 2. Can the authors provide more statistics on the assets used for creating the digital twins, or do they simply reuse the CARLA assets? Like how many different types of vehicles, pedestrians, etc? Are they fixed or randomly initialized in each scenario? 3. Can the authors elaborate more on the real-world data collection? For example, what types and how many infrastructure sensors are used? What is their perception range? 4. In Tab. 7, can the authors explain more about the element-wise similarity metrics used to evaluate the reconstruction fidelity? Fully human-written
CausalPlan: Empowering Efficient LLM Multi-Agent Collaboration Through Causality-Driven Planning Soundness: 2: fair Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes CausalPlan, a framework that purports to integrate explicit structural causal reasoning into LLM-based multi-agent planning. The core contribution is a Structural Causal Action (SCA) model that learns relationships between prior actions, current states, and future actions from agent trajectories. These learned relationships are encoded in a ``Causal Action Matrix'' $M$, which is then used to reweight LLM-generated action probabilities during planning. The authors evaluate their approach on the Overcooked-AI benchmark across multiple LLM backbones and show empirical improvements in task success rates and reductions in invalid actions. 1. Consistent gains across very different LLM backbones and layouts, not just one setup. 2. Human-partner results are stronger than baselines and include statistical testing; several settings reach p<0.05 and none show degradation when the method is enabled. 3. The causal backup plan is an effective recovery mechanism when the planner proposes no valid actions; ablation shows it adds measurable benefit beyond the two-prompt tweak. 4. The framework exposes a causal action matrix and publishes heatmaps, giving a degree of interpretability about which state/action factors influence next actions. 5. Robustness to who collects the data for the buffer; using a stronger behavior policy helps, but even weaker LLM-collected data still benefits from the causal integration. 6. Sensitivity analysis of the γ weighting shows a broad sweet spot. 7. Explicit DAG enforcement by zeroing the weaker direction in any bidirectional pair prevents cycles in the learned structure. 8. Modular drop-in over ProAgent with open-source LLMs, keeping the rest of the stack intact and making replication or extension straightforward. 9. Extends beyond Overcooked to a long-horizon single-agent benchmark (Crafter) and outperforms both a causal-prompting baseline and Dreamer-V2 at 1M steps. 12. Prompting design separates analysis from action selection, making the action extraction unambiguous; ablation indicates the components introduced to capture causal relationships drive most of the lift. 1. The framework learns from trajectories generated by a fixed behavior policy in Overcooked-AI, which means each action is conditioned on the policy’s internal decision process. Since actions aren’t randomized or independently manipulated, the data are observational, not interventional. 2. The Structural Causal Action model optimizes a likelihood loss ( -\log P(a_t \mid s_t, a_{t-1}) ), which captures conditional correlations rather than causal effects ( P(a_t \mid s_t, \text{do}(a_{t-1})) ). Without interventions or counterfactual adjustments, the learned structure reflects co-occurrence patterns, not causal mechanisms. 3. Although Overcooked-AI’s environment is deterministic, the data collection process is not interventionally controlled. The simulator ensures that actions deterministically affect states, but the trajectories used for learning are policy-dependent rollouts, not samples from systematically applied interventions. 4. Because the same policy governs both state visitation and action choice, correlations between (s_t) and (a_t) can arise from shared dependencies on unobserved latent factors such as internal LLM reasoning or high-level strategy. The model treats these as causal links. 5. The binary feature encoding used for states and actions is a coarse abstraction of the full simulator state. Hidden variables like spatial positioning or timing can confound state–action dependencies, violating causal sufficiency. 6. The framework’s only verification is improved prediction accuracy and task performance, which measure behavioral alignment, not causal correctness. A model can be highly predictive while causally wrong. 7. The paper’s theoretical identifiability proof relies on assumptions such as additive noise, faithfulness, full observability, and acyclicity, none of which are verified in Overcooked-AI. There is no empirical evidence that these assumptions hold in practice. 8. Each entry of the causal action matrix represents a learned dependency weight, not an intervention-derived causal coefficient. The matrix is effectively a correlation matrix with sparsity regularization. 9. The observed reduction in invalid actions and improved cooperation may result from regularized prediction smoothing or bias correction, not genuine causal reasoning. The gains demonstrate utility, not causal validity. 10. Because the learned structure reflects policy-specific correlations, the matrix may not transfer to different partners, environments, or task variations, contradicting the stated goal of causal generalization. 1. How do you distinguish causal effects from correlations when all data come from a fixed behavior policy π₍ᵦ₎? What is your formal definition of causation in this context? 2. Can you provide empirical evidence that the faithfulness, causal sufficiency, and additive noise assumptions hold in Overcooked-AI? For instance, conditional independence tests, checks for unobserved confounders, or validation of the additive noise model? 3. What would an intervention experiment look like to validate your learned causal structure? For example, could you force an agent to take actions inconsistent with M and measure the deviation in outcomes? 4. Why not compare against a model that learns P(aₜ | sₜ, aₜ₋₁) with standard neural networks (e.g., feedforward or recurrent) without causal constraints? Does the DAG structure and sparsity actually matter, or are the gains from additional learned features? 5. How does performance degrade when the partner policy changes? Does your “causal” matrix M transfer to new partners, or is it partner-specific? 6. Can you show that the learned dependencies correspond to true causal mechanisms rather than artifacts of π₍ᵦ₎? For instance, by comparing M learned from different behavior policies? 7. Have you tested whether M changes systematically under distributional shift? This would be evidence of instability inconsistent with causal invariance. 8. Why is binary feature encoding sufficient when it discards causally relevant information such as spatial distances, timing, and interaction history? 9. What is the causal graph G you claim to identify? Can you draw it explicitly (not just heatmaps of M) and verify it against ground truth or domain knowledge? 10. In Proposition 1, you assume aₜ is “unobservable” during training, but clearly you observe aₜ in the trajectory data 𝓑. Can you clarify this apparent contradiction? Fully AI-generated
CausalPlan: Empowering Efficient LLM Multi-Agent Collaboration Through Causality-Driven Planning Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces CausalPlan, a planner for two-agent collaborative tasks that reduces invalid actions by learning and exploiting a sparse policy structure over decisions. Given a set of trajectories, each timestep is factorized into binary features (agent states and environment state) and a one-hot previous action of the agent being controlled. Then a per-action head is trained with a sparsity mask using NLL (alternating masks/weights), indicating how much each input feature is used when choosing that action. The masks form a matrix whose entries reveal the propensity of choosing an input feature when choosing a decision (action). At test time, the agent a) prompts an LLM to propose candidate high-level actions, b) the actions are pruned based on feasibility using external rules/grounding, and c) each remaining candidate is scored by summing the mask weights from the currently active features. The score is blended with LLM scores via a convex combination. If there are no valid actions, a fallback regime is used where the chosen action is the top-scoring action from the learned matrix. Empirically, this cuts invalid actions and improves cooperative rewards across layouts and LLM backbones. - Tackles a real failure mode of LLM agents in collaborative tasks - Fills a real gap between pure prompting and heavy world-modelling. - Simple training pipeline: Learns from trajectories using a standard NLL+sparsity objective - The policy-structure matrix gives insight into which inputs support which actions. - The method shows consistent empirical gains over strong baselines. - The system is partner-aware (by including the partner’s state in the feature set) without requiring access to or having to learn the partner’s policy - Inference is simply masked feature sum+convex blend of LLM action probabilities, far cheaper than replanning/multiple model rollouts. - Deployability is practical, given that the same assumptions hold. There is minimal friction into existing agent stacks given the learned matrix; one matrix lookup, one re-weighing step. - The appendix is comprehensive and code was provided, facilitating reproducibility. I will combine weaknesses, remarks, and questions in one section for readability. The comments below reflect my current reading of the paper and appendix; if I’ve misread any definitions or misinterpreted any claims, I welcome pointers and will happily revise. To my understanding, the paper does not build a causal model of the world, though the writing sometimes suggests it is. The learned object is a policy-structure over observed features that predicts the next action, not a dynamics model one could query with do operators or counterfactuals. The SCA takes parents $(a_{t-1}, s_t)$. This models cause-effect relationships within the agent’s decision process, not the environment’s physics or tasks dynamics According to the definition at L229-L232, each row of $M$ corresponds to a possible next action and each column to a state of past-action features. Each entry is the learned probability that feature $j$ influences action $i$, given the learned structure. Querying $M$ sums the active parent entries to produce a "causal score". In effect, from my understanding, higher sum implies that the model has learned that the currently active features are predictive parents of that next action. This is not a causal effect estimate in the Pearl sense. The proof’s conclusion (L812-L814) that "the causal action matrix … faithfully reflect the true cause-effect relations among states and actions" reads too strongly. What is captured is a sparse dependence over $(s_t, a_{t-1})\to a_t$, which is a property of the actor’s policy, not of the world’s causal structure. A (hard) intervention (if atempted) would possibly be toggling columns $j$ (making a feature active/inactive) and seeing how the score changes. The paper does not define or use interventional queries over an SCM over the environment’s variables. Concretely, the method learns a decision structure, not environment causality. Can you clarify the above in the paper? Minor: “intervention” is used informally (L277, re-prompting) which can be confusing in a section that discusses causal effects and structural causal action models. Furthermore, there are two distinct issues with the proof: A) Proof-method mismatches - Estimator mismatch; The appendix analyses ridge on fixed basis features and reads parents from the support of $W$. The method itself trains neural Bernoulli heads via NLL and learns $\eta$ jointly. These are not equivalent and, as far as I can tell, one does not imply the other. - ANM vs classifier; The proof invokes ANM-style identifiability but the trained model is a binary classifier - Acyclicity gap: The proof assumes a DAG. The method’s heuristic "zero the smaller of each bidirectional pair" only removes 2-cycles, leaving (3+)-cycles in the action-action portion of the graph). Take for example the following relations: $W(a \to b) = 0.5 > 0.3 = W(b \to a)$, $W(b \to c) ‎ = 0.5 > 0.3 = W(c \to b)$, $W(c \to a) = 0.5 > 0.3 = W(a \to c)$. The heuristic would remove the three arrows $b \to a$, $c \to b$, $a \to c$ but $a \to b \to c \to a$ remains. This is inconsequential when the learned matrix is used in a feedforward manner as done in the paper, but the claim (L236-238)"…ensure DAG property of a standard SCM…" is not correct. - Observability: The proposition (L220) states $a_t$ is unobservable, but the proof in the appendix uses an observed $a_t$ to train the SCA model. Clarifying unobserved at test time, observed during training would help. Any of the above, in my view, render the statement proven in the appendix inapplicable to the method in the main paper (apart from the observability claim but that’s a wording inconsistency). Questions: Either restate and prove a proposition for the actual model class and estimators, align the method to the ridge/basis estimator in the appendix or clearly state that this is a motivating surrogate that does not apply to the method. B) Standalone proof issues When considering the proof itself in isolation: - The specific regularizing conditions/assumptions are not clearly stated (i.e., L728: "identifiability of causal direction relies on the function class having sufficient expressiveness and satisfying certain regularity conditions (e.g., nonlinearity, invertibility)". Invertibility, as far as I can tell, is not relevant to the proof. Please state the exact assumptions used. - Ridge regression is used and the parents are identified by the support of $W_i$ (L808) with the claim (L809): "one can recover the graph structure by examining which entries of $W_i$ are significantly nonzero, using thresholding or statistical tests.". L2 regularisation has no sparsity guarantees and typically yields dense solutions. How exactly is this step justified and implemented? - The proof assumes causal faithfulness but the collaborator’s actions are not modelled (L899-901). If $u$ denotes the collaborator’s action, an implicit assumption is made: $a_t \perp u_{t-1} \mid (s_t, a_{t-1})$. If $s_t$ is intended to be sufficient to mediate all effects of the parent’s last action, clearly state so, otherwise sufficiency is violated. - Eq (8) instantiates the dataset as sequences of the form $(s_t, a_{t-1}, a_t)$. Clearly state what is observable and what is not. Furthermore: - L244-246: Can you clarify how the actions are sampled by the LLM and how they are scored? - In the action pruning step, the method assumes access to an external verifier of feasibility. What does this verifier look like? What happens when such a verifier is unavailable (i.e., real-world robotics)? If it is required, can you add this as an explicit assumption? - Why does CausalPlan underperform in some settings (Table 1)? A rief discussion of failure modes would help. - L318: A short description of the baselines in the main text (with details in the appendix) would make the experiments easier to follow. - What is the impact of removing the previous action feature/removing partner-state features or adding the partner’s previous action? - How well does an SCA trained on trajectories from a behaviour policy paired with a partner transfer when deployed with different partners? See weaknesses Fully human-written
CausalPlan: Empowering Efficient LLM Multi-Agent Collaboration Through Causality-Driven Planning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes CausalPlan, a framework designed to improve LLM-based multi-agent collaboration by incorporating explicit causal reasoning into the planning process. The method introduces a Structural Causal Action (SCA) model that learns a causal graph from offline trajectories, modeling dependencies between state factors, prior actions, and next action choices. During inference, the causal graph is used to reweight sampled candidate actions from the LLM, promoting causally consistent planning and filtering out invalid or incoherent actions. Experiments are conducted on the Overcooked benchmark showing consistent improvements. 1. The paper clearly identifies a real pain point in the LLM-planning space LLM agents often rely on correlations and fail under causal inconsistencies and proposes a targeted solution. The motivation aligns well with current challenges in multi-agent LLM systems. 2. The method section is well-structured and easy to follow. The paper clearly explains the proposed causal model and the way it integrates with LLM action sampling. The theoretical argument that, under standard identifiability assumptions, the causal graph and functional relationships can be uniquely recovered adds credibility and supports why the approach should work. 3. Implementation details are provided in good depth, including model architecture and prompting strategies. 1. It does not compare against the most recent LLM-agent + causal reasoning methods. For example, CausalMACE[1] and Causal-aware LLMs[2]. 2. All evaluations are done in the Overcooked kitchen environment. While this benchmark is standard, it is still a fairly constrained action/state space in a stylized cooperative setting. It would be helpful to see results in a more diverse or general multi-agent domain (e.g., social games, robotics simulators). Otherwise, it's unclear how easily the method generalizes to richer or more realistic scenarios. 3. The method depends on manual factorization of state/action features, and lower-level actions are ignored. This raises concerns about domain specificity and manual engineering effort. In complex environments, designing semantic factors may be non-trivial, and it’s unclear how the method scales without strong prior knowledge. [1]https://aclanthology.org/2025.findings-emnlp.777/ [2]https://www.ijcai.org/proceedings/2025/0478.pdf 1. The paper claims not to rely on the LLM’s causal reasoning ability, yet the pipeline still depends heavily on the LLM for analysis and candidate-action generation via a two-prompt design and knowledge library. Could the authors clarify whether the method truly disentangles causal reasoning from linguistic reasoning? To what extent could the observed gains stem from improved prompting workflow rather than causal modeling itself? 2. Each decision step requires: extracting factorized features and querying the causal matrix, is the runtime overhead significant? 3. Does the training buffer come from the same task distribution as evaluation? Are trajectories from them fully disjoint from test episodes? Could the causal structure overfit to the demonstration policy rather than reflect true task dynamics? Fully AI-generated
CausalPlan: Empowering Efficient LLM Multi-Agent Collaboration Through Causality-Driven Planning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper identifies a key failure mode in LLM-based agents, particularly smaller models, which often generate causally invalid actions in multi-agent collaborative tasks. To address this, the authors propose **CausalPlan**, a two-phase framework. In Phase 1, "Causal Action Structure Learning," a Structural Causal Action (SCA) model is learned from a dataset of agent trajectories to capture the influence of previous actions ($a_{t-1}$) and current states ($s_t$) on the next action ($a_t$). This is stored in a Causal Action Matrix (M). In Phase 2, "Agent Planning with Causal Knowledge," this matrix M is used to guide the LLM's action selection. This is done via two modules: 1) "Causal-Aware Planning," which re-weights the LLM's output probabilities with the causal scores from M, and 2) "Causal Backup Plan," a fallback mechanism that greedily selects the highest-scoring causal action if the LLM fails to produce a valid one. Experiments on the Overcooked-AI benchmark and Crafter demonstrate that CausalPlan reduces invalid action. 1. The proposed two-phase framework is intuitive and modular. 2. The paper is easy to follow. 3. The empirical results are extensive and show consistent performance gains across multiple LLM backbones (Gemma, Llama, Qwen) and evaluation settings (AI-AI collaboration and Human-AI collaboration), outperforming baselines on the Overcooked benchmark. 1. The paper's primary claim rests on "causality-driven planning". However, the SCA model learns a supervised mapping from $(s_t, a_{t-1})$ to $a_t$ based on data collected from a single behavior policy (MEP). It is highly questionable whether this process discovers true "causal" structure as defined by Pearl or simply learns the strong correlations and biases within that specific policy's data. The proof of identifiability (Proposition 1) relies on strong, standard assumptions (e.g., causal sufficiency, additive noise) that are difficult to justify in a complex, dynamic environment like Overcooked. 2. A major limitation, which is not adequately discussed, is that the Causal Action Matrix $M$ appears to be learned **per layout**. The heatmaps in Fig. 10 and 11 are specific to the "CR layout", and the offline training takes 3 hours per environment. This severely limits the method's scalability and flexibility, which is one of the primary advantages of using LLM-based agents. The authors provide no evidence or discussion on whether $M$ learned on one layout can generalize to another. 3. The central idea of learning an external model from trajectory data to score and refine LLM-generated plans is not novel. The paper's related work section is missing key work [1] on this specific problem. - ReAd [1] directly tackles the same problem of inefficient LLM grounding in multi-agent environments like Overcooked. - The proposed "Structural Causal Action (SCA) model" is conceptually very similar to the local advantage function used in [1]. Both frameworks learn a function from agent trajectory data (collected from a behavior policy $\pi_\beta$ here) to score the utility of the proposed plan. While this paper formulates the scorer as a causal model $P(a_t | s_t, a_{t-1})$, ReAd [1] formulates it as an RL-based advantage function, the high-level approach of using a learned, data-driven scorer to refine LLM plans is highly overlapping. The authors must discuss this and other related works to properly situate their contribution. 1. Could the authors please clarify the novelty of the SCA model compared to other data-driven refinement models, such as ReAd [1] ? A thorough comparison in the related work section is necessary. 2. Can the authors provide more evidence that the SCA model is learning true causal relationships rather than just the strong policy-specific correlations from the MEP dataset? What happens if a sub-optimal or random policy is used to generate the dataset $B$? 3. Does the Causal Action Matrix M learned for one layout (e.g., Cramped Room) have any utility when transferred to another layout (e.g., Asymmetric Advantages)? If not, doesn't this per-layout offline training requirement undermine the zero-shot generalization promise of using LLMs? [1] Zhang, Y., Yang, S., Bai, C., Wu, F., Li, X., Wang, Z., & Li, X. (2024). Towards efficient llm grounding for embodied multi-agent collaboration. ACL 2025 Fully AI-generated
Modeling the Density of Pixel-level Self-supervised Embeddings for Unsupervised Pathology Segmentation in Medical CT Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose Screener, an unsupervised visual anomaly segmentation (UVAS) method, which leverages dense self-supervised pre-training and masking-invariant dense feature conditioning variables as replacement for positional encodings. The authors train Screener on 30k unlabeled CT-volumes and outperform existing UVAS methods and self-supervised pretraining methods when fine-tuning their method. The method the authors propose, is really nice, as it leverages dense self-supervised pretraining, which is a good choice given their goal of dense downstream applications and their addition of their masking-invariant dense feature conditioning variables as replacement for positional encoding is also well motivated and creative. Additionally the clarity of the paper is also very high, with the authors explaining the different aspects very well. The majority of my criticisms hinge around two key points of the authors paper. 1) The authors claim their method exceeds current self-supervised learning methods. However, the authors don't compare against MAE pre-training, which was shown to be the strongest SSL pre-training for 3D medical image computing in the medical domain in a recent benchmark [1]. In particular the chosen VoCo and SwinUNETR pre-training baselines were shown to be bad for segmentation in general, making this claim not substantiated. 2) The evaluation protocol of using just 25 training cases is limited. If the authors claim their SSL method to be overall useful for segmentation I would like them to additionally evaluate their pre-training against other pre-training methods on a full-data regime. This is largely, because pre-training methods in general appear to yield performance benefits in small-data regimes, however when more data is available it may not do so anymore. Having this information is crucial for practitioners to know which SSL method to choose given their data availability. If the authors 1) include MAE pre-training as an additional SSL baseline (when finetuning) and 2) include a finetuning experiment in a full-data-regime and 3) include a full-length nnU-Net training as reference, I will raise my score further. [1] Wald, Tassilo, et al. "An OpenMind for 3D medical vision self-supervised learning." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025. Q1: I am not sure if I missed it but how do you get from the anomaly maps to hard predictions as needed for DSC measurement? Is there some thresholding and if so, how is the thresholding done in the unsupervised and supervised setting? Fully human-written
Modeling the Density of Pixel-level Self-supervised Embeddings for Unsupervised Pathology Segmentation in Medical CT Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents Screener, a fully self-supervised framework for unsupervised pathology segmentation in 3D CT. It reframes pathology detection as unsupervised visual anomaly segmentation (UVAS), under the assumption that pathological regions are statistically rare. The method enhances density-based UVAS by (i) learning dense voxel-level self-supervised descriptors tailored to CT data, and (ii) introducing masking-invariant contextual embeddings as conditioning variables within a conditional density model. The models are trained on 30k unlabeled CT volumes. Screener achieves state-of-the-art results on four public CT datasets (LIDC, MIDRC, KiTS, LiTS). When distilled into a single UNet and fine-tuned with limited labeled data, it matches or surpasses medical SSL baselines. Reviewer acknowledges the following contributions: **Addresses an important medical challenge**: The paper tackles the critical and realistic problem of detecting all pathological findings in 3D CT without requiring pixel-level annotation, which is a limitation that hinders clinical deployment of current supervised methods. **Interesting and well-motivated technical idea**: - The proposed use of self-supervised learning for both the descriptor and conditioning models is conceptually appealing. It enables the system to model normal anatomy and contextual expectations directly from large unlabeled data, thereby detecting deviations that correspond to abnormalities. - Furthermore, the integration of dense SSL with masking-invariant conditioning is simple but effective. It allows a simple Gaussian density model to perform on par with more complex normalizing flows, indicating that meaningful contextual embeddings can simplify anomaly modeling. **Broad evaluation**: Experiments across four diverse datasets demonstrate that Screener generalizes well across organs and pathologies. The unsupervised anomaly segmentation results are significant, with clear ablation studies supporting each component. **Clear presentation**: The paper is clearly written, logically organized, and easy to follow, making complex ideas accessible to both machine learning and medical imaging audiences. I found the following weaknesses in the current manuscript: - **Core conceptual clarity**: The motivation for why optimizing the **conditional density model** $q_{\theta}(y|c)$. helps detect abnormal regions, is not fully articulated. Since this is a central concept, further clarification and discussion would strengthen the theoretical grounding. For instance, authors need to provide a visualization of normal pixels near the abnormal regions and show how the heatmap behaves to provide insights. - **Experimental organization**: The unsupervised experiments could be better structured to highlight the benefits of dense SSL. For instance, baselines should be grouped into (i) image-level SSL models (e.g., autoencoder) as well as adding SOTA models like LVM-Med [1] (developed for ResNet-50 and suitable with U-Net) and (ii) dense SSL models (the current author compared), to clearly demonstrate that dense SSL drives the improvements. - **Supervised performance gap** – While competitive, the fine-tuned Screener in Table 2 lags behind certain SOTA supervised or SSL-pretrained models. The authors could explore initializing the distilled UNet with a pretrained medical backbone (e.g., RadImageNet or LVM-Med to enhance feature representations. - **Presentation detail** – Some result table formats (e.g., Tables 1–3) could be reformatted and polished for a more professional appearance and readability. [1] Lvm-med: Learning large-scale self-supervised vision models for medical imaging via second-order graph matching, NeurIPS 2023. [2] RadImageNet: an open radiologic deep learning research dataset for effective transfer learning." Radiology: Artificial Intelligence - In the self-supervised learning settings, besides the medical models, can authors examine the performance of the generalized model, for e.g., the Grounding Dino technique, which can retrieve objects given prompt input and apply some threshold techniques to filter noise [3]? Incorporating such experiments is interesting and highlights the benefit or potential weakness of the current strategy. - It would be great to see the advanced performance by applying the U-Net with pre-trained medical models during the distillation process, aiming to bridge the gap between current performance and other SOTA. [3] Grounding Dino 1.5: Advance the" edge" of open-set object detection. Fully human-written
Modeling the Density of Pixel-level Self-supervised Embeddings for Unsupervised Pathology Segmentation in Medical CT Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes Screener, a self-supervised anomaly segmentation framework for 3D CT. The method combines dense self-supervised descriptor learning, masking-invariant conditioning embeddings, and density modeling to detect abnormal patterns. The model trains on over 30,000 unlabeled CT volumes. Screener is tested on four public CT benchmarks under both unsupervised and fine-tuning settings. The problem is clearly motivated, as a supervised method cannot capture all distributions of pathologies. The integration of dense SSL for descriptors and SSL-learned conditioning embeddings looks novel. The proposed model is tested on four public CT benchmarks under both unsupervised and fine-tuning settings. Strong and consistent performance is observed. While the method is well-executed, the contribution may appear incremental as the dense SSL for voxel embeddings and conditioning is a straightforward extension of VICReg and conditioning replacing sin-cos encodings is conceptually natural. The paper would benefit from clearer articulation of why condition embeddings are fundamentally more informative than APE or sin-cos encodings beyond empirical gains. The proposed method looks closely related to an existing method [1]. It would be beneficial if the authors acknowledge and contrast with the prior work. The method produces per-voxel anomaly scores. A threshold is necessary to perform the final segmentation. It is unclear how this threshold is determined. References: [1] Seince, Maxime, Loı̈c Le Folgoc, Luiz Facury De Souza, and Elsa Angelini. "Dense Self-Supervised Learning for Medical Image Segmentation." In Medical Imaging with Deep Learning, pp. 1371-1386. PMLR, 2024. 1. Are there reasons why condition embeddings are fundamentally more informative than APE or sin-cos encodings? 2. How is the threshold determined to obtain the segmentation? Lightly AI-edited
MedFuse: Multiplicative Embedding Fusion for Irregular Clinical Time Series Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes MedFuse, a framework for irregular clinical time series centered on the MuFuse (Multiplicative Embedding Fusion) module. MuFuse fuses value and feature embeddings through multiplicative modulation, preserving feature-specific information while modeling higher-order dependencies across features. Experiments on two datasets demonstrate the superior performance of MedFuse. 1. The studie problem is interesting and important. Especially irregular time series analysis is a pretty challenging domain for clinical data analysis. 2. The proposed approach achieve good results on 3 different tasks. 1. The paper’s contribution appears incremental. Representing each feature at each timestamp as an embedding follows prior work in SUMMIT [1], as the authors acknowledge. The main novelty is the multiplicative fusion of the feature identifier and the value embedding. As presented, this reads more as a heuristic than a method: the paper does not explain why multiplication is preferable to alternatives such as addition, concatenation, gating, attention, or bilinear pooling. 2. Empirically, the gains are unclear. In Table 1, MedFuse and SUMMIT achieve overlapping performance once the reported standard deviations are considered, suggesting no statistically meaningful improvement. 3. In addition, several figures and tables (e.g., Figure 1, Figure 3, Table 2) closely mirror SUMMIT’s presentation, which further blurs what is substantively new beyond [1]. [1]. Huang, Chun-Kai, Yi-Hsien Hsieh, Ta-Jung Chien, Li-Cheng Chien, Shao-Hua Sun, Tung-Hung Su, Jia-Horng Kao, and Che Lin. "Scalable Numerical Embeddings for Multivariate Time Series: Enhancing Healthcare Data Representation Learning." arXiv preprint arXiv:2405.16557 (2024). 1. What is the rationale for choosing a multiplicative operation to fuse feature-identity and value embeddings? Please provide theoretical motivation or comparative experiments against common alternatives (addition, concatenation, gating/MLP, attention, bilinear). 2. Table 2 reports ablations only on PhysioNet 2012 (P12). Could you extend these ablations to other datasets or tasks to demonstrate robustness and generality? 3. What is the intended takeaway of Table 3? As written, the transfer results do not show clear gains, and performance when transferring from P12 to MIMIC-III appears to worsen. Lightly AI-edited
MedFuse: Multiplicative Embedding Fusion for Irregular Clinical Time Series Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. To effectively model numerical laboratory measurements, this paper proposes MedFuse to disentangle irregular clinical time series numerical values into value and feature embeddings. - The motivation is clear. - The problem is interesting. - The paper is easy to read and understand. - Insufficient experiments. - Novelty is limited. It is a technique that disentangles numerical values into two parts. But this has been done by others like FT-Transformer and TabTransformer-like series models. - Novelty clarification. Please reclarify your novelty after surveying the Transformers for tabular data. - More details about the standard lookup table are needed. - The method is claimed to model the feature identity embeddings and value embeddings. Validation experiments or detailed interpretation are needed to support your claim. - Why repeat each entry of $e_v$ for k times? There are other choices, like using an MLP (or other layers) to map the two embeddings into one shared embedding. Repeating k times does not make sense. - The TabTransformer-like series models should also be included as baselines. - The step in "Comparison of different partitioning factors k on P12" is too large. A fine-grained setting is needed. Lightly AI-edited
MedFuse: Multiplicative Embedding Fusion for Irregular Clinical Time Series Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 0: Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes MedFuse, a model incorporating irregular clinical time series. The core of the paper is a novel 'MuFuse embedding module' that performs value-conditioned multiplicative fusion. The multiplicative factors themselves are obtained via sigmoid function to supress outliers. The authors claim this approach better captures nonlinear feature-value interactions compared to existing fusion methods that are generally additive. They evaluate MedFuse on three clinical datasets (two public and one private) showing performance improvements over several baselines. The paper also includes a cross-dataset transfer experiments to support the idea of learning cohort-agnostic features. The main contribution of fusion methods for irregular EHR data is an interesting and underexplored area. The paper generalizes the SUMMIT model, and the derivation is sound. The formulation and mathematical notation are generally clear presented. The overall paper is easy to read. The experiments are not just on ICU mortality prediction datasets, but also carcinoma, suggesting some generalizability. If the claims hold, the approach proposed in the paper can influence how numerical values are embedded in clinical models. The paper heavily relies on the claim that multiplicative fusion enables "richer feature-value interactions" and "nonlinear modulation," but provides limited evidence about its benefit for clinical data. The justification for multiplicative gating is also hand-wavy, and relevant only at the last layer. Since we have a transformer architecture with multiple layers, there is no reason to see why the relevant interaction cannot be learned under additive terms. Furthermore, there are multiple papers which have porposed incorporating multiplicative interactions. How is this work different? The paper talks about irregularly sampled time series, but I do not see anything specific to that use case here. Any argument in the paper can be applied even to regularly sampled no-missing data series as well. The improvements are minor, and are often not significant. This might not be as important, if the paper established that these differences are clinically meaningful. But I see no such evidence. Additionally all the results are based on only 2 datasets PhysioNet and MIMIC. The third HCC data is entirely private with no reference. Furthermore on Physionet, accuracy increases only marginally but the model deviation increased 4 times. This undermines the entire argument about the model capturing specific multiplicative interactions which others do not. What specific types of clinical patterns or relationships does your method capture that is not captured by other methods? What is the value of Sec 5.3. The alternate parameterization was not experimented on nor theoretically analyzed, so this seems like unnecessary addendum to the paper. As the paper also says, the transfer results are influenced more by dataset size rather than embedding quality. Moreover the descriptions says that after initial freezing, the embeddings is fine-tuned as well. Doesn't this undermine the claim of learning "reusable, cohort-agnostic embedding"? Additionally multiplicative interactions baseline models can be added. See [1,2,3] and references there in Have you conducted any error analysis to identify specific patient subgroups or clinical scenarios where the improvement is most pronounced? IMPORTANT: I was a last minute/emergency reviewer, so I have not had the chance to read this in detail. If I have misunderstood or missed anything, please bring it to my notice, and I will correct myself and revise my ratings. Minor: There are many duplicated references ( Song et al, Shukla and Marlin, Tipirneni and Reddy, etc.) 1 Multiplicative interactions and where to find them, Jayakumar et al 2 AdaDHP: Fine-Grained Fine-Tuning via Dual Hadamard Product and Adaptive Parameter Selection, Liu et al 3 SCINet: Time Series Modeling and Forecasting with Sample Convolution and Interaction Fully human-written
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces a new tokeniser-free architecture, H-Net, that learns segmentation strategies and can be applied more than once, thereby creating learnable tokenisation. It evaluates its performance with respect to other architectures when data- and compute-matched. - Interesting architecture that contributes to investigating the long-lasting issue of tokenisation in language models (aka tokenisation-free architectures) - Many carefully conducted experiments, with positive results - Paper clear and well written - I think it is not ideal that the state of the art is only briefly described in the main part of the paper, and discussed at more length in an appendix. It makes the main part of the paper not really self-contained - Moreover, the discussion on what is different and novel with H-Net with respect to previously published works is insufficient, especially in the main part of the paper. The authors write that "H-Nets […] unlock the ability to remove another layer of pre-processing, such as tokenizers, and instead learn them end-to-end". To me, it is not the case that H-Nets "unlock" (i.e. make possible for the first time ever) such an "ability". Previous work have done this before, and the authors should acknowledge it better, and clearly explain how and why H-Nets are novel and better. - The authors rightly explain that such tokeniser-free architectures remove the need for heuristic tokenisation strategies, and create "optimal" (in some sense) segments that can be visualised and analysed. The authors only show a handful of examples in English, although they mention the fact that improvements are better in Chinese. This is certainly due to the fact that their algorithm re-discovers whitespace-separated tokens in English, but cannot to that in Chinese, where there are no whitespaces. But a whole discussion on the boundaries endogenously learned by the model is missing. In languages using a script that includes whitespaces, how often are boundaries placed on whitespaces? When are whitespaces not used as boundaries, and when does the model places boundaries at other places? In languages using a script that does not includes whitespaces, do the endogenously learned segments "look like" what linguistic traditions (and more recently, treebank developers) have defined as "words" or "word-forms"? When using 2 levels of hierarchy, does the first level generate morph-like units (it seems that it does not, at least on the examples shown on English)? To me, adding such a discussion is absolutely necessary to understand what the model actually learns and how (and when) it (should) perform(s) better than models relying on heuristic and/or statistics-based tokenisation strategies (whitespace, BPE, etc.) - will the codebase be released? Fully human-written
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors propose a hierarchical model (H-net) that can operate on raw byte sequences. The model is learned end-to-end. First, the sequence is processed by a shallow Mamba-based encoder, then it is chunked and downsampled to reduce its length. Then, the main network, which can have any architecture —typically a Transformer or, recursively, an H-net for multiple hierarchy levels. Then, a smoothing module is applied to provide gradients for the decision points, the sequence is upsampled by repeating the main network's output the correct number of times, and a Mamba-based decoder network outputs the full-length sequence. Additional regularization is used to achieve the desired target compression ratio. The model consistently outperforms its competitors, including Transformers, SpaceByte, and MambaByte - Interesting, fully differentiable method - Improved scaling - Improved robustness - Complexity - Clarity: Some baselines are not described in enough detail. For example, what exactly is H-Net (Space) or H-Net (pool)? Some details are described in the appendix, but are very vague. Line 353 claims that the main network also uses Mamba layers, but this is never described. Are all layers Mamba layers? Only some? - The only BPE model is a Transformer; however, the H-net uses Mamba layers. It would be nice to have a Mamba-based BPE model to see if the dynamic chunking or the Mamba is a bigger win. - In line 206, the authors say "where chunking layer ... ", however the chunking layer is defined afterward, which is a bit confusing - In line 355, the authors say "As discussed Section, ... comprise mainly Mamba-2 layers.". However, Mamba was never mentioned before in the paper. - It would be nice to describe what the scales are in the text. Now one has to look at the figure descriptions to be able to figure it out. - In line 1455, the authors say "Multiplying upsampled vectors by their confidence scores incentivizes the routing module to make confident, accurate decisions.". However, they are multiplied by one, and the difference is only in the backward. Why does this trick help? Fully human-written
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces H-Net, a novel and compelling hierarchical architecture for end-to-end sequence modeling directly from raw bytes, aiming to replace the conventional tokenization pipeline. The core contribution is a "Dynamic Chunking" (DC) mechanism that learns content- and context-dependent segmentation strategies jointly with the main model. This is achieved through a clever combination of a similarity-based routing module to predict boundaries and a smoothing module to ensure stable, differentiable end-to-end training. The authors demonstrate empirically that H-Net not only trains stably at scale (up to 1.3B parameters) but also outperforms strong, compute-matched BPE-tokenized Transformer baselines. Furthermore, they show that the architecture can be recursively stacked (2-stage H-Net), leading to even better performance and scaling, particularly on languages and modalities where traditional tokenization heuristics are less effective, such as Chinese, code, and DNA sequences. - **Novelty and Significance:** The paper's primary strength lies in presenting a robust and scalable framework for fully end-to-end, learnable segmentation. The Dynamic Chunking mechanism, especially the smoothing module that turns a discrete selection problem into a differentiable one, is an elegant solution to a notoriously difficult problem that has hampered previous efforts. This work represents a significant step towards realizing the "bitter lesson" by replacing a major handcrafted heuristic (tokenization) with a learned component. - **Empirical Rigor:** The experimental evaluation is thorough and convincing. The authors conduct carefully controlled comparisons against strong baselines (BPE Transformer, MambaByte, SpaceByte) by matching both data and computational (FLOPs) budgets. The consistent outperformance of the 2-stage H-Net across different model scales is a powerful result. - **Generality and Robustness:** The paper convincingly demonstrates the model's advantages beyond standard English text. The superior performance on Chinese, code, and DNA sequences validates the claim that a learned chunking strategy is more generalizable than fixed heuristics. The improved robustness to textual perturbations is another key benefit of operating directly on bytes.   - **Strong Ablation Studies:** The authors provide detailed ablation studies that validate their key architectural choices. These studies effectively demonstrate the importance of the smoothing module, the similarity-based routing, and the use of SSMs (Mamba-2) in the outer encoder/decoder layers, strengthening the credibility of the proposed design. - **Practical Efficiency Concerns:** The paper candidly acknowledges that the current implementation can be up to 2x slower during training and has dynamic memory usage, which can be unpredictable. This is a significant practical hurdle for widespread adoption and large-scale training, as it complicates hardware optimization and resource allocation. - **Uncertainty and Potential Fragility at Extreme Scale:** While stability is demonstrated up to 1.3B parameters, the fact that larger scales are left as future work raises questions about the fundamental robustness of the mechanism. The complex interplay between the main prediction loss and the auxiliary ratio loss could introduce unforeseen instabilities at much larger scales (e.g., 70B+). This limitation implies that the current mechanism, while effective, might not yet be a "fundamental" solution but rather one that is proven to work within a specific regime. - **Need for Stronger Evidence on Principled Operation:** This is the most critical weakness. The paper's core claim is that it introduces a principled, end-to-end chunking mechanism. However, the primary evidence comes from ablation studies showing that removing a component (e.g., the smoothing module) degrades the final performance metric (BPB). This demonstrates that the components are necessary for good performance, but it does not sufficiently prove that they are working as theorized. For instance, it is unclear whether the performance gain is due to the smoothing module correctly interpolating uncertain boundaries, or if it's acting as a complex, yet effective, form of regularization. The visualizations of learned boundaries are a good first step, but more direct, quantitative evidence is needed to establish that this is a truly fundamental innovation rather than a highly effective, scale-specific heuristic. - **Hyperparameter Sensitivity:** The ratio loss weight, $\alpha$, is fixed at 0.03 for all experiments without an accompanying ablation study.1 The model's performance and learned compression ratio could be highly sensitive to this value, potentially requiring extensive tuning for new domains or modalities. This replaces one form of tuning (tokenizer design) with another, potentially more opaque one. To further strengthen my assessment and address the concerns about the mechanism's fundamental nature, I would appreciate the authors' response to the following questions: - **Evidence for Principled Operation:** Beyond the final performance metrics in ablation studies, can you provide more direct evidence that the core modules are functioning as hypothesized? For example: - **Routing Module:** Can you analyze the relationship between the cosine similarity scores and human-annotated semantic/syntactic boundaries? Does the "w/o cosine routing" variant learn a different, perhaps less interpretable, chunking strategy, or does it fail to learn any consistent strategy at all? - **Smoothing Module:** Does the smoothing module primarily act on low-confidence boundaries ($P_t \approx 0.5$), as intended? Could you provide statistics on the distribution of $P_t$ values and show how the EMA application correlates with them? This would help differentiate its role from being a general regularizer. - **Scaling and Stability:** Could you elaborate on the potential stability challenges at scales significantly larger than 1.3B parameters? Have you observed any trends in the training dynamics (e.g., the interplay between the ratio loss and the main loss) that might suggest future issues, and do you have any hypotheses on how to mitigate them? Proving the mechanism's fundamental nature requires confidence in its scalability. - **Hyperparameter $\alpha$:** Could you provide more intuition on the choice of $\alpha=0.03$? How sensitive is the model's final performance and, more importantly, the stability of the learned compression ratio to this hyperparameter? An ablation, even a small one, would be very valuable to understand the robustness of the training process. Fully AI-generated
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling Soundness: 3: good Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors propose a tokenizer-free architecture that dynamically segments sequences by jointly learning context and content dependent boundaries alongside the language modeling objective. Their model is hierarchical, enabling it to capture multiple levels of abstraction, from finegrained details to higher-order structure. In experiments on English, the architecture demonstrates increased robustness at the character level compared to BPE-based tokenizers, and the authors further report improvements for Chinese, code, and DNA sequences. - The paper addresses an important limitation of many tokenizer-free architectures: training instability when boundary predictors must make discrete decisions (with or without supervision). Their proposed architecture is elegant in how it handles segmentation via the novel routing and smoothing mechanism. - The paper is well written with lots of ablations, detailed discussions on different architectural and experimental choices that potentially aid reproducibility. - Their experimental results are great to see, they demonstrate improvements over traditional BPE in downstream tasks, with robustness on character-level tasks, code etc. This speaks directly to the benefits of their dynamic tokenization strategies. - Of course, this paper is not framed as a multilingual one, but the authors do claim improvements in other languages, and only evaluate on Chinese. While the improvements are notable, many recent frontier LMs are trained on web data across several languages. Do you have insights on how your architecture scales in a multilingual setting, when it is very common to have very distinct tokens mixed in individual sequences? - In addition to what I mentioned above, I am particularly curious about the potential challenges of scaling HNET experiments to a truly multilingual setting involving languages with distinct scripts, varying data sizes, and diverse linguistic structures. When data is highly imbalanced across languages or domains, how robust is the quality of the learned segmentation? Even if prior ratios are fixed to guide boundary prediction convergence, could the quality of learned segmentation degrade as language-specific data becomes more limited, given that boundary learning is inherently context-dependent? I will be interested to hear your thoughts on this. - Why are there no comparisons to BLT (https://arxiv.org/pdf/2412.09871) and the dynamic token pooling paper (https://arxiv.org/abs/2211.09761)? At the very least, there is some similarity in architectural designs with the dynamic token pooling paper. - There’s also no discussion about how well the model handles out-of-domain sequences, and I don’t mean big shifts going from natural language to DNA or code, but more subtle, realistic shifts like moving to scientific text or just long-tail words in the pretraining data. What kind of segmentations are observed? Are they optimal? - Any thoughts on how your model performs at really small scales <700M . I know that there seems to be more value these days in training models at larger scales, just curious if you have any thoughts. Fully human-written
Identifiability Challenges in Sparse Linear Ordinary Differential Equations Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors prove an identifiability result for a class of linear dynamics governed by a sparse-continuous random matrix. I went through the proofs in detail and they are reasonable, although this is outside my personal area of research so I can just say that the techniques and arguments seem reasonable. My only feedback is whether the authors can better motivate this class of sparse-continuous matrices and in what way they are relevant to machine learning. Of course identifying dynamics from time series is of interest, but a bit more discussion about why one should care about this is necessary. Seems like a correct theoretical paper. See my response to the summary - the relevance of this class of matrices needs to be better motivated. I couldn't tell whether this is an impactful result and the authors chose this problem because its important, or if the authors just chose a class of matrices for which they knew how to prove something. if they can help me understand that better i will raise the score. no further questions Fully human-written
Identifiability Challenges in Sparse Linear Ordinary Differential Equations Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper revisits the identifiability of linear ODE systems $dx(t)/dt = Ax(t)$ when the system matrix $A$ is sparse. The authors show that, unlike the dense case where almost all systems are identifiable from a single trajectory, sparsity introduces a positive probability of unidentifiability. They define a sparse–continuous ensemble and prove a sharp phase transition: when sparsity exceeds roughly $1-ln n /n$, systems become globally unidentifiable with high probability. The paper also introduces a trajectory-level metric to quantify how close a trajectory is to being unidentifiable. Simulations confirm that identifiability deteriorates with higher sparsity, both theoretically and in practice, using SINDy and Neural ODE estimators 1. Provides a clear theoretical characterization of when sparse linear ODEs lose identifiability. 2. Decomposing failure probability into global vs. trajectory unidentifiability (Eq. 2) is conceptually clarifying, and the distance $d_A(x_0)$ is a helpful practical proxy. 3. The sharp threshold result is elegant and connects to known results in random matrix theory. 4. Writing is well structured, and assumptions are transparently stated. 1. The strong assumptions (noise-free, single continuous trajectory, full observability) limit real-world applicability. 2. For readers less familiar with random matrix theory, it would be helpful to include a brief intuition box or paragraph below Lemma 3 explaining why the identifiability transition occurs precisely at $1-ln n/n$. A short, high-level explanation would make this elegant result more accessible and highlight its connection to classic random graph thresholds. 3. In the main text, the normalized Hamming distance is described as divided by $n$, but Appendix C.2 defines it as normalized by $n^2$. Please clarify which version was used, and make it consistent. 1. In Fig. 4, for small dimensions (e.g., $n=3,5$) SINDy sometimes achieves low normalized Hamming distance even at very high sparsity. Could you comment on this regime? 2. In Section 2 (“Discussion of assumptions and limitations”), you might consider citing a related work on identifiability under hidden confounders: Wang et al. (2024) Identifiability analysis of linear ODE systems with hidden confounders. Fully AI-generated
Identifiability Challenges in Sparse Linear Ordinary Differential Equations Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents a theoretical study of the identifiability of autonomous, linear and noise-free ODEs from a single trajectory, focusing on systems where the drift matrix is *sparse*. The authors distinguish between two notions of unidentifiability: (i) system-level (or global) unidentifiability, where the system is unidentifiable regardless of the initial condition, and (ii) trajectory-level unidentifiability, which occurs only for initial states lying in invariant subspaces of the system. *For system-level unidentifiability*, they show that in sparse systems it is implied by rank deficiency (i.e., zero-eigenvalue degeneracy, Lemma 1), derive a lower bound on its probability via the occurrence of system matrices with two zero columns (Lemma 2), and extend the analysis (Lemma 3) using random graph theory to identify a dimensionality-dependent sparsity threshold governing unidentifiability. *For trajectory-level unidentifiability*, they prove that the probability of “unlucky” initial conditions lying in invariant subspaces is zero, so identifiability depends solely on the system-level property. They further analyze near-invariant initial conditions, introducing a notion of distance to invariant subspaces, and showing that the indistinguishability time between trajectories of two systems agreeing on such a subspace increases inversely with this distance (Lemma 4). Theoretical findings are finally supported by numerical experiments, which rely on neural network (Neural ODE) and symbolic regression (SINDy) methods. 1. The paper provides strong context through a clear discussion of related work, assumptions, and limitations. This situates the contribution well within the system identification and machine learning communities and makes its relevance easy to grasp. 2. The work extends classical results on the unidentifiability of linear systems to the practically important case of sparse systems, filling a notable gap in the existing theory. 3. The proofs are well structured and technically interesting. 4. The experimental section is well designed and directly supports the theoretical claims. The authors also apply two widely used system-identification methods (Neural ODE and SINDy) to test the practical implications of their results. The exposition of these empirical findings is clear and coherent (albeit limited, see Weekness 2 below). 5. The appendix is comprehensive, providing detailed proofs, additional empirical results, and clear information for reproducibility. Overall, the paper is well written, technically sound, and contributes meaningful theoretical insight into the identifiability of sparse linear systems. 1. The most immediate limitation lies in the restriction to linear and noise-free systems. While the authors are transparent about this assumption, it considerably narrows the scope of applicability to real-world or data-driven settings. 2. The results on empirical unidentifiability (Section 5.3) are interesting but somewhat limited. The authors do not explicitly connect their theoretical findings on trajectory-level unidentifiability with their empirical unidentifiability results. For example, it would be interesting to examine how the two system-identification methods considered behave for initial conditions close to invariant subspaces, especially since the trajectory length should influence distinguishability in that regime (as suggested by Lemma 4). 3. While the use of a Bernoulli mask to model sparsity is theoretically convenient, the authors neither discuss nor acknowledge (or at least I don't find such a discussion/acknowledgement) that real-world sparse networks often exhibit "structured sparsity" (e.g., hub nodes or community structure). The Bernoulli mask (i.e. Erdős–Rényi) model does not capture such degree statistics, which may limit the generality of the conclusions. 1. Why did you title Section 3 “Global Unidentifiability” instead of “System-level Unidentifiability”? Put differently, why introduce the term system-level unidentifiability later in Section 3 rather than consistently using global unidentifiability throughout? This shift in terminology is somewhat confusing. 2. In Appendix B, you discuss the sparsity of real-world gene regulatory networks. What are the node degree statistics of these networks, and can they be reproduced by your sparse–continuous random matrix model? If not, how might structured or heavy-tailed degree distributions affect your theoretical results? 3. Within your setup, is there a way to study how the closeness of an initial condition to an invariant subspace influences the performance of Neural ODE and SINDy? In particular, does the length of the observed trajectory or the inter-observation interval play a significant role in this regime, as suggested by Lemma 4? 4. In Figure 4, SINDy appears to match the true sparsity pattern almost perfectly for the 3-dimensional case across all sparsity levels. How should these empirical findings be interpreted in light of your sharp sparsity threshold for global unidentifiability, and the corresponding empirical verification in Fig. 2? In other words, why does SINDy succeed in this low-dimensional setting even in regimes predicted to be unidentifiable? Fully human-written
Can Graph Quantization Tokenizer Capture Transferrable Patterns? Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates whether graph quantized tokenizers (using VQ/RVQ methods) capture transferable structural patterns across different graph domains. The authors introduce the Graph Token Information Discrepancy Score (GTID), a metric based on normalized Maximum Mean Discrepancy that quantifies the alignment of structural and feature information for nodes assigned to the same token across source and target graphs. Novelty : Investigating whether graph tokenizers learn reusable structural patterns addresses a fundamental gap as the community moves toward graph foundation models. The motivation is well-articulated and grounded in practical applications. Motivation and Analysis: GTID provides an intuitive decomposition of token consistency into feature-based and structure-based components. The separation enables targeted analysis of tokenizer failures and could inform future design choices. Theorems 1 and 2 formalize the intuition that code-conditional discrepancies in feature/structural spaces directly contribute to transfer error. The bounds connect IPM-based discrepancies (W₁ distance for features, MMD for structures) to generalization gaps . Empirical Results: SHE show measurable reductions in structural GTID and performance gaps (Figure 5) which validate the hypothesis that explicit structural inductive biases matter. Dataset: - Only citation networks and e-commerce graphs are tested. More critical domains like molecular graphs (where structural motifs determine properties), biological networks, and social networks are absent. Largest graph has 173K nodes, which is small. GTID's behavior and computational feasibility at scale (millions of nodes/edges) is unknown. From my understanding, GTID requires computing centrality for every node, which may prohibitively expensive for large graphs. Baselines: There are a lot of missing baselines from this paper. - No non-quantized transfer baselines: The paper does not compare against established continuous GNN transfer learning approaches such as GraphCL [1], SimGRACE [2], or GROVER [3] for molecular graphs, nor domain adaptation methods like AdaGCN [4] or UDAGCN [5]. Without these baselines, it is impossible to determine whether the observed GTID-related issues are fundamental limitations of discretization or simply artifacts of suboptimal VQ/RVQ implementation. The paper does not justify why quantization is necessary when continuous fine-tuning remains a viable alternative. - Alternative tokenizer comparisons: The paper focuses exclusively on VQ/RVQ but does not compare to other graph tokenization paradigms mentioned in related work, such as GFT's transferable tree vocabulary [6], OneForAll's task-level tokenization [7], or subgraph-mining approaches [8]. This makes it unclear whether high structural GTID is specific to vector quantization methods or a general challenge across all discrete graph representations. - Single encoder architecture: Only MPNN encoders are tested. Recent work has shown that Graph Transformers with structural encodings [9, 10, 12] and attention-based architectures like GPS [11] exhibit fundamentally different inductive biases. These architectures may interact with quantization differently. For instance, global attention mechanisms might better preserve structural information during discretization, or conversely, might suffer more from quantization artifacts. [1] Zhu et al., Graph Contrastive Learning with Augmentations, NeurIPS 2020. [2] Xia et al., SimGRACE: A Simple Framework for Graph Contrastive Learning without Data Augmentation, WWW 2022. [3] Rong et al., Self-Supervised Graph Transformer on Large-Scale Molecular Data, NeurIPS 2020. [4] Dai et al., Graph Transfer Learning via Adversarial Domain Adaptation with Graph Convolution, IEEE TKDE 2022. [5] Wu et al., Unsupervised Domain Adaptive Graph Convolutional Networks, WWW 2020. [6] Wang et al., GFT: Graph Foundation Model with Transferable Tree Vocabulary, NeurIPS 2024. [7] Liu et al., One for All: Towards Training One Graph Model for All Classification Tasks, ICLR 2024. [8] Jin et al., Towards Graph Foundation Models: A Survey and Beyond, arXiv 2024. [9] Rampaśek et al., Recipe for a General, Powerful, Scalable Graph Transformer, NeurIPS 2022. [10] Rampášek et al., Exphormer: Sparse Transformers for Graphs, ICML 2023. [11] Rampaśek et al., GraphGPS: General Powerful Scalable Graph Transformers, NeurIPS 2022. [12] Ling et al., UNIFIEDGT: Towards a Universal Framework of Transformers in Large-Scale Graph Learning, IEEE BigData 2024. Presentation: - TIDS (abstract) vs GTID (body)? Are they referring to the same thing? They are quite confusing to me. - K for both codebook size and function K(g) - Eq. 3's L×M tuple vs. VQ/RVQ distinction not explained - "co_pu" to "ar_db" in Figures 2,3,6,7 what do they mean? - Only figures for quantitative comparisons - Typos: "tokenizer" (title), "langugae", "the are no" - RKHS kernel in Theorem 2 not stated Reproducibility: - No code provided See weakness Lightly AI-edited
Can Graph Quantization Tokenizer Capture Transferrable Patterns? Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper investigates whether graph quantization tokenizers—specifically those based on vector quantization (VQ) and residual vector quantization (RVQ)—are able to capture transferable structural patterns across heterogeneous graph datasets. The authors introduce a metric called Token Information Discrepancy Score (TIDS) to quantify how consistently the same discrete token corresponds to similar structural and feature patterns in different graphs. They find that existing graph quantizers often map structurally dissimilar node contexts to the same token, leading to reduced transferability in downstream tasks. To mitigate this issue, the paper proposes a Structural Hard Encoding (SHE) strategy intended to inject structural information into the tokenizer. Experiments indicate that SHE reduces TIDS and improves cross-dataset performance to some extent. 1. With recent movement toward graph foundation models and discrete graph tokenization, examining transferability is important and underexplored. 2. The paper clearly states the gap between current quantization practices and cross-domain robustness. 3. The proposed TIDS score provides a simple and interpretable measure for assessing token consistency across datasets. 1. The main contributions are empirical and diagnostic; the proposed SHE method seems incremental and not conceptually strong enough to be considered a substantial methodological advance. 2. The evaluation appears limited in scale—datasets used for analysis and transfer are not clearly representative of the breadth of graph domains where tokenization matters (e.g., molecular vs. social vs. knowledge graphs). 3. The paper focuses on a few quantization models but does not compare against more recent or structurally richer tokenization schemes (e.g., subgraph vocabulary learning, motif-based tokenizers) [1, 2]. [1] Beyond Message Passing: Neural Graph Pattern Machine, ICML 25. [2] Scalable Graph Generative Modeling via Substructure Sequences, NeurIPS 25. 1. Can the authors elaborate on whether TIDS correlates with transfer performance within the same domain (e.g., differing graphs of similar type)? Or is the effect only present in cross-domain settings? 2. How sensitive is SHE to hyperparameters and model architecture? Could the improvements stem from implicit regularization rather than structural encoding? Fully AI-generated
Can Graph Quantization Tokenizer Capture Transferrable Patterns? Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors investigates whether the graph quantized tokenizers can capture transferable graph patterns across datasets. To evaluate this, they propose a new metric, i.e., graph token information discrepancy, to measure the consistency of token-level feature and structure information between source and target graphs. While theoretical analysis proves that lower discrepancy indicates tighter transfer generalization bounds, their empirical studies conducted on two domains reveal that, structural information is poorly aligned across datasets (high structural GTID), while feature information transfers better. Therefore, they introduce structural hard encoding technique to effiectively reduces GTID. - the task is novel: while some works use quantization techniques to tokenize graphs, the authors are the first to investigate the transferability problem. - theoretical analysis and emprical studies further demonstrate the transferable pattern problem. - a simple yet efficient solution is proposed to boost the transferability. - the downstream tasks are limited to the classification, more can be explored. - case studies are encouraged to further demonstrate the effectiveness of the proposed technique please refer to the weaknesses Fully human-written
Can Graph Quantization Tokenizer Capture Transferrable Patterns? Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper focuses on the graph tokenization. Specifically, the authors argue that existing methods cannot capture complex graph structural information. While, the authors further develop a new metric called TIDS to measure the effectiveness of existing token generators. Empirical results based on TIDS reveal the inconsistent issue in previous methods. Finally, the authors provide a solution named SHE for addressing the above issue. 1.This paper is clearly motivated and easy to follow. 2.The authors provide the theoretical analysis of the proposed metric. 3.The proposed TIDS provides new insights for graph quantization tokenizer. 1.The introduction of background is somewhat too long. 2.The empirical results are not extensive. 3.There are several grammar issues. 1.The authors report the results of TIDS on various experimental settings. I just wonder how the performance of each model on the downstream tasks? 2.Does the observed issue have serious impact on the model performance for the specific downstream tasks (node classification or link prediction)? 3.Minor issues, like “TOKENZIER” in the title. Please proofread the manuscript. Fully human-written
Beyond Score: A Multi-Agent System to Discover Capability and Behavioral Weaknesses in LLMs Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes AGENT4WEAKNESS, a multi-agent system designed to uncover capability and behavioral weaknesses of large language models beyond numeric evaluation scores. The framework coordinates three agents, Planner, Analyzer, and Reporter, equipped with 22 analytical tools to generate structured weakness reports from existing benchmark results. Tested on 8 models and 27 benchmarks, the system improves report quality and aligns closely with human judgments, demonstrating its effectiveness in revealing interpretable model weaknesses. 1. Well-motivated framework for deeper LLM evaluation. The paper clearly identifies the limitations of traditional benchmark-based evaluation, which focuses only on numeric scores, and introduces a multi-agent framework (AGENT4WEAKNESS) that systematically analyzes capability and behavioral weaknesses of LLMs beyond accuracy metrics. This motivation is timely and addresses a meaningful gap in current evaluation practices. 2. Comprehensive and well-structured system design. The proposed Planner–Analyzer–Reporter architecture is conceptually clear and technically coherent. Each agent has distinct responsibilities, and the framework integrates 22 analytical tools for statistical testing, capability gap detection, and behavioral pattern mining. The design effectively demonstrates how multi-agent coordination can enable flexible, user-driven evaluation. 3. Empirical validation. The system is tested on 8 representative LLMs and 27 public benchmarks, covering reasoning, factuality, and safety. Alignment of human evaluation and model-based score confirm the validity of the automatically generated reports. 1. Limited technical innovation. The framework mainly assembles existing components in the agentic working pipeline, including multi-agent role decomposition, tool invocation without introducing novel algorithms or coordination mechanisms. 2. Insufficient diversity of user queries. The evaluation covers only three representative user requests (Q1–Q3), which cannot fully demonstrate the framework’s flexibility or generalization across different analysis needs. Broader testing on varied and real-world queries would be needed to justify the claimed adaptability and scalability of AGENT4WEAKNESS. 3. Limited ablation and component analysis. The paper provides little insight into the contribution of each component (22 analytical tools and three agent roles). The ablation results shown in Table 4 remain coarse-grained and do not reveal which tools or interactions most affect report quality. A more detailed analysis would strengthen the causal understanding of the framework’s performance. 4. Lack of formal formulation. While the overall system architecture (Figure 2) conveys the high-level workflow, the paper lacks a clear and formalized description of how the multi-agent coordination operates in practice. Key elements such as the data flow between the Planner, Analyzer, and Reporter agents, the intermediate representations of benchmark results, and the decision rules for tool invocation are only described narratively. 1. In Section 3.1, the author first mentions “a detailed list of 104 models” but later states that “the evaluation results of 8 representative models”, Could the authors clarify whether and how the 104 models are used in your experiments? 2. Only three user requests (Q1–Q3) are defined to test the framework. How do you make sure these requests are representative and cover the user needs? Fully AI-generated
Beyond Score: A Multi-Agent System to Discover Capability and Behavioral Weaknesses in LLMs Soundness: 4: excellent Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces AGENT4WEAKNESS, a multi-agent system designed to overcome the limitations of current LLM evaluation methods, specifically their "insufficient comparison" and "inflexible evaluation". The framework leverages multi-agent collaboration (a Planner, Analyzer, and Reporter) and a suite of specialized statistical and analytical tools to generate in-depth evaluation reports based on flexible user queries. Rather than just reporting scores, the system performs both capability analysis (assessing statistical significance of performance gaps) and behavioral analysis (mining reasoning patterns from Chain-of-Thought data). The authors demonstrate through extensive experiments that reports from AGENT4WEAKNESS are significantly higher in quality than a strong baseline. - The paper's primary innovation is reframing LLM evaluation from a static, fixed-pipeline task to a flexible, systematic process. An agentic workflow with multi agents are used. - The technical quality of the work is high. The system is thoughtfully designed, decomposing the complex task of "weakness discovery" into distinct, manageable agent roles and tool families. In general the design of the system makes sense. ```Lack of evidence of the benefit from the report ``` I am concerned whether the report is helpful given that each model is unique. Without knowing the internals or the properties of the model, it would be hard to assess whether the report is helpful. One example given in the paper is that AGENT4WEAKNESS identifies that the reasoning process of DeepSeek-V3.1 on AIME2025 questions is disorganized and suggests using markers such as “### Step 1” to structure the reasoning and adding verification of intermediate results after each step. This particular example can be problematic because for some models, adding additional structure or verification may force model to reason out of distribution. This may lead to decreased performances. ```Nature of Model "Improvement" is in context only``` The 3.7-point performance improvement (Section 4.3) is a strong result, but it appears to be achieved by feeding the system's analysis and suggestions back into the model as part of the prompt. This demonstrates an improvement in in-context learning or prompt adherence based on the analysis, which is different from a permanent improvement to the base model (e.g., via fine-tuning). While still valuable, this distinction should be made clearer, as the current phrasing could imply a more fundamental model enhancement. - Did you experiment with other SOTA models (like GPT-5 or Gemini-2.5-pro) to run the agents themselves? How sensitive is the final report quality ("Content Value" and "Factuality") to the choice of the underlying model powering the Planner, Analyzer, and Reporter? Fully human-written
Beyond Score: A Multi-Agent System to Discover Capability and Behavioral Weaknesses in LLMs Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes AGENT4WEAKNESS, a multi-agent framework using three agents (Planner, Analyzer, Reporter) with 22 specialized tools to generate comprehensive evaluation reports identifying LLM weaknesses. The authors claim their approach addresses two key limitations in existing methods: insufficient statistical comparison and inflexible evaluation perspectives. The work is generally interesting and provide novel perspective into LLM evaluation. The challenge of systematically discovering LLM weaknesses beyond simple accuracy scores is crucial for the field. LLM evaluation requires systematic yet novel benchmarking. 1. Circularity induced when using LLM to generate AGENT4WEAKNESS reports and evaluate their quality. This may create inherent bias where the evaluator model may favor outputs from its own framework over others. 2. The framework still relies on human curated benchmarks. Does the model to evaluate still need to run on many benchmarks or the evaluation could be made on partial results? Is inference computation saved by this system? 3. Can the system provide novel evaluations beyond available benchmarks? For example, if I would like to assess an agent's ability in creative writing, would the system still be applicable? 1. Does the system exhibit consistent performance when analyzing the same model multiple times? 2. Running benchmarks and evaluations may incur significant computational overhead. Can this system run stably on research computational infrastructure? Fully human-written
Beyond Score: A Multi-Agent System to Discover Capability and Behavioral Weaknesses in LLMs Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces AGENT4WEAKNESS, a multi-agent system designed to address the limitations of existing methods for discovering capability and behavioral weaknesses in large language models (LLMs). Current weakness discovery approaches are characterized by insufficient comparison, often failing to analyze statistical significance and confidence of performance differences, and inflexible evaluation, being restricted to fixed perspectives rather than adapting to user-specific requirements. AGENT4WEAKNESS aims to provide richer statistical insights and generate customized evaluation reports through collaborative agents and specialized tools. The core methodology of AGENT4WEAKNESS is structured around a multi-agent workflow consisting of a Planner, an Analyzer, and a Reporter, leveraging comprehensive evaluation data. Experimental results demonstrate that reports generated by AGENT4WEAKNESS show a significant improvement of 2.6 out of 10 points across four evaluation dimensions (Requirement Fulfillment, Content Value, Factuality, and Readability) compared to a direct answering baseline, with high consistency with human evaluations (Pearson r=0.801, Spearman \rhoρ\rhoρ=0.944). Notably, it achieves a 3.4-point improvement in Content Value, signifying richer analyses, and a 3.4-point improvement in Requirement Fulfillment, indicating its flexibility. Furthermore, models guided by weaknesses discovered through AGENT4WEAKNESS show an average performance improvement of 3.7 points, underscoring its practical utility. 1. The paper squarely tackles the challenge of LLM weakness discovery, a critical need as models grow more complex. The authors clearly motivate that current eval methods lack substantive comparisons (no statistical rigor) and flexibility to meet different analysis needs. 2. The proposed Agent4Weakness framework is innovative in using a multi-agent collaboration for evaluation. It combines a Planner, Analyzer, and Reporter agent with specialized roles. This design allows the system to break down the user’s query, fetch and compute detailed statistics, and compile findings into a coherent report. The use of external tools (for data analysis and statistical testing) within the agent workflow is a strong point, as it grounds the LLM’s analysis in solid quantitative evidence rather than just its internal knowledge. This represents a creative extension of chain-of-thought prompting into a tool-augmented, multi-step evaluation process. 3. Agent4Weakness produces much more informative and accurate analyses than the baseline. The strong performance across multiple dimensions underscores the efficacy of the approach. Furthermore, the authors show that LLM-based scoring of the reports correlates strongly with human evaluation (Pearson r ≈ 0.80, Spearman ρ ≈ 0.94), which suggests the evaluation criteria were meaningful and actually reflect human-perceived quality. 4. The system doesn’t just produce higher-level summaries – it can pinpoint specific weaknesses with verifiable accuracy. Such detailed weakness analysis is a clear strength over traditional evaluations that would simply report an average score. 1. High Complexity and Resource Requirements: The proposed system is quite complex, involving three coordinated agent roles and multiple tool integrations. Running Agent4Weakness requires a powerful backbone LLM and a large set of evaluation data in memory. This complexity could pose practical challenges. For example, orchestrating multi-agent prompts and tool calls may be slow or expensive compared to a single-pass evaluation. 2. The paper primarily compares Agent4Weakness to a direct answering baseline. While this highlights the advantage of the structured approach, the baseline is relatively rudimentary. It does not, for instance, use any chain-of-thought or tools at all. In other words, an intermediate baseline (say, a single-agent chain-of-thought using the same tools, but without specialized roles) could help attribute the improvements more precisely. As it stands, the evaluation convincingly shows superiority to an “uninformed” baseline, but not necessarily to any sophisticated alternative. 3. The paper demonstrates that Agent4Weakness’s outputs correlate well with human judgments of quality, but it doesn’t compare against human-written analysis of model weaknesses. Of course, expecting the authors to produce human-written reports for all cases is impractical, so this is more of an observation than a strict criticism. 1. You chose a direct-answer baseline for comparison, arguing that prior specialized pipelines aren’t as flexible. Nonetheless, have you considered comparing Agent4Weakness to a simpler single-agent chain-of-thought approach using tools? For example, one could prompt a single LLM with something like: “Here is all the data, think step by step to analyze weaknesses…” and allow it to call the same tools (sort of an ablation of the multi-agent structure). This might isolate the benefit of having distinct Planner/Analyzer/Reporter roles. 2. The multi-agent pipeline with external tools sounds computationally heavy (multiple prompt exchanges, tool calls, etc.). Do you have a sense of the runtime or cost overhead of Agent4Weakness compared to a direct evaluation? For instance, how long does it take to generate a full report for one model’s weaknesses, and could this be a bottleneck if evaluating many models continuously? 3. You showed a compelling result that providing the model with the identified weaknesses and suggestions can improve its performance via prompt adjustments. Do you envision a more automated integration of Agent4Weakness into the model development loop? For example, could the reports be used to inform fine-tuning data generation or to create adversarial test cases for continuous improvement? Fully AI-generated
EGEA-DM: Eigenvalue-Guided Explainable and Accelerated Diffusion Model Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper aims to enhance the training efficiency, interpretability, and controllability of diffusion models. Drawing on ergodic theory, the authors proposed an Eigenvalue-Guided Explainable and Accelerated Diffusion Model (EGEA-DM). This approach leverages the principal eigenvalue of the L-generator to precisely regulate the forward diffusion speed, thereby enabling adaptive adjustment of reverse steps during both training and sampling phases. 1、Reinterpreting diffusion models through the lens of ergodic theory, while establishing a connection between the convergence rate of the forward process and the principal eigenvalue of the L-generator, is an interesting idea. 2、By tuning the coefficients of the L-generator according to its principal eigenvalue, the proposed method introduces a flexible mechanism that enables control over both the speed and stability of the diffusion process. 1、It appears that as the eigenvalue increases, the FID of the model's generated results may deteriorate. Given that the appropriate eigenvalue range is critical for balancing efficiency and generative performance, how should this range be determined? Is it necessary to conduct careful parameter tuning for different models individually? 2、Both qualitatively and quantitatively, the paper lacks comparisons between the proposed method, a baseline (i.e., the model without using the proposed method), and other SOTA approaches. Consequently, the true performance gains brought by the proposed method cannot be accurately evaluated. Furthermore, experiments are only conducted on two small-scale or low-resolution datasets—CIFAR-10 and CelebA-HQ-64, with no experimental validation of generalization on larger-scale or higher-resolution datasets. 3、The latest methods discussed in the "Related Work" section only date up to 2023, lacking discussions on the connections to and distinctions from more recent advances. Please see weaknesses above. Lightly AI-edited
EGEA-DM: Eigenvalue-Guided Explainable and Accelerated Diffusion Model Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces EGEA-DM, a framework intended to address the high computational costs, slow convergence, and perceived lack of theoretical interpretability in existing diffusion models. The central thesis is the application of ergodic theory to the generative diffusion process. Specifically, the authors propose a formal connection between the convergence rate of the forward diffusion process and the principal eigenvalue ($\lambda_1$) of its corresponding L-generator. The framework's core mechanism involves modulating this principal eigenvalue by adjusting the coefficients $a(x)$ and $b(x)$ of the L-generator. The authors provide a theoretical argument (Theorem 2) and empirical results (on CIFAR-10 and CelebA-HQ-64) to support their main claim: a larger principal eigenvalue $\lambda_1$ leads to an exponentially faster convergence to the stationary distribution. This, in turn, allows the model to reach convergence in fewer forward steps, significantly reducing the computational overhead for both training and sampling. The method is also presented as a plug-and-play module compatible with existing samplers like DPM-Solver. - The primary strength of this work is its effort to ground the problem of diffusion model acceleration in established mathematical theory. While the underlying concepts of spectral gap convergence for 1D diffusions (e.g., from Chen 2005/2012, Bobrowski 2008) are not new, their application as a design principle for accelerating generative models is a valuable contribution. - The paper does not just posit a theoretical relationship. It provides a operational method for implementing its core idea by adapting an iterative numerical estimation algorithm (from Chen 2012) to estimate the principal eigenvalue $\lambda_1$. This turns a purely theoretical quantity into an actionable component of the algorithm. - The paper is generally well-written and logically structured. The narrative guides the reader from the ergodic theory, through the specifics of the L-generator and its eigenvalue, to the eventual experimental validation. - The entire theoretical derivation (Preliminaries, Section 3.1) is explicitly simplified to a one-dimensional (1D) case for tractability. However, diffusion models for image generation operate in extremely high-dimensional spaces. The paper makes no serious attempt to bridge this enormous theoretical gap. Spectral gap analysis in high dimensions is notoriously more complex than in 1D and often requires additional structural assumptions. The authors do not discuss how, or even if, their 1D-derived insights generalize to high-dimensional, non-reversible, or anisotropic processes common in diffusion modeling. - The proofs for the main theorems, such as Theorem 2, are relegated to the appendix and are overly brief. They lack the rigor and detail necessary for a reader to verify the claims or, more importantly, to understand the precise preconditions and scope under which these convergence bounds hold. This ambiguity undermines the paper's theoretical foundation. - The paper claims to provide an "explainable" framework. However, "Guiding Principle II" (which states that higher polynomial orders for $a(x)$ and $b(x)$ lead to a larger $\lambda_1$) is presented as a purely empirical observation from numerical experiments. This is a heuristic, not an explanation. The paper fails to provide any theoretical justification for why this relationship holds, which runs counter to its own stated objective of improving interpretability. Collectively, these issues suggest that the theoretical contribution of the paper is largely limited to restating classical diffusion results in a simplified 1D setting, without sufficient rigor or justification to claim novel theoretical insight. The proposed method introduces an extra, non-trivial computational step: the iterative numerical estimation of $\lambda_1$. The appendix (C.2) claims this overhead is "entirely acceptable." However, the framework's performance seems to hinge critically on this value. How sensitive is the model's performance (e.g., final FID, optimal $T_{conv}$) to the accuracy of this numerical estimation? If the estimation error is, for example, 10% or 20%, does the principled acceleration break down or become suboptimal? Fully AI-generated
EGEA-DM: Eigenvalue-Guided Explainable and Accelerated Diffusion Model Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper provided the theoretical understanding for the sampling rate of diffusion model via the egodic theory and proposed an efficient method called Eigenvalue-Guided Explainable and Accelerated Diffusion Model (EGEADM). EGEADM leverages the L-generator’s principal eigenvalue to explicitly control the sampling speed and the accuracy for diffusion model. Theoretical analysis shows that the convergence speed of diffusion model is determined by the spectral gap of the L-generator, which is the first non-zero eigenvalue. Experimental results demonstrate the theoretical results and the effectiveness of the proposed method. 1. This paper leverages the novel Chen’s estimation theory to diffusion model and provides a rigorous characteristic of the convergence speed using the spectral gap of the L-generator. 2. The theoretical analysis also innovates a novel algorithm using the L-generator. By choosing appropriate $a(x)$ and $b(x)$, we can balance the sampling speed and generation quality in diffusion model. 3. Numerical results verify the effectiveness of the proposed L-generator and provided several insights on the learning behavior under different sets of $a(x)$ and $b(x)$. 1.Although the paper proposes an interesting, principled mechanism for balancing sampling speed and sample quality in diffusion models, it does not provide clear tuning guidelines for $a(x)$ and $b(x)$. The guiding principles from Line 249 to 255 are still too general and lack concrete “go-to” defaults that reliably improve performance. Since practitioners need to choose both the functional forms $a(x)$ and $b(x)$, and the internal parameters in them, the resulting hyperparameter search can be extensive and may offset the performance gains. Providing a small set of recommended forms and default settings would make the method far more accessible. 2. Another limitation is the evaluation scope. The method has been tested with DDPM and ODE samplers, but not with more efficient samplers such as DDIM. It would be beneficial to see whether the technique also improves fast samplers, which further clarifies its impact on sampling efficiency. 1. Finding the point that balances the quality and efficiency might be time-consuming. Any guidance for finding such point? 2. I did not understand the relations of equation 5 and $\lambda_1$. Could you explain more on why equation 5 inspires the study of the principle eigenvalue? 3. To compute $\lambda_n$, what discretization procedure do you use, and how does the error scale with the discretization accuracy? And how large $n$ should we choose in practice? Fully human-written
EGEA-DM: Eigenvalue-Guided Explainable and Accelerated Diffusion Model Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes EGEA-DM, which applies ergodic theory to diffusion models by controlling the principal eigenvalue of the L-generator to regulate forward diffusion speed. The authors claim this provides theoretical interpretability and enables faster training while maintaining generation quality. Experiments on CIFAR-10 and CelebA-HQ-64 show reduced training steps, though with mixed results on quality. 1. **Interesting theoretical angle.** Connecting the spectral gap of the L-generator to convergence rates is a neat idea, and framing diffusion acceleration through ergodic theory is relatively unexplored in the generative modeling literature. 2. **Rigorous mathematical development (in 1D).** Theorems 1-2 properly characterize uniqueness, ergodicity, and convergence rates for one-dimensional diffusion processes. The eigenvalue estimation procedure based on Chen (2012) is technically sound. 3. **Integration with existing samplers.** The framework works with DPM-Solver and DPM-Solver++, suggesting it could be modular. ### 1. The theory lives in 1D but the experiments are in high dimensions This is the paper's fundamental problem. All theoretical guarantees (Theorems 1-2, convergence rates, eigenvalue analysis) assume one-dimensional processes where components evolve independently. But image generation requires modeling high-dimensional joint distributions with complex spatial dependencies. The paper claims "without loss of generality" we can focus on 1D (line 59), but this is incorrect. For a 256-dimensional vector (e.g., 16×16 grayscale image), the spectral gap of 256 independent 1D processes tells you nothing about the mixing time of the joint distribution. The covariance structure, spatial correlations, and coupling between dimensions fundamentally change the convergence behavior. There's no analysis of how the 1D eigenvalue relates to the high-dimensional process actually being run. This makes the entire theoretical framework disconnected from the experimental validation. ### 2. The practical recipe doesn't follow from the theory Section 3.3's "guiding principles" for choosing L-generators feel ad-hoc: - Principle I just says "satisfy the ergodicity conditions" (obvious) - Principle II says "higher degree polynomials give bigger eigenvalues" (empirical observation, not a principle) The paper never explains *why* polynomial forms are the right parameterization, or how to choose between the infinitely many (a,b) pairs that satisfy the convergence conditions. Table 4 shows that nonlinear generators behave unpredictably. Even with similar eigenvalues, you get wildly different Ddisc values (Appendix C.3, Table 7). The guidance provided amounts to "try different polynomials and see what happens," which undermines the claim of interpretability and principled control. ### 3. Computational cost story is incomplete The paper emphasizes training acceleration but glosses over the eigenvalue estimation overhead. In Appendix C.2, the authors mention "about two hours" for computing eigenvalues, but: - This is for 1D. What about estimating properties of the high-dimensional process? - How does this scale with dataset complexity or resolution? - What's the total wall-clock time (estimation + training + sampling) compared to just training a baseline DDPM longer? Table 1 shows training drops from 52h to 26h, but if I need 2+ hours of eigenvalue computation for every new (a,b) configuration I try, plus the overhead of determining the "optimal" range, the net savings become unclear. There's no end-to-end timing comparison. ### 4. Experiments are too limited **Baselines:** The paper primarily compares against vanilla DDPM from 2020. For a paper submitted in 2025, missing comparisons include: - DDIM (deterministic sampling, 2021) - EDM, EDM2 (2022,2024) - Modern flow matching methods (2023-2025) **Datasets:** Only 32×32 CIFAR-10 and 64×64 CelebA. No experiments at 256×256 or higher resolutions where diffusion models are most impactful. No text-to-image, video, or other modalities. ### 5. Changing (a,b) changes the target distribution This is subtle but important. When you modify a(x) and b(x), you don't just change the convergence speed—you change the stationary distribution π (Theorem 1). So comparing FID scores across different eigenvalue configurations isn't purely measuring the speed/quality tradeoff; you're potentially converging to different distributions. The paper doesn't discuss this. Are the quality changes due to insufficient sampling steps, or because you've fundamentally altered what you're sampling from? ## Minor Issues - **Writing quality:** Several grammatical errors ("egodic theory" in abstract, "beolw Theorem 1" line 168). Notation switches between $X_t$ and $Y_t$ confusingly. - **$D_{disc}$ metric:** The convergence discrepancy metric behaves irregularly for nonlinear cases (Table 7) and seems unreliable as a stopping criterion. Its relationship to actual generation quality is unclear. - **Figure 5:** Referenced as justification for polynomial choices but relegated to the appendix. The 3D surface plot is hard to interpret and doesn't provide clear design guidance. - **Table 5:** Shows that very different (a,b) with similar eigenvalues give similar results. This contradicts the claim that eigenvalue is the dominant factor. if other properties of (a,b) matter equally, what's the advantage of the eigenvalue-centric view? 1. Can you provide a rigorous treatment of how 1D eigenvalues relate to high-dimensional convergence, or alternatively, show how to compute/estimate the spectral gap of the actual high-dimensional process? 2. What happens when you do compute-normalized comparisons (same total compute budget including eigenvalue estimation) against strong baselines? 3. How do you explain Table 5, where eigenvalue seems less important than other properties of the generator? 4. Can you clarify whether the quality differences come from speed/sampling tradeoffs or from changing the target stationary distribution? Fully AI-generated
Exploring Instruction Data Quality for Explainable Image Quality Assessment Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses a critical and under-explored issue in the field of Explainable Image Quality Assessment (Explainable IQA): the impact of instruction data quality and scale on the performance of Multimodal Large Language Models (MLLMs). The work challenges the prevailing "more data is better" paradigm by systematically investigating the feasibility of constructing a smaller yet more effective data subset (a coreset) through principled data selection. The authors propose a three-stage data selection framework named IQA-Select and demonstrate through extensive experiments that fine-tuning with just 10% of the selected data can surpass the performance of using the full dataset on benchmarks like Q-Bench and AesBench. 1. Clear and Pragmatic Motivation: The paper rightly points out the potential drawbacks of the field's heavy reliance on "scaling laws," namely the immense computational overhead and data redundancy. The direction of exploring the importance of data quality is forward-looking and holds significant practical value. 2. Systematic and Comprehensive Ablation Studies: To construct the optimal data selection pipeline, the authors conduct exhaustive experiments and ablation studies for each stage of the framework (feature extraction, quota allocation, and sampling strategy). For instance, they compare nine different feature types in the feature extraction stage and explore eleven strategy combinations for quota allocation. 1. Limited Novelty: The core contribution of this paper lies more in "problem exploration" than in "methodological innovation." The proposed IQA-Select framework is fundamentally a clustering-based data selection pipeline, and its constituent components (e.g., clustering, quota allocation based on density/transferability/IRS, and sampling via SVD/PCA) are largely existing or slightly modified techniques from the data selection literature. The authors themselves candidly define it as a "pipeline." 2. Superficial Analysis of "Performance Degradation": The paper observes that fine-tuning on the full dataset leads to inferior performance compared to using a subset, attributing this to "model forgetting" and "data redundancy." The paper fails to provide a deeper analysis. For example, what are the specific characteristics of the redundant data (e.g., low-quality, repetitive, overly simplistic)? A more in-depth investigation would make the claims more convincing. 3. Dependence on a Strong Pre-trained Model: The entire framework and its findings are heavily dependent on a powerful pre-trained MLLM (InternVL3-Instruct). It is unclear whether the conclusion "less data is better" would still hold if a weaker base model were used. Arguably, large-scale data might still be essential for less capable models. The paper offers limited discussion on this aspect of generalizability. 4. Trade-off Between Complexity and Practical Gain: The IQA-Select framework is relatively complex, involving multiple steps such as feature extraction, clustering, multi-metric calculation, and sampling. Given that random sampling of 20% of the data already achieves strong performance, the paper needs to more clearly justify whether the additional benefit (an approximate 1-2% performance gain) warrants the introduction of such a complex framework. 1. In the discussion of the final method combination in Section 4.2.5, the paper states that the final method uses "transferability and instruction relevance score." However, this appears to contradict the results in Table 2, where the best-performing combination is "Transferability & Density" (II-(8)), and also the final reported combination (I-(6) + II-(8) + III-(2)). Could you please clarify this discrepancy? 2. The framework introduces several hyperparameters, such as the number of clusters (N) for clustering and the similarity threshold (τ) for calculating transferability. How were these hyperparameters selected? How sensitive is the final performance of the framework to variations in these hyperparameters? 3. Do you believe the IQA-Select method can be generalized for instruction data selection in other vision-language tasks, such as VQA or Image Captioning? If so, which parts of the framework would require the most significant adjustments when applied to different tasks? Fully AI-generated
Exploring Instruction Data Quality for Explainable Image Quality Assessment Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces an efficient data selection method from redundant full-scale data for explainable image quality assessment (IQA). The proposed method, namely IQA-Select, consists of three stages: clustering feature extraction, cluster quota allocation, and cluster sampling strategy. The cluster features are divided into model-related features and model-independent features, and a total of 9 combinations of features have been investigated. Similarly, for cluster quota allocation and cluster sampling strategy, 11 allocation strategies and 3 sampling methods have been investigated respectively. Using the baseline model InternVL3-Instruct-8B, the proposed IQA-Select demonstrated superior IQA performance using only 10% of the subset on both the Q-Bench and AesBench benchmarks, compared to full dataset fine-tuning. The authors explore multiple configurations per stage to select the optimal combination across three stages: - Clustering feature extraction: 6 model-related combinations from 4 distinct features; 3 model-independent combinations from 3 distinct features. - Cluster quota allocation: 11 strategies from 4 features. - Cluster sampling: greedy MMD, SVD, and PCA. Experiments on two benchmarks (Q-Bench and AesBench) improve reliability. As the performance of general-purpose baseline models continues to improve, I agree that optimizing the fine-tuning dataset to adapt a model for a downstream task is an effective strategy. However, the manuscript appears to lack sufficient concrete evidence to substantiate this claim. Generalization - Only InternVL3-Instruct-8B is tested, so generalization to other models is uncertain. - Since the work concerns data curation, multi-model runs are needed. - Additional suggestion: cross-validation would strengthen the generalization capability. Incomplete metrics - Table 3 omits AesI while AesBench includes four categories (AesA, AesE, AesP, and AesI). - The omission is not explained. Unsubstantiated claim - The assertion that “fine-tuning the MLLM with all dataset may inversely cause the MLLM to forget the knowledge it learned before” is repeated without theoretical or empirical support, especially with respect to the IQA task. Minor issues - L450: table hyperlink error. No questions Lightly AI-edited
Exploring Instruction Data Quality for Explainable Image Quality Assessment Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper questions the prevailing “scale-first” mentality in instruction-tuning for explainable image quality assessment (IQA). Starting from the observation that InternVL3-8B fine-tuned on the full 200K Q-Instruct set barely outperforms the pretrained checkpoint, the authors systematically reduce the training pool and discover an inverted-U curve: performance plateaus at roughly 20 % of the data and even the extreme 5 % random subset matches the full-set accuracy, revealing massive redundancy. Building on this insight, they design IQA-Select, a three-stage data-curation pipeline that (i) represents every instruction by fusing multi-level MLLM hidden states with vision–text embeddings, (ii) allocates cluster-level quotas via a combination of transferability and density, and (iii) harvests the most representative samples within each cluster through SVD leverage scores. With only 10 % of the original data, the method attains 79.4 % overall accuracy on Q-Bench (102.1 % of full-data tuning) and 61.1 % on AesBench (103.7 % of full-data tuning) while cutting GPU hours by an order of magnitude, thereby providing the first systematic evidence that careful curation can outperform brute-force scaling in explainable IQA. The contribution is original in that it is the first work to interrogate—and empirically refute—the scaling law inside the IQA instruction-tuning niche, and it delivers a principled, reproducible pipeline to exploit this observation; the experimental design is thorough, encompassing roughly 300 ablations across nine feature families, eleven quota strategies and three intra-cluster samplers, all trained under identical LoRA hyper-parameters and evaluated on two public benchmarks with consistent trends; the paper is clearly written, with precise mathematical definitions of density, IRS and transferability, intuitive figures that visualise the selected feature-space coverage, and ample discussion of design choices, making the approach immediately actionable for practitioners; finally, the work is significant because it transforms a costly data-collection problem into a data-curation opportunity, offering the community a ten-fold reduction in training cost without sacrificing—and often slightly improving—accuracy, and it opens a new research direction that shifts the focus from “how to generate more” to “how to select better” instruction data for visual-quality tasks. **Model-scale scaling law is unexplored:** All conclusions are derived from a single 8B-parameter model. The redundancy hypothesis may not hold for smaller (≤ 4 B) or larger (≥ 30 B) models whose capacity, memorisation behaviour and forgetting dynamics differ. I suggest authors to run identical selection pipelines on at least two more scales (e.g., InternVL2-2B and InternVL2-40B). Report whether the “5 % = 100 %” trend persists and whether IQA-Select’s relative gain grows or shrinks. **Task-scope is narrow:** The method is only evaluated on IQA and aesthetics. It is unclear whether the designed features (IRS, distortion-related IQA scores, etc.) generalise to other tasks. Authors can try to validate the conclusions in reasoning task using the training and test data of M3CoT. **Missing baselines from recent data-selection literature:** No comparison with other data-selection pipelines. I suggest authors to include at least one more data-selection baseline; otherwise it is hard to validate the effectiveness of the proposed method. **Minor LaTeX style issue:** Citations use \cite instead of \citep, producing “Q-Bench Wu et al. 2023” rather than “Q-Bench (Wu et al. 2023)”. Please conform to standard ICLR format. Please see weakness. Fully AI-generated
Exploring Instruction Data Quality for Explainable Image Quality Assessment Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper studies instruction data quality for explainable image quality assessment (IQA) with multimodal LLMs. Using InternVL3-Instruct as the base model, the authors first show that less can be more: randomly fine-tuning on a small fraction of Q-Instruct can match or even exceed full-data fine-tuning. Building on this, they propose IQA-Select, a clustering-based selection pipeline with three stages. With only 10% of the data, IQA-Select reportedly attains 102.1% of full-data performance on Q-Bench and 103.7% on AesBench, and the best variant yields the top overall Q-Bench score among 10% subsets. 1. The proposed data selection method is effective: using just 10% of the SFT data, it outperforms the pretrained baseline and other methods. 2. The diversity, IRS, and Trans metrics that gauge SFT data quality provide compelling evidence of its effectiveness. 1. The comparison set is dated. Please include recent VLM baselines, e.g., VisualQuality-R1, to ensure a fair, up-to-date evaluation. 2. Expand experiments across diverse SFT datasets. Q-Bench may contain redundant samples, so cross-dataset validation would strengthen the conclusions. 3. Add qualitative case studies illustrating which data types most effectively boost VLM performance, with before/after outputs where possible. See weakness. Lightly AI-edited
A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a new model, the Manifold Probabilistic Projection Model (MPPM) and its latent version, which interprets diffusion models as geometric projections that iteratively move corrupted inputs toward the clean image manifold. Based on the manifold assumption that image data resides on a low-dimensional smooth manifold, the paper integrates a learned distance function to the probability vector fields to guide image reconstruction and generation. The method shows superior performance compared to the latent diffusion model on image restoration and generation tasks. - The paper introduces a new view of the diffusion model as a projection onto the manifold. - Using a distance-based geometric approach and a kernel-based probabilistic model, the paper tries to make an interpretable link between them and attempts to make a unified framework. - Detailed definitions on loss, architectures, and training settings have been provided in the Appendix. - While the idea of viewing the diffusion model as a projection onto the manifold is interesting, I cannot find a theoretical explanation or demonstration that the iterative process approximates a projection. Also, Equations 11 and 12 are heuristic updates without any guarantee of convergence. - The formulation is overly complex without clear benefit. Distance function, kernels, and autoencoders introduce considerable complexity, but I do not see why it should be explicitly better than exisiting diffusion models. Empirical results cannot be the justification as the datasets are too small and baselines are too weak. - The experiments are limited to simple datasets: MNIST and SCUT-FBP5500 datasets. It does not show general applicability or scalability. Experiments on datasets with the scale of CIFAR-10 or LSUN would be recommended. Also, the compared baselines are too weak and naive to say the complex formulation of the method should be used. - As the model introduces additional networks and iterative updates, it should require a comparison of computational complexity to diffusion models. Does learning distance functions and using it cost significantly? - Ablation analysis on the main components, like the distance function or kernels, would make the claim of the paper stronger. Please address the questions raised in the weakness section. Fully human-written
A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduce a geometric picture framing VAE, GAN diffusion with respect to the data manifold, and they interpreted diffusion models as iterative manifold projection. Then they derive a model / objective (LMPPM) based on this idea, and and show some improved performance regarding clearing image degradation on some datasets. ### Strength - The conceptual discussion about the manifold geometry in diffusion, VAE and GAN is interesting, and think about diffusion model in this geometric way is laudable. ### Weakness - The new methods in Sec.4 is not very convincing and/or not super well framed in the literature. Specifically, I feel there are so many connection to existing diffusion models, energy based model etc., just by re-interpreting the entities. - e.g. the first term of the loss in eq. 13 learn the distance or energy instead of the score vector itself. - the 5th term is very similar to the denoising score matching objective if we parametrize score by denoisers [^3,^4] since $z_i^{shift}$ is basically the denoiser. which is enforcing the gradient of distance to be nicely aligned to score - In this regard, it seems the main innovation is that we have a distance function without explicit conditioning of noise scale. But seems [^5,^6] also discussed / discovered that the time / noise scale conditioning is not necessary. [^5] Sun, Q., Jiang, Z., Zhao, H., & He, K. (2025). Is Noise Conditioning Necessary for Denoising Generative Models?. [^6] Kadkhodaie, Z., Guth, F., Simoncelli, E. P., & Mallat, S. (2024). Generalization in diffusion models arises from geometry-adaptive harmonic representations. ICLR - From the algorithm or the method itself, I cannot see a clear reason why the proposed MPPM or LMPPM method is better than LDM. is it the case than LDM needs a certain noise / time conditioning, thus if you input the wrong noise / time, it will not correctly denoise the image? but for LMPPM, you have no time conditioning, so you are more robust in that regard? - Currently the FID in Table 1 is very high for LDM and DAE, which is a bit concerning. I feel something is wrong in the implmentation of these baselines… - Elucidating why LMPPM is better via ablation / control experiment can largely improve the paper, and increase my evaluation of the paper. - in abstract why do the authors say “*The foundational premise of generative AI for images is the assumption that images are inherently low-dimensional objects embedded within a high dimensional space*”? Seems generative AI can still work if images are not low dimensional objects…. I agree with the assumption, but do not think it’s a foundational premise of generative AI. - I feel the geometric view of diffusion models (Sec 3, Fig. 2) is definitely correct and worth noting, but it’s also not entirely new. Authors could mention very similar figures as in Fig1 [^1] Fig4 [^2]. e.g. the quantity noted in eq. 8 has name in many papers, i.e. ideal denoiser [^3,^2], and the relation between score and denoiser has been known as tweedie’s formula [^4]. $\hat{x}_{\text{MMSE}} = \mathbb{E}[u \mid x] = x + \sigma^2 \nabla_x \log P(x)$ [^1] Chen, D., Zhou, Z., Wang, C., Shen, C., & Lyu, S. (2024). On the trajectory regularity of ode-based diffusion sampling. ICML https://arxiv.org/abs/2405.11326 [^2] Wang, & Vastola, (2024). The unreasonable effectiveness of gaussian score approximation for diffusion models and its applications. TMLR https://arxiv.org/abs/2412.09726 [^3] Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the design space of diffusion-based generative models. NeurIPS [^4] Efron, B. (2011). Tweedie’s formula and selection bias. Journal of the American Statistical Association - Eq.14 is also known as ideal denoiser with delta mixture distribution / empirical distribution [^2,^3]. - As the authors pointed out the Riemannian geometry of data manifold through the generator of GAN, VAE have been studied for a while, some reference could be added for this tradition [^7,^8,^9]. [^7] Shao, H., Kumar, A., & Fletcher, P. T. (2017). The riemannian geometry of deep generative models. *CVPR Workshops* [^8] Wang, B., & Ponce, C. R. (2021). The geometry of deep generative image models and its applications. ICLR [^9] Chadebec, C., & Allassonnière, S. (2022). A geometric perspective on variational autoencoders. NeurIPS Fully human-written
A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces the Manifold-Probabilistic Projection Model (MPPM) and its latent variant (LMPPM), which unify geometric and probabilistic interpretations of generative modeling. The method interprets diffusion models as iterative projections onto the manifold of “good” images, defined through a distance function and an associated kernel-based probability density. The authors derive this formulation rigorously from geometric principles, introduce both ambient-space and latent-space implementations, and connect the model to classical autoencoder architectures. The paper is well-written, mathematically detailed, and offers a clear conceptual framework that bridges geometry and probability in generative modeling. * Excellent clarity and presentation of the theoretical derivation. * Well-organized narrative: the geometric intuition, probabilistic extension, and algorithmic details are all coherent and rigorous. * The proposed framework provides an elegant deterministic alternative to diffusion sampling, supported by sound intuition. * Clear and readable mathematical notation throughout. **Experimental scope:** * Despite the theoretical strength, experiments are limited to MNIST and SCUT-FBP5500, which are small and relatively trivial datasets. After such a solid theoretical development, this weak experimental section feels like a missed opportunity. * Evaluations rely mainly on the Latent MPPM variant, and mostly in a reconstruction setting rather than true generation. * While reconstruction is a valid demonstration, it is not the most relevant metric for generative models. The paper would be much stronger if generation quality were assessed on more challenging datasets such as ImageNet 64×64, CelebA-HQ, or CIFAR-10. Even a small-scale generation study (e.g., 32×32), if focused, would make the contribution more complete. **Focus dilution:** The inclusion of reconstruction experiments makes the paper feel slightly misaligned with its main message. A more focused evaluation of generation performance would better highlight the model’s strengths. **Minor technical comments:** * Missing citations for the Eikonal equation (line 145) and kernel density estimation (line 185). * Line 216: the term “normalized gradient” seems redundant since $|| D_M (x) || = 1$ by the Eikonal equation. * Equation (10): unclear why $G(z)$ appears outside the exponential. * Line 55: the authors do not **propose** the manifold assumption but rather **assume** it. I have no questions other than asking the authors to perform more focused experiments, as indicated in the "Weakness" section. Heavily AI-edited
A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The article presents an integrated perspective that unifies geometric and probabilistic views by introducing a geometric framework and a kernel-based probabilistic method. Within this framework, the diffusion model is interpreted as a projection mechanism on the manifold of “high-quality images,” providing new insight into its underlying nature. Building on this interpretation, the authors propose a deterministic model—the Manifold Probability Projection Model (MPPM)—which operates coherently in both the representation (pixel) and latent spaces. Experimental results indicate that the Latent Space MPPM (LMPPM) surpasses the latent diffusion model (LDM) across multiple datasets, demonstrating superior performance in image restoration task. 1.The perspective of this article is very interesting. It is of great significance to unify the understanding of geometry and probability. This is of great significance for modeling more complex data manifold distributions. 2.The theory in this article is very solid. As a work of great theoretical significance, it deserves attention. 3.The paper is well-written and the motivation is very convincing. 1.I have some concerns about the theoretical assumptions. The article assumes the existence of Gaussian noise perturbations between points on the clean image manifold and the real images. However, if the task is not image restoration, or if the data are already sufficiently clean, this assumption and consequently the proposed theory may not hold effectively. 2.The experimental section lacks comparisons with several relevant baselines in image restoration and inverse problem research [1–3]. In addition, manifold-preserving approaches [4–6] should also be considered for a more comprehensive evaluation. It seems insufficient that the author only compares with DAE and LDM. [1] A Unified Conditional Framework for Diffusion-based Image Restoration [2] DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior [3] Refusion: Enabling Large-Size Realistic Image Restoration [4] Manifold Preserving Guided Diffusion [5] CFG++: Manifold-Constrained Classifier-Free Guidance for Diffusion Models [6] Improving Diffusion Models for Inverse Problems using Manifold Constraints 3.Why do other methods work well on SSIM but worse on FID? Could this be because they only learned the distribution with noise added instead of the clean data distribution? The author lacks a more profound analysis. 1.Would the proposed method still be effective under the assumption of clean data? Or whether it can be directly used for generating images rather than for image restoration tasks? 2.Refer to Weakness 2, how about the performance of other related models? 3.In Figure 5, I noticed that the image generated by LMPPM does not contain white teeth. Could this be caused by the limitations of some manifold probability distributions? Or is it because of some probability assumptions that the model ignores the special manifold of the tooth part? Lightly AI-edited
Quantum-Inspired Image Encodings for Financial Time-Series Forecasting Soundness: 1: poor Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a novel "quantum-inspired" framework for financial time-series forecasting. The core idea is to encode one-dimensional time-series data into two-dimensional, complex-valued image representations, which are then fed into a Convolutional Neural Network (CNN) for prediction. Specifically, the authors extend classical image-encoding methods like Gramian Angular Fields (GAF), Recurrence Plots (RP), and Markov Transition Fields (MTF) into their "quantum analogues" (Q-GAF, Q-RP, and Q-MTF) by incorporating probabilistic amplitudes and dynamic phase information borrowed from the mathematics of quantum mechanics. The primary contribution is methodological: creating a richer representation space that can capture complex dynamics missed by traditional methods, supported by empirical evidence of its effectiveness. Applying the mathematical formalism of quantum mechanics to feature engineering for financial time series is a highly creative and interesting approach. It moves beyond conventional real-valued representations, offering new possibilities for capturing latent data structures, such as volatility and phase dynamics, by leveraging complex-valued spaces. The paper presents an extensive empirical study on two major equity indices. The proposed quantum-inspired models consistently outperform their classical counterparts in terms of both predictive accuracy and average win rate, providing strong initial evidence for the practical utility of the framework. This paper fails to provide the necessary background for non-specialists and fails to offer a clear, reproducible algorithm for specialists. This makes the paper's contribution difficult to evaluate and seriously compromises its quality as a rigorous scientific publication. The paper successfully maps time-series data points to probability amplitudes and phases but lacks a deep theoretical justification for why this specific analogy is appropriate and superior. The experiments effectively demonstrate the superiority of the proposed methods over their direct classical counterparts. However, to establish the true value and novelty of this encoding framework, it must be benchmarked against a broader set of state-of-the-art time-series forecasting models. See weakness Moderately AI-edited
Quantum-Inspired Image Encodings for Financial Time-Series Forecasting Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a quantum-inspired representation learning framework for financial time-series forecasting. It maps 1D normalized observations into probabilistic amplitudes and injects local temporal structure through phase-function encoding. The proposed approach outperforms the other baselines on two financial datasets. - The quantum-inspired representation of time series is novel, introducing some unexplored features that seem useful in time series forecasting. - The proposed approach achieves solid improvements compared to classical image encodings in experiments. - The approach works with standard CNN-based models, suggesting that advanced CNN approaches could be explored to further enhance time series forecasting. - This paper does not compare the image encoding-based methods to those without image encodings. - Despite its effectiveness, results rely on a single CNN backbone. There is no investigation across diverse forecasting models. - Ablation study is needed to confirm the effectiveness of each component in the proposed approach. - Can you compare the image encoding-based approaches to non-image baselines? - Do the performance gains of quantum encodings hold for other forecasters, such as Transformer-based models? - How do you interpret the features captured by the two components in the model? - How do you determine the hyperparameter $k$? Do model performance totally degrade using other settings? - What’s the complexity (or runtime) compared to baselines? A small typo: Citation formatting at line 045 is incorrect. Lightly AI-edited
Quantum-Inspired Image Encodings for Financial Time-Series Forecasting Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a quantum-inspired method for financial time series that transforms one-dimensional signals into two-dimensional images for forecasting. This transformation also enables classical encodings to suit CNN-based forecaster. Using five-minute U.S. equity data, such as S&P 500 and Russell 3000, over 2009–2025, the authors report higher classification accuracy and win rates relative to classical encodings. - The amplitude–phase formalism extends classical image encodings and may capture salient features of financial time series. - The paper addresses the limitations of classical encodings by lifting them into the complex domain. - On the reported tasks and CNN architecture, quantum-inspired encodings deliver consistent accuracy and win-rate improvements over classical counterparts. - The paper claims that the phase-function encoding reflects cumulative imbalance and volatility, but do not empirically verify this. - Classification accuracy and “win rate” reported in the paper are not directly connected to monetizable alpha. Without this evaluation, it is unclear whether the accuracy uplift converts into practical use. - Key hyperparameter choices lack sensitivity analyses. - Can you report critical finance metrics (e.g., IC, Rank IC, and Sharpe ratio) from portfolio backtests? - How sensitive are results to hyperparameters, such as k? - Do the phase functions measurably capture cumulative imbalance and volatility? - How does the method perform in tests beyond 2009-2025? - Can the quantum-inspired method generalize to other assets (e.g., individual equities, commodity futures, FX)? Lightly AI-edited
Image Embeddings from Social Media: Computer Vision and Human in the Loop Applications for Social Movement Messaging Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 0: Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. 6,567 image posts from Instagram related to the anti-feminicide movement in Mexico were collected and analyzed to see if these models can group the pictures into meaningful topics (like protest signs, solidarity posts, info posters). Human oversight is also provided to make sense of such groups or clusters. - The social problem is very relevant as we need to study what is going on in our society and how anti-feminicide discourse is prevalent online. - Above 16,000 image posts from Instagram are taken and studied, which is a big number. - The usage of multiple vision algorithms like (ResNet50, CLIP, BLIP-2) and multiple clustering methods gives both depth and breadth to the study. - There is a good attempt to connect automated clustering with human oversight which helps sensemaking of clusters effective. - Detailed discussion like acknowledgement of challenges of dense-clusters and text-rich images. - The methodological contribution is very minimal. The paper applies a few algorithms with clustering techniques. There is no new method or any novel grounding on the modeling side. Evaluation is very descriptive rather than quantitative. - The clustering gives negative silhouette scores, and the authors claim highly useful clustering performance. - There is no comparison to anything. Like other clustering algorithms (k-means?). There is no usage of multimodal pretrained models that are predominantly tuned in social media data. - The work is very exploratory. There is no hypothesis-driven research. There is not much scientific takeaway from the paper for ICLR audience. - The paper has overemphasized the social sciences. It may be valuable socially as it tackles important social problems but for a machine-learning venue, the contribution is so limited. - Why did you conclude "best separation" when all silhouette scores remain negative? - Did you try any OCR + text embeddings? The images have very dense text. While CLIP/BLIP could handle text + image, did you not try posing the problem differently and aim for a better representation? - Why did you use only a few clustering approaches? Fully human-written
Image Embeddings from Social Media: Computer Vision and Human in the Loop Applications for Social Movement Messaging Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper analyzes 16,567 Instagram images from the anti-feminicide movement in Mexico using unsupervised and self-supervised embedding models combined with HDBSCAN clustering. The authors employ human-in-the-loop content analysis to evaluate cluster quality and understand visual messaging structures. The results show dense, overlapping clusters across all models, with CLIP achieving the best separation metrics. - The paper provides a thorough comparison of three embedding approaches and combines quantitative clustering metrics with qualitative human annotation of 185 sample images. - The application to anti-feminicide social movement messaging represents a meaningful use case for computer vision methods. The dataset of 16,567 Instagram posts provides a substantial corpus, and the human-in-the-loop analysis reveals real evaluations that the VLMs cannot find. - The paper applies existing, well-established methods in an off-the-shelf manner. There are no new proposals for model development. - How do you justify claiming CLIP is "best" when DBI of CLIP is worse than that of ResNet50 or all the methods showed the negative Silhouette scores? Lightly AI-edited
Image Embeddings from Social Media: Computer Vision and Human in the Loop Applications for Social Movement Messaging Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates the use of computer vision-based image embeddings and clustering for analyzing social movement messaging within a large set (16,567 posts) of Instagram images related to the anti-feminicide movement in Mexico. The study extracts feature embeddings using ResNet50, CLIP, and BLIP-2, applies HDBSCAN for clustering, and evaluates clusters with several quantitative metrics alongside human-in-the-loop inductive content analysis. The results compare representational properties across models and discuss overlap and nuance in image messaging structure, highlighting the strengths and limitations of current representation models for domain-specific, topic-coherent social images. 1. The paper addresses a relevant, underexplored problem at the intersection of machine learning, social movements, and computational social science, providing quantitative insight into the structure of visual messaging in a humanitarian context. 2. It adopts a comparative framework, using both popular (ResNet50, CLIP) and more advanced (BLIP-2) image embedding models, allowing a nuanced analysis of their clustering behavior on real-world activist imagery. 3. Employing multiple established clustering evaluation metrics (Silhouette Score, Calinski-Harabasz Index, Davies-Bouldin Index), in tandem with human-in-the-loop content analysis, demonstrates methodological rigor and brings valuable qualitative depth to quantitative findings. 1. While ResNet50, CLIP, and BLIP-2 are established models, the rationale for selecting these, particularly why not use more recent or task-specialized models (e.g., multimodal sentiment/abusive meme detectors), is underspecified. There is also little reflection on how text-in-image (e.g., hashtags, slogans) is handled beyond embedding, although text is central to social movement images. 2. No non-deep-learning clustering baselines (e.g., classical SIFT/ORB features, PCA+GMM, or manual curation) are reported for context. Similarly, the study provides limited statistical testing to gauge whether any performance differences (or cluster count/size differences in Tables/Appendices) are meaningful, or merely artifacts of parameter tuning. 3. While the analysis (Section 3 and Figures 2–4) is a valuable complement, the use of coding to identify cluster validity serves more as a post hoc rationalization than a systematic validation, potentially overfitting the interpretation to noisy clusters. For instance, the claimed “nuance” within dense groups could mask model or clustering failures. More rigorous validation (possibly co-clustering, cross-validation with held-out hand-labels, or even crowd-sourced validation as secondary annotation) is missing. 4. Given the consistently negative Silhouette Scores (Table 1), what measures (quantitative or qualitative) can you provide to justify that your clusters are not artifacts of parameter tuning, but capture semantically meaningful differences? Would alternative clustering/objective functions mitigate the issue of dense overlap? 5. Can you provide any ablations or error analysis comparing embeddings and clusters for images dominated by text versus those that are mostly visual? How do ResNet50, CLIP, and BLIP-2 differ in treating such cases? As shown in Weakness. Heavily AI-edited
Image Embeddings from Social Media: Computer Vision and Human in the Loop Applications for Social Movement Messaging Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper deals with the problem of analyzing images shared on social media platforms. More specifically, images related to a specific topic are collected from a single social media platform, and image features extracted from standard image encoders are grouped using DBSCAN, a standard clustering algorithm in data mining and database management. Extracted clusters are checked and investigated by human inspections, which reveal that different image encoders yield different clusters and thus provide different semantic groups. S1. The research topic dealt with in this paper is significant. Social media has become one of the most influential media platforms, and its significance continues to grow day by day. In this sense, analyzing the characteristics and dynamics of media content distributed on social media platforms is one of the most significant research topics for understanding the shifts in social conditions and public opinion. W1. If my understanding is correct, the main topic of this paper belongs to social science, not computer science. In this sense, I strongly recommend this paper to be submitted to other conferences related to social science, such as ICWSM and CHI. ICLR focuses on fundamental theories and innovative technologies for machine learning, placing a high priority on theoretical and/or technical novelty. W2. On the other hand, discovering and demonstrating novel findings with already known techniques is also valuable for healthy development in computer science. However, papers focusing on this aspect should provide extensive investigations from various viewpoints and attempt to address nearly all questions derived from the original research question and the experimental results. See e.g. [Teney+ CVPR2024 https://openaccess.thecvf.com/content/CVPR2024/html/Teney_Neural_Redshift_Random_Networks_are_not_Random_Functions_CVPR_2024_paper.html]. From this viewpoint, the current paper seriously lacks deep investigations for the research question. W3. The organization should also be majorly revised. For example, this paper devotes excessive space to explaining well-known techniques. Meanwhile, almost all the experimental results required for describing the main story are placed in the supplementary material. Q1. I could not understand the reasons why the authors chose the anti-feminicide movement as research material. This description is required for understanding the philosophy of this paper and the research question of this paper and checking the existence of ethical issues. Fully human-written
CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes CamPilot, a technique to improve camera control in video diffusion models. Concretely, this work uses Reward Feedback Learning (ReFL) to improve the camera control. For this, a 3DGS decoder on top of the video latents is used to directly generate 3D Gaussians instead of RGB videos. Then, the renderings of the 3D Gaussians are used for reward feedback. Using the depth to obtain a visibility mask, the generated output is masked out to only supervise on input content. - Novelty of using reward feedback for camera control: I have not seen a work that uses 3DGS to improve the underlying video model and its 3D consistency/camera control precision. The direction seems promising. - Camera-aware 3DGS decoder already proposed: The first point of the contribution list claims the camera-aware 3DGS decoder to be a contribution. However, the approach just follows Wonderland [1] without changes. The claim of the whole pipeline seems to be a big overclaim and the authors should have acknowledged Wonderland a lot more during the paper. Wonderland is mentioned and referenced in the paper, but there are no differences in the approaches. The main point of the paper is the reward-based feedback learning, not the 3DGS decoder. - Supplementary video presentation: The supplementary video presentation is one of the key components of a video generation submission. But there is no website but only separate videos and it is not clear what is happening. Moreover, there are no side-by-side comparisons with previous works. You should also show multiple generated scenes for one camera trajectory to show that they are consistent and the camera control works across scenes. - Unclear handling of dynamic objects: Currently, there is no guarantee that the scene remains static. While most results are designed for scenes without objects in them, that could move, the approach seems to not handle dynamic objects which would get animated by the video model in normal cases. Those animated objects would then lead to blurry 3DGS outputs and bad reward feedback. Hence, currently the model is restricted to static scenes. [1] Liang et al., Wonderland: Navigating 3D Scenes from a Single Image, CVPR 2025 I generally like the idea of using 3DGS output to improve 3D consistency/camera control precision of video models. However, the majority of the paper overclaims the architectural contributions since they just follow Wonderland. The main formulation of the reward feedback is simple, which is good, but it is not analyzed that deeply. Moreover, the supplementary material is not organized well and is difficult to digest. I highly recommend preparing a supplementary website with side-by-side comparisons. I would like authors to address the following question: - What are the architectural contributions described in Sec. 3.1-3.3? All the parts mentioned are just reusing other works. I am currently negative but happy to see what the authors say about the actual contributions of the work. Fully human-written
CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback Soundness: 2: fair Presentation: 1: poor Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces CamPilot, a video diffusion framework designed to improve camera controllability through a reward feedback learning strategy. The authors propose a camera-aware 3D decoder that decodes video latents into 3D Gaussian representations to evaluate camera-video alignment efficiently. The model achieves better camera control and 3D consistency compared to existing methods on the RealEstate10K and WorldScore benchmarks. 1. Using 3DGS to enhance camera-guided video generation is a good starting point 1. The writing quality of this paper needs improvement. Many expressions are verbose and not concise enough. For example, Sections 2.2 and 2.3 contain multiple repetitions of earlier content, resulting in unnecessary length. Moreover, the core comparison experiment showing how much computational cost is reduced compared to VAE decoding is placed only in Appendix A.4. From the main text alone, the experimental details are unclear. 2. The baseline methods used for comparison are somewhat outdated. The paper lacks comparisons with several recent and highly relevant approaches, such as CamI2V [1], OmniCam [2], RealCam-I2V [3], and ReCamMaster [4]. 3. The proposed method employs a multi-stage reward scoring strategy for evaluating camera trajectories. Although the authors claim that this method is efficient, I am curious whether it could lead to error accumulation across stages. Compared with approaches that estimate camera trajectories from decoded videos (e.g., via VAE decoding) and directly compare them to ground-truth trajectories, how large is the accuracy gap between these two strategies? 4. Minor revision suggestions: Please correctly use the `\citep` command for references, e.g., Line 208 for *UniFL*. [1] CamI2V: Camera-Controlled Image-to-Video Diffusion Model [2] OmniCam: Unified Multimodal Video Generation via Camera Control [3] RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control [4] ReCamMaster: Camera-Controlled Generative Rendering from A Single Video See WeakNess Lightly AI-edited
CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces CamPilot, a novel framework aimed at enhancing camera controllability in video diffusion models through reward feedback learning. The authors identify a persistent challenge in aligning generated video content with specified camera trajectories, which undermines 3D consistency in downstream tasks such as scene reconstruction. To address this, they propose a camera-aware 3D decoder that projects video latent and camera poses into 3D Gaussians (3DGS), enabling efficient rendering and reward computation. The framework is evaluated on RealEstate10K and WorldScore, demonstrating improved camera alignment and visual fidelity. 1. Introduces a feed-forward 3D Gaussian-based decoder that efficiently evaluates camera-video alignment without reliance on computationally intensive post-processing tools like COLMAP. 2. Applies reward feedback learning (ReFL) to optimize camera adherence, which represents a previously underexplored direction in video diffusion. 3. Enables high-quality 3D scene reconstruction directly from video latents and camera poses, bypassing computationally expensive per-scene optimization. 1. The evaluation is limited to static scene datasets, this potentially limits its applicability for real-world video generation tasks. 2. The method assumes precise extrinsic and intrinsic camera parameters are available, which may not be a valid assumption in real-world applications. 1. How does the method handle dynamic scenes involving object motion or non-rigid transformations? It is recommended that the authors include evaluation results evaluated on open-sourced dynamic RealCam-VID[1]. 2. It is recommended that more comparison with other recent methods focused on camera control be added, such as RealCam-I2V[2]. We recommend including quantitative and qualitative results on the open-sourced dynamic RealCam-VID[1] dataset, alongside an analysis detailing the advantages of the proposed CamPilot framework in final paper. 3. How well does the model perform under conditions of rapid camera motion or large trajectory shifts (e.g., simulated drone flights or rapid rotations)? The authors are encouraged to provide quantitative or qualitative results in such scenarios. 4. How is camera scale handled during training and inference? The authors are recommended to clarify whether the model rely on absolute or relative scale inputs, and how does it ensure scale consistency across scenes? [1] RealCam-Vid: High-resolution Video Dataset with Dynamic Scenes and Metric-scale Camera Movements [2] RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control Lightly AI-edited
CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper improves camera control in video generation using a camera-aware 3D decoder and Reward Feedback Learning (ReFL). The decoder maps video latents to 3D representations to assess alignment. ReFL then optimizes this alignment using a novel visibility-aware reward that supervises only visible pixels. Experiments on RealEstate10K and WorldScore show significant improvements in camera control and visual quality. 1. Originality This paper addresses the under-explored problem of enforcing camera conditioning in video diffusion models using ReFL, thereby improving alignment between generated footage and prescribed camera parameters. 2. Quality The proposed approach incorporates a camera-aware 3D decoder that efficiently evaluates video–camera consistency while reducing computational overhead. Experimental results demonstrate clear gains in both camera control accuracy and overall visual quality. 3. Clarity The presentation is well organized and easy to follow. Detailed diagrams—particularly the architectural overview in Figure 2—effectively convey the components and workflow of the framework. 4. Significance By enhancing camera control in video generation, this work has immediate relevance to applications such as virtual reality and robotics. Moreover, it delivers an efficient feed-forward 3D reconstruction module and addresses the efficiency bottlenecks of reward-based learning in diffusion models. 1. The paper provides limited insight into how the method scales to larger datasets or real-world deployments; a more detailed analysis of computational requirements and potential bottlenecks would strengthen its practical relevance. 2. By relying on 3DGS, the approach is inherently restricted to static scenes, as the authors acknowledge, limiting its applicability to dynamic or non-rigid environments. 3. The pixel-level reward signal may be too low-level to capture high-order semantics, potentially leading to overly smoothed outputs that sacrifice semantic fidelity. 1. How general is the proposed framework? How well does it perform when applying the camera-aware 3D decoder and CRO to other baseline models, such as MotionCtrl or ViewCrafter? 2. The claim that latent-pose misalignment causes blurry renders is central to the reward design , supported qualitatively by Fig. 12. Could the authors provide quantitative validation, such as a plot showing rendering quality (PSNR/LPIPS) degrading as camera pose noise increases? 3. Since the visibility mask only supervises regions from the first frame, how does the model ensure 3D consistency in large, newly revealed areas that receive no reward gradient? Have any artifacts been observed in these un-supervised regions? 4. Can the authors quantify the sensitivity of the final model's performance to the 3D decoder's quality? For example, what is the expected performance gain for every 1-point PSNR improvement in the decoder? 5. What is the justification for using 7 denoising steps for reward computation, and how does this hyperparameter affect the trade-off between performance, stability, and cost? 6. How does the initial training's timestep sampling bias interact with the subsequent reward optimization stage? Is the reward gradient also biased towards certain timesteps? 7. Given the model's scene-centric training on RealEstate 10K and its static scene design, how does it perform on object-centric generation tasks? Lightly AI-edited
PreviousPage 7 of 1516 (75800 total rows)Next