ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper evaluates the historical knowledge embedded in text-to-image (T2I) models. The authors generate images using multiple T2I models based on predefined actions and time periods, ranging from the 17th century to the late 20th century. The analysis focuses on three key aspects: implicit stylistic associations, anachronism detection, and demographic representation. Using carefully defined metrics for each aspect, the study reveals several noteworthy findings: a) different TTI models exhibit distinct stylistic associations across time periods, and these associations persist even under explicit prompting; b) the models frequently produce anachronistic elements, reflecting a lack of temporal awareness; and c) the generations display gender and racial over- or under-representation, highlighting underlying demographic biases. While the intent behind each such direction is appreciated, i have some issues with the methods used to evaluate these, which i have summarized in the weaknesses. 1. The paper explores the world knowledge embedded in text-to-image (TTI) models—knowledge that parallels that of modern Large Language Models (LLMs). By examining how these models represent historical contexts, the authors go beyond the typical focus on creativity or imagination to probe their practical understanding of reality. This represents an important and relatively underexplored research direction. 2. The findings reveal previously unexamined layers of bias within TTI models. For instance, the recurring depiction of modern artifacts such as headphones in images set in the 17th or 18th century underscores the models’ limited grasp of historical realism and their tendency to fill knowledge gaps with contemporary concepts. Addressing such issues is crucial for improving the reliability and historical awareness of future model releases. 3. The paper is clearly structured and well written, making complex ideas accessible and easy to follow. 1. The paper’s positioning could be clearer. The provided dataset, being composed of the outputs from T2I models applied to a set of prompts, offers limited standalone value to the community—apart from ensuring reproducibility. The true contribution appears to lie in the methodological framework for analyzing the historical biases in generative models. It would therefore strengthen the paper to explicitly present the work as proposing a benchmark for estimating VLM biases in representing historical contexts, with the accompanying dataset serving as an illustrative application of this benchmark to three specific models. 2. In the anachronism detection evaluation, the reported 72% alignment with human judgment is substantially low, casting some doubt on the robustness of the anachronism detection component. The concern here is less about the existence of anachronisms and more about the quantitative reliability of the reported metrics. 3. The use of Large Language Models (LLMs) to measure gender and racial representation raises validity concerns. While the paper attempts to verify these measures through domains such as education and agriculture, the scope of the evaluation appears insufficient to justify LLMs as reliable tools for assessing historical demographic biases. A more cautious approach would be to limit this analysis to tasks with concrete supporting data and to avoid the use of LLM-based estimations where empirical grounding is weak. The paper uncovers stylistic biases across time periods, however, these would be expected on some levels due to the data available on those time periods (portraits and illustrations in older days vs black and white in 1900s vs more varieties nowadays). May be the issue can be avoided just by prompting the models to generate only real-life images. What's the authors' take on this? Lightly AI-edited
Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a benchmark for evaluating how text-to-image diffusion models represent historical contexts, addressing a gap in current research which has focused primarily on contemporary demographic and cultural biases. The authors created HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art models (SDXL, SD3, and FLUX.1) using neutral prompts describing universal human activities across five centuries and five decades of the 20th century. They evaluate these images along three dimensions: (1) Implicit Stylistic Associations, finding that models strongly default to specific visual styles for certain eras (e.g., engravings for the 17th-18th centuries, monochrome photography for early 20th century decades) even without explicit stylistic instructions; (2) Historical Consistency, using an automated LLM+VLM pipeline to detect anachronisms; and (3) Demographic Representation, comparing generated gender and racial distributions against LLM-derived historical baselines. The findings demonstrate that T2I models systematically struggle with historically accurate representations, relying on learned stylistic conventions and failing to properly condition on temporal context. * First systematic evaluation of historical representation in T2I models, articulating why this matters beyond factual accuracy—historical imagery shapes cultural memory, collective identity, and public understanding of the past, with real consequences as these systems increasingly generate educational and cultural content. * Dataset contribution: HistVis dataset with 30,000 images across 3 state-of-the-art models (SDXL, SD3, FLUX.1), using 100 universal, temporally-agnostic activities paired with 10 time periods—this design cleverly isolates models' internal historical representations by avoiding historically-specific prompts that could encode external assumptions. - Authors check that prompt engineering fails to mitigate biases. Mitigation experiments demonstrate that explicit instructions (adding "photorealistic" to prompts, using negative prompts to discourage monochrome) largely fail to override models' stylistic defaults * Authors compare multiple state-of-the-art models and find systematic differences. Comparative analysis reveals model-specific failure modes (SD3 exhibits highest anachronism rates at 20-25%, SDXL most historically accurate, FLUX.1 intermediate) - I think this is a good paper, but I am worried about the fact that demographic baseline uses LLM as "ground truth". The third metric relies entirely on GPT-4o to estimate historically plausible demographics, meaning any biases the LLM has will be encoded into the benchmark and treated as "correct" historical representation. This is particularly dangerous because: (1) LLMs are trained on internet data that reflects contemporary biases and incomplete historical records, not peer-reviewed historical scholarship; (2) the validation against Our World in Data only covers 3 out of 20 activity categories, leaving 85% of the benchmark unvalidated against any formal historical source; (3) even the validated categories use continent-level distributions while the actual evaluation uses race categories, introducing an additional unsupported mapping; (4) future work citing this benchmark may treat these LLM estimates as authoritative baselines, perpetuating and legitimizing whatever biases GPT-4o encoded. The authors acknowledge this is a "coarse approximation" and cannot replace expert historians, yet they publish quantitative over/under-representation scores without consulting actual historical demographers or using primary historical sources. This creates a circular validation problem where one AI system (GPT-4o) judges another (SDXL/SD3/FLUX), with no external ground truth. A benchmark critiquing historical accuracy should itself be grounded in rigorous historical methodology, not LLM outputs. The stylistic and anachronism metrics are well-validated, but the demographic analysis risks doing more harm than good by establishing flawed baselines as reference standards. - Why not use actual historical sources for demographic baselines? You validate 3 categories against Our World in Data with reasonable results (MAE=4.64 for GPT-4o). Why not extend this approach to the remaining 17 categories by consulting historical demographers, census data, labor statistics, or peer-reviewed historical scholarship? Even if comprehensive data doesn't exist for all activity-period pairs, wouldn't partial coverage with real historical data be more valuable than complete coverage with LLM estimates? What were the practical constraints (time, cost, expertise access) that prevented this? - Did you consult any historians or historical demographers during this work? If so, what was their feedback on using LLM-generated baselines? If not, would you consider adding expert validation in a revision, at least for a subset of high-impact categories (e.g., education, work, agriculture) that future users might cite most frequently? - How do you recommend future work should use the demographic metric? Given that you acknowledge LLM baselines are "coarse approximations" and cannot replace expert knowledge, should future papers cite the demographic over/under-representation scores as evidence of bias? Or should this metric be treated as exploratory/preliminary pending validation with real historical data? Would you consider adding stronger cautionary language in the camera-ready version? Fully AI-generated
SFedPO: Streaming Federated Learning with a Prediction Oracle under Temporal Shifts Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper studies the problem of federated learning when clients receive stream of data. Due to having dynamic environment, the data distribution among clients may vary over time. The paper proposed adjusting data sampling for model training and obtained the convergence rate. Then to optimize the convergence, the paper proposes a client data sampling distribution and a server aggregation strategy. The proposed solution is based on the presence of reliable oracle that can predict states of clients. Experimental results show that the proposed algorithm achieves better results than other baselines. - The problem of federated learning facing with stream of data is an interesting problem. The paper focuses on the cases where there is a possibility to predict the next state of clients. - The paper provides convergence analysis for the proposed algorithm. Furthermore, the proposed data sampling and aggregation strategy technically sound which optimizes the convergence rate. - The paper provides comprehensive the experimental results. - The performance of the proposed algorithm depends on the quality of the predictive oracle model. However, an accurate oracle model may not always be available. - The paper could benefit from including “Federated Learning for Data Streams” (Marfoq et al., 2023) as one of the baselines to strengthen the experimental study. - A discussion on how the use of the oracle model can improve the convergence rate could also be added to enhance the paper. Based on my understanding, the paper focuses on scenarios where clients operate in dynamically changing environments, and the model training is adjusted accordingly to adapt to these changes. However, the paper does not appear to consider client heterogeneity. Could you elaborate on this aspect? Fully human-written
SFedPO: Streaming Federated Learning with a Prediction Oracle under Temporal Shifts Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes SFedPO, a framework designed for FL under streaming data distributions by leveraging partial predictions about clients' data distribution shifts to guide both local data sampling and global aggregation. Based on a convergence upper bound, the authors develop two modules: Distribution-guided Data Sampling (DDS) and Shift-aware Aggregation Weights (SAW), which are claimed to jointly minimize the optimization error bound. Theoretical analysis provides convergence guarantees and robustness under prediction errors. Extensive experiments on multiple benchmarks show that SFedPO consistently improves test accuracy over existing FL baselines and can be plugged into various FL frameworks to further enhance performance. 1. This paper focuses on an important practical problem - federated learning with streaming and non-stationary data. 2. The theoretical convergence upper bound drives the development of DDS and SAW modules. 3. The experiment setup is reasonable, and the experiment results show the advantage of their proposed SFedPO. W1. In lines 52-53 about the research question, the sudden transition to focusing on client sampling and server aggregation feels abrupt. From my personal perspective, the related background or explanation is not enough before coming up with the research question, which results in the research question not being convincing enough. W2. Theorem 1 has a lack of readability because it lacks explicit interpretation and further discussion, such as convergence rate analysis, the meaning of each term, the impact of some key factors (e.g., $\pi$), difference between your theoretical results and the result with static data distribution, and so on. W3. As for Equation (10), there are several issues: (W3.1) In lines 268–270, the authors state that “In realistic streaming environments, it is neither desirable nor feasible for a client to completely discard previously stored data or to entirely ignore new samples from a given state.” This viewpoint is not completely convincing because there are realistic situations where it is necessary to discard newly received samples with extremely high noise. Moreover, this assumption implicitly supports the validity of Equation (10) because it ensures that $\alpha_n$ cannot be zero. However, in practice, $\alpha_n$ can indeed be zero, in which case Equation (10) becomes undefined. (W3.2) The bounded gradient $G$ can be very large. As a very large $G$, $-a_1 d_m + b_1$ will be almost zero. Then, the score will not decrease with the state-specific heterogeneity bound $d_m$, which does not work as what they claimed. **Typo:** W4. Line 89: "a *date* evaluation metric" -> "a *data* evaluation metric" Q1. In the modularity experiment, does it use the same hyperparameter settings for baselines with and without SFedPO? Q2. Can the authors provide more explanation and discussion about the Theorems 1 and 2? Moreover, what is your theoretical contribution compared to the existing works? Q3: There are a lot of hyperparameters that need to be set in the experiment. Hence, I would like to know: how did the authors make sure that the ranges of $a_1$, $b_1$, $a_2$ and $b_2$ are reasonable? Moreover, when the heterogeneity score $s_n$ is large due to a large $G$, $a_2$ and $b_2$ may may not have a significant effect because they are dominated by $s_n$. Did the authors observe any related phenomena in your experiments? See weaknesses above. I would adjusting my rating if the authors can address my concerns properly. Fully human-written
SFedPO: Streaming Federated Learning with a Prediction Oracle under Temporal Shifts Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper investigates federated learning under dynamic distribution shift scenarios. SFedPO avoids two extremes: traditional FL (assuming static distribution) and online FL (not using previous information at all)j by employing sampling strategy and client weighting mechanism. Both mechanisms are theoretically supported and experimental results demonstrate the superiority of SFedPO. 1. Research topic is timely and meaningful: federated learning using streaming data (dynamic data distribution). 2. Both ideas (sampling and client weighting) are theoretically supported. 3. Experimentally demonstrated performance improvement 1. Model architectures are too old: AlexNet and LeNet-5, meaning that updating only architectures can exceed the gain of the proposal. Utilizing at least ResNet is recommended. 2. Datasets are small and less challenging. Evaluating on CIFAR-100 is more common for federated learning. Dynamic environments can incur class changes in CIFAR-100, which makes challenges. Please address the weaknesses above. Fully human-written
SFedPO: Streaming Federated Learning with a Prediction Oracle under Temporal Shifts Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces SFedPO, a streaming federated learning (FL) framework designed to bridge the gap between conventional static FL and fully adversarial online FL settings. The authors assume that temporally evolving client data can be modeled as transitions among a finite set of latent states, each corresponding to a stationary distribution of data. By incorporating a prediction oracle that estimates the transition probabilities among these states, SFedPO dynamically adjusts local data sampling through a distribution-guided strategy (DDS) and adapts global aggregation via shift-aware weights (SAW). The authors provide convergence guarantees under this setup, demonstrate robustness to oracle prediction errors, and show that SFedPO consistently improves accuracy over baseline FL methods in simulated streaming settings. The paper is logically structured, demonstrating a clear flow from problem formulation to theoretical analysis and practical implementation. The authors present a comprehensive convergence analysis and include robustness guarantees with respect to prediction errors. The framework is modular, meaning it can be integrated with multiple FL algorithms, and the experimental results demonstrate measurable improvements in accuracy across several baseline methods. The work also addresses an unfilled niche in FL literature by navigating between static and adversarial formulations with partial future knowledge. The primary weakness lies in the practicality and realism of the assumptions. The use of a finite latent space with a known transition model and access to a prediction oracle may not reflect real-world data characteristics, and the experiments do not validate this setup beyond simulations. The partial-access experiment design may unfairly benefit SFedPO by clustering latent states in ways that other methods are not designed to exploit. Moreover, the feasibility of estimating parameters, such as the number of states or heterogeneity bounds, remains unclear, and no computational overhead analysis is presented for the proposed sampling and weighting schemes. What is the computational overhead of computing the sampling ratios $\alpha_{n,m}$ and aggregation weights pₙ during federated rounds, and how does this compare to the cost of local model training and communication? Are there settings, either in terms of client availability, data dynamics, or oracle error, where SFedPO may underperform relative to classical FL methods such as FedAvg? How should practitioners estimate or determine the number of latent states M if no prior structure is available in a real dataset? In Equation (1), the authors propose updating the local distribution through a convex combination. Are there alternative techniques (e.g., kernel-based blending or Bayesian updating) that could better capture uncertainty or non-convex transitions across states? Fully AI-generated
Memory Makes The Poison: Over Memorization Drives Visual Poisoning in LVLMs Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper argues that the true cause of data poisoning in LVLMs is the memorization of fine-tuning concepts, not the pixel-level perturbations themselves. It also introduces RejectShield, a simple rejection-based defense that filters out likely poisoned samples using an adversarial detector. The problem setting is interesting, backdoor attacks are becoming an increasingly pertinent threat as more and more SFT data is mined from uncurated web sources. ## Weaknesses - **Overstated Claims** The observation that backdoor attacks are successful due to models memorizing patterns is obvious and this paper is certainly not the first one to observe this. The authors repeatedly state that "backdoors are not successful because of the pixels" but what does this even mean? - **Unsurprising Result in Figure 1** This result is unsurprising, as the pixel perturbations in backdoor attacks are meant to (1) enforce imperceptibility and (2) to prevent interference with other concepts that may exist in the finetuning dataset. In the setting you have selected, the effect is likely observed due to their choice of concepts which are rare, low-overlap categories like engine lights naturally face little interference from clean data, making memorization-driven attacks appear stronger than they would for more common concepts. The broader implication of this plot is not at all clear to me. - **Method not clearly described** Most of the paper is spent explaining a fairly obvious finding, and very little time is spent on describing RejectShield in the main paper though this is the main method of this paper. As far as I understood, it is a classifier that is able to detect adversarial examples but almost no details are provided on the datasets used to train this classifier or its error patterns, nor is there any sensitivity analysis. See weaknesses Fully human-written
Memory Makes The Poison: Over Memorization Drives Visual Poisoning in LVLMs Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper claims to find that memorization is the root cause of the success of data poisoning attacks in multimodal LLMs and conducts extensive experiments to verify their claim. And then propose RejectShield which filters out any data with poison and finetunes the model without such data, ultimately finding that such a process results in much lower ASR therefore a successful cleaning approach. The study of data poisoning attacks in LLMs and multimodal LLMs is a very pertinent subject of study. In the current state, the paper has several and serious flaws: 1. First of all it is commonly understood that memorization is the cause of data poisoning, otherwise such low percentages of poisoning will never lead to model being poisoned so effectively as demonstrated in this and several previous works. Therefore it is not the contribution of this paper to claim that memorization leads to data poisoning, it is kind of obvious. 2. The claims about multimodal being more subject to data poisoning is also something previous papers have found, see https://arxiv.org/abs/2106.09667. 3. The approach RejectShield which should have been the main contribution of this paper if it is such an effective technique has been treated as second class citizen in the work, there is hardly any mention of how it works. The authors just say that they have built a technique that can detect if an image has adversarial perturbation. As far as my understanding of adversarial attacks in images goes, this is an unsolved problem in the image community since 2013 when adversarial attacks in images were first discovered and there exist no reliable classifiers that given an image can predict if the image has an adversarial attack or not. So I am very curious of the details of how the authors of this work were able to achieve that. Please provide more details of the training data, training procedure, and the robustness of this classifier. 4. There are several possible errors in the paper. For example in line 180 (x_d, y_d) is not the benign image, it is the target pair, right? 5. Another mistake is in line 211 where the authors state the model was trained on the benign counterparts x_d (and not x_p), if that is the case then how did the model learn any poison (like where did the 1% poison in Figure 2 come from)? Can you explain this in much more clarity in the paper? Please refer to the weakness section. Fully human-written
Memory Makes The Poison: Over Memorization Drives Visual Poisoning in LVLMs Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper studies how visual perturbation–based data poisoning attack, specifically ShadowCast, poses threats to Large Vision–Language Models, and the reasons existing input purification defenses appear to not fully mitigate the attack. The paper identifies the issue as the LVLM over memorizing concepts contained by the images added to the clean dataset, instead of the adversarial perturbations, by fine tuning with samples with only benign images without edits. Motivated by the observation, the paper proposes to filter the fine tuning dataset with an adversarial sample detector, complemented by an LLM that checks if there is imbalanced data in the dataset. 1. The paper conducts detailed analysis through controlled experiment to fine tune with only benign target samples, to motivate the proposed defense. 2. The paper conducts similar controlled experiments on single modality model to identify the problem is more serious for LVLM. 1. In Figure 1, labeling in the rate where the model mistakenly produces results with respect to x_d instead of x_o for the controlled experiment is a bit misleading, since the misidentification is not the intended outcome. 2. The experiment results suggest the reason the ShadowCast attack appears to be not mitigated by just input purification defenses is that the target model cannot distinguish samples containing the destination concept and samples containing the origin concept, due to that the model is fine-tuned with an unbalanced amount of target samples. I think there are some open questions the paper does not address that make it unclear about how to interpret the results of the paper. For example, if under the controlled experiment or after input purification, the target model does not only respond the prompts about the origin concept with information about the destination, but also respond to prompts about some other similar concepts with information about the destination concept, then it suggests what is observed is hallucination as a side effect of the targeted attack, and the targeting itself has been mitigated. Then the proposed defenses are closer to methods that mitigate hallucination when there is an imbalance of fine tuning data, instead of defenses for targeted attacked. In particular, using an adversarial perturbation detector works because the attacker has label the imbalanced data with adversarial perturbation. Q1. > This vulnerability is particularly concerning, as it shows that standard LVLM fine-tuning can be destabilized with only a few injected samples. Can the authors quantify "a few"? What is the typical fine tuning dataset size? For example, if the size if 10K, then 100 poison samples is not a trivial number. Q2. If the origin concept and the target concept are more distinct under certain perception, like Harry Styles vs Biden, does the conclusion from Section 3.2 still hold? Q3. If the there are benign samples containing origin concept and the target concept (some Trump example, some Biden example), does the conclusion from Section 3.2 still hold? Q4. In the case study with engine light and fuel light, and only normal fuel lights examples are provided in fine tuning, if the fine tuned model is tested with an origin concept is some other warning light on a car that looks similar, would the model also output fuel light related answers? Q5. Does the defense only improve on attack cases where origin concept images look similar to target concept image? Q6. How does “ShadowCast" compare with the "controlled experiment" in regime where origin concept images are distinct from target concept image? Can “ShadowCast still generalize to other models under that regime. Fully human-written
Memory Makes The Poison: Over Memorization Drives Visual Poisoning in LVLMs Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper finds that visual data poisoning in LVLMs is mainly caused by over-memorization during fine-tuning, not visual perturbations. The authors show that LVLMs hallucinate even when trained on benign but repetitive samples and that multimodal inputs worsen this effect. They reinterpret the ShadowCast attack as a memorization issue and propose RejectShield, a rejection-based defense that cuts attack success by up to 99% while preserving model performance. 1. The structure of the paper is clear, and the problem is well motivated. 2. The authors demonstrate the problems of the Shadowncast, which makes sense to me. Main Concerns 1. This paper shows that the effectiveness of Shadowcast does not come from the poisoned attack, but the hallucination of the model. I think this makes sense to me, as the injected images contain only a single class, which likely introduces strong data bias and induces hallucination. However, it remains unclear whether this limitation is unique to Shadowcast or shared across other poisoning setups. To support the claim that over-memorization, rather than poisoning, drives the observed effect, the authors should conduct additional experiments under varied training configurations and data distributions. Without such evidence, the generality of the conclusion remains uncertain.
 2. The paper only evaluates the Shadowcast attack. Including more advanced attack methods (e.g, [1]) could strengthen the robustness of the conclusion. Other Concerns 1. The authors only evaluate the LLaVA v1.5 and MiniGPT-4 which are a little bit outdated right now. It is suggested that the authors introduced one more advanced VLM (e.g., Qwen2.5-VL-7B or Qwen3-VL-7B). 1. How did the authors train the classifier? If the training data only consists of injected pairs (w/o perturbation) and clean data, how would this classifier perform? Since underlyingly, it is a hallucination problem, using a classifier to detect the adversarial examples might not be ideal solution. Fully human-written
LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces **UI-SIMULATOR**, a framework that treats large language models (LLMs) as **digital world simulators** for generating large-scale UI interaction trajectories to train general-purpose digital agents. The framework includes a world simulator for state transitions, a guided rollout process for multi-step task exploration, and a trajectory wrapper for refining reasoning and task instructions. A key extension, **Retrieval-Augmented Simulation (UI-SIMULATOR-R)**, allows the model to reference a few real-environment samples during generation, making simulated UI states more realistic and domain-consistent. The authors further propose **UI-SIMULATOR-GROW**, a targeted scaling strategy that selects mid-difficulty tasks and synthesizes diverse variants for efficient model improvement. Experiments on **WebArena** and **AndroidWorld** show that simulated data—especially with retrieval augmentation—matches or surpasses real-environment training, demonstrating the potential of LLM-based simulation as a scalable, data-efficient paradigm for digital agent training. The paper proposes a **well-structured and carefully engineered pipeline** for scalable digital agent training based on LLM-driven UI simulation. The design integrates multiple complementary components—guided rollouts, retrieval-augmented simulation, trajectory refinement, and targeted scaling—each with clear motivation and technical soundness. The **implementation quality and modular design** are impressive, and the **ablation studies are thorough and informative**, effectively illustrating the contribution of each component. While the overall performance gains are moderate, the system demonstrates solid empirical grounding and provides a valuable, reproducible framework for future research on LLM-based digital world simulation. 1. The pipeline relies on **a relatively weak base model (GPT-4o-mini)** as the core simulator. Many auxiliary modules (guided rollouts, trajectory rewriting, targeted scaling, etc.) appear necessary to compensate for its limited capability. However, if stronger LLMs (e.g., GPT-5-High, Claude-4.5-Sonnet) were used, much of this complexity might become unnecessary. The overall contribution would benefit from a clearer discussion of how the framework scales with model strength and whether the modular design remains essential under more capable simulators. 2. The **performance improvements are rather limited**, especially considering that current baselines on WebArena (≈50) and AndroidWorld (≈60+) are already relatively high. Models trained purely on the synthesized data achieve only around 10-point success rates, which questions the practical impact and realism of the generated data. The paper would be stronger with a deeper analysis of why synthetic trajectories fail to translate into higher downstream gains, or with qualitative evaluations showing where the simulated data still diverges from real interaction patterns. 1. The authors may consider **re-evaluating the pipeline with stronger base models** (e.g., GPT-5-High, Claude-4.5-Sonnet, or Llama-3-70B) to examine whether the proposed augmentation modules—such as guided rollouts and trajectory rewriting—remain necessary or still contribute meaningfully when the simulator itself is more capable. This would help clarify whether the framework’s complexity is inherently valuable or primarily compensatory for weaker models. 2. It would also be interesting to **test the generated data within more advanced agent frameworks**, such as **AgentLab**, or with stronger open-source agents (e.g., **Qwen2.5-32B**, which can reach ≈20 success rate on WebArena). Demonstrating improvements under these more competitive setups could better validate the real-world utility and transferability of the synthesized data. Heavily AI-edited
LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes UI-SIMULATOR, a scalable method that uses LLMs to simulate digital UI environments and generate diverse, high-quality training trajectories for agents—eliminating the need for costly real-world data collection. It also introduces UI-SIMULATOR-GROW, a targeted data scaling strategy that prioritizes the most useful tasks to improve data efficiency. Experiments on WebArena and AndroidWorld show that agents trained with this approach match or outperform those trained on real environments, even when using smaller models, demonstrating that LLMs can act as effective, scalable simulators for digital agent training. 1. **Novel Application of LLMs for Digital UI Simulation** The work creatively positions LLMs as world models capable of generating realistic and structured UI transitions, removing the need for traditional real-environment interaction. 2. **High Data Efficiency with Competitive Performance** The proposed methods deliver strong task success rates using significantly fewer training samples and smaller model sizes, showcasing impressive efficiency in both data and compute. 3. **Robust and Domain-Agnostic Generalization** The approach demonstrates strong robustness to perturbed environments and generalizes effectively across both web and mobile UI tasks, indicating broad applicability. 1. The paper writing is not sufficiently clear on some figures. 2. Experiments need to be further clarified. 3. Method claiming on diversity remains quentionable. 1. The method's main contribution is to treat LLM as similarotrs to collect data. How good are current LLMs treated as world models? Do authors treat this point for granted or have some prior investigation? 2. Figure 2 is too vague to read. I would suggest to highlight only the core part of the code (Instead of using tiny font to fit all codes into the fiture) 3. From my side, it is not appealing to say your trained 8B model is better than 70B model because your model is specially in-domain trained. Authos are proposing a data collection pipeline, will this pipeline fits different kinds of base models? The justification of this pipeline on more base models as well as teacher models are much more important yet lacking in current paper. 4. What is the performance of your teacher model (gpt-4o-mini) ? Will you forsee your trained model surpass the teacher model? 5. Can your collected data further boost the performance of your teacher model ? You can SFT your OpenAI teacher with your constructed data (FYI: https://platform.openai.com/docs/guides/supervised-fine-tuning). Fully human-written
LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper propose UI-SIMULATOR, a method for producing synthetic models of UI states and transitions. Additionally, the paper proposes UI-SIMULATOR-GROW, a method that uses UI-SIMULATOR combined with an instruction generator to collect synthetic trajectories to improve UI agents. The paper shows that these techniques improve agent performance on WebArena and AndroidWorld. * Using LLMs to generate synthetic environments for digital agents is an interesting area of exploration. This paper proposes an interesting perspective on using LLMs to generate world models of this form. Rather than generating the source code to simulate web or mobile apps, the LLM generates states and transitions. I think analyzing the effectiveness of this approach is quite interesting for the community. * The empirical results show improvements on WebArena and AndroidWorld relative to baselines. The core novel method that this paper explores is using LLMs to generate synthetic UI states and transitions -- a nice topic! The extension to use a given environment to synthesize instructions and trajectories has been explored by prior work, and the contributions there were less clear. I think the paper could be improved by focusing the presentation and analysis on the core contribution related to generating synthetic environments. * Presentation: Section 3 seems to describe the core contribution, but defers most of the detail to the appendix. The details of Figure 2 are not legible. I did not follow the summary of the combination of rule-based and LLM-generated trajectories without following the reference to Appendix C, and similarly did not follow how retrieval-augmentation is used (deferred to Appendix D). As this is the core contribution, I think it would be useful to present this aspect of the method more clearly. The other aspects of the overall UI-SIMULATOR-GROW recipe could have potentially adopted methods from prior work, e.g. https://arxiv.org/abs/2410.02907. In short, I thought section 3 should be expanded and sections 4 and 5 could be reduced if relying more on methods from prior work, to better focus the contribution. * Analysis: It was difficult to evaluate the core contribution (generating synthetic environments) given the complexity of the overall recipe. There is some discussion in section 7.1, but it seems the quantitive results are deferred to the appendix. How do synthetic environments of this form compare to simply asking the LLM to generate frontend environments with transitions described in the source code? I liked the brief mention of a comparison with exploring the test environments directly and the anecdote that synthetic databases can lack coverage over all search queries, but wanted more of this quantitative and qualitative analysis in the main paper. * Benchmarks: The overall results are so low (success rates <15%) on WebArena and AndroidWorld (much below SOTA) that I wonder if benchmarks with more dynamic range would be more useful. (It's fine to not be SOTA but the small effect sizes make the results harder to interpret) Summary: I thought the presentation and analysis in the main paper did not sufficiently focus on the core contribution of developing synthetic environments following the proposed method. I think it would be a more useful paper to focus on 1-2 key research questions rather than introduce a complex pipeline with many novel components that are not individually well justified. See weaknesses above. Fully human-written
LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces UI-SIMULATOR, a scalable trajectory synthesis paradigm that uses LLM-based digital world simulators to synthesize diverse UI trajectories. The authors further propose UI-SIMULATOR-GROW, which enables the continuous improvement of synthesized data. Experimental results on WebArena and AndroidWorld demonstrate the promise and robustness of this trajectory synthesis method. 1. The proposed scalable UI trajectory synthesis paradigm, UI-SIMULATOR, offers an interesting new direction for synthesizing data for GUI agents. 2. The UI-SIMULATOR-GROW extension effectively improves the model's utilization of synthetic data, ensuring stable performance gains. 1. Figure 2 is rendered poorly and has several clear issues. First, the screenshot in the middle is too small, even when I zoom the image to fill the entire screen, the text within it remains difficult to read. Second, the figure fails to clearly distinguish between the Retrieval-Free and Retrieval-Augmented Simulator methods. The latter should logically collect higher-quality trajectory data, yet the figure depicts identical actions and states for both simulators, failing to illustrate the benefit of using prior experience. Third, the figure lacks sufficient informational content. It is a missed opportunity to present the complete architecture of UI-SIMULATOR, leaving the paper without a central, overarching diagram to visually anchor the methodology described in the text. 2. Sections 3 and 4 could be merged. Section 3 provides a conceptual and design overview of the entire process of building a digital world simulator within UI-SIMULATOR, while Section 4 describes the specific data collection methodology. Both sections are fundamentally about explaining UI-SIMULATOR. Presenting them separately creates a distinct sense of fragmentation. Merging them would allow the two perspectives to complement each other, giving readers a more profound and cohesive understanding of the system. Furthermore, Section 3 does not seem to warrant its current length, as it does not involve extensive methodological design. 3. Some experimental results are puzzling. First, the reported result for Qwen-2.5-7B-Instruct on AndroidWorld is worse than Qwen-2-VL-7B, and is reported as zero. The official Qwen-2.5-7B-Instruct result on AW is 25.5, which I have also personally verified. It is unclear if this discrepancy stems from an issue with the authors' action space or coordinate conversion. Second, the scores reported for baselines like NNetNav, OS-Genesis, and GUIMid appear incorrect. On AndroidWorld, OS-Genesis scores around 17 and GUIMid around 21. On WebArena, NNetNav has a SR of 16.3, and the other two score around 10. I suspect the authors may have reproduced these results under their own setting; if so, the comparison would be unfair. Additionally, including GUIMid as a baseline is inappropriate, as it is not a data synthesis method. 4. Although the proposed method is presented as SOTA in the provided tables, its performance still lags significantly behind the latest work on GUI agents. For instance, the top scores on AndroidWorld have exceeded 70[1], yet UI-SIMULATOR's result is nearly 50 points lower than a model of a similar scale like GUI-Owl-7B (66.4)[1]. While I understand the limitations of synthetic data, such a large performance gap makes the proposed method less competitive. [1] Ye J, Zhang X, Xu H, et al. Mobile-agent-v3: Fundamental agents for gui automation[J]. arXiv preprint arXiv:2508.15144, 2025. 1. Line 214 mentions a "first stage" of the data collection process, but I could not find any mention of a "second stage" or "next stage" in the text of Section 4.1. Does it refer to the "step-wise guided rollout process and a final trajectory wrapper" mentioned in the last line of the section? The phrasing here is not very clear. 2. The "Step-wise guided rollout process" seems very similar to the Explorer[1] method: both propose a high-level task from an initial screen, have the agent interact with the environment to iteratively update goals, use a verifier, and summarize the task at the end. Could you please provide a detailed explanation of the differences between your method and Explorer, as well as the advantages of your approach? 3. LLM-based trajectory generation relies heavily on A11y Tree, which are often noisy. Furthermore, the A11y Trees for some websites or applications can be incomplete. I would like to ask how the authors address these two common problems. [1] Pahuja V, Lu Y, Rosset C, et al. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents[J]. arXiv preprint arXiv:2502.11357, 2025. Fully human-written
Rethinking Transformer Inputs for Time-Series via Neural Temporal Embedding Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper revisits a fundamental but often overlooked design aspect of time-series Transformers — the input embedding stage. The authors propose **Neural Temporal Embedding (NTE)**, a simple yet effective alternative to conventional **value embedding + positional encoding** pipelines. NTE employs lightweight neural modules such as **Conv1D** and **LSTM** to process each variable’s time series individually and encode temporal dependencies directly, without relying on positional encodings. The key claim is that much of the Transformer’s inefficiency in time-series forecasting stems not from the attention mechanism itself, but from **suboptimal input representation**. Experiments on standard benchmarks (ETT, ECL, Weather) demonstrate that NTE-based Transformers achieve comparable or better performance than specialized architectures, such as Autoformer, PatchTST, and iTransformer, particularly on long-horizon forecasting tasks. The contribution is conceptually simple yet cleanly executed, offering an interesting perspective that suggests **input design** improvements can yield non-trivial gains without requiring architectural overhauls. 1. **Clear motivation and conceptual simplicity.** The paper makes a strong case that input embedding deserves more attention. The removal of positional encoding is a bold but well-motivated design choice. 2. **Empirical clarity.** The experimental setup is well organized, with fair comparisons to established baselines. The results convincingly show that input modifications alone can lead to performance gains. 3. **Strong writing and accessibility.** The narrative is concise and approachable — the authors explain their ideas clearly without unnecessary jargon. 4. **Relevance to the ICLR community.** The study fits the current trend of revisiting Transformer assumptions for efficiency and simplicity. It may inspire further work on lightweight input layers. 5. **Practical insights.** The findings suggest that some of the architectural “complexity arms race” in time-series forecasting might be avoidable, which is refreshing. 1. **Limited novelty at the algorithmic level.** NTE combines well-known neural components (Conv1D and LSTM) in a new configuration. While the empirical insight is valuable, the conceptual innovation is modest. 2. **Insufficient theoretical explanation.** The paper would benefit from a deeper discussion of *why* NTE works — e.g., whether the learned temporal encoding approximates sinusoidal patterns or adapts to variable frequencies. 3. **Lack of broader baselines.** The study compares mainly against mainstream Transformer variants. Including recent input-focused or embedding-free models (e.g., TSMixer, FreTS) would help strengthen the claim of generality. 4. **Ablation analysis could go further.** It would be useful to isolate the impact of each NTE component (Conv1D vs LSTM) and analyze whether NTE benefits small-data or irregularly sampled settings differently. 5. **Unclear scalability implications.** Since NTE introduces extra pre-processing per variable, a brief discussion of runtime or memory overhead would make the work more complete. 1. How sensitive is the model to the choice of neural encoder (e.g., Conv1D vs GRU)? 2. Does NTE preserve translation invariance in temporal shifts, or does the neural encoder introduce biases? 3. Could the authors test whether NTE generalizes to irregular or non-uniform sampling rates? 4. Are there cases where positional encodings outperform NTE (e.g., highly periodic signals)? 5. How does the per-variable processing scale when the number of dimensions exceeds 100? Fully AI-generated
Rethinking Transformer Inputs for Time-Series via Neural Temporal Embedding Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper proposes Neural Temporal Embedding (NTE), an embedding mechanism that effectively internalizes temporal dependencies without relying on either value embedding or positional encoding. The authors claim that a learnable NTE layer (using FC, Conv1D, LSTM, etc.) can process each variable’s time series and directly learn temporal patterns. Experimental results show that NTE-based models match or outperform state-of-the-art Transformer variants, particularly maintaining stable accuracy in long-horizon forecasting. 1. The motivation of the paper is valuable, which rethinks the input stack of time-series Transformers and presents NTE as a unified, learnable temporal layer that can replace value embedding and explicit positional information. 2. The paper includes ablations over PE variants, various NTE module types (FC, LSTM, Conv1D, Dilated, Bi-DilatedConv1D), bidirectional dilated Conv1D structures, and analyses of representation similarity (CKA) and entropy. 1. While the motivation for the proposed method is well-founded, the experimental results reveal notable shortcomings. Specifically, Table 1 shows that introducing NTE leads to significant performance degradation for the Vanilla Transformer on certain datasets, such as ETTh1 and ECL. This raises concerns about the robustness of NTE when combined with standard Transformer architectures and suggests that its benefits may be limited to specific backbone designs. 2. The paper does not provide sufficient theoretical grounding to explain why using modules like Conv1D or LSTM within NTE leads to better results compared to the original value embedding. While the empirical results support the effectiveness of these modules, a theoretical analysis of how these architectures capture temporal dependencies more effectively would strengthen the contribution and improve the general interpretability of NTE design. 3. The ablation studies, while extensive, could be further expanded to explore the impact of kernel size in Conv1D-based NTE modules. The paper primarily reports results with fixed kernel sizes (e.g., 3 or 5). However, it is unclear whether these choices are optimal for capturing temporal dependencies in time series data, which often vary significantly in terms of patterns, seasonality, and granularity. 1. Could the authors clarify what "Future-Dilated" means in Figure 3 and how the future embedding is constructed? 2. RoPE is a commonly used positional embedding method in Transformer-based models, particularly for tasks involving sequential data. However, it is not included in the experiments for comparison. Could the authors provide insights into how NTE compares to RoPE in terms of performance and effectiveness for time series forecasting? Heavily AI-edited
Rethinking Transformer Inputs for Time-Series via Neural Temporal Embedding Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose a novel technique for the input embedding / transformation for time series foundation models (TSFM). In particular, they propose to combine the position embedding with the input embedding using different neural networks, which is a timely and interesting research area. The problem that the authors work on is timely and a critical problem for any TSFM. So far, the default mode has been to simply embed the inputs either directly with a linear layer, or after applying a patching technique. The authors unify these two aspects with their proposed Neural Temporal Embedding, which is a simple neural network. In my view, this would be a novel aspect. Despite the novelty, the idea and the paper has several critical flaws: First, it is unclear until almost the very end of the paper on page 7, what the NTE is really doing. Up until this point the authors only mention that the NTE can be a 1D convolution, a LSTM, a fully connected network and several others, but they don't provide any concrete examples. Then, even though the authors provide this simple description of the two Conv1D layers, it is still unclear what the precise architecture of the NTE in Table 1 is. Is it the two conv layers, or is it something else? Only from the text, the reader can infer that the results in Table 1 stems from the two conv 1D layers. However, the authors state that "the sequential bias introduced by NTE is insufficient to compensate for the order-agnostic nature of the standard Transformer". However, they do not elaborate whether any other structure of the NTE would improve that. There is some ablation study in Table 2 that, which the reader can appreciate, but then this table is also confusing. Firstly, because the standard in the literature is to use the learnable PE (which should also be the reference point in Table 1), and then with this more realistic comparison point, none of the NTE architecture really seem to make a significant difference. Finally, since the paper puts so much emphasis on the NTE architecture as a novelty, detailed investigations of it are absent. For example what is the associated computational cost with the different variants, what are the features that those architectures provide? Are there specific cases in which to use one architecture of the NTE over the other? Overall, the study spends a lot of time explaining and reiterating on the basics for the NTE, but doesn't dive in to the essence of it. See the text above, there are several open questions which should be addressed:\ Why to choose the NTE over the Learnable PE if performance is not better?\ What is the associated computational cost with the different variants?\ What about patching techniques paired with the NTE?\ Wouldn't a recurrent network break the parallelism and thus limit the efficiency of an attention backbone?\ What are the features that those architectures provide?\ Are there specific cases in which to use one architecture of the NTE over the other? Etc.\ Fully human-written
Rethinking Transformer Inputs for Time-Series via Neural Temporal Embedding Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposed a mechanism, namely Neural Temporal Embedding (NTE), to replace the Positional Encoding (PE) in transformer-based models. Neural networks like FC, LSTM, and Conv1D are used to build the NTE module. - The proposed mechanism achieves improvement when applied to backbones like iTransformer and PatchTST, especially on ETTh1 and ECL. - The proposed NTE, as a plug-and-play module, can easily work with different backbones without changing the downstream structure. - Multiple experiments are conducted to verify the effectiveness and efficiency of the proposed mechanism. - Structure issues: - Although many experiments are conducted, only a few of the results are displayed in the main body of the paper. - In Sec.4, the paragraph `Bi-directional Dilated Convolutional Embedding' seems to have little relevance to the experiment. (Should it be in Sec.3 or Appendix?) - Motivation issues: - Although the NTE is claims to simplify the input, the results in Table 6, 7, 13, and 14, indicate that the NTE may add to the computation burden. - Experiment issues: - The NTE shows poor performance on Exchange, raising concerns that NTE may not be competent for long-horizon forecasting. - It is recommended to conduct multiple experiments and calculate the standard deviation to verify the stability. - Others: - As mentioned in the paper, the NTE can only be conducted on time-series forecasting tasks. Please see weaknesses. Fully human-written
On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents Reward Filtered Sequential best of n (RF-SBON) for sequential test time compute. RF-SBON sequentially appends n answers within its context window, answers with rewards exceeding a threshold are only appended in this procedure. Under the assumption that the pre-training data comes from a mixture of reference policies and that a single trajectory $(x_t, a_1,..,a_N)$ comes from the same mixture component $\pi^{\tau}_{ref}$, and that given an $x$ that reference component is distinguishable from other components, authors shows that for BoN there exits a reward such that BON is suboptimal in terms of regret to an optimal policy $\pi^*$, while for sequential Best of n there are regimes where sequential BoN improves on BoN. Authors show empirically that sequential BoN improves on BoN in terms of performance on Math500, GPQA diamond, AIME 24 using models in the Qwen family and a PRM. The paper studies an interesting question on sequential test time compute generalizing Huang et al results for Best of N. The paper makes a lot of assumptions and derive sensible results under these assumptions. * I would like to challenge the main assumptions in the paper relating to the mixture assumption. Reading through the appendix, models are instructed as follows: "The previous solution(s) may contain errors. Before solving, briefly critique the previous attempt(s) in 2 to 3 bullet points.Then provide a COMPLETE and CONCISE corrected solution from scratch that addresses those issues. End with exactly one line containing the final answer" is it really that the trajectory is singling out a single component of the model and giving it some flexibility to win on Best of n, or the self correction / reflection aspect that makes sequential BoN work. As an ablation running sequential BoN without this instruction would it lead to same improvements? * Are any of your assumption reflecting this ability of the model to judge its own result? along the trajectory is the reward empirically monotonically increasing ? * If the model is a base model and not an instruct or thinking model , would sequential BoN still work ? imagine the model is a bad judge, I believe that this will not allow the model to judge its own generation and improve upon. I believe an inherent assumption maybe implied by yours, should be on the ability of the model to judge and recover. * As the context length is increasing in n, the paper does not baseline as it advertises in the introduction : Is Best-of-N the best one can hope for under a fixed test-time budget? are the comparisons to BON fair in terms of test time compute? can you give wall clock for SeqBON, BoN, filtered sequential bon and rewind and repeat ? * What n (maximum of appended answers) is used in the experiments ? * When you evaluate with Pass@N do you mean N for sequential BON and filtered sequential BON the maximum number of appended answers , or you mean N samples from RF-SBON for a given n ? * For different gamma used , how many filtered answers end up in the context of LLM? Minor suggestion: * Avoid using SBON in RF-SBON as SBON is known as Soft Best of N in many works, you can perhaps use RF-SeqBoN Fully human-written
On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. In this work, the authors analyze existing Best-of-N approaches to inference time scaling and note that under certain data/LLM settings, these methods fail to achieve low regret. They then propose a modification to a sequential variant of BoN known as Sequential Best-of-N, and show under certain conditions it can achieve better regret than BoN. They then implement an algorithm via reward-filtering, where they only feed back good (high reward) completions to the LLM for the next sample in the classical sequential BoN scheme, to encourage exploration of near-optimal solutions. * The argument that in sequential settings we can leverage additional information is interesting and compelling. * The authors propose a realistic mixture of topics setting where BoN can fail to achieve low regret, and show that their proposed algorithm can achieve low regret. * The empirical results demonstrate the efficacy of reward-filtered sequential BoN. * SBoN is already terminology used to refer to Soft Best-of-N [1], and is already referenced as such in the community [2,3]. I would recommend using a different abbreviation and rename Algorithm 1 such as SeqBoN . * Also, Algorithm 1 is very vague. What is "Update h based on $x,a_1,...a_{i-1}$"? Can you describe a specific function or algorithm? For instance, what if I update $h$ to be $x$? Then this is just BoN, but falls under your description of SBoN. I'm not trying to make you say explicitly "update h based on x and not just x" but I think a little more clarity would be helpful here. Theoretical Results: * I'm a little confused by Thm 4.3: It is unclear to me what about the sequential nature of the algorithm leads to a different result. In fact, the proof for this and Prop 4.2 look the same, and the only difference is now looking at this mixture of topics model. Then, the result reads to me as that under certain models, the sample complexity bounds are in fact better: In 4.2, $n \leq M_{LLM}$ means we have at least $\epsilon$ regret, and here by 4.3 we have that $M_{\tau} \leq M_{LLM}$ and therefore if $n \leq M_{\tau} \leq M_{LLM}$ we have $\epsilon$ regret which is strictly easier to avoid. * Following on this, isn't the point to say that BoN fails under a mixture of policies thus motivating a modified Sequential BoN strategy? Why do you want to prove these converse results for sequential policies? * For Thm 5.1, if you let your SBoN algorithm be $h \leftarrow x$ (so just the BoN algorithm as I noted above) then you just get a tighter bound. Can you comment on this? Your framing of the result and your remark on it makes it seem like something about the sequential nature is unlocking this better result, yet this example implies otherwise. If you can build some intuition or explain it in the sketch I would appreciate that. Overall, reading this paper gave me the feeling (at times) that there was a lot of theory just to say something intuitive: we can incorporate additional information in a smart way during sequential BoN to get better results. The deep theoretical treatment came at the cost of empirical results validating the method and exploring its potential due to page restrictions. Furthermore, many of the results especially in the beginning were just restatements of Huang et al., or minor modifications to incorporate the mixture of policies approach. I think these results could be distilled which would make the reading process better. I would love to hear more from you and am willing to improve my score. Notation * On line 791, what is A.1 and A.2? If they are section headers, how are you referencing A.2 before getting there? I think maybe you swapped these because you also in A.1 reference the introduced reward function as being introduced during the proof of theorem 4.3 (which is A.2) * Line 170, you have a trailing bullet point you should try to get on the previous page if possible. [1] Verdun, C. M., Oesterling, A., Lakkaraju, H., & Calmon, F. P. (2025). Soft Best-of-n Sampling for Model Alignment. ISIT 2025. [2] Geuter, J., Mroueh, Y., & Alvarez-Melis, D. (2025). Guided Speculative Inference for Efficient Test-Time Alignment of LLMs. [3] Aminian, G., Shenfeld, I., Asadi, A. R., Beirami, A., & Mroueh, Y. (2025). Best-of-n through the smoothing lens: Kl divergence and regret analysis. * You frame Huang et al.'s work as a sequential BoN algorithm on line 136 but I was under the impression that the authors proposed a parallel BoN algorithm. Can you clarify this? Fully human-written
On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a new algorithm, RF-SBoN (see Algorithms 2 and 3), that adapts sequential best-of-n (BoN) sampling by only including responses that fall above a certain reward threshold. Furthermore, the authors develop theory for evaluating test-time compute (TTC) methods under a certain modeling assumption on the pretraining corpus of an LLM. **I lean towards rejection of the paper.** As outlined below, the theoretical foundation hinges on unrealistic assumptions, and the algorithmic novelty seems to consist of only including responses into the context whose reward lies above a certain threshold, which is a marginal contribution. Under certain assumptions on the LLM training data, the authors evaluate different TTC methods and e.g. show that vanilla BoN is suboptimal. The experimental results show that the proposed methods outperforms vanilla BoN and a simple sequential baseline in terms of accuracy over $N$. ### Presentation I find the paper somewhat difficult to follow. I'm not sure what's the best way to present things, but I feel like the paper could be improved by a more clear presentation, including a more precise problem statement, more clear definitions, and goals throughout the paper/sections. (E.g., a more clear distinction between "showing that BoN under these assumptions on the pretraining data is suboptimal", and "derive a new TTC method for sequential TTC".) ### Theory The assumptions made on the pretraining data seem extremely unrealistic: it seems the authors are assuming all tokens in a pretraining sample (except a token prefix) are sampled independently instead of autoregressively (line 161). They briefly mention this is in line with previous work, such as [1]. After skimming [1], it seems to me like this work actually assumes autoregressive dependency though (compare the last equation on page 5 therein). If there is a reasonable explanation for this assumption, the authors should elaborate more. The main theoretical claim here -- Theorem 4.3 -- depends on this assumption on the pretraining dataset, so a more careful evaluation of the assumption would be helpful. Much of the remaining theory -- such as the assumption that for each prompt at test time, there exists a corresponding "optimal reference policy", which seems unrealistic in practice -- are also derived from these assumptions. ### Algorithm The algorithmic novelty seems limited. Essentially, the authors are suggesting to do sequential TTC, but only append responses to the current context whose reward exceeds a certain threshold (possibly including a "burn-in" stage, where the context is not changed until the hidden state reaches a certain length $m$). ### Experiments It would improve the experimental results to include error bars. Results on more than just three datasets would be helpful to better evaluate RF-SBoN. A major difference between parallel BoN and sequential TTC methods is that parallel BoN can compute all responses in parallel, hence is much more efficient in terms of latency. However, in the experiments, the authors only compare to parallel BoN in terms of $N$. I believe a comparison in terms of latency is crucial for a fair comparison of the methods. [1] Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and How Does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization. arXiv preprint arXiv:2305.19420, 2023. - the presentation of responses in Appendix C.3 could be improved: the orange is not very readable; would be nice to put each prompt-response pair in a box; I assume the "Error: ..." in the solution of the last problem is part of the generated solution, but it would help to clarify this; since each algorithm solves a different prompt, this doesn't really provide any information on the direct comparison between the methods. Instead, solving the same prompt with different methods might be more insightful. - in the notation, the difference between $\mathcal{X}$, $\mathcal{A}$, and $\mathcal{V}^*$ is not made explicit (in particular, I would assume they're all mathematically identical? Is the action space supposed to consist of *complete* responses, or only partial responses/tokens?) Furthermore, shouldn't the policy $\pi$ be conditioned on the prompt, i.e. $\pi(\cdot|x)$ instead of $\pi(\cdot)$? - Assumption 3.1 seems unrealistic in practice. Could the authors elaborate on this? Is this assumption commonly made, or do they have any justification for it? - the definition of the family of reference distributions (line 156) is not very clean. What is $\mathcal{T}_{\text{ref}}$, in particular, is it countable/finite? What's the prior distribution? - the axis labels in the plots could be improved: I assume on the y axis (e.g. Figure 1), the authors plot pass@N accuracy, and on the x axis, they plot N (however, the x axis label says "pass@N") - can the authors provide experimental results that compare the proposed algorithm to parallel BoN in terms of *latency* instead of $N$? This seems to be a much more fair comparison. Fully human-written
On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates the fundamental limits of parallel Test-Time Compute (TTC) methods like Best-of-N. By assuming that pretrained LLMs are a "mixture-of-reference-policies", the authors provide a clear proof that parallel TTC is inherently suboptimal and establish a tighter theoretical performance limit that can only be reached by sequential methods. This holds only when there is a unique response that attains perfect reward (such as a final numerical answer in math). The paper proposes Reward-Filtered Sequential Best-of-N, a simple sequential algorithm. The method generates responses iteratively and selectively appends only high-reward outputs to the context for subsequent generations. Both the theoretical analysis and empirical results show that RF-SBoN outperforms parallel BoN and non-filtered sequential baselines, demonstrating a more effective use of the test-time compute budget. The authors make a good contribution to our understanding of inference-time algorithms within their idealized theoretical setup. It helps to motivate a practical algorithm. RF-SBoN is compelling in its simplicity. It is intuitive, easy to implement, and directly motivated by the paper's theoretical goals. I foresee this potentially being a popular method if it gains traction and end-users validate its performance. The experiments are well-structured, comparing RF-SBoN against the appropriate baselines (BoN and a non-filtered sequential approach). The use of multiple models, diverse benchmarks, and difficulty-stratified analysis provides solid evidence for the robustness and effectiveness of the proposed method. - Improved "budget efficiency" is framed in terms of the total number of model calls. However, parallel, batched decoding is very efficient on our hardware. This makes vanilla BoN very appealing due to its *fixed* number of N samples generated in parallel. In contrast, RF-SBoN requires unknown N sequential steps, making it potentially slower and less GPU-efficient. Moreover, you are adding more input tokens because of appending and maintaining the "good" responses within the context. - The RF-SBoN procedure actively conditions the LLM's future generations on outputs that the reward model scores highly. This creates a feedback loop that is highly vulnerable to reward model hacking. If the reward model has systematic flaws (e.g., a preference for verbosity over correctness), RF-SBoN will amplify this bias by repeatedly showing the model examples of it. This conditioning also reduces the diversity of the generation history, which could be detrimental, especially if the base LLM or reward model is not already very strong. I would encourages authors to discuss reward hacking in inference-time settings (qualitatively with respect to related work or quantitatively in their paper). - This is less of a concern. However, the theoretical results rely on a "mixture-of-reference-policies" model of the pretrained LLM. While this is a useful analytical tool, it is a strong structural assumption. The paper's theoretical claims are a direct consequence of this modeling choice. They also assume the existence of a unique best response. For example, I don't see this applying for any real setup (outside math or Q&A) where you may have many equally good responses. - Given that RF-SBoN is inherently sequential, how do you see its practical applicability in latency-sensitive applications? - Your method refines the generation history based on reward scores. Did you notice cases where the generation process collapsed into a deterministic answer or into a "reward hacking" echo chamber? I feel that may be the case with PureSeq because the empirical performance of PureSeq is usually as good or worse than BoN. This is surprising unless we account for the two mentioned factors. Perhaps, to make the method more robust to reward model exploitation, you could filter on a lower confidence bound of the reward. However, that would add complexity. - Rewards are generally not calibrated across contexts, making it difficult to compare reward scores. Even within a prompt, Best of n only cares about the rank induced by the responses, rather than the scores themselves. So I feel extra care should be taken to make sure you can compare the evolving rewards to a fixed threshold. This may be valid in your setup since the reward for a new answer, r(a_i, x), is always evaluated against the static initial prompt x. Why not keep the top-k responses? - The presence of plateaus and dips with AIME makes me question how you estimated the accuracy. Is there a way we can add bootstrap CIs to the plots? Is this behavior observed in practice with AIME? Fully human-written
Automated Architecture Synthesis for Arbitrarily Structured Neural Networks Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a novel framework for automatically constructing neural network architectures with arbitrary graph structures. Instead of relying on predefined Directed Acyclic Graph (DAG) or tree-like topologies, the model begins with a complete graph and adaptively learns and optimizes the connectivity among nodes during training. The authors introduce a synchronous computation mechanism for both forward and backward propagation, as well as a Neural Module (NM) Regularization technique that organizes nodes into balanced subgraphs to enhance efficiency and generalization. 1. The paper challenges the traditional DAG-based view of neural architectures by proposing a general graph-based structure that can theoretically encompass existing designs as special cases. This idea has clear conceptual novelty and could inspire new directions in architectural design. 2. The inclusion of algorithmic pseudocode, complexity analysis, and proof sketches contributes to the overall completeness of the work. The NM regularization method is interesting, as it allows parallel processing. 3. Experiments across four datasets show performance improvements over traditional NN baselines and comparable methods. The framework also appears to enable more efficient computation via modular parallelization. 1. It remains unclear whether Neural Modules (NMs) are individual components within a larger structure or whether they represent the entire architecture. A clear, high-level topology diagram of the complete system is missing. Furthermore, it is unclear how the model handles different data modalities (e.g., images, graphs, sequences). 2. The experimental setup lacks sufficient detail. Baselines such as NN, DEQ, DAG, and OPTNET are referenced but not fully specified; key implementation parameters, dataset preprocessing steps, and hyperparameter settings are omitted. The evaluation protocol and metric definitions need clarification to ensure reproducibility. 3. Table 1 contains inconsistent naming (e.g., "NMs", "NMsL2", "NMsNM", "NMs&L1") without clear explanations. Figures (especially Figure 2) are visually cluttered, making it difficult to interpret results. These issues reduce the paper’s readability and empirical credibility. 5. While the authors claim efficiency improvements via NM regularization and parallelization, concrete runtime profiling and scalability analyses (e.g., GPU utilization, training time comparisons) are not provided. 6. The writing style sometimes overstates claims (e.g., "unlock the full potential of neural networks"), which can reduce perceived objectivity. Mathematical equations, while thorough, are dense and not always supported by intuitive explanations or visualization. 1. Please clarify the structural role of Neural Modules. Are they subcomponents of a larger graph or equivalent to the entire network topology? How does this framework adapt to various input modalities (e.g., graphs, sequences)? 2. Please provide full implementation details for the baseline methods, dataset splits, and evaluation metrics. 3. Explain the meaning of result notations such as "NMsL2", "NMsNM", "NMs&L1", etc. 4. Justify the choice of NN, DEQ, DAG, and OPTNET as baselines. Are these the state-of-the-art for the target tasks? 5. Include visual diagrams of the overall network topology and module formation during training for clarity. Fully AI-generated
Automated Architecture Synthesis for Arbitrarily Structured Neural Networks Soundness: 3: good Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a novel framework that departs from traditional tree-like or DAG-based artificial neural network architectures, introducing a dynamic graph-based system inspired by biological neural networks. The authors enable neural units to form flexible, arbitrary connections during training, organizing them into "Neural Modules" that allow synchronous communication and parallel computation **Novel Architecture Design Beyond DAG Constraints** The paper introduces a biologically inspired framework that allows neural networks to autonomously learn arbitrary graph structures during training, overcoming the inherent limitations of traditional DAG-based architectures. This enables more flexible communication between neural units and enhances the model’s representational capacity. See questions. 1. The writing of this paper needs further improved before consideration of acceptance. For example, the Punctuation Mistake in abstract, un-unified reference format. 2. The paper lacks theoretical justification and rigorous comparative analysis. Why your method is more impressive than current DAG baselines? A theoretical justification is appreciated. 3. Authors study the method with toy datasets. How it will scale to larger ones such as ImageNet? Moderately AI-edited
Automated Architecture Synthesis for Arbitrarily Structured Neural Networks Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a new approach which is different from layered, DAG-based networks. It starts from a complete directed graph whose edges are pruned on-the-fly, and solves the resulting cyclic graph as a system of non-linear equations during both forward and backward passes. The authors proposed the “Neural Modules” (strongly-connected components) together with a repulsion-based regulariser that keeps the modules small and balanced. Inference is therefore the numerical solution of a fixed-point system; training updates the coefficients of that system. The authors claim that this removes the structural bias of trees/DAGs, allows arbitrary connectivity, and can be parallelised module-wise. Experiments on four medium-scale tabular/graph tasks show lower error rates than DEQ, OptNet, DARTS or standard MLPs of comparable node budget, and a ~10× speed-up when modules are processed in parallel on GPU. (1) The authors answer with a fully-differentiable, cyclic-graph neural model whose forward pass is literally a Newton solver and whose topology is pruned on-line by a repulsion prior that encourages strongly-connected components to fragment into GPU-friendly micro-clusters. (2) The mathematical framing is good and clean: forward propagation = root-finding on a nonlinear system, backward propagation = solution of the dual linear system, training = gradient updates on the coefficients of that system. 1) I have a concern about the memory usage: the model stores the dense adjacency matrix and the dense Newton correction explicitly—an O(p²) memory footprint that already exhausts 40 GB of VRAM at p≈70K, i.e. three orders of magnitude smaller than a single transformer layer. Is it an efficient model compared to existing ones. 2) The proposed regulariser is a repulsion term on the raw edge weights; there is no spectral penalty, no curvature constraint, and no mechanism to prevent the graph from becoming a single strongly-connected component, at which point the promised parallel decomposition collapses and you are back to solving a monolithic 70K×70K linear system every mini-batch. 3) I don't deny the fact that the idea is novel, but given the scale of experiments, the submission remains an elegant thought experiment (with some initial attempts on limited datasets) rather than a credible path to the next generation of foundation models. 1) DEQ was designed for weight-tied infinite-depth models, whereas the proposed new work here uses unique edge weights; a fair comparison would be a finite-depth DEQ with learned layer-wise weights, which the authors never run. Do the authors agree? 2) The experimental canvas is restricted to four small tabular datasets whose input dimensionality is three orders of magnitude below the visual or linguistic entropy that modern architectures are expected to model; on such low-entropy data any sufficiently expressive inductive bias (including a cyclic graph) can look superior to a vanilla MLP, but the results may not be that good with curse-of-dimensionality that accompanies pixels or sub-words. Fully AI-generated
Automated Architecture Synthesis for Arbitrarily Structured Neural Networks Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a novel framework for designing directionless neural network architectures, inspired by the complex connectivity of biological neural systems. The authors propose a method to automatically learn arbitrary graph structures during training, organized into "Neural Modules" (NMs) to facilitate unrestricted communication between nodes. While the approach showcases out-of-the-box thinking and a refreshing departure from traditional tree-like or DAG structures, I have significant concerns about its theoretical grounding, computational feasibility, and experimental rigor. The paper’s theoretical claims (e.g., Theorems 3.1 and 3.2) feel oversold, and the backward gradient computation in non-DAG structures is unclear and potentially problematic. The computational overhead of the proposed approach is prohibitive, scaling poorly with large graphs due to its dependency on the number of edges. Additionally, the experimental evaluation is limited, ignoring state-of-the-art baselines like Graph Neural Networks (GNNs) and Transformers, which are standard for graph-structured data. The performance gains over traditional approaches appear negligible, and the paper is poorly written, lacking clarity in critical components. While the idea is innovative, the execution and validation fall short of expectations for a rigorous contribution. - The authors propose a truly novel perspective on neural network architectures, moving beyond traditional DAG structures to embrace directionless, biologically inspired graphs. This represents a bold and creative departure from conventional designs. - The source code is available, which is commendable for reproducibility and further research. - The theoretical foundation is shaky. Theorems 3.1 and 3.2 are oversold: one is a trivial application of the universal approximation theorem, and the other is a basic SGD formula. The authors need to rephrase or remove these claims to avoid misleading readers. - The backward gradient computation is unclear and problematic. The role of the operator $H_j$ is not well-defined and seems arbitrarily introduced. Traditional backpropagation relies on a notion of order (e.g., layer-wise gradients), which is lost in arbitrary or cyclic graphs. The authors do not address how gradients are computed in such structures, raising doubts about the soundness of the approach. - The proposed approach has a prohibitive computational complexity $O(N+E+s\cdot NM^2)$, which scales poorly with large graphs due to the dependency on the number of edges E. This makes it impractical for modern, large-scale NNs with billions of parameters and edges. - The experimental evaluation is limited and incomplete. The authors ignore state-of-the-art baselines like GNNs and Transformers, which are standard for graph-structured tasks (e.g., Facebook datasets). The performance gains over traditional approaches are negligible, undermining the claimed advantages. The evaluation lacks diversity in datasets and tasks, making it difficult to assess the method’s generalizability. - The paper is poorly written, with missing details and confusing explanations. Key components, such as the synchronization method and Neural Module formation, are difficult to follow. - The authors use Tarjan’s algorithm to identify strongly connected components, but they do not justify this choice or explore alternatives. It is unclear how different algorithms might affect performance or scalability. - Theorems 3.1 and 3.2 seem oversimplified and misleading. Could the authors rephrase or remove these claims to better reflect their actual contributions? - How does the backward gradient computation work in a non-DAG structure? Without a clear notion of order, how are gradients propagated, and how is convergence guaranteed? - In cyclic or fully connected graphs, establishing an order for backpropagation is non-trivial. How do the authors handle this, and what guarantees can they provide for gradient stability and correctness? - The complexity $O(N+E+s\cdot NM^2)$ suggests the approach does not scale to large graphs. How do the authors envision deploying this method in real-world, large-scale NNs? - Why did the authors choose Tarjan’s algorithm for identifying strongly connected components? Were alternative algorithms considered, and how might they impact performance or scalability? - The authors ignored state-of-the-art baselines like GNNs and Transformers. Why were these omitted, and how does the proposed method compare to them on graph-structured tasks (e.g., Facebook datasets)? - Could the proposed framework be integrated with existing architectures (e.g., GNNs, Transformers) to leverage their strengths while addressing the limitations of traditional DAG structures? - Are there specific applications or domains where this approach might be particularly effective, despite its current limitations? Fully AI-generated
WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces WaterSearch, a search-based LLM watermarking framework that generates multiple parallel candidates instead of a single one and then selects the one that best preserves coherence with the original text to enhance quality and detectability. 1. The idea of using multiple candidates is clear and effective. 2. The paper also includes a solid theoretical analysis of the proposed method. 3. Evaluations are comprehensive, on various models and tasks. 1. The overhead of this method seems to be significant. 2. A main weakness of the paper is the limited comparison against recent works. The paper mainly compared the original KGW method. More recent and stronger baselines, including both token-level and semantic-level watermarking methods, should be compared for a more comprehensive evaluation. 3. Robustness evaluation is also limited. Stronger modification and paraphrasing attackers, beyond deletion, insertion, and synonym substitution, need to be presented to show the superiority of the proposed method against SOTA. See weaknesses. Fully human-written
WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces WaterSearch, a search-based watermarking framework that aims to improve the trade-off between watermark detectability and text quality in large language models (LLMs). Instead of modifying logits during generation as in the standard KGW framework, WaterSearch performs chunk-level parallel generation: it generates multiple candidate continuations (some watermarked, one unwatermarked) and selects the one that maximizes a joint score balancing semantic similarity to the unwatermarked text and detectability based on green-list token frequency. A chi-square–based detection procedure is proposed to test statistical significance across chunks. Experiments across 3 LLMs (Llama-2, Qwen-2.5, InternLM) and 10 datasets show consistent improvements in both generation quality and detection robustness, particularly under low-entropy and short-text conditions. * Simple idea and framework: WaterSearch can be applied on top of existing KGW-style watermarking schemes with minimal modification. * The method improves performance across all evaluated datasets, including difficult cases such as short-text or low-entropy settings, where KGW tends to fail. * The paper uses WaterBench and additional benchmarks (e.g., RepoBench-P) and shows gains across multiple model families. * Figures and algorithm descriptions make the approach easy to follow; the writing is concise and readable. * Incremental conceptual novelty: The idea to generate several watermarked candidates and pick the best is intuitive, but very closely resembles beam search or rejection sampling. The contribution feels more engineering-oriented than conceptual, especially given that most of the theoretical development restates expected properties of the existing KGW trade-off. * Computational inefficiency: WaterSearch performs parallel or beam-style generation of multiple watermarked candidates per chunk and selects the best one, which intuitively incurs substantial wall-time cost. While Table 4 discusses space complexity, runtime overhead or throughput (tokens/s) is not reported. Without this, it is hard to assess practical efficiency, but based on the runtime complexity reported in the paper, a ~5x slowdown in generation is fairly substantial and reduces the practical utility of the method. * What is the actual computational overhead relative to vanilla KGW? Reporting wall-clock time or tokens/s for each configuration would clarify practical feasibility. * How sensitive is the approach to the number of parallel candidates k? Does increasing k yield linear improvement in detectability, or diminishing returns? * Could the same results be achieved by post-hoc reranking or constrained decoding (e.g., using logits rather than full re-generation)? * Since quality is evaluated only relative to the unwatermarked model’s continuation, could semantic drift still occur if that baseline itself is low-quality or inconsistent? * The claim that KGW “maintains text quality well from the perspective of perplexity or LLM-as-judge” is contradicted by results in WaterBench (Tu et al., 2024) and New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking (Singh et al., 2024), which show measurable degradations in both perplexity and subjective fluency for KGW. The discussion should acknowledge these findings. Fully AI-generated
WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. By constraining semantic consistency with the original sentence as much as possible, generating multiple outputs with different random seeds, and combining these outputs, this paper identified the optimal balance between semantic integrity and watermarking. 1. The design of alpha is rigorous overall, with theoretical proof of its effectiveness and ablation studies demonstrating optimal alpha values, reasonably extending the KGW method. 2. Effectively designed time complexity to ensure computational resources increase only moderately. 3. Experiments demonstrate that sufficiently large differences between random seeds enable multiple outputs of the watermark to combine into text with semantics closer to the original meaning, including the validity of other hyperparameters such as K. 1. Missing Visualization examples of all results, just an NBA example 2. Scoring q is a linear add-up of semantic similarity towards the original output, and watermarking quality, which is very straight forward, but can be questioned that if the linear add-up is effective or not, more theoritical supports are needed 3. Strategy of picking different random seed is still not clear enough for me How will the method perform on bigger and SOTA language models? Fully human-written
WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes WaterSearch, a novel search-based framework for watermarking Large Language Model (LLM) outputs. The core idea is to move beyond token-level watermark embedding by generating multiple candidate text chunks in parallel and selecting the one that best balances text quality (fidelity to the original, unwatermarked model distribution) and watermark detectability (statistical strength of the watermark signal). The method is presented as a solution to the fundamental trade-off between these two objectives in existing watermarking schemes like KGW. The authors also introduce a new detection algorithm based on hypothesis testing and provide extensive experimental results showing significant improvements over strong baselines. 1. The shift from a purely token-level manipulation to a chunk-level search-and-select paradigm is a significant and compelling contribution. It elegantly reframes the watermarking problem as a multi-criteria optimization, which directly addresses a well-known limitation of existing methods. 2. The paper provides a theoretical analysis (Theorem 1) linking the macroscopic (sentence-level) selection objective with the microscopic (token-level) watermarking trade-off. This strengthens the methodological foundation and justifies the proposed approach. 3. The experiments are thorough and well-designed. * Comprehensive Benchmarking: Evaluation across 10 diverse tasks and 3 major LLMs (Llama-2, InternLM, Qwen) demonstrates generalizability. * Significant Performance Gains: The reported average improvement of 51.01% in downstream task performance under fixed detectability is impressive and clearly highlights the method's value. * Robustness in Challenging Scenarios: The strong results on short-text (+47.78%) and low-entropy (e.g., code generation, +36.47%) scenarios are particularly noteworthy, as these are known pain points for current watermarks. * Exceptional Attack Resilience: Maintaining high detectability under 80% word-level perturbations (deletion, insertion, substitution) is a remarkable result that significantly outperforms baselines. 1. Algorithm 1 requires generating $k$ candidate chunks in parallel at each step. This implies the generation time (latency) will be roughly $k$ times that of a baseline method. The paper's claim of "low computational cost" is misleading as it primarily focuses on memory (KV cache). 2. Algorithm 2 (Detection) appears to require the detector to "Recover the seeds from generation". This suggests the detector must know the exact context $c$ and the random seed generator used during generation. This is a much stronger assumption than KGW (which only needs a secret key) and may be fragile in black-box detection or if the context is slightly modified. 3. The experiments fix $k$ (beam size) to 5 (1 vanilla + 4 watermarked). $k$ is a critical hyperparameter balancing quality, detectability, and latency, but the paper lacks a sensitivity analysis or ablation study on $k$. 1.How exactly are the k−1 watermark seeds generated from context and recovered at detection time? Is the seed generator deterministic and keyed? What are attack consequences if this procedure is partially known? 2.Can you provide wall-clock runtime and peak GPU memory numbers for representative settings (e.g., k=5, chunk m=32) on a standard GPU? The asymptotic KV discussion is useful but practitioners will want absolute numbers. 3.Have you tried stronger/adaptive attackers (e.g., paraphrase-based sentence rewriting engineered to minimize green-token counts) or defenses that specifically target chunk-final tokens? If so, what happens to detectability? 4.For Theorem 1, can you relax the token-independence assumption or empirically show how well the mapping (f) holds across tasks? Fully AI-generated
Bridge Policy: Visuomotor Policy Learning via Stochastic Optimal Control Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces BridgePolicy, a new generative model for visuomotor imitation learning. The core idea is to frame policy learning as a diffusion bridge problem, which directly learns a stochastic mapping from the observation distribution to the action distribution. This is in contrast to conventional diffusion policies that generate actions from random noise, conditioned on observations. The authors argue this condition-free approach, grounded in stochastic optimal control, avoids issues like manifold deviation and fitting errors inherent in conditional models. To handle the challenges of bridging heterogeneous, multi-modal distributions, the paper proposes a semantic aligner and a modality fusion module. The aligner uses a combination of contrastive loss and KL regularization to map observations to a latent space with the same shape as the action space. The method is evaluated on 52 simulation tasks across three benchmarks (Adroit, DexArt, MetaWorld) and four real-world tasks, reportedly outperforming state-of-the-art baselines like DP3 and FlowPolicy. 1. The central idea of using a diffusion bridge to directly map observations to actions is novel and theoretically elegant. It offers a principled, condition-free alternative to existing generative policies, which could potentially resolve known issues with conditioning schemes. 2. The method achieves state-of-the-art performance across a wide range of simulated tasks. BridgePolicy achieves a 0.90 average success rate compared to 0.78 for the next best baseline in the real-world tasks. 3. Well written, Figure 1 provides a helpful overview of the proposed architecture. 1. **Central Motivation is Not Directly Validated:** The primary motivation is to avoid "manifold deviation" and "fitting error" from conditioning. However, the paper provides no direct evidence to support this claim. The only evidence is improved task success, which is an indirect and inconclusive metric for this specific claim. 2. **Poor Justification for Key Design Choices in the Aligner:** The proposed alignment loss (Eq. 9) is a core contribution, yet its components are poorly justified. The KL regularization term, $D_{KL}(z_{obs}, \epsilon)$, forces the encoded observation latent distribution towards a standard normal distribution $\mathcal{N}(0, I)$. The paper claims this "ensures the observation maintains its sampling ability," a vague and unsubstantiated statement. This regularization is highly counter-intuitive; it actively penalizes the model for learning the true, complex, multi-modal structure of the observation space, potentially leading to mode collapse. This directly contradicts the goal of leveraging "information-rich observations" and seems more like a VAE-esque regularization hack than a principled choice for a diffusion bridge endpoint. A strong justification for why one would want to destroy the natural structure of the observation manifold is absent. 3. **Insufficient and Unconvincing Experimental Validation:** The empirical evidence, while showing positive trends, is not rigorous enough to support the strong claims of superiority. The simulation results are averaged over only 3 random seeds. 4. **Crucial Ablations are Missing:** The paper fails to ablate the most important new hyperparameters introduced. The alignment loss introduces weights $\alpha$ and $\beta$. The paper's performance is likely highly sensitive to these parameters, yet no ablation study is provided to analyze their impact or justify their chosen values. The parameter sensitivity analysis in Table 3 only investigates $\lambda$ and $\gamma$ from the base UniDB framework, which is insufficient. This is a critical omission that questions the robustness and generality of the method. 1. Can the authors provide a rigorous theoretical or empirical justification for the KL regularization term $D_{KL}(z_{obs}, \epsilon)$ in Equation 9? Forcing the information-rich observation endpoint $x_T = z_{obs}$ of the diffusion bridge to be distributed as a standard Gaussian seems fundamentally at odds with the goal of the model. Why should the learned latent observation distribution match unstructured noise? Have you investigated the impact of removing this term (i.e., setting $\beta=0$)? 2. The simulation results in Table 1 are based on only 3 seeds and exhibit very high variance, making the performance differences between BridgePolicy and the top baseline (FlowPolicy) statistically questionable. Could the authors please run experiments with more seeds (e.g., 5 or 10) and provide statistical significance tests to properly validate the claim of superior performance? 3. The alignment loss $L_{align}$ introduces two crucial hyperparameters, $\alpha$ and $\beta$. Their values could heavily influence the model's performance. Why was an ablation study on these hyperparameters omitted? Can you provide this analysis and explain how their values were determined for the final experiments? 4. The paper's motivation rests on avoiding "manifold deviation" and "fitting error." Beyond the final task success rates, can you provide any direct evidence that BridgePolicy actually mitigates these specific issues? 5. For the contrastive loss component of $L_{align}$, are negative samples drawn only from within the current batch? If so, have you analyzed the potential impact of false negatives and considered more sophisticated negative sampling strategies to ensure a more robust alignment is learned? Fully AI-generated
Bridge Policy: Visuomotor Policy Learning via Stochastic Optimal Control Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a visuomotor policy that uses a diffusion-bridge formulation with a stochastic optimal control view to connect observations (point clouds, proprioception) and action trajectories. The method replaces standard condition-guided diffusion with a bridge process and adds a multimodal aligner and cross-attention fusion. The authors report results on several simulation suites and a small set of real-robot tasks. Overall, the idea is interesting and the robotics application is well motivated. 1.The formulation is interesting and innovative. Casting policy learning as a diffusion bridge with an SOC perspective is a fresh angle and is clearly explained. 2.The paper presents a clean module stack (semantic/shape alignment + multimodal fusion) and provides training/inference details. Additionally, the paper includes some ablation analyses (e.g., demo count sensitivity and hyperparameter sweeps), which help readers understand stability. 3.The method is tested on many simulated tasks and several real-robot tasks; results are generally competitive compared with diffusion-based policy learning methods. 1.Limited practical gain over strong baselines. On several benchmarks the improvements over diffusion/flow policies are small or mixed. 2.Evidence gap for the main claim. The statement that conditional diffusion inevitably causes manifold deviation/estimation error is only supported by the experiment success rates. I would like to see more results to support your claim, such as distributional distances to expert trajectories, score estimation error, and trajectory-manifold deviation—beyond just success rates. 3.Generalization and robustness. Beyond the success rates, I would like to see more results on the generalization results, such as object, viewpoint and visual generalization. 4.Throughput/latency. The method appears slower than some diffusion/flow counterparts, but there is no speed-accuracy trade-off analysis. The slower inference time may hurt the performance of robots. I would raise my score if my concerns is solved. 1.For DP3 and simple DP3, are the number of points exactly the same as in your method (both in sim and real)? 2.Manifold-deviation claim. Can you run a controlled study with identical encoders/U-Nets: (a) conditional diffusion/flow, and (b) your bridge? Report distributional distances to expert trajectories, score estimation error, and trajectory-manifold deviation—beyond just success rates. I would like to see the direct comparison between a conditional flow/diffusion and your bridge to support your claim when controlling most of the network architecture to be similar. 3.Generalization. Do your real-robot tasks include unseen objects/poses/scenes or different cameras/lighting? Fully human-written
Bridge Policy: Visuomotor Policy Learning via Stochastic Optimal Control Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces a new generative visuomotor policy that is condition-free and maps observations directly to actions through a diffusion bridge grounded in stochastic optimal control. The authors argue that traditional diffusion-based or flow-based imitation learning methods suffer from manifold deviation and the policies learned are less controllable and stable. To address the limitations, the paper reframes the problem and extends diffusion bridge theory from image generation to robot control, which connects observation and action distributions natively, avoiding conditioning. The policy forward starts from observation rather than noise and the backward process generates actions through SOC principles. The experiments show that BridgePolicy improves the success rate by a large margin across different simulation and real-world tasks, demonstrating the effectiveness of the proposed method. - The condition-free idea is reasonable and promising, especially if you want a smoother, deterministic imitation learning policy that can work in a well-observed setting. - The diffusion bridge formulation is pretty clear, which models the transformation from observation to action using a stochastic differential equation from SOC. It also introduces a semantic distribution aligner that combines contrastive loss and KL regularization to shape mismatch. - The experimental results are strong enough to show that Bridge Policy improves the task success rate compared to traditional diffusion-based and flow-based methods. With fewer demonstrations, Bridge Policy tends to show better stability and when demonstrations increase, the success also increases. - Through the Bridge Policy reduces extrinsic stochasticity and avoids score-fitting instability, it is less flexible in some sense. A condition-free bridge bakes the observation distribution into the diffusion dynamics. Once it is learned, you cannot easily intervene or modulate the policy externally. It is a trade-off between stability and flexibility. For more general policy or complex tasks, some degree of conditioning is crucial. - The SOC-based formulation leads to higher computational cost, which could constrain long-horizon or high-frequency control. - In the experiment section, the results demonstrate that Bridge Policy can improve the task success rate, but lacks other metrics to support the improvement of stability, sample efficiency, etc.. - More ablations or analyses on policy failure modes can help to determine when the bridge misaligns distributions or overfits to certain visual scenes, to better demonstrate the effectiveness of the proposed method. It was not clear in the paper. - If the diffusion bridge only ever connects distributions seen during training, it might fail catastrophically on unseen combinations of states and visuals, as mentioned in the weaknesses. Can you discuss more? - It’s unclear how much improvement comes from better representation learning or from the new diffusion formulation. Ablations on the aligner, bridge formulation could be better. - Typos: --- Is the Norm missing a square symbol in Eq (6)? It should be squared to match the diffusion ELBO. --- Should the KL term in Eq (9) be asymmetric? If so, it should be reformulated. Moderately AI-edited
Bridge Policy: Visuomotor Policy Learning via Stochastic Optimal Control Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Bridge Policy, replacing the conditioning in vanilla diffusion policy with diffusion bridge. Briefly, diffusion bridge modifies the forward process such that the terminal distribution is not a random Gaussian, but a point or a distribution related to the conditioning input (observation embedding in this case). And similarly, the reverse process inference also starts from a latent related to the observation rather than random Gaussian. Bridge Policy encodes robot state and point cloud to a latent of the same shape as the output action chunk, learns a policy, and infers with a fast solver. Sim and real environment experiments show that Bridge Policy outperforms prior state of the art in various settings. - The idea of using diffusion bridge, which has shown promise in visual domains, for policy learning, is well-motivated. The policy architecture design makes sense. - Strong suite of sim and real environments show that the proposed policy works well in various setups. - Good set of baselines show that the proposed policy outperforms prior state of the art. - Lack of ablations. There are some ablations studying the number of demos and hyperparameter sensitivity, but it would be nice to see an ablation on the alignment loss (no contrastive learning, no KL, original method), and observation encoder (cross-attn vs. simple concat). - It's unclear from the paper how much of the gain is from the observation module vs. the policy head. If the core claim is that diffusion bridge improves over prior policy heads, then it would be important to keep the observation input treatment constant and compare various policy heads. - The main concern that I have is whether the baselines receive the same observation input as BridgePolicy. How are the observations handled in the baseline cases? - It would be good to see additional ablation experiments: alignment: no contrastive loss, no KL; observation encoder: cross-attention / simple concat. Fully human-written
RegionReasoner: Region-Grounded Multi-Round Visual Reasoning Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper extends a previous work VisionReasoner by adapting to the multi-round setting. They present RegionReasoner, which is a reinforcement learning framework that uses SegLLM to bring multi-round interactions. To validate RegionReasoner, the authors also introduce a new benchmark RegionDial-Bench, which is designed to test the multi-round reasoning ability. The main tasks focus on detection and segmentation. In each round of reasoning, the model can refer to information such as box coordinates of previous rounds, and thus provide more grounded reasoning. In each round, RegionReasoner generates a structured text action includes scene, focus, think and answer, and the memory is updated accordingly. RegionReasoner forces the reasoning to cite evidence to reduce the hallucination by adding the reference citation reward. RegionReasoner-7B outperforms VisionReasoner-7B and other VLMs such as QwenVL-7B in multi-round detection and segmentation tasks on RegionDial-Bench. (1) RegionReasoner extends a strong previous single-round model VisionReasoner and adapts to the challenging multi-round setting. Results on the proposed benchmark show the validity of RegionReasoner. (2) The benchmark itself can be used later form multi-round vision reasoning studies. The motivation of referring to object locations is direct and clear. (1) The paper claims "RegionReasoner consistently outperforms strong Vision-Language Models and task-specific baselines on both referring segmentation and detection.". Previous benchmarks focus on single-round detection/segmentation, but in the main table 1 and table 2, the results are shown on the proposed multi-round benchmark. I think it would be reasonable to add the table to show some "task-specific baselines" for the previous single-round benchmarks. (2) Also, the proposed benchmark uses RefCOCO+ and RefCOCOg, but there are also other benchmarks such as MSCOCO and Visual Genome, which are diverse and have boxes and segmentations masks. Have the authors tried to use other datasets to construct the benchmark? Why RefCOCO+ and RefCOCOg are selected here? Some of the figures should be polished. For example, the text in Figure 2 is not clear enough when zooming in. Fully human-written
RegionReasoner: Region-Grounded Multi-Round Visual Reasoning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper investigates the problem of grounding visual referents in multi-turn dialogues for vision-language models (VLMs). They introduce RegionDial-Bench, a benchmark for evaluating multi-round question-answering where each response must be grounded in a specific object instance within the image, annotated via bounding boxes. Alongside the benchmark, they propose RegionReasoner-7B, a model trained using a GRPO-based reinforcement learning approach. The reward function incorporates three key objectives: correctness of the object grounding, global-local semantic consistency, and answer accuracy. Experimental results on RegionDial-Bench demonstrate the effectiveness of the proposed method. 1. This paper introduces RegionDial-Bench, a new benchmark designed to study multi-round conversational reasoning in VLMs, with a specific focus on the groundedness of evidential objects in each dialogue turn. 2. The authors propose a GRPO-based training framework that rewards models for accurate object grounding, global-local semantic consistency, and answer correctness. Experimental results demonstrate the effectiveness of their resulting model, RegionReasoner, on the proposed benchmark. 1. The creation process of RegionDial-Benchmark, which constitutes a major contribution of this work, is not sufficiently detailed in the paper. The authors should include a clear description of the benchmark construction methodology, such as data sources, annotation protocols, and key statistics (e.g., number of dialogues, turns, and object categories),to facilitate wider adoption. 2. The evaluation of RegionReasoner is currently limited to the proposed RegionDial-Bench. To better assess the generalizability of the method, it is important to also report performance on established benchmarks such as V*. Without such results, it remains unclear whether the improvements are specific to the proposed benchmark or reflect broader applicability. 3. The multi-round conversation results in Table 1 are notably lower than those in the single-round setting, which appears strange. Furthermore, the result for RefCOCOg Multi-turn (R6) stands out as an outlier, being significantly higher than those of R5 and R7. These inconsistencies warrant further analysis and explanation. 4. As shown in Table 3, the model consistently performs better in single-round settings compared to multi-round scenarios across multiple metrics. This recurring pattern suggests a systematic challenge in handling multi-turn grounded dialogues, which should be explicitly discussed in the paper. 1. How was RegionDial-Bench constructed? should detail the data sources, annotation protocols, and key statistics. 2. Does RegionReasoner generalize to other VQA benchmarks beyond RegionDial-Bench? Evaluation on established benchmarks (e.g., V*) is needed to verify its broader applicability. 3. Why does multi-round conversation performance consistently lag behind single-round? Furthermore, what explains the outlier for RefCOCOg Multi-turn (R6) in Table 1? Lightly AI-edited
RegionReasoner: Region-Grounded Multi-Round Visual Reasoning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a multi-round visual reasoning benchmark for detection and segmentation. They propose a grounded reasoning method, RegionReasoner, which incorporates reinforcement learning and a global-local consistency reward to enhance semantic coherence. On RegionDial-Bench, the proposed method achieves improvement compared to other VLMs, especially in the later turns. - This paper presents an interesting reasoning task that integrates QA, referring expression in a multi-turn manner. - They propose new reward functions for the new task. They propose a global-local consistency reward to align keywords from the global and local context. - The way they expand the referring expression to multiple turns is confusing and may not be natural. In Appendix B, they illustrate how to simply use a preposition + bbox coordinates in the later turns. A natural referring expression considers the composition between objects. However, in the qualitative examples, they have more complicated and natural questions, such as "Which slice of pizza is R1 about to eat"? "Who is the person next to R1"? They mention that those GPT-style questions used in the previous paper may hallucinate, but it is unclear how they convert the question to this. - The task setting may not be challenging enough or fair. 1) If the latter turn is based on the ground-truth previous turn (as Appendix B), then the task is essentially a regular single-turn QA, which is not novel. 2) If the latter turn is based on the previous turn, then the reason other models can not achieve good performance is that they are not trained on these templates. If we feed the ground-truth in the question, RegionReasoner may not perform much better than previous methods. It would be nice to see the comparison with the provided ground-truth of the previous step object. - It is unclear if the new training data affects the performance on standard REC and RES benchmarks. - How did you generate the questions, or did you use the templates in Table 5? - In the later turns, do you provide the ground truth box of the previous object? - Could you compare your methods on standard REC and RES benchmarks? Fully human-written
RegionReasoner: Region-Grounded Multi-Round Visual Reasoning Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces RegionReasoner, an RL-trained vision–language policy that emits structured per-turn trajectories for multi-round, region-grounded visual reasoning. Two novel reward components are proposed: (1) an explicit reference-citation reward that forces <think> to verbatim-cite bbox coordinates and penalizes hallucinated citations, and (2) a global–local semantic consistency reward that aligns keywords across <scene>, <focus>, and <think>. The authors also present RegionDial-Bench, a multi-turn benchmark built from RefCOCO+/RefCOCOg, and show that RegionReasoner-7B improves multi-turn detection and segmentation metrics, especially at later turns. 1. Reward design — The reference-citation and global–local consistency rewards are intuitive, easily implementable, and well tied to the structured output format. They provide fine-grained shaping for intermediate (reasoning trace) tokens rather than only final outputs. 2. Results in Tables 1 and 2 show consistent improvements, and the claim that improvements compound at later turns (robustness to error accumulation) is supported by both quantitative and qualitative examples. 1. The authors state that test dialogues reformulate later queries to explicitly cite bounding boxes predicted in earlier rounds. If the test references are model-predicted boxes (rather than strictly ground truth), the evaluation can be sensitive to the upstream model used to generate them. This raises two issues: inconsistent comparison if different baselines consume different predicted references, and potential leakage effects. Please clarify exactly how test references are created and ensure all methods receive the same predicted references (or show oracle vs. predicted-reference performance). 2. The global–local consistency reward depends on a handcrafted lightweight keyword extractor + lemmatizer + noun filter. This may be brittle: paraphrases, synonyms, pronouns, coreference, or longer expressions are likely missed. More importantly, if baselines do not produce structured <scene>/<focus> text, how is the comparison fair? Forcing <think> to repeat the same noun form may advantage RL-trained models. 1. How about the ablations with different reward weight hyperparameters $\alpha$, $\beta$? Heavily AI-edited
Generative Model via Quantile Assignment Soundness: 2: fair Presentation: 4: excellent Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a novel deep generative model (DGM) called NeuroSQL, which departs from the common framework of encoder+decoder (as in VAEs) or adversarial training (as in GANs). Instead it directly learns latent codes for each training datum via a quantile-assignment (linear assignment / Hungarian) to a pre-specified lattice of latent quantiles, and trains a generator network to map those codes to data. The authors formulate a minimisation over both generator parameters θ and a permutation π that maps quantile codes to data For multivariate (latent > 1D) codes they leverage optimal-transport/multivariate quantile theory and solve assignment between training data and a fixed uniform grid in latent space. OpenReview They propose an alternating algorithm: (i) keep π fixed, update θ via generator training; (ii) fix θ, update π via Hungarian algorithm assignment; optionally use momentum smoothing of assignments. OpenReview Empirically they evaluate on 4 domains (MNIST, CelebA, AFHQ animal faces, and OASIS brain images) under a “small‐budget / low-data” regime: e.g., training data capped at ~2 k images, resolution up to 128×128, single Google Colab budget. OpenReview The main claims: (1) NeuroSQL is more stable (no adversarial or encoder collapse issues), (2) it yields better or competitive image quality (measured via proxy FID, LPIPS, SSIM) under matched generator/backbone conditions, (3) it is more resource-friendly in low-data/high-dimension settings. Interesting idea / novel paradigm — Replacing the encoder or discriminator with an explicit assignment of latent codes (quantile grid) is novel, and links generative modelling with statistical quantile/transport theory. Pragmatic focus on low-data regimes — The paper addresses an important setting: generating synthetic data when the training dataset is small relative to high ambient dimension (e.g., neuroimaging) which is under-studied. Theoretical underpinning — The use of quantile assignment and the convergence argument in the univariate / multivariate case gives formal support to the latent-code approximation strategy. 1. Resolution / dataset scale limited — Their experiments are constrained to small image resolutions (64×64-128×128) and relatively small sample sizes (~2 k images) under a limited compute budget. While this is aligned with their motivation (low-budget), it raises the question of how the method performs at larger, contemporary scales (e.g., 256×256, ImageNet scale). The authors acknowledge this in future work. 2. Interpretability of latent codes — Since the quantile grid is fixed and codes are assigned via permutation, the learned latent space may lack the structure/meaningfulness of e.g., disentangled VAE latents or hierarchical GAN latents. The paper doesn’t deeply analyze the semantics of the latent codes—are they smooth, do they support interpolation, manipulation, etc. The paper assumes a fixed quantile lattice in latent space. How sensitive is the method to the choice of quantile grid (e.g., Sobol vs uniform vs Gaussian quantiles)? The assignment problem is discrete, yet the generator is trained with continuous gradients. How do you ensure smooth convergence given the alternating discrete–continuous optimization? Fully AI-generated
Generative Model via Quantile Assignment Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper build a new structure for generative models, with less dimension. 1. this is a new structure 2. test with many different dataset and genearive framework 1. The expression of Figure 1 is unclear. From the image, it appears that the input data are fed into the decoder. The paper should clarify why this component is referred to as the decoder rather than the encoder, and explicitly describe what the input data are. Moreover, the roles of Momentum Update and Embedding in the framework are not clearly explained. What does “Cost” represent in this figure? Is it equivalent to the loss function? Additionally, regarding the left-hand side of the figure, I speculate that it corresponds to the grey-shaded part on the right-hand side. However, it is not clear how the output on the left is transmitted to the generator. This connection should be explained more explicitly. 2. Section 3 mainly discusses the quantile assignment, but it should also explain how this mechanism is made trainable and why it is considered optimal. These claims should be supported by theoretical justification or experimental evidence. 3. Dataset and Metrics: The introduction of the dataset and evaluation metrics is not the core contribution of the paper and could be moved to the appendix or combined with the related work section to improve focus. 4. Diffusion Model Performance: The diffusion process seems to fail under the proposed method, which may be influenced by the linear assignment mechanism. Diffusion models often struggle with simple linear interpolation in latent space, resulting in abrupt transitions, artifacts, or degenerate (e.g., grey) images. This appears to be a limitation of the current approach. However, it might be mitigated by adopting smoothed diffusion models [1] or related approaches that enforce smoother, more linear latent mappings. [1] Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models check with weakness Lightly AI-edited
Generative Model via Quantile Assignment Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes NeuroSQL, a latent-variable generative model that removes the encoder/discriminator and instead assigns latent codes by solving a linear assignment problem between data and a fixed quantile lattice of the latent prior. Training alternates between fitting a single generator to currently assigned codes and re-solving the assignment with a cost based on a perceptual/structural image loss; a momentum update smooths the assigned codes across iterations. For d>1, the quantiles are built via center-outward multivariate ranks from optimal transport; practically, a low-discrepancy grid on the unit ball is used. Experiments under a small-compute regime compare NeuroSQL against VAE, GAN, and a budget-matched diffusion baseline on MNIST, CelebA, AFHQ, and OASIS, showing improved visual quality and quantitative scores. The paper makes a meaningful attempt to resolve the disadvantages of mainstream generative models VAEs and GANs, by removing the encoder and discriminator modules and integrating statistical quantile learning for stable training. The approach may be of interest in certain domains of generative tasks. 1. The main issue of the proposed method is scalability. The optimization algorithm runs in O(n^3) time with n being the number of samples. While the paper mentioned approximation via mini-batches, no concrete evidence is provided to show if it still works with large datasets (and models). The paper compares the method with VAEs, GANs and diffusion models in the seemingly fair budgeted setting. However, the comparison is not sound in that other models scale easier and perform much better with more budget. The budget, 200 Google Colab compute units and 2000 training images, is too limited for practical generative tasks. 2. Experimental results are not convincing to show the advantage of the proposed method. Images in Figure 2 are in low resolution making it hard to compare the visual quality. Quantitative results in Appendix show high instability across latent dimensions. In particular, in many cases the FIDs for NeuroSQL, VAE and CAN change significantly and non-monotonically as the latent dimension increases. 1. How does the method perform in mini-batches settings? 2. For quantitative results, how many runs were executed? Is unstable and insufficient training the cause for the varying evaluation scores? Fully human-written
Generative Model via Quantile Assignment Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces NeuroSQL, a generative modeling framework that learns latent variables through a quantile-assignment process derived from optimal transport, eliminating the need for an encoder or discriminator. The model alternates between a generator update and an assignment step solved via the Hungarian algorithm. This approach aims to combine stable, deterministic optimization with the expressiveness of deep decoders. Experiments span MNIST, CelebA, AFHQ, and OASIS, across multiple generator backbones (ConvNet, ResNet, U-Net), showing competitive visual quality and efficient convergence under low-data conditions. The replacement of encoder–decoder mappings with an assignment-based quantile mechanism is conceptually fresh and theoretically grounded. It bridges optimal transport with generative modeling in a unique and elegant way. The method is particularly well-suited for data domains like neuroimaging, where dimensionality exceeds sample size, and the assignment cost is independent of feature dimensionality. Avoiding adversarial losses makes the model stable and lightweight to train. The simplicity of using an L2-based reconstruction objective allows reproducibility even in constrained computing environments. The experiments show meaningful improvements in visual quality and diversity under limited data, highlighting NeuroSQL’s advantage. While the paper provides an overall complexity estimate, quantitative comparisons to VAEs, GANs, or diffusion models in terms of runtime, memory, and scalability would provide stronger evidence of its efficiency. The Hungarian step’s cubic cost could be limiting for very large batch sizes, although mini-batching is suggested as a practical solution. The performance advantage over GANs and VAEs is not uniform—some settings show weaker results, suggesting NeuroSQL's strengths may be inconsistent. Why the quantitative diffusion comparisons under similar compute budgets are missing? Do you see challenges extending this approach to transformer-based or high-resolution settings? Fully AI-generated
In Agents We Trust, but Who Do Agents Trust? Latent Preferences Steer LLM Generations Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates whether Large Language Models (LLMs) possess "latent source preferences," meaning they systematically favor information based on the perceived reputation or brand identity of its source (e.g., news outlets, academic journals). Through controlled experiments on twelve LLMs, the study finds these preferences are strong, predictable, context-sensitive, and can outweigh the influence of the content itself, persisting even when models are prompted to be unbiased. 1. The study validates its hypothesis with an extensive empirical evaluation across twelve distinct LLMs from six different providers , spanning synthetic and real-world tasks including news, research, and e-commerce. 2. The paper effectively isolates the phenomenon by complementing direct preference queries with a rigorous "indirect evaluation" methodology, which uses semantically identical content to disentangle latent source bias from content-driven effects. 3. The work addresses a novel and critical gap by focusing on how LLMs select and present information rather than just what they generate , demonstrating in real-world case studies that these preferences can dominate content and explain observed political skews. 1. To better situate the paper's contribution, the "Related Work" section should explicitly differentiate its findings from key studies on LLM cognitive biases, such as [1-3]. A clearer discussion is needed on how 'latent source preference' (a bias towards external entities) differs from biases originating in pretraining vs. finetuning [1], emergent cognitive biases induced by instruction tuning [2], and existing cognitive debiasing techniques focused on reasoning [3]. This would more effectively highlight the novelty of the current work. 2. A significant concern arises regarding the paper's strong conclusion from the AllSides case study—namely, that source preference "can completely override the effect of the content itself" and that the observed "left-leaning skew" is "largely attributable" to source trust. This claim appears to be undermined by the study's own control data. In the critical "Source Hidden" condition (Fig. 6), the models already exhibit a clear preference for articles originating from left-leaning and centrist sources, even when no source information is provided. This strongly suggests that the content itself (e.g., writing style, topic selection, or alignment with the models' RLHF training) is a significant confounding variable that introduces a substantial skew before source attribution is considered. Therefore, a more rigorous and defensible interpretation is that latent source preferences amplify or reinforce a pre-existing content-driven bias, rather than "overriding" it or being its primary cause [1] Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs [2] Instructed to bias: instruction-tuned language models exhibit emergent cognitive bias [3] Cognitive debiasing large language models for decision-making None Moderately AI-edited
In Agents We Trust, but Who Do Agents Trust? Latent Preferences Steer LLM Generations Soundness: 2: fair Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper studies latent source preferences in LLM-based agents: the authors hypothesize that models encode brand-level signals (e.g., reputation, follower counts) in their parametric knowledge and that those signals systematically bias which retrieved items the agent surfaces. They evaluate this via complementary direct (ask models which source they prefer) and indirect (show semantically identical content with different source labels) tests across 12 models and three domains (news, research-paper selection, and seller choice), as well as realistic case studies. The authors uncover multiple interesting findings about the nature and implications of LLM source preferences. 1. The core research question “whether LLM-based agents carry latent source preferences that systematically influence which items they trust and retrieve” is largely novel. This is a specific type of model bias that has not been systematically studied by prior works, but also appears timely and highly relevant to realistic LLM applications. 2. The paper is well-structured and easy to follow. Each research question is stated up front and directly answered with matched experiments and analyses, making the paper easy to follow and the claims easy to verify. 3. Experiments are comprehensive. The authors combine direct and indirect tests, synthetic and realistic case studies, broad model coverage (12 LLMs), and diverse domains (news, research papers, e-commerce), which together give the results both depth and external validity. 1. The evaluation may be vulnerable to prompt-induced shortcutting: if the same phrasing (for instance, “select the article based on journalistic standards”) is used across direct and indirect tests, models might be reacting to that cue rather than expressing a stable, content-independent source prior. Concretely, a model could learn that the phrase “journalistic standards” often co-occurs with examples from mainstream outlets during pretraining or instruction tuning and therefore surface those sources whenever the phrase appears. This would look like a latent source preference but is actually a response to prompt wording. 2. During synthetic dataset construction the authors use GPT-4o to generate/refine article variants; quantitative diversity metrics and/or human validation are needed to confirm that generated items are sufficiently distinct. 3. The evaluated models also include two smaller GPT-4.1 variants, which might undermine the validity of the findings, as it’s been discovered that models generally prefer outputs from the same model family. 1. Comparing the two case studies presented in section 5, the authors find that prompting cannot reduce source bias in the news aggregator setting, while it turns out to be effective when selecting Amazon sellers. Are there any insights for the cause and implication of such difference? 2. Since you also ask for a brief explanation from the models during evaluation, did you observe any interesting patterns in their reasoning when they select the sources? 3. Did you consider finding mechanistic explanations (within representations) for such latent source preference with open-source models to cross-validate your findings? 4. From line 285: “a model may seem to favor sources with fewer followers when asked directly, yet in practice it may assign more weight to higher follower counts”. This seems rather counterintuitive. Do you have any plausible explanations? Lightly AI-edited
In Agents We Trust, but Who Do Agents Trust? Latent Preferences Steer LLM Generations Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper investigates latent source preferences in large language model (LLM) based agents systematic biases that lead models to favor certain sources (e.g., NYTimes, Nature, Amazon) over others when generating or recommending information. The authors conduct controlled and real-world experiments on 12 LLMs from six providers across domains such as news recommendation, research paper selection, and e-commerce decisions. They find that (1) source preferences are strong, predictable, and persist even when content is identical, (2) these preferences are context-sensitive, varying with domain and framing, (3) LLMs inconsistently associate different brand identities (e.g., “@nytimes” vs “nytimes.com”), creating vulnerabilities for impersonation, and (4) prompting strategies like “avoid bias” fail to eliminate these tendencies. The study reveals that LLM agents encode trust hierarchies toward real-world entities, emphasizing the need for auditing, transparency, and controllable bias mechanisms in future agent design. 1. Introduces and formalizes the idea of “latent source preferences.” 2. 12 models, 6 providers, multiple domains, and both synthetic and real world data. 3. Consistent results with rank correlation and contextual sensitivity analyses. 4. Ties directly to alignment, fairness, and trustworthiness of LLM based agents. 5. Appendices include detailed prompt templates, datasets, and code release commitment. The paper stops short of causal analysis, it does not probe which stages of training (pretraining vs instruction-tuning) most contribute to preference formation. While the phenomenon is well-characterized, the mitigation aspect is limited to showing that prompting fails. A deeper exploration of possible control mechanisms (e.g., debiasing or preference regularization) would strengthen the work. Some statistical results (e.g., rationality correlations in Fig. 5) could be better explained with accompanying confidence intervals or ablation-based sensitivity checks. The work primarily focuses on English-language and Western-domain sources; future multilingual and cross-cultural extensions would enhance generalizability. 1. Can the authors disentangle the contribution of pretraining data versus instruction-tuning datasets to these latent preferences? 2. Would fine-tuning on balanced or anonymized source data reduce these biases? 3. How would the results change for non-English or low-resource languages where brand representation is limited? Fully AI-generated
In Agents We Trust, but Who Do Agents Trust? Latent Preferences Steer LLM Generations Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. I dont think the paper fits with the ICLR main ML community. The paper is on findings after findings after findings, with no clear insights, no clear "so what" answers. I think the paper is more fit to Scientific Reports than a ICLR paper. The authors did a bunch of experiments. The paper has no clear take-away insights. It is more fit for a Scientific Reports kind of paper, than an ICLR paper. Would suggest the authors to consider the target conference or journals. Also would advice the authors read their paper carefully, and think about what are the main contribution, and take away from the paper. The writing is kind of missed throughout the paper. Fully human-written
Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents a systematic study of multimodal jailbreaks and introduces a quantitative framework (on-topicness, OOD-intensity, harmfulness, refusal) along with the Balanced Structural Decomposition (BSD) attack. BSD adaptively decomposes malicious prompts to balance semantic relevance and distributional novelty, achieving improvements in attack success. - This paper defines and separates on-topicness and OOD-intensity, enabling structured, interpretable analysis of jailbreak mechanisms in multimodal LLMs. - This paper proposes BSD, a recursive decomposition framework that operationalizes this balance and achieves improved attack success rates. - The contribution is primarily heuristic and engineering-focused. BSD refines existing decomposition-based attacks rather than introducing new theoretical insights into model vulnerability. - The BSD framework mainly extends prior decomposition-based attacks with additional heuristics (WBS, SDU) but lacks causal or mechanistic justification for why balancing on-topicness and OOD-intensity fundamentally increases vulnerability. - The hierarchical search and threshold settings introduce many heuristic hyperparameters without sensitivity or stability analysis, making the method’s reproducibility and robustness across models and datasets uncertain. - How sensitive are the results to the embedding model choice and the specific OT/OI computation? - Can the authors demonstrate that BSD captures intrinsic safety vulnerabilities rather than benchmark-specific weaknesses? - Would the observed OT–OI balance remain effective under adaptive or adversarially retrained guard models? Fully AI-generated
Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the challenge of effectively jailbreaking Multimodal Large Language Models (MLLMs). It argues that current evaluation methods are flawed, as they often misclassify benign or off-topic responses as successful attacks. To rectify this, the authors propose a new four-axis evaluation framework that assesses prompts based on On-Topicness (OT), Out-of-Distribution Intensity (OI), Harmfulness, and Refusal Rate. Through empirical analysis, the paper identifies a critical structural trade-off: highly on-topic prompts are more harmful but also more likely to be rejected, while extreme out-of-distribution (OOD) prompts evade filters but produce less harmful content. To exploit the optimal balance between relevance and novelty, the authors introduce Balanced Structural Decomposition (BSD), a recursive strategy that decomposes malicious instructions into semantically coherent sub-tasks paired with descriptive images. Evaluated on 13 MLLMs, BSD demonstrates superior performance by generating more harmful outputs with fewer refusals compared to baseline methods, highlighting a vulnerability in existing safety mechanisms that rely on surface-level filtering. + The introduction of the four-axis framework (On-Topicness, OOD-Intensity, Harmfulness, Refusal Rate) provides a more nuanced and comprehensive standard for evaluating MLLM jailbreaks, addressing the overestimation problem in prior work. + The paper makes a good contribution by empirically identifying and formalizing the fundamental trade-off between prompt relevance and novelty, offering a clear explanation for the limitations of existing attack strategies. + The proposed BSD strategy is a well-motivated and systematic approach that directly targets the identified OT/OI trade-off, demonstrating state-of-the-art attack performance across a wide range of models. - The authors' findings do not significantly differ from previous work. For example, one of the main findings of this paper, that existing jailbreaks are generally ineffective, is consistent with the conclusions of Nikolić et al. (ICML’25). However, this paper fails to provide sufficient new insights or highlight the differences from previous work. - While BSD integrates descriptive and distracting images, it does not analyze specific visual features (such as content relevance) or perform ablation experiments to determine how this affects the OT/OI balance or model behavior, making the role of visual effects difficult to understand. - More advanced defenses need to be added for further evaluation. This paper only evaluates GuardReasoner-VL-3B/7B, which does not seem convincing. Q1: Regarding the overlap between your findings and previous work (e.g., Nikolić et al., ICML'25) on the ineffectiveness of existing jailbreaks, could you elaborate on why the paper does not include a detailed comparative analysis of the evaluation frameworks, attack mechanisms, or core conclusions between your work and Nikolić et al.'s study? [1] Nikolić, Kristina, et al. "The Jailbreak Tax: How Useful are Your Jailbreak Outputs?." In Proc. of ICML, 2025. Q2: For the visual components integrated into BSD (descriptive and distraction images), since the paper does not analyze specific visual features (e.g., semantic relevance between descriptive images and the original malicious prompt), do you have plans to supplement experiments that quantify how variations in visual content relevance affect the On-Topicness (OT) and OOD Intensity (OI) balance of the input? Q3: The paper lacks ablation experiments to isolate the impact of visual cues on BSD’s performance. For example, why did you not design an ablation group that removes descriptive images or distraction images entirely, and compare its ASR, HS, and RR with the full BSD model? This would help clarify whether visual components are necessary for achieving the OT/OI balance. Q4: When evaluating BSD against guard models, the paper only tests GuardReasoner-VL-3B and GuardReasoner-VL-7B. Why did you not include other mainstream MLLM guard models (e.g., LLaVA-Guard, VILA-Guard, or commercial guard systems like OpenAI’s Content Policy Enforcement) in the evaluation to verify the generalizability of BSD’s ability to bypass defenses? Q5: The paper states that BSD’s sub-task decomposition relies exclusively on Qwen2.5-7B, which fails when malicious prompts are overly explicit or complex (resulting in overt intent and model rejection). Given this limitation, why did the study not evaluate alternative decomposition LLMs (e.g., smaller open-source models like LLaVA-7B or safety-aligned models) to test whether they could improve the robustness of the decomposition module? Q6: For overly complex malicious prompts (e.g., multi-step illicit operations) that Qwen2.5-7B fails to decompose effectively, the paper only notes "reduced jailbreak success" but does not define clear criteria for "complexity" of prompts. Could you clarify how the paper quantifies prompt complexity, and whether this quantification was used to systematically test the decomposition module’s limits? Fully AI-generated
Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates a practical limitation of existing OOD jailbreak attacks where the responses classified as "successfully jailbroken" are frequently unrelated to the malicious intent while the on-topic attack prompts only elicit response rejection. To effectively jailbreak MLLMs, the authors proposed balancing this trade-off between OOD-ness and On-topicness by iterative decomposition of attack prompts and depth refinement within a tree structure. - The paper is tackling an unexplored but practical limitation of existing OOD attacks with qualitative analysis of relevance-novelty trade-off. - The paper proposed a novel tree-based decomposition attack strategy to balance the trade-off. - The paper is well-written and easy to follow. - The comprehensive experimental results over closed and open-source models demonstrate the jailbreak effectiveness of proposed method with detailed qualitative analysis. - One of my concerns is in the paper’s reliance on SBERT embeddings to quantify On-Topicness (OT) and Out-of-Distribution Intensity (OI). SBERT has known limitations in capturing subtle semantic variations—such as word order perturbations or nuanced paraphrases [R1] —which may undermine the reliability of these metrics. For instance, in Eq (2), when the short caption from MLLM is subtly but semantically altered where only minor lexical changes lead to a substantially different meaning—it is unclear whether the OI metric based on SBERT embeddings can reliably capture such distinctions. - Also, when measuring OI in Eq (2), a safety-aligned MLLM may generate a safe summary rather than one semantically consistent with the harmful prompt $P_0$. This naturally increases the OI score even though the divergence stems from safety alignment rather than genuine distributional novelty. Consequently, the metric may conflate the model's safety-alignment with true out-of-distribution characteristics, limiting its interpretability as an indicator of OOD intensity. - In Eq. (3), the formulation of the Harmfulness Score (HS) may produce misleading results. If the reference vector $h_{ref}$ ​contains uniformly high category scores while the response vector $h_r$ ​ is uniformly low, the ℓ1 ​distance term​ still increases—thereby raising the overall HS despite the response being less harmful. - In Eq. (4), the notation $N$ is used without a clear definition. While it appears to represent the total number of evaluated responses, this seems not explicitly stated in the manuscript. - In Eq. (5), the assumption that greater semantic dissimilarity among decomposed sub-tasks directly corresponds to higher OOD characteristics lacks clear justification. It is not evident that mutual dissimilarity between sub-tasks translates to genuine OOD behavior from the model’s perspective. - The experimental comparison is limited to two baselines (FigStep, CS-DJ), which is too narrow to support broad claims. Including a broader set of recent baselines (also with recent victim models such as GPT-5) would make the results more convincing. - The iterative decomposition (width) and refinement (depth) and image generation likely add significant computational overhead compared to the non-optimization methods such as FigStep. [R1] https://arxiv.org/pdf/2309.03747 See above weaknesses. Fully human-written
Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper focuses on addressing the vulnerability of Multimodal Large Language Models (MLLMs) to adversarial prompts and the inaccuracies in existing jailbreak evaluation standards. It points out that current jailbreak strategies often overestimate success rates, as many "successful" responses are benign, vague, or unrelated to malicious goals. To solve this, the paper proposes a four-axis evaluation framework covering input on-topicness, input out-of-distribution (OOD) intensity, output harmfulness, and output refusal rate. Through empirical research, it reveals a structural trade-off: highly on-topic prompts are easily blocked by safety filters, while overly OOD prompts evade detection but fail to generate harmful content. Based on this insight, the authors design a recursive rewriting strategy called Balanced Structural Decomposition (BSD). BSD decomposes malicious prompts into semantically consistent sub-tasks, adds subtle OOD signals and visual cues, and uses a neutral tone to present inputs. Experiments on 13 commercial and open-source MLLMs show that BSD outperforms existing methods, improving attack success rates by 67% and harmfulness by 21%, and also performs well against guard models. - The proposed four-axis evaluation framework comprehensively captures both input and output characteristics of MLLM jailbreaks, addressing the defect of overestimating attack effectiveness in traditional binary evaluation and providing a more accurate and reliable benchmark tool for subsequent research. - The BSD strategy innovatively balances the trade-off between on-topicness and OOD intensity. By recursively decomposing prompts and integrating visual cues, it effectively evades MLLM safety filters while ensuring the generation of harmful content, achieving breakthroughs in attack performance. - The study conducts extensive experiments across 13 MLLMs (including commercial closed-source and open-source models) and two guard models, using three datasets (HADES, MMSafetyBench, and AdvBench-M). The large-scale and multi-dimensional experimental design enhances the generalizability and persuasiveness of the research results. - The BSD strategy relies heavily on the quality of sub-task decomposition by the Qwen2.5-7B model. When facing overly obvious or complex malicious objectives, the decomposition fails to produce semantically diverse sub-tasks, leading to reduced jailbreak success rates and limiting the strategy's adaptability. - The generation of descriptive images in BSD depends on the FLUX.1-schnell text-to-image model. The paper lacks an in-depth analysis of how image quality, style consistency, and semantic alignment with sub-tasks specifically affect jailbreak results, and there is insufficient verification of the necessity of visual cues. - The study only evaluates the short-term effectiveness of jailbreak attacks but ignores the long-term impact of repeated use of BSD on MLLMs. It does not explore whether MLLMs can learn to identify and defend against such structured decomposition attacks, resulting in incomplete research on attack durability. - This paper needs to add more advanced baselines for further comparison to highlight the superiority of the proposed attack. Comments: 1. Over-reliance on specific decomposition models without alternative mechanisms: The BSD strategy is highly dependent on Qwen2.5-7B for sub-task decomposition. When this model fails to decompose complex or obviously malicious prompts effectively, the entire jailbreak process breaks down. The paper does not propose alternative decomposition models or adaptive adjustment mechanisms to address this single-point failure risk, which affects the robustness of the strategy. 2. Insufficient validation of the role of visual cues: Although BSD integrates descriptive and distraction images, the paper only verifies the performance differences between FLUX-generated images, colored boxes, and random noise in ablation experiments. It does not quantitatively analyze how factors such as image semantic relevance to sub-tasks, visual complexity, and number of distraction images affect the jailbreak effect, making the role of visual cues in the strategy unclear. 3. Limited analysis of cross-dataset generalization differences: The paper tests BSD on three datasets (HADES, MMSafetyBench, AdvBench-M) but does not deeply analyze why the strategy has significant performance differences across datasets (e.g., lower ASR on AdvBench-M due to fewer samples). It also fails to explore whether the decomposition logic of BSD needs to be adjusted for different types of malicious prompts in different datasets, limiting the understanding of the strategy's cross-scenario adaptability. 4. Lack of in-depth comparison with state-of-the-art baselines: Although this paper has conducted in-depth comparisons with CS-DJ and FigStep, it is still not convincing enough. I recommend that the authors compare with the methods in the following references: [1] Ma, Teng, et al. "Heuristic-induced multimodal risk distribution jailbreak attack for multimodal large language models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025. [2] Liu, Yi, et al. "Arondight: Red teaming large vision language models with auto-generated multi-modal jailbreak prompts." Proceedings of the 32nd ACM International Conference on Multimedia. 2024. [3] Jeong, Joonhyun, et al. "Playing the fool: Jailbreaking llms and multimodal llms with out-of-distribution strategy." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. 5. Minor issues - The font size of the text in Figures 1, 8, 9 is too small to be read. - "Stage 1: Width-first balancing via Width Balance Score." should be "Stage 1: Width-first balancing via width balance score." Heavily AI-edited
PreviousPage 21 of 1516 (75800 total rows)Next