ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper tackles the challenge of generalizing AI-generated image (AIGI) detectors across different generation models. Through systematic analysis, the authors identify a key issue — “Few-Patch Bias” — where existing detectors over-rely on a small number of image patches despite artifacts being uniformly distributed across all regions of synthetic images. They propose two guiding principles, *All Patches Matter* and *More Patches Better*, and introduce the **Panoptic Patch Learning (PPL)** framework to operationalize them. PPL combines **Randomized Patch Reconstruction (RPR)**, which injects synthetic artifacts into randomly chosen patches to diversify learning, and **Patch-wise Contrastive Learning (PCL)**, which enforces consistent discriminative capability across patches. Extensive experiments on major benchmarks (GenImage, DRCT-2M, AIGCDetectionBenchmark, and Chameleon) demonstrate that PPL achieves state-of-the-art accuracy and robustness, significantly improving generalization to unseen generators and real-world data. 1. The paper introduces clear and insightful principles (“All Patches Matter” and “More Patches Better”) that reveal a fundamental property of AI-generated images and motivate the proposed framework. 2. The proposed Panoptic Patch Learning method is conceptually simple yet effective, combining randomized patch reconstruction and patch-wise contrastive learning to mitigate few-patch bias. 3. Extensive experiments across diverse benchmarks demonstrate strong generalization and robustness, supported by thorough analyses and clear visual evidence. 1. In the second paragraph of the introduction, the term *“patch”* appears for the first time but lacks a clear definition or motivation. Since AIGI detection includes many CNN-based detectors that do not explicitly rely on patch-level representations, introducing the patch concept without clarification may confuse readers about its relevance to this task. It is recommended that the authors explain why they adopt patch as the basic analytical unit and cite related works that have previously used patch-based approaches, which would make the motivation more convincing. 2. In line 364, the reference of SAFE is wrong. 3. In Tables 3 and 4, I notice that all baseline methods are trained on GAN-based datasets, while the proposed PPL is consistently trained on SDv1.4 (a diffusion-based model). Although it is reasonable to fix one generator for evaluating generalization, different training datasets may have biases toward different test distributions (e.g., a model trained on diffusion data may generalize better to diffusion-based test sets). Therefore, I suggest that the authors also report results trained on a GAN-based dataset for these two benchmarks. This would ensure a fair comparison and further demonstrate that the proposed method is also effective when trained on GAN-generated data. I have no further questions. Lightly AI-edited
All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposed a novel patch-based AI-generated image detection method. The main motivation is the few-patch bias observed in existing detectors, i.e., they overly rely on a limited proportion of patches and neglect the diversity of artifacts across patches. To encourage the utilization of information from all patches, Randomized Patch Reconstruction (RPR) is proposed, which applies diffusion reconstruction to real images and replaces a random set of the original image patches with the corresponding reconstructed patches. Patch-wise Contrastive Learning (PCL) further encourages the learning and utilization of all patch features. Results on several benchmarks suggest the state-of-the-art generalization performance of the proposed method. 1. The analysis of the few-patch bias of AIGI detectors reveals a significant limitation of existing methods. 2. The proposed RPR and PCL effectively encourage the model to utilize the information across all patches. 3. The generalization of the proposed method is comprehensively evaluated, including testing on challenging datasets like Chameleon and robustness studies. 1. The motivation of the *All Patches Matter* principle requires further clarification. - The two lines of evidence stated in Lines 45-49 only support that "some patches matter" (i.e., some of the patches contain discriminative patterns) rather than "all patches matter". - The "Theory" in Line 146 may be an inappropriate title for the first key finding, as no theoretical results are provided. In addition, the assumptions behind the statement "Because every patch of a synthetic image is itself generated, each inherently contains artifacts" need further clarification (e.g., what "artifacts" are and why generative models produce them across every pixel). - The details for the "Experiments" in Lines 154-157 are not specified. - Given that "a single patch contains sufficient information for reliable discrimination" (Line 157), it seems unnecessary to emphasize the utilization of all patches. This point may not support the *All Patches Matter* principle. 2. It seems that the attention maps in Figure 3(a) can be explained by the observations in [1] that vision transformers tend to utilize patch tokens in low-informative background areas as registers for aggregating global information. Repeating the visualization experiments with vision transformers with dedicated registers proposed in [1] may eliminate this possibility. Besides, the acquisition of the attention maps needs explanation. 4. The experimental details for Figure 3(b) are not specified, including what detectors are tested and how the patch masking is implemented. This is important for supporting the generalization of the conclusion. 5. It seems that the Total Direct Effect (TDE) described in Lines 210-213 should be the Controlled Direct Effect (CDE). 6. Previous reconstruction-based methods such as [2] and [3] should be discussed and compared. [1] Vision Transformers Need Registers. ICLR 2024. [2] Aligned Datasets Improve Detection of Latent Diffusion-Generated Images. ICLR 2025. [3] A Bias-Free Training Paradigm for More General AI-generated Image Detection. CVPR 2025. 1. How are the reconstructed images (in Figure 2 and Section 4) produced? Is there any image processing process, such as upsampling and downsampling, that can affect the low-level details of the image or introduce artifacts? 2. Why does the proposed method generalize effectively to GAN-generated images, despite the training is solely based on diffusion models? 3. Is it possible to set $p_{rpr}$ to 1, i.e., using only the real images and the RPR images for training? 4. In Figure 8, what contributes to the difference between the blue and red bars at +LoRA, given that RPR is not used? Fully human-written
IQA-Octopus: Unified Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces IQA-Octopus, a novel framework for IQA built upon LMMs. The primary contribution is unifying multi-granularity perception tasks: reasoning, grounding, and referring. To address the lack of suitable training data, the authors create and introduce a new dataset, IQA-Octopus-33K. The proposed method utilizes a two-stage optimization strategy. The first stage trains the model for text-based reasoning and referring tasks. The second stage enables pixel-wise perception using a novel, training-free "text-to-point" strategy. This strategy implicitly converts text logits into point coordinates for a segmentation model (i.e., SAM) , avoiding special tokens that can degrade the LMM's reasoning abilities. The authors demonstrate that their model achieves comparable or state-of-the-art performance across multiple IQA benchmarks - The paper is clearly structured, well written and presented in a clear manner. - The paper proposes a unified framework that combines multi-granularity reasoning, grounding, and referring for IQA. This is a promising direction, as it aligns with the growing demand for more explainable and comprehensive IQA methods. - The model demonstrates strong, comparable performance across different tasks. - The architecture is straightforward, consisting of a standard LMM backbone and a frozen segmentation head. Such simplicity is often beneficial for both interpretability and ease of implementation. - Limitation of dataset. - Scale: The newly proposed IQA-Octopus-33K dataset seems limited in scale. With only ~33K total samples (and fewer per specific task, as shown in Table 7 ), it may be insufficient for robustly training an LMM to handle four distinct and complex task paradigms. This concern is amplified by the fact that a large portion of the data is generated automatically (synthetic data pipeline ) or semi-automatically (using InternVL-2.5 for Q&A generation ), which can lead to a lack of diversity and potential propagation of errors from the generator model. - Annotation Quality: The reliance on a single open-source model (InternVL-2.5) for generating all Q&A pairs is a significant concern. The paper's justification for avoiding closed-source models like GPT-4V on the grounds of cost and transparency is weak; generating 33K annotations is not prohibitively expensive, and using an open-source model does not inherently make the data generation process more transparent or guarantee higher quality. Relying on a single model risks baking its specific biases and stylistic into the dataset, potentially limiting the trained model's generalizability. - Sufficiency: The paper does not adequately demonstrate that the 33K dataset is sufficient for achieving the reported performance. The ablation study in Table 4 actually suggests the contrary: the model trained only on the IQA-Octopus-33K dataset performs worse than the model trained with additional datasets. A more convincing study is needed to justify that this dataset size is sufficient. - Method design. - Naive Segmentation Prompting: The "text-to-point" strategy, while simple, appears overly naive. It generates only a single point coordinate based on a weighted average of positional term logits. It is known that SAM's performance with a single point prompt can be inaccurate, especially for non-salient objects (or in this case, distortion regions). The paper provides insufficient evidence that this single-point-prompting is robust for accurately segmenting diverse and complex quality degradations. - *[Suggestion] A better practice might be splitting the image into smaller grids and predict multiple points to prompt SAM.* - Results comparison. - Unclear Baseline Training (Table 1): The experimental setup for Table 1 is unclear and potentially unfair. The paper does not state whether the baseline models were fine-tuned on the new IQA-Octopus-33K training set or evaluated in a zero-shot manner. Given the specialized task formats, a zero-shot evaluation would likely fail, making the comparison weak. If they were retrained, the drastically lower performance of other methods especially on global desc is surprising and requires explanation. - Lack of Backbone Ablation (Table 2): The comparison in Table 2 lacks a critical baseline for fair assessment. As shown in Q-Instruct, the choice of LMM backbone can significantly impact performance. The paper should include results using a similar LMM backbone (e.g., InternVL-2.5) without the proposed method to isolate the contribution of the architecture itself. In conclusion, this paper is well written and proposes a novel unified framework for multi-granularity image quality assessment, which are potentially useful for several applications. However, there are still several concerns about the dataset, the method design and results comparison. Therefore, I would currently recommend marginal below acceptance, and would like to see the authors address the above concerns in the rebuttal for the final decision. Please see weakness points above. Fully human-written
IQA-Octopus: Unified Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces IQA-Octopus, a unified image quality assessment (IQA) framework that integrates reasoning, grounding, and referring within a single large multimodal model (LMM) architecture. Unlike prior works that focus on isolated dimensions of visual quality (e.g., global description, pixel-wise grounding), IQA-Octopus emphasizes multi-granularity perception by jointly modeling global, local, and pixel-level understanding. 1. First LMM-based IQA model combining reasoning, grounding, and referring, addressing multi-granularity perception in a coherent architecture. 2. The text-to-point method avoids retraining segmentation heads while maintaining reasoning ability. 3. High-quality dataset: IQA-Octopus-33K integrates both synthetic and real distortions, enabling comprehensive evaluation. 1. The proposed IQA-Octopus-33K dataset, while diverse, contains only 33K samples. This scale may be insufficient for robust instruction-tuning at the LMM level, and expanding the dataset could further strengthen the empirical claims. 2. The evaluation is restricted to images. Assessing the model’s generalization to video or 3D quality tasks would better demonstrate the versatility of the proposed multi-granularity reasoning framework. 3. Although the hybrid dataset ablation is insightful, the contribution of each sub-task (reasoning, grounding, referring) is not explicitly disentangled, leaving uncertainty about their respective roles in overall performance gains. 4. The combination of SAM-based grounding and LoRA tuning increases architectural complexity, yet runtime, computational cost, and efficiency trade-offs are not discussed. 5. The dataset generation partially relies on InternVL outputs, which may introduce variability. Full release of generation templates and code would be necessary to ensure reproducibility and transparency. 1. Do the authors plan to expand IQA-Octopus-33K or incorporate additional data sources to improve instruction-tuning robustness and cross-domain generalization? 2. Could the proposed framework be extended or tested on video or 3D quality assessment tasks to validate its applicability beyond static images? 3. Can the authors provide a finer-grained ablation that isolates the contributions of reasoning, grounding, and referring sub-tasks to better understand their individual impacts? 4. What is the inference-time overhead of integrating SAM and LoRA compared to a standard LMM baseline, and how does this affect scalability in real-world scenarios? 5. How do the authors ensure consistent data generation from InternVL, and will they release the corresponding scripts and templates to enable full reproducibility? Heavily AI-edited
IQA-Octopus: Unified Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes the IQA-Octopus framework to unify multi-granularity image quality assessment tasks, including reasoning, localization (grounding), and referring. The authors construct the IQA-Octopus-33K dataset, which covers four task paradigms: global/local quality description, pixel-level quality grounding, and region-level quality referring (Section 3, Fig. 1). The core methodological contributions are: (1) an LMM based on Phi-3.5-Vision is instruction-tuned on a mixed dataset (the proposed 33K data, Q-Instruct, and DQ-495K) to learn multi-granularity reasoning and referring at the text level (Section 4.3); and (2) a “text-to-point” strategy that, at inference time, extracts the probability distribution of location words (top/bottom/left/right) from the LMM’s logits, then produces a single coordinate via weighted averaging (Eqs. 2-4), which serves as a zero-shot prompt to drive a frozen SAM for pixel-level segmentation (Section 4.2). Experimental results show competitive performance on both the authors’ in-house benchmark (Table 1) and external benchmarks (Tables 2 and 5). Notably, on Q-Ground-Test (Table 5), the model achieves 0.293 mIoU in a zero-shot setting, surpassing Q-Ground (0.271), which requires fine-tuning. The paper’s main strengths lie in its clear motivation and the novelty of the “text-to-point” strategy. The authors explicitly argue that existing methods (e.g., Q-Ground, LISA), which introduce special tokens (such as `\<seg\>`) for explicit localization, “damage instruction-following behavior and reasoning processes” (Section 1, lines 107–109), and they propose to avoid this by implicitly mapping text logits to coordinate points (Section 4.2). This design is technically elegant in that it preserves the integrity of the LMM’s text output (“keeping the text output stainless,” Section 4.2, lines 325–327), avoids adding special tokens to the vocabulary, and does not require an additional segmentation head to be fine-tuned. The ablation study (Fig. 5) strongly supports the effectiveness of this strategy: its zero-shot performance (0.364 mIoU) slightly exceeds EVF-SAM (≈0.35 mIoU), which requires joint training of the multimodal encoder and SAM. In addition, the IQA-Octopus-33K dataset (Section 3) unifies four task paradigms and combines synthetic and real distortions (Fig. 2), providing a more comprehensive benchmark for multi-granularity IQA; its automated annotation pipeline (Section 3.2) also enhances the scalability of data construction. The core weakness is the lack of direct empirical validation for key assumptions. First, the paper repeatedly asserts that avoiding special tokens prevents “damaging instruction-following and reasoning ability” (Section 1, lines 107-109; Section 2, lines 123-125), yet provides no ablation study directly substantiating this claim, as the authors do not compare their “text-to-point” method against a baseline trained with explicit localization using special tokens (on the same data and backbone) on pure text reasoning tasks (e.g., Q-Bench-A1 in Table 2). As a result, the central “conflict-free” claim lacks empirical support. Second, a fundamental assumption is that a single point derived via weighted averaging of logits over four location words (Eq. 4) suffices to prompt SAM to segment arbitrary distortion regions (Section 4.2). However, the paper does not discuss or validate this assumption in more complex scenarios. For instance, when distortion comprises multiple disjoint regions (e.g., noise at all four corners) or complex shapes (e.g., annular distortions), the DAO-G task’s mIoU of only 0.375 in Table 3 may hint at this limitation, but no analysis is provided. Finally, in data construction (Section 3.2, Eq. 1), the authors use bounding-box centers to derive location-word labels, effectively establishing a strong association between center points and location words at the training-data level. This introduces conceptual ambiguity with the “zero-shot” grounding claimed in Section 4.2. Is the model merely reproducing this training-time mapping, rather than exhibiting genuine zero-shot generalization? On validating the core assumption: A central motivation is the claim that using special tokens “damages instruction-following behavior” (Section 1, lines 107-109). Could the authors provide a direct ablation study comparing IQA-Octopus with a variant trained for explicit localization using special tokens (e.g., the `\<seg\>` mechanism in Q-Ground or LISA), and then evaluate the two models on pure text reasoning benchmarks (e.g., Q-Bench-A1 in Table 2) and dialog-based quality assessment? This is critical to substantiate the “conflict-free” core claim. On the applicability limits of the single-point prompting strategy: The method in Section 4.2 generates a coordinate via weighted averaging (Eq. 4) as a prompt for SAM. How does this single-point mechanism handle more complex localization scenarios? For example, when distortion consists of multiple non-adjacent regions (e.g., noise in all four corners), or when distortion has complex shapes (e.g., hollow ring-shaped distortions), does a single-point prompt remain effective? Do the mIoU differences across subtasks in Table 3 (DAO-G: 0.375 vs. HyD-G: 0.354) reflect this limitation? On the nature of “zero-shot” grounding: During data construction (Section 3.2, Eq. 1), bounding-box centers are used to generate location-word (top/bottom/left/right) labels, implying that the training data already encodes an explicit mapping between center coordinates and location words. Does the “zero-shot” grounding in Section 4.2 merely reproduce this training-time mapping rather than demonstrate true zero-shot generalization? How do the authors distinguish between these two cases? On generalization to real-world distortions: Table 6 shows SOTA performance on the synthetic distortion dataset KADID-10K (0.815/0.783 SRCC/PLCC), but performance on the real-world dataset FLIVE (0.439/0.541) is nearly on par with Q-Instruct (0.432/0.545). Given that Fig. 6 indicates the training data are primarily sourced from the synthetic-distortion KADIS-700K, does this suggest that the model’s fine-grained perception abilities (e.g., grounding and referring as in Tables 1 and 3) rely heavily on patterns specific to synthetic data, thereby struggling to generalize to diverse, atypical real-world distortions? How do the authors explain this performance gap? Lightly AI-edited
IQA-Octopus: Unified Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper presents a four-task united IQA framework that can handle multiple tasks, including reasoning, grounding, and referring capabilities. The authors construct the dataset for multi-granularity perception and design a conflict-free two-stage optimization strategy. Additionally, extensive experiments on self-built and public benchmarks demonstrate the framework's effectiveness in text-based answering, pixel-wise grounding, and visual quality scoring tasks. 1. The authors propose a unified IQA framework that integrates reasoning, grounding, and referring. The conflict-free two-stage optimization and text-to-point strategy effectively balance multi-granularity perception without compromising reasoning ability. 2. The IQA-Octopus-33K dataset is comprehensive and scalable. The dataset covers four task paradigms, and its construction pipeline introduces SAM tools and InternVL-2.5 to ensure scalability and transparency. 3. The experiments apply various benchmarks covering text-based answering, pixel-wise grounding, and visual quality scoring. Ablation studies on hybrid dataset training and text-to-point strategy also validate key design effectiveness. 1. Figure 3 with the model framework does not detail how the LMM backbone interacts with the SAM segmentation head, especially in the SAM part. Key components like the "text-to-point conversion module" are missing. 2. In the Image Collection part of Section 3.2, the author mentioned that they employed human annotators to provide reliable ground-truth annotations. However, 3. A minor suggestion is to show the usage of color in figure examples of exhibiting text outputs for different methods(Fig. 1,3,4), and distinguish these methods more clearly(Fig. 5). 1. Please clarify the dataset construction details. For authentic distortion annotation in KonIQ, the relationship between human annotators and reliable image quality, as well as the inter-annotator agreement for distortion type/region labeling, should be proven. 2. Failure cases are needed for deeper analysis. Lightly AI-edited
Hallucination Mitigation in Large Vision-Language Models via Adaptive Multi-Subspace Projection Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a training-free framework to mitigate hallucinations in LVLMs. The method first constructs a set of disentangled hallucination subspaces via SVD and K-means. Then, at inference time, the model adaptively creates specific weights from the subspaces to alleviate the hallucination, via two forward queries of the original image and masked image. 1. The novelty is sound. The authors find an adaptive method to calculate specific weights for different queries, which is often neglected in other hallucination papers, as the hallucination type is different for different inputs. 2. The general method builds up with the training-free methods, while adaptively creating specific weights from a clustered subspace, which seems to be superior to some other training-free methods. 3. The method is well evaluated across different base models. 1. The paper's Table 3 shows that as hallucination suppression increases (using more basis vectors), the BLEU score drops significantly. While BLEU is not a comprehensive metric for modern LVLMs, this still raises concerns about the degradation in general model performance. The evaluation is narrow on hallucination benchmarks and lacks testing on broader, general-purpose benchmarks (MMMU/VQAv2/...) to confirm that the model's abilities are not so compromised. 2. One of the core claims is that it identifies distinct hallucination modes. However, the authors do not provide any qualitative analysis or evidence, or visualization to validate that these disentangled subspaces actually correspond to semantically different types of hallucinations. 3. The method requires two forward passes at test time (one for the original image and one for the masked one) to get a diff. This will introduce extra computation. Moreover, it remains unclear why the specific difference signal serves as a proxy for the input-specific hallucination signal. Lastly, the authors seem to miss the exact strategy for masking. Is it random black masking or other strategies? N/A Fully human-written
Hallucination Mitigation in Large Vision-Language Models via Adaptive Multi-Subspace Projection Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a training-free method to mitigate hallucinations in Large Vision-Language Models (LVLMs) through **adaptive multi-subspace projection**. The authors argue that existing model-editing approaches like Nullu use a single global subspace that fails to capture diverse hallucination patterns across different inputs. So, they construct multiple disentangled hallucination subspaces via K-means clustering and SVD, then adaptively weight these subspaces at test time based on input-specific hallucination signals derived from masked image perturbations. The method is evaluated on CHAIR and POPE benchmarks across three LVLM families (LLaVA-1.5, MiniGPT-4, mPLUG-Owl2), showing improvements over existing baselines including the recent Nullu method. 1. The paper identifies a limitation of existing fixed model-editing methods: a single global subspace cannot adapt to the heterogeneous hallucination patterns that vary across different inputs. In my view, this observation is insightful and the proposed solution of using multiple subspaces with adaptive weighting represents a natural and promising direction for improvement. 2. The inclusion of ablation studies on the number of subspaces, basis dimensions, and perturbation strategies demonstrates investigation of the method's behavior. The consistency of improvements across different settings is encouraging. 3. The method maintains the training-free property which is valuable for practical deployment. Unlike fine-tuning approaches that require curated datasets and substantial computational resources, the proposed approach offers a reasonable middle ground by preprocessing subspaces offline and applying lightweight adaptive weighting at test time. 4. The two-stage framework is well-designed and the mathematical formulation is generally clear. - In my opinion, the writing of the paper could be improved. The reported improvements over Nullu are relatively modest in limited statistical validation of improvements, and given the standard deviations shown some gains may not be significant. The paper would be much stronger with paired t-tests or similar statistical validation to confirm these improvements are reliable rather than random variation. - While the paper claims efficiency advantages, no actual inference times, memory usage, or FLOPs are reported to validate this. Additionally, several key technical details lack clarity: **semantically salient regions** for masking (see in Equation 13) is never defined. - Table 3 reveals that increasing basis vectors improves hallucination metrics but causes substantial BLEU degradation, suggesting over-suppression of legitimate content. While the authors select **balanced** hyperparameters empirically, there is no principled guidance for navigating this trade-off in new settings, and no theoretical understanding of why it occurs. 1. Could you add statistical significance tests to validate that the improvements over Nullu are reliable rather than within-noise variation? This would substantially strengthen the empirical claims. 2. How are **semantically salient regions** identified for the masking operation? Please provide implementation details or point to the specific saliency method used. Fully AI-generated
Hallucination Mitigation in Large Vision-Language Models via Adaptive Multi-Subspace Projection Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a training-free way to reduce hallucinations in large vision-language models (LVLMs) by editing their internal activations at test time instead of fine-tuning them. The key idea is to build multiple low-rank “hallucination subspaces,” each representing a different type of hallucination, by comparing model states from truthful vs. hallucinated captions. At inference, the model estimates which hallucination modes are most likely for the current input image, then dynamically projects its hidden representations away from those directions, suppressing ungrounded content while keeping image-relevant semantics. Experiments on benchmarks like CHAIR and POPE across models such as LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2 show that this adaptive multi-subspace projection reduces hallucinations more consistently than prior decoding-based or single-subspace editing methods. 1. The paper models hallucination not as one global direction but as multiple disentangled subspaces, each tied to a different hallucination mode. At test time it adaptively weighs these subspaces for the current input and projects away the most risky directions, which leads to stronger hallucination suppression than fixed one-subspace editing. 2. The ablation shows that different LVLM backbones prefer different numbers of subspaces (e.g., 7 for LLaVA-1.5, 11 for MiniGPT-4, 5 for mPLUG-Owl2). This suggests each model has its own “hallucination landscape,” rather than a universal structure. Making these subspaces interpretable in semantic terms (e.g., “spurious object insertion,” “wrong spatial relation”) would be a valuable next step. 1. The method needs two forward passes at inference: it runs the LVLM on both the original image and a perturbed/masked version to estimate which hallucination modes are likely, then applies the adaptive projection. Prior fixed-edit approaches only require a single edited forward. Authors need to report a compute/runtime comparison against those baselines. 2. The “contrastive dataset” used to build the hallucination subspaces is under-specified. The paper does not state where the images/captions come from, how large this dataset is. Without dataset source/scale, it is hard to judge fairness and reproducibility. 3. The experiments are only on older/open LVLMs (LLaVA-1.5, MiniGPT-4, mPLUG-Owl2). There is no evidence that the approach still works on newer high-capability MLLMs (e.g., recent Qwen2.5-VL–series) . Please refer to the weaknesses part. Fully AI-generated
Deformable Contact-Aware 3D Object Placement Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes a method for arranging deformable objects using large language models (LLMs). Previous similar studies only considered rigid bounding boxes without accounting for contact-induced deformation. This paper addresses this issue by proposing a collaborative solution involving multiple components. However, the overall writing, formatting, diagramming, and font sizes in the figures severely hinder readability. The provided qualitative experimental demos are also too sparse to reliably assess their generalization capabilities. Furthermore, there is a lack of evidence demonstrating whether the employed LLM can effectively meet the requirements. The research question is novel and meaningful. 1. The paper writing is too poor. Both the writing and diagrams include significant issues. For example, the font size in the images is too small. These problems make the paper hard to read. 2. The provided qualitative experimental demos are also too sparse to reliably assess their generalization capabilities. The authors need to provide more evidence to prove the robustness of such a complex system. 3. There is a lack of evidence demonstrating whether the employed LLM can effectively meet the requirements. The authors need to provide a more complete evaluation of the LLM to prove that the selected LLM has a strong capability to understand the setting. See weakness. Moderately AI-edited
Deformable Contact-Aware 3D Object Placement Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors introduce an approach to place objects into simulated 3D scenes using language prompts specifying the desired position of the object. From the prompt, the approach makes use of LLMs and VLMs to visually locate the intended support of the object in a rendered image of the scene, as well as segment the object and its support into parts and infer their physical parameters (e.g. density, Young's modulus) to condition the simulator. The approach further uses LLMs to determine the initial position and rotation of the object for a drop into the scene, then rolls out the simulation until a set of convergence criteria are satisfied. The results are presented to a VLM and a set of human study participants for evaluation, outperforming all of the 3 considered baselines on the "right placement" and "physics & naturalness" metrics. The formulation of placement as "drop-to-equilibrium" is a useful idea for applications such as animation or interior design. The considered baselines are deemed to be outperformed by human judges. It is commendable that the authors reimplemented the FirePlace baseline for an additional comparison with a method whose code is not publicly accessible. The work includes an honest discussion of its limitations. The presentation of the work can be improved in several points: improving the citation style, spending more effort on visually appealing figures. Several citations have been formatted incorrectly. Insufficient examples of the output are provided to allow a sensible assessment of the method's output and performance against baselines. The work could have greatly benefited from the inclusion of supplementary materials in form of videos with qualitative examples of baselines' vs. the proposed method's results, to allow reviewers to form a more informed opinion about its efficacy. The work cites its appendix in several places, yet no appendix has been uploaded. The abstract is too detailed. The textual abstract still contains LaTeX code. What is the size of the human study in Table 2? In line 226, you mention that "More views look thorough but often introduce near–duplicate evidence and confusion". It seems counterintuitive to me that providing more evidence to the VLM will produce worse results given a reasonable aggregation scheme such as a majority vote. Do you have more evidence to support this claim? What is the quality of the LLM semantic label-to-material mapping in the "Giving the simulator honest materials" step? Which segmentation model is used to segment the image into labeled object instances? How often does this step fail, e.g. due to failure of detection of the object by the segmentation method? Fully human-written
Deformable Contact-Aware 3D Object Placement Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes DCAP pipeline which combines vision/language priors with simulation. The main contributions of the work are as follows: - A novel problem formulation which formulates placement as drop-to-equilibrium, adhering to the physical properties of the materials. - DCAP pipeline that couples LLM/VLM with physical simulators. - A new benchmark that converts 186 high-fidelity InteriorGS scenes into watertight, simulation-ready meshes using SuGaR. - [S1] **Novel Problem Formulation:** The primary strength of the proposed work is the novel formulation of object placement as a physics-based equilibrium problem. This formula addresses a clear gap in the existing framework that ignores the physics-awareness of the generation. - [S2] **New Benchmark:** The proposed method makes a valuable contribution by proposing a high-fidelity benchmark for this task. The authors reconstruct the scenes from InteriorGS and extracted meshing using SuGaR. - [S3] **Novel Pipeline:** In the proposed method, high-level semantic reasoning (intent, size and location) is performed by LLMs/VLMs. They utilize a physics simulator to handle the complex contacts and deformations. The proposed method is highly scalable. - [W1] Total inference time is not reported. - [W2] "Filling Soft Parts so They Behave Like Solids": The authors do not provide any supporting figures for this. Further, there is no ablation study for this. - [W3] The argument of 1cm is not backed by empirical evidence. What was the reasoning behind choosing this value? This design choice should be thoroughly investigated. - [Q1] Please provide more details on the curated material library (Section 3.3). How many materials are included, and what are the ranges and median values for the key parameters - [Q2] Can the following experiment be done with the TanksandTemples dataset? Train 3DGS on the dataset, obtain the mesh, and use the proposed pipeline. Please provide reasoning on why it cannot be done. - [Q3] Can the proposed pipeline handle out-of-distribution materials? Fully human-written
Deformable Contact-Aware 3D Object Placement Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Uses segmentation and VLM queries to try and solve object placement in the problem of 2D image editing, trying to simulate properties of the different objects to give good contacts and realistic deformations. Lots of description on what is being done helps the readers through a difficult method improving the clarity, but this significant usage of page space limits your ability to demonstrate the significance of the method. Lack of qualitative evaluations for a visual task, makes it hard to judge this work and understand the improvement over prior works. Even some in the supmat would be sufficient. The results shown so far look interesting and I would like to see more. Any more diagrams that can help more quickly convey the method so that you can spend more time demonstrating your method's results would also be beneficial. Are there more evaluations that can be run aside from human preference and VLM evals? This is a task I'm not as familiar with but I'd love to see more quantitative evals if any more benchmarks exist. Fully human-written
Reframing Dense Action Detection (RefDense): A New Perspective on Problem Solving and a Novel Optimization Strategy Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the task of temporal dense action detection, a fundamental and challenging problem in video understanding. The authors propose to decompose dense action labels into two components: action-entity and action-motion, to alleviate the difficulty of modeling overlapping and co-occurring actions. In addition, they employ a noisy contrastive learning objective to provide explicit supervision for co-occurring concepts. Experiments on three benchmark datasets show moderate performance gains compared to prior methods. While the paper is clear and well organized, the conceptual novelty is somewhat limited, and some claims are overstated. The decomposition into entity and motion branches aligns closely with well-established two-stream and relational modeling paradigms in video understanding. Furthermore, the method introduces additional network capacity, making it difficult to disentangle gains due to the proposed design from those due to the larger network. - This paper is well organized and easy to follow. - This paper targets a general and important task for video understanding, temporal dense action detection. - The decomposition of actions into entity and motion components is conceptually intuitive and may help address overlapping action scenarios. - The paper claims novelty in addressing simultaneous temporal and class overlaps, but this challenge has been widely recognized in earlier dense detection and multi-label video models. - The statement (L156–L157) suggesting that prior two-stream networks focus only on low-level spatiotemporal features because they are trained end-to-end lacks conceptual clarity and justification. - The two-stream design introduces increased model capacity, making performance comparisons with single-stream baselines potentially unfair. The ablation studies do not convincingly separate the effects of decomposition from additional parameters. The main technical concerns are outlined in the weaknesses section. A minor question relates to the results. I noticed this manuscript was made public earlier this year, but the results in that version differ from those in the current one. Given that the overall methodology remains largely unchanged, what are the key differences? Lightly AI-edited
Reframing Dense Action Detection (RefDense): A New Perspective on Problem Solving and a Novel Optimization Strategy Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces RefDense, a framework designed for dense action detection that addresses the challenges of temporal and class overlaps through problem decomposition. The approach consists of two key components: first, actions are decomposed into entity and motion components, with dedicated sub-networks tasked to detect each, thereby simplifying individual learning objectives. Second, a contrastive co-occurrence loss leveraging language guidance is proposed to explicitly capture relationships among frequently co-occurring actions, overcoming the limitation of standard binary cross-entropy loss that treats classes independently. The method is evaluated on the TSU, Charades, and MultiTHUMOS datasets. 1. The paper is clearly written and easy to follow. 2. The proposed method demonstrates performance gains ranging from 0.4% to 2.1% across the benchmark datasets. 1. The idea of decomposing actions into entity and motion components bears resemblance to established paradigms like two-stream networks and several recent works. [1] Dual detrs for multi-label temporal action detection, CVPR 2024. [2] Decomposed cross-modal distillation for rgb-based temporal action detection, CVPR 2023. 2. The construction of labels for the sub-networks, specifically the "action-entity" labels, may be problematic. In untrimmed videos, certain entities (e.g., "hammer" in the provided example) might be present throughout the entire video duration, even when the corresponding action is not being performed. This could lead to ambiguous and noisy supervision for the Action-Entity sub-network. The authors should address this potential issue and justify the robustness of their labeling process. 3. The experimental comparisons appear to be limited to other dense action detection methods. To better position the work, it would be valuable to include comparisons with recent state-of-the-art methods in temporal action localization on the same datasets, which would provide a broader perspective on its performance. 4. The ablation studies could be more comprehensive. Key questions remain unanswered: What is the performance of each sub-network (Action-Entity and Action-Motion) when trained and evaluated independently? Is the observed performance gain primarily due to the increased network capacity (using two sub-networks) or the core idea of decomposition? A controlled experiment, for instance, comparing against a single network of comparable parameters, would help isolate the true source of improvement. 5. The figures could be improved for clarity: Figure 1 would benefit from concrete examples of actions (e.g., "pour water") to more directly illustrate the concepts of entity and motion decomposition. There is a typo in Figure 2; the second sub-network is currently labeled "Action-Entity" but should presumably be "Action-Motion." 6. There is a confusing use of the symbol tau in the manuscript. It is used to represent the temperature coefficient in Equation 8 but denotes a window size in Table 2. To avoid confusion for the reader, it is strongly recommended to use distinct symbols for these different parameters. 1. What is the fundamental conceptual or technical advancement of your decomposition framework compared to these existing approaches? 2. How does the Action-Entity sub-network distinguish between an entity being merely present versus being actively involved in an action? Could you provide an analysis or examples from the validation set showing that the entity labels are temporally precise and not overly noisy? 3. How would your method, RefDense, perform against these recent temporal localization models in terms of average precision? 4. What is the standalone performance (e.g., on the decomposed task) of the Action-Entity and Action-Motion sub-networks? 5. Is the performance improvement primarily due to the increased model capacity from having two sub-networks? Have you conducted a controlled experiment comparing RefDense against a single, larger network with a comparable number of parameters? Moderately AI-edited
Reframing Dense Action Detection (RefDense): A New Perspective on Problem Solving and a Novel Optimization Strategy Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The manuscript suffers from critical flaws in methodological transparency, experimental rigor, and practical relevance that cannot be addressed through minor revisions. The lack of reproducible details for label decomposition and L_CoLV, confounded generalization experiments, and incomplete engagement with related work undermine the validity of the claimed contributions. To be reconsidered, the authors would need to: (1) fully specify all methodological details (e.g., L_CoLV formulation, GPT-4 prompts), (2) conduct ablation studies to isolate the impact of key components (e.g., parameter count vs. decomposition), (3) validate performance across more diverse qualitative examples, and (4) address practical constraints like computational efficiency and LLM accessibility. This paper proposed a strategy of decomposing the task into detecting temporally dense but unambiguous components underlying the action classes, and assigning these sub-problems to distinct sub-networks 1. The core contributions of RefDense—action label decomposition via GPT-4 and Contrastive Co-occurrence Language-Video Loss (L_CoLV)—lack sufficient detail to support reproducibility and validity 2. The experimental evaluations, while extensive, suffer from biases, unaddressed confounders, and incomplete reporting that undermine the credibility of the claimed performance gains. For example, Confounded Generalization Experiments: When embedding PAT and MS-TCT into the RefDense framework (Table 5), the paper reduces the parameter count of each sub-network (e.g., PAT from 270M to 144M) while claiming "total embedding dimensionality is the same." 3. Inconsistent SOTA Benchmarking: Many SOTA comparisons rely on re-run results (marked †) using the authors’ own code, but fail to validate that experimental conditions (e.g., optimizer hyperparameters, training epochs, data augmentation) match the original papers. 4. Insufficient Discussion of Limitations and Practicality. The paper does not evaluate decomposition performance with open-source LLMs (e.g., LLaMA-3, Mistral) to assess accessibility. Besides, this paper does not discuss the scalability of LLM-based label generation for larger datasets (e.g., beyond 10k videos in Charades). 5. The related work section fails to engage with recent or relevant literature, leading to an inaccurate positioning of RefDense’s novelty. For example, it oversimplifies Two-Stream Networks and neglects Vision-Language action detection precedents: 1. The core contributions of RefDense—action label decomposition via GPT-4 and Contrastive Co-occurrence Language-Video Loss (L_CoLV)—lack sufficient detail to support reproducibility and validity 2. The experimental evaluations, while extensive, suffer from biases, unaddressed confounders, and incomplete reporting that undermine the credibility of the claimed performance gains. For example, Confounded Generalization Experiments: When embedding PAT and MS-TCT into the RefDense framework (Table 5), the paper reduces the parameter count of each sub-network (e.g., PAT from 270M to 144M) while claiming "total embedding dimensionality is the same." 3. Inconsistent SOTA Benchmarking: Many SOTA comparisons rely on re-run results (marked †) using the authors’ own code, but fail to validate that experimental conditions (e.g., optimizer hyperparameters, training epochs, data augmentation) match the original papers. 4. Insufficient Discussion of Limitations and Practicality. The paper does not evaluate decomposition performance with open-source LLMs (e.g., LLaMA-3, Mistral) to assess accessibility. Besides, this paper does not discuss the scalability of LLM-based label generation for larger datasets (e.g., beyond 10k videos in Charades). 5. The related work section fails to engage with recent or relevant literature, leading to an inaccurate positioning of RefDense’s novelty. For example, it oversimplifies Two-Stream Networks and neglects Vision-Language action detection precedents: Fully AI-generated
Reframing Dense Action Detection (RefDense): A New Perspective on Problem Solving and a Novel Optimization Strategy Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper tackles the challenges of dense action detection, specifically temporal and class ambiguity, by proposing a decomposed approach. The method breaks down ambiguous actions into unambiguous temporal components, assigning them to specialized sub-networks to simplify temporal overlap resolution. Furthermore, it introduces a language-guided contrastive loss to explicitly model the relationships between co-occurring actions, overcoming the limitations of independent class treatment in standard binary cross-entropy. The approach demonstrates superior performance, achieving substantial gains on TSU, Charades, and MultiTHUMOS benchmarks. + This paper decomposes the complex problem of dense action detection into simpler sub-tasks of detecting unambiguous temporal components, allowing specialized sub-networks to handle temporal overlaps more effectively. + The method demonstrates superior and substantial performance improvements over state-of-the-art methods across multiple challenging benchmark datasets. - The performance gain might be better explained by the sub-networks specializing in foreground entities and actions. This specialization reduces the impact of the background after feature concatenation, which is a perspective that diverges from the authors' stated motivation. - Missing visualization and quantitative results of two sub-network. The qualitative result comparison among the predicted action-entity, action-motion and the final detection result can help readers understand the reasons for the effectiveness. - The performance improvements shown in Table 3 and Table 5 are incorrect. Please recheck these tables. None Moderately AI-edited
Grid-Based Evolutionary Algorithm for Multi-Objective Molecule Generation Enhanced by Reinforcement Learning Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Through this paper, the authors propose Reinforcement Learning-Driven Grid-based Fragment-Masked Multi-objective Evolutionary Algorithm (RL-GFM). The authors claim that RL-GFM can overcome the limitations of existing FBDD methods, the need to construct and maintain static fragment libraries. - The concept figure aids in understanding the work. - The writing is easy to follow. My concerns are as follows: - The main weakness of this paper is its weak novelty. As the authors mentioned in lines 137~140, the proposed RL-GFM framework comprises three components: (1) a grid-based fragment-masked crossover operator, (2) RL-driven crossover and mutation operators, and (3) a Pareto-optimal solution selector. - However, in the grid-based fragment-masked crossover (Section 4.1) is a heuristic and cannot be considered a significant contribution from an ML perspective. Moreover, the idea is very similar to the fragment remasking of GenMol [1], but there is no detailed comparison with this work. - The idea of RL-based genetic operations (Section 4.2) is very similar to those of [2] and [3]. - The idea of the Non-dominated Sorting Genetic Algorithm II (NSGA-II) algorithm to perform multi-objective molecular optimization is from [4] and is not an invention of this work. Overall, I am not convinced that this work provides a novel approach compared to previous methods in the domain. - The authors claim that the limitations of existing FBDD methods are (1) reliance of static fragment libraries, (2) inefficient construction and maintenance of fragment libraries, and (3) lack of interpretability. However, several existing methods have already overcome these limitations. For example, GEAM [5], one of the baselines in this paper, addresses the limited exploration problem by introducing a dynamic fragment library, enables very fast fragment library generation, and provides an interpretable fragment library. - In the PMO experiment (Section 4.1, Figure 3), SOTA molecular optimization baseline, GenMol [1], is missing. --- **References:** [1] Lee et al., GenMol: A Drug Discovery Generalist with Discrete Diffusion, ICML 2025. [2] Fu et al., Reinforced Genetic Algorithm for Structure-based Drug Design, NeurIPS, 2022. [3] Ahn et al., Guiding Deep Molecular Optimization with Genetic Exploration, NeurIPS, 2020. [4] Verhellen, Graph-based molecular pareto optimisation. Chemical Science, 2022. [5] Lee et al., Drug discovery with dynamic goalaware fragments, ICML, 2024. Please see the *Weaknesses* section for my main concerns. - Why are the results for f-RAG, the best-performing baseline in Figure 3 and Table 3, missing from Table 2 and Table 4? Fully human-written
Grid-Based Evolutionary Algorithm for Multi-Objective Molecule Generation Enhanced by Reinforcement Learning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. RL-GFM, a reinforcement learning and multi-objective evolutionary algorithm framework for molecular generation and optimization is proposed in this paper. The method integrates a grid-based fragment-masked crossover mechanism, RL-driven crossover and mutation policies. RL-GFM aims to balance multi-objectives during chemical space exploration. The authors extensively evaluate RL-GFM on PMO benchmark tasks and molecular docking tasks, and results demonstrate that RL-GFM achieves superior performance in most tasks. 1. Extensive experiments provide a robust validation of the method. RL-GFM outperforms most baselines in similarity-based and multi-property optimization tasks. 2. The manuscript is well-written and clearly structured. 1. Although the manuscript achieves strong results, it feels technically oriented, integrating Grid-based methods, RL, and multi-objective optimization, but lacks a clearly highlighted novel technical contribution. 2. Please refer to the Questions part. 1. Based on the results in Table 3, RL-GFM shows a clear advantage in the mestranol_similarity and valsartan_smarts tasks, while in other tasks its advantage is relatively limited, and in some cases it even performs worse than other algorithms. Can you explain why this phenomenon occurs and why the algorithm performs better specifically in these two tasks? 2. As stated in Line 239, "The crossover policy determines parent pair molecules, which are then split into fragments and randomly recombined to form new molecules". Could you compare how this randomness affects the algorithm? Is its performance robust across multiple runs, and what is the variance? 3. Ablation experiments on the hyperparameters need to be conducted to assess the robustness of the algorithm, for example, for $\epsilon$, $k$, etc. 4. Compared to other algorithms, what are the advantages of this algorithm in terms of efficiency and real-time computational cost? Lightly AI-edited
Grid-Based Evolutionary Algorithm for Multi-Objective Molecule Generation Enhanced by Reinforcement Learning Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces **RL-GFM**, a reinforcement learning–driven grid-based evolutionary algorithm for molecular generation. The framework combines an NSGA-II multi-objective optimizer with two RL-guided operators for crossover and mutation, alongside a grid-based fragment-masked crossover strategy designed to preserve scaffolds while diversifying side chains. The approach is motivated by the limitations of fragment-based drug discovery and black-box generative models, aiming for more interpretable and efficient exploration of chemical space. The authors evaluate RL-GFM on 23 PMO benchmark tasks and on docking experiments against five protein targets (PARP1, FA7, 5HT1B, BRAF, JAK2). Results show that the method achieves the highest cumulative optimization score across PMO tasks, with notable improvements in similarity and multi-property optimization challenges, and demonstrates superior balance among diversity, novelty, and synthesizability. In docking experiments, RL-GFM delivers the best mean binding affinities across all targets and a higher ratio of novel hits compared to state-of-the-art baselines. Ablation studies further confirm that both the grid-based method and RL-guided operators contribute substantially to performance. Overall, RL-GFM demonstrates strong empirical results and highlights the potential of combining RL and evolutionary search for multi-objective drug design. The paper's main strengths lie in its well-motivated methodological contributions and strong empirical validation. First, it introduces a novel integration of RL-guided crossover and mutation operators within an NSGA-II optimization framework, which enables more adaptive and targeted exploration of chemical space compared to purely stochastic evolutionary search. Second, the grid-based fragment-masked crossover is a creative mechanism that preserves core scaffolds while systematically diversifying side chains, effectively balancing convergence and diversity in multi-objective optimization. Third, the extensive experiments on both PMO benchmarks and docking tasks highlight clear improvements over state-of-the-art baselines, demonstrating that the proposed approach not only achieves higher optimization scores but also discovers more novel and synthesizable compounds. These contributions collectively show that RL-GFM is a promising framework for advancing multi-objective molecular generation. - **PMO benchmark setup**: Some PMO objectives are inherently contradictory with diversity (e.g., rediscovery). Reporting diversity alongside such objectives can be misleading, since the optimal solutions by definition are structurally constrained. - **Interpretability claim**: The contribution of interpretability is not well supported. Although mentioned in the Introduction, Related Work, and Conclusion, the Methods and Experiments sections do not provide any explanations, metrics, or case studies to illustrate how the approach enhances interpretability in practice. - **Relation to prior work (RGA)**: The combination of reinforcement learning with genetic algorithms for molecular design has already been introduced in RGA (*Reinforced Genetic Algorithm for Structure-based Drug Design*, NeurIPS 2022). This work is not cited or discussed, which risks overclaiming novelty. - **Typos and formatting issues**: - Line 98: "modifing" should be "modifying" - Figure 1: "offspring moleculse" should be "offspring molecules" - Table 3: The standard deviations of RL-GFM results are not consistently reported to three decimal places. - Could the authors provide evaluation curves showing performance over iterations of the RL process? This would help illustrate convergence behavior and stability during training. - The grid-based strategy partitions the objective space by properties, yet the underlying search operates on molecular structures. Given that structure–property relationships can be highly non-convex and complex, how do the authors justify the effectiveness of applying a grid-based approach, which is more common in convex optimization contexts? - Why is the Synthetic Accessibility (SA) score incorporated as an additional oracle in the PMO experiments? How does SA relate to the original PMO objectives, and could this addition potentially bias the optimization outcomes? Fully AI-generated
Grid-Based Evolutionary Algorithm for Multi-Objective Molecule Generation Enhanced by Reinforcement Learning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes **RL-GFM**, a reinforcement learning–driven grid-based evolutionary algorithm for multi-objective molecular generation. The method integrates two RL agents that guide crossover and mutation operators within a grid-based fragment-masked crossover framework, aiming to balance scaffold preservation with side-chain diversity while improving exploration efficiency. The optimization loop is built on NSGA-II, enabling multi-objective trade-offs across property optimization, synthesizability, and docking scores. Experiments are conducted on the 23 PMO benchmark tasks and several protein docking targets (PARP1, FA7, 5HT1B, BRAF, JAK2), showing that RL-GFM achieves state-of-the-art performance in both property optimization and docking, with higher novelty and diversity than baselines. An ablation study indicates that both the grid-based mechanism and the RL-guided operators contribute to the observed improvements. The authors position RL-GFM as a more interpretable, efficient, and robust alternative to prior fragment-based or black-box generative methods. 1. **Essential problem**: The paper tackles an important and timely challenge in molecular generation: balancing quality, diversity, and synthesizability under multi-objective constraints. By focusing on interpretable fragment-level operators and efficient oracle usage, it addresses practical limitations of existing fragment-based and black-box generative approaches. 2. **Fluent writing and logical flow**: The manuscript is generally well-written, with a clear logical progression from motivation to methodology, experimental setup, results, and conclusions. Figures and tables are well-structured and support the narrative effectively, making the contributions easy to follow. 3. **Abundant experiments and comparisons**: The experimental evaluation is extensive, covering both the PMO benchmark (23 tasks) and five protein docking targets. The paper compares against a wide range of strong baselines, and also includes ablation studies to validate the contributions of individual components. ### **Major** 1. **Method novelty**: The idea of **RL-based crossover and mutation** has already been explored in the Reinforced Genetic Algorithm (RGA) [1] for molecular generation, yet this prior work is not cited or discussed. 2. **Limitations of the grid-based method**: While the proposed **grid-based fragment-masked crossover** is novel, it has several drawbacks: - It intrinsically assumes a multi-objective setting. For single-objective problems, an artificial second objective (e.g., SA in the PMO benchmark) must be introduced, which may be unreasonable. For example, in PMO rediscovery tasks, the main goal is to find molecules similar to a target. If the target has poor SA, adding SA as an explicit objective may actively hinder optimization. In general, introducing arbitrary auxiliary objectives is not a principled solution. - The assumption of “intra-grid convergence” may not hold in practice, especially when optimal points in chemical space are sparse [2]. In many objectives such as docking scores, crossover offspring often perform worse than both parents. Thus, the theoretical premise behind the grid strategy is questionable for real-world property landscapes. 3. **Interpretability claim**: The paper repeatedly emphasizes interpretability, but it is unclear how the generated structures can be interpreted beyond standard fragment recombination. No concrete explanation or case study is provided to support this claim. 4. **PMO experimental setup**: - Several PMO objectives inherently conflict with diversity, such as rediscovery and similarity tasks, making the evaluation of diversity unreasonable in those settings. - Ten PMO objectives are themselves multi-property objectives (MPOs) originating from GuacaMol [3]. Since this paper emphasizes multi-objective molecular generation, it is unclear why those built-in multiple properties were not directly used as objectives, and instead a new SA score was introduced. 5. **Docking experimental setup**: - Comparisons are reported based on the top-5% molecules, but this subset may vary greatly in size across methods. A fixed set size (e.g., top 100) would make the comparison more consistent and fair. - No diversity metrics are reported for docking tasks, leaving it unclear whether RL-GFM achieves good diversity in addition to docking performance. ### **Minor** 1. No analysis of runtime or computational efficiency is provided. 2. Since no code or data is released, a reproducibility statement is recommended. 3. In Table 3, the standard deviations of RL-GFM results should be reported with the same precision (3 decimal places) as the baselines. [1] Tianfan Fu, et al. "Reinforced Genetic Algorithm for Structure-based Drug Design." NeurIPS, 2022. [2] Austin Tripp, et al. "An evaluation framework for the objective functions of de novo drug design benchmarks." ICLR workshop, 2022. [3] Nathan Brown, et al. "GuacaMol: benchmarking models for de novo molecular design." JCIM, 2019. 1. Could the authors clarify whether their grid-based crossover achieves unique benefits beyond efficiency compared to exhaustive parent pairing? In other words, are the gains due to pruning unpromising combinations, or does the grid introduce qualitatively new recombinations that would not appear in a full GA? 2. The initial population of molecules significantly determines the quality of the whole generation process, but this is not studied. For example, if the initial population doesn't contain a certain substructure, it will not be generated all along the fragment-based process. How to overcome this? In addition, the standard deviations in Table 3 are very low. Are they initialized with randomly selected different molecular sets? 3. For the PMO tasks, why you report the top-10 score AUC and the top-100 diversity, instead of a consistent set size? In addition, why the RL-GFM is evaluated under a budget of 3,000 oracle calls, instead of 10,000? Fully AI-generated
ChartAlignBench: A Benchmark for Chart Grounding & Dense Alignment Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes ChartAlignBench, a benchmark on chart dense grounding and alignment. The authors define dense grounding and alignment as identifying data variations and chart element differences such as colors, styles, and legends. The dataset contains 9k pairs of chart images. The visualizations are generated by perturbing certain code content with an LLM. The authors also contribute a two-stage evaluation pipeline and show weaknesses in some VLMs. + The two-stage evaluation pipeline makes the evaluation more rigorous, with ablations supporting this. + The dataset size, 9k, is sufficiently large. + The presentation of the paper is overall easy to follow. - My central concern is the utility of the benchmark. I don't agree with the authors that "real-world use cases often require comparing similar charts to detect subtle differences among the charts". While it might be true that sometimes one needs to compare a series of visualizations to check differences in data, I cannot imagine tasks such as comparing colors of chart elements or fonts of text being at all common, but these questions take up a significant portion of the benchmark. - The models evaluated are quite old. The best model, GPT-4o, is more than one year old at this point. No reasoning models are evaluated. Reasoning models have made tremendous progress in chart grounding, so I suspect the numbers will be a lot higher. - Most questions assess some form of perception, and a lot of it is meaningless or unanswerable. For things like identifying the font weight, you simply cannot do this. If you fix the font weight and make the whole chart image bigger, the text is going to be bigger. Overall I just don't find tasks like "comparing the hex color of chart elements" to be an interesting exercise. -The visualizations showing model performance are pretty difficult to parse. Presenting them in a table format would be better. - Given that the tasks cover a pretty fixed set of visualization design space (e.g., color, legend placement, fonts), I would expect finetuning to be really helpful, which the authors did not perform. - How does the two-stage pipeline work for non-data-alignment questions? - What does Figure 9a mean? What does the size encoding channel encode? Fully human-written
ChartAlignBench: A Benchmark for Chart Grounding & Dense Alignment Soundness: 3: good Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces the ChartAlign Benchmark (ChartAB), a benchmark designed to evaluate the fine-grained perceptual abilities of Vision-Language Models (VLMs) in chart understanding. The authors argue that existing benchmarks focus on simple QA, failing to assess detailed comprehension. ChartAB addresses this gap by proposing two core tasks: (1) chart grounding, which requires VLMs to extract structured information, such as tabular data and visual attributes (e.g., color, text style, legend position), into a JSON template, and (2) dense alignment, which uses a two-stage inference workflow to test a model's ability to compare and identify subtle differences between two similar charts. By evaluating several recent VLMs, the authors reveal significant weaknesses, biases, and hallucinations in current models' abilities to perceive complex chart structures, demonstrating that these foundational skills are critical for robust downstream reasoning. The authors have conducted abundant experiments. - The paper's core terms, "grounding" and "alignment," are used in a non-standard and confusing way. In V&L, "grounding" typically implies localizing language to specific spatial regions (e.g., bounding boxes). This paper redefines it as extracting information into a structured textual (JSON/CSV) format. This is more accurately described as data extraction or structured representation. Similarly, the term “alignment” usually refers to mapping representations between modalities (e.g., vision and language), or making models conform to certain preferences. The paper uses it to mean a simple comparison between two already extracted textual representations to find differences. This terminological choice overstates the novelty and creates confusion. - The practical utility of the "Attribute Grounding & Alignment" task is not well-motivated. The paper claims comparing attributes is an "essential skill", but it's unclear why a user would need a VLM to identify that one chart uses a "bold" font and another uses a "normal" font, or that one legend is on the left and another is on the right. The other two tasks also do not seem very practical but slightly more useful than "Attribute Grounding & Alignment". - ChartAB builds heavily on the existing ChartX dataset, with perturbations applied to create pairs. While this adds pairs for comparison, it doesn't sufficiently differentiate from prior chart benchmarks (e.g., ChartQA, CharXiv, MultiChartQA), which already cover QA, multi-hop reasoning, and multi-chart tasks. The focus on "dense" extraction feels incremental rather than groundbreaking, especially since similar grounding approaches (e.g., DePlot for image-to-CSV) are cited but not substantially advanced. - While the paper identifies VLM weaknesses like hallucinations, biases, and perceptual inaccuracies, it stops at analysis without proposing concrete enhancements (e.g., fine-tuning strategies or architectural changes). This makes the benchmark feel more diagnostic than constructive, limiting its impact on advancing VLMs for chart tasks. What is the practical, real-world scenario for the "Attribute Alignment" task? Fully AI-generated
ChartAlignBench: A Benchmark for Chart Grounding & Dense Alignment Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces ChartAB, a novel framework for evaluating/probing VLMs (VLMs) on chart grounding (extracting structured data and attributes from individual charts) and "dense alignment" (identifying fine-grained differences between chart pairs) tasks. ChartAB derives from ChartX, retaining good diversity of domains and chart types over 9k examples, but is structured around spreadsheet extraction; attribute extraction for visual encoding colors, legend location, and text style; and contrasting pairs or charts partially differing in data or attribute. Authors argue that VLMs must be evaluated by sequentially performing grounding and then dense alignment, with a JSON-like interface between the two, and proceed to analyze 4 open-source VLMs + GPT-4o through the lens of this benchmark. 1. Drawing relationships between charts ("alignment") on a panel or dashboard is a relevant motivation, and it is indeed a relative blind spot in the data resources space compared to individual charts. 2. The number of examples ("9k pairs"---although an exact number would be good) is good, as well as the diversity of chart types. The separation between data grounding and visual encoding understanding is also meaningful, with the "alignment" part being the main novelty and contribution. The paper presentation is confusing and excessively repetitive in some parts. Importantly: 1. Being a benchmark, it is crucial to report the exact performance measurements of the baselines tested. Without them, the current choice of charts through Section 4 feels unclear and inappropriate. For example, on Figure 5, why do we only see a radar chart with nearly overlapping results exclusively for the one cell configuration (whereas the benchmark included two and three cells)? A well organized table is crucial for communicating exactly how far the current set of tested models go on the benchmark's task(s), along with a clear definition of the performance measure(s) close to the table. This feedback is applicable throughout Section 4. 2. Lines 222-223 refer to Appendix A.7.2 implying that the 2-stage pipeline is unequivocally better, but on Table 1 the 1-stage multi-chart configuration is superior on 3 out of 9 chart types (3D Bar, Radar, and Box). Table 1 should be better organized (see #1 above---specifically, performance definition should be clearly linked, and whether higher/lower is better) and these mixed results should be more transparently discussed. 3. VisText [1] and InsightBench [2] should be cited, as much of the value from drawing relationships between charts is to be able to draw insights and understand trends, which are the topic of these other data resources. 4. Examples of writing that can be improved include: i. Lines 168-172 are repetitive w.r.t. lines 162-167. ii. Lines 312-319 feel particularly confusing. iii. Sentences in lines 91-92 and 338-339 could skip the break and be merged. [1] Tang, Benny, Angie Boggust, and Arvind Satyanarayan. "VisText: A Benchmark for Semantically Rich Chart Captioning." In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7268-7298. 2023. [2] Sahu, Gaurav, Abhay Puri, Juan A. Rodriguez, Amirhossein Abaskohi, Mohammad Chegini, Alexandre Drouin, Perouz Taslakian et al. "InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation." In The Thirteenth International Conference on Learning Representations. As described in weakness #1: For each task in the benchmark, could the authors include a table with the exact performance measurements of the baselines tested? As described in weakness #2: Could the authors provide more details on the mixed results reported on Table 1? Fully human-written
CoLaP: Contrastive Learning with Adaptive Prompts for Continual Learning Soundness: 1: poor Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes CoLaP, a multimodal framework for continual learning that introduces language-guided prompt selection. By using textual descriptions from an instruction-tuned vision-language model to form semantically clustered prompt pools and to train a language-aligned visual selector via contrastive learning and distillation. During inference, the model operates purely in the visual domain. Experiments on multiple in-domain and OOD benchmarks super L2P and DualPrompt. 1. The paper addresses the problem of out-of-distribution generalization in continual learning from a novel and interesting perspective. 2. The method of leveraging language to guide a visual prompt selector is intuitive and presents a promising research direction for improving robustness. 1. The method is only compared against two baselines from 2022 (L2P, DualPrompt), while omitting more recent and relevant prompt-based methods mentioned in the related work such as CODA-Prompt, ProgPrompt, (which are also mentioned in the related works) and DIKI [1]. Comparisons against other families of CL methods, such as regularization-based or LoRA-based approaches such as SD-LoRA[2], are also missing. 2. While hyperparameters are ablated, there is no ablation analysis of the impact of the KL-divergence loss or the teacher-student distillation framework. 3. The term "Adaptive" in the title is not well-justified or explained in the paper. The abstract and methodology instead focus on "language-guided prompt selection," creating an inconsistency in the paper's framing. 4. The proposed pipeline introduces computational overhead and external dependencies. It requires a large VLM to generate descriptions, another model for text embeddings, and a clustering step. This complexity and the unanalyzed robustness of the generated descriptions may limit the method's practical application. 5. The lack of source code and the generated textual description files raises concerns about the work's reproducibility. [1] Tang L, Tian Z, Li K, et al. Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models[C]//European conference on computer vision. [2] Wu Y, Piao H, Huang L K, et al. SD-LoRA: Scalable Decoupled Low-Rank Adaptation for Class Incremental Learning[C]//The Thirteenth International Conference on Learning Representations. 1. How do the inference-time FLOPS or latency of CoLaP compare to the baseline methods? 2. Is there a risk of semantic leakage if the generated textual descriptions are overly correlated with the class names, and how is this possibility addressed? Fully human-written
CoLaP: Contrastive Learning with Adaptive Prompts for Continual Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents CoLaP, a language-guided prompt selection framework for continual learning that enhances robustness against distribution shifts. Unlike prior methods that rely solely on visual encoders, CoLaP incorporates textual descriptions during training to provide semantic guidance for prompt selection, enabling more reliable and adaptive representations. A concept-clustered prompt pool captures dataset-specific distributions, while inference remains purely visual to ensure efficiency. By leveraging the knowledge space of language models, CoLaP facilitates better alignment between current and future tasks, improving knowledge transfer and reducing forgetting. Extensive experiments on in-distribution and out-of-distribution benchmarks demonstrate that CoLaP significantly outperforms several purely visual prompt methods in both generalization and scalability. 1. The paper introduces a language-guided prompt selection framework that integrates multimodal semantic guidance to address distribution shifts, improving robustness and generalization in continual learning. 2. It provides extensive experimental validation on both in-distribution and out-of-distribution benchmarks to support its claims of improved performance and scalability. 1. The sentence “CoLaP is the first approach to integrate textual representations into the prompt selection stage for CL method that operates over the visual domain.” appears to overstate the novelty of the contribution, as prior work such as LGCL (Khan et al., 2023) also leverages textual embeddings for prompt selection. Although the authors already discuss LGCL in the related work, this claim in the contribution section should be rephrased to more accurately reflect the distinction between CoLaP and existing methods. 2. The review of traditional continual learning methods is not comprehensive. In addition to regularization- and memory-based methods, architecture-based approaches [a-f] should also be discussed to better contextualize the contribution. [a] Lifelong learning with dynamically expandable networks, ICLR18 [b] Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. ICML19 [c] Beef: Bi-compatible class-incremental learning via energy-based expansion and fusion. ICLR23 [d] Overcoming catastrophic forgetting with hard attention to the task. ICML18 [e] Compacting, picking and growing for unforgetting continual learning. NeurIPS19 [f] Parameter-level soft-masking for continual learning. ICML23 3. The notions in Figure 2 should be clarified with a legend or more detailed caption. And the prompt pool appears missing and is suggested to be explicitly annotated in the figure. 4. The definition of “key” is confusing. The authors state that keys are integer labels but also mention clustering centroids as keys. If the intent is that the prompt index is an integer while the key embedding is a fixed (non-learnable) vector, this should be clearly and consistently described. 5. The current method section only introduces the loss for the selector, but does not describe the overall training objective, including how prompts are trained. A more complete description would improve clarity and reproducibility. 6. The statement “The prompt selector, which is composed of the projection head f_\beta and student network f_\delta^s, …” is in consistent with the earlier definition f_s= f_\delta^s \odot f_\beta\odot f_\omega. This discrepancy should be resolved for consistency. 7. The experimental comparison is limited to L2P and DualPrompt. It should also include recent prompt-based methods such as HiDe-Prompt, S-Prompt, CODA-Prompt, ProgPrompt, LGCL, VQ-Prompt, Cprompt, etc. Moreover, the BWT metric underperforms in most cases, which weakens the claim of reduced forgetting and warrants further analysis or discussion. 8. The number of cluster centroids Q is fixed across datasets and tasks, which may limit adaptability. An adaptive strategy could potentially improve performance. 9. Some minors. 1) The font size in Figure 2 is too small and could be increased for readability. 2) Equations are suggested to be numbered for easier reference. 1. Could the authors rephrase the statement of being the first approach to integrate textual representations into the prompt selection stage to more accurately reflect the nature of their contribution? 2. Could the authors include a more comprehensive discussion of architecture-based methods and other relevant prompt-based approaches to better position CoLaP in the broader CL landscape? 3. Could the authors provide a more complete methodological description that include all loss terms and their interactions to enhance clarity and reproducibility? 4. Could the authors further clarify the keys, specifically whether they are integer indices, fixed embeddings, or learnable representations? 5. Could they reconcile the notation of the selector components for internal consistency for internal consistency across the text? 6. Could the authors report or discuss the performance of CoLaP compared with more recent prompt-based methods such as HiDe-Prompt, S-Prompt, CODA-Prompt, ProgPrompt, and LGCL, which are not currently included as baselines? 7. Since the reported BWT metric underperforms in several cases, could the authors provide additional analysis or justification to support their claim of reduced forgetting, or explain why this metric may not fully capture the benefits of CoLaP? 8. Did the authors explore or consider adaptive or dataset-dependent strategies for determining the number of cluster centroids Q, and if not, could they discuss why a fixed value was chosen? Fully AI-generated
CoLaP: Contrastive Learning with Adaptive Prompts for Continual Learning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper introduces CoLaP (Contrastive Learning with Adaptive Prompts), a multimodal continual learning framework that integrates language-guided prompt selection to mitigate catastrophic forgetting. Unlike prior visual-only prompt methods, CoLaP leverages auto-generated textual descriptions to cluster semantically aligned prompts, training a visual selector via contrastive and distillation losses. At inference, it operates purely on visual data, maintaining efficiency. Extensive experiments on in-domain and out-of-domain benchmarks (e.g., TinyImageNet, ImageNet-O) show CoLaP achieves superior generalization and stability, outperforming L2P and DualPrompt. The method effectively balances plasticity and stability, highlighting language-informed prompting as a promising direction for robust continual learning 1. Paper uses language-guided prompt selection that’s trained once, used visually at test time. CoLaP aligns a visual selector to language embeddings via contrastive + distillation losses, then drops text at inference preserving efficiency while improving selection robustness. 2. It clusters auto-generated captions to form prompt keys, encouraging concept sharing across related classes and reducing interference. 3. OOD robustness and Consistent gains on ImageNet-O, Oxford-IIIT-Pet, etc., showing >5% improvements over L2P in key OOD settings. 4. Maintains stability as tasks increase (5→20) with sensible top-K prompt retrieval. 1. While the specific combination of contrastive alignment + discrete prompt keys is new, its conceptual ingredients language guidance, multimodal contrastive learning, and prompt tuning are well explored (e.g., LGCL ICCV 2023, Roy CVPR 2024, Progressive Prompt ICLR 2023) 2. Limited baselines: The paper claims SOTA but omits contemporary multimodal CL baselines (Roy 2024; LGCL 2023; PromptAlign NeurIPS 2024). Without these, the improvement claims can’t be trusted across modalities 3. The model highly dependent on the text caption, yet there’s no sensitivity or noise ablation. If captions are incorrect or generic, how robust is the contrastive alignment? This is a critical missing analysis since the approach’s strength hinges on textual fidelity. 4. The alignment network will get trained on the task specific data, training one after another task this network itself will suffer from forgetting how paper handle the same? 5. While discrete prompt indices reduce memory, they remove continuous similarity structure. This may harm fine-grained transfer or incremental compositional reasoning, provide the ablation on the same. Compare with the recent SOTA model, provide the answer discussed the point in the weakness section. Heavily AI-edited
CoLaP: Contrastive Learning with Adaptive Prompts for Continual Learning Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. CoLaP argues that multimodal LLMs, having been trained on a large number of concepts, are better suited to encode task information compared to visual-only models that rely solely on visual features for prompt selection. The limited knowledge of visual models can significantly harm accuracy, especially on out-of-domain datasets. CoLaP introduces a novel contrastive learning based method that allows the use of language models by feeding images into a vision–language model to generate text embeddings, then finding cluster centers for these embeddings, which serve as the keys for each task. A classifier generates a categorical distribution over these keys to select the appropriate prompt. The framework further includes a teacher–student network, where the student distribution is trained to match the teachers'. At inference time, the student network selects the top-K prompts. - CoLaP introduces a novel perspective by using multimodal LLM embeddings to train a prompt selector. Unlike ViTs, which are trained on a much smaller set of visual concepts, multimodal LLMs are exposed to a vast range of concepts, reducing semantic misalignments. - CoLaP proposes a novel framework that leverages multimodal LLM embeddings for prompt selection in downstream tasks. - CoLaP outperforms vision-based prompt selection methods such as L2P and DualPrompt, particularly on out-of-distribution tasks. - Comparison is incomplete. Many later works that can potentially outperform are not compared. 1. RanPAC: Random Projections and Pre-trained Models for Continual Learning 2. Dynamic Integration of Task-Specific Adapters for Class Incremental Learning 3. Adapter Merging with Centroid Prototype Mapping for Scalable Class-Incremental Learning and many more. Even within prompt-based methods, authors have not compared with later works such as CODA-prompt (CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning). - It is also unclear how exactly the prompts are trained. The paper talks about 'The global prompt pool is updated by simply adding these centroids and their associated set of learnable prompt values,' but does not mention how these 'learnable prompt values are really learnt? Do we train them separately after training the prompt selector? Do we train them in parallel? Do you generate t' using the learnable prompts? Do you simply select the correct task prompts during training and train them using cross-entropy loss? It is unclear - The main figure is also vague. A key representing meaning of elements (arrows, colors) in the figure would be a nice addition. - What is the teacher network? Is this an MLP? Is this also trained? - Table 2 only compares COLLAP results and does not include other methods. - Reporting average performance would be nice. - No analysis of generated captions is shown. They could contain errors as well. - Per task, backward transfer curves and accuracy curves would be nice to look at. - What is in Table 5? Is this the projector that generates t' - The paper needs a lot more polishing. Important details are missing. The figure needs to be improved, and more comparisons are needed as well. -How are you training the prompt values? Do you also train the old prompts when a new task arrives since the prompt selector outputs distribution over the whole pool at task i? Or do you keep old prompts frozen when a new task arrives? Please be detailed. - What is in Table 5? - Why aren't you comparing with the latest methods? How does your method compare with the latest methods? - Some detailed ablations on design choices of networks used (MLPs,LLaVA and text embedder) will be helpful. Add some analysis of captions and their effect on performance. - Accuracy curves are also required to show the average accuracy per task during training. Fully human-written
Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose an instruction-following tuning framework: PDDL-Instruct to improve the symbolic planning capabilities of LLMs using chain-thought reasoning. The solution decomposes planning verification into atomic reasoning steps incorporating the resulting structure for instruction-tuning. PDDL-Instruct integrates external verification feedback from a plan validator (i.e. VAL by Howey et al. (2004)) during training to guide and refine the model's planning and reasoning outputs. Experimental results on three different planning domains from PlanBench show that the tuning paradigm introduced by PDDL-Instruct results in more capable planning models, substantially outperforming naive instruction tuning baselines. - The proposed framework results in significant improvements in the underlying LLMs' planning capabilities across different symbolic planning domains. - The use of an external plan validator, VAL, is well-aligned with the challenges of the domain, avoiding over-reliance on LLMs' imperfect self-reflection. - The paper responsibly acknowledges the limitations of the proposed framework, such as restricted PDDL feature coverage, and outlines promising avenues for improving self-verification capabilities and broader domain coverage. - The paper primarily compares the proposed PDDL-Instruct framework against naive instruction tuning baselines, where the LLMs are fine-tuned without the logical CoT and verification feedback mechanisms. It would be useful to compare or at least relate the performance of PDDL-Instruct to other symbolic planners tested on the selected domains from PlanBench. - While splitting training into initial and CoT instruction phases boosts the end performance, the paper would benefit from clearer explanation and explicit details regarding test set construction. - How are test problems selected and controlled for overlap or similarity with training domains and instances to ensure cross-domain generalisation evaluation? - The paper would benefit from a more detailed analysis of employing the specialised loss functions used in the experiments against more conventional alternatives in this space (e.g., against the negative loglikelihood instead of the $\mathcal{L}_{\text{plan}}$, as described in Appendix B.2.2). Such comparison could provide insight into the impact and necessity of the designed loss components. - As far as I understand, a separate model was trained for each benchmark, including all relevant training phases. It would be interesting to explore how a model jointly trained across all three domains would perform. Fully human-written
Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes PDDL-INSTRUCT, an instruction-tuning framework to improve LLMs’ classical planning in PDDL domains. The central move is to turn plan generation into explicit state–action–state chains and supervise them with a formal validator (VAL) that checks preconditions, effects, and goal satisfaction. Training is staged: first general instruction-tuning with explanations (including negative examples), then a chain-of-thought phase with two objectives: a step-level “reasoning-chain” loss over ⟨si−1, ai, si⟩ triplets and a final plan-validity loss. On PlanBench-style tasks, reported accuracy gains over base models and a Phase-1-only ablation are large, with the strongest results when using detailed validator feedback and more feedback iterations. The contribution is best viewed as an integrative advance rather than a new paradigm. Each major component has precedent namely: instruction tuning, chain-of-thought prompting, and external verification, but the paper’s value lies in packaging them into a coherent training loop that treats symbolic planning as verifiable, stepwise reasoning and shows strong empirical gains. The two-stage loss that first optimizes step-level reasoning chains and then end-task validity is a nice, concrete design choice; using detailed validator feedback to supervise CoT at the triplet level is what most clearly differentiates this from prior self-critique/reflection approaches. 1. Clear problem framing: LLMs often miss formal action applicability and state updates. The paper targets that gap directly. 2. Explicit verifiability: grounding CoT in VAL feedback forces faithfulness to domain dynamics and mitigates ungrounded story-like CoT. 3. Detailed documentation: The appendix provides extensive implementation details, hyperparameters, prompts, and mathematical formulations, facilitating reproducibility. 4. Empirical signal: consistent improvements across Blocksworld, Mystery Blocksworld, and Logistics; detailed vs binary feedback ablation is informative. 5. Clear idea & engineering pipeline. Combining structured CoT outputs (explicit ⟨s, a, s′⟩ steps) with an external symbolic verifier (VAL) is a natural and well-motivated approach to close the gap between natural-language reasoning and formal symbolic planning. The proposed two-stage optimization (reasoning loss then final task loss) is intuitively sensible. 1. Scope: restricted to classical PDDL without conditional effects, temporal/durative actions, or costs; focuses on satisficing rather than optimal planning. 2. Generality: results are on three domains; transfer to richer planning languages and real-world robotics pipelines is untested. 3. Data and compute loop: accuracy improves with more VAL-guided iterations (n), but the cost/benefit curve and stopping criteria are not fully characterized. 4. Comparison set: baselines focus on untuned or instruction-tuned models; a head-to-head against strong LLM-modulo planners or LLMs-as-modelers plus classical solvers (on identical tasks) would sharpen the contribution. 5. Ablate the effect of modified VAL outputs in training (show effect when you never tamper with VAL). 6. Generalization claims need clearer definition. The paper uses “generalization” loosely; clarify whether this means cross-domain transfer (different domain file), larger problem instances, or semantic obfuscation (Mystery Blocksworld). Provide explicit transfer experiments (train on a set of domains, test on held-out domains). 7. Test set contamination risk: The paper states "We remove the solution plans from datasets D₂ and D_test" but doesn't clarify if D₂ and D_test contain problems from the same domains with different configurations. If test problems are structurally similar to training problems within the same three domains, this raises concerns about memorization vs. generalization. 1. You state you sometimes alter VAL outputs to create incorrect explanations for a few plans (Sec. 5.1). Why was this done? How many examples were altered and what safeguards ensure the model does not learn incorrect inference from corrupted feedback? 2. Can you show results for (a) more expressive PDDL features (conditional effects, derived predicates) and/or (b) an additional diverse suite of PlanBench domains? If not possible, explain limitations and expected failure modes. 3. How robust are the CoT traces? Provide examples of unfaithful CoT traces (CoT that looks plausible but the plan fails) and quantify how often your external verifier catches such unfaithful traces. Compare this to literature on CoT unfaithfulness. 4. In result section I see SD value is quite high, why paper have such a high SD value. I assume SD mean standard deviation, correct me If I am wrong. Is it due to high temperature setting of the LLM? 5. How does your approach differ fundamentally from existing work on iterative refinement with external verification (LEPA, STaR, code as symbolic[1], Cot-TL[2])? What is the specific technical contribution beyond applying these ideas to PDDL? Please clarify. [1]Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation [2] CoT-TL: Low-Resource Temporal Knowledge Representation of Planning Instructions Using Chain-of-Thought Reasoning Fully AI-generated
Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces PDDL-INSTRUCT, a new method for training LLMs to be much better at symbolic planning in PDDL. The main idea is to teach the model a logical chain-of-thought where it has to explicitly reason about why an action is valid (checking preconditions, applying effects) at every step of a plan. The training has two phases: a basic fine-tuning on PDDL problems and plans, followed by a more advanced phase where the LLM's chain-of-thought plan is checked by an external VAL verifier, and this feedback is used to tune the model even more. PDDL-INSTRUCT shows 94% accuracy on Blocksworld, a 66% jump over the baseline. - The PDDL-INSTRUCT framework is a novel and well-structured approach to instruction tuning that directly targets an LLM's weakness in formal logical verification. - The paper includes a detailed analysis comparing binary vs. detailed feedback and tests across multiple frontier/public models, which validates the design. - It's a good trial, but the results are showing essentially LLMs can still only solve TC0 problems, aligning with the LLM state-tracking literatures. The selected tasks, e.g., BlocksWorld and Logistics, are pretty simple in the planning domain, and some neuro-symbolic methods can perform 100% on them. - The paper's claim that this could be combined with frameworks like LLM-Modulo to reduce feedback loops is interesting, but where are the results? How is that claim supported? - The paper is missing many literature references and discussion points from the neuro-symbolic and LLM-Modulo research areas, which are highly relevant. - The framework's assumptions limit it to simple PDDL features, avoiding common but complex features like conditional effects or durative actions. - Could the author provide qualitative examples of failures after tuning? e.g., How does the simple blocksworld planning fail? Is it on long-horizon planning task? Fully human-written
Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors propose the PDDL-Instruct framework, a model finetuning approach, to improve Large Language Models abilities on sequential planning tasks. The finetuning is split into two phases. The first phase is an instruction tuning phase, where the model is trained over both correct and incorrect plans. The model output for each plan is an analysis of whether the given plan is valid and why it is or is not valid. The second phase ia a chain-of-though tuningand phase. The model is prompted to generate a chain of planning tuples, which is checked for correctness by an external verification algorithm. The input to a model tuned by this framework is then a set of natural language instructions, a domain description in PDDL, and goal to be reached described in PDDL. The LLM output is a set of (State, Action, Next State) tuples that can be used to reach the goal. 1) The finetuning framework combines both existing strategies and new finetuning approaches 2) The writing is clear and the provided tables / figures are easy to understand 3) Inclusion of error analysis and failure modes is useful for understanding how future frameworks can improve this framework and also for helping interperate the results in the main paper 4) Training on negatives is not a new concept, but it is not one I've seen applied in the context of LLM planning, and is an idea I feel is typically neglected. Clever of the athors to include such an approach. 5) Results show strong improvements over baseline models 6) Moderate novelty and significance 1) The authors do not compare against any other planning frameworks. Even if the other frameworks were not made for this task and the the authors were sure that the other frameworks would fail, it is still useful to see the performance of other frameworks to put the results in context. Can these problems be easily solved by a general solver for AI planning problems? 2) This work has some similarity with the approach in this paper that also uses feedback: https://arxiv.org/abs/2309.16436 However, that paper does not use any fine-tuning. 3) I guess the overall excitement about the paper is rather limited as all the the applied techniques are well established. However, they may not have been applied together to this type of problem. 4) The authors include some figures and tables in the appendices. It would be beneficial if descriptions of the figures/tables and analysis of the results were included. 5) Minimal results were provides. See questions. 1) Have the authors tried training on one dataset (Blocksworld) and then testing on another (Logistics)? How well does the model generalize to unseen data? 2) Have the authors trained on two of the datasets (Blocksworld, Logistics) and tested on the heldout dataset (Blocksworld Mystery)? Fully human-written
HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. disclaimer : i did not use any LLM for this review. The paper works on event sequences classification. The authors introduce history token(s) in the pretraining of event sequences by strategically inserting history tokens to a transformer structure for next event prediction task. Then in finetune (classification of event sequences), they use the history token again as global feature for this downstream task. The authors demonstrate the good empirical results on 5 real datasets compared to a variety of RNN and Transformer variants. Originality: the authors observe the effect of history embedding/token and its success in NLU from Recurrent memory transformer paper, and apply to event sequences, (irregular) time series. This is a new adaptation. quality, the authors conduct lot of experiments and variants of HT-transformer on 5 datasets and that support their 4 main claims. clarity, overall decent – I can follow the structure of the paper fairly well. significance— I think the history token or in general recurrent memory transformer can be interesting to researcher in time series/event sequences. Originality : it can be improved by thinking what are unique aspects of time series/event sequences from the adaptation perspective. It can also be improved by provide some theoretical aspect to explain why bring in history token helps learn global characteristics to predict well on classification Quality: it can be improved by extending experiments to braoder setting: including time series, and adding regression etc. Currently it is a little bit weak as the authors did mention time series/ event seuqnces, classification/regression. Clarity: I am a little confused about PRELIMINARIES ON EVENT SEQUENCES. The authors should separate their contributions and related background information. For example, I am not sure about whether they use eqn (3) as their training objective or not. I know eqn 2 is used in pretraining. Significance: I think the current scope is a little bit narrow. Classification on event sequence is less study – there are a few studies on cluster event sequences ( A Dirichlet Mixture Model of Hawkes Processes for Event Sequence Clustering, (neurips 17 ) . researcher in this domain tends to focus on (long horizon/ next event)future prediction/generation, structure learning. Not so much in classification. So the proposed problems are less motivated and the datasets are not standard benchmark (esp. with label). I think the approach is better suited for time series classification/regression. 1. Do the authors use eqn 3 in training or pretraining? 2. What does it mean by “ The Uniform strategy inserts history tokens at uniformly sampled positions. However, this approach can lead to a discrepancy between training and inference, as history tokens are typically positioned near the end of the sequence during evaluation “ ? Inference mean the fine tuning step right? What is the discrepancy between training and inference? 3.The ablation experience is based on markovian generation of type only events. So this is no time step and numerical values correct? So how is training and finetuning different from event sequences with timestamps ? 4. The current version of history token is a little bit strange to me. One way, it is kind like a mask token like in the pretraining of BERT. in another way, it also kind of act like RNN however, does not change its state. In HT last history token acts like such state so that future token only depend on this state and token after this. I am not sure why it works. It would be great if the authors can provide some explaination. Fully human-written
HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes HT-Transformer, a novel Transformer architecture for event sequence classification that introduces learnable history tokens to accumulate contextual information through sparse attention patterns during next-token prediction pretraining. The method eliminates the need for auxiliary objectives like contrastive learning and demonstrates strong performance across financial, healthcare, and e-commerce datasets. (1). The concept of history tokens with specialized attention patterns thoughtfully integrates recurrent principles into Transformer architectures. (2). The writing is generally clear, with the method explained step-by-step. (1). The paper positions HT-Transformer as a significant departure from prior work, but the core idea of using special tokens to aggregate sequence information has been explored in existing works. The authors should more clearly differentiate their contribution. (2). The paper lacks explanation for why the Random strategy and Bias-End placement work better. It's better to provide analysis about why these strategies improve performance. (3). The paper uses gradient boosting on frozen embeddings for classification. It is unclear why this choice was made and whether it is standard. (1). Why does the Random history token selection strategy consistently outperform the Last strategy? (2). The paper mentions that the Longformer was adapted for causal attention, but the details are sparse. How exactly was the attention mask modified? (3). The paper uses the average of hidden activations of the history token as the sequence embedding. Was any other aggregation method (e.g., max pooling, concatenation) considered and why? (4). Are there particular event sequence characteristics (temporal density, categorical diversity) where HT-Transformer shows weak performance? Moderately AI-edited
HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces HT-Transformer, a variant of the transformer architecture designed for event sequence classification. The core contribution is the use of history tokens, which learns to extract cumulative prefix information during pretraining. The authors show how this approach enables transformers to better represent sequential context for downstream tasks, especially for predicting future events. Extensive empirical evaluations are reported across datasets in different domains, with comparisons against multiple methods, including RNNs, NTP transformers, and adapted techniques such as the recurrent memory transformer and Longformer. **Clear motivation and conceptual description:** The paper clearly explains the limitations of the existing transformers and presents the differences among the proposed HT-Transformer and previous variants, especially on the masking strategy. Detailed descriptions of the history token insertion strategies are provided. **Comprehensive experiment:** The paper evaluates across multiple domains and covers different methods (contrastive learning, NTP, etc.) on both network architectures (RNN, Transformer). **Study on Limitations:** Experimental results show that the proposed HT-Transformer performs worse on the global task. 1. The pretraining part should have more details. Currently, Sec. 2 includes part of the training objectives and Sec. 3.1 introduces the masking and inserting strategies. However, it is still not clear what the optimization target is for this pretraining stage. This is important as the pretrained embeddings are directly used for classification. 2. The performance gain is marginal, considering that LongFormer also aims to reduce memory. The implementation details and ablation study actually weaken the contribution: a step increase is observed in the low probability zone, which is counterintuitive if the history token actually has a significant effect on the performance. Weakness 1: How are the inserted history tokens supervised during the "next event prediction" modelling framework? Weakness 1&2: In the insertion strategy part, when a sample is selected as "no application", do you mean there are no history tokens in the middle of the sequence (excluding the final one), or there are no history tokens throughout the whole sequence? Weakness 2: Could you further explain the phenomenon of the step increase in $p=0$ to $p=0.25$? Fully human-written
HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes HT-Transformer, a transformer-based architecture designed for irregul sequence classification. The key idea is to introduce history tokens to accumulate prefix (historical) information during pretraining. The accumulated history token serves as a meaningful representation for downstream tasks. Moreover, it proposes a contrastive pretraining method for next token prediction and cont. The authors evaluate HT-Transformer on diverse datasets spanning finance, e-commerce, and healthcare domains, indicating strong performance compared with autoregressive Transfomers with next token prediction task. 1. It proposes “history tokens” as accumulators of prefix informationthat mimics the RNN hidden-state mechanism. 2. Strong empirical evidence: The method is evaluated across multiple domains. 1. Lack of methodological contribution: This work proposes "history token" which is very similar to [SEP] and [EOS] token in BERT, which can be used to summarize the past information. 2. Token aggregation: For classification, this work does not compare or make it clear why history token is better than traditional [EOS] or simply avg pooling. It seems like, if you use history token at the end of sequence (Fig 1.b), you actually use the [EOS] token which is already used by GPT before. 3. Method Comparsion: it does not compare with recurrent (linear) transformers, which can be considered as deep RNNs. 1. How does the proposed “history token” mechanism differ functionally from the state memory variable? In linear transformer model, it will compute the state variable using K and V matrix, which has dxd size. This state variable is larger history token you propose, and it is dynamic. So each token's state variable can summarize all the past information. It seems much flexiable and representative than introducing an extra specical token. 2. This work actually has 2 loss L_MAE and L_CONT, it is not clear how to combine them? 3. Did you evaluate how many history tokens are actually attended to during pretraining? This would help understand if the model genuinely learns to use them. Fully human-written
HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a novel approach to improving temporal sequence prediction by introducing history tokens, dedicated tokens that accumulate and represent historical context. These tokens function as dynamic memory units, allowing the model to construct richer and more informative representations of past events during both pretraining and inference. Furthermore, the authors focus their investigation on two primary design dimensions: 1. Placement: They evaluate two strategies for positioning history tokens: uniform distribution across the sequence and end-biased placement closer to the sequence's conclusion.
 2. Attention: They examine how sequence tokens should attend to history tokens, comparing last-token attention with random attention mechanisms.
 Extensive experiments on five datasets spanning finance, e-commerce, and healthcare demonstrate that the proposed HT-Transformer consistently outperforms strong baselines, including RNNs, standard Transformers, Longformer, and Recurrent Memory Transformer (RMT) in different settings. These results underscore the value of integrating history tokens into transformer-based architectures for temporal and event sequence modeling. 1. Novelty and Practicality: This paper introduces a simple yet effective idea called history tokens to flexibly incorporate historical context addressing a key limitation in temporal sequence modeling.
 The concept of history token is very interesting. 2. Well Explored Design Choices: Authors 
provide a systematic study of token placement and attention strategies with solid ablation experiments that enhance both interpretability and adaptability.
 3. Strong Empirical Results
: Experiments demonstrate consistent performance gains over RNNs Transformers Longformer and RMT across five datasets in finance healthcare and e-commerce and so on.
 4. Reproducibility
: Public code are available supporting transparency and facilitating future research. While this paper presents valuable contributions, addressing the following points could further strengthen its impact and clarity: 1. Clarify the Relationship with CLS Tokens: 
It would be more comprehensive to include a detailed comparison between history tokens and the CLS token mechanism used in models like BERT, other than the history tokens appearing multiple times in a sequence. Elaborating on how they differ functionally and architecturally will help clarify history token's unique role.
 2. Compare with Compressed Vector Methods in Long-Context Transformers:
 Providing a methodological comparison between history tokens and compressed or pooled vector approaches used in long-context models such as Unlimiformer, Linformer, Reformer, sparse/blockwise attention mechanisms, and so on, would contextualize the contribution and highlight advantages of history tokens.
 3. Broaden Experimental Baselines: 
Incorporating additional baselines like BigBird, FlashAttention, ModernBERT, Mamba, and Structured State Space Models (SSMs) would strengthen the empirical evaluation and better position the work within the current state of the art.
 4. Include Efficiency and Computational Cost Analysis:
 Adding an analysis of computational efficiency and resource usage compared to other long-sequence models is important for assessing the practical applicability of the approach. Including such details in the main text will provide readers with a clearer understanding of its scalability.
 Addressing these points would provide greater clarity, strengthen the motivation, and improve the overall impact of the work. 1. How does the HT-Transformer relate to traditional time series forecasting methods and recent models like PatchTST? Could it be applied to classic forecasting tasks such as next-step prediction or multivariate forecasting, and how well does it capture inductive patterns like seasonality, trends, or locality? 2. What factors influence the optimal number or ratio of history tokens relative to the input length? Are there cases where using too many tokens might introduce redundancy or reduce attention efficiency? 3. Could the placement of history tokens be adapted to the structure of the data or task, for example, aligning with seasonal events, or paragraph or document boundaries in text? Would such task-aware placement improve efficiency or performance? Fully AI-generated
3DPhysVideo: 3D Scene Reconstruction and Physical Animation Leveraging a Video Generation Model via Consistency-Guided Flow SDE Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces 3DPhys Video, a training-free pipeline for generating physically realistic and photorealistic videos from a single input image. It addresses the fundamental limitation of traditional video generative models, which often fail to adhere to real-world physical dynamics. The pipeline operates in two main stages, Novel View Synthesis and Simulation to Video Generation. The core contribution is the training-free pipeline that repurposes an off-the-shelf image-to-video diffusion model for two entirely different tasks: 3D scene reconstruction and physics-guided video synthesis. The authors conducted extensive experiments to validate the effectiveness of their proposed method. The results demonstrate that 3DPhysVideo outperforms state-of-the-art methods in terms of physical realism and semantic consistency while maintaining competitive photorealism. The Material Point Method (MPM) is computationally expensive, especially for high-resolution simulations and complex scenes with numerous interaction points. The overall pipeline's speed is likely bottlenecked by the MPM step. The authors should clearly address the runtime breakdown for the three main stages: 3D reconstruction, MPM simulation, and I2V synthesis, to highlight the practical efficiency of the "training-free" claim. The demonstration mostly focuses on relatively contained scenes with specific, localized physical events (e.g., ball drops, liquid pouring). It is unclear how well the pipeline scales to large-scale, non-local physical phenomena like wind effects, cloth dynamics, or complex collisions involving many small particles. Could you address the problems in the weaknesses? Fully AI-generated
3DPhysVideo: 3D Scene Reconstruction and Physical Animation Leveraging a Video Generation Model via Consistency-Guided Flow SDE Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a training-free pipeline that generates physically realistic videos from a single image. It repurposes an off-the-shelf image-to-video flow model for two stages: reconstructing full 3D scene geometry using rendered point clouds, and synthesizing final videos guided by Material Point Method physics simulations. The authors also propose Consistency-Guided Flow SDE that decomposes predicted flow velocities to enable effective 3D reconstruction and simulation-guided video generation. - Research on physically realistic video generation is both practical and meaningful. - The proposed pipeline is well-designed, feasible, and reasonable. - Experimental results demonstrate performance improvements across multiple scenarios. - From the appendix video examples, some cases appear worse than other methods. For instance, in the Apple sample, the back video shows no water splashing when the apple falls. What could be the possible reason for this? - What is the speed of generating a video sequence, and how does it compare to other methods? - The paper lacks a discussion of limitations and corresponding analysis. Could the authors provide intermediate visual results showing MPM-simulated outputs under different types of interactions, such as solid–fluid collisions, fluid–fluid interactions, and so on? Lightly AI-edited
3DPhysVideo: 3D Scene Reconstruction and Physical Animation Leveraging a Video Generation Model via Consistency-Guided Flow SDE Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents 3DPhysVideo, a training-free pipeline designed to generate physically realistic videos from a single image input. It reuses a pre-trained Image-to-Video (I2V) flow model across two stages. In Stage 1: Single Image to 3D, the I2V model functions as a view synthesizer to reconstruct 360-degree 3D scene geometry. In Stage 2: Simulation to Video, Material Point Method (MPM) physics simulation is applied to the geometry. The resulting simulated point trajectories, which support complex dynamics like fluids and viscous substances, then guide the same I2V model to synthesize the final photorealistic video. The core mechanism, Consistency-Guided Flow SDE, adapts the I2V model for both 3D reconstruction and simulation-guided rendering. The 3DPhysVideo pipeline generates physically realistic videos from a single image using a training-free approach. It repurposes an off-the-shelf Image-to-Video (I2V) model in two stages. 1. 3D Reconstruction: The I2V model first acts as a novel view synthesizer to reconstruct 360-degree 3D scene geometry. 2. Physics Generation: The geometry undergoes Material Point Method (MPM) physics simulation. The resulting simulated dynamics then guide the same I2V model to synthesize the final photorealistic video. This dual functionality is enabled by the Consistency-Guided Flow SDE, which adapts the pre-trained model for both geometry and dynamics synthesis. The method achieves good physical realism compared to baselines, especially in multi-object and fluid interaction scenarios, while offering user control over physical properties. 1. The proposed method appears incremental, with limited distinction from prior work. 2.Experiments are limited in scope; key baselines and datasets are missing. 3. Core assumptions lack rigorous justification or mathematical support. 4. Result interpretation is shallow; no discussion of failure cases or parameter sensitivity. 5. Figures and explanations are sometimes unclear, reducing readability and impact. 1. Could the authors elaborate on the empirical or theoretical rationale for entirely eliminating the denoising bias ? 2. What is the measured reliability or accuracy of these automatically inferred physical parameters compared to manually specified inputs? 3. Since the current SDE is heavily reliant on visual consistency, how would the core consistency metric and the model’s latent inputs need to be adapted or redefined to effectively enforce a non-visual inductive bias, such as alignment with a detailed text prompt, without requiring additional model training? Lightly AI-edited
3DPhysVideo: 3D Scene Reconstruction and Physical Animation Leveraging a Video Generation Model via Consistency-Guided Flow SDE Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces 3DPhysVideo, a novel, training-free pipeline that generates physically realistic videos from a single input image. Instead of training a new model, it cleverly repurposes a single, pre-trained image-to-video (I2V) model for two distinct stages: 3D Scene Reconstruction and Physics-Guided Video Generation. 1. The entire pipeline requires no additional training. It runs on a single consumer GPU, making it highly accessible and efficient compared to methods that require training large, specialized models. 2. By grounding the animation in an explicit physics engine (MPM), the final video exhibits a high degree of physical plausibility, especially in complex scenarios like fluid dynamics and multi-object interactions, where purely data-driven models (e.g., Sora, Gen-3) often fail. 3. The paper is well written and organized. 1. As a multi-stage pipeline, errors from any stage may make the result fail. In particular, the 3D reconstruction and physical property estimation (using LLM) parts are prone to errors. For example, the apple in the demo appears elastic (it should actually be similar to a rigid body). It would be better if the accuracy of these two parts could be assessed, and the potential limitations could be analyzed. 2. While it can run on a consumer GPU, this method predictably significantly increases inference time due to the introduction of 3D reconstruction, physical property estimation, and MPM simulation. It would be better to report a comparison of inference time. 3. Were the liquids in the scene also reconstructed in 3D? How is the physical realism of the fluid dynamics ensured? 4. The article states that PhysGen3D cannot maintain the relative position of objects. However, PhysGen3D does perform pose estimation, so is this statement somewhat unreasonable? Please see Weaknesses. Lightly AI-edited
Latency-Aware Contextual Bandit: Application to Cryo-EM Data Collection Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper examines a version of the contextual combinatorial bandit problem in which each action (a subset of arms) incurs a latency—a variable time cost. The goal is to maximize expected reward per unit of elapsed time rather than per round. The authors model this as an average-reward semi-Markov decision process (SMDP) and derive a Bellman optimality equation of the form E_{(X,A,l)}\!\left[\min_{A\in\mathcal{A}}\{l(A)\Gamma - \sum_{i\in A}\mu_i\}\right] = 0, where \Gamma represents the long-run average reward rate. They then propose an algorithm, COAF (Contextual Online Arm Filtering), that combines a Robbins–Monro–type stochastic approximation for estimating \( \Gamma^\* \) with UCB-style exploration for learning the arm rewards \mu_i(x). Regret bounds of order O(T^{3/4}) are proved under both linear and general function classes, and experiments on synthetic data (MovieLens) and a cryo-electron microscopy (cryo-EM) simulation are presented. 1) The latency-aware formulation is conceptually relevant to real scientific workflows. 2) The mathematical derivations are careful and correct. 3) The paper is generally well written and easy to follow. 4) The cryo-EM example adds color and a nice application context. 1. The regret bound is very likely suboptimal. 2. There is no lower bound or discussion of optimality. 3. The “latency” feature mostly amounts to a time-rescaling, I think; it is not clear why this warrants a fundamentally new theory. 4. The experiments lack statistical rigor—no error bars or serious baselines. 5. Overall novelty is modest: the algorithm is a straightforward hybrid of known tools (UCB + stochastic approximation). 1. Do you believe the T^{3/4} rate is unavoidable, or is it simply an artifact of your analysis? 2. When latencies are known and bounded, why can’t the setting be reduced to a contextual bandit with a random time clock? 3. Could one obtain a sharper \sqrt{T}-type result using a ratio or Dinkelbach-style formulation? 4. What exactly does “throughput” measure in the cryo-EM experiment, and how does it relate to \(\Gamma^\*\)? 5. Please clarify whether the cryo-EM data come from real microscope logs or a synthetic simulator. Moderately AI-edited
Latency-Aware Contextual Bandit: Application to Cryo-EM Data Collection Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors study the contextual MAB problem where each action incurs a context dependent time cost and the goal is to maximize the reward per unit time. They develop new algorithm called COAF that jointly learns the reward model with UCB style confidence band and also learns the optimal average reward. They also showcase the performance of the algorithm with theoretical guarantees with regret bound in several regimes. The theoretical work is supported with experiments conducted on two real world data from different domains showcasing the adaptability of the proposed setting. The problem setting is clearly motivated with a proper use case of cryo em data collection and is designed to tackle similar use case. The problem formulation has a generalization over contextual bandits, combinatorial semi bandits, which makes it solid. Also, COAF is supported by optimality equation and design with its dependence. The experimentation is supported by real world data to show the working validation of the motivating example. Along with it, they also show their performance on other domain with MovieLens data to showcase the wide adaptability of the setting. Having per arm feedback within combinatorial choice helps reduces variance and seems to be realistic for their application. The setting allows for switching to a new decision sets but don't signify the regime when it is optimal as supposed to exploiting. The experimentation lacks proper baseline to compare the effectiveness of the proposed algorithm COAP. The problem setting has IID assumption with ($X_j$ ,$A_j$ , $l_j$), however this might applications where nonstationary has to dealt with and taken into account. Since Latency aware contextual bandits seems to be the special case of contextual bandits and contextual semi bandits, If the action space and context of the arm is reduced to be similar to stochastic bandits, Does Latency aware Contextual bandits reduce to Budgeted bandits ? If so, how does the regret bound compare in this scenario ? If a learner is allowed to request a new action set, how does this switch take latency into account, Is it already a part of the latency of the original selection action set ? Since the work is motivated by cryo em data collection, In latency aware contextual bandit setting with COAF can it exploit all the structure in the latency observed rather than treating it as arbitrary ? Also, For the cryo-EM, Does COAP outperforms the strong domain specific heuristic or is there any advantage of using a learned policy for cryo em data collection application ? Also the numerical experimentation only involves a baseline comparison with the humans and Can any of the contextual bandits setting be adapted with mild relaxation to consider them for baseline evaluation ? Often case, since cryo EM data collection involves human microscopists, drift in instrumentation or user's action changes mid run. In that case, under a IID assumption ($X_j$ ,$A_j$ , $l_j$), how does the algorithm behave ? Fully human-written
Latency-Aware Contextual Bandit: Application to Cryo-EM Data Collection Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. In this paper, the authors studied a latency-aware contextual bandit framework that extends standard contextual bandits by incorporating the action delays and formulates it as a special case of a semi-Markov decision process. The authors proposed the Contextual Online Arm Filtering (COAF) algorithm, which combines stochastic approximation and UCB exploration to balance reward and latency. The authors provided theoretical analysis of their algorithm, proving sublinear regret bounds. Finally, they conducted numerical experiments on MovieLens and cryo-EM data to demonstrate that COAF outperforms baselines and improves data collection efficiency. - The proposed latency-aware model generalizes contextual and combinatorial bandits by explicitly accounting for temporal costs. This is a novel problem in the bandits literature. - The theoretical analysis appears sound and comprehensive, though I have not checked every proof in detail). - I also appreciate the discussion of the application to cryo-EM data collection, which highlights the real-world relevance of the framework. Modeling microscope exposure and movement as latency is a strong and realistic motivation that grounds the theoretical development. - While the latency-aware formulation is novel, COAF primarily builds on existing tools such as UCB and stochastic approximation. The conceptual combination is interesting but may feel incremental without deeper theoretical or algorithmic insights. Could you clarify whether the current results provide any new algorithmic intuition or theoretical implications for the broader bandit literature? - The numerical experiments, though illustrative, are relatively small-scale, so the insights they provide are somewhat limited. The cryo-EM evaluation appears to use simulated data, with experiments mainly comparing COAF to human microscopists. Could you offer some more comprehensive analysis here? E.g., an ablation study examining how COAF’s performance changes under different latency distributions. Similarly, the MovieLens experiments feel limited in scope, particularly in the choice of baselines. It would be informative to also compare against a number of standard contextual bandit algorithms. - Finally, it would be valuable for the authors to discuss additional application domains where the proposed latency-aware bandit framework could be beneficial, beyond the cryo-EM setting. See weaknesses. Lightly AI-edited
Latency-Aware Contextual Bandit: Application to Cryo-EM Data Collection Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper investigates a latency-aware contextual bandit problem, where each action (arm) incurs a random latency drawn from an unknown distribution. To capture the impact of latency on decision-making, the authors model the problem as a Markov Decision Process (MDP) and derive the corresponding Bellman optimality equation. Building upon this formulation, they propose an arm filtering algorithm that balances exploration and exploitation by accounting for both reward and latency. The proposed approach is demonstrated through experiments on the MovieLens 1M dataset and a Cryo-EM experimental setting. The paper has the following strengths: - The paper provides a theoretical formulation by modeling the latency-aware contextual bandit problem as an SMDP and deriving the corresponding Bellman optimality condition. - The paper introduces a contextual online arm filtering (COAF) algorithm based on the derived Bellman condition and establishes regret bounds for both linear and general reward function settings. - The problem is well-motivated by a real-world Cryo-EM application, and the proposed method is empirically validated on both MovieLens 1M and Cryo-EM datasets. The weaknesses are described below. - Although the paper formulates the latency-aware contextual bandit problem as an MDP, it does not clearly justify why the proposed method is preferable to existing MDP-based solutions. - The arm filtering design and regret analysis follow relatively standard techniques, and the paper does not clearly articulate new analytical challenges introduced by latency or contextual dependencies. - The study focuses solely on the stochastic setting, which can already be addressed by conventional MDP algorithms. Extending the formulation to adversarial or non-stationary environments would make the contribution more compelling. - The impact of action latency on the learning rate or convergence behavior is not explicitly analyzed or reflected in the algorithmic design, despite being central to the problem motivation. - The experimental evaluation is limited to the proposed method without comparisons against existing delayed-feedback bandits [1,2] or MDP-based algorithms, which weakens the empirical evidence supporting the algorithm’s effectiveness. [1] Masoudian, S., Zimmert, J. and Seldin, Y., 2022. A best-of-both-worlds algorithm for bandits with delayed feedback. Advances in Neural Information Processing Systems, 35, pp.11752-11762. [2] Lancewicki, T., Rosenberg, A. and Mansour, Y., 2022, June. Learning adversarial markov decision processes with delayed feedback. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 7, pp. 7281-7289). Please see the weaknesses. Moderately AI-edited
TheMCPCompany: Creating General-purpose Agents with Task-specific Tools Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper provide a new benchmark to evaluate the tool use and tool calling of LLM agent with MCP tools. This benchmark is notable for its scale, including over 18,000 MCP tools. 1. The MCPAgent incorporates 18,000 tools and introduces a gateway MCP server to retrieve the tools relevant to each user query, thereby improving performance and reducing operational costs. 2. This paper evaluates the MCPAgent on challenging tasks that reflect the complexity of real scenarios. 1. Although constructing a standardized set of MCP tools requires substantial engineering effort, the novelty of this paper appears to be limited. 2. Some experiment setups are confusing. For example, in Table 2, the comparison between the **MCPAgent** and the **Oracle Tool Set** supports the claimed advantages of introducing a gateway MCP server. However, it is unclear why the **MCPAgent** is also compared with the **browser-based agent**, given that their functionalities and supported capabilities differ significantly. 1. Could the authors further clarify the novelty of this work? 2. What is the reasoning behind comparing the **MCPAgent** with the **browser-based agent**? Their functionalities and supported capabilities differ substantially. 3. Is the MCP tool set fixed at 18,000 tools in the experiments? Does the benchmark support a dynamic tool set? A dynamic setting might better capture real-world scenarios where available tools evolve over time. Lightly AI-edited
TheMCPCompany: Creating General-purpose Agents with Task-specific Tools Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents TheMCPCompany, a benchmark and evaluation framework for testing LLM-based agents equipped with Model Context Protocol (MCP) tools. The work extends TheAgentCompany by integrating various enterprise services (Azure, GitLab, RocketChat, Plane, ownCloud) through the MCP interface, resulting in over 18,000 tool endpoints. The authors also propose MCPAgent, a tool-retrieval agent capable of discovering and invoking MCP tools automatically. Experiments are conducted on several proprietary models (GPT-4.1, GPT-5, Claude Sonnet, Opus), showing that MCP-based agents outperform browser-based baselines in cost and accuracy. The paper aims to highlight the potential of large-scale MCP environments for real-world agent evaluation. - Strong engineering contribution: Implements a large, fully functional MCP benchmark with 18,000+ tools across enterprise services. - Systematic evaluation pipeline: Builds upon TheAgentCompany with added realism (Azure integration). - Empirical comparison: Includes quantitative cost and accuracy analysis between MCP and browser-based setups. - Reproducibility commitment: The authors intend to release code, MCP servers, and Terraform scripts. - Limited model coverage: Only closed-source models from OpenAI and Anthropic are evaluated; Gemini and other open-source models (eg., DeepSeek-V3, Qwen3, Llama) are excluded. This limits generalizability. - Lack of retrieval comparison: The paper does not directly compare MCPAgent with traditional retrieval-based methods, making it unclear whether MCPAgent offers genuine advantages. - Narrow task scope: The actual benchmark tasks are mainly Azure tasks, and other major components (e.g., TheAgentCompany) are reused without meaningful extension. As a benchmark paper, this is insufficient. - Weak analysis of MCPAgent: The paper provides little insight into how MCPAgent performs tool discovery or why it succeeds/fails in specific cases. - 18,000-tool claim not substantiated: Although the paper highlights a huge MCP tool set, it never reports how many tools are actually useful or invoked during evaluation. - Lack of concrete examples: The two main contributions—Azure tasks and MCPAgent—are not illustrated with examples or reasoning traces, making the work difficult to interpret and assess. - Lack of comparison with existing MCP benchmarks: The paper does not include direct comparisons with other MCP-based benchmarks (eg., MCPVerse or LiveMCPBench). - How many of the 18,000 MCP tools are actually used in the benchmark? Can you provide tool invocation statistics? - Why were open-source models (e.g., DeepSeek-V3, Qwen3, and Llama) excluded from the evaluation? - Can you show a concrete example of an Azure composite task and how MCPAgent solves (or fails to solve) it? - How does MCPAgent compare to traditional retrieval-based systems or other MCP-agent implementations? - What are the main factors that limit agent performance on Azure composite tasks? Lightly AI-edited
TheMCPCompany: Creating General-purpose Agents with Task-specific Tools Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces TheMCPCompany, a benchmark for evaluating general-purpose AI agents that primarily interact with their environment through a large set of task-specific tools (over 18,000) rather than a few general-purpose tools like web browsers. The core contributions include: (1) the creation of this large-scale, realistic benchmark based on the Model Context Protocol (MCP), which includes complex tasks adapted from a software company simulator and new challenges for the Azure cloud platform; (2) an extensive evaluation comparing browser-based agents to agents with access to either a pre-selected 'oracle' tool set or a tool-retrieval mechanism; and (3) key findings that demonstrate the potential of task-specific tools to improve performance and reduce cost, while also exposing the significant difficulties agents face in navigating and combining thousands of tools in complex enterprise environments. 1. This paper tackles a very relevant and interesting angle: understanding the capabilities of general-purpose agents when they are equipped with large, heterogeneous tool collections. Studying how LLMs perform as the number of available tools scales is highly realistic and timely, given the fast-evolving ecosystem of MCP tools. 2. The writing and motivation are experienced and clear, making the paper's contributions easy to grasp. The design of the benchmark is intuitive and well-justified; it builds sensibly on prior work by replacing a few general-purpose tools with a massive set of task-specific ones, thereby creating a novel and challenging testbed. 3. The experiments yield meaningful and interesting observations. For instance, the note that GPT-5's excellent performance is partly due to its perseverance provides genuine insight that maps model behavior directly to success and failure patterns. The clear performance gap between models with oracle tools versus tool retrieval effectively pinpoints the current challenges in tool discovery and usage. The main weakness lies in the insufficient details provided for the MCPAgent's tool-finding function. This module is central to the paper's investigation of agents in large-scale tool environments, yet its implementation is only briefly described. Specifically, the choice of the embedding model is a critical design decision that could significantly impact retrieval quality and, consequently, the overall agent performance. The authors state, "We use OpenAI’s text-embedding-3-large model," but there is no discussion or ablation study on how this choice affects the results. Would a different embedding model change the performance gap between models, especially for smaller ones like GPT-5-mini? Without this analysis, it's difficult to fully assess the robustness of the retrieval approach and the conclusions drawn from it. 1. Could you provide more details on the tool finding function? For example, what was the value of k (the number of tools returned per query), and how was it determined, and were there any strategies for handling the diversity of tool schemas (e.g., name vs. description weighting) during embedding? 2. How sensitive are your key results, especially the poor performance of smaller models with retrieval, to the choice of the embedding model? Did you experiment with any other models, and if so, were the conclusions consistent? 3. The error analysis is insightful but brief. For the complex Azure tasks where all models failed, could you provide more detail on the specific types of reasoning failures? For example, were the issues more related to flawed problem decomposition, an inability to understand tool dependencies, or something else? A more detailed breakdown here would be very valuable for the community. Lightly AI-edited
TheMCPCompany: Creating General-purpose Agents with Task-specific Tools Soundness: 3: good Presentation: 2: fair Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. ``` I used LLM to fix the grammar of the Official Review, but all opinions are my own ``` Many current AI agents rely heavily on general-purpose tools such as web browsers to perform tasks—whether retrieving data or operating software. This approach is inefficient and costly, and it fails to capture how most real-world professional work is done. For example, in enterprise environments that use Azure for cloud services or GitLab for code management, there exist specialized APIs or SDKs that are far more suitable for task automation, but current agents struggle to leverage them effectively. To address this, the authors built a “tool library” that converts common enterprise services—like Azure, GitLab, and internal communication tools—into over 18,000 “AI-native tools” (called MCP tools). For instance, instead of manually updating database settings via the Azure UI, an AI agent can now directly call a dedicated “update Azure database version” tool. They also constructed a simulated company environment (“TheMCPCompany”) to benchmark AI agents in realistic work settings. The environment includes both simple tasks (e.g., labeling files) and complex ones (e.g., fixing a faulty cloud service). A baseline agent (“MCPAgent”) is introduced, which must first discover relevant tools for a given task (e.g., finding “check Azure database version” or “restart service”) and then use them to solve the problem. Although the paper doesn’t introduce a novel method, I find the problem setting very meaningful and the work potentially impactful. I recommend accepting this paper. 1. The idea of turning real-world software APIs into standardized, callable tools for AI agents is highly interesting. 2. The dataset and environment are both valuable community resources that can enable further research. 1. Some sections, especially the data construction process, are not clearly written. 2. It remains unclear how one might systematically improve the agent’s ability to use such tools efficiently. 1. The paper seems to lack a clear explanation of how the data were constructed. The current narrative is somewhat scattered, and it’s hard to follow the full pipeline from tool creation to task setup. 2. This direction is quite open-ended, and I’m curious how one might improve generalization in this setting. Since your tools appear highly domain-specific, training on them might not help the model handle out-of-domain (OOD) scenarios. Have you considered splitting the tools and environments into disjoint train/test sets—for example, using part of the tools to fine-tune/RL a model like Qwen, and then testing on unseen tools or environments? Such an experiment would be very informative, and if you could include results along these lines, I’d be even more inclined to advocate for acceptance. Lightly AI-edited
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces WAVE, a LLM-based embedding model specifically designed to create a unified representation space for text, audio, silent video, and synchronized audio-visual inputs. To achieve this versatility, WAVE employs a hierarchical feature fusion strategy that aggregates representations from multiple LLM layers, alongside a dual-encoder architecture for audio inputs. The model is optimized through a joint multi-modal, multi-task training approach to enable any-to-any cross-modal retrieval and the generation of prompt-aware embeddings that condition on user instructions for tasks like multimodal QA. Experimentally, WAVE outperforms other baselines on the MMEB-v2 video benchmark and yields descent results in audio and video-to-audio retrieval, with ablation studies confirming the performance benefits derived from both joint training and the learned cross-layer fusion technique. * The writing is easy to follow. * The architecture features an effective hierarchical feature fusion strategy by aggregating representations across multiple LLM layers and a dual-encoder design for audio. Ablation studies confirm that the proposed joint multi-modal, multi-task training strategy enables positive cross-modal knowledge transfer and superior results compared to specialist models. * WAVE achieves new state-of-the-art results on the MMEB-v2 video benchmark and shows superior performance in audio and video-to-audio retrieval compared to strong baselines. 1. A major limitation is the model's reliance on prompt-aware embeddings for high performance in complex tasks like multimodal QA. Using a single general prompt to extract the embedding results in a drastic performance degradation across all QA benchmarks, highlighting the critical limitation of a single, static representation in handling complex query semantics. Although boosting performance, generating diverse embeddings can be very expensive in real-world scenarios. Consider the model that needs to generate different embeddings based on various user inputs. 2. Achieving optimal performance requires a complex, learned MLP fusion across the last-token outputs of all LLM layers. Ablation studies showed that simpler aggregation methods, such as a direct weighted sum across layers, underperform, suggesting the necessary cross-layer interactions are highly complex. This leaves readers wondering whether all the layers were necessary, or if there are other strategies to select meaningful layers that would be sufficient. 3. As an LLM-based model built on a 7B parameter backbone, Qwen2.5-Omni, the training demands are significant, requiring large-scale infrastructure (192 H20 GPUs for approximately 36 hours), which is hard for many labs to reproduce. ### 1. Analysis of Prompt-Aware Embeddings and Generalization * Could the authors elaborate on the fundamental limitations preventing a single, static embedding from adequately capturing complex query semantics for QA? Does this performance gap imply that for high-level reasoning tasks, WAVE’s LLM backbone is using the prompt to select relevant features internally rather than deriving a truly universal, task-agnostic representation? * For users who need a generalized embedding for downstream applications that lack explicit questions, what is the optimal and computationally lightest prompt recommended by the authors that can minimize performance loss while maintaining acceptable semantic coverage? ### 2. Feature Fusion and Interpretability * Given that a direct weighted sum underperforms, suggesting that cross-layer interactions are complex and non-linear, can the authors provide deeper insights into what the two-layer MLP fusion module learns? Are there any visualization techniques or analysis that can illustrate the relative importance assigned to low-level (early-layer) perceptual cues versus high-level (late-layer) semantic reasoning during the fusion process? ### 3. Dual-Encoder Necessity and Redundancy * Since BEATs is designed for comprehensive audio event understanding, did the authors explore replacing the existing speech encoder entirely with a second, possibly smaller, instance of the BEATs encoder, or fine-tuning a single, unified audio encoder? If the dual approach is mandated by specialized roles, how do the embeddings from the two encoders contribute uniquely to the final unified representation, beyond the observed performance boost in audio retrieval? ### 4. Inference and Computational Cost * While the training resources are impressive (192 H20 GPUs for 36 hours), could the authors provide a comparison of the average latency and total inference computation (FLOPs or time per sample) required to generate a WAVE embedding versus competing MLLM-based embedding models (e.g., LamRA or CAFe)? Specifically, how much overhead is introduced by processing and fusing features from all layers compared to standard last-token pooling from only the final layer? Fully AI-generated
PreviousPage 24 of 1516 (75800 total rows)Next