|
DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces DeepTravel, an RL framework for autonomous travel planning. DeepTravel enables the agent to plan, execute external tools, and iteratively refine itineraries. The framework includes a sandbox environment built from cached transportation, accommodation, and POI data, a reward system with trajectory and turn-level verifiers, and a reply-augmentation method for learning from failed experiences. The proposed approach outperforms larger state-of-the-art reasoning models (used in zero-shot mode) on the target benchmark.
- The paper is easy to read and understand. The graphical illustrations are clear and informative.
- The presented approach significantly outperforms current state-of-the-art models in solution quality on travel planning tasks.
W1: The proposed approach is primarily engineering work rather than a conceptual innovation. The simulated API cache closely resembles existing agentic environments used for tool-based or web-interaction training, such as ReTool and WebSailor. The hierarchical verifier component functions as a rule-based reward layer, merely providing additional supervision signals instead of introducing a novel learning principle. Similarly, the replay augmentation strategy represents a simplified variant of well-known curriculum or prioritized replay mechanisms in reinforcement learning [3–6]. Consequently, the overall novelty of the paper is limited.
W2: Although the paper claims an end-to-end agentic RL framework, many components are manually supervised and tied to proprietary systems. The reward model uses human-engineered scoring rules, and the sandbox relies on internal APIs and unavailable data, making the framework neither fully end-to-end nor externally reproducible. Even though appendices are extensive, external researchers cannot replicate the environment or results.
W3: Evaluation metrics are internal and mostly rule-based. Conducting only 50 human evaluations online is too few for robust validation. Moreover, the online evaluation uses queries collected from the production environment rather than deploying the system to real users.
W4: The paper doesn’t transparently discuss how the system was adapted to the nine reasoning LLMs it was compared to. Did the authors attempt to improve them, for example by providing in-context data from the dataset used to train DeepTravel?
[1] Jiang, M., Grefenstette, E., & Rocktäschel, T. (2021). Prioritized Level Replay. In Proceedings of the International Conference on Machine Learning (ICML), pp. 4940–4950. PMLR.
[2] Andrychowicz M, Wolski F, Ray A, Schneider J, Fong R, Welinder P, McGrew B, Tobin J, Pieter Abbeel O, Zaremba W. Hindsight experience replay. Advances in neural information processing systems. 2017;30.
[3] Shao Z, Wang P, Zhu Q, Xu R, Song J, Bi X, Zhang H, Zhang M, Li YK, Wu Y, Guo D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. 2024 Feb 5.
**Minor suggestions**:
- Make the descriptions of tables and plots more self-contained to improve clarity and readability.
Q1: Does the proposed simulator/environment support multi-turn dialogue? For example, can the user see a response and add further requests, or does the system need to ask clarifying questions?
Q2: What is the token budget and inference cost for the baseline LLMs? |
Lightly AI-edited |
|
Transferring Jailbreak Attacks from Public to Private LLMs via Local Prompt Optimization |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper aims to jailbreak black-box LLMs that were originally finetuned from some open-source LLMs. The authors design a transfer attack that synthesizes jailbreak prompts based on the original open-source LLMs to attack the black-box finetuned ones. The proposed attack was compared with some very old jailbreak baselines, but unfortunately could not beat the black-box PAIR attack in both jailbreak effectiveness and efficiency.
**None.**
This paper contains two major flaws that make it impossible for me to vote for acceptance:
- The proposed attack implicitly requires access to the full next-token sampling distribution of the targeted LLM, which, however, is basically impossible in real-world LLM services.
- Both the query efficiency and the attack effectiveness of the proposed black-box attack are significantly weaker than those of other existing black-box attacks with fewer assumptions.
Please see **Weaknesses** for details.
1. **(Major flaw 1)** In Line 213, the authors state that "they can implement the distance function $Dist(\cdot)$ as the cross entropy loss". Since the authors do not discuss this distance function further, I have to assume that they directly implement $Dist(\cdot)$ as the cross-entropy loss in all experiments in their paper. If this is the case, it means the authors implicitly assume that the attacker can access **the full next-token sampling distribution of the targeted LLM**, since the cross-entropy loss needs to be calculated on the output logits/distributions of the targeted model. However, due to privacy issues [r1], real-world LLM black-box services typically do not provide access to the full next-token sampling distributions to end users. For example, OpenAI only allows access to the log probabilities for up to the $20$ most likely tokens [r2]. This issue significantly shrinks the practicality of the proposed attack.
2. **(Major flaw 2)** The proposed black-box attack is significantly weaker than the PAIR attack, which is a black-box attack that this paper uses as a baseline, in both query efficiency and attack effectiveness. Furthermore, the PAIR attack does not even need to assume that the targeted private model is fine-tuned from an open-source model. As a result, I cannot see any advantages of the proposed attack. Specifically, for query efficiency, the proposed attack needs to query the targeted model $O(1000)$ times (see Line 320; I am not sure if it is $1000$ or $2000$), while the PAIR attack only needs to query the targeted LLM up to $20$ times to find an effective jailbreak prompt (see the original paper [r3] of the PAIR attack). For attack effectiveness, Table 2 clearly shows that the proposed attack is significantly weaker than the PAIR attack.
3. The attack model the authors consider is very narrow and not practical at all. I do not think it is very reasonable or practical to assume that an arbitrary black-box targeted LLM is fine-tuned from an open-source LLM. The authors need to provide more real-world evidence to justify this.
4. In Eq. (3), the authors claim that "$p(x^*_{[n+1:n+H]})$ is the target output logit values". Is this the logit of the original model or the fine-tuned one? Besides, it seems that this is not a logit conditional on an input prompt, is it? So why would you want to align a conditional logits/distribution to this unconditional one under the $Dist(\cdot)$ function? From my perspective, such an operation is meaningless.
5. The GBDA baseline considered is too old and weak. [r4] has already shown that the GCG attack is much stronger than the GBDA attack. The authors should consider more advanced jailbreak attack baselines such as [r5, r6, r7].
6. The authors aim to attack models fine-tuned from open-source models. However, their experiments relevant to open-source LLMs are conducted on only two models: Vicuna-7B and Llama-2-7B. Both of these open-source models are too old and too weakly aligned, so I do not think the empirical conclusions drawn from these models can be generalized.
**References**
[r1] Carlini et al. Stealing Part of a Production Language Model. ICML 2024.
[r2] https://platform.openai.com/docs/api-reference/responses/create#responses_create-top_logprobs
[r3] Chao et al. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv 2023.
[r4] Zou et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv 2023.
[r5] Sadasivan et al. Fast Adversarial Attacks on Language Models In One GPU Minute. ICML 2024.
[r6] Hayase et al. Query-Based Adversarial Prompt Generation. NeurIPS 2024.
[r7] Andriushchenko et al. Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. ICLR 2025.
See **Weaknesses**. |
Fully human-written |
|
Transferring Jailbreak Attacks from Public to Private LLMs via Local Prompt Optimization |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a two-stage local prompt optimization attack to transfer jailbreaks from a public LLM to a private fine-tuned LLM accessible. First, an auxiliary suffix is learned to align the public model’s output distribution with the target private model. In second stage, an adversarial suffix is optimized on the public model, using that alignment as a surrogate for target gradients. Experiments report higher ASR on AdvBench and some transfer to closed or target models.
* The motivation is clear and grounded in a public-to-private transfer setting that is relevant for practical security concerns.
* The idea of using an auxiliary alignment suffix to approximate gradients via a public model is promissing.
* The method achieves notable improvements over naive transfer baselines and shows competitive performance against some black-box attacks in certain cases, though gains are not consistent across all settings.
1. The claims about generality across fine-tuning settings are overstated, as the evidence is limited to a single LoRA-based example in the appendix. Although the introduction mentions broader fine-tuning scenarios (e.g., domain-specific tuning), these are not experimentally validated.
2. The main experiments are limited to a single model pair (LLaMA–Vicuna). This is insufficient to support the broader claims about transferability. Additional pairs such as GPT-NeoX → Dolly, Mistral → Mistral-Instruct, or base → instruct/code models would make the conclusions more robust.
3. The ASR computation method (“judge”) is under-specified. The list of refusal phrases and generation length, are not reported, though these factors can strongly affect results as different models use different refusal phrases or can decide not to answer in later stages [1,2]. Given evidence that rule-based judges are unreliable, an LLM-based judge (e.g., GPT-4 or LLaMa-Guard) would make the evaluation more credible.
4. The approach might be adaptable to other optimization-based attack formulations [3,4], but this potential is unexplored. A brief experiment or discussion would strengthen the paper’s generality claim.
5. The cost comparison is not well controlled. It is unclear whether baselines in Table 1 were reimplemented under the same conditions or taken from prior work. Query budgets and per-example query counts are not reported, making the “comparable efficiency” claim hard to evaluate.
6. The efficiency claims in “Computational and Query Efficiency” part are unclear. Algorithm 1 appears to roughly double the query cost compared to GCG, yet the paper suggests similar or lower cost. Additionally, more detail is needed on how the suffix candidate sets are generated, filtered, and whether this was actually done in experiments.
7. Table 3 lacks baseline comparisons, making it difficult to interpret the reported transfer results in context.
***Minor remarks:***
1. Clarify the meaning of “baseline,” “ours w/o a,” and “ours” in Table 1 in more details, this distinction is currently unclear and hard to follow.
2. The new model results in the appendix are central to the main argument and should be moved into the main text.
3. Define the term “black-box” consistentlyas in the literature it can sometimes mean query-only access, other times query + logits.
4. The asterisk (*) in Table 2 is not explained anywhere.
5. The results on the right half of Table 1 are somewhat inconsistent with the paper’s narrative, since Vicuna is a fine-tuned derivative of LLaMA rather than the other way around. The interpretation of transfer direction and its implications for fine-tuning robustness should therefore be clarified.
6. The discussion of Table 2 notes that the method underperforms PAIR but omits that it also underperforms TAP, this should be acknowledged for completeness.
7. Section 3.3 reads like the beginning of a separate chapter rather than a continuation of the current discussion. You might consider reframing its opening or adding a short transition to better connect it with the previous sections.
[1] Zhou, Y., & Wang, W. (2024). Don’t Say No: Jailbreaking LLM by Suppressing Refusal
[2] Ran, D., Liu, J., Gong, Y., Zheng, J., He, X., & Cong, T. et al. (2025). JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
[3] Jia, Xiaojun; Pang, Tianyu; Du, Chao; Huang, Yihao; Gu, Jindong; Liu, Yang; Cao, Xiaochun; Lin, Min (2024). Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
[4] Andriushchenko, M., Croce, F., Flammarion, N. (2025). Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
1. Have you tested transferability between different versions within the same model family (e.g., LLaMA 2 → LLaMA 3)? Although it is not technically finetuning, these models share many components and it would be interesting to see their transfer.
2. What happens when transferring between different model sizes (e.g., 3B → 7B)?
3. How do you interpret these results in light of findings that fine-tuning often reduces robustness [6]?
4. Could you provide the attack results without the auxiliary weight (Table 2) to isolate its contribution?
5. Have you considered using ensemble GCG (as described in section 3.2 of [7]) in this setting? It may improve transfer, especially when the target family is unknown.
6. Have you tested the attacks against any defense mechanisms? Are they more or less likely to be detected than baseline attacks?
7. Did you keep the models’ default system prompts (e.g., “You are a helpful AI assistant”) during all experiments, or were they overridden? This choice can meaningfully affect jailbreaking success rates.
[5] Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., & Henderson, P. (2023). Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
[6] Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. |
Lightly AI-edited |
|
Transferring Jailbreak Attacks from Public to Private LLMs via Local Prompt Optimization |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a two-stage local prompt optimization framework for transferring jailbreak attacks from public to private LLMs in black-box settings. The key idea is to introduce an auxiliary adversarial suffix $a$ to align the output distributions between a public base model and a target private model, then optimize the attack suffix $s$ on the aligned public model. The method is evaluated on open-source models (Vicuna, LLaMA2) and proprietary models (GPT-4, Claude), showing high attack success rates comparable to white-box attacks.
The problem is practically relevant given the widespread practice of fine-tuning public LLMs for private deployment. However, the technical contribution is incremental over existing gradient-based methods like GCG, and the theoretical justification for the proposed approximation is insufficient.
The paper addresses a realistic threat model where private LLMs are fine-tuned from public models and accessible only via query APIs. This scenario is increasingly common in practice.
The two-stage optimization strategy is intuitive. By first aligning distributions via an auxiliary suffix and then optimizing the adversarial suffix on the aligned model, the method bridges white-box optimization with black-box transfer.
The experimental coverage is broad, including both open-source and proprietary models. The transferability experiments in Table 2 demonstrate that attacks can generalize to various downstream models.
The paper shows that even without knowing the exact base model, using a surrogate model can still achieve reasonable attack success rates (Table 3), which strengthens the practical applicability.
The core technical contribution is limited. The method essentially combines distribution alignment with GCG-based optimization. The key approximation in Equation 5, $\nabla_{e_s}L_{llm}(s; \theta_{loc}) \approx \nabla_{e_s}L_{llm}(s + a; \theta_0)$, lacks rigorous theoretical justification. The two bullet points provided (PEFT freezes parameters and gradient-based candidate selection has error tolerance) are insufficient to establish when and why this approximation holds. No error bounds or convergence analysis is provided.
The experimental setup has several issues. First, the baseline comparisons primarily use older methods (GBDA, Autoprompt, GCG from 2021-2023). More recent methods like PAIR, TAP, and AutoDAN only appear in the transferability experiments without fair comparison in the main results. Second, the main experiments use Vicuna-7B and LLaMA2-7B, which are relatively old models from 2023. While the appendix includes newer models, the evaluation on state-of-the-art aligned models (e.g., GPT-4o, Claude 3.5 Sonnet) is missing. Third, the computational and query efficiency claims lack detailed analysis. The paper mentions "up to 70% reduction" in queries but provides no breakdown of costs across different stages.
The term "local fine-tuning" is misleading throughout the paper (appearing in Section 3.2 title, Figure 1, Introduction, and Experimental sections). The method does not fine-tune any model parameters but only optimizes prompt suffixes locally. This terminology creates confusion about the actual technical approach. The authors should use more accurate terminology such as "local prompt optimization" consistently throughout the paper.
The evaluation metrics are limited. The paper only reports Attack Success Rate (ASR) based on whether responses contain negative phrases. There is no assessment of attack quality, harmfulness severity, or human evaluation of the generated outputs.
The limitations section acknowledges that the method requires prior knowledge about the base model and is less effective when the private model diverges significantly from the public base. However, the paper does not explore how much divergence can be tolerated or provide guidelines for practitioners.
The defense discussion is superficial. Table 6 in the appendix shows that adversarial re-finetuning reduces ASR from 79.1% to 48.3%, but this is only tested with 500 adversarial examples and lightweight LoRA updates. No other defense mechanisms are explored.
Can you provide a formal analysis of the approximation error in Equation 5? Under what conditions (e.g., amount of fine-tuning, model architecture) does this approximation hold, and what are the theoretical error bounds?
How does the method perform when the private model undergoes substantial fine-tuning (e.g., full fine-tuning instead of LoRA) or uses a significantly different fine-tuning dataset? Can you provide experimental results on models with varying degrees of divergence from the base?
The query efficiency claim needs more details. How many queries to the target model are required in each iteration? How does this compare to methods like PAIR that directly query the target model? Can you provide a detailed breakdown of query costs?
Why are the newer baseline methods (PAIR, TAP, AutoDAN) not compared in the main experiments (Table 1)? These methods should be included for a fair comparison.
Can you evaluate on more recent and strongly aligned models such as GPT-4o, Claude 3.5 Sonnet, or LLaMA 3.1? The current evaluation on 2023-era models may not reflect the method's effectiveness on state-of-the-art systems.
Have you considered evaluating the quality of generated attacks beyond binary success rates? For example, measuring the actual harmfulness of responses or conducting human studies?
The paper claims this framework can also be used as "an effective tool for safeguarding open resources from misuse" (Equation 8). Can you elaborate on this use case with concrete examples and experiments? |
Fully AI-generated |
|
Transferring Jailbreak Attacks from Public to Private LLMs via Local Prompt Optimization |
Soundness: 3: good
Presentation: 1: poor
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
In this paper, the authors propose a two-stage optimization-based attack to threaten the black-box private LLMs. Experiments have demonstrated their transferability and effectiveness between open-source and closed-source models.
1 This paper is easy to follow.
2 The attack is effective for both closed-source and open-source models.
3 The soundness of the proposed method is good.
1 The author appears to have confused the "\citep" command with the "\cite" command. Based on my observation, it seems the incorrect command has been used throughout the entire paper.
2 Some typos exist in this paper. For example, in Line 280, there should be a space before 'In each iteration'.
3 No normalization to ensure the readability of the crafted suffix. This makes the attack super fragile to the PPL-based detection method [1].
4 The baselines for comparison are old. An example is that GBDA, Autoprompt, and GCG are attacks that were proposed before the year of 2023. I suggest authors compare their attack with more SOTA attacks, such as AutoDAN-turbo [2].
5 In Table 2, the authors compared their proposed attacks with the query-based and optimization-based attacks. However, as far as I know, context-based attrack like ICA [2] and I-FSJ [3] is another kind of black-box attack, I suggust authors add them into Table 2 as baseline methods.
6 The robustness of the attack against defense is not evaluated, like smoothLLM [5], PAT [6] and RPO [7].
[1] Baseline Defenses for Adversarial Attacks Against Aligned Languge Models
[2] Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms
[3] Jailbreak and guard aligned language models with only few in-context demonstrations
[4] Improved few-shot jailbreaking can circumvent aligned language models and their defenses
[5] SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
[6] Fight Back Against Jailbreaking via Prompt Adversarial Tuning
[7] Robust prompt optimization for defending language models against jailbreaking attacks
1 I think Section 3.3 is confusing. Why do the public owners want to forbid fine-tuning in certain cases?
2 In Equation 8, the $s^{(t)}$ minmizes the formulation $Dist[p(\tilde{x}|x_{in}+s;\theta_0),p(\tilde{x}|x_{in};\theta_0)]$. Does this formulation indicate that $s$ has no impact on the output of the target LLMs? |
Fully human-written |
|
MANGO: Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes MANGO, a two-stage framework for natural, multi-speaker 3D talking head generation. Stage-1 predicts FLAME parameters (facial expression and articulated head/jaw pose) directly from dual-speaker audio, explicitly modeling speaking/listening interaction. Stage-2 renders the predicted 3D motion with a 3D-Gaussian splatting–based image generator and uses 2D photometric/perceptual losses to lift supervision back to the 3D motion, alternating training between the motion and the renderer. On a new multi-speaker conversational dataset, the method reports improved 3D motion accuracy and 2D visual/lip-sync scores versus recent baselines.
- Modeling both speakers’ audio and the role switch (speaking vs. listening) is well motivated and aligns with emerging conversation-aware talking-head research. This is a non-trivial step beyond speaker-only driving.
- Alternating 2D-lifted supervision is elegant and plausible. Using a fast differentiable renderer (3D Gaussians) to refine motion predicted in Stage-1 is technically sound and explains the observed improvement in mouth articulation/expressiveness.
- Evaluation with community-recognized metrics. Reporting LSE-C/LSE-D alongside image-space metrics aligns with established practice in audio-visual lip-sync evaluation.
- Limited analysis of listening behaviors. The qualitative figures suggest better non-verbal feedback (nodding, smiles), but there is little targeted measurement of listener' realism beyond global metrics including more diverse non-verbal signals. Consider role conditioned metrics or human studies that separately score speaking vs. listening segments.
- Ablations could isolate Stage-2’s contribution more sharply. It would help to report identical Stage-1 models trained (a) without any 2D-lifted refinement, (b) with only photometric vs. only perceptual losses, and (c) with/without Gaussian renderer fine-tuning.
1. How sensitive is Stage-1 to errors in active-speaker detection and speech overlaps? Any quantitative robustness test (e.g., synthetic noise or mis-attribution)?
2. Can the model generalize to unseen speakers and to diverse emotions (e.g., laughter, surprise)? A small cross-emotion test would be informative. Please refer relevant work [1].
[1] LaughTalk: Expressive 3D Talking Head Generation with Laughter, https://arxiv.org/pdf/2311.00994 |
Fully AI-generated |
|
MANGO: Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents MANGO, a two-stage framework for multi-speaker 3D talking head generation. The first stage uses a diffusion-based dual-audio motion generator to produce 3D motion parameters conditioned on both speakers’ audio. The second stage employs a 3D Gaussian Renderer (MG-Renderer) to synthesize high-fidelity images using 2D photometric supervision, which the authors refer to as a 2D-lifted enhancement. A new dataset, MANGO-Dialog, is also introduced, consisting of over 50 hours of synchronized 2D-3D conversational data from 500+ speakers. Experimental results demonstrate quantitative and qualitative improvements over prior 3D talking head methods such as DualTalk and DiffPoseTalk, particularly in lip-sync precision and overall mesh–image alignment.
- Novel Dual-Audio Diffusion Framework : The combination of a dual-audio fusion module and a diffusion-based motion generator allows modeling of bidirectional conversational dynamics, distinguishing speaking and listening states more effectively than single-speaker systems.
- Two-Stage 2D-Lifted Enhancement Strategy : The introduction of 2D image-level supervision through the Gaussian Renderer effectively refines 3D mesh predictions, mitigating the noise of pseudo-3D labels obtained from tracking.
- New Dataset (MANGO-Dialog) : A large-scale, 2D–3D aligned multi-speaker dialogue dataset is a valuable contribution that could benefit future research in multi-person conversational synthesis.
- Comprehensive Evaluation : Extensive comparisons with single- and multi-speaker baselines (FaceFormer, DualTalk, SadTalker, etc.) show improved 3D accuracy (LVE, MVE) and 2D fidelity (PSNR, SSIM, LPIPS).
- Limited Conceptual Novelty : The framework is primarily a combination of existing paradigms. 3D talking head generation and speech-based motion diffusion—with modest architectural novelty. The overall system resembles a combination of talking head generation and speech separation rather than a fundamentally new paradigm.
- Scalability Concerns : The method generates videos at a head level, which limits scalability to full-scene multi-speaker synthesis. Extending the approach to simultaneous multi-agent scenes or long-turn interactions would be challenging given the per-head rendering design.
- Problem Scope Overlap : The targeted issue of over-smoothed mouth motion is not unique to multi-speaker setups; numerous single-speaker 3D talking head works (e.g., DiffPoseTalk, FaceFormer) already address similar issues with comparable diffusion or transformer-based solutions.
- Relatively Lower Visual Quality : Compared to high-quality 3D generative systems such as HALLO3 or LivePortrait, the generated outputs in Fig. 6 appear less photorealistic and expressive. Leveraging stronger generative priors might substantially improve realism and lip-sync fidelity.
- Terminological Ambiguity.The term “2D-lifted enhancement” is not clearly justified : It appears to describe the process of applying 2D photometric loss to refine 3D motion, but the phrasing could mislead readers into thinking it’s a new geometric transformation rather than a training strategy.
- Advantage over Task Composition : Since the proposed setup essentially combines speech-driven motion synthesis with conversational context modeling, what specific benefits does MANGO achieve beyond simply combining existing talking head and speech separation modules?
- Scene-Level Generation : Could the method be extended to generate an entire two-speaker video scene simultaneously, rather than per-head? If so, what architectural or computational challenges arise due to the current 3D representation?
- Relation to Single-Speaker 3D Methods : How does the proposed system differ from prior works that already tackled over-smoothing using diffusion or Gaussian renderers (e.g., DiffPoseTalk, ARTalk)? Are the improvements mainly empirical or conceptual?
- Quality Gap vs. MANGO-Dialog Baselines : The paper shows improved quantitative metrics, but the generated samples from MANGO-Dialog still look coarse compared to prior generative 3D methods. Have you considered integrating more advanced generative models like LivePortrait or HALLO3 to enhance appearance realism?
- Clarification on “2D-Lifted Enhancement : ”Could you clearly define this term? Is it equivalent to the two-stage alternated supervision process (3D→2D refinement), or does it imply a structural connection between 2D features and 3D geometry? |
Fully AI-generated |
|
MANGO: Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces MANGO, a 3D conversational multi-speaker talking-head generation framework that unifies the synthesis of both speaking and listening behaviors.
The framework consists of two stages:
1. A diffusion-based multi-audio fusion model that models motion distributions across speakers;
2. A 3D Gaussian Splatting (3DGS) renderer that converts predicted motion sequences into videos, with additional 2D image supervision to mitigate inaccuracies from 3D tracking.
The authors also provide a new 3D conversational dataset for training and evaluation.
1. Presents a unified framework for generating both speaking and listening 3D talking heads. It is a novel and ambitious direction.
2. Incorporating 2D image-level loss after 3DGS rendering helps partially alleviate the errors caused by 3DMM estimation, providing additional supervision for the 3D talking head generation.
1. (Major) Limited effectiveness in both speaking and listening states: From the demo videos, while the model can roughly switch between speaking and listening modes, neither mode performs convincingly.
- Speaking: The lip motions are not accurate and clearly worse than single-speaker baselines such as CodeTalker or DiffPoseTalk. Even though MANGO separates speaking and listening audio inputs and introduces an indicator for speaking status, the generated lips remain unsatisfactory. This raises the question of whether the dual-audio module introduces interference between the two states.
- Listening: The listening behaviors appear almost random or static, lacking clear correlation with the interlocutor’s speech (e.g., at 00:38 in the demo, when hearing “luckily”, DualTalk shows a smile but MANGO does not). The dataset examples contain rich listening behaviors — nodding, smiling, eyebrow raises, or thoughtful blinking — yet these are not reflected in the results. Quantitative or qualitative evidence showing the correlation between listening behavior and input audio would strengthen the claim.
- In conclusion, the framework currently fails to convincingly capture both accurate lip articulation and expressive listening dynamics.
2. (Major) Questionable benefit of 2D image loss after 3D reconstruction: While introducing a 2D image loss after rendering is presented as a core contribution, such image-space supervision has long been standard in 3D face reconstruction pipelines.
Here, applying the 2D loss after generation introduces compounded errors — both from inaccurate expression estimation and imperfect 3D rendering.
The actual effectiveness of this design is unclear and requires visual ablation evidence.
Moreover, the rendered frames in the demo show strong 3DGS artifacts, raising doubts about gradient stability and potential negative impacts on 3DMM coefficient learning.
As an alternative, will it be more stable and effective to optimize the predicted 3DMM (pGT) directly through differentiable rendering and computing the image loss on the rendered image of pGT?
3. (Major) Relation to INFP remains underspecified: Although the authors claim their task differs from INFP (which generates 2D talking heads), MANGO ultimately renders to 2D and mainly differs in that it uses 3DMM as the intermediate representation instead of INFP’s motion features.
- The two tasks and formulations are thus highly similar, and a visual comparison with INFP would be essential to demonstrate advantages in motion controllability.
- Both methods employ a dual-audio module to link speech and motion features. What's the difference and strength of MANGO's audio2motion model against the INFP's?
4. (Minor) Presentation and clarity issues:
- The two claimed contributions, conversational talking-head generation and 2D image loss, appear weakly connected and seem like two independent ideas. And the statement “in conversational scenarios, lip movements become more complex due to the dynamics of interaction” (L80) lacks empirical justification; single-speaker data can exhibit similar complexity.
- The naming of Stage 1/2 and Training Phase 1/2 is confusing. For example, does Training Phase 1 (Stage 2 training) refer to only training the MG-Renderer on pGT meshes with 2D image loss?
- In Equation (3), both $H_{self}$ and $H_{other}$ pass through the Transformer jointly, which seems inconsistent with the schematic in Figure 4.
1. In our understanding, a listener’s expressions during conversation should depend not only on the other party’s speech content but also on the speaker’s facial expressions. Will the speaking state (or visual features of the speaker) be considered as an additional input when modeling the listening behavior? |
Lightly AI-edited |
|
MANGO: Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper proposes MANGO, a two-stage framework for generating natural, bidirectional 3D talking heads in multi-speaker conversational settings. Unlike prior work that focuses on either speaking or listening, MANGO aims to model fluid transitions between these states using dual-audio inputs and 2D photometric supervision to refine 3D motion. The authors also introduce MANGO-Dialog, a new dataset of 50+ hours of aligned 2D–3D conversational videos across 500+ identities. The core idea is to bypass the inaccuracies of pseudo-3D labels (from 3D face trackers) by using 2D image-level losses to guide 3D motion learning through a 3D Gaussian renderer.
1. Dual-audio fusion module enables speaker–listener disentanglement
The paper introduces a Dual-audio Interaction Module (DIM) that explicitly models conversational dynamics by fusing self and other speaker audio through a Transformer, followed by a residual connection with the self-audio. This design helps preserve speaker-specific lip-sync fidelity while allowing listener behaviors (e.g., subtle smiles, head nods) to be conditioned on the interlocutor’s speech. As shown in Fig. 7(a) and Table 2, removing this module leads to cross-contamination—e.g., the listener exhibits speaking-like mouth movements. This is a non-trivial contribution, as prior multi-speaker methods (e.g., DualTalk) do not explicitly model such asymmetric audio roles.
2. MANGO-Dialog: A large-scale, temporally aligned 2D–3D conversational dataset
The authors release MANGO-Dialog, comprising 50+ hours of dual-speaker videos across 500+ identities, with synchronized audio, pseudo-3D FLAME parameters (via Spectre), and refined camera poses. Crucially, clips are 30–120 seconds long, ensuring natural speaking–listening transitions—a rarity in existing datasets (e.g., VoxCeleb, HDTF are mostly single-speaker). The dataset also includes speaker diarization labels (via TackNet), enabling training of the speaking indicator. While the 3D labels are pseudo-ground truth (see Cons), the 2D–3D alignment and scale make this a valuable resource for future research in conversational avatars.
1. Inadequate 2D SOTA comparison
The paper compares its 2D output against SadTalker (2023), AniTalker (2024), and ARTalk (2025)—all of which are 3D-parameter-driven 2D renderers, not end-to-end 2D diffusion or neural rendering pipelines. It omits recent high-fidelity 2D talking head methods that achieve near-photorealism and strong lip-sync, such as:
VASA-1 (Microsoft, 2024): generates real-time, high-resolution, emotionally expressive talking faces from audio + single image.
OmniHuman-1: supports full-body, multi-view, and expressive control.
IF-MDM (2024): uses masked diffusion for coherent long-term 2D animation.
GaussianTalker / FlashAvatar: pure 3D Gaussian-based pipelines that may share architectural similarities with MANGO’s renderer but are not discussed or compared.
Without these comparisons, the claim of “superior 2D realism” is not convincingly supported—especially since MANGO’s 2D results (Fig. 6) show limited texture fidelity (e.g., blurry teeth, flat skin shading) compared to VASA-style outputs.
2. Missing comparison with industry-grade 3D pipelines
The 3D evaluation (Table 1) only includes academic methods (FaceFormer, CodeTalker, DualTalk, etc.). It excludes NVIDIA Audio2Face, which is:
*Widely used in production,
*Trained on high-quality 3D scans,
*Capable of real-time inference,
*Supports expression and viseme controls.
Given that MANGO claims “exceptional accuracy,” a comparison with Audio2Face on the same test set (even via qualitative side-by-side) would be essential to validate industrial relevance.
3. No explicit modeling of head pose dynamics or eye blinking
While the FLAME model includes head pose, the paper does not evaluate or visualize head motion quality. In Fig. 5–6, heads appear mostly static, suggesting the model may underutilize head pose variation—a key aspect of natural listening (e.g., nodding, tilting). Similarly, eye blinking is absent: FLAME does not model eyelids, and the renderer does not synthesize blinking. This leads to unnaturally fixed gazes, reducing perceived realism—especially in listening mode, where blink rate and gaze shifts are critical social signals. Previous methods such as DiffPoseTalk, Media2Face, already include the head-pose prediction and some also deliver natural eye blinking.
4. Limited expression control and variation
The method uses FLAME’s expression parameters (ψ ), but the paper provides no analysis of non-mouth expressions (e.g., brow raises, smiles, frowns). While Fig. 6 shows some smiling, it’s unclear whether this is audio-driven or coincidental. There is no user control over expression intensity or type, and no disentanglement between speech-driven and emotion-driven motion. This limits applications requiring emotional or stylistic control.
5. 3D labels derived from noisy 2D-to-3D lifting
The dataset’s 3D labels come from Spectre, which the paper itself critiques (Fig. 2) for over-smoothing or exaggerating lip motion. This creates a fundamental supervision bottleneck: even with 2D-lifted refinement, the initial motion prior is biased. The authors claim their output sometimes exceeds the pseudo-GT mesh (Fig. 5, 9), but this is not quantified (e.g., via human preference or 2D re-projection error vs. GT image). Without ground-truth 3D scans (e.g., from multi-view capture), the true 3D accuracy remains unverifiable.
6. Ablation studies are missing from the demo video
The paper includes strong ablation results (Table 2, Fig. 7), but the supplementary demo video (presumably linked in submission) does not visualize these variants (e.g., w/o DIM, w/o two-stage). This makes it hard for reviewers/users to perceptually validate the claimed improvements. For a method relying on subtle conversational cues, visual ablation is essential.
1. How is Fig. 2 generated?
Fig. 2 shows “over-smoothed” (orange) and “exaggerated/noisy” (blue) 3D lip motion curves compared to a “real” red curve, with visual insets of misaligned meshes. However, the paper does not specify:
What is the ground-truth reference for the red curve? Is it manually annotated lip keypoints, or derived from high-fidelity 3D scans?
Which 3D reconstruction methods produced the orange and blue curves? Are they Spectre, DECA, 3DDFA-v3, etc.?
Are these curves from real conversational data (like MANGO-Dialog) or from single-speaker datasets?
Without this, the figure risks being illustrative rather than empirical, weakening the motivation for 2D-lifted supervision.
2. The paper claims MANGO sometimes outperforms pseudo-GT meshes (e.g., Fig. 5, 9). But how is this quantified?
The visual examples in Fig. 5 and Fig. 9 suggest that MANGO’s mesh aligns better with the 2D ground-truth image than the pseudo-GT mesh from Spectre. However:
Is there a 2D re-projection error (e.g., L1 distance between rendered mesh and GT image) comparing MANGO vs. Spectre?
Have you conducted a user study where humans judge which mesh (Spectre vs. MANGO) better matches the GT video?
If MANGO is “better than pseudo-GT,” does that imply the pseudo-GT is a poor training target—and if so, why not use 2D-only supervision from the start?
This is central to the paper’s core claim but remains anecdotal.
3. The speaking indicator I_self is assumed to be perfectly known. How does performance degrade under realistic diarization errors?
The method uses a binary speaking indicator derived from TackNet (Sec 3.4), which is likely near-perfect on curated clips. But in real-world deployment:
What happens if I_self flips state 10% or 20% of the time (common in overlapping speech)?
Is the model robust to missing or delayed indicators?
Could the model infer the speaking state from audio alone, removing reliance on external diarization?
This affects practical applicability, yet no ablation on indicator noise is provided.
4. The ablation in Table 2 shows “Ours (+two stage)” has higher LVE/MVE than the “+jaw pose” variant. Why does adding 2D supervision increase 3D vertex error?
In Table 2:
The “+jaw pose” row: LVE = 0.235,
The full “Ours (+two stage)” row: LVE = 0.122,
But in the MANGO-Dialog column of Table 1, the full model reports LVE = 1.741, which is much higher than the ablation’s 0.122. This suggests a unit or normalization inconsistency.
Are the ablation metrics computed on mouth vertices only (as in LVE definition), while Table 1 uses full mesh?
Or is there a scaling difference (e.g., mm vs. normalized units)?
Please clarify the metric definitions and scales across tables to ensure comparability. |
Fully AI-generated |
|
Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes an asynchronous diffusion framework for text-to-image generation. The core idea is to create a pixel-level timestep scheduler and let prompt-related regions be decoded more slowly. Extensive experiments demonstrate AyncDM achieves better performance among other training-free text-to-image alignment approaches.
1. The idea is novel and well-motivated. The proposed approach, using cross-attention as a mask indicator, is quite intuitive and easy to follow.
2. This paper is clearly written and well-organized.
3. The comparison results are very promising, showing clear advantages over relevant baselines.
4. The authors conduct comprehensive experiments across multiple model baselines, sampler choices, and ablation settings, which provide strong and convincing evidence for the proposed approach.
1. My main concern is the inconsistency between the training and inference stages of AyncDM. From my understanding, during training, noise is added synchronously into all pixels, and the diffusion model predicts $f_\theta(x_t)$ where $x_t$ is a noised latent with uniform noise levels. However, during inference, the input of the diffusion model is asynchronous or spatially varying noise levels across pixels. How can the model reliably decode latents that contain uneven noise distributions, given that it was never explicitly trained under such conditions? Some marginal artifacts may occur in these scenarios.
2. The distracted attention mask relies heavily on cross-attention maps. However, in the early denoising steps, cross-attention maps are often noisy and unstable, which may lead to unreliable or ambiguous guidance when determining which regions should be denoised faster or slower.
3. Similarly, for more advanced T2I models such as SD3.5 or Flux that adopt the MMDiT framework instead of conventional cross-attention, deriving reliable spatial masks becomes more challenging, since text and image latents are concatenated within a self-attention module.
1. How do the authors address the training–inference gap in AsynDM? Is there any theoretical or empirical evidence showing that this discrepancy can be safely ignored, or that it does not occur during sampling?
2. How does AsynDM apply or adapt the distracted cross-attention mechanism to SD3.5-Medium or other architectures based on MMDiT? The process seems non-trivial and could benefit from further clarification.
I would be happy to raise my score if these concerns are properly addressed. |
Lightly AI-edited |
|
Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a novel method that utilizes masks (getting from network attention or using fixed masks) to decompose the diffusion time of every pixels in an image, resulting clearer inter-pixel context and significant improvement.
**S1: Very good innovation.** I believe asynchronous diffusion can bring a lot of inspiration to subsequent work.
**S2: Significant improvement.** Figure 4 (I was looking forward to discovering more in the supplementary materials, but I couldn't find them) and human survey show extremely good improvements.
**W1: Claim issue**. In intro., the authors claim that the text-to-image misalignment is caused primarily by synchronous denoising. However, in many other single-step generative models, including GANs and VAEs, the misalignment also is a key problem. Thus, I think this statement lacks strong support. In other words, I believe that the proposed asynchronous denoising method can alleviate this misalignment problem to some extent, but this misalignment may not necessarily be caused by this synchronous denoising. Therefore, I believe that this statement needs to be revised and the author needs to provide a broader discussion on other methods to address this issue (including discussing existing methods [1] [2] [3] in other generative models to address this issue and other potential solutions).
[1] Liao W, Hu K, Yang M Y, et al. Text to image generation with semantic-spatial aware gan[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 18187-18196.
[2] Zhang C, Peng Y. Stacking VAE and GAN for context-aware text-to-image generation[C]//2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM). IEEE, 2018: 1-5.
[3] Wang H, Lin G, Hoi S C H, et al. Cycle-consistent inverse GAN for text-to-image synthesis[C]//Proceedings of the 29th ACM international conference on multimedia. 2021: 630-638.
Q1:
This proposed method uses different time steps for different pixels. However, current research on diffusion focuses on reducing the time step of diffusion. Therefore, when this method is applied to models with small-time diffusion, e.g., T=4 in DDGAN [4], can the effect still be significantly improved?
Q2:
This proposed method can be further combined with patchDiff [5] to improve the generation effect?
Q3:
From the Fig.4, we can find there are clear improvement between the proposed methods and other methods. However, the quantitative performance improvement in Table1 are not obvious. I believe this is a question worth explaining.
[4] Xiao Z, Kreis K, Vahdat A. Tackling the Generative Learning Trilemma with Denoising Diffusion GANs[C]//International Conference on Learning Representations.
[5] Wang Z, Jiang Y, Zheng H, et al. Patch diffusion: Faster and more data-efficient training of diffusion models[J]. Advances in neural information processing systems, 2023, 36: 72137-72154. |
Fully human-written |
|
Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The work deals with text-to-image (T2I) diffusion models. The authors argue that the misalignment of the generated image and the textual prompt results from a spatially and temporally uniform (synchronous) change of the denoised image pixels. Hence they propose to denoise some regions differently, at different time steps during the reverse denoising process, the timestep of every pixel being dynamically determined. For this a timestep is assigned to each pixel. The regions of the image that are significantly associated to some tokens of the prompt are thus scheduled according to a specific (decreasing concave) function. The approach is compared to recent baseline models on four prompt datasets, reporting four alignment metrics.
* the initial motivation -- the fact that it might be worthwhile to denoise differently various regions of an image -- is intuitive and makes sense. It is clearly explained both in the introduction (section 1) and the method itself (section 3).
* the quantitative and qualitative results are reported with three different backbones, including the recent SD 3.5 (in the appendix). The proposed approach is compared to four recent baselines, published in 2024 or 2025 in top-tier venues (CVPR, NeurIPS, ICLR).
* the quantitative results are computed with a significant number of 1280 images per prompt set, with the same random seed for all models.
* the paper ignores an important part of the literature relating to generative semantic nursing. Following Attend-and-Excite [d] several papers investigated at optimizing alignment (cross attention) between the prompt and the noise during the backward process e.g Divide and Bind [g] or Syngen [h]. Similarly to the proposed work, these works showed that the "regions" of the image are indeed decoded at various timestep, but the conclusion was rather than the diffusion models first reconstruct the high-power, low-frequency components at early denoising stages before adding low-power, high-frequency details at later stages [i,j]. Positioning with regards to these works would thus have been relevant.
* The quantitative results are not convincing
- the authors do not adopt previous protocol. In particular for GenEval, they use the same 553 prompt but do not use the metrics of the GenEval benchmark itself, making hard to compare to previous published results. For Drawbench, the metrics are not the same as e.g (Bai LiChen et al, 2025), making hard to compare directly with previous published results
- the used BERTscore relies on image description obtained with an ad-hoc model, Qwen2.5-VL-7B-Instruct in this paper. The resulting scores thus evaluates both the models tested and the model used to generate the "ground truth". Similar remarks applies to the QwenScore. However, the two other metrics reflects also alignment (see below)
- all the metrics deal with text-to-image alignment, ignoring other aspects of image generation. One can understand that this aspect is important to evaluate for this paper, but it should have been asserted that, for example, the image quality is -- at least -- maintained.
- the quantitative results are reported without any standard deviation, while the quality of generated images (both for aesthetic and alignment) is known to be variable w.r.t the seed. Given the limited difference in performances in comparison to baselines (in particular ofor the most recent backbones in Table 4 and Table 5), one can have doubt regarding the significance.
- by the way, it is hard to understand why the results in the main paper are based on a old model (SD 2.1) while results on more recent SDXL
* the human study is poorly described
- there no detail on the 22 participants: are they diversified in gender and age ? Is a majority of them student of the same university as the authors? Or even colleagues? Or the author themselves? Is there some diversity in terms of native language? if so, how the prompt were presented (in English or in their native language) ?
- it is not clear how many triplet were shown to the participants, nor how these triplets were chosen. For the automatic evaluation of Table 1 it is said that $4\times 1280$ images are considered for each of the prompt sets. Just after, on line 376, it is reported that the participant evaluate "for each group of three candidate", suggesting that they evaluate 5120 triplets. One can doubt that participant actually made as many evaluation.
- there is no inter-annotator agreement, making hard to estimate the relevance of the study
- for good practice on human studies, one can refer to [f]
* the qualitative results in Figure 4 are poor for the baselines, but it seems to be mainly due to the old SD 2.1 backbone used. If one uses the [online inference available for SDXL on huggingface](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) -- while it is itself quite old since released in July 2023 -- it is quite easy to get much better results for the base DM than those reported in Figure 4 (with the same prompts).
- in the Appendix, on Figure 10 and 11, the qualitative results for SDXL and SD 3.5 looks often better for the baselines than the proposed model.
* minor
- the definition of $x_i\in\mathbb{R}^{n_c\times h\times w}$ on line 194 should be introduced earlier, before equation (4) around line 184. It is indeed a crucial change for the proposed method since it reflects the *pixel-level* aspect.
- the references for "text-to-image misalignment" on lines 057-058 are recent but inappropriate since this phenomenon has been identified well before for diffusion models, e.g in DALL-e [a] released as a preprint, Imagen (Saharia et al, 2022), DAAM [b], Structured Guidance [c] and Attend&Excite [d]. Not to mention previous works with GANs e.g SOA [e]. The reference (Liu et al, 2025) may be relevant since it is a review, although it seems to be only a preprint and not (yet?) published. However (Hu et al, 2025a) is just a recent paper dealing with a problem already known.
- which "SD 2.1 base" (line 300) is used ? SD 2.1-512 or SD 2.1-768?
[a] A. Ramesh et al. "Hierarchical Text-Conditional Image Generation with CLIP Latents". In: arXiv 2204.06125 (2022). arXiv: 2204.0612
[b] R. Tang et al. "What the DAAM: Interpreting Stable Diffusion Using Cross Attention". ACL 2023.
[c] W. Feng et al. "Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis". ICLR 2023.
[d] H. Chefer et al. "Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models". In: ACM Trans. Graph. 42.4 (July 2023)
[e] Hinz et al "Semantic Object Accuracy for Generative Text-to-Image Synthesis", TPAMI 2022 (and arXiv:1910.13321 in 2019)
[f] M. Otani et al. "Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation". CVPR, 2023
[g] Li et al "Divide & Bind Your Attention for Improved Generative Semantic Nursing", BMVC 2023
[h] Rassin et al "Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment" ,NeurIPS 2023
[i] S. Rissanen, M. Heinonen, and A. Solin. “Generative modelling with inverse heat dissipation”. ICLR 2023
[j] Y.-H. Park et al. “Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry”. NeurIPS 2023
- which "SD 2.1 base" (line 300) is used ? SD 2.1-512 or SD 2.1-768?
- why the results on SDXL and SD 3.5 have not been reported in the main paper (and those with SD 2.1 in appendix) ?
- Could we have more details about the study conducted with humans, in particular on the cohort of 22 participants (see above)? |
Fully human-written |
|
ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes ConciseHint, a framework that improves reasoning efficiency in large reasoning models by injecting concise hints during generation rather than before reasoning or through fine-tuning. The method continuously introduces either manually designed or learned hints to encourage concise thinking while maintaining accuracy. It adaptively controls the hint intensity and injection position based on query complexity through parameters α and β. The extended version, ConciseHint-T, learns hint embeddings from concise reasoning data and introduces a controllable interpolation parameter γ. Experiments on GSM8K, AIME24, and GPQA-Diamond using Qwen and DeepSeek-R1 show token reduction with minimal accuracy loss.
- The proposed training-free ConciseHint framework introduces an interesting perspective by applying in-reasoning intervention rather than pre-reasoning prompting or fine-tuning.
- The paper addresses an important and timely question about improving reasoning efficiency in large reasoning models.
- The experimental results are promising, showing substantial token reduction while maintaining accuracy across the Qwen family of models.
- The idea of adaptive hint injection and its potential for plug-and-play integration with existing efficiency techniques is conceptually appealing.
1. Hyperparameter Selection and Clarity
- The strategy for determining hint injection intensity (parameters α and β) appears ad hoc.
- Although the paper claims that hints are “learnable,” the positions and frequencies of injection are determined through manually tuned hyperparameters rather than learned mechanisms.
- The description of α and β is confusing and inconsistent. For example, the paper suggests α should be “small,” yet sets α = 128 without clarifying what range is considered small or providing sufficient justification through sensitivity analysis
2. Overreliance on Hyperparameters
- The effectiveness of ConciseHint heavily depends on α, β, and γ, which undermines its claim of adaptivity.
- The framework’s practicality is limited without a clear or automated method for selecting these hyperparameters. This raises concerns about reproducibility and robustness across unseen models. Obviously, the current hyperparameter settings are good for Qwen family models, but not as significantly effective in DeepSeek models.
3. Incomplete Experimental Coverage
- The comparison with baselines is uneven across models. For instance, Qwen3-4B includes BeConcise and other baselines, but DeepSeek-R1-14B results omit some of these.
- It is unclear why BeConcise or similar prompting-based baselines cannot be combined with ConciseHint for stronger comparisons. These inconsistencies suggest the experiments are not yet fully comprehensive or standardized.
4. Overclaim on Reasoning
- The evaluation focuses primarily on math and physics QA datasets (GSM8K, AIME24, GPQA-Diamond). Such domains do not represent general reasoning; broader evaluations on coding, commonsense, or multimodal reasoning datasets are needed.
- As a result, the paper’s claim of improving “reasoning efficiency” in general is somewhat overstated.
5. Trade-off Between Accuracy and Efficiency
- In Table 2, performance noticeably degrades when γ increases (e.g., γ = 1), indicating potential instability or limited generalization.
- The results reveal a clear trade-off between conciseness and accuracy, which should be analyzed more thoroughly rather than only reported. A discussion of this trade-off would help readers better understand practical deployment choices.
6. Writing and Presentation Issues
- The paper is difficult to read, with dense notation and numerous hyperparameters that could be summarized more clearly.
- Several equations (e.g., Eq. (1)–(3)) could benefit from intuitive explanations or visual aids describing the adaptive behavior.
Please refer to the weaknesses. |
Moderately AI-edited |
|
ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
To address the inefficiency of Large Reasoning Models (LRMs) caused by verbose Chain-of-Thought (CoT) generation (e.g., redundant tokens and self-checks), this paper proposes ConciseHint, an in-reasoning intervention framework. Unlike existing methods that only intervene before reasoning (e.g., input prompting, SFT/RL), ConciseHint continuously injects learnable hints (manually designed text or data-learned embeddings) during the reasoning generation process. It adaptively adjusts hint intensity based on reasoning length to avoid over-intervention on complex queries, and dynamically shifts injection positions (from head to tail, capped at \(\tau_{k} \cdot 0.8\)) to balance accuracy and computational cost. An enhanced variant, ConciseHint-T, further optimizes hints via supervised fine-tuning on concise data, enabling controllable reasoning length through embedding interpolation. Experiments on GSM8K, AIME24, and GPQA-Diamond with models like Qwen3 series and DeepSeek-R1 show that ConciseHint reduces tokens by 27%-65% while maintaining accuracy, and seamlessly integrates with existing methods (e.g., BeConcise, Deer) to further boost efficiency.
Novel In-Reasoning Intervention Paradigm: Breaks the limitation of "pre-reasoning intervention" in existing works, directly guiding conciseness during token generation—opening a new direction for efficient LRMs.
Adaptive and Dynamic Mechanisms: Designs complexity-aware hint intensity (adapting to query difficulty via reasoning length) and dynamic injection positions, ensuring accuracy while maximizing efficiency.
Flexible and Controllable Hint Design: Supports both training-free manual hints and data-learned hints (ConciseHint-T), with interpolation-based controllability to balance token reduction and performance.
Strong Compatibility: Serves as a plug-and-play plugin that integrates seamlessly with existing efficient methods, pushing the upper bound of reasoning efficiency without modifying the base model.
The largest model tested is 14B (DeepSeek-R1-14B)—no validation on ultra-large LRMs (70B+, e.g., Qwen3-72B, GPT-4o) where CoT verbosity and computational costs are more severe. Larger models often have more stable reasoning chains; it is unclear if ConciseHint’s intervention is redundant or still effective here.
Lack of Redundancy Targeting and Parameter SensitivityWeakness Details:Unquantified Redundancy Suppression: The paper claims ConciseHint reduces "redundant tokens and self-checks" but provides no breakdown of which specific redundancies are eliminated (e.g., transition words like "wait", repetitive premise restatements, logically superfluous steps). Without this analysis, readers cannot confirm if the framework targets meaningful redundancy (vs. accidental suppression of critical logic).Unvalidated Adaptive Parameters: The default values for \(\alpha=128\) (base interval) and \(\beta=0.2\) (adaptive coefficient) are provided without parameter sensitivity analysis. It is unknown how these values perform across tasks (e.g., AIME24’s long reasoning vs. GSM8K’s short steps) or if optimal parameters exist for different scenarios—hindering practical adoption.Hint Content Impact Ignored: The paper tests only one manual hint ("make answer concise!") and a single training dataset (MixChain-Z-GSM8K) for ConciseHint-T.
Lack of Error Analysis and Edge Case RobustnessWeakness Details:Unanalyzed Accuracy Drops: The paper notes minor accuracy drops (e.g., Qwen3-1.7B on GSM8K: 90.87% → 88.01% for \(\gamma=1.0\)) but provides no analysis of why these drops occur.
see Weaknesses |
Heavily AI-edited |
|
ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose ConciseHint to tackle the inefficiency of reasoning models. This is a framework that injects "concise hints" (like "make answer concise!") during the reasoning process, rather than only prompting before it. The method’s key features include: (1) Complexity-Adaptive Intensity, which automatically adjusts how often it injects hints. (2) Dynamic Injection Position, which dynamically adjusts where it injects the hint in the text. A trained version, ConciseHint-T, learns hint embeddings from data to further improve efficiency. Experiments show ConciseHint significantly reduces token usage (e.g., ~49% on GSM8K) while maintaining accuracy.
- The "in-reasoning intervention" paradigm is new and interesting; it is intelligently designed to avoid hurting performance.
- The method is flexible and can be integrated with other existing efficiency methods, and it can also be controlled either in a training-free or a trained manner.
- Experimental results show that the method works effectively across multiple state-of-the-art models (Qwen3 series, DeepSeek-R1) and challenging benchmarks.
- The core assumption relies on the idea that the current reasoning length is a good proxy for query complexity. This largely depends on specific models, as a model can be verbose on an easy problem or concise on a hard one.
- The evaluation methodology is weak:
- The paper is missing comparisons to other efficient reasoning methods like AlphaOne, AdaptThink, O1-pruner and Autol2s.
- Missing multiple runs and pass@1: For small, complex benchmarks like AIME24 (only 30 problems), reporting "accuracy" from a single run or small average is not statistically sound, especially when using sampling (temp 0.6).
- The trained version, ConciseHint-T, is trained only on the GSM8K (math) dataset. The paper's claim that it "generalizes well" to completely different domains like GPQA has limited evidence.The hyperparameters are sensitive and need careful choice.
- The paper claims $\alpha$ and $\beta$ work well when fixed, but the appendix shows that poor choices can "significantly undermine accuracy."
[1] AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
[2] Adaptthink: Reasoning models can learn when to think
[3] O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning
[4] Autol2s: Auto long-short reasoning for efficient large language models
- Could fine-tuning the model (even with a lightweight method like LoRA) on the same concise dataset also work?
- What is the end-to-end latency impact of the periodic KV-cache invalidation?
- Are there any other proxies for complexity besides current length? |
Fully human-written |
|
ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper focuses on the reasoning efficiency problem of large reasoning models (LRMs) with CoT that tend to generate verbose intermediate reasoning steps. To improve the efficiency, the paper proposes ConciseHint, a method that performs intervention during the generation of reasoning, making the reasoning process concise. ConciseHint continuously injects hints (either a designed text or continuous embeddings) to control the subsequent token generation. ConciseHint also adaptively adjust the injection intensity according to the complexity of the query, which balances efficiency-accuracy by applying a lower hint intensity for complex queries and a higher intensity for easy ones.
- The approach of inserting a short, instructive hint (e.g., “make answer concise”) into the model’s reasoning process is simple and straightforward.
- The strategy for adjusting the hint injection intervals and positions is intuitive and well-motivated.
- The paper is clearly written and logically structured.
- Limited evaluation. The experiments are run on only three datasets, and two of them (AIME24 and GPQA-Diamond) are quite small. It would be helpful to test the method on more datasets from different domains.
- Performance drop. While ConciseHint successfully reduces the number of generated tokens, it also causes a clear drop in accuracy, especially for ConciseHint-T.
- Narrow analysis. The evaluation mainly looks at accuracy and token count. It would be valuable to also assess the quality of the reasoning steps. It’s unclear how the injected hints change the reasoning path, whether they improve clarity, oversimplify the reasoning, or disrupt its flow. A more detailed quality or behavioral analysis would make the work stronger.
- Limited novelty. Although the method is simple and practical, its scientific novelty is modest. The hint scheduling and injection mechanisms are relatively intuitive and heuristic, and the paper does not analyze why or how the hints affect the model’s reasoning process. Exploring how sensitive the model is to different hint designs or injection positions would add important insights.
The paper does not discuss the generalizability of the designed hint. Would the same hint work effectively across other reasoning domains, or would they require task-specific tuning? |
Moderately AI-edited |
|
Einstein Fields: A Neural Perspective To Computational General Relativity |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper derives a novel neural representation method to compress the relativity simulations.
- Outstanding efficiency and accuracy are shown when representing the symmetry of the simulations.
- This paper is especially well written and well presented.
- The problems addressed by the new tool is of interest to a wide community.
- I am not entirely sure that how interesting this paper will be for the readership of ICLR, of whom so few are well versed in this area.
See above. |
Fully human-written |
|
Einstein Fields: A Neural Perspective To Computational General Relativity |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Einstein Fields (EinFields), a neural field approach to compress and represent 4D spacetime metrics from general relativity simulations. The method achieves up to 4,000× compression of metric tensor fields with 7-9 decimal digit accuracy (1E-7 to 1E-9 relative precision) while providing discretization-free continuous representations that can be trained on arbitrary point samples and queried at arbitrary resolutions. A key contribution is improved differentiation accuracy through automatic differentiation (AD), computing Christoffel symbols, Riemann tensors, and other derived quantities with up to 10^5 better accuracy than finite difference methods in FLOAT32. The approach parametrizes the metric distortion (deviation from flat space) using MLPs and employs Sobolev supervision. The validation focuses on three analytical solutions to Einstein's field equations: Schwarzschild, Kerr, and linearized gravitational waves, successfully reconstructing key relativistic phenomena.
## Strengths
**Originality**: This paper presents a novel application of neural fields to general relativity, introducing the first implicit neural representation for tensor-valued spacetime geometries. The approach creatively adapts neural field techniques from computer vision to computational physics, with several original contributions.
**Quality**: The paper demonstrates strong technical rigor with comprehensive validation across multiple canonical GR test cases (Schwarzschild, Kerr, gravitational waves). The evaluation methodology is sound. Ablation studies (Table 3) properly isolate the contributions of different design choices. The authors are transparent about limitations.
**Clarity**: The paper is well-written and accessible. The background section (Section 2) effectively introduces both GR concepts and neural fields. Figure 1 provides an excellent conceptual overview of the pipeline. The mathematical notation is consistent and properly defined (though dense in places).
**Significance**: This work addresses genuine computational bottlenecks in numerical relativity—storage (petabytes per simulation) and accurate tensor differentiation. The 4,000× compression factor and 10^5 improvement in derivative accuracy (FLOAT32) represent substantial practical gains.
## Weaknesses
**Limited experimental scope**: The validation is restricted to three analytical solutions to Einstein's field equations (Schwarzschild, Kerr, linearized gravitational waves). While these are canonical test cases, they represent idealized scenarios far simpler than realistic numerical relativity (NR) simulations.
**Limited contextualization within scientific computing:** While the introduction mentions neural fields and ML for scientific computing, it lacks: (1) discussion of prior ML work specifically targeting numerical relativity or gravitational physics, (2) comparison with traditional compression methods used in scientific computing, and (3) detailed positioning relative to neural operators and PINNs. A dedicated related work section would help readers better understand the landscape and the paper's specific contributions.
**Missing Error Quantification:** Tables 1-3 report single-valued metrics without error bars or confidence intervals. Table 1 mentions selecting 'the model with the lowest MAE,' suggesting multiple runs were performed but statistics are not reported.
### Minor issues:
Page 10, line 490: "supplimentary" → "supplementary"
Figure 4 a caption: "Perihilion precession" → "Perihelion precession"
## Questions
* Actual NR simulation data: Your validation strategy using analytical solutions (Schwarzschild, Kerr, linearized GW) with known ground truth is appropriate for demonstrating the method's capabilities. As a natural next step, have you tested EinFields on any actual numerical relativity simulation outputs, even at small scale? What additional challenges arise with real NR data?
* Table 1 mentions selecting "the model with the lowest MAE" - how many training runs were performed? Can you report mean ± standard deviation over multiple random seeds for the key results in Tables 1-3? This is important for assessing reproducibility and typical vs. best-case performance.
* Parameter generalization: Do you train a separate network for each physical configuration (M, a, etc.), or can one network generalize across parameter ranges? |
Fully AI-generated |
|
Einstein Fields: A Neural Perspective To Computational General Relativity |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents a neural tensor‑field (tensors in the General Relativity sense) parametrization of GR metrics with a JAX implementation for differential geometry operators wired through automatic differentiation. The demos (Schwarzschild/Kerr orbits, Weyl scalars, ring deformation under a linearized GW) look correct and generate polished images, and the ablations around Jacobian/Hessian supervision are good.
That said, I have major concerns with the claims and evaluations. I believe these are not aligned with numerical relativity (NR) practice, and so this is not yet at the point of being useful for actual science. Thus, the current narrative is a bit misleading. The headline is "compressing 4D NR simulations by 1000-4000x with better derivatives than FD in FLOAT32," but most experiments are static analytic 3D cases (t = 0) plus one simple time‑varying GW; storage comparisons are made against a dense "explicit grid" strawman (modern NR code uses AMR or pseudo-spectral methods); baselines omit spectral/ROM methods; coordinate chart sensitivity is large; and long‑horizon dynamics need FLOAT64 to avoid divergence. I think it's a promising direction for ML to help, but I think this paper needs to be honest about the current status of such an approach. Doing so would not only be better for the paper, but I think also benefit the authors in that it would point out to the ML community where more work is needed. Basically I would like to see the narrative of this paper modified to be honest about the practicality of this, and about the toy baselines, before I consider acceptance, especially as general ML audiences will not know how to evaluate this.
- I like the clear pipeline and library, and a JAX code for this seems useful for the NR community. The graph from metric to derived quantities is explicit and leverages forward‑mode Jacobians/Hessians with einsum operations which is nice. This is a useful contribution of reusable tooling for GR in ML, and I think it is a good contribution by itself.
- I liked the ablations accounting done in 4.2, it is nice and I think quite useful to see the effect of every modification to the training process. It is interesting to see that Jacobian/Hessian supervision indeed helps.
- Neat canonical tests: precession, circular/eccentric Kerr orbits, ring response under a linearized GW, and Weyl/Kretschmann diagnostics are shown and largely match analytics on short horizons.
- Multi‑chart training attempt: training/evaluating in multiple coordinate charts acknowledges a real pain point in GR ML (though this is near the end of the appendix, I think it should come earlier)
First, my main concerns:
1. First, I think the scope and storage comparisons are misaligned with NR. The paper advertises 4D compression of NR simulations, but the primary experiments are analytic snapshots at t = 0 (Schwarzschild/Kerr); only the linearized‑gravity toy has time evolution. The compression factors compare MLP weights to explicit dense grids counted as "#points x 4 bytes" in FLOAT32, which is not how NR codes actually store data (they would use adaptive mesh refinement stored with an octree). So the 1000–4000x headline is basically comparing against a strawman and misleading to the ML community about the state of this domain.
2. The paper itself notes that modern NR "increasingly opts for (pseudo‑)spectral methods ... up to 1000–5000x faster on CPUs than FD on GPUs at comparable accuracy." But all quantitative baselines in this paper are finite difference stencils (on a uniform grid - but the only actual finite difference codes used in NR are based on adaptive meshes) and an analytic AD. There's no head‑to‑head vs actual NR codes used in GW modeling. This makes it hard to situate their method, even for someone who knows NR, let alone the ML community.
3. The paper claims up to five orders of magnitude derivative gains over FD in FLOAT32, but geodesic integration requires FLOAT64 and long‑time rollouts still diverge (the authors show this in their own figures and explicitly state that they only surpass FD in single precision). This puts the method far from NR‑readiness, where double precision (or even higher) is standard.
4. There is large coordinate‑chart sensitivity which is a bit worrisome. Table 8 (deep in the appendix) reports up to three orders of magnitude variation in "Rel‑L2" error across charts for the same spacetime. That undermines generality claims unless the representation or training explicitly handles diffeomorphisms or evaluates with chart‑invariant metrics.
5. The physics is not enforced or audited. The pipeline mentions Bianchi identities, but experiments focus on pointwise tensor errors, scalar invariants, and geodesic tests. While the pointwise tensor errors are no doubt useful in clarifying (to a NR person) that these methods aren't yet ready, there's no reporting of physics checks, like conservation laws, etc., which are exactly the diagnostics one needs to trust a compressed metric in downstream NR workflows.
6. I am a bit confused about the "discretization‑free" claims, since it seems the method is ultimately trained on a grid. Several places describe training on 4D spacetime grids or "4D training and validation grid data," which undercuts the claim of being discretization‑free. Even an INR is ultimately a finite parametrization/basis.
7. The throughput trade‑off is not discussed. Even if the file is tiny, post‑processing requires many MLP queries to reconstruct fields, whereas decompressing spectral grids is reading coefficients + evaluating polynomials. You still likely win on storage, but the compute/runtime trade‑off for analysis & viz should be stated.
Second, some other suggestions/comments:
- I make the following comment purely to help the authors improve their work, and do not include this comment as part of my score for the paper, so feel free to not address this in your rebuttal. It is simply a suggestion/idea. So I think the branding of the method as "Einstein Fields" is not an optimal choice, because it would likely conflict with "Einstein Field Equations" in any search. Also, to the physicist, who I assume you would like to include in the audience of the paper, it does not give them any idea that this is related to machine learning.
- Edit: What about calling it "Neural Einstein Fields"?
- Obviously GR is quite complicated to someone with no background. I am not sure it is possible to give it much of an introduction here, and I worry the current introduction might give the wrong ideas. I think it is better to simply direct the reader to an online resource, rather than give an inevitably incomplete description of the mathematics in the appendix. Try to think think about what purpose it serves: (1) for those who don't know GR, this is not going to be nearly enough to introduce them even to the basics; (2) for those who do know GR, they will not need this anyways. So, why include it at all? Consider, for audience (1) I think you need to simply target the intuition for each variable you model. That is the fundamentally useful thing to write about. And for audience (2) (curious physicists, maybe) I think you simply need to _translate_ GR concepts to machine learning for them. So, when you consider these two audiences, the appendix seems to serve little purpose. I recommend trying the split approach above: focus on intuition of the key target variables for the non‑physicists (and point them to other resources), and focus on translation of the machine learning concepts for the physicists. This would be much more effective in my view.
- Also, I did not check through all the math in the appendix. Thus, there could be errors.
- It might be worth calling a GR tensor exactly that: a "GR tensor," to differentiate from the ML meaning.
- Use proper scientific notation instead of "5.37E‑6" in tables.
- Even if the file is small, you still must run tens of thousands of MLP queries for post‑analysis/viz; be transparent about that runtime trade‑off versus reading spectral coefficients.
- Figure 1 is noisy. Consider simplifying or splitting. (The caption itself also states training on a 4D spacetime grid, which conflicts with "discretization‑free" messaging.)
- Please clarify the precise data format used for training ("4D spacetime grid" vs "arbitrary samples"), since the paper simultaneously describes the approach as discretization-free yet refers to training on regular grids.
- Could you provide any comparison, even small-scale, against a spectral or AMR baseline (ideally an actual code used by the GR community) to contextualize the claimed compression?
- Please explain how derivative accuracy scales in FLOAT64 and whether the observed long-time geodesic divergence persists under higher precision. |
Lightly AI-edited |
|
PRISM: Controllable Diffusion for Compound Image Restoration with Scientific Fidelity |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces PRISM, a controllable diffusion framework for compound image restoration, primarily targeting scientific applications. The method involves a two-stage process: first, fine-tuning a CLIP image encoder using a novel weighted contrastive loss to create a compositional embedding space for degradations , and second, training a latent diffusion model conditioned on these embeddings and user prompts to perform selective restoration.
The work compellingly argues for the necessity of controllable, selective restoration over automated 'full' restoration for scientific applications, demonstrating significant gains in downstream task utility.
The primary methodological concern is the limited novelty. The core idea of fine-tuning a CLIP encoder to be degradation-aware heavily relies on prior work (e.g., DA-CLIP). The main novelty appears to rest on the Jaccard distance weighting in the contrastive loss, but the paper lacks a direct ablation comparing this to an unweighted compound contrastive loss, making it difficult to isolate its true impact.
Second, the two-stage training pipeline is computationally complex, and the choice of a diffusion backbone introduces significant inference latency. This efficiency trade-off is not sufficiently justified, especially as the performance gains over other recent diffusion methods are notable but not transformative.
Finally, the framework's generalization to real-world, unseen degradations is questionable. The model is trained exclusively on synthetic composites , and it is unclear if the model is truly learning compositional physics or rather a powerful interpolation across its massive synthetic training domain when faced with the complex, non-linear physics of real-world degradations.
What is the inference time of PRISM compared to the baselines (e.g., MPerceiver, AutoDIR, and the non-diffusion NAFNet), and how do the authors justify this computational cost for the observed performance gain?
Given the reliance on synthetic data , how can we be sure the model is learning robust compositional reasoning for real-world physics rather than a complex interpolation? |
Moderately AI-edited |
|
PRISM: Controllable Diffusion for Compound Image Restoration with Scientific Fidelity |
Soundness: 1: poor
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors argue that the current unified processing pipeline adopted for image restoration in the scientific field may introduce redundant restoration types, which could counter productively lead to the loss of authentic information. Therefore, the authors aim to provide a controllable mechanism, Controllable Diffusion, to enable experts to select specific restoration types to avoid the aforementioned dilemma.
A.The authors present a novel perspective in addressing the issue.
B. The amount of work undertaken is solid.
>- They constructed a brand-new dataset and employed the latest relevant methods (mostly works from 2024) to demonstrate the superiority of the proposed approach.
I think subjective assumptions and the unconvincing problem-solving approach are the most prominent issues.
A.Subjective assumptions
>- A.1 Taking underwater image restoration, as mentioned in the article, as an example. The authors identify three types of distortions: low light, scattering, and wavelength dependency. Since these distortions act on the same image, any single restoration method may inadvertently damage useful information. Therefore, the core of fidelity preservation should lie in the precision of each restoration. However, the authors' equivalence of precision with controllability over the quantity and types of restoration is a unconvincing subjective assumption.
>- A.2 It is possible that the most appropriate combination of restoration types has already been implicitly learned by the model through training. However, the authors have neither experimentally nor theoretically refuted the aforementioned viewpoint nor provided substantive support for their own hypothesis.
>- A.3 The authors arbitrarily set the number of distortion types to 3, which is overly simplistic and lacks justification. This approach may fails to flexibly handle complex scenarios.
B.Unconvincing problem-solving approach
>- B.1 The core idea of artificial intelligence is to employ machine intelligence to replace human thinking. When encountering problems, the focus should be on solving them directly, rather than reversely introducing human labor costs to empower machines.
>- B.2 Although the experimental results show some improvement, these enhancements are not significant. Moreover, there is a lack of relevant analysis regarding the extent to which such improvements stem from the investment in human labor costs and whether they offer cost-effectiveness. Additionally, the ablation experiments fail to fully elucidate the impact of distortion type settings and their dependence on accurate expert guidance.
>- B.3 Assuming your hypothesis is correct, why not just solve the problem in a software engineering way? Let a single model specialize in removing a single type of noise, we design control codes to allow experts to manually execute several actions to process the images. Therefore, what is the point of incorporating NLP models for processing opinions of experts and coupling the problem together to train a model?
The same as the above analysis in Weaknesses. |
Lightly AI-edited |
|
PRISM: Controllable Diffusion for Compound Image Restoration with Scientific Fidelity |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces PRISM (Precision Restoration with Interpretable Separation of Mixtures), a prompted conditional diffusion framework for expert-in-the-loop controllable restoration of images with compound degradations. PRISM integrates two key components: (1) compound-aware supervision trained on mixtures of distortions, and (2) a weighted contrastive disentanglement objective that aligns compound distortions with their constituent primitives. This approach enables high-fidelity joint restoration.
1. The paper is well-structured and the motivation is clear.
2. The framework leverages compound-aware supervision and a contrastive disentanglement objective across a diverse set of primitive tasks. This produces separable embeddings of distortions, enabling robust restoration, even for unseen real-world mixtures.
3. This work proposes a novel benchmark for scientific utility, spanning remote sensing, ecology, biomedical, and urban domains.
1. The method is devised on the basis of CLIP and Latent Diffusion Model. Moreover, Semantic Content Preservation Module is also relatively simple.
2. There is no physical information being incorporated into the training process. In other words, it is a general method that is used for scientific domains.
2. The dataset is constructed by integrating existing datasets, which is not very solid.
1. Collecting a dataset from existing datasets is not very convincing. Have the authors collected any data?
2. It is a general method that is applied as a unified model for scientific and environmental images. More domain-specific priors, such as physical information, should be considered.
3. The text prompt is also short for benefiting these domains. |
Fully human-written |
|
PRISM: Controllable Diffusion for Compound Image Restoration with Scientific Fidelity |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a framework for controllable restoration of images that underwent multiple primitive distortions. The restoration of the degradations happens at once, rather than consecutive restorations that may introduce artifacts, yet enables defining which degradation to restore in order to preserve necessary signals.
As part of the training process, a restoration benchmark with tuples of a clean and a degraded image, along with a natural language prompt that describes the degradation is used, which is made public. Given a prompt, the framework includes using a frozen CLIP text encoder for multi-label classification from a set of primitive degradation, which are then formatted to a predefined form of prompt. This prompt is then used to restore the image by applying a finetuned version of SD 1.5, where the CLIP image encoder was finetuned to cluster embeddings by degradation, followed by an additional model trained to preserve semantic content.
The overall framework is evaluated on a benchmark with images that underwent distortions as the ones in the training process (up to 3 primitive distortions) and on unseen distortions. Furthermore, its usage for four downstream tasks is evaluated, showing scientific utility.
* Image restoration is a critical task, particularly for scientific applications. This paper demonstrates the method's effectiveness through general purpose image restoration, evaluated using fidelity and perceptual metrics, and its application for downstream scientific tasks.
* The motivation is written clearly, and the figures (although 1 and 2 are not referenced) support the understanding of the general approach.
* Although the number of consecutive distortions in the training set is limited to 3, section 4.2 shows that the method archives good results also for unseen degradations that are not necessarily a combination of the degradation in the train set, or are a combination of more than 3 primitive distortions.
* While the fine tuning of CLIP image encoder is explained thoroughly, the following steps of how SD 1.5 is used as the backbone and the suggested SCPM module are explained only briefly. This impairs the understanding of the entire framework, and while the code is submitted, the text itself is not sufficient to reproduce the code.
* The concept of automatic restoration needs clarification. While the paragraph on prompting (line 207) describes the automatic transformation of natural prompts to fit the required format, the method for generating these automatic prompts remains unclear. Possibly related, it is unclear which prompts were used in the experiments in Sec. 4.1.
* The motivation of controlled restoration for expert in the loop scenarios could be further evaluated by comparing images that underwent multiple distortions but PRISM is prompted to restore only a partial set (as in Sec. 4.3.1 on synthetically) where the control of the restorations suggests different restorations for specific images / use cases rather than a predefined subset for all images in the same domain.
**Minor:**
* Indices are not explained and somewhat confusing. It seems $d_{i_j}$ in Eq. 1 denotes a specific distortion and $d^i$ in row 180 denotes a set of distortions, yet the notations are explained only after being used ($d_j$ is explained in line 200).
* Using the Jacard distance between degradation sets neglects how some distortions are more similar than others.
* Should mention how the prompts in the dataset are auto generated (line 157).
* In addition to PSNR reported in Fig 3, what was the effect on other discussed metrics?
* What value is used for the number of variants $m$ ? And what is the minibatch size? If these values are not similar there might be added values in weighing the two terms in the denominator of the per-variant contrastive loss to control the effect of repelling from other degradations and that of repelling from other images.
* Is there a difference between the dataset described in Sec.3.1 and the benchmark described in 3.2 (MDB)? If so, what is included in MDB?
* In Sec. 4.2, did PRISM identify the same set of primitive distortions for different images from each domain where images probably went through similar degradation?
* Which images were used to create the visualization of the scatter plot in Fig. 5? |
Fully human-written |
|
PRISM: Controllable Diffusion for Compound Image Restoration with Scientific Fidelity |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents PRISM, a prompted, controllable diffusion framework for restoring images suffering from compound degradations. The training setup uses mixtures of up to three distortions and introduces a weighted contrastive disentanglement stage to make embeddings separable and compositional. Inference accepts free-form prompts that are mapped to a fixed set of restoration labels; a latent-diffusion backbone performs restoration and a Semantic Content Preservation Module (SCPM) refines fine detail. Experiments cover a new Mixed Degradations Benchmark (MDB), zero-shot evaluation on real datasets (UIEB, under-display camera, and fluid lensing), and downstream tasks across remote sensing, ecology, microscopy, and urban scenes.
1) Clear objective and method design. The paper argues for simultaneous rather than sequential restoration, emphasizes expert control, and focuses on scientific fidelity rather than aesthetics. The architecture coherently combines contrastive disentanglement, prompt-conditioned latent diffusion, and SCPM for detail recovery.
2) Good reported performance and breadth. On MDB, PRISM outperforms representative all-in-one and diffusion/composite baselines (e.g., AirNet, Restormer, NAFNet, PromptIR, OneRestore, DiffPlugin, MPerceiver, AutoDIR) on PSNR/SSIM and perceptual metrics.
3) Generalization beyond the synthetic training setup. The paper reports zero-shot gains on real distortions (underwater, under-display camera, and fluid distortions) and shows that performance scales well as the number of simultaneous degradations increases.
4) Practical value for scientific analysis. Selective, prompt-guided restoration improves downstream tasks in several domains, supporting the utility of controllability.
1) Control granularity and evaluation scope:
The evaluation largely uses manual prompting with a pre-defined set of distortion types, not open-ended language or fine-grained controls. The paper itself notes that extending controllability beyond “which distortions to remove” to specifying intensity and spatial extent is left for future work. This leaves unanswered how robust the system is to realistic prompt variations or local/severity-aware edits.
2) Synthetic-to-real gap and capped composition complexity:
Training relies on synthetic mixtures and explicitly caps each sample at up to three distortions for efficiency and interpretability. The authors acknowledge that these synthetic augmentations cannot fully capture real distortions. While results on real datasets are encouraging, this cap and reliance on synthetic compositing may underrepresent harder real-world compound effects.
3) Efficiency and deployability are not quantified in the main text:
The paper does not provide main-text wall-clock, throughput, or memory comparisons versus strong diffusion baseline. Without standardized timing/FLOP/peak-memory profiles at a fixed resolution, it is difficult to assess practical deployability or the overhead of the added control and SCPM modules.
. |
Fully AI-generated |
|
Rethinking Scale: How Multi-Agent Collaboration Enables Smaller Models to Rival GPT-4 in Video Understanding |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces RIVAL, a *training-free* agentic framework for long-video question answering that aims to “rethink scale”: instead of relying on very large proprietary models, it orchestrates smaller open LLMs via two modules:
-Multi-Stage ReAct Planner (MSRP): enforces explicit OBSERVE → THINK → ACT stages, produces a structured tool-usage plan (Stop Searching; Delete/Add by Frame ID; Add by Text via CLIP), and stops when a score threshold or max steps is reached. This reduces reasoning/action drift and keeps context within ~15k tokens.
-Multi-Agent Debate Refinement (MADR): after an initial answer, *affirmative* and *opposition* agents debate with limited tool calls; a judge either declares agreement, a winner, or halts at max rounds.
On EgoSchema, RIVAL with Qwen-2.5-72B/3-32B reports 66.8/65.0 on the subset and 56.4/57.2 on the full set, surpassing GPT-4–based baselines in their table. On NExT-QA, it reaches 74.4/73.2 on val and 66.5/63.7 on ATP-Hard.
Under an ≈28-hour concatenated-video stress test, RIVAL degrades less than VideoAgent (e.g., on a 1.5B model: 33.8 vs 23.4).
Compared to VideoAgent (LLM-tool agent with proprietary backends) and LLoVi (dense captioning + LLM reasoning), RIVAL argues better *privacy* (local open models) and *resource* efficiency; against streaming VLMs it offers a systems alternative grounded in retrieval + agent debate.
+Competitive long-video QA with open models under a 15k token budget and modest GPUs.
+Enforced OBSERVE/THINK/ACT stages + fixed tools and stop criteria reduce “free-form” LLM drift and make loops auditable.
+MADR’s affirmative/opposition/judge structure with Frame-ID/Text queries is a neat twist on multi-agent debate for video evidence-seeking.
+Consistent gains over VideoAgent from 0.6B–72B (incl. large margins on EgoSchema subset) and solid NExT-QA results.
+Smaller degradation on the 28-hour test relative to VideoAgent supports the scalability story.
-Results would be more conclusive with head-to-head against *arbitrary-length* streaming VLMs (e.g., VideoStreaming, StreamingVLM) at comparable compute, not only GPT-4–centric agents.
-The end-to-end quality hinges on EVA-CLIP-8B+ retrieval and LaViLa/CogAgent captioners; failure modes (missed key frames, caption hallucination) are not deeply dissected.
-The evaluator’s 60/40 criteria and α=5 gate are plausible, but more human-agreement/calibration plots (e.g., ROC/AUC vs. correctness) and sensitivity to α would strengthen soundness.
-Hardware is stated, but per-query metrics (#tool calls, frames retrieved, tokens, wall-clock) and *compute-normalized* comparisons vs. VideoAgent/LLoVi are sparse.
-The method is validated on EgoSchema/NExT-QA; extensions to open-ended grounding, temporal localization, or instruction following would clarify generality.
-The paper reads reproducibly at the concept level, but full code/prompts would be needed for wider adoption; timelines aren’t specified.
1. Could you report accuracy vs. *seconds/tokens/tool-calls* per query (and per MADR round), and compare against VideoAgent/LLoVi at matched budgets?
2. How stable are results under different α thresholds or 60/40 weight splits? Any human-agreement stats (e.g., Kendall τ between evaluator scores and correctness)?
3. What is the impact of swapping EVA-CLIP-8B+ for a lighter/heavier CLIP, or changing Top-k and similarity thresholds? Fail-case taxonomy?
4. Cross-captioner results (LaViLa ↔ CogAgent) on both datasets, plus leakage checks for LaViLa (you mention overlap removal).
5. Can you add direct comparisons to VideoStreaming/StreamingVLM (or other 2025 long-context VLMs) to contextualize where RIVAL wins/loses?
6. Any early results on open-ended QA or temporal localization tasks to probe beyond multi-choice settings? |
Fully AI-generated |
|
Rethinking Scale: How Multi-Agent Collaboration Enables Smaller Models to Rival GPT-4 in Video Understanding |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes RIVAL, a video understanding framework built on small open-source LLMs (≤72B), aiming to rival GPT-4-level proprietary methods. RIVAL consists of (1) MSRP (Multi-stage ReAct Planner) for structured reasoning with explicit sub-states (OBSERVE → THINK → ACT) and tool-calling, and (2) MADR (Multi-Agent Debate Refinement) for adversarial multi-role answer refinement. The system retrieves key frames via CLIP and performs iterative information augmentation plus debate-based correction. Experiments on EgoSchema and Next-QA show strong results, surpassing GPT-4 baselines on subsets, and showing robustness on extremely long (28h) concatenated video.
1. Strong empirical results. RIVAL achieves substantial gains over prior GPT-4-based VideoAgent/LLoVi on EgoSchema subset (+6.6%) and competitive Next-QA performance.
2. Multi-agent debate refinement is effective and well-motivated. MADR empirically corrects initial errors and is demonstrated clearly with case study.
3. Very long video case study is interesting. Handling 28h concatenated input with minimal degradation is a good stress test.
1. Clarity & ablations missing. The current writing does not sufficiently quantify how much performance comes from CLIP retrieval, MSRP decomposition, and MADR debate individually. Ablation will greatly strengthen causal attribution.
2. Significant engineering heuristics. Many parts of MSRP are manually structured and rely on prompt templates / tool definitions — unclear robustness to domain shift or tasks not fitting stepwise logic.
3. Scalability beyond QA not validated. RIVAL is only evaluated on video QA benchmarks; unclear if this paradigm generalizes to open-ended summarization / event boundary detection / reasoning beyond MCQ.
4. Some baselines may not be strictly comparable. For Next-QA, several older entries are pre-CLIP/2024-era; more recent strong open models could be added for fairness.
1. Can the authors include ablations isolating (a) no MSRP, (b) no MADR, (c) no CLIP key-frame retrieval, to quantify contribution of each component?
2. Could the authors report results on free-form open-ended summarization tasks to illustrate generality beyond MCQ style QA?
3. Given that the video is often reduced to textual descriptions, does RIVAL degrade on videos with non-linguistically describable cues (e.g., spatial geometry, implicit physics)? |
Fully AI-generated |
|
Rethinking Scale: How Multi-Agent Collaboration Enables Smaller Models to Rival GPT-4 in Video Understanding |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes U-CSA—an unsupervised cross-modal semantic anchoring framework to match aerial imagery with vector maps. Instead of aligning image↔image, U-CSA first asks a multimodal LLM (Qwen2.5-VL) to produce structured “semantic anchors” (JSON over 11 attributes + a ≤40-word summary) for each image–map pair; then (i) a dual-branch visual encoder is trained with anchored contrastive learning against the text anchors; and (ii) an adversarial matching head with a prototype library refines the decision boundary. The authors also introduce MSTcons, a 18,907-pair benchmark built from WHU (Christchurch) and Inria (Austin, Chicago, Kitsap, Vienna, West Tyrol), with 256×256 tiles and explicit splits. On MSTcons, U-CSA beats unsupervised SAM-MCD and several adapted change-detection baselines in ROC-AUC/F1, with ablations supporting the contribution of anchors and prototypes.
Originality. A clean combination of staged ReAct planning with adversarial multi-agent debate targeted at long-video QA; explicit tool APIs (stop, add/delete by frame ID, CLIP text query) make the control flow concrete.  
Quality. Strong headline numbers on EgoSchema and Next-QA (incl. per-subset breakdowns) and a long-video stress test (28h). Comparisons include GPT-4/VideoAgent/LLoVi families and same-model re-implementations.   
Clarity. The pipeline and roles are well illustrated; termination conditions and planner/debate prompts are spelled out; implementation details (captioner/CLIP/serving) are given.   
Significance. If claims hold under controlled settings, the result that smaller open LMs with orchestration can match/beat prior GPT-4-based agents is practically meaningful for privacy/cost-sensitive deployments.
Potential option-conditioning bias. Frame retrieval uses both the question and the answer options (Ia = Top-k Sim(I, A)), which risks label-peeking and unfairly advantaging multiple-choice setups. Please add ablations that retrieve only from Q (no options), or retrieve before reading options, and report accuracy deltas.
Fair-comparison controls. Several baselines differ in LLM scale, context, and tools. While you re-implement VideoAgent with Qwen-2.5 (Appendix C), the tables still mix methods with non-comparable compute/budgets. Please provide same-LLM apples-to-apples runs (VideoAgent/LLoVi/RIVAL all on Qwen-3-32B with matched token budgets, retrieval limits, and tool calls) and report token & wall-clock costs.
Ablation depth on MSRP/MADR. The contribution attribution is under-specified. Add: MSRP→single-stage ReAct; MSRP without enforced state transitions; MADR→self-consistency / majority-vote; debate with no tools; and variance across #rounds/threshold α. Report accuracy, calls/round, tokens, and failures.
Reliance on captioner/CLIP & leakage audit. Results hinge on LaViLa/CogAgent (captioner) and EVA-CLIP-8B+ (retriever). You remove EgoSchema overlaps in LaViLa, which is good, but please add captioner/CLIP swaps and a leakage audit across both benchmarks (e.g., retrieve-then-blind the captioner to options; test different CLIP checkpoints).
No-options retrieval? What happens if frame retrieval is conditioned only on Q (not options), or performed before seeing the options? Please quantify.
Budget & efficiency. Can you report average #tool calls / debate rounds / prompt tokens / latency per question, and compare to VideoAgent/LLoVi under matched settings?
Ablation breadth. Could you add MSRP/MADR ablations described above and release prompt templates and judge criteria used for win/consensus decisions?
Captioner/CLIP sensitivity. How sensitive are results to swapping LaViLa↔CogAgent (cross-benchmark) and EVA-CLIP↔other CLIPs? Any drop-in if the captioner context is limited?
8-hour scenario. How do you partition the 28h stream internally (sliding windows? chunked debates)? Please report failure modes and variance for the long-video setting. |
Fully AI-generated |
|
Rethinking Scale: How Multi-Agent Collaboration Enables Smaller Models to Rival GPT-4 in Video Understanding |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes RIVAL, a training-free framework showing that *multi-agent collaboration* can enable smaller open-source LLMs (≤72B) to approach or surpass GPT-4–based systems on long-video understanding. RIVAL has two core components:
- Multi-Stage ReAct Planner (MSRP): decomposes reasoning into *OBSERVE → THINK → ACT* stages, with explicit state transitions and a fixed toolset (stop search, delete/add by frame ID, add by text via CLIP), iterating until a quality threshold is met.
- Multi-Agent Debate Refinement (MADR): after MSRP forms an initial answer, affirmative and opposing agents debate once per turn (with one tool call each), and a judge selects or revises the final answer; debate stops on agreement, a win, or max rounds.
On EgoSchema, RIVAL with Qwen-2.5-72B reaches 66.8% on the subset (SOTA in their comparison; +6.6 over GPT-4 baselines) and 56.4% on the full set. With Qwen-3-32B, it reaches 65.0/57.2 (subset/full). On NExT-QA, RIVAL attains 74.4% (72B) and 73.2% (32B) on validation and 66.5% / 63.7% on ATP-Hard, surpassing prior GPT-4–based agent methods in the reported comparisons. The system also processes a 28-hour concatenated long video under limited compute (≤15k token context; dual A100s for 72B), arguing for privacy-preserving, resource-constrained deployment.
- Shows that *careful orchestration* (MSRP) plus *adversarial verification* (MADR) can reduce dependence on very large proprietary LLMs for long-video QA.
- Explicit stage transitions, fixed tool APIs, and stopping rules make the agent loop auditable and easier to reproduce conceptually.
- SOTA subset performance on EgoSchema (66.8% with 72B), and NExT-QA gains over VideoAgent/LLoVi in the reported tables.
- Maintains accuracy on a 28-hour concatenated video where a single-agent baseline degrades substantially.
- Operates within a 15k token window and on commodity accelerators (72B split over 2×A100), aligning with realistic deployment; privacy angle is well-motivated.
- Improves upon GPT-4–centric VideoAgent and text-only aggregation like LLoVi, while aligning with the trend toward streaming arbitrary-length video.
- Tables focus on GPT-4 baselines circa 2024; it would help to benchmark against the most recent proprietary/open VLMs that handle arbitrary-length streams (e.g., streaming VLLMs) to solidify the “rivals GPT-4” claim.
- The pipeline’s quality hinges on CLIP retrieval and the image/video captioners; retrieval bias or caption hallucinations could mislead the debate, and ablations on retrieval quality (e.g., different CLIP backbones, top-k) are limited in the main text.
- MSRP/MADR rely on an internal evaluator score (60/40 criteria) and a threshold to trigger debate; while there is some analysis, deeper calibration/robustness checks (e.g., agreement with human judgments, sensitivity to α) would strengthen soundness.
- Claims of efficiency would benefit from a cost breakdown: number of tool calls, frames read, average tokens per step, wall-clock latency vs. baselines. Current hardware details are provided, but *end-to-end* throughput comparisons are sparse.
- Source code is not yet released (pending security review); although pseudocode and prompts are promised, this limits verification and adoption pre-camera-ready.
- Results are strong on EgoSchema/NExT-QA; adding diverse long-video tasks (e.g., instruction following, temporal localization) would clarify generality.
1. How sensitive are results to the accuracy/completeness weights (60/40) and the debate threshold α? Can you report Kendall/ Spearman correlation of evaluator scores with correctness, and success rates per score bin?
2. You set 3 rounds based on a peak at 66.8%. What is the marginal accuracy gain vs. added latency per round on both datasets? Provide a Pareto curve (accuracy vs. seconds/$$).
3. How do different CLIP variants and top-k selections affect accuracy and runtime? Can you quantify failure modes where retrieval misses key evidence?
4. For EgoSchema you use LaViLa with overlap removal; for NExT-QA, CogAgent. Could you provide cross-captioner results and any leakage checks?
5. Can you add a head-to-head vs. recent streaming long-video VLLMs or updated GPT-4-class systems to contextualize “rival GPT-4” beyond 2024-era baselines?
6. Please report average #tool calls, frames retrieved, tokens consumed, and end-to-end latency per query, and contrast with VideoAgent and LLoVi at similar compute. |
Fully AI-generated |
|
Achieving Noise Robustness by additive normalization of labels |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes “additive normalization of labels” to construct noise robust losses. They instantiate the framework into two specific losses: Noise Robust Focal Loss (NRFL) and Weighted Robust Log Loss (WRLL). Experiments on image classification (MNIST, Fashion-MNIST, CIFAR-10) and several NLP tasks (News20 classification, synthetic JSON information extraction, MALLS, GSM8k reasoning, OpenBookQA) claim improved robustness versus CE, MAE, RLL, and NFL.
1. The noisy label learning problem studied here has practical significance, and developing robust loss functions is a promising direction for further exploration.
2. Simple, clear construction; directly yields the classical symmetric-noise robustness criterion.
3. The paper has a complete structure, with experiments spanning both computer vision and natural language processing.
1. No new contributions: This paper cites [1]. After my review, the method proposed in this paper is exactly the same as that in [1]. Therefore, this paper may not have any new contributions. The authors might not have read [1] carefully. I have serious concerns about this matter.
2. Unfair experiments: The authors claim that their method converges more quickly. However, NRFL and WRLL specifically used higher learning rates compared to the baselines, without maintaining consistency. This might be an unfair experiment. In addition, for NRFL, the authors added an additional parameter $\delta$ to adjust the gradient, but did not keep it consistent with the baselines, which could lead to unfair experiments.
3. The performance was mediocre: Although the authors claim that their method is effective, the experiments show that, even if there might be unfair learning rates and gradient-scaling issues, their method is inferior to standard MAE in many cases. This significantly undermines the significance of their approach.
Please refer to "Weaknesses" |
Lightly AI-edited |
|
Achieving Noise Robustness by additive normalization of labels |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a new way to handle noise in data by changing labels, not the loss function. The main idea is to adjust labels using an additive method that centers target vectors. This makes sure the loss can handle noise, as shown in earlier studies. The authors prove that with symmetric label noise, minimizing risk with noisy labels is the same as minimizing clean risk if the noise rate $q < \frac{k-1}{k}$. From this idea, the paper creates two loss functions: the noise-robust focal loss and the weighted robust log loss. Both keep a steady relationship with prediction confidence and are strong against noise due to the label adjustment. Tests on image classification (MNIST, CIFAR-10, Fashion-MNIST) and large language model tasks (like text classification, information extraction, translation, reasoning, and short answer grading) show these losses work better than cross-entropy, mean absolute error, and other methods when labels are noisy. The method is simple, based on theory, and works well in practice. It combines previous methods into one general approach that can be used in other areas beyond supervised learning.
The theory is solid and nicely links additive label normalization with known symmetry rules for handling noise.
The results are strong in both vision and language areas, showing steady improvements with different noise levels.
The method is simple, fast, and can be used widely without needing changes to the model or loss function.
The paper clearly explains its purpose and how it relates to past methods using normalization and symmetry.
All experiments use the same type of uniform label noise. Theorem 1 says it can handle different types of noise, but no tests prove this.
In real life, noise is often uneven and depends on the instance.
NRFL: It is unclear if improvements come from normalization or the focal loss part. The focal loss seems separate from the noise handling method. WRLL: It uses frequency-based weighting to deal with class imbalance, but its link to noise handling is not proven by theory or tests. This mixes two different issues.
The claim of allowing "application-specific" loss design is not shown with a clear method. The losses seem random, not based on the framework.
insufficient baseline comparisons with recent noise-robust methods
Paper uses 2025 formatting instead of 2026
How is additive normalization of labels different from Ma et al. (2020b) loss normalization?
(1) Why does your method avoid the underfitting reported in Ma et al. (2020b)?
(2) Is there proof or empirical evidence that additive label normalization outperforms loss normalization?
(3) How does the training speed compare to standard losses under noisy labels?
(4) Please report results for normalized cross-entropy (without focal) versus the proposed noise-robust focal loss, to isolate the effect of additive normalization from the focal term.
(5) For the weighted robust log loss, compare:
(a) standard weighted loss without normalization,
(b) normalized unweighted loss,
(c) the proposed weighted robust log loss (WRLL).
On the noise bound in Theorem 1:
(6) The bound (k-1)/k suggests degradation at very high noise rates. How does performance behave beyond this threshold?
(7) Please explain why the noise limit is (k-1)/k and discuss how accurate or conservative this bound is in practice.
(8) What happens on non-uniform dataset like CIFAR-N, Clothing-1M more IDN or Class-dependent |
Lightly AI-edited |
|
Achieving Noise Robustness by additive normalization of labels |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose a new foundational study on robust loss design for the LNL problem.
While numerous LNL methods exist, the authors’ goal appears to be a more fundamental exploration that could inspire future research on noise-robust loss functions.
Specifically, they define the probability of label noise (assumed to occur randomly) and formulate a loss function that minimizes the expected risk under this noise distribution.
1. The proposed method contains almost no hyperparameters, which makes it elegant and easy to reproduce. This also implies that relaxed variants of robust loss (those requiring tuning parameters) are not the primary focus of this work.
2. The authors conduct experiments on a wide range of benchmarks. Notably, the inclusion of NLP datasets in their experiments is quite novel in the LNL literature and demonstrates the potential generality of their approach.
1. The theoretical derivation is rather straightforward.
Intuitively, in a k-class classification problem, it is not difficult to reason about the level of random label noise that can be tolerated before performance degrades.
Although the authors explain this process clearly, the derivation itself offers limited new insight.
Moreover, the analysis assumes purely random (uniform) noise and does not consider more realistic or ambiguous cases, such as class-dependent or instance-dependent label noise.
2. The empirical results show limited improvement.
While the proposed method achieves small gains, the baselines compared against are not among the strongest or most recent methods in the LNL field.
Although the restricted comparison setup is understandable given the paper’s theoretical orientation, additional information or analyses would be required for the paper to serve as a solid foundation for future work.
I suggest introducing a relaxed normalization variant (e.g., one controllable by hyperparameters) and demonstrating superiority over established robust losses such as GCE or APL.
I understand the authors’ proposed method and their intended objective; however, the experimental results and theoretical justification do not seem sufficient for this paper to be considered a new milestone in the field.
As mentioned in the Weaknesses, the authors should at least demonstrate the potential to extend their proposed framework or provide stronger evidence of theoretical noise tolerance under more challenging conditions. Such additions would make the paper significantly more convincing and impactful. |
Moderately AI-edited |
|
Achieving Noise Robustness by additive normalization of labels |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Additive Normalization of Labels, a framework for constructing noise-robust loss functions by normalizing labels instead of losses. The authors theoretically prove that this additive normalization preserves the same optimal solution under label noise, satisfying the symmetry condition for noise tolerance. Based on this principle, they propose two new losses—Noise-Robust Focal Loss (NRFL) and Weighted Robust Log Loss (WRLL)—which consistently outperform existing methods across computer vision and NLP tasks. The approach is simple, theoretically grounded, and broadly applicable to real-world noisy learning scenarios.
1. The paper provides an interesting and reasonable insight into maintaining collinearity between clean and noisy labels.
2. The theoretical analysis clearly and rigorously validates the feasibility of the proposed additive normalization of labels.
3. Extensive experiments empirically demonstrate the effectiveness of the proposed approach.
1.In line 222, there is a condition requiring the noisy label $\tilde{y}$ and the clean label $y$ to be collinear, i.e., $q<\frac{k-1}{k}$. This implies that in a kkk-class classification task (e.g., k=10), the model learns the correct label direction only when the noise ratio is below 0.9. It would be interesting to empirically explore this noise-ratio boundary to gain additional insights for theory verification—for example, does the performance drop sharply when the noise ratio increases from 0.89 to 0.91?
2. Figure 1 only presents a PCA analysis under a 50% noise ratio. It would be helpful to also include PCA results under 0% noise or use an alternative visualization to further demonstrate the collinearity property.
3. The decimal places in Table 2 are inconsistent and should be standardized for clarity.
Please refer to weaknesses. |
Lightly AI-edited |
|
Achieving Noise Robustness by additive normalization of labels |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes “additive normalization of labels” for noise-robust learning. Given a $k$-class problem with one-hot target $y\in{e_1,\dots,e_k}$ and model scores passed through any strictly increasing component-wise map $f(p)=[f_1(p_1),\dots,f_k(p_k)]$, the authors replace $y$ by $\bar y=\frac{1}{k-1}(k y - \mathbf{1})$ and optimize $\ell(y,p)=-\langle \bar y, f(p)\rangle$. They prove that under symmetric label noise with flip probability $q$, the noisy risk equals the clean risk up to a positive scalar when $q<\frac{k-1}{k}$ (binary case reduces to $q<\tfrac12$). They argue that the loss is “symmetric” because $\sum_{y}\ell(y,p)=0$, and claim this entails robustness beyond symmetric noise. Two instantiations are presented: a noise-robust focal loss (NRFL) obtained by plugging the focal form into $f$, and a class-imbalance variant “weighted robust log loss” (WRLL) using $f_i(p_i)=\log(\alpha_i+p_i)$ with $\alpha_i$ inversely proportional to class frequency. Experiments on MNIST, Fashion-MNIST, CIFAR-10, several NLP tasks (News20, a synthetic JSON IE task, MALLS, GSM8k, OpenBookQA), and small LLMs reportedly show improved robustness, faster convergence, and clearer decision boundaries.
1. This article is well-written and easy to read.
2. Flexible template: any strictly increasing componentwise $f$ can produce a loss; easy to implement (NRFL, WRLL).
3. Some empirical gains reported over CE/MAE/NFL/RLL in several noisy setups; qualitative feature visualizations align with the claimed decision-boundary effect.
4. This article draws on both an image dataset and an NLP dataset, which is commendable.
1. This article evaluates the method only on very small datasets and does not include experiments on benchmarks with more categories or larger scale, such as CIFAR-100 or WebVision. It is well known that symmetric losses [1, 2, 3] are challenging to optimize. Results on small datasets like CIFAR-10 and MNIST are insufficient to assess optimization capability: even symmetric losses such as MAE and NCE—despite their optimization difficulty—can perform well in these settings. In contrast, on datasets with more categories or larger scale, such as CIFAR-100 and WebVision, symmetric losses are often harder to optimize and typically underperform. Consequently, using a symmetric loss in isolation has limited practical value. As a result, the experiments presented by the authors do not establish the practical significance of the proposed method.
2. The author did not compare with advanced methods such as GCE [1], APL [2], and AGCE [3]. The experimental results show that the method proposed by the author does not significantly outperform the most basic MAE.
3. Lack of details regarding NLP experiments, there are multiple processes and methods for training large language models, such as supervised training, alignment or add an MLP layer for classification training. The author did not clearly explain how they conducted the training.
4. Why only train for 15 epochs or 30 epochs on the CV dataset? This is unreasonable. For instance, on the CIFAR-10 dataset, one usually needs to train for at least 100 epochs.
5. The theoretical contribution is limited. The author did not make any new theoretical contributions.
6. A minor issue: Table ?? in page 12.
[1] Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy labels.
[2] Normalized Loss Functions for Deep Learning with Noisy Labels.
[3] Asymmetric Loss Functions for Learning with Noisy Labels.
1. Why not conduct the experiments on CIFAR-100 and WebVision? Evaluate on real-world noisy datasets and on non-uniform/instance-dependent noise to support the broader robustness claims.
2. Why not compare with the advanced robust loss function? Add strong, modern baselines under the same budgets and report multiple seeds with confidence intervals.
3. Why only conduct training for 15 or 30 epochs? Please re-run image classification with standard training pipelines (reasonable learning rates, epochs) and equalize hyperparameters across methods. Current CIFAR-10 baselines (e.g., 34% at 0% noise with ResNet-18) are not credible. |
Lightly AI-edited |
|
Graph-Based Operator Learning from Limited Data on Irregular Domains |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper proposed a novel architecture of neural operators based on graph neural networks and attention, GOLA, for data-scarce scenarios. On 4 2D benchmarks, GOLA shows superior performance compared with 3 baselines.
The paper is presented in high clarity.
Regarding the experiments:
1. The baseline models are not SOTA, where the latest is AFNO, a work in 2021. The baseline should include some SOTA neural operators, e.g., Transolver [1].
2. The author claimed generalization across sample densities and resolutions. However, the experiments are conducted only for GOLA. How about the baseline models? Is GOLA better at generalization across sample densities and resolutions, or worse?
3. On data efficiency, again, comparison should be made among baseline models. At what amount of data would baseline models catch up with GOLA?
4. Are models compared in the setting of similar number of parameters?
[1] Wu, H., Luo, H., Wang, H., Wang, J., & Long, M. (2024). Transolver: A fast transformer solver for pdes on general geometries. arXiv preprint arXiv:2402.02366.
N.A. |
Fully human-written |
|
Graph-Based Operator Learning from Limited Data on Irregular Domains |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents a new architecture for operator learning over irregular point clouds. It encodes sampled inputs along trainable Fourier features, embeds these features on a spatial graph according to the sampling scheme, and uses a message passing algorithm and an attention mechanism to produce an output over these spatial nodes. A proof of universality of this architecture is presented and experiments on Darcy flow, nonlinear diffusion, Eikonal, and advection PDE operators are conducted to compare against DeepONet, AFNO, and GKN architectures. Compared to these three other architectures, the proposed method shows improved results.
Exploring additional architectures for neural operators, especially on irregular domains or with irregularly sampled discretizations, is an important area of research to further improve our ability to make surrogate models for PDE solution operators.
The paper claims that the method can be used for learning operators on irregular domains, however all experimental results are for rectangular domains. The sampling within these domains is not necessarily on a grid, but throughout the paper there is no evidence support any claims for irregular domains themselves.
Not addressing irregular domains properly is also the cause for some incorrect theoretical arguments. In particular, the paper references the existence of the complex exponentials as a basis for $L^2(\Omega)$, with $\Omega$ a compact domain. This is false:
Iosevich, Alex. "The Fuglede spectral conjecture holds for convex planar domains." Mathematical Research Letters 10.5 (2003): 559-570.
The proof of universality of the architecture is incomplete. The proof proceeds by showing first that input functions can be represented arbitrarily well by a finite number of complex exponentials (not true for general compact $\Omega$). It then uses the universarlity of GNNs to claim arbitrarily close approximation (implicitly in L2 but not specified) of the mapping on the sampled points. The GNN decoder $D_\theta$ is invoked without being defined and its error is not accounted for.
The only motivation provided for the architecture choices is that a graph is used to handle non-uniformly spaced sampling points. All other components are presented largely without motivation.
Previous work has investigated methods for operator learning on irregular point clouds which has not been discussed or compared against in this paper, such as
Lingsch, Levi E., et al. "Beyond Regular Grids: Fourier-Based Neural Operators on Arbitrary Domains." International Conference on Machine Learning. PMLR, 2024..
Li, Zongyi, et al. "Fourier neural operator with learned deformations for pdes on general geometries." Journal of Machine Learning Research 24.388 (2023): 1-26.
Liu, Ning, Siavash Jafarzadeh, and Yue Yu. "Domain agnostic fourier neural operators." Advances in Neural Information Processing Systems 36 (2023): 47438-47450.
Due to lack of relative baselines for other neural operator methods on irregular domains or sampling schemes the benefits of this particular construction are not clear.
1. How does this method compare to other methods for operator learning with irregularly sampled points?
2. What is the motivation for the message aggregation as presented in (8)?
3. Is this method competitive with other operator learning methods specifically designed for irregular sampling on the presented benchmarks?
4. Is this architecture universal if we cannot appeal to the density of spans of complex exponentials in $L^2(\Omega)$? |
Fully human-written |
|
Graph-Based Operator Learning from Limited Data on Irregular Domains |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper proposes GOLA (Graph-based Operator Learning with Attention), an operator-learning framework for PDEs that works on irregular domains and sparse, nonuniform samples. GOLA first embeds inputs with a learnable Fourier encoder that projects function values at arbitrary coordinates into a spectral basis with complex-valued modes, then performs attention-enhanced message passing on a proximity graph built from spatial points to capture both local and global dependencies. Experiments on four 2D PDE families—Darcy Flow, Advection, Eikonal, and Nonlinear Diffusion—show consistent gains over baselines (DeepONet, AFNO, GKN).
1. **Learnable Fourier encoder on spatial graphs.**
Turning inputs into a complex Fourier basis lets the model capture long-range/global interactions with few coefficients, while graph message passing refines local structure. Because of the learned frequencies, the model can generalize across resolutions (the spectral representation is not tied to a fixed grid). This also reduces aliasing artifacts compared with naïve coordinate MLPs and gives a compact, physics-plausible feature space for operator learning.
2. **Clear and sufficiently detailed method presentation.**
The paper lays out the pipeline cleanly—Fourier encoding → graph construction → attention-augmented message passing → decoding—with consistent notation and design choices explained (edge features, attention rationale, training objective). The ablations and component descriptions make it straightforward to reproduce and to understand which parts drive gains.
3. **Helpful visualizations.**
The figures (architecture block diagram, graph construction sketches, and qualitative field reconstructions) make the workflow and design choices easier to follow and provide intuitive evidence for how attention and spectral features influence predictions. These plots also highlight resolution/generalization behavior and error patterns, aiding interpretability.
1. **Outdated or incomplete comparisons on irregular-domain PDE learning.**
The related work and experimental baselines underrepresent recent approaches across **GNN** [1], **implicit neural representations (INR/SIREN/coordinate networks)** [2], and **Transformer-style neural operators** [3] tailored to unstructured meshes or scattered points. Without head-to-head evaluations against stronger and more recent models (e.g., graph/mesh operators with positional encodings, attention-based operators on point clouds), it’s hard to substantiate GOLA’s claimed advantage. This limits the paper’s positioning and makes the empirical novelty less compelling.
2. **“Irregular domains” are synthetically derived from uniform grids, weakening the motivation.**
The paper’s “irregular sampling” is obtained by subsampling a uniform lattice, which **does not reflect** the practical challenges that motivate irregular discretizations: complex or curved boundaries, anisotropic resolution to capture fine features, mesh adaptivity, or true unstructured meshes common in CFD and geometry-rich PDEs (e.g., airfoils, cylinder flows, ShapeNet-like CAD geometries). As a result, the experiments do not test boundary handling, mesh heterogeneity, or topology changes—key factors for demonstrating real-world utility. The focus on **only 2D** further narrows external validity; many operator-learning applications (fluid/solid mechanics, climate/ocean) require robust performance on 3D unstructured meshes.
[1] Li, Zhihao, et al. "Harnessing scale and physics: A multi-graph neural operator framework for pdes on arbitrary geometries." Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1. 2025.
[2] Wang, Honghui, Shiji Song, and Gao Huang. "GridMix: Exploring Spatial Modulation for Neural Fields in PDE Modeling." The Thirteenth International Conference on Learning Representations. 2025.
[3] Wu, Haixu, et al. "Transolver: A Fast Transformer Solver for PDEs on General Geometries." International Conference on Machine Learning. PMLR, 2024.
1. **Stronger, more representative experiments.**
Could you add evaluations on *truly* irregular domains (unstructured/anisotropic meshes and complex boundaries), e.g., airfoil/cylinder flows or CAD-like geometries, and compare against more recent GNN/INR/Transformer-based operators designed for such settings? This would directly address external validity and positioning (see Weaknesses).
2. **Fourier encoding on irregular meshes.**
Please clarify why the proposed Fourier encoder is applicable on nonuniform point sets:
* Is the encoding purely a function of coordinates (i.e., independent of grid regularity), or does its validity rely on the fact that your samples come from an underlying uniform lattice?
* How does aliasing/spectral leakage behave on truly unstructured meshes where Nyquist-style guarantees don’t hold?
* Can the approach extend to 3D unstructured meshes (e.g., tetrahedral/hexahedral) and curved boundaries, and what modifications (basis choice, graph construction, positional encodings) would be required? |
Fully AI-generated |
|
Graph-Based Operator Learning from Limited Data on Irregular Domains |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces GOLA (Graph-based Operator Learning with Attention) — a framework for operator learning on irregular spatial domains. The method combines (1) a learnable Fourier encoder for spectral embedding of irregular samples and (2) attention-enhanced graph neural networks (GNNs) for message passing. The authors claim GOLA generalizes across PDE types (Darcy Flow, Advection, Eikonal, Nonlinear Diffusion) and sampling densities, outperforming DeepONet, AFNO, and Graph Kernel Network (GKN) in low-data regimes.
1. The paper targets an important challenge: operator learning on irregular and sparse domains.
2. The writing is overall structured and readable.
1. Limited Novelty and Overlap with Existing Work
- Both Fourier encoding and attention mechanisms have been extensively explored in the context of graph neural networks, including Fourier Feature Networks, Graph Attention Networks (GATs), and Graph Transformer networks (GTN).
- The proposed combination in GOLA therefore appears incremental, as similar designs integrating spectral embeddings with attention-based message passing already exist in prior operator learning and geometric deep learning literature.
- The paper fails to clearly articulate what novel insight, mechanism, or theoretical contribution differentiates GOLA from existing graph-based operator learning frameworks.
2. Weak Baselines and Benchmarks
- The baseline comparisons are limited to older and weaker models (DeepONet, AFNO, GKN), which no longer represent the state of the art. The authors should include more recent and competitive baselines that address irregular or non-Euclidean domains, such as Transolver, Position-induced Transformer (PiT), Geo-FNO, PINO, Geo-FNO, etc.
- The chosen benchmarks are small-scale, synthetic PDEs (Darcy, Advection, Eikonal, Diffusion) simulated in-house. These toy problems do not adequately demonstrate the model’s generalization capability. Evaluation on standard community datasets—such as Airfoil, ShapeNet-Car, or AirFRANS—would significantly strengthen the empirical validation.
3. Lack of Depth in Theoretical and Conceptual Analysis
- The “theoretical analysis” section is superficial, restating standard neural operator approximation theorems without offering any model-specific insight or proof.
- The paper provides no theoretical reasoning or ablation-based evidence explaining why combining Fourier encoding with attention-enhanced GNNs should lead to improved expressivity, generalization, or sample efficiency.
1. What is the key novelty beyond existing graph-based operator networks with attention?
2. What if the authors remove the Fourier encoder or attention modules? Are there any ablation studies? |
Fully AI-generated |
|
GDEGAN: Gaussian Dynamic Equivariant Graph Attention Network for Ligand Binding Site Prediction |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose a variation of vanilla dot product attention based on adaptive kernels as a motivation to capture local geometric and chemical features of the residues to predict protein-ligand binding sites. The approach calculates mean and variance of a residue's neighborhood on the fly. They show the approach combined with ESM embeddings as node features significantly outperform previous SOTA GotenNet that was based on dot product attention.
The authors claim that current best ligand binding site prediction methods that are based on message passing networks use context-agnostic attention mechanisms and the binding sites in a protein are often clustered based on their local geometric and chemical features. They construct a protein structure graph with residues as nodes and edges between the residues determined by a threshold on C-alpha distances. And the third component of the graph is the C-alpha coordinates. They initialize the nodes using ESM-2 node embeddings and then project to hidden dimensional space using learned transformations. Nodes are labeled 1 or 0 based on closeness to ligand atoms. The task is given a protein graph, predict the binding probability of each residue. The authors adopt Gotennet and modify the attention part and representations for scalars and tensors. They design basis functions based on spherical harmonics to preserve equivariance. They then use these steerable features and the invariant scalars from the projected pLM embeddings to create the message passing networks for both nodes and edges. The node features go through a dot product with the RBF features ensuring differentiability. Steerable features are initialized to 0 and then updated during training. The node features (from pLM) are used to compute mean and variance on the fly.
Experiments are conducted on three benchmark datasets and an additional ablation study is performed.
The paper is well organized in terms of the limitations of the current binding site prediction models. The adoption of GotenNet and modifying the vanilla attention with the proposed approach to make the attention more dynamic and aware of an atom's local neighborhood is a an interesting approach. The key contribution is the idea of computing a neighboring atom’s features from Gaussian distribution defined by the target atom’s local neighborhood in the model. The results in Table 1 show that GDEGAN beat the baseline models on three benchmark datasets.
From the results in Table 1 it shows the proposed method beats GotenNet on all three datasets except on the failure rate. However, from the ablation study shows (in Table 2) that the main boost comes from the ESM embeddings for both methods. As the authors show that the proposed approach is beneficial as structural heterogeneity increases. Since protein ligand binding site is determined by the chemical fingerprint it would be interesting to see if the method relied no only on the C-alpha atoms but an all atom graph model like GearBind.
If that is a significant stretch the authors could also try using embeddings from structure aware protein models such as Prostt5 or SaProt or even surface-structure aware protein models such as AtomSurf ? If the main hypothesis is that the Gaussian kernels give the additional boost in performance by capturing the local chemical and geometric features then one could test that by extracting residue embeddings from the advanced protein models that are trained on structure and surface features constructed from local neighborhoods. Without that comparison it is hard to determine if the proposed approach is the optimal method to capture local geometric and chemical characteristics of the binding residues.
In any case when referring to chemical and geometric features the authors did not cite a few other relevant papers:
1. MaSIF: https://www.biorxiv.org/content/10.1101/606202v1.full.pdf
2. AtomSurf: Which combines structure and surface (https://arxiv.org/pdf/2309.16519)
3. GearBind: https://www.biorxiv.org/content/10.1101/2023.08.10.552845v1
Please add other SoTA methods on protein ligand binding tasks such as HoloProt (https://arxiv.org/pdf/2204.02337) |
Fully human-written |
|
GDEGAN: Gaussian Dynamic Equivariant Graph Attention Network for Ligand Binding Site Prediction |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces GDEGAN, a Gaussian Dynamic Equivariant Graph Attention Network for protein–ligand binding site prediction. Its central idea is to replace dot-product similarity with a Gaussian kernel whose weights are determined by local neighborhood statistics and a learnable temperature, implemented within an SE3-equivariant graph architecture. The approach targets the strong geometric and chemical heterogeneity of protein surfaces and includes a clear treatment of symmetry, noting that the use of ESM-2 scalar features yields SE3 rather than full E3 equivariance. The training objective addresses class imbalance and directional cues. Experiments on COACH420, HOLO4K, and PDBbind2020 report consistent gains in DCC and DCA, substantial reductions in failure rate, and faster inference versus strong baselines. Attention visualizations align with pocket regions, offering an interpretable account of model behavior.
Strength 1:Uses a local Gaussian kernel with adaptive bandwidth from neighborhood statistics and a learnable temperature, yielding context-aware attention suited to heterogeneous protein surfaces.
Strength 2:Provides formal analysis showing the proposed attention preserves SE(3) equivariance under the chosen feature representation, giving a clear geometric justification.
Strength 3 :Demonstrates consistent improvements over strong baselines across standard pocket benchmarks, supported by ablations and qualitative visualizations, with competitive or better inference efficiency.
Weakness 1: The paper claims to capture geometric structure and handle variation among neighboring residues, but the evidence is mostly indirect. It should demonstrate which previously hard geometric challenges are now addressed, with targeted analyses rather than only aggregate metrics and visuals.
Weakness 2:Comparisons with prior graph-attention variants are incomplete, especially kernelized attention methods. A deeper analysis against these baselines is needed to substantiate the claimed contribution and clarify what is genuinely new.
Weakness 3:Key terms should be standardized for readability, including “Gaussian kernel,” “Gaussian attention,” and “Protein-aware Structural Embeddings.” A thorough pass is recommended. Also correct the misspelling of “temperature” in the figures.
Weakness 4:The proposed variant may introduce extra computational cost. The paper should provide complexity measurements and hyperparameter sensitivity analyses to quantify overhead and practical trade-offs.
Q1:Can you report how the learnable temperature evolves and distributes during training?
Q2: How does using a learnable temperature compare to a fixed bandwidth parameter?
Q3:Which specific geometric structures previously handled poorly by dot-product or standard graph attention are now captured better?
Additional questions and suggestions please refer to the Weaknesses. |
Moderately AI-edited |
|
GDEGAN: Gaussian Dynamic Equivariant Graph Attention Network for Ligand Binding Site Prediction |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces GDEGAN, a novel Gaussian Dynamic Equivariant Graph Attention Network designed for predicting protein-ligand binding sites. The core innovation is the replacement of standard dot-product attention with a Gaussian Dynamic Attention mechanism. This new mechanism adapts to the local chemical and geometric heterogeneity of protein surfaces by using the mean and variance of neighboring residue features to compute attention scores, leading to state-of-the-art performance on three established benchmark datasets.
The novelty lies in the successful adaptation and application of a probabilistic, variance-aware attention mechanism to the domain of 3D equivariant graph representations for a critical bioinformatics task. While building upon a strong backbone (GotenNet), the proposed attention module is a distinct and impactful innovation. It provides a more physically grounded inductive bias by assuming that variance in learned features is a meaningful signal, a departure from standard similarity-based dot-product attention.
The model achieves substantial and consistent improvements over strong baselines, including the current state-of-the-art EquiPocket, across three diverse datasets (COACH420, HOLO4k, PDBbind2020). The reported relative gains (e.g., 37-66% in DCC) are compelling.
**Key Flaw:** The most critical weakness is the ambiguity in the formulation of the core Gaussian attention mechanism. Specifically, the dimensionality of the neighborhood statistics μ_i and (σ_i)^2 in Equations 5 and 6 is unclear in the context of Equation 7. Since h_j is a high-dimensional feature vector, μ_i and (σ_i)^2 should also be vectors (element-wise mean and variance). However, Equation 7 uses (σ_i)^2 as if it were a scalar value for modulating the attention kernel's bandwidth. This lack of clarity is a major impediment to understanding and reproducing the method and casts doubt on its technical soundness. The authors must explicitly define how the vector variance is converted into the scalar used in the denominator.
**Methodological Issues:** The central hypothesis of the paper, while intuitive, could be better substantiated. The authors assume that high local feature variance is a reliable signal for binding sites and that standard dot-product attention cannot capture this.
1. Justification of Hypothesis: This assumption is presented as a given but lacks direct empirical or theoretical support. Is there prior work suggesting a strong correlation between feature variance and functional sites?
2. Expressive Power of Baselines: The paper argues that multi-layer GNNs with standard attention are insufficient. However, it is plausible that a sufficiently deep model could learn to approximate similar context-aware behavior implicitly. The paper does not provide a compelling argument or experiment to rule out this possibility.
**Experimental Evaluation Issues:** The experimental section is strong but could be improved.
1. Training Dynamics: The introduction of a data-dependent variance term (σ_i)^2 in the denominator of an exponential function could potentially lead to training instability (e.g., vanishing or exploding gradients) if the variance becomes very small or large. The paper does not discuss the training dynamics or present loss curves to demonstrate that the model converges as robustly as the baseline.
2. Lack of Parameter Sensitivity Analysis: The model introduces a learnable temperature parameter ξ for each attention head. An analysis of the model's sensitivity to this hyperparameter would strengthen the results and provide insights into the mechanism's behavior.
Clarifying these issues will be crucial for a more thorough assessment of the paper's quality.
1. Clarification of Equations 5-7: This is the most critical point. Please provide a precise mathematical definition for how the neighborhood variance (σ_i)^2 is used in Equation 7. Given that h is a vector, (σ_i)^2 from Equation 6 should also be a vector. How is this vector transformed into the scalar value required in the denominator of Equation 7? Is it the mean of the vector elements, their L2 norm, or something else?
2. Justification of the Core Hypothesis: Could you provide further justification for the core assumption that local feature variance is a superior signal for identifying binding sites compared to what can be learned by standard attention mechanisms? Perhaps you could show a correlation analysis on a validation set between the learned (σ_i)^2 values and ground-truth binding site locations.
3. Training Stability: Did the use of the (σ_i)^2 term in the attention calculation lead to any training instability? Could you please present the training and validation loss curves for GDEGAN and the GotenNet(full) baseline to demonstrate that the proposed mechanism allows for stable convergence?
4. Computational Overhead: Remark 2 discusses the theoretical computational complexity. Could you provide the empirical wall-clock inference and training time overhead of the Gaussian Dynamic Attention layer compared to the standard dot-product attention layer in GotenNet? This would give a clearer picture of the practical trade-offs. |
Fully AI-generated |
|
A State-Transition Framework for Efficient LLM Reasoning |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper aims to alleviate the computational and memory burden of long-chain CoT reasoning: the authors explicitly model LLM reasoning as “state–transition,” using linear attention to maintain a cross-step “reasoning state” so that each token in the current step can directly retrieve historical essentials from this state instead of re-attending to all tokens from previous steps; meanwhile, the softmax-attention branch only attends to the prompt and the current-step prefix, reducing attention complexity from quadratic to linear, together with a “state-based reasoning strategy” to mitigate noisy steps and overthinking. The abstract claims that experiments across datasets and model scales show not only substantial efficiency gains but also improvements in reasoning performance.
1. The problem is well-targeted and the motivation is clear. The paper tackles the latency and memory blow-up of long CoT reasoning: by restricting the SA branch to “prompt + current step” and introducing an LA branch to maintain a “historical reasoning state matrix,” it reduces attention complexity from quadratic to linear and the KV cache from linear to near-constant. The exposition is clear and technically coherent.
2. The method is novel and modular in practice. The proposed Mixed Attention Module (MAM) replaces standard attention with parallel SA (local, current-step) and LA (global, cross-step state) branches; tokens in the current step fetch historical essentials directly from the “state blackboard,” without re-reading all past tokens. The idea is natural, and implementation is compatible with existing Transformer interfaces.
3. Experiments are broad and the gains are significant. (1) Accuracy: Outperforms a range of efficient-reasoning and KV-compression baselines across multiple benchmarks. (2) Efficiency: Advantages become clear when CoT > 4K; at 32K, inference is accelerated by over 40%, and memory usage remains approximately constant with length, in line with the theoretical claims.
The methodological description is not sufficiently clear. I recommend adding a schematic of the attention matrices to reduce the reader’s cognitive load. In addition, I recommend including pseudocode or a diagram in the main text for both the training and inference procedures of the MAM method.
N/A |
Lightly AI-edited |
|
A State-Transition Framework for Efficient LLM Reasoning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes an efficient reasoning framework for Large Language Models (LLMs) that addresses the high computational and memory costs associated with long Chain-of-Thought (CoT) reasoning. Unlike prior work that compresses CoT sequences—potentially limiting reasoning capacity and conflicting with test-time scaling—the authors model the reasoning process as a state-transition system.
The key idea is to maintain a compact reasoning state using a linear attention mechanism, which summarizes historical reasoning information. At each step, the model generates the next reasoning segment based on the current query and this reasoning state, rather than attending to the full CoT history. This allows each token to efficiently access relevant past information without the quadratic complexity of standard attention, reducing computational cost to linear time. Experiments across multiple benchmarks and model sizes show that the proposed method improves both reasoning efficiency and performance compared to standard CoT and other efficient reasoning approaches.
1. The paper proposes a conceptually innovative approach by modeling the LLM reasoning process as a state-transition system, where historical reasoning information is compressed into a compact reasoning state matrix via linear attention. This design effectively decouples reasoning efficiency from CoT length, preserving full reasoning trajectories while avoiding the quadratic attention cost.
2. The proposed state-based reasoning strategy leverages the gradient-descent interpretation of linear attention to compute a global gradient (via momentum) that guides the current reasoning step. This mechanism actively counters the accumulation of noisy or misleading reasoning steps, addressing the over-thinking problem in a principled and trainable manner, which contributes to both improved accuracy and stability in long reasoning chains.
1. The major drawback of this paper is the lack of comparison with many relevant baseline methods. These baselines are not included in the experimental comparisons.
[1] Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning
[2] Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging
[3] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
[4] Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning
[5] Concise Reasoning via Reinforcement Learning
[6] SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning
[7] Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning
[8] Not All Tokens Are What You Need In Thinking
[9] Stable Reinforcement Learning for Efficient Reasoning
[10] Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning
[11] Optimizing Anytime Reasoning via Budget Relative Policy Optimization
[12] Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement
2. The paper relies on a large amount of data (95K) for fine-tuning, which suggests a clear issue of data inefficiency compared to RL-based methods that incorporate length penalties. The authors should provide a detailed discussion of this limitation in the paper.
3. The performance improvement is marginal. In terms of token efficiency and compression, there is no significant gain. Moreover, the results on AIME24 and AIME25 are based on a single run, which introduces considerable randomness and undermines the reliability of the evaluation.
4. The experimental results appear to be sensitive to hyperparameters, yet the paper does not include a joint analysis or visualization (e.g., a heatmap or contour plot) of the effects of key hyperparameters such as $\alpha$ and $\beta$. Such an analysis would strengthen the validity and reproducibility of the findings.
During the long chain-of-thought reasoning process, for certain questions, the model sometimes exhibits repetitive generation, leading to hitting the maximum length without producing a final answer. These cases significantly increase inference length while still failing to solve the problem. I would like to know to what extent the performance gain of the proposed method stems from mitigating such repetitive generation behavior. |
Lightly AI-edited |
|
A State-Transition Framework for Efficient LLM Reasoning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a framework that models the reasoning process of LLMs as a state-transition process, framing it as a sequence of state evolution. Specifically, the paper designs a Mixed Attention Module (MAM), which incorporates the LLM's original attention module alongside a linear attention module, to address the computational and memory efficiency issues associated with CoT reasoning. The proposed framework is evaluated on test sets spanning mathematical, scientific, and code reasoning tasks. The results demonstrate that this framework significantly enhances the efficiency of LLM reasoning.
1. In terms of research motivation, the focus of this paper is highly significant. While CoT enhances LLM performance on complex reasoning tasks, it also incurs substantial computational and memory costs. Current academic approaches to this efficiency problem often employ prompting, supervised fine-tuning (SFT), or reinforcement learning (RL) to compress CoT, which can lead to the loss of critical information. This paper innovatively addresses this issue, aiming to improve LLM reasoning efficiency while minimizing information loss.
2. In the design of the framework, to prevent performance degradation that might arise from directly replacing the LLM's original attention module, the MAM retains it. This design facilitates a division of labor between different attention mechanisms, allowing them to work collaboratively.
3. Regarding the experimental results, the model's performance is comparable to other baseline models with shorter CoT lengths. However, its performance significantly surpasses other baselines when the CoT length exceeds 4K. When the CoT length reaches 32K, the model's reasoning speed is even over 40% faster than other baseline models. These experimental results strongly validate the effectiveness of the proposed MAM framework.
4. Regarding the research outlook, the MAM proposed in this paper significantly enhances the reasoning efficiency of LLMs in CoT. Looking forward, as the complexity of tasks assigned to LLMs increases, requiring longer CoT to maintain high performance for complex reasoning, the consumption of computational and memory resources becomes particularly critical. Therefore, the focus of research on this paper is innovative and holds substantial practical significance.
1. The framework relies on segmenting long CoT sequences. The paper does not elaborate on the extent to which this segmentation method is applicable to different types of reasoning tasks or whether it can generalize effectively. Furthermore, it states that all reasoning steps in the training set are clustered, but the specific method used for this clustering is not described. It is also unclear to what extent the different thinking patterns effectively correspond to distinct reasoning types. Concerns remain about whether scenarios analogous to "over-fitting" or "under-fitting" of thinking patterns could occur, rendering them ineffective for different CoT and reasoning tasks, thereby questioning the framework's robustness.
2. The experiments are primarily concentrated in the mathematical domain, making the experimental scope relatively narrow. Only one dataset was used for testing in the scientific and code reasoning domains. Moreover, since the training data is sourced entirely from the mathematics-focused OpenR1-Math-220K dataset, the framework might perform well in mathematics, but its generalization capability to other domains remains questionable.
3. The explanation of the "state-transition process" lacks depth, making it difficult for the reader to gain a clear and comprehensive understanding of its specific working principles.
4. The experimental design lacks sufficient justification for the chosen parameter values. The rationale behind the specific parameter settings is not adequately explained.
See the Weaknesses above. |
Moderately AI-edited |
|
A State-Transition Framework for Efficient LLM Reasoning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper integrates full attention with linear attention to construct an efficient state-transition-based inference framework. By applying full attention to the current state and linear attention to historical states, the proposed framework effectively reduces the attention burden in long CoT scenarios.
1. The idea of employing a hybrid attention mechanism to achieve efficient reasoning is innovative.
2. Calibrating the current state based on the global state is convincing, and the experimental results demonstrate strong performance.
1. The method relies on step-level segmentation, which may limit its applicability to more general tasks.
2. The paper lacks certain implementation details, such as the diversity of thinking patterns and the specific configurations used in LoRA training.
1. Could you provide an example to illustrate how diverse thinking pattern samples were constructed?
2. What are the specific LoRA configurations and the size of the state space used in the LA component of the proposed framework? These factors could significantly influence the training cost of the method. |
Moderately AI-edited |
|
Learning without Global Backpropagation via Synergistic Information Distillation |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents Synergistic Information Distillation (SID), a backpropagation-free training framework that aims to eliminate update locking and reduce memory overhead while maintaining performance comparable to standard BP. SID reformulates deep network training as a cascade of local cooperative refinements, where each module incrementally refines a probabilistic “belief” over class labels. Each module is trained with a local objective combining a distillation term toward the ground-truth label and a consistency term that regularizes deviation from the previous module’s belief. A stop-gradient operation fully decouples backward dependencies across modules, enabling parallel training and constant memory scaling.
1. The formulation is clear and mathematically elegant, bridging ideas from knowledge distillation, local learning, and probabilistic inference into a unified framework.
2. The theoretical analysis, while idealized, provides useful intuition for why local updates can lead to monotonic improvement and stable convergence.
3. The paper demonstrates good empirical clarity, with ablation studies, visualization of belief dynamics, and comparisons to recent backprop-free methods.
1. The experiments are restricted to small and medium-scale image classification tasks such as CIFAR and Tiny-ImageNet. There is no evidence that SID scales to large datasets or more complex architectures.
2. The study omits direct experimental comparison with related frameworks like PETRA, SLL, and NoProp under consistent setups. The claimed advantages over other methods remain qualitative rather than quantitative.
3. Although the paper proves monotonic descent under local improvement, it does not test whether this property holds empirically, for example through layer-wise KL measurements or ablation under imperfect optimization.
4. Gains over backpropagation are small on CIFAR-10 and moderate on more complex tasks, which may reflect optimization luck rather than a systematic advantage.
1. Could SID be applied to architectures with residual or attention-based connections that violate the Markov-like layer independence assumption?
2. Have the authors verified empirically that the local descent guarantee (Proposition 2) approximately holds during training?
3. How can the belief-based formulation smoothly extend to non-classification tasks such as language modeling or reinforcement learning? |
Fully AI-generated |
|
Learning without Global Backpropagation via Synergistic Information Distillation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors introduce SID, a method for training deep models in a parallel fashion, without resorting to full backpropagation. The method decouples feature extraction, which is performed by a dedicated network, and prediction, which is done by a sequence of blocks, each refining the prediction. Each layer can then be trained locally, with only a local classification loss. It is shown to be useful to add a regularizing term which encourages
stability.
Empirical evaluation on a simple CNN model show promising results on common image classification tasks, outperforming backpropagation while being more parallelizable and consuming less memory.
- The paper is well written and easy to follow.
- The method is well motivated, with SID appearing as a natural idea for parallelizing training of deep models.
- Theoretical insights motivate the choices for the local losses and the soundness of SID.
- SID shows promising results on simple image classification tasks. It outperforms standard backpropagation on all experiments, as well as other similar local training algorithms such as NoProp and HSIC.
- In proposition 3, time and memory complexities are missing the terms for the feature extractor, which may be big.
I feel like the authors are not being 100% honest about the impact of the feature extractor, which is quite big and is learned with sequential backpropagation.
- It is not explained in the paper how SID is applied to VGG-11, Resnet-18 and ViT-Tiny architectures. Are these models used as the feature extractor only? Otherwise I do not see how SID could be applied to a ResNet. If that is the case, these experiments would not be very relevant nor convincing, considering that the feature extractor would cost much more time and memory, rendering SID's gains negligible.
- The scale and diversity of the experiments are limited, which does not convince enough that this method would scale well to larger models and/or other tasks. For instance additional results on language modeling with transformers would help, although I believe the method is not directly applicable as is since language modeling is not classification.
- Have you tried changing alpha at each layer? In theory the first layer can completely discard the previous prediction (alpha = 1), while the last layer is given a very good prediction already and should not modify it too much (alpha small). |
Fully human-written |
|
Learning without Global Backpropagation via Synergistic Information Distillation |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Sequential Information Decoupling (SID), a layerwise training scheme using a local KL objective that interpolates between the previous layer’s belief and the label distribution. The theoretical analysis (closed-form minimizer, monotonic descent bound) is sound under mild additional assumptions and results on small datasets (CIFAR-10/100, Tiny-ImageNet) are reported. Reported speedups are purely theoretical and omit communication costs. While the formulation is neat and clearly written, the contribution remains incremental and insufficiently supported by strong experiments or realistic scaling tests.
The theoretical insights of this paper are its strengths:
- $S_1$: The theoretical analysis is internally consistent and transparent (closed-form minimizer, telescoping KL bound).
- $S_2$: The exposition is clear with accessible equations and thorough appendix with sound proofs.
- $S_3$: I found that the “belief cascade” interpretation offers a fresh view of layerwise decoupling and is quite original.
The main weaknesses of the paper rely on the experimental part: not only are the baselines limited to shallow networks and small scale datasets, the reported results for backpropagation (BP) show underperforming baselines.
- $W_1$: Baseline BP accuracies are significantly lower than standard reproducible results, undermining the claimed gains (eg. 91.12\% on CIFAR-10 with ResNet-18 in Table 3 while 94-96\% are expected results). One would expect SGD+momentum as optimizer for BP and not Adam. This biases the reported results towards the proposed method and is undermining the claimed gains in my opinion.
- $W_2$: Only small datasets and shallow architectures; no scaling or real parallel experiments.
- $W_3$: All speedups are theoretical; no multi-GPU timing or memory profiling. This is a small weakness as I would not expect a dedicated CUDA kernel to be developed for an exploratory paper but it would be a nice-to-have plus.
- $W_4$: The method’s advantages seem modest and context-dependent, but are presented as generally superior.
- $W_5$: The developed theory assumes full-support (smoothed) targets, while the main text implies one-hot; this should be aligned explicitly.
- $Q_1$: Can you re-train BP with standard strong recipes (SGD + momentum + WD, longer schedule) to provide realistic baselines for ResNet-18, VGG-11, and ViT-Tiny?
- $Q_2$: Do the claimed gains hold when BP is trained properly?
- $Q_3$: Can you please clarify explicitly in the main text that label smoothing (full-support $p_y$) is assumed in theory and used in practice?
- $Q_4$: Could you compare SID vs BP under identical strong settings to quantify the true gap? |
Moderately AI-edited |
|
Learning without Global Backpropagation via Synergistic Information Distillation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes Synergistic Information Distillation (SID), a two-phase training scheme that replaces a single global loss with per-module local objectives. In Phase-1, the model runs a gradient-free forward pass to cache “teacher beliefs” (module-wise class-probability vectors); in Phase-2, each module updates in parallel by minimizing a weighted sum of KL divergences to the one-hot label and to the (stop-gradient) belief from the previous module, while all modules also send gradients into a shared feature extractor. The authors argue this decoupling eliminates update locking and reduces memory, prove depth-wise monotonic descent under idealized conditions, and report competitive or better accuracy than backpropagation (BP) on CIFAR-10/100 and Tiny-ImageNet, plus improved robustness under symmetric label noise.
1. SID matches BP on CIFAR-10 and outperforms BP on CIFAR-100 and Tiny-ImageNet. It strongly outperforms several backprop-free or local baselines (FA, FF, HSIC, NoProp) in the reported setting.
2. Framing training as a cascade of belief refinements is intuitive and connects to distillation and deep supervision while explicitly enforcing consistency across depth
1. All experiments are conducted on small datasets, not large-scale ones. The reviewer is curious about the performance on large-scale datasets such as ImageNet.
2. There are several BP-free learning methods that achieve high performance. The reviewer is curious about the comparison between SID and these methods[1-3].
3. SID’s design overlaps conceptually with decoupling via synthetic gradients/DNI (removing update locking)[4], forward-only/HSIC[5], and deep supervision/self-distillation[6]; the paper should more directly delineate what is new relative to these lines and run stronger head-to-head studies on their best configurations.
[1] Kappel, David, Khaleelulla Khan Nazeer, Cabrel Teguemne Fokam, Christian Mayr, and Anand Subramoney. "A variational framework for local learning with probabilistic latent representations." In 5th Workshop on practical ML for limited/low resource settings.
[2] Zhang, Aozhong, Zi Yang, Naigang Wang, Yingyong Qi, Jack Xin, Xin Li, and Penghang Yin. "Comq: A backpropagation-free algorithm for post-training quantization." IEEE Access (2025).
[3] Cheng, Anzhe, Heng Ping, Zhenkun Wang, Xiongye Xiao, Chenzhong Yin, Shahin Nazarian, Mingxi Cheng, and Paul Bogdan. "Unlocking deep learning: A bp-free approach for parallel block-wise training of neural networks." In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4235-4239. IEEE, 2024.
[4] Jaderberg, Max, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, and Koray Kavukcuoglu. "Decoupled neural interfaces using synthetic gradients." In International conference on machine learning, pp. 1627-1635. PMLR, 2017.
[5] Hinton, Geoffrey. "The forward-forward algorithm: Some preliminary investigations." arXiv preprint arXiv:2212.13345 2, no. 3 (2022): 5.
[6] Ma, Wan-Duo Kurt, J. P. Lewis, and W. Bastiaan Kleijn. "The HSIC bottleneck: Deep learning without back-propagation." In Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, pp. 5085-5092. 2020.
Please see weakness above. |
Lightly AI-edited |
|
SADA: Safe and Adaptive Inference with Multiple Black-Box Predictions |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes an augmented M-estimation framework that leverages multiple auxiliary signals ("black-box predictions", possibly by pretrained general-purpose language models) to reduce asymptotic variance relative to using labeled data alone, and shows adaptivity under idealized conditions.
Technically, the core idea of adding mean-zero augmentation terms built from auxiliary signals to an unbiased estimator and choosing weights to minimize asymptotic variance is classical, with prediction-powered inference (PPI) and its variants as the closest modern instantiation.
The main value appears to be packaging these tools for multi-surrogate, vector-parameter settings and clarifying efficiency guarantees.
The proposed method was evaluated on synthetic data and natural language data using LLMs.
- This paper proposed a general framework with clear recipe for augmenting M-estimators with multiple auxiliary signals, subsuming PPI as a special case.
- The author provided asymptotic variance/MSE dominance over the labeled-only estimator.
- It's proven that the proposed method achieves oracle efficiency if one auxiliary is ideal.
- The projection view makes the weighting understandable and implementable.
## Contribution
The author stated that
> First, we consider a semi-supervised setting where multiple sets of predicted labels are available, without making any assumptions about their quality or requiring prior knowledge of which predictions are more accurate. The predicted labels are also not needed to share the same scale or format, either with each other or with the true labels. This bridges the gap between advanced machine learning tools and principled methods for leveraging them to improve the inference results.
However, this statement is true for many other existing papers.
Conceptual novelty is unclear relative to classical augmented estimating equations or GMM weighting.
The authors should precisely articulate what is new (e.g., theory for multi-surrogate vector targets beyond existing results, sharper bounds?)
## Problem framing
> Given multiple predicted labels with unknown quality, how can we aggregate them in a safe and data-adaptive manner to improve estimation and inference?
The problem framing over-emphasizes "black-box predictions" and ignores the equivalence to multi-annotator or measurement-error settings; the difference from **crowdsourcing/ensembling** is not clarified.
Framing the auxiliaries as "black-box predictions" (and hinting they may be from LLMs) rather than "noisy annotators/surrogates" is a marketing choice, not a mathematical distinction.
Please clarify differences from crowdsourced label aggregation and model ensembling.
## "Safe"
> The proposed method is guaranteed to perform **no worse than** (?) the naive estimator (using the labeled data alone) in terms of mean squared error, regardless of the choice of machine learning models or their prediction accuracy.
The term "safe" is overloaded and used here to mean **asymptotic efficiency dominance** rather than robustness.
It seems several existing papers also use the term "safe" this way (https://arxiv.org/abs/2011.14185 PPI and this paper cited, and https://arxiv.org/abs/2502.17741), but in my opinion it's very confusing.
The paper should explain it with more precise language (e.g., asymptotically at least as efficient as labels-only, PSD variance dominance?)
## Writing
- "GPT, Llama, DeepSeek" should have references.
- The introduction makes broad claims ("bridges the gap between advanced machine learning tools and principled methods") without precise, falsifiable statements.
## Future work
> Our method can be extended to the situations under distribution shift. In those settings, developing methods that are robust to distribution shift is essential for enhancing the reliability and practical effectiveness of semi-supervised learning.
This is very generic, and I don't see a connection between the current work (safety?) and the distribution shift literature.
Either drop the generic sentence from the conclusion or make it more concrete (choose a shift model, transport the estimating equations with density-ratio weights, suggest a variance-dominance result, etc.).
- The meaning of "gold-standard experiments" is unclear in the machine learning context.
- "Outputs from different models–such as GPT, Llama, or DeepSeek–often differ, sometimes substantial; and the quality of predictions from black-box models can be highly variable." Please provide empirical evidence or references.
- "In particular, low-quality or poorly calibrated predictions can introduce significant noise, increasing variance and leading to unreliable inference." How siginificant? How unreliable?
- What does "perfectly accurate" mean in this context? |
Lightly AI-edited |
|
SADA: Safe and Adaptive Inference with Multiple Black-Box Predictions |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper looks to leverage the availability of unlabelled data in order to improve statistical inference. Previous works have demonstrated that producing quality synthetic labels for these unlabelled data has become even easier with the proliferation of Large Language Models (LLMs). However, previous works leveraging unlabelled data and synthetic labels make use of only a single predictor. This work uses the predictions from several different models simultaneously to achieve strong variance reduction while maintaining consistency. Asymptotic guarantees are provided, as well as experiments on real and synthetic data to demonstrate the method’s efficacy.
- The problem being addressed is interesting and highly relevant to researchers and practitioners alike, both of which would be interested in better inference methods.
- The mathematical set up is well explained, and the annotations on equations help with building intuition.
- Several relevant baselines are compared with both theoretically and empirically, with the theoretical benefits of the new method being especially well demonstrated.
- In the mean estimation case (as well as others), using the the OLS estimator equivalent to equation (4) is an inspired way of ensuring that a perfectly predictive set of synthetic labels will receive a weight of 1, while other predictions will remain unweighted.
- The idea of leveraging multiple sets of predictions for the same inference problem is an interesting take on the typical PPI set up.
- [W1] The main weakness is that the method does not seem particularly different from PPI, PPI++, and stratified PPI. The same formula is used for finding the optimal weighting over the terms using predictions in the SADA estimator as for PPI++ in the K=1 case. Similarly, the idea of fitting multiple coefficients for multiple pools of data was explored in the stratified PPI paper, although there each of the coefficients are not fit simultaneously and not on data pools of the same size. Similarly, the guarantees around always performing better than the naive estimator are already guaranteed by PPI++, as already stated in this paper.
- [W2] Despite requiring several more sets of predictions than PPI++, SADA does not seem to be able to exceed the performance of PPI++ on the most correlated set of predictions. This is on display in both Figure 2 and Figure 3. See [Q1]. I would have anticipated that by leveraging multiple pools of predictions simultaneously, we could produce performance better than the best PPI++ estimator.
- [W3] (Minor) There are typos in the work (line 28, or requires -> or require) (line 36, sometimes substantial -> sometimes substantially)
- [Q1] It seems as if this method is equivalent to using PPI++ with the most correlated set of data. How would this compare with first testing which set of predictions is most correlated using the labelled dataset, and then running PPI++ with that most correlated set? Estimating this correlation is already part of the process of estimating the correlation coefficient, which is $\lambda$ in the PPI literature and $\omega$ in this paper. |
Fully human-written |