|
Property-Driven Protein Inverse Folding with Multi-Objective Preference Alignment |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper applies multi-objective preference optimization to protein inverse folding, using semi-online DPO with adaptive margins to balance structural accuracy against properties like solubility and thermostability. The resulting model, MoMPNN, beats existing baselines across several benchmarks. The approach is solid but not particularly novel—it's essentially transplanting techniques from LLM alignment into protein design. That said, the execution is strong: the experiments are thorough, the amino acid distribution analysis shows the model learns sensible patterns, and the framework appears general enough to extend to other properties. The comprehensive evaluation is strong.
See summary
See summary
No questions. |
Moderately AI-edited |
|
Semantic Robustness of Deep Neural Networks in Ophthalmology: A Case Study with Colour Fundus Imaging |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 0:
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a framework for evaluating semantic robustness of deep neural networks in color fundus photography. The authors focus on three types of distortions, namely geometric transformation, illumination changes, and motion blur. The paper tries to utilize a non-gradient optimization method based on the DIRECT algorithm.
This paper should be rejected because there are significant shortcomings including the following points:
1- The proposed algorithm (DIRECT-LSR) is a simplistic modification to the original DIRECT algorithm.
2- The paper does not provide any justification in using the DIRECT-LSR to evaluate robustness. In fact, the authors do not formally define what robustness means in the context used in this paper.
3- The paper claims be able to evaluate robustness of neural networks to semantic perturbations and changes. However, the experiments simply use geometric transformations, illumination changes, and motion blur as examples of semantic perturbations. These are not in fact changing the semantic context of the images, and thus are practically irrelevant. Again, this is due to the fact that the paper doesn't clearly and formally define robustness and semantic sensitivity of neural networks.
4- The paper does not provide any comparison with other state-of-the-art algorithms designed for assessing robustness.
5- Evaluation against human expert assessments and protocols used for this evaluation is not clearly discussed.
6- This paper is an application paper, focused on a narrow domain (retina fundus images). Its applicability beyond this domain is questionable.
The paper attempts to evaluate the robustness of neural networks to semantic perturbations.
The paper suffers from many weaknesses with major shortcomings listed below:
1) Lack of novelty. The paper simply applies a least-squar-regression to the DIRECT algorithm.
2) Lack of motivation behind the use of this optimization algorithm. Why not evaluate with a gradient-based approach?
3) Lack of comparative evaluations against the state-of-the-art.
4) Lack domain generalizability. It is unclear how this method can be applied beyond fundus images.
5) Insufficient experimentation. The paper simply uses three types of image manipulation. Would this method work for adversarial attaches? Or actual semantic perturbations, like content manipulation? The paper also only evaluates six neural nets for robustness. The claim that the proposed method can evaluate neural network robustness should design a more comprehensive framework for evaluating major neural networks' robustness.
None. |
Fully human-written |
|
Semantic Robustness of Deep Neural Networks in Ophthalmology: A Case Study with Colour Fundus Imaging |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents a thorough and well-motivated framework for evaluating the semantic robustness of deep neural networks (DNNs) in retinal image diagnosis. Rather than relying solely on pixel-level adversarial perturbations, the authors propose a semantic perturbation framework that operates at the level of medically meaningful transformations (e.g., illumination, lesion intensity, vessel sharpness). The goal is to assess whether retinal disease classifiers remain stable under clinically relevant but semantically preserving changes. This is a timely and practically significant contribution to the intersection of adversarial robustness, explainable AI, and medical imaging, with strong potential implications for clinical reliability and model validation.
1. The work addresses a real and pressing gap: most medical AI robustness studies focus on pixel-level noise or domain shifts, ignoring semantic perturbations aligned with clinical reasoning. The proposed semantic perturbations are interpretable, controllable, and grounded in medical semantics, which enhances both transparency and clinical adoption potential.
2. The framework systematically defines semantic dimensions and corresponding transformation operators. The perturbations are implemented in a way that maintains plausible clinical realism, avoiding synthetic artifacts.
3. Evaluation across multiple datasets and models demonstrates the framework’s generality. The correlation analysis between semantic robustness and adversarial robustness is particularly insightful—it shows they are distinct but complementary.
1. The framework is empirical and descriptive; it lacks formal definitions or theoretical guarantees of semantic robustness (e.g., invariance under a semantic transformation group). A more formal link to robustness theory (e.g., Lipschitz continuity under semantic metrics) would strengthen its academic rigor.
2. The chosen perturbations, while clinically plausible, are manually curated and limited to a few dimensions. There is no discussion on how to generalize or learn semantic perturbations automatically (e.g., via generative models or disentangled representations).
3. Experiments focus on binary diabetic retinopathy grading. It remains unclear how well the framework generalizes to multi-class or multi-label medical tasks (e.g., glaucoma, AMD). The lack of external validation on unseen imaging modalities (e.g., OCT) limits generalizability.
1. How do you define the boundary between semantic and non-semantic perturbations, especially when pixel-level changes may indirectly alter semantic meaning?
2. Could your semantic perturbation framework be adapted for unsupervised discovery of semantic factors using disentangled or generative representations?
3. Do you have any insights into why ViT architectures (if included) appear more semantically stable than CNNs, or vice versa? |
Fully AI-generated |
|
Semantic Robustness of Deep Neural Networks in Ophthalmology: A Case Study with Colour Fundus Imaging |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper investigates the robustness of retinal imaging models to "semantic" corruptions such as geometric distortions, illumination changes, and motion blur (in contrast to adversarial corruptions). These corruptions are common in datasets of retinal images due to practical issues in data collection.
The authors parameterize the corruptions and present an algorithm (DIRECT-LSR) which can find distortion parameter values that are provably close to optimal, in the sense that corruptions will maximally affect classifier performance. Indeed, the found corruptions are shown to degrade performance stronger than corruptions found by PGD-like gradient-based methods. The authors then show that augmenting the training data with such corruptions improves model robustness with respect to these semantic corruptions.
- The investigated problem is practically relevant, for the applied field of retinal imaging.
- The parametrization of the corruptions seems reasonable and is explained clearly.
- The proposed algorithm, DIRECT-LSR, could be useful in other contexts as well.
- The investigation covers multiple models and datasets.
- The scope of the paper is very narrow. I see the potential value of this analysis to the medical field, specifically the subfield working on retina imaging, but for ICLR, this somewhat lacks generality. One would have to pitch the DIRECT-LSR algorithm as a general method, and I could see myself accepting such a paper, but this is not what the paper does.
- The presentation of the paper should be improved:
- Figure 3 is a bit weird because the lines all overlap, so most of the colors never even show up - probably, a logarithmic y-axis would have been advisable.
- Figure 4 is poorly designed, I'd recommend to focus on one dataset, then put the rest in the appendix as separate figures.
- Table 2 is hard to read, it includes a few irrelevant models (as in, models that never perform best), and no confidence intervals or any notion of uncertainty. I would make a selection of models, then have the full table in the appendix. Likewise, I would remove the $L_{min} > 0$ rows and present them in an extra appendix table.
- The related work section is not very convincing, I would have liked to see works more closely connected to robustness of retinal imaging models rather than e.g. an arbitrary autonomous driving paper.
- It seems like train / test splits were not conducted properly. If this is indeed the case, it would be a weakness.
- The four datasets are not sufficiently characterized. How large are they? What is the state-of-the-art classification performance on the datasets?
- The evaluation of human experts was not done perfectly. Ideally, human experts should have been asked to perform a classification task "blindly", in a true nAFC fashion, i.e. without access to the true labels, rather than just stating whether they think crucial features for the true class are visible or not. This would put model performance in perspective much better.
- The limits of all corruption parameters are neither justified nor visualized, making it hard to assess whether they are reasonable (e.g. there are levels of motion blur that would be absurd to expect in retina images).
- The analysis is a bit thin, I would have liked to see e.g. whether the different optimization procedures (DIRECT-LSR, Bayes, random search) find very different optima, whether your attacks lead models to favor a certain class, etc.
- I realize this is a somewhat unfair criticism, but I find it hard to believe that random search outperforms PGD-style gradient-based optimization for the task of finding optimal corruptions. I must imagine that this is indicative of sub-par implementation or poor hyperparameters. Can the authors comment on this? (Maybe the random search effectively covers the entire parameter space?)
- Training the models in table 2 with exactly the same setup does not seem very principled to me. Different training setups might be optimal for different models, so I would find it more convincing if all models were trained in a way that yields maximum performance (up to the limit of the respective architecture and model size). Another reasonable decision would have been to compute-match models. I would have also liked to see reference performance values on these tasks from literature.
- Again a somewhat unfair point, but the paper lacks certain shibboleths that one would expect from groups that work in this field. For example, the phrase "out-of-distribution robustness" does not appear even once, although this is exactly what this paper is about at its core. I have tried to down-weight this point in my assessment, but it does not instill confidence.
Overall, I feel conflicted about this paper. In principle, I see the value of the idea of finding provably difficult points in bounded parameter spaces for parametric corruptions, and I can imagine that evaluating robustness to such corruptions is relevant in the retinal imaging field. The biggest argument in favor of the paper is that the proposed algorithm could be useful in other settings. But the paper itself suffers from poor presentation, and there are some "smells" (such as the train-test-split issue, model comparison on equal footing, bad PGD performance, etc). The paper also has a very narrow topic scope and many questions remain open. Based on the related work, I am also lacking perspective on e.g. what models are considered gold standard for these tasks, and how robustness has been evaluated in the literature. I'm submitting an initial rating of 2 for now, but I am curious about other reviewers' thoughts and generally willing to increase this score, provided that presentation is improved and the paper is championed by someone else.
- I would find the paper more valuable if the idea was to implement a benchmark, where people submit retinal imaging models, and maximally strong "semantic adversarials" are created specifically for these models, eventually leading to a public leaderboard of the most robust retinal imaging models. Do you plan on implementing this?
- What was the train- / test-split of the datasets? It reads in line 440 as if the dataset was not split, and validation was done on a random subset of the training set.
- What is the Schwefel function? This should probably explained in at least one sentence. |
Fully human-written |
|
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper presents MORALSIM, a new framework for evaluating the moral behavior of LLM agents in situations where ethical norms conflict with personal interests and benefits. The authors investigate the behavior of 9 modern language models (including Claude-3.7-Sonnet, GPT-4o, Deepseek-R1, o3-mini, and others) in two classic game scenarios: Prisoner's Dilemma and the Public Goods Game. Each game is embedded in three different moral contexts: Contractual Reporting, Privacy Protection, and Green Production. The full factorial design of the experiment is used, varying the type of game, the moral context, the opponent's behavior (always cooperating / always betraying) and the risk of survival. The key result: none of the tested models demonstrates consistent moral behavior in all scenarios. The proportion of morally oriented actions varies from 7.9% (Qwen-3) to 76.3% (GPT-4o-mini). The authors use causal analysis (Average Treatment Effects) to determine the factors influencing moral decisions, and show that the structure of the game, the specific moral context, and the opponent's behavior have the greatest impact.
- A full-factor design with clear manipulations (type of game, moral context, survival risk, opponent behavior) allows you to isolate the effects of each factor
- Quality of analysis. Analysis of ~3500 reflections of agents reveals decision-making mechanisms. Causal assessment through ATEs yields quantitative effects with confidence intervals. It is shown that profit maximizer models (Deepseek-R1, Qwen) rely on profit maximization, while more cooperative ones (Claude, GPT-4o) more often take into account moral considerations.
- All models show a decrease in morale precisely when the user is most vulnerable (at risk of bankruptcy), which raises important issues of AI security.
- The code is open, the prompts are documented in detail.
1. PD and PGG only. Other structures (for example, Trust Game, Stag Hunt) could reveal other patterns of moral behavior. An extension to asymmetric games would be especially valuable.
2. In the real world, agents can often negotiate, which significantly changes the dynamics of cooperation. The authors acknowledge this, but do not investigate it.
3. For some models (Claude-3.7-Sonnet, Gemini-2.5-Flash), versions without reasoning mode were used for cost reasons. Given that the analysis has shown the importance of reasoning, this limits the conclusions.
4. Multi-agent scenarios (N>2) could better reflect social dynamics and collective responsibility.
1. You have shown invariance to paraphrases, but how sensitive are the results to more fundamental changes in the presentation of the problem? For example, what if we present the same dilemmas through different metaphors or change the order in which options are presented?
2. Does the moral behavior of agents change as they gain experience in repetitive games? Are there signs of "moral learning" or adaptation of strategies over time?
3. Moral norms vary between cultures. Do you plan to investigate how different cultural contexts affect the moral behavior of LLM agents? |
Fully AI-generated |
|
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This work studies how LLMs behave when moral obligations conflict with incentives in social dilemmas. The authors present a simulation framework called MoralSim that essentially is a collection of situations that reflect social dilemmas expressed in natural language. The authors evaluate several LLMs considering Prisoner's Dilemma and Public Good games. The findings show that models vary widely in their tendency to act morally and lack consistent ethical behavior across situations, suggesting that LLMs may be unreliable for decision-making roles requiring alignment with ethical norms.
**General note**: I reviewed this manuscript for a previous conference. I would like to note that I went through the paper carefully, noting the main differences. The core contributions are essentially the same apart from the causality analysis; hence, I find myself in a position to repeat my critiques. Please note that the list of weaknesses below takes into consideration the revised version of the paper with respect to my initial review. I would also like to note that my comment about the naming of the “morality score” was addressed, and in this version, the authors talk about “cooperation score”. I noted the following changes with respect to the version I reviewed below (these are the main ones I was able to clearly identify):
- The morality score has been renamed cooperation score.
- A section about causal effect estimation has been added (plus a discussion of the treatment effects later in the paepr).
- The discussion of the results regarding Q3 and Q4 has been extended.
- There is a new section about reasoning trace analysis in the appendix.
These points/observations did not affect my review. I treated this paper completely *as new*. I believe that these are independent conferences and we should consider the manuscripts in their own right each time. I added this note explicitly in order to avoid potential criticisms about the fact I raised again some of the concerns of my previous review (since those parts are unchanged).
- This is indeed a very interesting topic.
- The paper is well written and very easy to follow. The evaluation of the work is conducted.
- This is a well-written paper, but it is difficult to identify a substantial contribution with respect to the state of the art. In fact, the papers listed under “LLMs in Game Theory Settings” focus directly or indirectly on decision-making problems in social dilemmas. In fact, the authors do not really consider actual moral frameworks in their analysis as these works do. The authors claim that various prior research has already explored LLMs’ moral reasoning and strategic behaviour separately. The authors of [Tennant et al., 2024] (actually published and presented at ICLR 2025) essentially not only present a variety of games (including the Iterated Prisoner's Dilemma) but also study how to run a fine-tuning procedure to examine the effects of moral decision-making.
- The authors present very limited analysis in terms of sensitivity to variations of the prompts, including the agent setup that is discussed in Section 3.3 (e.g., descriptions of the setting, personal memory, and current task). These might have substantial effects and should be discussed by the authors in my opinion.
- It is quite surprising that the authors considered moral dilemmas, but the actual dynamics of the responses are only partially analysed/considered by the authors. In fact, the authors mainly focus on the choice of the single agents. The authors also consider the opponent alignment, but this is quite confusing since it appears to the reviewer quite orthogonal to the problem of acting morally. The authors consider the relative payoff, but considering the fact that these are classic (repeated) games, the analysis of the actual cumulative payoff would have probably been more informative from a game theory point of view.
- The survival rate score is interesting, but it appears rather disjointed considering the core topic of the paper. With respect to the statistical validity of the simulations, it is unclear how different repetitions of the games have been implemented.
- The paper essentially lacks a related work section. The authors moved it to the Appendix, but that is outside the 10 pages of the main body of the paper. It seems to me a way for going above the 10 page limit: this is not fair towards the other ICLR authors in my opinion.
I do not have specific questions for the authors. My core concern is about the very limited contribution of this work with respect to the state of the art. |
Fully human-written |
|
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors introduce MoralSim, a benchmark that focuses on studying LLM behaviors in scenarios where explicitly forced to trade-off between moral behavior and rewards. The authors encase these scenarios in realistic settings to provide greater fidelity to underlying LLM behavior and perform analysis of different behavioral questions in LLMs to find that
- The paper is well-written and the experiments are well-designed.
- The settings developed by the paper more clearly expose trade-offs in moral behavior compared to prior work.
- There are clear behavioral takeaways from the paper, e.g., opponent behavior meaningfully steers LLM actions or that moral context improves the morality of LLM behaviors.
While the games offer a step towards realism, they still do not capture the full nuances of reality. In particular:
- All the games are two-player. Many real-world settings involve multiple players with different levels of power and interlocking incentives.
- The text of the games themselves is still somewhat unrealistic. There is an explicit payoff structure described in the system prompt (e.g., Figure 2), whereas in the real world an LLM agent would need to uncover those trade-offs themselves.
However, I realize that these nuances are quite tricky to incorporate and would make some of the analysis less clean, so I don't think that should hold the paper back.
There is no discussion or analysis of whether or not the LLMs participating in the games recognize that they are in the game. Recent research into scheming (https://www.antischeming.ai/) suggests that frontier LLMs may recognize that they are being evaluated, which could question the validity of the research results. However, it is difficult to assess scheming without access to the full reasoning trace, so I also don't count this as a strong weakness.
- Would it be possible at all to analyze whether the LLMs recognize that they are playing a game and if that would alter their behavior in any sense? Would it be possible to attempt to induce "evaluation-awareness" into the LLMs (e.g., by being more suggestive with the wording that the environment is a game) and see how that affects the results? |
Fully human-written |
|
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper investigates the behavior of large language models (LLMs) as agents in social dilemmas that include moral constraints. Specifically, it explores how these models make decisions when moral principles conflict with strategies that yield higher rewards. To address this, the authors propose a new evaluation framework called MoralSim (Moral Behavior Social Dilemma Simulation), which integrates classic game-theoretic environments (such as repeated Prisoner's Dilemma and Public Goods Game) with real-world moral scenarios.
- The MoralSim framework combines classic game-theoretic scenarios with real-world moral dilemmas, enabling a systematic and comprehensive evaluation. This framework is more structured than previous fragmented tests. For example, compared to the MACHIAVELLI benchmark—which primarily uses text adventure games to assess agent behaviorarxiv.org—MoralSim employs formal game structures integrated with moral contexts, allowing for more controlled and easily quantifiable comparisons. This could be better than just simple QA.
- Scientificness of the evaluation of this assumption. Although multi-agent framework could provide a more vivid setting for revoke LLM's decision under certain scenarios, it could still be a question of how real and how consistent these evaluations are. According to my experience these testings are easily be changed by small parts of prompts. Yet, this paper don't provide a convincing enough evidence to illustrate the scintificness of this testing.
- The novelty is somewhat incremental. Although the MoralSim framework’s integration of morality and game theory is commendable, its concepts overlap with some existing work [1,2,3]. Also, please discuss these papers in the main paper more to let audiance familiar with context and existing research as well as the differences. Especially [2,3] already reported the betray behavior of LLMs
[1] Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
[2] Moral Alignment for LLM Agents
[3] Survival Games: Human-LLM Strategic Showdowns under Severe Resource Scarcity
- How authors evaluate the testing is stable instead of impacted by small part of prompt design? How often do models change archetype between Goal-oriented vs Neutral prompts? Provide a confusion matrix of archetypes across prompts.
- Please include a comparison table vs MACHIAVELLI, GovSim, Survival Games, and moral-reward alignment detailing the unique “human-harm resource” dimension and survival horizon. |
Fully human-written |
|
MoRE: Batch-Robust Multi-Omics Representations from Frozen Language Models |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces MoRE, a framework for learning batch-robust multi-omics representations. The method repurposes a frozen, pre-trained transformer backbone—originally developed for biological sequence data—as a feature extractor for single-cell omics profiles. Lightweight, task-specific adapters are then trained to project different omics modalities into a shared embedding space, which is further refined using a combination of supervised contrastive learning and alignment losses to mitigate batch effects. The authors claim that this approach enables effective integration across modalities and improves generalization to unseen cell types and platforms. Evaluations against methods like scGPT and scVI are presented to support its performance in tasks such as cell type annotation and batch correction.
1. Motivating a Challenging Problem: The work tackles the relevant and non-trivial challenge of integrating heterogeneous multi-omics data while correcting for batch effects, a significant pain point in single-cell genomics.
2. Conceptual Proposition: The idea of leveraging a frozen, pre-trained model backbone to create a stable, general-purpose feature extractor is a conceptually interesting direction for improving model generalizability and computational efficiency.
3. Modular Design Intent: The proposed framework suggests a modular architecture, which, in principle, could offer flexibility for incorporating new data modalities and tasks.
1. Misleading Terminology and Presentation: The persistent use of the term "Frozen Language Models" is highly misleading, strongly implying the use of natural language text and models like GPT. In reality, the method merely employs a transformer architecture on biological sequences, constituting a significant misrepresentation of the work's actual technical basis.
2. Insufficient and Unconvincing Empirical Validation: The experimental evaluation fails to substantiate the paper's central claims. There is a notable absence of quantitative benchmarks on the key promised tasks, such as batch effect correction (e.g., using metrics like ASW or LISI) or multi-omics integration. The comparison with Celltypist is not a rigorous validation of representation quality but rather an agreement study between two classifiers.
3. Unsubstantiated Claims of Superiority: The paper repeatedly claims to outperform strong baselines (e.g., scGPT, scVI, Harmony) but provides minimal quantitative evidence to support these assertions. The results presented do not demonstrate a clear or significant advantage over existing methods.
4. Poorly Defined Novelty and Contribution: The core technical approach of using a frozen pre-trained backbone with lightweight adapters is a well-established paradigm in transfer learning, not a novel innovation. The paper fails to articulate a specific, meaningful advancement beyond the current state-of-the-art, making its contribution ambiguous.
5. Disconnect Between Claims and Evidence: A significant gap exists between the ambitious claims made in the abstract and introduction (e.g., "zero-shot generalization," "batch-robust," "practical scalability") and the limited, tangential evidence provided in the results section. The work remains largely aspirational.
6. Severe Structural and Organizational Deficiencies: The paper exhibits a critical failure in logical organization. Section 3 ("Proposed Method") is cluttered with implementation details that belong in an "Experimental Setup," while Section 4 ("Experiments") confusingly introduces core methodological components (e.g., the embedding extraction and fusion modules). This profound disjunction between the high-level framework narrative and the low-level technical description makes it impossible to cleanly separate the proposed model's conceptual innovation from its specific instantiation.
1. Your abstract claims superiority in "integration fidelity, rare population detection, and modality transfer." Could you provide quantitative results on standard benchmarks for these tasks, such as batch integration scores (e.g., ASW, LISI) or metrics for rare cell detection, to substantiate these claims?
2. The term "Frozen Language Models" strongly implies the use of models trained on natural language. Given that you are processing biological sequences, do you agree that this terminology is potentially misleading and should be revised to more accurately reflect the technical approach (e.g., "frozen transformer backbone")? If not, please provide more evidence to support your opinion.
3. The manuscript lacks sufficient implementation details to ensure reproducibility. Could you please provide a comprehensive description of the training configuration, including key hyperparameters (e.g., learning rate, batch size, optimizer), the specific dimensions of the fusion modules and adapters, and the precise stopping criteria for the iterative refinement process?
4. What is the specific purpose of Figure 3 in the context of validating the MoRE framework, as it is not referenced in the main text? Furthermore, the description of data preprocessing and quality control is cursory. Please elaborate on the exact filtering thresholds applied (e.g., for gene counts, mitochondrial percentage) and the methodology for Highly Variable Gene (HVG) selection.
5. Figure 4a seems counterintuitive. Taking T cells as an example, the rows of True labels are all zero, and the columns of Predicted labels are also all zero. So why do T cells appear in this heatmap? Furthermore, this heatmap shows that the dataset only contains 136 + 211 + 9 = 356 cells, which is too few. Finally, a more precise description of this figure is needed. What are the True labels? Are they the cell types described by experts, or are the cell types annotated by Celltypist considered the truth?
6. The process for cell type annotation is completely inconsistent with consensus. Celltypist is simply another annotation method and should not be considered ground truth. The appropriate process is to select a dataset with expert-annotated cell types and consider this expert annotation as ground truth. Then, use methods such as MoRE, scGPT, and Celltypist to annotate cell types. Finally, use the F1 score or accuracy to evaluate the consistency of the predicted results with the expert annotations.
7. The biological insight intended from Figure 4b (ACTB expression) is unclear, as it is not discussed in the main text. What specific conclusion regarding the model's performance or the data's biology should the reader draw from the expression profile of this ubiquitous housekeeping gene?
8. Section 5.2 claims that MoRE can discern finer-grained cell states than the initial Celltypist labels. What independent, quantitative evidence—such as marker gene enrichment or concordance with known cellular hierarchies—can be provided to validate the biological accuracy of these postulated subtypes, beyond visual separation in UMAP space? |
Fully AI-generated |
|
MoRE: Batch-Robust Multi-Omics Representations from Frozen Language Models |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes MoRE, a framework that repurposes a frozen transformer backbone plus lightweight, modality-specific adapters and a task-adaptive fusion layer to produce batch-robust multi-omics embeddings. The paper claims generalization across unseen batches, donors, platforms, and modalities on single-cell tasks.
1. Recycling a frozen transformer with modality-specific adapters is a clear high-level idea.
2. The freeze-the-backbone approach could reduce compute and mitigate catastrophic forgetting during onboarding of new modalities.
1. Despite strong claims, the paper does not report standard, widely accepted measures for batch effect removal and biological conservation, such as kBET, ASW, graph iLISI, kNN-mixing scores, or NMI for label structure. Without these, batch-robust is not substantiated.
2. Figure 2 is not evidence of batch correction or superior embeddings. A single UMAP is insufficient to claim batch removal and one must at least show color-by-batch overlays and quantify mixing. Moreover, Figure 2 appears to lose a distinct B-cell cluster, undermining the structure-preserving claim. Clustering quality should be reported with NMI, F1, ASW, and stability.
3. Although the method is presented as multi-modal, all downstream experiments are on scRNA-seq only. To support the central thesis, the paper must include bona fide multi-omics datasets and tasks (e.g., 10x Multiome RNA+ATAC, SHARE-seq, SNARE-seq, CITE-seq RNA+ADT), including modality transfer and missing-modality scenarios.
4. Comparisons exclude canonical multi-omics integration methods such as GLUE, MultiVI, Seurat, MOFA and more recent cross-modal models. Since the paper’s innovation overlaps with multi-omics representations, not comparing to these makes the empirical case unconvincing.
5. Ablations and diagnostics are absent (e.g., adapters vs. no adapters and each loss component).
Please see Weaknesses. |
Moderately AI-edited |
|
MoRE: Batch-Robust Multi-Omics Representations from Frozen Language Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper presents MoRE, an LLM-inspired framework for multi-omics representation embedding. MoRE repurposes frozen transformer backbones augmented with lightweight, modality-specific adapters and a task-adaptive fusion layer to map heterogeneous single-cell omics into a shared latent space. Experimental results show that MoRE significantly outperforms scGPT, scVI, Scrublet, and Harmony in integration fidelity, rare-population detection, and cross-modality transfer.
1. The combination of a frozen backbone, parameter-efficient fusion, and iterative batch-effect refinement provides a clear cross-modality alignment scheme.
2. Compared with prior generative-reconstruction approaches, it places greater emphasis on structure-preserving alignment, which is relatively novel.
1. This manuscript uses the wrong template. For example, it is clearly missing line numbers.
2. There is a GitHub repo link in the abstract, which might be a violation of the double-blind policy.
3. The paper lacks ablation studies to demonstrate the necessity of each module.
4. The paper has typos and a confusing overall presentation. For example, the Proposed Method and Experiments sections seem to be in the wrong order.
5. The figures need better quality. For instance, the description of Figure 1 is unclear. Figure 4 also has a layout problem: its caption should not be separated from the figure or split across pages.
6.The training setting needs further clarification, including which components are frozen and which are trained. The reproducibility details, such as data splits and random seeds, are incomplete.
1. In the Introduction, the paper describes the approach as “training-free” for multi-omics integration. However, the modality-specific adapters and the task-adaptive fusion layer need to be trained. The authors should reconcile this and improve the wording.
2. Please check Equation (5). The "+" shall be redundant.
3. Please provide ablation results that clarify the incremental contributions of iterative refinement and dense alignment beyond simple fusion.
4. Please provide a detailed description of the domain splits and missing-modality configurations, along with the corresponding quantitative results. |
Lightly AI-edited |
|
MoRE: Batch-Robust Multi-Omics Representations from Frozen Language Models |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 2: fair
Rating: 0:
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work introduces MoRE (Multi-Omics Representation Embedding), a framework to extract biologically meaningful cell representations by integrating modality-specific embeddings for different omics which are projected into a shared semantic space. MoRE is benchmarked against established single-cell models and methods, including scVI, Harmony, Scrublet, scGPT, and CellTypist across multiple tasks and datasets, with promising initial results.
• This paper introduces a novel method in an underexplored and important area
• Promising results across various tasks.
• A critical flaw of this paper is that it makes the claim:
"Our benchmarking across multiple datasets demonstrates that MoRE significantly outperforms established models—including scVI (Lopez et al., 2018), Harmony (Korsunsky et al., 2019),Scrublet (Wolock et al., 2019), and scGPT (Cui et al., 2024)—on metrics such as integration fidelity, rare population detection, and annotation accuracy."
However, those results do not appear to be in the manuscript. While the models listed above are evaluated on clustering performance and compared against MoRE, and MoRE is also evaluated on other tasks such as annotation accuracy and compared against CellTypist, these results are not sufficient in supporting the paper's main claims, and providing a comprehensive view of MoRE’s performance.
• A minor flaw is that tables are often not referred to in the text, detracting from clarity (ie. see section 5.1, section 3.1).
• The paper is poorly organized - the introduction is a mix of different sections that include results, discussion, background and significance, and the text traverses back and forth among these sections, creating a confusing narrative and leaving the reader unclear about any of the paper claims.
• The rationale for selection of the baselines (claimed as strong baselines) is lacking. Why include scrublet, which is a QC method?
• The MoRE method is insufficiently described (just a few sentences are provided). At least an architecture diagram would be needed.
This paper omits key results in supporting its main claims. I am therefore providing an initial recommendation of rejection.
• Can the authors compare MoRE’s performance on integration fidelity, rare population detection, and annotation accuracy against all the models it was claimed MoRE was compared to?
Some suggestions:
• Consider moving minor details such as figure generation to an appendix.
• Ensure that tables are referred to in the text where appropriate.
• There are a significant number of typos in the paper. Please further edit the manuscript (for example, the title for section 5.1 has a typo). |
Fully human-written |
|
Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents a two stage training paradigm for VLMs like CLIP arguing the benefits of perceptual initialization (PI) over random initialization. Further it argues that incorporating PI in initialization phase more advantageous than post-hoc finetuning. The main contribution is demonstrating that this early-stage alignment provides a stronger foundation for general-purpose VLM intelligence. PI models show significant zero-shot performance improvements, without any task-specific fine-tuning, across a comprehensive suite of 29 classification and two retrieval benchmarks. PI approach consistently outperforms a randomly-initialized baseline, and a direct comparison shows that post-hoc perceptual fine-tuning is catastrophic to V-L alignment.
**originality**: Lveraging supervised human behavioural data as a foundational inductive bias in the model intialization is a novel idea that opens a new research direction. The works provided a provides a structured solution that converts often ignored variance of random initialiation into a principled prior.
**Significance**: PI paradigm is the core strength of the paper. It uses the supervised human perceptual data to initialize a VLM parameter prior to large scale pretaining, provide a potent to human aligned inductive bias right from time t=0.
**Quality**: The provided results empirically validate the PI hypothesis, having consistent performance gains, outperforming 23/29 classification tasks. Further, it shows how post-hoc finetuninig leads to catastrophic forgetting.
**writing**: The argument for PI is presented logically, starting from the known "path-dependency" of deep networks and the variance of random seeds, making the motivation for a principled initialization intuitive.
**Limited scope of the prior**: Only the vision encoder is initialized with PI and the text encoder is still randomly initialized and trained from scratch. What is the reason for this choice for the experiments?
CLIP like model operates on the shared latent space of vision and text modalities. The paper could be strenthened by exploring complementary intialization of text encoder, to see if such complete model with PI initialization provides synergistic benefits.
**Perceptual Loss**: The core of the PI benefits lays in the perceptual loss function which is derived from a previous works. There's no/lack of evidence/interpretation (apart from the final results) provided on how does this loss function work/fail in the assumed context: pretraining vs post-hoc finetuning.
**Mechanistic Analysis**: While efficacy id proven, the paper does not delve into why the inductive bias remains so effective after 32 epochs, where the post-hoc finetuning fails. This theoretical insight is critical to see the compatibility of leveraging this idea to different models or scenarios. Many of the questions in **Questions** section could not answered from the given content of the paper.
**Limited evaluation**: current training uses 15M image-text pairs, while this is substantial, SOTA VLMs often trained on hunderds of millions or biilions of pairs. Will the proportional gains from PI would persist, diminsh or grow continuously (Though limited scling law provided in the paper). In failure cases, how PI should be addressed?
- Do the PI weights remain closer to the perceptiual optimum throughout training?
- How does the learned Image-Text alignment module interact differently with the PI-derived features versus the baseline features? How does the shared representation space differ?
- Have any preliminary experiments been conducted to determine the minimal amount of human perceptual data required in Stage 1 to achieve a statistically significant positive gain?
- Why/How does this loss function work?
- Can the authors analyze the evolution of the logit scaling parameter ($\tau$) in Stage 2?
- For failure cases, should the perceptual prior be "re-anchored" at intermediate stages, or perhaps weakened by introducing a temperature parameter to the perceptual loss? |
Fully human-written |
|
Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces Perceptual-Initialization (PI), which initializes the CLIP vision encoder using human perceptual similarity data (NIGHTS triplets) before standard large-scale image-text contrastive training on YFCC15M.
Compared with random initialization and with post-hoc perceptual fine-tuning, the proposed method yields consistent zero-shot gains across 29 classification and 2 retrieval benchmarks. The authors argue that embedding human perceptual priors at the start of training leads to faster convergence and more human-aligned representations.
1. Novel use of human perceptual priors as initialization rather than alignment fine-tuning.
2. Comprehensive evaluation over diverse datasets shows consistent positive gains.
3. Very low additional compute cost.
4. Clear comparison showing that late perceptual fine-tuning disrupts alignment and opens new direction for human or brain aligned pretraining.
1. No experiments using random or pseudo perceptual triplets to isolate the contribution of human perceptual structure.
2. The approach is validated only on NIGHTS; applicability to richer datasets remains untested.
3. No probing or visualization is provided to show how perceptual initialization changes internal feature space or similarity structure compared to the baseline.
1. Could the authors analyze which visual attributes benefit most from perceptual initialization (e.g., texture vs. shape bias)?
2. Does PI primarily affect the early layers or propagate to higher-level semantics during contrastive training?
3. How much perceptual data is necessary—does performance saturate after a certain fraction of NIGHTS triplets?
4. Could PI be combined with supervised or robust-CLIP initializations, or would they interfere? |
Fully AI-generated |
|
Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a new visual representation learning scheme called Perceptual-Initialization, which trains the visual encoder to match human preference before the contrastive learning. Specifically, the human preference alignment is achieved using a triplet contrastive loss on the NIGHT dataset and the resultant model weights are used as the initialization of the formal contrastive learning. PI achieves zero-shot performance improvements on a variety of image classification and retrieval benchmarks compared to the baseline CLIP.
- The proposed method is novel, simple yet effective. The promising results of the paper can encourage following researches exploring other initialization strategies.
- Results in zero-shot image classification and retrieval tasks demonstrates that PI scales as the data volume increases, indicating the method's potential in large-scale training.
- The paper is well organized and nicely presented. The ending section points out remaining challenges faithfully and offers valuable insights, strengthening its contribution to the field.
- The proposed method limits its scope for the initialization of CLIP type model, despite that the human preference alignment is independent to the text encoder. The author could add experiments on other visual backbones such as vanilla ViTs to fully explore the potential of the method.
- As mentioned in the weakness part, I'm wondering if PI could also benefit other types of visual pretraining?
- Additionally, does the model trained using PI demonstrates stronger transferability compared to normal training? |
Fully human-written |
|
ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
In this work, the authors introduce a simulation framework and benchmark for developing AI assistants that are both proactive and personalized. The framework models a simulated home environment where a user agent performs daily routines, with multiple distinct user personas and realistic activity patterns generated using GPT-4o. The assistant’s objective is to determine when and what actions to recommend to the user. To address this, the authors propose a retrieval-augmented, preference-learning assistant that maintains a memory of past interactions, generates multiple recommendation candidates, and learns from user feedback through preference optimization. Extensive experiments demonstrate that this approach substantially improves user satisfaction across a wide range of simulated personas.
The main strength of this paper lies in its timely motivation to create assistants that are both proactive and personalized. The authors create extremely controlled agent setup to study the design of such systems. Under this setup, they consider multiple constraints that the assistant need to adhere to, creating an extremely rich setting to ablate different scenarios. The authors also validate their proposed setup with a human evaluation study, which increases validity of their claims about naturalness and persona alignment.
In addition, the authors propose intuitive system that builds on 2 key components: (a) structured memory storing past experiences, and (b) DPO training based on user feedback. Through extensive evaluation, their proposed system shows consistent improvement across all their persons. Overall, this is an extremely strong paper that will be of great interest to the scientific community.
I don't see many weaknesses with the paper. In fact, the authors have gone to extreme details in designing their system.
I have a few questions with the experimental setup.
(a) The definition of proactivity is narrow. Under the current definition, proactivity is reduced to timing control, not anticipation of latent goals or needs of the user. This is because the authors define it as whether an action should be taken at a particular time. However, by proactivity, I would also define it in terms whether an action can return a reward faster or change the trajectory of decisions on the user. How can that be modeled under the current framework?
(b) Noisy or delayed feedback: One key missing component in the current setup is that it assumes almost perfect feedback from the user always. User Preferences can be noisy or even delayed. How would such noise be incorporated into such a setup? And how would the proposed system change under presence of such noise?
(c) LLM-as-a-judge is performed with Gemini 2.0 Flash. The rubric used in table 6 is extremely broad. How would the results change when the evaluation is performed with a different LLM?
Please check my questions above. |
Fully human-written |
|
ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper created a benchmark for simultaneously evaluating the proactivity and personalization of an agent. The benchmark is a simulated environment where a user which is powered by an advanced LLM conducts daily activities, simulating human behaviors. At the same time, an agent observes the activities of the user and decides whether or not to interact with the user by providing recommendations. The agent will receive a score of its recommendation based on the user's personalized rubric evaluated by an LLM. The users are created to cover a broad set of persona. Besides the benchmark, the authors also provide an agent that performs well under this simulated environment by utilizing preference alignment and retrieval-augmented generation.
1. This paper considers the problem of benchmarking both proactivity and personalization of an agent. It can augment the existing evaluations which mainly focus on either proactivity or personalization.
2. The presentation of this paper is clear and easy to follow.
1. The definition of proactivity is too narrow. In this proposed benchmark, proactivity is mainly described as the ability to initiate the interaction with the user by providing recommendations. On the contrary, previous works on benchmarking proactivity covers a broad range of tasks including coding, writing, and daily life scenarios. Also, the quantification/evaluation of proactivity seems to be mixed into personalization. For example, how can one infer from the evaluation results that the agent is proactive but bad at personalization?
2. The proposed ProPerAssistant can be regarded as a baseline of this benchmark instead of a contribution, since there is no new innovation in this agent which mainly based on memory retrieval and preference optimization. For example, the statement at line 406 "Notably, ProPerAssistant achieves this without relying on computationally expensive reinforcement
learning methods, instead leveraging an efficient DPO-based approach to learn user preferences". This is the contribution of DPO. And I think the statement is a little bit misleading since the agent is trained with online DPO, which necessitates the collection of online trajectories, it is still expensive.
3. The benchmark is not practical to use. the simulation and evaluation are both powered by advanced LLMs which induces lots of API costs. And since the evaluation is based on LLM-as-a-judge. There can be bias or hallucination. Although in section 4.3, the authors conducted manual checking of the actions and evaluations, it is a limited checking and I still concern the reliability of the benchmark.
4. As an agent capable of actively engaging the interaction and infer the personality of a user based on interaction history, it should be able to generalize to users with other persona without re-training. But there is limited discussion (not sure the first paragraph in section 6.4 is under this setting?) on the generalization ability of ProPerAssistant.
I also raised these questions in the weaknesses part. Please refer to that part for more contents.
1. How can one infer the separate performance of proactivity or personalization from the evaluation results? It seems low proactivity or bad personalization will decrease the score. I know the goal of this benchmark is to evaluate the joint performance. But it is also crucial to know what the agent is bad at currently for future improvement.
2. What is the setting of Adaption Across Diverse Personas in section 6.4? What's the training set of users and what's the evaluation set? |
Fully human-written |
|
ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper investigates the proactive personalization problem in the conversational ai scenario. Due to the rare study in this track, this work proposes ProPerSim to simulate this scenario and further propose ProPerAssistant which coupled with a memory system and a preference-aligned training strategy to proactively generate personalized interventions for user.
* The research scenario is very interesting and to the community’s focus.
* The proposed ProPerSim simulation is concrete and consider personality, environment, user’ memory.
* The proposed ProPerAssistant learns from user feedback via intermediate state and user preference is novel.
Some minors from practical:
* The baseline selections are not convincing, the ProPerSim would of course outperforms than no memory, AR memory and ARS memory since ProPerSim are trained based on these configurations and contain them all. The comparisons here are more like ablation. Authors may consider different preference alignment comparisons etc.
* The experiments lack the evaluation of intervention. Since it’s proactive personalization, it would be straightforward to see what the rate of successfully proper interventions is.
* Latency/efficiency analysis on the daily data generation should be included.
* How does adopting a purely time-based framing of proactivity potentially limit the assistant's ability to respond to specific user-triggered events (which prior work used) that might require instantaneous intervention, rather than waiting for the next predetermined timestep?
* Is the challenge of improving performance for these complex personas fundamentally attributable to the memory system's inability to retain the necessary temporal and content granularity required to meet these highly specific demands, given that the memory relies heavily on time block summarization? |
Fully human-written |
|
ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces ProPerSim, a simulation framework for developing AI assistants that integrate proactivity and personalization in home environments. It also presents ProPerAssistant, a retrieval-augmented AI assistant trained using ProPerSim to adapt its strategies based on user feedback. ProPerAssistant demonstrates improved user satisfaction across 32 diverse personas, highlighting its ability to adapt to individual preferences and contexts. This work bridges the gap between personalized and proactive AI systems, advancing the development of context-aware, user-centric recommendations.
1. The paper addresses an important gap by combining personalization and proactivity, offering a new perspective on AI assistant design.
2. The simulation framework is well-structured and supports realistic evaluation through diverse user personas and rich feedback loops.
3. ProPerAssistant demonstrates measurable improvements in user satisfaction, showcasing its practical applicability and potential for real-world integration.
1. The dataset mentioned in the paper is neither provided in the supplementary materials nor shared with an anonymous link. This raises concerns about whether the dataset will be made publicly available in the future.
2. While the paper claims to integrate proactivity and personalization as its motivation, the focus of both the dataset and the proposed method appears to be predominantly on personalization. The treatment of proactivity seems rigid, as it only involves providing suggestions at fixed time intervals rather than adapting to users' behaviors. Such an approach lacks flexibility and does not align well with real-world user scenarios.
3. The paper includes comparisons with only a limited set of baselines, specifically a few variations of ProPerAssistant. However, it does not compare the proposed method against other existing personalization baselines. This omission raises concerns about the effectiveness of the proposed approach.
4. The authors report performance results based on only one type of large language model (LLM) and do not conduct ablation studies on other LLMs. This raises doubts about the generalizability of their method and dataset across different models.
See weakness. |
Moderately AI-edited |
|
Variance Reduced Distributed Non-Convex Optimization Using Matrix Stepsizes |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces two algorithms, det-MARINA and det-DASHA, aimed at federated non-convex optimization with communication compression. These methods extend the matrix-stepsize algorithm det-CGD by incorporating variance reduction techniques (MARINA and DASHA, respectively). The primary contribution is the theoretical demonstration that these variance-reduced versions overcome the main limitation of det-CGD, which is convergence only to a neighborhood of a stationary point, and instead achieve convergence to a true stationary point with an O(1/K) rate, measured in a determinant-normalized gradient norm (Theorems 4.1 and 5.1). The paper derives conditions for the matrix stepsize D and proposes a practical relaxation by setting D = γW, providing explicit formulas for the scalar γ. Complexity analyses suggest improvements over scalar MARINA/DASHA and det-CGD. Empirical results on logistic regression problems confirm faster convergence in terms of communication bytes compared to baselines.
1. The paper removes the residual error present in det-CGD convergence analyses and proves clear O(1/K) convergence to a stationary point under the matrix Lipschitz assumption.
2. The theoretical results appear correct and consistent, with checks showing that the framework recovers known results for scalar MARINA/DASHA and matrix gradient descent.
3. The relaxation D = gamma*W and explicit formulas for gamma make the method easier to use in practice.
1. The paper’s novelty is a theoretical synthesis of matrix stepsizes and variance reduction, which successfully removes the convergence
neighborhood of det-CGD. While the analysis is careful, many steps of the proof are direct translations of variance-reduction proofs to the matrix norm setting. The manuscript does not isolate what specific technical challenges (if any) required new analytical techniques.
2. The method’s reliance on knowing or accurately estimating the matrix L (or local Li) remains a major practical limitation, The paper acknowledges this (”Availability of L”, Sec 5.1) but lacks a convincing practical strategy or robustness analysis regarding estimation errors. This significantly undermines the applicability of the results.
3. The experiments do not sufficiently justify the method’s added O(d2) complexity. The evaluated logistic regression
tasks on LibSVM datasets neither reflect moderate-scale non-IID federated benchmarks nor include ill-conditioned settings where matrix stepsizes should provide clear benefits. The observed gains over scalar methods are modest, offering limited empirical justification for the added complexity.
1. What parts of your analysis are genuinely new or rely on arguments that differ from scalar variance reduction proofs? Clarifying this would help readers understand the technical novelty.
2. How sensitive are det-MARINA and det- DASHA to errors in estimating L or Li? Showing results with controlled under- and over-estimation would make the practical stability clearer.
3. For W = diag−1(L), what efficient methods can estimate or update diag(L) in large-scale federated settings?
4. The experiments use small logistic regression problems. Can you include larger benchmarks such as CIFAR-10 or CIFAR-100 with non-IID clients and moderate neural networks to show that the approach scales and remains effective in realistic scenarios?
5. Could you test a problem with strong ill-conditioning where matrix stepsizes give much larger speedups over scalar methods? This would better justify their added complexity of your approach. |
Fully AI-generated |
|
Variance Reduced Distributed Non-Convex Optimization Using Matrix Stepsizes |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper studies federated learning optimization methods with matrix stepsizes. Two new FL algorithms were proposed, det-MARINA and det-DASHA. These works extend the existing det-CGD algorithm. The two new algorithms mainly address the variance reduction aspect. As a result, the proposed algorithms are able to exhibit a superior convergence bound, exceeding the neighborhood limitation presented in det-CGD and other SGD-style methods. Experiments show that the two proposed algorithms require less iteration complexity while being as communication efficient as existing algorithms such as det-CGD.
The paper introduces variance-reduced variants of det-CGD, Although MARINA and DASHA have been relatively well-studied, the reviewer believes it is still a nontrivial and meaningful extension. The proposed det-MARINA and det-DASHA algorithms display noticeable improvement compared to previous algorithms.
The theoretical analysis seems solid and rigorous to the best of my knowledge. The analysis is well explained and mostly intuitive. The statements regarding iteration and communication complexity are of interest to potential readers.
The experiments utilize synthetic objective functions that satisfy the function assumptions, and the numerical experiments verify the effectiveness of the proposed algorithms. The authors also provided a comparison between algorithms in terms of communication bytes.
While the experiments effectively recover the technical assumptions on the objective function and serve as verification of the analysis, they are, for the most part, still synthetic toy examples. Since the paper addresses federated learning problems, which are largely motivated by practical applications, it would be helpful if the authors also provided further practical results with real-life FL tasks.
The writing of this paper has much room for improvement.
- The authors provided a mathematically sound introduction and related works; however, they did not explain the motivation or the necessity of this work. What are the properties of CGD? Why is the previous method named det-CGD? More explanations should be added for this paper to be accepted as a conference paper.
- The author also used terms such as det-CGD and det-CGD1/2 interchangeably with the assumption that readers have read the "original paper", which the reviewer assumes to be Li et al.(2024).
- In terms of notations, the authors have switched notation from D to W, and introduced matrices L, S, D in Section I, while the det-CGD algorithm is introduced two pages later.
- Many mathematical definitions, such as matrix smoothness, should be provided in the manuscript.
I wonder how the heterogeneity across agents affects the convergence of distributed det-CGD and the two proposed algorithms in this setting? Does the variance from CGD and the variance from FedAvg compound in practice and theory?
How does the computational complexity of Compressed Gradient Descent affect the general cpu wall time of the proposed algorithms?
In practice, are the gradients first calculated in full and then compressed, or are they directly estimated as compressed signals? |
Fully human-written |
|
Variance Reduced Distributed Non-Convex Optimization Using Matrix Stepsizes |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes two variance-reduced distributed algorithms, det-MARINA and det-DASHA, for nonconvex finite-sum optimization in federated learning settings. Building on the det-CGD method, the authors incorporate variance reduction techniques from MARINA and DASHA to eliminate the convergence neighborhood issue caused by stochastic compressors in det-CGD. Under matrix smoothness assumptions, they prove convergence with improved iteration and communication complexities compared to scalar counterparts and det-CGD. Experiments on LIBSVM datasets show improved iteration/communication metrics over baselines.
1. The integration of matrix stepsizes with variance reduction addresses a key limitation of det-CGD by removing the non-vanishing error term, leading to better theoretical bounds.
2. The authors provide clear convergence guarantees under matrix Lipschitz assumptions, highlighting the advantages of the proposed methods.
3. Experiments validate the effectiveness of the proposed method, showing det-MARINA and det-DASHA outperform baselines in communication and iteration efficiency.
1. Considering matrix smoothness is not very standard in general stochastic optimization problems, the authors should give more discussion about this assumption, including how it can be satisfied in practical problems and the comparison with the standard smoothness assumption.
2. Although the authors provide many experimental results, more complicated and real-world scenarios or tasks would largely strengthen the experimental part.
3. How do the authors recommend choosing or approximating the matrix stepsize D in practice, especially when the full Hessian or smoothness matrices are expensive to compute? Are there heuristics or approximations beyond $D = \gamma w W$? More discussions here may be helpful.
4. It is well-known that the optimal complexity for finite-sum optimization is $\mathcal{O}(\sqrt{n} \epsilon^{-2})$. How can we understand the order of $n$ in the convergence rate for matrix stepsize methods?
See the Weakness part. |
Lightly AI-edited |
|
Riemannian Stochastic Interpolants for Amorphous Particle Systems |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents a generative model for amorphous materials based on stochastic interpolants, taking symmetries and periodic boundary conditions into account. Mathematical details regarding the model design are included in detail. The method is evaluated on a small 2-d toy example simulating a single system at two scales (10 and 44 atoms respectively).
The paper is well written and while the mathematical details at time makes the paper a bit dense, it is quite readable.
The topic is timely and of great interest.
The method is described in sufficient detail, which makes it likely that results can be reproduced.
The focus on distributional/ensemble metrics is relevant and interesting.
The experimental validation is very limited. There is no real data example, e.g. with a 3-dimensional amorphous material.
While the method could be expected to generalize between different systems, the experiments only show a model trained for a single system.
The ability of the model to generalize should be clearly demonstrated.
Could you write M = R^3 / (LZ)^3 ?
Are you sure that Eq. (1) is not a metric on the quotient space M? Perhaps you meant to say it is not a true metric on R^d.
The potential energy as defined in Eq. (2) assumes no self-interaction across the periodic boundaries, right?
What do you mean by "For instance, the modulo operator defines a fundamental invariance for probability densities on M, as any density that is not modulo-invariant would assign infinite mass (...)"? Do you mean p(X) = p(X+kL) for all k in Z^d? I am not sure I understand this, since if X is in M, then X (in R^d) and X+kL (in R^d) is literally the same point in M, so this is already built into the domain, p: M->R.
Line 131. Is the ⨂ notation necessary here? Could you not just write (X + u) % L? I think that is fairly standard and perhaps more readable. Same goes for line 139. Perhaps you prefer the chosen notation because it makes it explicit that operations are on all N atoms?
What is the practical significance of Eq. (4)? Also, if b should be in R^{N|S|+Nd} that seems to not align with the definition of b in appendix A, lemma 10: b=1_N ⨂ c where c is in R^d.
Could the logarithmic map (Definition 2) be written more compactly as (A-B+L/2) % L - L/2 ?
On the flat torus, where the differential of the exponential map is the identity, doesn’t Eq. (9) recover the exact trajectories rather than only the time-marginals?
I am confused about the notation in Proposition 8. The velocity field is defined as \hat v: [0,1] x C -> TC. So is are the coordinates not already defined on M, and thereby by definition \hat v(t, (s,X)) = \hat v(t, (s,X+kL)) ?
What is the relation to generative models based on kinetic Langevin diffusion? Could you compare to this line of work?
Can you demonstrate the capability of the model to generalize beyond the training data? |
Fully human-written |
|
Riemannian Stochastic Interpolants for Amorphous Particle Systems |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors tackle the problem of sampling equilibrium configurations of glass-forming materials. They do so by combining the Riemannian Stochastic Interpolants and equivariant flow matching for the groups of interest (permutations, translations and symmetries). They substantiate their claims empirically and improve on baselines where available.
- The paper is well-presented.
- The developed theoretical framework is arguably simple, but rigorous and thorough. The proofs seem sound and are well written.
- The paper formalises and proves some intuitive claims (e.g., Prop. 9) – something often overlooked.
- The empirical evidence clearly shows improvement on existing baselines on the provided experiment. Not being an expert in the field of the application, I cannot exactly judge of its quality, however; but the overall method seems to produce much more stable configurations by about a magnitude. Moreover, it looks sound.
- The empirical evidence seems very thorough on the provided dataset.
- The novelty is arguably very low. This paper mostly applies equivariant architectures to Riemannian Flow Matching. In particular, the considered GNN is made Lipschitz-bounded and equivariant, which, as mentioned in the paper, has already been done numerous times. (Perhaps not all at once?)
- Similarly, the theoretical framework does not seem particularly insightful; it mostly formalises results that intuitively seem self-evident. While it is good to prove these, it does not add anything new. At least, this is in my current understanding of the theorems.
- The experiments are convincing, but arguably most of the improvement comes from the equivariance of the architecture.
Overall, this seems to be a good paper, but I mostly doubt that ICLR is the right venue: the interesting part for a Machine Learning conference is really the experiments section. Moreover, and because of that, I am not expert on this particular sub-field, so I do apologise in advance for not being knowledgeable enough on this. So:
- Have I missed out on some particular novelty in the paper?
- Are the experiments more related to ML than I have understood? Are there perhaps any conclusions to be made about Riemannian Flow Matching/Stochastic Interpolants?
- Have you tried out your method on different data/larger scales, and do you see similar improvements?
- Could you point out the main differences between your method and Equivariant Flow Matching?
I am happy to engage in the discussion period well to hopefully deepen my understanding and better my assessment of this work. |
Fully human-written |
|
Riemannian Stochastic Interpolants for Amorphous Particle Systems |
Soundness: 3: good
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces an equivariant form of stochastic interpolants and applies it to amorphous materials. The main results are given on a toy problem with either 11 or 44 atoms with two different species.
- Figure 2 makes the equivariances quite convincing
- Competing methods have drawbacks (Riemannian DDPM, no likelihood; maximum-likelihood training of ODE, expensive)
- The reweighing is possible via the framework and it is shown to make a big improvement in performance.
- Figure 1, right side. The word "symmetrized" is written next to an image in which the symmetry is far from clear. What's going on there?
- There is a lot of time spent on what I would consider background information. Many of these symmetries are closely discussed in other works. I agree that a specific treatment of stochastic interpolants is technically new, but the details discussed here are somewhat minor.
- While the performance is obviously better with the author's treatment, the results are rather simple. Of course, everything gets more complex in amorphous state with periodic boundary conditions, but 44 atom with two species is not very many or complex. Is there any experimental data you can attempt to fit to?
- What is the citation issue on line 177?
- Can you consider any bigger systems or ones with more motivated datasets?
- Can you explain your main contribution more clearly? It seems like many of these modeling aspects existed already in extremely related frameworks. |
Fully human-written |
|
Riemannian Stochastic Interpolants for Amorphous Particle Systems |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes to use Riemannian stochastic interpolants frameowork and group equivariant network for amorphous systems generation. The authors also adapt the architecture of graph neural network to leverage the full symmetry of the amorphous materials. Experiments on a classical glass model show the empirical performance of the proposed methods.
1. The paper is well-structured and esay to follow.
2. The paper leverages the symmetry and geometry structure of the amorphous materials, which is reasonable,
1. Limited experimental scope. The experiments appear to be restricted to two-dimensional and relatively small-scale datasets. Additional experiments on larger-scale and real-world datasets would strengthen the paper’s contributions. Please refer to Question 1 for further details.
2. Lack of efficiency analysis. I noticed that the authors use Eq. (8) to compute the expectation of physical quantities. To the best of my knowledge, such likelihood computations can be inefficient and inaccurate. Additional experimental results are needed to demonstrate that the proposed estimation process indeed converges reliably.
1. Since I am not an expert in amorphous materials, could you clarify whether there are any real-world datasets or tasks in this domain that are suitable for diffusion models? As mentioned in Weakness 1, experiments on large-scale datasets would help demonstrate the scalability of the proposed method and further strengthen the contribution of this work.
2. Could you elaborate on how you evaluate the average potential energy and heat? As mentioned in Weakness 2, does the simulation of Eq. (8) converge in practice?
3. How does the choice of numerical ODE solver affect the generation performance? |
Fully human-written |
|
Continuous Online Action Detection from Egocentric Videos |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a new problem setup called Continuous Online Action Detection (COAD) for egocentric video.
Unlike standard Online Action Detection (OAD), where models are trained offline and only infer online, COAD aims to both learn and adapt continuously from streaming input, in a single-pass, without replay or multiple epochs.
The authors introduce a benchmark called Ego-OAD (a modified subset of Ego4D’s Moment Queries split) and combine three ideas to deal with the new configuration:
1. state continuity in RNNs,
2. orthogonal gradient updates (to decorrelate consecutive gradients), and
3. non-uniform loss computed only on the last frame of a window.
They claim this setting improves in-stream adaptation and out-of-stream generalization.
The proposed setup is relatively simple to implement.
1. **Almost no technical novelty**:
The three main components (orthogonal gradient, non-uniform loss, RNN state continuity) are all borrowed from prior work.
The proposed orthogonal gradient is directly taken from prior work on streaming learning (e.g., CVPR 2025), the non-uniform loss has already appeared in OAD literature (e.g., MiniROAD), and state continuity is essentially an inherent property of RNN-based models.
Overall, the claimed “new task and strategy” appears to be more of a reapplication or repackaging of existing ideas rather than a genuinely novel methodological contribution.
2. **Motivation is not persuasive**:
COAD assumes frame-level or interval-level action labels—extremely expensive to obtain in egocentric settings.
Labeling continuous egocentric streams for action intervals is one of the most labor-intensive tasks in video research.
If the method truly aims at continuous online learning, it should address label sparsity, delayed labels, or weak/self-supervised alternatives.
As written, it still assumes dense supervision, which fundamentally contradicts the “continuous deployment” scenario it claims to model.
In addition, the proposed formulation gives the impression of directly extending the streaming-learning setup of Carreira et al. (CVPR 2024a) into the OAD domain, without sufficiently reconciling the differing supervision assumptions and data requirements between the two tasks.
3. **Missing comparison across diverse OAD architectures**:
The paper evaluates COAD primarily on a minimal RNN-based configuration, without comparing across existing architectural paradigms.
However, most prior OAD works—such as LSTR, OADTR, MA-Transformer, GateHub, MiniROAD, and TeSTra—are fundamentally architectural contributions, largely orthogonal to the continuous-learning setup proposed here.
For the study to be experimentally complete, it is therefore important to demonstrate how COAD behaves when integrated with or compared against these architectures under a consistent backbone and feature extraction setting.
Please see weaknesses. |
Moderately AI-edited |
|
Continuous Online Action Detection from Egocentric Videos |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Traditional Online Action Detection (OAD) models are trained offline and assume static environments. This limits their adaptability to dynamic, personalized context especially in wearable devices like smart glasses. These devices capture egocentric (first-person) video streams in real time, where users, environments, and tasks vary continuously. The paper introduces Continuous Online Action Detection (COAD) to address this gap. COAD enables models to not only detect actions in real time but also learn and adapt continuously from streaming video without storing data or retraining offline. It curates a large-scale benchmark from Ego4D’s Moment Queries split for egocentric OAD with multi-label, temporally grounded annotations. It proposes training strategies of orthogonal gradient projection to reduce update redundancy, state continuity via RNNs to maintain long-term memory, and a non-uniform loss to align training with inference dynamics. In the experimental evaluation, COAD improve top-5 accuracy by 20% for in-stream adaptation and generalization performance by 7% on unseen data.
1. The paper introduces a well-motivated problem of moving beyond static, offline-trained OAD models to methods that can continuously learn and adapt in dynamic, egocentric environments.
2. The curation of the Ego-OAD benchmark provides a new, large-scale dataset for this specific task, derived from the existing Ego4D dataset.
1. The method/architecture novelty of the paper seems to be minimal. The proposed COAD method is a combination of pre-existing components. The orthogonal gradient technique is applied from [1] , and the non-uniform loss is applied from [2]. The third component of state continuity seems to be the default behavior of an RNN during inference. Can the authors emphasize the contributions in terms of architecture, if any?
2. The main results in Table 1 show that COAD method improves out-of-stream generalization (26.0 mAP) but worsens in-stream adaptation (36.8 mAP) compared to the "w/o COAD" baseline (25.5 mAP and 39.0 mAP, respectively, with Ego pretraining). Can the authors discuss why the performance in generalization task improves but seems to be less for in-stream data?
[1]. Han, Tengda, et al. "Learning from Streaming Video with Orthogonal Gradients." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
[2]. An, Joungbin, et al. "Miniroad: Minimal rnn framework for online action detection." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
Suggestions:
The text in Section 5.3 refers to Table 1 for Epic-kitchens-100 results but these results are located in Table 2. |
Fully human-written |
|
Continuous Online Action Detection from Egocentric Videos |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper tackles continuous egocentric action detection and introduces Continuous Online Action Detection (COAD), emphasizing the need for models that operate on long, unsegmented video streams rather than isolated clips. To address gradient noise and distribution shifts across continuous sequences, the authors leveraged gradient orthogonalization, a Non-Uniform loss, and persistent RNN hidden states, achieving up to 6% mAP improvement on Ego4D and comparable gains on EPIC-KITCHENS. Through detailed ablations, they show that gradient orthogonalization and Non-Uniform loss are the most critical components, underscoring the difficulty of adapting exocentric models to egocentric settings.
The authors carefully designed ablation studies that show each of their modifications to a continuous (non-IID) training pipeline is effective (perhaps except for state-continuity)
The authors demonstrate a significant improvement over naive IID-pretraining across two egocentric datasets specifically for the continuous Online Action Detection task.
In Figures 7 and 8 from the Appendix, it appears that some categories (ex., Cutting/trimming grass, Using the phone, etc.) are both prevalent and long-lasting. If this is the case, wouldn’t reporting the per-frame mAP and Top-5-Recall, computed over all action classes, lead to an unbalanced view (by label) of the model’s performance? Wouldn’t it be fairer to weight each action class proportionally to the total number of seconds present in the out-of-stream subset? Please provide some clarifications.
Beyond merging temporal annotations from multiple reviewers and manually grouping semantically similar action classes, it is unclear what modifications the authors have made to the EGO4D MQ subset to state that their EGO-OAD dataset is a “curated” version.
What are the dimensions of the embedding layer used in the RNN? Are they identical to the output embedding dimensions of the visual backbone? If so, could this alignment explain any performance differences between TSN and TimeSFormer beyond the temporal modeling effects reported in Table 4?
Regarding the ablation in Table 3, what would happen if the hidden states were not continuous and the training procedure did not involve continuous adaptation? In other words, if we maintained an IID training setup for both the visual backbone and the RNN, but retained gradient orthogonalization and the Non-Uniform loss, would we still observe comparable gains in mAP and Top-5 Recall?
It remains unclear to the reviewer that the out-of-stream setting outperforms the in-stream setup by such a large margin (over 25% mAP) for the noun classification task in EPIC-KITCHENS? If this gap is attributed to the fine-grained nature of actions and annotations in EPIC-KITCHENS, why do we not observe a similar disparity for actions and verbs in Table 2?
Additionally,1,177 videos (approximately 62%) were reserved for the in-stream subset in EGO-OAD, while only 202 videos (around 31%) were used for the in-stream subset in EPIC-KITCHENS. Is this discrepancy related to video length or another dataset-specific factor that should be clarified for the reader?
Finally, there are some other relevant methods not reported in Tables 1 and 2 for baseline comparison. For example, including some of the top-performing approaches from the Ego4D leaderboard (https://eval.ai/web/challenges/challenge-page/1626/leaderboard/3913
) could provide readers with a broader context, highlighting that continuous online action detection remains a challenging problem even for methods specifically designed for egocentric OAD. |
Lightly AI-edited |
|
Continuous Online Action Detection from Egocentric Videos |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces continuous online action detection, a new task that combined traditional online action detection with learning from continuous video streams. The paper targets this setting in egocentric video proposing a new Ego online action detection benchmark based on the Ego4D moment queries annotations. To tackle this task the paper combines orthogonal gradients with state continuity and a non-uniform loss.
- The paper addresses a timely problem: enabling models to learn continuously from streaming egocentric video. This is well motivated and relevant for wearable AI systems.
***
- The proposed method does seem to have benefit across in-stream and out-of-the-settings for both in-stream and out-of-stream settings, at least on Top-5 recall (little improvement on mAP)
***
- The Ego-4D based online action detection benchmark could be useful for future works
- Novelty and Positioning
- The proposed Continuous Online Action Detection (COAD) task does not appear fundamentally new. It largely combines elements of continual (or continuous) learning and online action detection, without providing a clear distinction from existing formulations.
- The paper does not clarify how much adaptation is actually required to extend existing continuous video learning methods (e.g., Carreira et al., 2024b; Han et al., 2025) to this setting, or whether these methods could serve as direct baselines.
***
- Comparison to Prior Work
- The evaluation lacks state-of-the-art baselines from both online action detection (e.g. An et al. 2023) and continuous video learning (e.g. Carreira et al. 2024b and Han et al. 2025). Without these comparisons, it is not possible to assess whether the proposed approach meaningfully advances the field.
- The main results (Table 1) instead focus more on the benefit of using egocentric data in training rather than the benefit of the proposed method itself.
- The comparison to prior work is particularly important method appears conceptually similar to earlier recurrent or memory-based OAD models, but these connections are not made explicit.
- The ablation study shows that state continuity has minimal impact, and that improvements come primarily from the non-uniform loss and orthogonal gradient components, both of borrowed from prior work. Orthoganal gradient comes from continious video work Han et al. 2025 and the non-uniform loss comes from online action detection work An et al. 2023.
***
- Dataset Contribution (Ego-OAD)
- While the dataset could be useful for future works, the contribution is relatively small since it is derived from the Ego4D momemnt queries task and annotations
***
- Evaluation Considerations
- There is no analysis of the computation–accuracy trade-off, which is critical for the proposed on-device learning setting. The cost of continuous adaptation compared to standard offline or inference-only OAD is not quantified.
- The concept of “out-of-stream” data is poorly defined. It is introduced briefly at the end of the method section and later described in the results as “held-out data reserved for evaluation only,” without a clear explanation of its role or relevance.
Task Definition
- In what precise way does Continuous Online Action Detection (COAD) differ from simply combining continual learning and online action detection?
- What assumptions make COAD a distinct and necessary formulation rather than an application of existing paradigms?
***
Relation to Prior Work
- How much modification of existing continuous video learning methods (e.g., Carreira et al., 2024b; Han et al., 2025) is required to adapt them to the OAD setting?
- Why does the paper not include these prior works as baselines or points of comparison?
- How does the proposed approach differ from earlier recurrent or memory-based OAD models (e.g. An et al. 2023) and why are these works not compared to?
***
Method and Analysis Details
- Why does state continuity have such a limited impact compared to the non-uniform loss and orthogonal gradient components?
- Are these components directly adopted from prior work, or have they been substantially modified for this setting?
***
Evaluation Considerations
- What is the computational overhead of continuous adaptation during inference compared to standard OAD models?
- How significant is the accuracy–efficiency trade-off for on-device deployment?
- How is "out-of-stream" data defined and used in the experiments, and how does it differ from standard held-out test data? |
Fully human-written |
|
Task-Aware Data Selection via Proxy-Label Enhanced Distribution Matching for LLM Finetuning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper tackles the problem of selecting high‐quality, task‐relevant instruction data for fine-tuning LLMs. The authors argue that existing data-selection methods focus only on aligning the input distribution X (i.e., instructions) with a target task, but neglect the joint distribution of (X,Y) where Y are task‐specific labels, which are often unavailable in practice. They propose a pipeline that uses an LLM to infer proxy labels for a large unlabeled source corpus, then applies a proxy-label enhanced distribution matching method: first filtering out noisy out-of-distribution samples, then aligning the remaining data to the target joint distribution (X,Y), and finally selecting a subset. Experiments show that fine-tuning on the selected subset can achieve performance competitive with or superior to using full dataset, thereby demonstrating task‐aware data selection is effective.
Novel viewpoint: Using proxy labels and distribution matching for task‐aware rather than input‐only data selection is an interesting insight.
Practical relevance: Demonstrating that a small subset of data can yield competitive fine‐tuning results addresses the real challenge of data efficiency in LLM tuning.
Clear presentation of the pipeline and motivation, making the method relatively easy to understand and adopt.
Proxy labels may introduce noise, and the paper gives limited analysis of how label quality affects downstream performance.
Transparency of cost/efficiency: While the claim of “smaller subset yields full‐data performance” is compelling, more detailed breakdowns (hardware, runtime, selection cost) would improve trust.
Risk of selection bias: Since the method selects based on proxy‐label generated metrics and distribution matching, it may inadvertently favour certain types of samples (e.g., easier ones, more model‐familiar) and perhaps neglect rare or hard tasks; the paper does not deeply analyse this risk.
Engineering complexity & scalability: Generating proxy labels, filtering, and distribution matching add overhead; discussion of how this scales or works in resource‐limited environments is limited.
Can you report detailed metrics on proxy label quality: e.g., accuracy, noise rate, and how selection performance degrades (or improves) with differing label quality?
How robust is the method to different model architectures or sizes? If the fine-tune target model is quite different (size, family) from the one used to infer proxy labels, how does performance change?
Could you provide full cost breakdowns (selection cost + fine-tune cost + hardware) for your method and key baselines (input‐only selection, random sampling) under identical hardware?
Have you analysed the selected subset in terms of diversity: task types, difficulty levels, rare vs common categories, language styles? Is there any systematic bias in what gets selected vs discarded?
In truly low-data regimes (e.g., 1 K or 5 K samples) or for very niche tasks (domain‐specific), how does your method perform relative to full‐data or random sampling? |
Fully AI-generated |
|
Task-Aware Data Selection via Proxy-Label Enhanced Distribution Matching for LLM Finetuning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces a task-aware data selection pipeline for fine-tuning llms. The approach focuses on aligning both input features and task-specific labels to improve the relevance and quality of selected instruction data. Since task-specific labels are often unavailable, the method uses LLMs to generate proxy labels for the target dataset, which are clustered and propagated to the source dataset. Then a two-stage selection process first filters out low-quality examples using llm-based scoring, and then matches the label distribution through incremental selection.
- The paper extends prior work by considering additional alignment dimensions during task-specific data selection, including task, topic, style, and audience.
- It offers an information-theoretic explanation that provides a principled understanding of the proposed data selection method and prior works.
- The method relies heavily on LLM-based judgment, but does not evaluate the robustness or reliability. It remains unclear how accurate the generated labels are and how consistent or calibrated the LLM-assigned quality scores are.
- The approach introduces several hyper-parameters and control knobs (e.g., k in k-means clustering, minimum score thresholds, label alignment choices) without providing clear guidance on how to tune them. According to the experiments, the results are sensitive to them.
- The paper does not provide any theoretical guarantees for the proposed distribution alignment algorithm. It is unclear whether this algorithm will converge or output a better match.
- The experimental results lack error bars, making it difficult to assess statistical significance or robustness of the reported improvements.
- Minor presentation issues: in Figure 2, the text and numbers are too small and overlap, which affects readability.
- Why you choose 100 as the number of centroids in k-means clustering?
- In label propagation, how exactly are the source examples embedded? Are they also labelled by the llm using the same process as the target examples?
- In "Prompt Template for Scoring Source Samples", the connection between the scoring instructions and the provided labels is unclear. What purpose do the labels serve in this context?
- Can we jointly match multiple labels using this method? |
Lightly AI-edited |
|
Task-Aware Data Selection via Proxy-Label Enhanced Distribution Matching for LLM Finetuning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper reformulates task-specific data selection for LLM finetuning, arguing that prevailing methods, which only align the distribution of inputs X, are insufficient. The authors' central claim is that selection must instead align the joint distribution of inputs and labels $(X, Y)$ to capture true task relevance.
To achieve this, the paper proposes a novel four-stage pipeline that uses LLM-generated "proxy labels" since true labels are unavailable.
Experiments show that finetuning a LLaMA-3.1-8B model on a 10K subset selected with this method achieves performance competitive with or superior to state-of-the-art baselines and a model trained on the full 300K-sample pool.
1. This paper argues that task-specific data selection should not be based on aligning inputs ($X$) alone, which is the common practice, but on aligning the joint distribution of inputs and labels ($X, Y$). This is a more accurate and semantically meaningful way to define task relevance.
2. The paper introduces a novel four-stage pipeline that operationalizes its new formulation. Since target labels ($Y$) are typically unavailable, it uses an LLM to generate structured "proxy labels" (Task, Topic, Style, Audience). This provides a concrete and practical solution to the challenge of joint distribution matching.
1. The proposed 4-step pipeline is highly complex. It requires two distinct LLM-based steps (proxy-label generation and OOD scoring), an embedding model, k-means clustering, and an incremental sampling algorithm. This complexity introduces numerous hyperparameters that are not thoroughly justified, such as the number of clusters ($k=100$), the OOD score threshold, and the choice of which label field to align (Task, Topic, Style, or Audience), suggesting the method requires extensive, task-specific tuning to work well.
2. The ablation study in Table 5 does not adequately isolate the core contribution of the paper. The paper's main claim is that aligning the joint distribution $P(X, Y)$ is superior to aligning the marginal distribution $P(X)$. However, the ablation study only compares the full pipeline against removing its own components (OOD filtering or incremental sampling). A crucial missing baseline would be to apply the exact same clustering and incremental sampling algorithm (Steps 2 and 4) directly to the input embeddings ($X$) instead of the proxy-label embeddings ($Y$). Without this direct comparison, it is unclear if the performance gains come from the novel $P(X, Y)$ alignment or simply from the clustering/sampling algorithm itself.
1. Your results in Table 3 demonstrate that the choice of which proxy label to align (e.g., "Align_task", "Align_topic", "Align_style") is a critical hyperparameter, as the best-performing field changes for each benchmark. For a practitioner applying your method to a new task, how would you recommend they determine the optimal field to align? Does this not require them to run multiple full finetuning experiments for each field, which would undermine the method's goal of data efficiency?
2. Your core claim is that aligning the joint distribution $P(X, Y)$ is superior to aligning the marginal input distribution $P(X)$. However, your ablation study in Table 5 only compares your full pipeline against versions with its own components (OOD filtering or incremental sampling) removed. To truly isolate the benefit of using proxy labels ($Y$), could you provide results for a baseline that applies your exact same pipeline (clustering, OOD filtering, and incremental sampling) but operates directly on the input instruction embeddings ($X$) instead of the proxy-label embeddings ($Y$)? |
Fully AI-generated |
|
Task-Aware Data Selection via Proxy-Label Enhanced Distribution Matching for LLM Finetuning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a proxy-label-based data selection method for instruction-tuning LLMs, aiming to select source data that best matches the target task by jointly considering instruction text and task-semantic proxy labels. The method targets limitations of prior work that only align input distributions without task semantics. Experiments across multiple benchmarks show improvements, though the gains vary by semantic field and threshold settings.
1. An interesting idea of jointly aligning instructional text and task-semantic proxy labels.
2. Comprehensive experimental coverage across multiple benchmarks and semantic dimensions, demonstrating systematic evaluation of the proposed approach.
1. The performance gains are inconsistent and not uniformly strong across benchmarks (Table 3). There is no single configuration that consistently outperforms others: for example, min-score ≥7 achieves two SOTA results, and min-score ≥6 also yields two SOTA results. Additionally, different semantic fields produce varying best configurations (e.g., TruthfulQA prefers “audience” under min-score ≥6 but “style” under ≥7), making it unclear how to select the semantic field and threshold in a principled manner. The authors also acknowledge inconsistent alignment effectiveness across fields (row 421), reinforcing this concern.
2. Important retrieval-style baselines such as representation-based RDS [1,2] and BM25 are missing, making it difficult to assess how much benefit comes from semantic distribution matching versus standard retrieval approaches.
3. The approach is modular and largely post-hoc rather than jointly optimized, which may limit conceptual novelty. The contribution appears to lie more in the combination of existing components.
[1] Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In *Proceedings of the IEEE conference on computer vision and pattern recognition*, pp. 586–595, 2018.
[2] Ivison, H., Zhang, M., Brahman, F., Koh, P. W., & Dasigi, P. (2025). *Large-Scale Data Selection for Instruction Tuning*. arXiv preprint arXiv:2503.01807.
1. In Table 5, why does removing OOD filtering produce a very large drop on TruthfulQA.
2. Why is a separate OOD-filtering step required? Since Step 2 already computes similarity for anchor propagation, could OOD samples be filtered via a similarity threshold rather than a second LLM-based scoring step?
3. As a baseline or further exploration, what would happen if semantic-field information were integrated into existing data-selection approaches (e.g., adding semantic attributes to gradient-based LESS or representation-based RDS)? Would this mitigate the issue raised in rows 49-53 and unify the benefits without the need for proxy labeling and field-wise tuning? |
Lightly AI-edited |
|
Task-Aware Data Selection via Proxy-Label Enhanced Distribution Matching for LLM Finetuning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 1: You are unable to assess this paper and have alerted the ACs to seek an opinion from different reviewers. |
This paper introduces a proxy-label enhanced joint distribution matching approach for task-specific data selection in large language model fine-tuning. The key idea is to let the model generate task-related proxy labels so that both inputs and outputs are considered jointly when aligning distributions, rather than focusing only on input similarity.
As a researcher specializing in human-computer interaction, this study clearly falls outside my area of expertise. Following the Area Chair’s instructions, I have selected “1: You are unable to assess this paper and have alerted the ACs to seek an opinion from different reviewers” and submitted my review accordingly. Therefore, I will not be participating in the rebuttal stage for this manuscript. Thank you for your understanding.
1. The paper reformulates task-specific data selection as a joint distribution alignment problem, moving beyond traditional input-only approaches. The introduction of proxy labels adds a fresh perspective to modeling task relevance.
2. It proposes a complete and coherent pipeline, from proxy-label generation and clustering to noise filtering and incremental sampling, with clear logical flow and information-theoretic grounding.
3. Experimental results on multiple mainstream benchmarks, such as MMLU, TruthfulQA, and GSM8K, show stable or superior performance compared with SOTA methods like LESS and TSDS, especially under low-data conditions.
1. Although the paper uses LLMs to generate proxy labels, the analysis of their consistency, bias, and noise propagation is rather superficial and lacks quantitative evaluation or comparison with human annotations.
2. The multi-stage pipeline, involving annotation, clustering, filtering, and sampling, lacks detailed efficiency analysis on large-scale corpora. Its scalability and real-world deployment cost remain unclear.
3. The experiments are limited to English and general-purpose LLMs like LLaMA and Mistral. There is little discussion on adaptation to multilingual or multimodal tasks, and the explanation for performance drop on TyDiQA is vague.
1. Can the authors provide an evaluation of the consistency or confidence of LLM-generated proxy labels compared with human annotations to verify label quality?
2. Can the proposed method generalize to cross-domain or cross-lingual scenarios, such as transferring from legal to medical tasks? Would new proxy labels be required in such cases?
3. Are the hyperparameters and target set sizes for LESS and TSDS exactly matched to those used in this paper? If not, this should be clearly stated to ensure fair comparison.
4. Have the authors analyzed the sensitivity of key parameters, such as the minimum score threshold or the number of clusters k? Without this, reproducibility and transferability could be limited.
5. The ablation only examines the combined effect of filtering and sampling. It would be helpful to further analyze how each stage contributes to different task types, such as reasoning, factual, or comprehension tasks. |
Moderately AI-edited |
|
Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces PASD, a Hierarchical RL method for Skill Discovery/Learning for Human-AI Collaboration. The authors argue that previous skill learning methods are agent centric and fail to capture information about the partner conditioned dynamics in the Multi-Agent Cooperative setting. The authors that proposes a Contrastive Loss-like shaped reward term to the objective of both the high and low level policies to maximize mutual information between trajectory segments sampled from the same skill. They then evaluate PASD on Overcooked-AI and compare it with FCP, HiPT and DIAYN.
- The method presented is well motivated and presents a principled approach to skill learning in the Multi-Agent Cooperative/ZSC setting.
- The paper is well written and structured.
- Unfortunately the experimental section of the paper is a faily weak at the moment
- Lack of evaluations against real human partners. This is a major omission in the experimental section considering that the paper is proposing a method for Human-AI Collaboration. Furthermore, previous works (Carroll et al., Strouse et al. and Loo et al.) all conducted experiments with real human partners.
- Limited evaluation partners. The authors evaluate only on one type of SP partners (TrajeDi plus past checkpoints) when there are a few other diverse partner generation methods (MEP, CoMeDi and HSP).
- Lack of qualitative analysis of skills. Though the authors provide some analysis of the learned skills in term of skill switching frequency and overall entropy. It is unclear is the skills learnt by PASD have any significant behavioural difference. It would to interesting to see some visualisations of different skills learnt by PASD in Overcooked.
- Minor typo Line 227 “collectoing” → “collecting”
- In Table 2 what does “CoSkill” refer to? Is that supposed to be PASD?
- What do the skills learned by PASD look like in Overcooked? |
Fully human-written |
|
Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work presents an HRL algorithm that is aware of other agents in cooperative MARL settings by introducing an intrinsic reward based on a contrastive metric to prevent skill collapse. The algorithm is evaluated on Overcooked to highlight its strength over prior HRL work.
- The intrinsic reward is well motivated and sound.
- Overcooked (v1) is a bad evaluation environment for this paper (see "OvercookedV2: Rethinking Overcooked for Zero-Shot Coordination").
- The results reported in this paper underperform the naive IPPO baselines (and state-augmented IPPO baseline) reported in the OvercookedV2 paper (for Overcooked-v1)
- OvercookedV2 already demonstrates that there is no zero-shot coordination challenge in Overcooked aside from state coverage
- Since Overcooked-v1 is fully observable, an LSTM is unnecessary
- Hierarchical RL in general is unnecessary for Overcooked, since it can be quickly solved with standard IPPO
- OvercookedV2 would be a better environment to validate your results on, but I still have the concern that HRL unnecessarily complicates the learning process.
Why is the intrinsic reward also applied for training low-level policies? |
Fully human-written |
|
Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper is motivated by the challenge in Human-AI collaboration where traditional Hierarchical Reinforcement Learning (HRL) agents fail to adapt to diverse partners due to agent-centric skill discovery, which often leads to "shortcut learning." To address this, the authors introduce Partner-Aware Skill Discovery, a DHRL framework that learns skills conditioned on partner behavior. They achieve this by proposing a novel contrastive intrinsic reward to align skill representations for similar partners while maintaining discriminability for diverse strategies. Evaluating PASD in the Overcooked-AI environment with diverse self-play and human proxy partners, the authors found that their method consistently outperforms existing population-based and hierarchical baselines, demonstrating superior generalization and robustness across a wide range of collaborator behaviors.
* PASD introduces a novel contrastive intrinsic reward that conditions skill learning on partner behavior that is quite interesting
* Generalization is validated using a diverse partner population across various skills
* Analyses of mean skill switches and policy entropy was a nice qualitative insight into learned adaptive behavior
* Comparisons against established cooperative baselines, specifically Cross Environment Cooperation (CEC) and E3T, are absent and necessary for full validation.
* The paper lacks in-depth analysis of error modes and failure cases for baselines (FCP, HiPT) versus PASD, which is needed to fully justify the claims of robust coordination and mitigation of shortcut learning.
* Section 4.2 is unclear and could benefit from an explanation more grounded in the context of partner-adaptive dynamics
* The approach relies on sampling from a predefined partner population; the paper should briefly discuss the implications for zero-shot generalization to truly novel human partners in scaled up settings beyond Overcooked
- Can the authors offer a more detailed, qualitative analysis of failure cases? Specifically, demonstrate instances where FCP or HiPT fall victim to shortcut learning or coordination failure, and contrast these with how PASD's partner-aware skills resolve the issue.
- What are the practical implications for zero-shot generalization? Can the authors speculate or provide preliminary results on performance when paired with a truly novel, unmodeled human partner policy in realistic settings beyond Overcooked?
- How does the assumption guarantee that the InfoNCE objective captures meaningful partner-conditioned information rather than merely maximizing skill-to-state diversity? |
Lightly AI-edited |
|
Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a novel method for learning diverse, partner-consistent skills for a hierarchical RL algorithm set within a MARL problem setting. Their method for learning diverse skills extends the mutual-information family of diverse skill learning strategies to the MARL setting, where they maximize a lower bound on the mutual information between skills and sub-trajectories. This encourages consistent representations (and presumably behaviors?) across partner interactions, i.e. skills that are discriminative across partners but consistent for partners with similar behaviors.
They evaluate their method in the standard Overcooked-Ai environment and show superior performance to another method that transfers to humans without human data, Fictitious Co-play, a hierarchical MARL method, HiPt, and DIAYAN. They show improved performance adapting to a policy cloned from human data, and show that their method switches skills more than HiPT---the other Hierarchical MARL algorithm.
- The writing is relatively clear
- The idea of applying mutual-information based skill discovery to the multi-agent setting is a really interesting application of the idea. The method. It's interesting that their method, in principle, allows for more diverse skills to be learned in a hierarchical MARL setting.
- They show good results in the overcooked-AI domain
- They show good results when transferring to a behavior clone from human data
- right now, I think "ultimately supporting adaptive human-AI coordination" is over-claiming since you don't actually test with transfer/adaptation to humans
- why not compare against human player? The original overcooked codebase [1] has code for running human experiments. Why don't you use that? [2] also studies transfer to human players, why not use that? Behaviorally cloned policies are rarely as adaptive as ones learned online (without very large, diverse training sets). Thus, it's hard to imagine that Table 2 is representative of transfer to human partners.
- "We assume that distinct sub-trajectory views of the same skill encode a consistent partner-adaptive strategy" - can you motivate this? One agent is adaptive to another agent, so based on what part of the task the other agent is doing, I could see that a sub-trajectory for the same ostensible skill would encode a different partner-adaptive strategy, since its adapting to the partner. Do you demonstrate this somehow? Figure 4 shows more skill switches by PASD tha HIPT. Is that evidence for this? If so, why? If not, what is your evidence for this?
- Regardless, by construction, I can see why different sub-trajectories of the same skill will encode the same behavior (regardless of another partner) because of how mutual-information based Rl methods work. Maybe this is what your method is exploiting for skill learning?
- The method is not that easy to read and understand given all of the indexing. This is a challenge in both HRL and MARL settings generally, which probably compounds for your method. A summary figure would be really helpful.
- your related work should discuss [2]
- Table 1 and your standard deviations are a bit deceptive. $101.3 \pm 8.5$ should be be bold in reference to $96.0 \pm 1.3$, since those clearly overlap. Your first sentence of this paper is "Developing intelligent agents that can coordinate effectively with humans and other novel partners has long been a central challenge in multi-agent reinforcement learning". Given this motivation, shouldn't you care about generalization to competent partners? Your evaluation doesn't seem like the best one given your motivation.
- You say "Analysis of learned skill representations shows that PASD adapts effectively to diverse partner behaviors, highlighting its robustness in human-AI collaboration." What analysis shows this? Figure 3? This shows that your method switches skills more (which doesn't indicate being more adaptive) and that your method maintains a higher entropy of switching.
- The size of the plots (e.g. Figure 3) make them really hard to read.
[1] https://github.com/HumanCompatibleAI/overcooked_ai/tree/master/src/overcooked_demo
[2] Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination
- not sure that the overcooked domain is sufficiently rich for an HRL method. What kinds of skills are you learning? You show no demonstration or visualization of the kinds of skills your method is learning. [1] suggests that the space of skills for coordination is quite small in overcooked environments.
- why should we care about how well a partner evaluates across different levels of skill or diversity? Even if we do, why would you do the mean across all of these?
[1] Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination |
Fully human-written |
|
WAFER-QA: Evaluating Vulnerabilities of Agentic Workflows with Agent-as-Judge |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper investigates the vulnerabilities of agentic LLM workflows, specifically focusing on systems that use a "judge" agent to provide feedback to a "generator" agent. The authors propose a two-dimensional framework for categorizing judge behavior based on intent (constructive to malicious) and knowledge (parametric-only to retrieval-augmented). The core contribution is a new benchmark, WAFER-QA, which evaluates an agent's robustness against deceptive feedback that is grounded in retrieved web evidence supporting plausible but incorrect answers. Experiments with several SOTA LLMs (e.g., GPT-4o, o3-mini, o4-mini) demonstrate that these models are highly susceptible to deceptive feedback, especially when it is backed by factual-sounding (even if fabricated) or genuinely retrieved web evidence.
The paper tackles a critical and timely issue. As feedback-based agentic workflows become more common, understanding their vulnerabilities to unreliable or malicious feedback is essential. The two-dimensional taxonomy of intent and knowledge is a key strength, providing a clear and extensible framework for analyzing and generating diverse judge behaviors. The paper provides some insights, such as the distinction in robustness between reasoning-focused models (o4-mini) and other models (GPT-4o, Qwen), and the finding that models struggle to acknowledge ambiguity even when prompted.
I think the WAFER-QA construction method is clever, but its dependence on finding questions that already have some plausible web evidence for an alternative answer might make it less general. The paper even mentions that this approach is “infeasible for factually well-settled queries.” That makes me wonder how representative this benchmark really is of the kinds of problems agents might face in the real world—especially those that have one clear, unambiguous truth.
The main experiments use the same model as both the generator and the judge. That’s a common setup, but it doesn’t quite match real-world situations where agents built by different teams or using different base models interact. The appendix briefly looks at an asymmetric setup (a stronger judge and a weaker generator), but I wish there were a deeper exploration of how heterogeneous agents—ones with different knowledge bases—would behave.
The paper mainly focuses on showing the vulnerability. In Section 6.2, the authors try a “moderator” agent as a possible fix, but it only works partly, and that part of the analysis feels underdeveloped. Overall, the paper does a good job of highlighting the problem, but it doesn’t go very far in offering solid solutions, other than noting that reasoning-trained models tend to be more resilient.
see weakness. |
Lightly AI-edited |
|
WAFER-QA: Evaluating Vulnerabilities of Agentic Workflows with Agent-as-Judge |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 3: good
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The authors present a dataset for evaluating the robustness of a system to a deceptive or hallucinating Agent-as-a-Judge. They find that existing systems are very vulnerable to such problematic judges.
- This is a very timely topic. With the growth in these kinds of judges and their unreliability, understanding how much a system can resist an unreliable judge is growing increasingly important.
- In line with that, a new benchmark is always appreciated. We're seeing models gaming benchmarks all the time now, so any new metrics we can use to help correct for this are very useful right now.
- This paper is all over the place. It seems like the two axes they are looking at are quite unrelated. So then, why these two axes? Is there some pattern in the literature? I would have expected a strong reason why whether a judge has database access and whether a judge is deceptive or helpful are the two axes used here. There seem to be many other characteristics a judge might have. I feel like the database access of a judge seems completely unrelated to the title of the work. The authors should focus on one and include the latter as a secondary consideration (this could be accomplished quite easily, actually).
- One that note, the title is bad. WAFER-QA is a dataset, so why does the title make it sound like a method? Also, Agent-as-Judge seems grammatically incorrect. I believe it should be Agent-as-a-Judge, like with LLM-as-a-Judge.
- They have Agent-as-Judge in their title and then cite a paper that includes Agent-as-a-Judge in the title. But they only mention it offhandedly. From my understanding, this paper is proposing an agentic judge---a direct instantiation of what the Agent-as-a-Judge work proposes, but then only mentions it in passing as a "constructive judge and/or adversarial judge without web
access"? If this is something I noticed, I worry what this says about the other papers they cite that I haven't read recently.
- There are a few statements that seem quite odd. For example: `For example, in response to the question “What is the capital of France in 2025?”, no credible web evidence exists to support any answer other than Paris, making web-based retrieval infeasible for factually well-settled queries.` But there are most definitely many things on the web that say otherwise (even though it may be incorrect). Do the authors mean to say that some questions will have an overwhelming quantity of web evidence leading to a particular conclusion and much less evidence proposing an alternative, as opposed to other facts where the evidence is more ambiguous (such as relating to a certain cooking technique being superior to another)? If so, they should be clearer about this and defend it. This also means "plausible supporting evidence" needs a more rigorous definition. Otherwise, it is very ambiguous what was or was not included.
- I think the above makes this not particularly useful to the field without quite a bit of a cleanup. It seems it needs to focus on the deceptive axis and include the web usage as a side quality being evaluated (or vice versa). Otherwise, it's trying to do two things at once.
- I understand that the above is quite challenging to meet in the timeline ICLR gives. However, if the above could all be addressed to a reasonable degree, I'm not opposed to changing my recommendation as I see the potential here.
See Weaknesses. |
Fully human-written |
|
WAFER-QA: Evaluating Vulnerabilities of Agentic Workflows with Agent-as-Judge |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper focuses on evaluating LLMs ability to provide feedback to other models in a debate-style setting. In particular, the paper proposes a new benchmark, WAFER-QA, that allows evaluating judges with web-search tool-use access in adversarial settings. Their benchmark builds on a framework introduced by the authors that aims to disentangle judge intent (constructive/hypercritical/malicious) from judge knowledge (parametric/grounded). The authors report the results of models on multiple question-answering benchmarks with the various judge setups. The authors analyse a well-selected set of models.
1. **Well-written.** This paper is well and clearly written, overall pleasant to read through. The tables and figures are well constructed and easy to read.
2. **Well-selected experimental data.** Given the QA setting, the authors select a number of well-known and -used datasets as the basis for their new benchmark (.
3. **Extensive discussion of experimental results.** The paper includes an extensive discussion of their diverse experimental results. It's a nice read.
1. **Lack of mean/variance statistics across re-runs.** Currently the experiments appear to use a single set of observation for each metric. Would be interesting to see how high the variance of each of these metrics is, even just 3 seeds would provide quite a bit of additional context here. In particular, since most metrics are based on multi-turn interactions, also evaluating variance along a point estimate would be very helpful. Also, sampling/inference parameters appear to be not discussed (e.g. temperature).
2. **Focus on simple Q&A tasks.** The paper currently focuses on simple question answering tasks. Whilst I see the value of keeping the scenarios simple for practical reasons, it means that the results may not extend to scenarios that are more similar to realistic real-world (complex agentic tasks, such as coding or web interactions).
3. **There could be stronger/clearer motivation for the threat scenario.** I understand that judges can play an important role in debate setups but it remains somewhat unclear how exactly a malicious judge might arise or enter the picture. This part of the motivating scenario could be further explored/discussed. Currently, it is simply assumed that such judges exist and relied on the reader to motivate this for themselves.
4. **Judge knowledge dimension not considered in benchmark.** As far as I read it, the later part of the paper (introducing the WaferQA benchmark) appears to focus far more on the judge intent perspective rather than the judge knowledge part. None of the tables or figures in the main body actually vary the judge knowledge dimension. Nevertheless, this knowledge dimension is one of the two introduced earlier - this makes the experiments and earlier sections feel disconnected.
Minor (no impact on score, no need to respond):
1. Use of \citet citations are sometimes used instead of \citep (e.g. L34,L40). This makes some sections more difficult to read. Notably the related work section is not affected by this issue.
2. L108: Table one, font feels unnecessarily small
3. L249: citing "Team", though this should be "Gemma Team". "Team" is not a last name here, confusing and needs to be adjusted in bib file, e.g. by adding curly brackets {}.
4. L232: terms contextual vs non-contextual QA should be clarified/defined
1. L343: How exactly is the acknowledgement detected and the corresponding acknowledgement rate computed? How do you detect whether a model "explicitly signals the presence of ambiguity"? Is this LLM-as-a-Judge? And, are the tasks up-front formulated in such a way that ambiguity is allowed?
2. Do you have an intuition how robust the benchmarks scores are under different random seeds (related to weakness above)?
3. Since you use such well-known benchmarks, do you think that the results might change if the underlying dataset was more "fresh", less likely to have (indirectly) leaked into the models' training data?
4. Would you be able to clarify how the knowledge dimension relates to the experiments? If it does not, would you be able to clarify why it is necessary to discuss in the taxonomy section (3). |
Fully human-written |
|
WAFER-QA: Evaluating Vulnerabilities of Agentic Workflows with Agent-as-Judge |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper investigates vulnerabilities in feedback-based agentic workflows, where one or more LLMs act as judges providing critiques or evaluations to other models. The authors introduce a two-dimensional framework that characterizes judge behavior along intent (constructive → malicious) and knowledge access (parametric → grounded). Building on this, they propose WAFER-QA, a benchmark that augments QA datasets with web-retrieved “alternative” evidence supporting incorrect but plausible answers. Through experiments across multiple LLMs and workflows (generator–evaluator, round-table discussion, moderator setups), the authors show that even the strongest models (e.g., GPT-4o, o4-mini) are highly susceptible to deceptive or adversarial feedback, with large accuracy drops. They further analyze multi-round feedback and multi-agent discussions, highlighting oscillatory behaviors and partial robustness gains from reasoning models or moderator agents.
The experimental design is comprehensive and systematic, with evaluations across different models, judge types, and task settings. The results are consistent and well-presented, revealing interesting behavioral distinctions between reasoning and non-reasoning models. The multi-round and multi-agent analyses are particularly strong, showing that reasoning models exhibit greater stability across iterations, and that additional normal agents in a discussion can partially mitigate the influence of deceptive participants. Some good points -
- Timely and relevant topic addressing reliability in multi-agent workflows.
- Comprehensive experimental coverage: parametric vs. grounded judges, reasoning vs. non-reasoning models, and single- vs. multi-agent setups.
- Sections 5.3 and 6 are particularly valuable: they provide agent-specific insights showing (a) multi-agent interactions can dampen deception, and (b) reasoning models are more stable across multi-round feedback.
- Clear, reproducible presentation of results with consistent quantitative reporting.
- Limited novelty : the main findings extend well-known LLM vulnerabilities (knowledge conflict, sycophancy, adversarial context) into an agentic framing, without introducing new underlying mechanisms.
- Benchmark reliability : the WAFER-QA construction lacks clear human validation that “alternative” answers are actually incorrect. Some questions may have multiple valid interpretations, making the measured vulnerability ambiguous.
- No confidence or calibration analysis : in agentic settings, an agent’s susceptibility to external critique should strongly depend on its internal confidence. If confident generators resist incorrect feedback while uncertain ones flip, that provides causal understanding and an actionable defense. However, the paper never quantifies this relationship or reports how calibration correlates with robustness.
- Shallow multi-agent analysis : Section 5.3 is promising but largely descriptive. There is no deeper causal study of who influences whom, how consensus evolves, or whether the majority of normal agents consistently stabilize decisions. Understanding these dynamics is central to robustness in collaborative agents.
- No mechanistic explanation of vulnerability : the paper convincingly shows that models ‘flip’ under persuasive judges but doesn’t probe why. It’s unclear whether failures stem.
Overall, while empirically strong, the paper stops short of deeper agentic-level insights that could make the results more explanatory or predictive.
1. Benchmark reliability:
How do you make sure that the “alternative answers” in WAFER-QA are actually wrong? Some questions might have more than one valid answer. Did you check this with human annotation or any validation process?
2. Multi-agent dynamics:
The multi-agent study (Section 5.3) is interesting but mostly descriptive. Could you show which agents tend to influence others, how agreement is reached, or whether having more normal agents always helps stabilize the outcome?
3. Why models flip:
It would be helpful to understand why the models change their answers after receiving feedback. Is it because of wording overlap, trust in citations, or reasoning failures? Some controlled ablations could make this clearer. |
Fully AI-generated |
|
Codified Finite-state Machines for Role-playing |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes CFSM and CPFSM for roleplaying tasks and games with LLMs to ensure behavioral coherence. This paper contributes by offering a framework that defines character transitions and extends it to a probabilistic version to offer multi-modal transitions. Authors have compared and presented the limitations of the LLMs in Section 4 while also showing that CFSM is superior to the baselines compared. The Authors have also discussed the setup and case studies considered for the real plot experiment. The number of models used for the experiment is good, considering the innovation is a technical advancement with LLM rather than the performance of the LLM itself.
1) The described methods work on various artifacts mentioned in the results, while
demonstrating the strong performance against the baselines.
2) The paper mentions the computational complexity for the both methods and shows
faster and efficient codification for the proposed methods.
3) This paper includes a very detailed analysis section mentioning synthetic and real plot
experiments, and is tested with multiple LLM models and techniques, and has various
kind of plots and scenes from various genres.
1) The “preliminary and denotation” introduces the necessary terminology but lacks
examples and a lucid explanation, which can be really helpful for the readers and the
general audience unaware of such methods.
2) The multi-modality and reactions of CPFSM lack depth and can be explained more
clearly.
3) The real plot experience can briefly explain one of the artifacts used in the work as a
running example. Not having this makes it lless intuitive for new readers.
Detailed suggestions:
1) NLI full form could be better when referenced first.
2) Best@K can be explained.
3) Figure 4 Caption: There is no space between CFSM&CPFSM.
4) Line 109 - evolve should be “evolves”.
5) Line 223 - w_i,j “is” then normalized.
6) Line 225 - in binary_questions, how the logits w_i,j are derived from the
“Yes/no/unknown” question. This can be explained in detail.
7) In Tables 2, 3, and 4, 6, the units of measurement are missing; it would be reader-
friendly to add them.
8) In the baseline, #Character is mentioned for each show, but lacks a reference to it. For
example, Haruhi mentions 5 Characters and AGOT 11, but a brief description of at least
one of the artifacts, such as JOJO, its characters, and the context of the scene, and
profiles would be intuitive for readers to analyze the results better.
9) Multi-modality and transitions of CPFSM can be explained in detail. |
Fully human-written |
|
Codified Finite-state Machines for Role-playing |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes Codified Finite-State Machines (CFSM), a framework that enhances character consistency in LLM role-playing by automatically extracting and codifying character states from textual profiles. CFSMs transform character descriptions into explicit FSM structures using LLM-generated logic, grounding behavior in interpretable state transitions. A probabilistic extension, CPFSM, further models transition uncertainty by maintaining distributions over states. The paper evaluates the approach in both synthetic state modeling tasks and large-scale narrative role-play (via the Fandom Benchmark) and demonstrates improved consistency, interpretability, and transition traceability compared to prompt-based methods. Ablation and cost analyses show that CFSM/CPFSM are scalable and effective, offering a hybrid symbolic–neural approach to stateful role-play generation.
- The codification of character logic via FSMs, driven by LLMs, presents a novel mechanism to preserve behavioral coherence in long-form role-playing.
- Experimental results show a clear improvement in behavioral consistency after introducing CFSM. Whether in synthetic tasks (e.g., Mario state transitions) or real narrative scenarios, characters’ state transitions become more coherent and believable. CFSM and CPFSM effectively reduce the confusion and inconsistency commonly observed in prompt-based methods. Notably, CPFSM enhances the subtlety and realism of character responses by modeling weighted reactions across multiple plausible actions through probabilistic transitions.
- Unlike prompt-only state modeling, CFSMs generate explicit transition rules, enabling better control and debuggability in interactive settings.
The proposed framework heavily depends on the LLM to extract states and generate transition rules. If the LLM-produced code contains errors or omissions, it may compromise the correctness of the resulting finite-state machine. The paper provides limited discussion on how to validate or correct the logic generated by the LLM, leaving the reliability of the approach partially contingent on the quality of the LLM’s rule extraction process.
Another concern lies in the current evaluation, which primarily focuses on the Synthetic Validation setup and the Fandom Benchmark — both emphasizing narrative-driven scenarios and character-centric tasks. While these datasets are structurally sound and semantically rich, it would strengthen the work to include more conventional evaluation settings, such as open-domain human–AI dialogue, task-oriented dialogue systems, or social simulation environments. Extending the experiments to broader multi-turn dialogue contexts would better demonstrate the generality and transferability of the proposed CFSM/CPFSM framework.
In addition, incorporating more objective and independent evaluation metrics would provide a more comprehensive assessment of model performance. The selection of baselines also appears somewhat limited: although Codified Profile and PromptTrans offer partial validation of the proposed design, the absence of stronger or more up-to-date baselines weakens the comparative significance. Including results against more advanced or representative methods could substantially enhance the paper’s empirical rigor and impact.
See in weakness. |
Fully AI-generated |
|
Codified Finite-state Machines for Role-playing |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces Codified Finite-State Machines (CFSMs), a framework that leverages large language models (LLMs) to automatically extract character states and generate executable state transition logic for role-playing (RP) agents, aiming to improve consistency and interpretability relative to prompt-based approaches. An extension, Codified Probabilistic FSMs (CPFSMs), models character states probabilistically, supporting nuanced transitions. Empirical validation includes synthetic (game-based) and real-world (Fandom Benchmark) RP tasks, demonstrating improvements in behavioral consistency, efficiency, and interpretability over established baselines.
1. Interpretability: The framework brings interpretability to state modeling in RP with executable, codified transitions derived directly from character profiles.
2. Probabilistic Extension: The CPFSM mechanism elegantly integrates stochasticity into state transitions, explicitly modeling uncertainty in RP.
3. Efficiency: CFSM delivers both accuracy and efficiency, as highlighted in Table 5.
1. Evaluation Scope (Generality): Empirical testing relies primarily on the Fandom Benchmark and three synthetic state machines. The real-world scenarios are derived from highly narrativized, structured data (Fandom plots) with limited diversity of state-space complexity and ambiguity. GPT-4.1 is both judge and model in several settings, and open-ended role-play evaluations rely heavily on LLM judgment. There is insufficient third-party or human evaluation of RP quality, which may limit claims of generality.
2. Limited Handling of Dynamic or Emergent States: The model assumes a fixed state set per episode. The limitations of this assumption are acknowledged in Appendix B but not addressed experimentally. Open-world RP often demands dynamic state growth or on-the-fly trait acquisition, which is not modeled or empirically probed in the present study.
1. What is the meaning of "multimodal" in Line 053, and "multi-modal" in Lines 070, 080, and 092?
2. How would CFSM/CPFSM scale to open-world/large-scale RP where thousands of (possibly compositional) states, or dynamically constructed state sets, are needed? Any memory, efficiency, or codification tests on "harder" synthetic FSMs or real-world systems? |
Lightly AI-edited |
|
SciPro Arena: a Case Study of AI Agent Capabilities in Scientific Analysis Tasks |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a new benchmark called SciProArena that measures how good recent LLMs are for the task of scientific data analysis. In particular they focused on 27 categories of analysis tasks that require the models to extract patterns from noisy experimental data. Authors present extensive experimental results covering recent reasoning models’ performance on the SciProArena benchmark by varying the noise level and dataset resolution.
* Unlike other benchmarks that focus on information extraction and inductive reasoning, SciPro Arena focuses on real, empirical data as input and requires deep analytical reasoning about complex, noisy data. The proposed task mirrors natural scientific experiments where relevant information (‘y’) is latent and must be inferred from proxy measurements (‘x’).
* Results demonstrate there is a large gap between human performance of 38 (55% on noiseless) vs most recent reasoning models achieve around 13% (21% on noiseless) on these tasks.
* Even though the dataset covers 27 task categories that cover A) Fermi level extraction, B) dispersion tracing, C) linewidth tracing, D) phonon energy determination, E) doping determination. While these five tasks are complex and represent core analytical work within their domain, they are a small fraction of the total landscape of scientific data analysis. Additional discussion about the scope of these tasks would help situate the claims better.
* As expected, the performance of reasoning models degrades as noise level or dataset resolution increases. More detailed discussion on what kind of agentic systems be developed on top of these LLMs to alleviate these limitations on existing tasks to solve the individual datapoints would strengthen the paper. Current future work focuses more on generalizing tasks or developing agents for meta-analysis.
1. Placing result figures near their description will make the paper easy to read.
2. Authors have uploaded the supplementary material, however most of the prompt and result files are just placeholders. I would suggest releasing the prompts and data so that the research community can reproduce these results.
3. Section 3.4 explains how the noisy version of the dataset was generated by inserting randomly distributed 2D Gaussians in each spectrum. |
Fully human-written |
|
SciPro Arena: a Case Study of AI Agent Capabilities in Scientific Analysis Tasks |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose a new benchmark, SciPro Arena, to test how well AI systems can analyze scientific data - specifically angle-resolved photoemission spectroscopy (ARPES) data. They test several frontier models, and find that in general models perform very poorly on the dataset, highlighting the continuing challenges of using AI for scientific discovery.
- Evaluation on a real scientific task in condensed matter physics
- Reasonably rigorous scoring methodology
- While the abstract sets out to test "analysis of scientific data", the actual grounding of this on ARPES data seems very specific and idiosyncratic. ARPES has numerous complexities and specialities in, making it particularly difficult. If the authors goals are to more generally evaluate analysis of scientific data, they would perhaps do better first by characterizing different types/dimensions of scientific data, collecting samples of each, and doing a more systematic evaluation. In other words, the generalization from ARPES to all scientific data seems somewhat of a leap here.
- Related to this, I'd really like to learn not how the models performs overall on this benchmark, but how the models perform on different aspects of scientific data analysis. Is it possible to identify different types of scientific reasoning required for this task? In other words, rather than (or as well as) having 27 domain-specific question types, identify N data analysis types (e.g., interpolation, prediction, pattern recognition, noise tolerance, data cleaning, visual interpretation, etc.). I'd love to see the paper's framing and conclusions mapped from the physics domain to the AI research domain more.
- Frontier models appear to be applied in a vanilla/naive way, i.e., do a single-call direct query to the model. However, there are numerous "AI Scientist" systems out there that do coding, iterative reasoning, reflect loops (e.g., AIScientist, CodeScientist, ReAct, Reflexion, CodeAct) that might do better at this task. This should be clarified, in particular the conclusions may only apply to "direct query" uses of frontier models.
- The results seem somewhat dependent on the prompting strategy used, e.g., choices of few-shot prompting
- For several of the plots, I'm not sure what to take away from them - highlighting the takeaway in the caption, rather than just summarizing the visual data (e.g., "accuracy decrease with noise") would be very helpful.
See weaknesses |
Fully human-written |
|
SciPro Arena: a Case Study of AI Agent Capabilities in Scientific Analysis Tasks |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a physics-based benchmark for scientific data analysis. They focus on Angle–Resolved Photoemission Spectroscopy (ARPES) data, which often substitutes many condensed matter experiments and stands as a realistic benchmark for finding patterns in noisy, multidimensional datasets.
- I appreciate the authors' meticulous presentation of a domain that's complex to understand.
- Furthermore, it reaffirms that LLMs still struggle to perform scientific data analysis.
- The experiments are thorough and thoughtfully executed.
- I recommend that authors consider a physical sciences x AI-related workshop to publish this work. It's a great paper, but unfortunately, too narrow to publish in ICLR.
While I like authors' effort to carefully construct the benchmark, it still suffers from some issues:
- I like the focus on scientific data analysis, but this focus has been explored quite a bit over the last 1-2 years. For example, ScienceAgentBench, DiscoveryBench, and AutoDST are some prominent examples that focus on real scientific data analysis, in fact, expanding across multiple domains. All of them find similar results that LLM struggles to perform scientific data analysis requiring long-tail methods. The paper also missed these very relevant and important citations, while claiming "SciPro Arena fills a gap that has not been addressed before — analysis of real scientific data."
- To my earlier point, what additional insights this paper brings remain unclear. In other words, why AI model builders would test their models on this benchmark compared to earlier comprehensive ones, or why solving this benchmark dictates fundamental capabilities of an LLM, is unclear.
- "datasets within a question are contained in a single text string" -- does this mean to solve this benchmark, all you need is language-based reasoning? What if the data is presented in a tabular file, and the system can interact with the file using code (e.g., Python)? Why is the proposed setup more important than the latter?
Please see the questions mentioned in the weakness. |
Fully human-written |
|
SciPro Arena: a Case Study of AI Agent Capabilities in Scientific Analysis Tasks |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a new benchmark, called SciPro Arena, to evaluate language models on real-world scientific analysis tasks. Given a stream of numerical data (e.g., a 2D intensity map with real energy and momentum axes) and some question, models extract patterns from examples and provide numerical answers which are scored by deviation from ground truth obtained from a realistic simulator. The authors evaluated 14 recent reasoning models both open- and closed-source on 135 questions (27 question types, each tested at 5 noise level times). Recent models tend to perform better (e.g., o3 and Gemini 2.5 Pro), but they only reach <15% on average while a human baseline program (~400 Python lines) scores 37%. Even when removing the noise, SOTA models score 20% while human program gets 55%. Since the questions are composed of different difficulty tiers, the authors could highlight that "[SOTA] models can extract simple features but fail at tracing continuous patterns or computing derived quantities, the latter constituting core reasoning skills needed for real scientific analysis".
- SciPro looks very challenging for current language models and according to the leaderboard it is far from saturation. Improving on that benchmark (esp. dealing with noise) will require breakthrough in reasoning/test-time compute.
- The benchmark offers different difficulty tiers which are great to track progress. The questions are grouped under 5 domains and seems to be easily extendable.
- To me it is unclear if directly evaluating models on numerical tasks totally makes sense. Specifically, what is the impact of the tokenization process on numerical values provided in the context. There's been some work in the literature on shedding light on typical error patterns of models on tasks involving numerical reasoning (e.g., https://arxiv.org/abs/2410.11781).
- For each question, three in-context examples are included as part of the prompt. According to the paragraph Form of questions in Section 3.2, each example contains a matrix of numbers with whitespace delimiters. While it provides some structured to the prompt, it is unclear to me what is the impact of presenting the information that way. To me in a realistic scenario the data would be ingested as a CSV file or some other structured format instead of free form text strings.
- The title of the paper contains "AI agent capabilities" but without tools or scaffolding that enable LLMs to interact with some environment, I wouldn't call the models being tested AI agents. Also, to be more realistic and comparable with the human baseline, I would argue the models should have been able to use tools, e.g. generating code, to deal with numerical values.
### Minor
Line 146 mentions that ARPES analysis is an inverse problem, but line 151 says "the aim of ARPES data analysis is then to work out x→ y". While I think I understand where that comes from, I find it more intuitive to think that the underlying dispersion and linewidth functions are the 'x' (of the forward process) and the noisy spectrum is the 'y', thus ARPES is really about solving y → x.
- Did the authors try to equip the LLMs with tools (e.g., Python interpreter)? What about allowing them to write code? I suspect the performance will increase. If not, then that's a good motivation for SciPro. Also allowing the use of coding tool is more realistic for scientific analysis.
- Is providing 3 in-context examples optimal? Is it enough to capture the nature of the tasks or does the model still need prior knowledge about the scientific domain the question is coming from?
- I might have miss it but how are the answers extracted from the LLM's response?
- Will the benchmark be open-source? |
Fully human-written |