ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 4.00 7746
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (25%) 6.00 3.00 2077
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 3.00 4.00 2016
Total 4 (100%) 4.00 3.75 3464
Title Ratings Review Text EditLens Prediction
Kaleidoscopic Teaming in Multi Agent Simulations Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents the Kaleidoscopic Teaming framework, which aims to improve AI agent safety evaluations, particularly in multi-agent environments. By using the MASK framework, it simulates challenging, dynamic scenarios where agents can interact, cooperate, or compete to complete tasks. The framework is designed to capture the complexities of agent interactions, thought processes, and decision-making in real-world-like environments. The paper also proposes new optimization strategies for generating these scenarios, along with metrics to assess the safety of agents in these contexts. This work provides an important step toward more realistic and comprehensive safety evaluations for AI agents. 1. Innovative Approach: The introduction of the Kaleidoscopic Teaming framework is an innovative and thoughtful approach to evaluating AI agent safety in complex, multi-agent settings. It offers a novel perspective on safety evaluation that goes beyond traditional red-teaming methods, making it highly relevant for the evolving landscape of AI deployment. 2. Comprehensive Simulation: The Multi-Agent Simulation Kaleidoscopic-Teaming framework provides a thorough method for stress-testing agents by modeling real-world interactions and belief states, ensuring that agents are evaluated in dynamic, challenging scenarios. 3. In-Context Optimization Strategies: The paper’s in-context optimization techniques to generate more challenging scenarios for agents is a strong contribution. These strategies help to push the boundaries of traditional safety evaluations and expose vulnerabilities in a more nuanced manner. 1. Insight Gatherer and Judges: The roles of the Insight Gatherer and Judges seem to be somewhat separate but could benefit from a more integrated approach. Is it necessary to keep them as distinct entities, or could they be merged into one unified system to directly evaluate the agents? The paper does not fully justify the separation. An ablation experiment exploring the impact of combining or separating these components could provide valuable insights into whether the current setup is optimal. 2. Role of the Orchestrator: The Orchestrator’s function is explained in the context of controlling agent interactions and belief states. However, its role and operation appear quite similar to that of a traditional star-shaped structure where a central node manages tasks. It would be helpful to clarify how the Orchestrator differs from traditional management nodes and whether its added complexity truly contributes to the system’s effectiveness. 3. Realism of Generated Scenarios: A significant concern is how to ensure that the generated scenarios are realistic and grounded in real-world possibilities. The paper does not fully address how these scenarios are constructed in a way that ensures they do not deviate too far from practical situations. If scenarios are too fictional, this could lead to agents completing tasks in unrealistic ways, potentially compromising the reliability of safety evaluations. Could the authors consider leveraging existing benchmarks as seeds to generate scenarios rather than relying on the model to generate them from scratch? This would ensure that the scenarios align more closely with real-world conditions. 4. Selection of Agents in MAS: The paper mentions creating a society of 100 agents, with eight major agent types. However, it remains unclear how the Multi-Agent Simulation (MAS) selects agents for each iteration. Figure 2 mentions, “In MASK, at each iteration, either one or more (a group of) agents are selected from the society,” but there is no detailed explanation of how this selection process works. Does the random selection of agents ensure that the chosen agents are suitable for the specific tasks in the generated scenario? If the agents are not well-suited to the tasks, the results could be meaningless or difficult to interpret. Further clarification on the agent selection process and its potential impact on safety evaluations would be valuable, as it is crucial to ensure that the chosen agents align with the specific requirements of each generated scenario. 5. Percent Negative Agents vs. Percent Negative Scenarios: The metrics Percent Negative Agents and Percent Negative Scenarios appear to be highly similar, with one directly influencing the other. There seems to be no strong justification for measuring both separately. It might be more effective to focus on evaluating agents’ planning processes and their associated negative outcomes rather than focusing on the scenario level. This would provide a more granular understanding of where the safety issues originate. 6. Scalability of Overall Average Agent Score: The definition and calculation of the Overall Average Agent Score, particularly the -2 and 2 values, are unclear. The paper does not provide an explanation of how these scores are derived or what specific criteria are used to assign them. Additionally, how does this score range scale when the number of agents varies? The paper does not address how this metric adapts in such cases. A more detailed explanation of how the -2 and 2 scores are determined, and how the score range scales with different agent populations, would be useful to understand the metric’s robustness and consistency. 7. Judge Rubric and Evaluation Consistency: The decision to use the worst score from the judges rather than an average or median raises concerns about evaluation consistency. How do the judges’ individual differences influence the final scores, especially if the judges are using the same model? If the model used by the judges produces the worst score, how can it be guaranteed that this score accurately reflects an agent’s safety, rather than being a result of potential misjudgment or model error? More clarity on how the worst score is selected as the definitive evaluation would strengthen the validity of the assessment process. 8. Limited Evaluation of Attack Models: The evaluation of attack models is somewhat limited, as the paper only compares strong LLMs as attack models against weak LLMs as target models. It would be interesting to explore the effectiveness of attack models of similar or stronger strength against the same LLMs. This would help in better assessing the robustness of the system and provide a more thorough evaluation. 9. Task Completion Evaluation: The paper omits an important aspect of agent evaluation: task completion success (e.g., accuracy or completion rate). There is no clear measure of how well the agents complete the tasks, and this is essential for evaluating their overall safety and effectiveness. It would be valuable to introduce accuracy-based metrics to ensure that agents are not only safe but also competent in completing the tasks at hand. Moreover, should a task be deemed impossible to complete, should this be considered a safety issue as well? 1. Clarification of Table and Figure Labels: In Section 4, the table labels are inconsistent. Table 3 should be Table 1 based on the context of the results. Additionally, Figure 4 is unclear, and it is difficult to discern the meaning of each chart without additional graphical annotations. The authors should consider adding labels or explanations to make these visuals more understandable. 2. Appendix Structure: The Appendix lacks subheadings, making it difficult to navigate, especially when referencing specific figures or details. Clearer segmentation in the appendix would help readers quickly locate the relevant information. Furthermore, the frequent citation of the appendix in the main text, with no specific reference to the relevant subsection, makes it challenging to follow and could hinder the clarity of the paper. Fully AI-generated
Kaleidoscopic Teaming in Multi Agent Simulations Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes an approach for jailbreaking LLMs in the context of multiple agents. The term kaleidoscopic teaming is a variation of red teaming. The authors perform simulations in which they model agents that contains agents with different capabilities and which exhibit human traits such as emotions and belief states. * The authors performed experiments with a relatively large number of generated agents across multiple models. * It is not clear whether the agents proposed by the authors are considered to be agents who might be implemented and deployed in production settings. Why would these agents exhibit human emotions? * The main contribution of the paper is unclear. I.e. are the authors propose that instead of red teaming people perform kaleidoscope teaming? Is this intended to be a methodology or a benchmark? * Can you please clarify how would the reader take advantage of the contributions of this paper? Would it need to repeat your experiments? Fully human-written
Kaleidoscopic Teaming in Multi Agent Simulations Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces kaleidoscopic teaming—a safety‐evaluation paradigm for AI agents (not just chat models) that stresses test thoughts, actions, tool use, and inter‑agent interactions. It operationalizes this via MASK (Multi‑Agent Simulation Kaleidoscopic‑teaming): a loop with a kaleidoscope that generates scenarios, an orchestrator that assigns turns and belief/emotional states, judges with a rubric, and an insight gatherer that feeds weaknesses back to the scenario generator. The authors also add two in‑context optimization strategies for the kaleidoscope—PSO (Past Scenarios Only) and CSR (Contrastive Scenarios + Rewards)—and define three ASR metrics (percent negative agents, percent negative scenarios, overall average agent score). Experiments span 100 agents using 32 tools across 8 domains; results show that optimized strategies often yield higher attack success and that multi‑agent settings expose more vulnerabilities than single‑agent ones. 1. The work evaluates thoughts, tool calls, and interactions rather than just final answers—closer to real agent deployments. 1. It has a clear, replicated finding that inter‑agent dynamics expose more safety failures 1. PSO and CSR are simple, model‑agnostic ways to steer the scenario generator; they often raise ASR and scenario diversity 1. Per‑agent‑type safety profiles can guide targeted mitigations. 1. Although partially validated, LLM‑as‑judge can encode biases; taking the worst score across an ensemble may over‑penalize edge cases. More blind human audits and evaluations on judging schemes would help. 1. The orchestrator injects belief/emotional states to nudge unethical behavior; this may inflate ASR relative to organic failures and complicate comparison with other frameworks. A controlled ablation isolating belief injection effects is needed. 1. “Percent negative” treats minor and catastrophic harms similarly; adding severity‑weighted scores and time‑to‑failure would give a richer view. 1. Datasets/traces aren’t released, hindering replication How do you handle severity-related concerns? Moderately AI-edited
Kaleidoscopic Teaming in Multi Agent Simulations Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper is about the safety evaluation of LLM agents. The authors propose a new method of safety testing called kaleidoscopic teaming, which allows for testing agents in single and multi-agent scenarios. Compared to red teaming, kaleidoscoping teaming allows for more complex, multi-turn scenarios, which are dynamically adjusted for finding vulnerabilities in specific agents. The authors describe their framework, which consists of “functional” LLMs orchestrating, making up the scenarios, and judging the agents. They consider different strategies of prompting them, and they propose a metric for safety that can be used in this framework. The experiments show diverse results of safety for different LLM agents and their types. Importance of the topic. LLM agents should be tested for safety, and this work allows for more complex testing. Additionally, authors show the importance of multi-agent testing. Possibility for capturing the different levels of safety. A dynamic framework that can adapt to a specific agent/agent type. Easily automated. No comparison with baselines. E.g., how would red teaming or automated red teaming grade the agents? Would there be much of a difference? There is no formal definition of the metrics. The description is a bit unclear, and formal equations would be useful. E.g., the score value. Additionally, why are the metrics introduced if they are not used in the experiments? It’s a very “practical paper” with no guarantees of detecting any specific kinds of vulnerabilities. And for such paper, there isn’t enough evaluation of particular parts of the framework (e.g., how different LLMs for judges influence the result) Figure 3: For Claude 3.7(and Nova Pro) the score is opposite for Nova Lite and Deepseek. That suggests that the framework is very dependent on the LLM types in Kaleidoscope, orchestrator, etc. What to do with this instability of evaluation? No mention of the code being published. Very limited explanation of the relationship to the automated red teaming research. The multi-agent aspect is claimed to be novel. Is there a difference between components of agents and multi-agent scenario? If this is the selling point, they should focus on specific examples from their scenarios that are of this type and explain why alternative methods would not capture them. Though I appreciate playfulness, it was hard to focus on reading the paper due to all the colorful fonts and bolded text. What is the reason for the substantially different evaluation result of Claude 3.7 by DeepSeek and Nova Pro? Why only the kaleidoscope component is tested with different LLM models? Other components are always just Nova Lite. Have you considered the weaknesses of using LLM as a judge and other components of an LLM evaluation framework? Is there a danger of collusion? How can you make sure that testing a potentially malicious agent with access to real-world tools (RapidAPI) will not cause any harm? What’s the difference in evaluation in competitive and cooperative multi-agent scenarios? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next