ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (50%) 5.00 4.00 4658
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 3.00 3.50 2148
Total 4 (100%) 4.00 3.75 3404
Title Ratings Review Text EditLens Prediction
Beyond Score: A Multi-Agent System to Discover Capability and Behavioral Weaknesses in LLMs Soundness: 4: excellent Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces AGENT4WEAKNESS, a multi-agent system designed to overcome the limitations of current LLM evaluation methods, specifically their "insufficient comparison" and "inflexible evaluation". The framework leverages multi-agent collaboration (a Planner, Analyzer, and Reporter) and a suite of specialized statistical and analytical tools to generate in-depth evaluation reports based on flexible user queries. Rather than just reporting scores, the system performs both capability analysis (assessing statistical significance of performance gaps) and behavioral analysis (mining reasoning patterns from Chain-of-Thought data). The authors demonstrate through extensive experiments that reports from AGENT4WEAKNESS are significantly higher in quality than a strong baseline. - The paper's primary innovation is reframing LLM evaluation from a static, fixed-pipeline task to a flexible, systematic process. An agentic workflow with multi agents are used. - The technical quality of the work is high. The system is thoughtfully designed, decomposing the complex task of "weakness discovery" into distinct, manageable agent roles and tool families. In general the design of the system makes sense. ```Lack of evidence of the benefit from the report ``` I am concerned whether the report is helpful given that each model is unique. Without knowing the internals or the properties of the model, it would be hard to assess whether the report is helpful. One example given in the paper is that AGENT4WEAKNESS identifies that the reasoning process of DeepSeek-V3.1 on AIME2025 questions is disorganized and suggests using markers such as “### Step 1” to structure the reasoning and adding verification of intermediate results after each step. This particular example can be problematic because for some models, adding additional structure or verification may force model to reason out of distribution. This may lead to decreased performances. ```Nature of Model "Improvement" is in context only``` The 3.7-point performance improvement (Section 4.3) is a strong result, but it appears to be achieved by feeding the system's analysis and suggestions back into the model as part of the prompt. This demonstrates an improvement in in-context learning or prompt adherence based on the analysis, which is different from a permanent improvement to the base model (e.g., via fine-tuning). While still valuable, this distinction should be made clearer, as the current phrasing could imply a more fundamental model enhancement. - Did you experiment with other SOTA models (like GPT-5 or Gemini-2.5-pro) to run the agents themselves? How sensitive is the final report quality ("Content Value" and "Factuality") to the choice of the underlying model powering the Planner, Analyzer, and Reporter? Fully human-written
Beyond Score: A Multi-Agent System to Discover Capability and Behavioral Weaknesses in LLMs Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces AGENT4WEAKNESS, a multi-agent system designed to address the limitations of existing methods for discovering capability and behavioral weaknesses in large language models (LLMs). Current weakness discovery approaches are characterized by insufficient comparison, often failing to analyze statistical significance and confidence of performance differences, and inflexible evaluation, being restricted to fixed perspectives rather than adapting to user-specific requirements. AGENT4WEAKNESS aims to provide richer statistical insights and generate customized evaluation reports through collaborative agents and specialized tools. The core methodology of AGENT4WEAKNESS is structured around a multi-agent workflow consisting of a Planner, an Analyzer, and a Reporter, leveraging comprehensive evaluation data. Experimental results demonstrate that reports generated by AGENT4WEAKNESS show a significant improvement of 2.6 out of 10 points across four evaluation dimensions (Requirement Fulfillment, Content Value, Factuality, and Readability) compared to a direct answering baseline, with high consistency with human evaluations (Pearson r=0.801, Spearman \rhoρ\rhoρ=0.944). Notably, it achieves a 3.4-point improvement in Content Value, signifying richer analyses, and a 3.4-point improvement in Requirement Fulfillment, indicating its flexibility. Furthermore, models guided by weaknesses discovered through AGENT4WEAKNESS show an average performance improvement of 3.7 points, underscoring its practical utility. 1. The paper squarely tackles the challenge of LLM weakness discovery, a critical need as models grow more complex. The authors clearly motivate that current eval methods lack substantive comparisons (no statistical rigor) and flexibility to meet different analysis needs. 2. The proposed Agent4Weakness framework is innovative in using a multi-agent collaboration for evaluation. It combines a Planner, Analyzer, and Reporter agent with specialized roles. This design allows the system to break down the user’s query, fetch and compute detailed statistics, and compile findings into a coherent report. The use of external tools (for data analysis and statistical testing) within the agent workflow is a strong point, as it grounds the LLM’s analysis in solid quantitative evidence rather than just its internal knowledge. This represents a creative extension of chain-of-thought prompting into a tool-augmented, multi-step evaluation process. 3. Agent4Weakness produces much more informative and accurate analyses than the baseline. The strong performance across multiple dimensions underscores the efficacy of the approach. Furthermore, the authors show that LLM-based scoring of the reports correlates strongly with human evaluation (Pearson r ≈ 0.80, Spearman ρ ≈ 0.94), which suggests the evaluation criteria were meaningful and actually reflect human-perceived quality. 4. The system doesn’t just produce higher-level summaries – it can pinpoint specific weaknesses with verifiable accuracy. Such detailed weakness analysis is a clear strength over traditional evaluations that would simply report an average score. 1. High Complexity and Resource Requirements: The proposed system is quite complex, involving three coordinated agent roles and multiple tool integrations. Running Agent4Weakness requires a powerful backbone LLM and a large set of evaluation data in memory. This complexity could pose practical challenges. For example, orchestrating multi-agent prompts and tool calls may be slow or expensive compared to a single-pass evaluation. 2. The paper primarily compares Agent4Weakness to a direct answering baseline. While this highlights the advantage of the structured approach, the baseline is relatively rudimentary. It does not, for instance, use any chain-of-thought or tools at all. In other words, an intermediate baseline (say, a single-agent chain-of-thought using the same tools, but without specialized roles) could help attribute the improvements more precisely. As it stands, the evaluation convincingly shows superiority to an “uninformed” baseline, but not necessarily to any sophisticated alternative. 3. The paper demonstrates that Agent4Weakness’s outputs correlate well with human judgments of quality, but it doesn’t compare against human-written analysis of model weaknesses. Of course, expecting the authors to produce human-written reports for all cases is impractical, so this is more of an observation than a strict criticism. 1. You chose a direct-answer baseline for comparison, arguing that prior specialized pipelines aren’t as flexible. Nonetheless, have you considered comparing Agent4Weakness to a simpler single-agent chain-of-thought approach using tools? For example, one could prompt a single LLM with something like: “Here is all the data, think step by step to analyze weaknesses…” and allow it to call the same tools (sort of an ablation of the multi-agent structure). This might isolate the benefit of having distinct Planner/Analyzer/Reporter roles. 2. The multi-agent pipeline with external tools sounds computationally heavy (multiple prompt exchanges, tool calls, etc.). Do you have a sense of the runtime or cost overhead of Agent4Weakness compared to a direct evaluation? For instance, how long does it take to generate a full report for one model’s weaknesses, and could this be a bottleneck if evaluating many models continuously? 3. You showed a compelling result that providing the model with the identified weaknesses and suggestions can improve its performance via prompt adjustments. Do you envision a more automated integration of Agent4Weakness into the model development loop? For example, could the reports be used to inform fine-tuning data generation or to create adversarial test cases for continuous improvement? Fully AI-generated
Beyond Score: A Multi-Agent System to Discover Capability and Behavioral Weaknesses in LLMs Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes AGENT4WEAKNESS, a multi-agent system designed to uncover capability and behavioral weaknesses of large language models beyond numeric evaluation scores. The framework coordinates three agents, Planner, Analyzer, and Reporter, equipped with 22 analytical tools to generate structured weakness reports from existing benchmark results. Tested on 8 models and 27 benchmarks, the system improves report quality and aligns closely with human judgments, demonstrating its effectiveness in revealing interpretable model weaknesses. 1. Well-motivated framework for deeper LLM evaluation. The paper clearly identifies the limitations of traditional benchmark-based evaluation, which focuses only on numeric scores, and introduces a multi-agent framework (AGENT4WEAKNESS) that systematically analyzes capability and behavioral weaknesses of LLMs beyond accuracy metrics. This motivation is timely and addresses a meaningful gap in current evaluation practices. 2. Comprehensive and well-structured system design. The proposed Planner–Analyzer–Reporter architecture is conceptually clear and technically coherent. Each agent has distinct responsibilities, and the framework integrates 22 analytical tools for statistical testing, capability gap detection, and behavioral pattern mining. The design effectively demonstrates how multi-agent coordination can enable flexible, user-driven evaluation. 3. Empirical validation. The system is tested on 8 representative LLMs and 27 public benchmarks, covering reasoning, factuality, and safety. Alignment of human evaluation and model-based score confirm the validity of the automatically generated reports. 1. Limited technical innovation. The framework mainly assembles existing components in the agentic working pipeline, including multi-agent role decomposition, tool invocation without introducing novel algorithms or coordination mechanisms. 2. Insufficient diversity of user queries. The evaluation covers only three representative user requests (Q1–Q3), which cannot fully demonstrate the framework’s flexibility or generalization across different analysis needs. Broader testing on varied and real-world queries would be needed to justify the claimed adaptability and scalability of AGENT4WEAKNESS. 3. Limited ablation and component analysis. The paper provides little insight into the contribution of each component (22 analytical tools and three agent roles). The ablation results shown in Table 4 remain coarse-grained and do not reveal which tools or interactions most affect report quality. A more detailed analysis would strengthen the causal understanding of the framework’s performance. 4. Lack of formal formulation. While the overall system architecture (Figure 2) conveys the high-level workflow, the paper lacks a clear and formalized description of how the multi-agent coordination operates in practice. Key elements such as the data flow between the Planner, Analyzer, and Reporter agents, the intermediate representations of benchmark results, and the decision rules for tool invocation are only described narratively. 1. In Section 3.1, the author first mentions “a detailed list of 104 models” but later states that “the evaluation results of 8 representative models”, Could the authors clarify whether and how the 104 models are used in your experiments? 2. Only three user requests (Q1–Q3) are defined to test the framework. How do you make sure these requests are representative and cover the user needs? Fully AI-generated
Beyond Score: A Multi-Agent System to Discover Capability and Behavioral Weaknesses in LLMs Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes AGENT4WEAKNESS, a multi-agent framework using three agents (Planner, Analyzer, Reporter) with 22 specialized tools to generate comprehensive evaluation reports identifying LLM weaknesses. The authors claim their approach addresses two key limitations in existing methods: insufficient statistical comparison and inflexible evaluation perspectives. The work is generally interesting and provide novel perspective into LLM evaluation. The challenge of systematically discovering LLM weaknesses beyond simple accuracy scores is crucial for the field. LLM evaluation requires systematic yet novel benchmarking. 1. Circularity induced when using LLM to generate AGENT4WEAKNESS reports and evaluate their quality. This may create inherent bias where the evaluator model may favor outputs from its own framework over others. 2. The framework still relies on human curated benchmarks. Does the model to evaluate still need to run on many benchmarks or the evaluation could be made on partial results? Is inference computation saved by this system? 3. Can the system provide novel evaluations beyond available benchmarks? For example, if I would like to assess an agent's ability in creative writing, would the system still be applicable? 1. Does the system exhibit consistent performance when analyzing the same model multiple times? 2. Running benchmarks and evaluations may incur significant computational overhead. Can this system run stably on research computational infrastructure? Fully human-written
PreviousPage 1 of 1 (4 total rows)Next