ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 6.00 4.00 2386
Heavily AI-edited 1 (25%) 6.00 3.00 2071
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 2 (50%) 3.00 3.50 1906
Fully human-written 0 (0%) N/A N/A N/A
Total 4 (100%) 4.50 3.50 2067
Title Ratings Review Text EditLens Prediction
MFCL: A Multi-modal Function Calling Evaluation for Large Language Models Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper addresses two limitations of current multimodal benchmarks: 1) reliance solely on text-based tools, and 2) lack of feedback across different modalities and fine-grained details. To tackle these issues, the authors propose MFCL (Multi-modal Function Calling Evaluation), which consists of three components: True Audio, Text Audio, and Vision. The authors establish specific guidelines for generating each type of data. They evaluate several mainstream models on the proposed benchmark and analyze the results from both audio and visual perspectives. From the audio perspective, they find that current multimodal large models are highly sensitive to speech noise and often fail to confirm critical entities, leading to task failure. From the visual perspective, they observe that these models still lack sufficient attention to details, along with limited tool-calling capabilities and self-correction abilities. There is currently a scarcity of in-depth evaluation benchmarks for multimodal large models, and this paper contributes meaningfully to this field. 1. The introduction is somewhat disorganized. The authors mention two research gaps, but starting from the third paragraph, they delve into the construction of the benchmark without directly linking it to how these gaps are addressed. I suspect the authors intended to highlight the lack of a benchmark combining API and multimodal evaluation, but the current version is hard to follow, making the motivation unclear. Additionally, Figure 1 is not referenced in the text. 2. There is a lack of data validation, particularly human evaluation, making it difficult to assess the benchmark’s quality and potential biases. Furthermore, certain details remain unclear. For instance, the authors state that vision data requires "one clear visual clue," but there is no in-depth analysis of how "clear visual clue" is defined or identified. 3. The benchmark does not effectively integrate multiple modalities. Although the authors claim that the three components are mutually supportive, the paper does not demonstrate how these components interact. 4. Experimental settings and evaluation metrics are crucial, yet placing them entirely in the appendix makes the paper hard to follow. The analysis section summarizes numerous issues. Among these, which problem is the most critical and has the greatest impact on the performance of current multimodal LLMs? Could resolving this issue potentially lead to the resolution of other problems? Lightly AI-edited
MFCL: A Multi-modal Function Calling Evaluation for Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. 1. This paper introduces MFCL, the first unified benchmark to evaluate structured function calling from multi-modal inputs (speech and vision). 2. It systematically injects realistic perception perturbations and uses exact-match automated scoring to diagnose failures 3. Results reveal significant degradation under noise and visual distortions, exposing critical weaknesses such as tool avoidance, keyword selection errors, and conversational drift in modern models. 1. Realistic perturbation design that simulates real-world failure conditions in multimodal function calling and thoroughly analyzes their impact. 2. Extensive evaluation across multiple leading commercial models, demonstrating substantial experimental effort and providing meaningful comparative evidence for the community. 1. Despite arguing for “real-world audio robustness,” the dataset relies on synthetic TTS rather than human-recorded speech. 2. The benchmark’s core metric (exact-match JSON output) misaligns with real agent objectives, overlooking task-level success, semantic equivalence, and cost-aware behaviors. 3. The turn and clarification rules constrain reasonable uncertainty-handling strategies, potentially biasing models toward brittle “just emit JSON” behaviors instead of safe, real-world interaction patterns. While the benchmark is valuable, I feel there are several questions: 1. How does strict exact-match scoring avoid misaligning the benchmark with real-world multi-turn agent behavior (uncertainty handling, clarification, self-correction)? 2. The turn semantics and clarification rules only allow spelling/value confirmations, while ignoring broader ambiguity resolution. How do these constraints avoid discouraging realistic uncertainty-handling strategies that agents must perform in practical deployments? 2. Given the recent emergence of audio-based function-calling benchmarks, is it appropriate for MFCL to claim to be the “first” in this space, and could the authors clarify the concrete differences that distinguish MFCL from prior speech-focused evaluations? Heavily AI-edited
MFCL: A Multi-modal Function Calling Evaluation for Large Language Models Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces MFCL, the first large-scale benchmark for evaluating multi-modal function calling (tool use) in large language models (LLMs). MFCL comprises 8.2K expert-verified tasks across three suites: True Audio, Text Audio, and Vision. Each example pairs a multi-modal user query (text, speech, or image) with a ground-truth toolcall trace, and includes controlled perturbations (accents, noise, occlusions, etc.) to stress the perception-to-action pipeline. MFCL provides an automatic grader for exact-match scores on function names and arguments, enabling robust, reproducible evaluation without reliance on LLM judges. The authors benchmark leading models (e.g., GPT-4o, Gemini, Claude, GLM, xLAM, etc.), analyze failure modes (named-entity ASR errors, conversational drift, tool avoidance), and present a taxonomy to guide future research. The dataset, taxonomy, and diagnostics are released to accelerate progress on reliable multi-modal agents. The strengths of the paper are: 1. Originality: - First benchmark to systematically evaluate multi-modal function calling under real-world perturbations. - Introduces controlled perturbations and a taxonomy of failure modes. - Unifies text, audio, and vision evaluation in a single framework. 2. Quality: - Expert-verified tasks, realistic data augmentation, and comprehensive error analysis. - Automatic grading at function and argument level, enabling reproducible and robust evaluation. - Strong experimental design, with ablations and comparisons across many models. 3. Clarity: - Clear motivation, methodology, and results presentation. - Figures and tables directly support claims; taxonomy is actionable. 4. Significance: - MFCL will become a standard for evaluating multi-modal tool-augmented agents. - The insights into failure modes and robustness are valuable for both research and deployment. No major weaknesses. The study of a multi-modal functional calling benchmark is very useful for developing agentic LLM in real-world scenarios. Minor questions: 1. The failure mode analysis is very interesting. Did authors have quantitative results in addition to the qualitative examples? 2. Any plan for the release of the benchmark? 3. Have you/ do you have plans to evaluate smaller models on the benchmark? like Qwen-omni, and other multi-modal LLMs with similar size? Fully AI-generated
MFCL: A Multi-modal Function Calling Evaluation for Large Language Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces MFCL, a new benchmark for function calling in multimodal scenarios. The paper examines several cutting-edged models and reveals common failure patterns of these models, providing insights into developing multimodal agents. - The topic (function-calling and multimodality) is timely. - Comprehensive dataset design - Clear description for the dataset construction - Limited insight beyond enumeration of “failure modes.” Most FM categories merely restate known LLM limitations (ASR errors, over-reliance on text, conversational drift). - Missing implementation details, such as the decoding hyperparameters. This may reduce the reproducibility of the paper. - Statistical shallowness. Reported numbers are raw accuracies with no confidence intervals or significance testing. - How did you verify that the TTS-generated and noise-augmented audio realistically represents spontaneous human speech or real-world acoustic conditions? Was any human evaluation conducted to confirm perceptual naturalness? - What procedures ensured the correctness and consistency of the expert-verified tasks? How many annotators were involved? What inter-annotator agreement was achieved? - Given that the Vision set contains only 250 examples, why do you consider its coverage sufficient for robust evaluation? Lightly AI-edited
PreviousPage 1 of 1 (4 total rows)Next