ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 3.00 2881
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 2 (50%) 5.00 3.50 2443
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 1 (25%) 4.00 4.00 2736
Total 4 (100%) 4.50 3.50 2626
Title Ratings Review Text EditLens Prediction
MARWA: Multi-agent retrieval-augmented framework for reliable bioinformatics workflow automation Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a multi-agent retrieval-augmented framework call MARWA for reliable bioinformatics Workflow Automation. MARWA's architecture is composed of six specialized LLM-based agents (Analyzing, Planning, Selecting, Generating & Executing, Debugging, and Judging) that operate in a step-by-step, closed-loop fashion. The authors also design an embedding method called LAFT, based on contrastive learning fine-tuning on the pretrained BERT model. Experiments show that MARWA consistently outperforms baselines like AutoBA and BioMaster, particularly in generating correct installation commands and file paths, leading to higher workflow success rates. S1.The six-agent, step-by-step framework is a well-reasoned and significant improvement over one-shot generation. By breaking the complex problem of workflow creation into discrete and verifiable stages, the system introduces robustness and error-handling capabilities that are critical for this domain. S2.Experiments show that MARWA and LAFT outperform other baseline methods. W1. The paper emphasizes that the proposed method is specifically designed for bioinformatics workflow automation. However, although the evaluation datasets are related to bioinformatics, the architectures of the proposed MARWA and LAFT methods do not appear to have domain-specific optimizations for bioinformatics. Could the authors leverage the characteristic structures of bioinformatics data to optimize the model framework itself (rather than only the prompts)? W2. In the embedding section, fine-tuning using contrastive learning is already a well-established approach for training embedding models. This work merely uses LLM-generated synthetic data to fine-tune the embedding model(BERT), without introducing a novel method. In addition, state-of-the-art embedding models are often based on decoder-only architectures with larger parameter scales, such as BGE-EN-ICL and Qwen3 Embedding. W3. The experiments show that MARWA achieves a higher success rate compared to baseline methods. However, given its highly complex workflow structure (including six agents and a loop structure), it is expected to consume significantly more tokens and inference time than other methods. The experiments, however, do not evaluate MARWA’s token costs, inference time, or similar metrics. 1. Based on Table 1, can the authors compare retrieval performance when replacing the BERT model with more recent open-source base embedding models, such as BGE-EN-ICL or Qwen3 Embedding? 2. Can the authors design experiments to evaluate the cost of MARWA, such as the number of tokens consumed and the average inference time? 3. Can the authors compare MARWA with some general-purpose agent methods, such as ReAct? Fully human-written
MARWA: Multi-agent retrieval-augmented framework for reliable bioinformatics workflow automation Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents an innovative Multi-Agent Retrieval-Augmented framework aimed at enhancing bioinformatics workflow automation. The approach leverages multi-perspective LLM-enhanced tool descriptions combined with contrastive representation learning to achieve robust semantic representations of bioinformatics tools, ultimately improving tool retrieval accuracy. The evaluation dataset constructed for this purpose could serve as a valuable resource for the research community. 1. The integration of multi-perspective LLM-enhanced tool descriptions is a promising approach for tool selection in complex scientific domains, potentially benefiting agent systems. 2. The two proposed datasets could significantly aid in the evaluation of bioinformatics agents. 1. The claim of complete automation in the current workflow seems somewhat overstated. Is there any algorithmic illustration provided? Is the sequence of operations predefined? 2. The framework includes six cooperative LLM-based expert agents. Are the same models used across all six components, or are there distinct characteristics for different experts? Insights into model selection would be beneficial. 3. There is a sentence structure issue on lines 263-264 that needs clarification. 4. Is the file system intended to be multimodal? 5. Which specific LLMs are used as evaluators? What distinguishes the evaluation process from the judging operation? 6. A more detailed presentation of the dataset's difficulty and characteristics would enhance clarity. Additionally, what are the cost differences between MARWA and test-time scaling with powerful LLM-only methods in solving the task? Identical to the 'Weaknesses' noted Moderately AI-edited
MARWA: Multi-agent retrieval-augmented framework for reliable bioinformatics workflow automation Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a multi-agent retrieval enhancement framework, MARWA, to address the robustness and scalability issues in bioinformatics workflow automation. With the rapid growth of multi-omics data, bioinformatics analysis workflows are becoming increasingly complex, and manually constructing workflows is both time-consuming and error-prone. Existing automation methods based on Large Language Models (LLMs) suffer from issues such as "one-off generation," inaccurate tool retrieval, and insufficient evaluation. MARWA significantly improves the accuracy of tool retrieval and command generation by employing six collaborative agents (analysis, planning, selection, generation and execution, debugging, and judgment), combined with retrieval enhancement (RAG), multi-perspective LLM tool description, and comparative learning. Furthermore, the paper proposes a two-stage evaluation system that combines expert execution and large-scale LLM evaluation. Experiments demonstrate that MARWA outperforms existing methods across pass rate, workflow quality, and scalability, laying the foundation for reliable bioinformatics automation workflows. - This work employs a multi-agent collaborative architecture, involving a series of intricate processes including analysis, planning, selection, execution, debugging, and judgment, thereby significantly improving process robustness and flexibility. - The method uses LLM to generate multi-perspective tool descriptions and optimizes tool embedding through BERT contrastive learning, achieving tool retrieval accuracy higher than mainstream baselines. - Real-world execution and large-scale evaluation are combined: 40 expert validation tasks and 2270 LLM evaluation tasks provide a comprehensive evaluation system, making the results highly convincing. - While the evaluation system is comprehensive, large-scale tasks primarily rely on automated evaluation using LLM, resulting in a limited number of actual tasks executed. Furthermore, for biomedicine, does over-reliance on automated evaluation accurately reflect real-world usability? - The tool's database expansion primarily relies on manual verification and command logging, leaving room for improvement in automation and scaling up. - The detailed description of contrastive learning is rather brief, lacking sufficient disclosure of hyperparameters, training set size, and other details. - Will the multi-perspective generation of tool descriptions in MARWA's search enhancement feature lead to semantic drift due to LLM illusion? How can description consistency be guaranteed? - Are there compatibility solutions for the file system interface across different operating systems (e.g., Windows, macOS)? How general is its practical deployment? - In large-scale evaluations, is there more detailed statistical analysis of the accuracy of LLM automatic scoring and its consistency with expert scoring? - How adaptable is MARWA to new tools or parameter changes? Does it support automatic tool version identification and compatibility? - Will multi-agent collaboration lead to significant computational resource consumption? What are the actual hardware requirements for deployment? Moderately AI-edited
MARWA: Multi-agent retrieval-augmented framework for reliable bioinformatics workflow automation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper tackles the challenge of **automating bioinformatics workflows**, where existing LLM-based systems often fail due to **ambiguous task definitions, heterogeneous tools, and unreliable one-shot generation**. To address these issues, the authors propose **MARWA**, a multi-agent retrieval-augmented framework that **decomposes workflow construction into modular stages with dedicated agents** for task clarification, tool retrieval, command synthesis, and error correction. MARWA enhances tool selection through contrastive-learning-based retrieval and validates reliability via a two-stage evaluation combining expert execution and large-scale LLM-based assessment. Experiments demonstrate that MARWA substantially improves workflow accuracy, robustness, and scalability over existing baselines. The paper designs a comprehensive end-to-end execution pipeline that covers the entire process,including task analysis, tool selection and workflow execution and validation. This holistic design ensures not only that each stage is logically grounded and traceable, but also that potential errors can be detected and corrected through contextual feedback, significantly enhancing overall reliability. The paper mainly proposes solutions to surface-level problems without uncovering the deeper reasoning gaps between the task requirements and the chosen methods. For instance, it claims that replacing a one-shot generation process with a step-by-step approach improves workflow reliability, yet it never explains *why* one-shot generation fails in this task setting, *what specific reasoning capabilities* are lacking, or *why* step-by-step reasoning would inherently address them. Since step-by-step generation is already a standard configuration for large language models, this modification only strengthens a weak baseline rather than constituting a principled innovation. Moreover, the framework appears as an incremental extension or a more fine-grained reactive pipeline, without demonstrating true multi-agent cooperation or emergent division of labor. Consequently, the proposed contributions lack clear conceptual novelty and task-specific motivation, resulting in a framework that feels more like a layered adaptation of existing paradigms than a genuinely innovative approach. 1. Have previous studies already introduced benchmark datasets or evaluation protocols for workflow automation, and how were these used to assess reliability or execution quality? 2. What specific improvements or novelties do the proposed dataset and evaluation metrics in this paper provide beyond existing ones — in terms of coverage, realism, or reproducibility? 3. Has the paper reported the computational or financial cost (e.g., model inference time, agent coordination overhead, or GPU usage) associated with the multi-agent setup, and how does this compare to the baselines? Fully AI-generated
PreviousPage 1 of 1 (4 total rows)Next