ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	2 (50%)	4.00	3.50	4360
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	0 (0%)	N/A	N/A	N/A
Fully human-written	2 (50%)	4.00	3.00	2762
Total	4 (100%)	4.00	3.25	3560

Title	Ratings	Review Text	EditLens Prediction
DeepResearchGuard: Deep Research with Open Domain Evaluation and Multi-Stage Guardrails for Safety	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes a framework (DEEPRESEARCHGUARD) to enhance the safety and quality of deep research agents. Deep research agents are defined as LLM based agents engaged in completing complex, multi-step research tasks. The paper argues that existing evaluations ignore risks that arise and propagate through the intermediate stages of the process, and only focus on the final output accuracy as a metric. The proposed method has 4 stages and monitors the input, plan, research, and output stages of the agentic workflow. Each stage has a dedicated guard agent that classifies content for quality and safety, revises or rejects problematic content, and can escalate to a human reviewer. A new evaluation protocol that assesses the final report on 5 qualitative dimensions and a safety bench mark is proposed to test these capabilities. + The framing of risk as not just an input/output problem, but rather a process level issue where errors can cascade, is interesting. + The inclusion of stage-specific taxonomies, is useful + Shifting away from a Q&A metric to a more multi-dimensional look at report quality works better at providing a comprehensive metric. The new metrics D@1, D@all are well motivated - The DRSAFEBENCH is partially constructed using and LLM to synthesize adversarial examples. Therefore, there is a risk that the benchmark may be evaluating the blindspots specific to the LLM rather than reflecting human devised adversarial attacks. The paper does not provide enough information on how it validates that these synthetic examples are representative of real-world threats. There is also no discussion of how the potential for systemic bias is being evaluated. - The high increase in run time complexity is mentioned but not fully discussed. - the architecture is highly complex, and contains multiple agents, taxonomies etc. This could make it pretty brittle. It would be good to have seen a discussion of why this much complexity is necessary and how what the complexities of debugging and maintaining such a system would work in practice. - It would have been good to have a formal definition of "deep research". The paper uses examples to define it broadly. This makes it difficult to understand the scope of this problem and therefore, to be able to judge how generalizable the approach is. 1. Can you provide more information on how you ensure that the LLM generated examples are a reflection of real world, human generated adversarial samples? 2. Can you comment on the generalizability of your approach? Do you anticipate changes in your modules depending on the use case domain. It is conceivable that an agentic system for a fintech application could be completely different from that for one in the healthcare domain. 3. Can you discuss the need for the current level of complexity of the system? Is there a way to make sure that the design is at the minimum necessary components?	Fully human-written
DeepResearchGuard: Deep Research with Open Domain Evaluation and Multi-Stage Guardrails for Safety	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper aims to address the lack of holistic evaluation and safety control in deep research agents, which typically rely only on QA accuracy and overlook report-level quality factors like credibility, coherence, and safety. It introduces DeepResearchGuard, a structured framework that divides the research workflow into four guarded stages—Input, Plan, Research, and Output—each equipped with memory retrieval, planning, evaluation, and human escalation mechanisms. The system jointly assesses both source and report quality under open-domain conditions. Experiments show that DeepResearchGuard significantly improves defense success while reducing over-refusal rates, with clear stage-wise contributions and an observable safety–depth trade-off across different guard models. It is good to see how the paper introduces a clear taxonomy of harmful content in open-domain deep research, showing how different types of risks can propagate across stages from Input to Plan, Reference, and Output. It also reframes safety control from a single blanket refusal into a set of fine-grained, auditable interventions — stopping harmful input early, improving citation discipline and source reliability during planning and research, and reinforcing structure, attribution, and risk disclosure at the output stage. Together, these ideas offer a well-structured and practical approach to improving overall report quality and trustworthiness. 1. While DeepResearchGuard improves DSR and maintains or lowers ORR, the paper doesn’t discuss how this impacts overall task completion or utility. Does stronger harmful-content blocking come at the cost of reduced usefulness or coverage? 2. The evaluation of source transparency and traceability seems limited to helpfulness, authority, and timeliness. It would be useful to know whether the authors assessed model-level selection bias, since judgments of evidence helpfulness or relevance inherently depend on model capability. The paper also doesn’t clearly define how “helpfulness” is measured or what its scope is. 3. It’s unclear what proportion of cases require human intervention under this framework. A quantitative analysis of escalation rates and their impact on efficiency or consistency would make the system’s practicality much clearer. 4. The paper doesn’t provide detailed statistics on sample difficulty, open-domain coverage, or category distribution — only mentioning targeted manual additions for jailbreak and injection cases. More transparency about dataset composition would help assess the generalizability of the evaluation. 1. It’s not clear how accurate the Output Guard agent’s scoring is across the five evaluation dimensions, or how it compares with a human baseline. It would be valuable to know where the largest discrepancies occur and which dimensions are most challenging. Also, the paper doesn’t specify how the weighted sum for the overall score is determined — are the weights equal or task-dependent? 2. The confidence-based threshold for escalating to human intervention seems to be chosen heuristically. More detailed experiments or ablations explaining how this threshold was set would help validate its reliability. 3. The role of the memory component is somewhat underexplained. If the model’s capability is limited, continuously storing and referencing processed cases might risk amplifying past errors. It would be useful to analyze whether memory helps stability or introduces error accumulation over time. 4. The paper doesn’t report the distribution of refusal types — for instance, which valuable sources were mistakenly blocked — nor does it break down the associated costs in terms of tokens or latency. These details would give a clearer picture of the system’s trade-offs and efficiency.	Fully AI-generated
DeepResearchGuard: Deep Research with Open Domain Evaluation and Multi-Stage Guardrails for Safety	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces DeepResearchGuard, a safety framework for deep research agents that decomposes complex queries, iteratively searches the literature, and generates structured reports. The authors identify two gaps in the current deep research agent framework: (1) evaluation focuses on QA accuracy rather than report quality, and (2) a lack of stage-specific safeguards allows harmful content to propagate through the multi-stage pipeline. DeepResearchGuard addresses these by implementing guards at four stages (input, plan, research, output), each with tailored taxonomies, memory-augmented classification, and human-in-the-loop intervention. The authors also contribute DRSafeBench, an 828-query benchmark combining adversarial and benign examples. Experiments across gpt-4o, Gemini-2.5-flash, DeepSeek-v3, and o4-mini show average defense success rate improvements of 18.16% while reducing over-refusal by 6%. 1. The four-stage framework design is clear and practical. Each guard has a clear taxonomy, severity-based actions, and memory retrieval for consistency. 2. The author did a comprehensive experimental evaluation. The author tests four baseline LLMs. Then, the author did ablation studies to show that the input guard contributes the most. Also, guard model sensitivity analysis reveals safety-performance tradeoffs 3. In this paper, the author further introduced DRSafeBench, which has 828 queries and covers diverse failure modes, including synthesized cases for low-quality/format errors. 4. The confidence-based escalation mechanism with user override options balances automation with human judgment. 1. I have some concerns about the attacking method. Reference poisoning is simulated by rewriting content, but it may be hard to reflect real attacks. Also, no evaluation against actual jailbreaks or adversarial attacks designed to evade the guards. 2. For the report quality evaluation, the author only takes the 1-5 scores for coherence, credibility, etc., and then use LLM-as-judge to evaluate. There is no human evaluation. As safety research work, maybe using human evaluation is important. 3. The paper doesn't compare against LlamaGuard, WildGuard, or other moderation systems that could be adapted for this task. 4. The author didn't evaluate the framework on recent LLMs. Typically, as the model is updated, its safeguards are also upgraded. So, evaluating new model is also necessary for this work. For example, how DeepResearchGuard compares to simply using a stronger base model (e.g., gpt-5 family for research instead of o4-mini + guards)? See in Weaknesses.	Fully human-written
DeepResearchGuard: Deep Research with Open Domain Evaluation and Multi-Stage Guardrails for Safety	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces DeepResearchGuard (DRG), a framework for safe and open- domain deep research using multi-stage guardrails, combining reasoning-enhanced agents with layered safety evaluation. The authors propose a pipeline that integrates: 1. Open-domain research tasks (web-enabled reasoning, synthesis, and reporting), 2. Multi-stage guardrails for safety assurance (input, intermediate, and output filters), 3. An open benchmark — DRSAFEBENCH — for evaluating model safety, factuality, and reasoning depth. The system design is centered around modular “Guard” components: (a) Input Guard screens unsafe or policy-violating queries. (b) Process Guard monitors the agent’s intermediate reasoning chain for unsafe actions or hallucinations. (c) Output Guard validates final answers for factual correctness and safety alignment. The evaluation compares DeepResearchGuard against strong baselines (GPT-4, Gemini, Claude, etc.) using both safety metrics (toxicity, factuality, alignment) and reasoning metrics (depth, coverage, correctness). DRG reportedly achieves substantial gains in factuality and reasoning safety while maintaining competitive research quality. (A) The paper tackles a timely and high-impact problem: the safe deployment of autonomous research agents. (B) The multi-stage guard concept provides a modular lens on safety assurance, enabling potential composability and interoperability. (C) The authors make a commendable effort to release a benchmark (DRSAFEBENCH) and system design intended to encourage openness and replication. 1. “Reasoning depth” and “safety compliance” are reported as scalar improvements without statistical context. No details are given about sample size, variance, or inter-rater reliability. Hence, claims such as “+15% factuality” or “+20% safety compliance” are not statistically grounded. 2. The paper lacks a precise mathematical definition of its evaluation metrics—particularly “reasoning depth.” While qualitative examples suggest multi-hop reasoning assessment, no formal operator D(f) or depth measure is defined, leaving interpretability gaps in quantitative claims. The safety scoring function ϕi(x) is also treated as a black box, often described as “a model-based judge,” without calibration analysis or ROC curves to support threshold stability. As such, while empirical results are strong, their robustness under distributional shift remains uncertain. 3. Limited novelty in the guardrail mechanisms themselves. The modular “multi-guard” setup echoes existing layered safety frameworks (e.g., Constitutional AI + LLM oversight). The originality primarily lies in system integration, not in new algorithmic or theoretical insights. 4. No Analysis of Guard Composition. The system assumes that multi-stage filtering improves safety monotonically, yet provides no proof or empirical evidence of risk subadditivity. The absence of such analysis limits the theoretical depth of the contribution 5. Inadequate transparency of model configuration. Details on base model sizes, prompt templates, or RLHF usage are not presented, making it difficult to attribute performance improvements to the guardrails rather than model capacity. 6. The benchmark DRSAFEBENCH, while valuable, is authored and evaluated by the same group, raising potential biases in task design and model ranking. Without third-party replication or human inter-annotator statistics (e.g., Cohen’s κ), claims of general safety superiority remain somewhat self-referential. Furthermore, the paper does not provide asymptotic complexity or runtime analysis—yet multi-stage filtering clearly incurs nontrivial computational cost. The paper would benefit from an empirical latency curve L(n)∝n×ci quantifying guard overhead per stage. 7. Multi-stage guardrails increase inference cost superlinearly (approximately O(n⋅ci), yet no performance or throughput metrics are provided. This omission makes real-world feasibility unclear. Each guard Gi is implemented via an LLM-based evaluator. If these evaluators share the same biases as the core model, then the composite guard G=G3∘G2∘G1 becomes a biased projection of a single safety prior, invalidating the notion of layered protection. 8. While safety is central, the ethical discussion (e.g., accountability of multi-agent decisions, risk of subtle biases) remains generic and lacks deeper engagement with responsible AI literature. 1. How are guardrails coordinated when their judgments conflict (e.g., process guard flags an unsafe inference, output guard clears it)? 2. What quantitative evidence supports that intermediate guard checks prevent unsafe outcomes rather than just delay them? 3. Is DRSAFEBENCH publicly released with example prompts, or only aggregate statistics? 4. Could the modular guard setup degrade reasoning quality by over-regularization? 5. How is factuality measured — automatic retrieval verification, or human grading?	Fully AI-generated

PreviousPage 1 of 1 (4 total rows)Next