|
DRBench: A Realistic Benchmark for Enterprise Deep Research |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces DRBench, a benchmark for answering deep research questions that require synthesizing information from both Web pages and enterprise documents embedded in files or apps like emails. The 15 tasks are curated using LLMs with a human-in-the-loop approach that involves generating a company and persona, collecting relevant public URLs and extracting insights from them, generating questions based on the public insights, generating internal insights and distractors for those questions, and generating internal documents to embed those insights. The answers are evaluated based on insight recall, factuality, distractor avoidance and report quality measure using LLM-as-a-Judge. A baseline DRBench Agent (DRBA) is also developed that consists of research planning, action planning, adaptive action planning and report generation. Experimental results show that even SoTA models struggle with these tasks, particularly on insight recall.
1. Novel and challenging task that involves assimilating information from various sources and also interacting with apps like emails
2. Systematic pipeline for benchmark creation
3. Extensive experiments and analyses to measure the performance of many models on four comprehensive criteria. Results demonstrate benchmark complexity
4. Human evaluation to validate both benchmark creation and evaluation metrics
1. Missing relevant work "Benchmarking Deep Search over Heterogeneous Enterprise Data" by Choubey et al.
2. Very small dataset consisting of only 15 tasks
3. No agent identified public insights. There should be some analysis done to understand if that is due to lack of web indexing, retrieval approach, tool limitations, benchmark design, agent extraction etc. For example, are the public insights inaccessible by the search tool?
1. Although some tables show (e.g., Table 11) show number of questions answered successfully, why is such an accuracy not reported for all tasks and models? This seems to be the most important evaluation criterion. For example, an agent might retrieve all insights but still not synthesize them into the correct answer. LLM-as-a-Judge could be used for this criterion too. Does the benchmark include gold answers for all questions?
2. Which model is used for LLM-as-a-Judge? This could impact results as the judge LLM is known to be biased towards models from its own family. |
Fully human-written |
|
DRBench: A Realistic Benchmark for Enterprise Deep Research |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes DRBench, a benchmark and reproducible enterprise environment for evaluating deep-research agents that must synthesize evidence from both public web sources and private organizational data real across applications. It provides 15 persona-grounded tasks spanning 10 domains with injected ground-truth insights and distractors, generated via a structured LLM+human-in-the-loop pipeline and anchored to dated, authoritative URLs. The evaluation framework scores reports along insight recall, factuality via per-claim citation verification using a fixed RAG pipeline, and multi-dimensional report quality via LLM-as-a-judge. A baseline DRBench Agent with planning variants is analyzed; results show adaptive planning boosts recall while lightweight planning better preserves factuality, and app-based environments are notably harder than local file access. The paper includes ablations across multiple backbone models, browser-only baselines, and a small human study validating metric alignment.
- The benchmark genuinely bridges public web retrieval with private enterprise data across heterogeneous formats and real apps, grounded in personas and company context; this goes beyond web-only research settings.
- The evaluation is thoughtfully designed—atomic insight extraction, strict per-claim citation checks with a controlled RAG pipeline, explicit distractor avoidance, and nuanced report quality scoring—plus an anti-gaming cap on evaluated insights.
- A clear, reproducible pipeline with human verification produces distractor-rich files and stable, dated public sources; the environment is containerized and well-documented, enhancing reproducibility.
- Ablations on planning strategies, backbone models, and local vs app-based settings expose concrete failure modes, offering actionable guidance for future agent design.
- DRBench fills a gap between deep research benchmarks and computer-use agent tests, with code/scripts promised and a setup that feels close to real enterprise workflows.
- The paper’s presentation could be slightly improved;
- Figure 8 could be better presented.
- The paper introduces “golden insight” (e.g., Sec. 6 and Prompt 15) without prior definition, seemingly synonymous with “ground-truth insights,” which creates confusion in the evaluation description. Please unify terminology and define the term at first occurrence
- There are two versions of labels in Figure 4 that overlap with each other.
- Only one backend evaluation (GPT-4o) is conducted. The stability of the evaluation across different backends is not evaluated.
- What do the stars in L948–L949 mean?
- MinEval selects only the retail domain. Given that stratifying the same total number of tasks across industries (retail, healthcare, EV) need not increase computational/evaluation cost, why not adopt stratified sampling or include at least one representative task per industry?
- Can the agent access public resources beyond the Task URL? If not, how do you limit the behavior of agents like OpenAI Deep Research? |
Lightly AI-edited |
|
DRBench: A Realistic Benchmark for Enterprise Deep Research |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces DRBench, a benchmark for evaluating deep-research agents that must integrate public web information with private, enterprise-like data (e.g., files, chats, emails) inside a realistic multi-app environment. It proposes three evaluation axes—Insight Recall & Distractor Avoidance, Factuality (via evidence-checked citations), and Report Quality—and presents a baseline DRBench Agent (DRBA) with variants (SRP/CRP/AAP). Experiments span 15 persona-grounded tasks across 10 domains, with analyses of planning strategies, backbone LLMs, and app-based vs. local file access.
The paper convincingly argues that prior deep-research benchmarks are predominantly web-only and do not measure whether agents surface the most salient enterprise insights or ground claims with citations.
Tasks require tool use across storage, chat, email, and documents, which distinguishes DRBench from web-only retro-search settings such as DeepResearchGym and Deep Research Bench, both of which rely on fixed corpora or “frozen web” for reproducibility rather than mixed private+public sources.
Multi-axis evaluation design. The insight recall vs. distractor avoidance split is well motivated; factuality uses RAG-style evidence checks; report quality is judged on structured dimensions. The methodology reflects current best practice in LLM-as-a-judge evaluations.
Evidence of external validity & task coverage. While the 15 tasks are persona-grounded, the coverage across industries and the depth of internal knowledge heterogeneity remains modest. Benchmarks such as DeepResearchGym and BrowseComp-Plus now report hundreds to thousands of instances or large curated corpora; DRBench’s small task count risks overfitting and limited statistical power.
LLM-as-judge reliance & bias. All key metrics (recall alignment, factuality judgments, report quality) ultimately depend on LLM judges. The paper would benefit from more thorough human-vs-LLM agreement studies and inter-rater reliability beyond the limited assessments reported.
How robust is Insight Recall to paraphrase or partial matches? Lack of evaluating the span-level alignment and evidence coverage per insight.
Can you quantify LLM-judge and human agreement for each metric (beyond small samples), and report Fleiss’ κ or Krippendorff’s α per dimension?
What are the exact artifacts you will release (images, VM snapshots, container specs, synthetic email/chat generators, grading scripts)? Any non-redistributable components?
How robust is Insight Recall to paraphrase or partial matches? Do you evaluate span-level alignment and evidence coverage per insight? |
Fully AI-generated |
|
DRBench: A Realistic Benchmark for Enterprise Deep Research |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces DRBench, a benchmark for evaluating AI agents on multi-step, long horizon enterprise deep research tasks. It consists of 15 persona grounded tasks across 10 domains. The tasks comprises of retrieval and insights generation from public web content as well as private enterprise data (emails, chats, documents, spreadsheets) to answer business related queries. It proposes 4 evaluation metrics - Insight Recall, Factuality, Distractor Avoidance, Report Quality. It introduces DRBench baseline agent and evaluates it on the benchmark across multiple planning strategies and backbone models.
1. The paper tackles an important challenge of enterprise deep research which is a problem space that remains highly unexplored.
2. The inclusion of private datasources simulating real world applications (such as cloud storage, chat, file system etc) and containing diverse file formats creates a more realistic evaluation environment. The benchmark incorporates private datasources distributed across realistic enterprise data sources such as cloud storage, chat, file system etc containing diverse file formats, resulting in a highly authentic simulation environment.
3. The evaluation framework consists of multiple complimentary metrics that help in evaluating agentic systems across both precision and recall and quality of report.
4. The paper includes ablation studies across planning strategies, backbone llms and environmental settings (local vs app based).
1. A major limitation of the benchmark is its limited size. 15 tasks and 114 insights makes the benchmark significantly smaller which raises questions on statistical significance of the evaluation.
2. Extraction of atomic insights from the final report is a very important step in the evaluation method since 3 of the 4 metrics depend on it. However due to the use of llms, this step will be noisy which will lead to less reliable metric scores.
3. Having LLM-as-a-judge as the only method of evaluation raises question about the accuracy of evaluation since llms can halucinate and show biasness. Even though the authors talk about correlation with human preference, It does not really indicate how accurate the llm evaluations are.
4. Synthetic data generation, even though it makes the benchmark generation approach more scalable, raises questions about the internal enterprise data being realistic in nature. Combined with the fact that LLM is also used for evaluation, it can lead to more noise and biasness in evaluation results.
5. Even though the paper includes several ablation studies, it does not provide indepth analysis of why the results are the way they are. It simply states that certain model / approach is better than the other without trying to provide any explanation as to why it might be so. A main example of this is stating the fact that no agent managed to successfully source external knowledge without providing any explanation to why it happened.
1. Results show relatively poor performance in insights recall metric across models and methods. Why is it so? What % of it is due to incorrect / noisy insights extraction?
2. How good is atomic insights decomposition? Is there any quantitative analysis done to measure the performance of insights decomposition as well as the different metrics scoring?
3. The decision to choose number of ground-truth insights plus five for calculating insights recall score seems arbitrary. Was some other methods for penalising copying all content into the generated report explored? |
Fully human-written |