|
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces DeepDive, a framework designed to improve the "deep search" capabilities of open-source Large Language Models (LLMs), turning them into more effective web-browsing agents for solving complex problems.
Key Contributions:
* Automatic Data Synthesis: It proposes a novel strategy to automatically generate complex, difficult, and hard-to-find question-answer pairs from open knowledge graphs. This creates a rich dataset for training without manual effort.
* Reinforcement Learning (RL) for Training: It employs an end-to-end multi-turn Reinforcement Learning process to train the LLM. This method enhances the model's long-horizon reasoning. It also introduces a "redundancy penalty" to discourage the agent from making repetitive or similar search queries, encouraging more efficient and diverse exploration.
* The knowledge graph-based QA synthesis ensures high-quality, reasoning-intensive training data—by blurring entity attributes and filtering via frontier models (e.g., GPT-4o), it creates "hard-to-find" questions that truly stimulate deep search capabilities.
* The multi-turn RL framework (with GRPO algorithm and redundancy penalty) encourages diverse, efficient search: the penalty discourages repeated queries (measured by Jaccard similarity), while the end-to-end design integrates reasoning and tool use, enhancing long-horizon search.
* Strong experimental validation: DeepDive-32B sets a competitive open-source record on BrowseComp (15.3% accuracy), outperforms other open agents, and demonstrates test-time scalability (better performance with more tool calls/parallel sampling).
Many of the contributions in this paper lack originality and novelty. For specifics, please refer to the "Questions" section below.
Many of the paper's contributions seem to lack novelty. For example:
* One of the core claimed contributions is the synthesis of QA pairs from KGs with an obfuscation step. However, this approach was already proposed in the prior work, WebSailor. What is the specific contribution of this paper in this regard?
* There is extensive prior work on applying reinforcement learning (RL) to search agents. What is the novelty and contribution of the RL framework presented here compared to existing methods? |
Lightly AI-edited |
|
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes DeepDive, a web-browsing deep search agent. DeepDive develops a method to generate synthesized, hard-to-find QA data derived from knowledge graphs via long random walks with attribute obfuscation, and also combines a multi-turn RL using GRPO plus a redundancy penalty. The method aims to improve long-horizon reasoning with tools in a ReAct style agent. DeepDive-32B shows superior performance comparing to open models, though it still exists significant gaps comparing to GPT deep research.
1. The paper combines KG based synthesis of harder, multi-hop queries with multiturn RL, and demonstrates meaningful improvements on challenging deep web-search tasks, which shows that tougher supervision plus RL can better exploit tools and sustain longer reasoning chains.
2. The ablation studies isolate the effects of the reward design and the synthetic data generation, providing evidence that each component contributes measurably to performance and search efficiency.
1. Uncontrolled inference-time scaling. The paper does not standardize or report consistent test-time budgets across baselines (e.g., max tool-call limits). Several baseline scores appear to be taken from prior work rather than re-evaluated under a matched inference budget, making the headline comparisons confounded by inference-time scaling.
2. Incremental contribution in data generation. The proposed synthetic data pipeline generates multi-hop questions/targets by graph traversal (random walks) to generate search paths and then obfuscating attributes. Many existing efforts already use KG/graph path extraction to script multi-step reasoning trajectories and there are also works using LLM to paraphrase and harden queries.
3. Use recent models. The experiments may also evaluate on the Qwen3-32B model, which is enhanced on tool use ability comparing the QwQ-32B
The authors can address the concerns mentioned in the above weaknesses 1 and 3. |
Fully human-written |
|
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents DeepDive, a method to enhance the capabilities of LLMs on complex "deep search" tasks, where they are often limited by long-horizon reasoning and a lack of quality training data. The approach has two key components:
1) A strategy to automatically synthesize complex, obfuscated question-answer pairs from Knowledge Graphs by using random walks and LLM-based entity blurring; and 2) An end-to-end multi-turn RL framework based on GRPO, which introduces a "redundancy penalty" to encourage diverse search.
Experiments show that DeepDive-32B achieves competitive results among open-source models on the BrowseComp benchmark.
- The paper's primary contribution is an automated data synthesis method for deep search. By leveraging the multi-hop structure and attributes of Knowledge Graphs (KGs) and using an LLM for "obfuscation," this method addresses the critical lack of "hard-to-find" questions in existing datasets.
- The paper provides detailed analyses, including scalability with tool calls, parallel sampling strategies, and the evolution of the model's search behavior during RL training.
- A significant concern arises from Appendix C, which introduces a semi-automated, i.i.d. data synthesis strategy. This method, using only ~3,000 QA pairs, achieves far superior results (22.2% on BrowseComp) compared to the main paper's automated KG method (15.3%). This large performance discrepancy calls into question the true value and scalability of the main proposed KG method.
- The novelty of the RL framework is limited. The core training framework is based on the existing GRPO algorithm, with the primary addition being a "Redundancy Penalty." While effective, this appears to be more of an effective engineering trick rather than a substantial algorithmic contribution.
- The ablation study for the "Format Reward" (Figure 7a) is inconclusive. The results show that removing it causes learning to stagnate (around 8.0% accuracy). This only demonstrates its necessity within the current framework, not its superiority or design effectiveness. The authors fail to compare it against other alternative intermediate rewards.
1. Regarding the large performance gap between the main KG method (15.3%) and the appendix i.i.d. method (22.2%): Since 3k QA pairs is not that much, can the authors elaborate on the cost-benefit trade-off? Specifically, what was the human-hour cost for the ~3k i.i.d. samples compared to the compute cost (including LLM obfuscation and filtering) for the KG data?
2. Regarding the Format Reward: As the ablation shows, this reward is critical for learning. Can you compare it against simpler baselines (e.g., a sparse reward given only for a successfully parsed tool call, rather than the strict, full format check)? This would help clarify if the strictness of the reward is the key factor, or just the presence of an intermediate signal. |
Lightly AI-edited |
|
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
Augmenting large language models (LLMs) with browsing tools greatly enhances their potential as deep search agents capable of tackling complex, real-world tasks. However, open-source LLMs still struggle in such settings due to limited long-horizon reasoning with browsing tools and the lack of sufficiently challenging supervised data. To overcome these limitations, this paper introduces DeepDive, a framework designed to advance deep search agents. First, it proposes an automated strategy for synthesizing complex, difficult, and hard-to-find questions from open knowledge graphs. Second, it employs end-to-end multi-turn reinforcement learning (RL) to improve LLMs’ deep search and long-horizon reasoning capabilities. Built upon open models, DeepDive-32B achieves 15.3% accuracy on BrowseComp and demonstrates strong test-time scaling in both tool usage and parallel sampling.
Clarity:
1. The paper presents its methods and experimental analyses clearly.
2. It provides sufficient ablation experiments to support the effectiveness of the proposed approach.
Originality: How does your work differ from “WebSailor: Navigating Super-human Reasoning for Web Agent”? I believe your approach shows little innovation in both the knowledge graph–based data construction and the multi-turn RL training methods.
1. What obfuscation strategies are used in the paper?
2. Regarding the multi-turn RL training method mentioned: it seems that the agent performs multi-step tool calls to produce an answer — but isn’t this the standard approach in existing work? There are no methods that produce an answer in a single tool call, right? Or do you compute rewards separately for each intermediate tool call? Based on my understanding, your method assigns a single reward to the entire multi-step trajectory, combining answer correctness and a redundancy penalty.
3. Is your training and evaluation framework based on open-source code? |
Lightly AI-edited |