|
Discourse-Aware Retrieval-Augmented Generation via Rhetorical Structure Modeling |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Discourse-RAG, a retrieval-augmented generation framework that explicitly models intra- and inter-chunk rhetorical structures using Rhetorical Structure Theory (RST) and rhetorical planning to improve coherence and factual consistency in long-context reasoning. It demonstrates strong empirical results on multiple benchmarks (Loong, ASQA, and SciNews) across varying document lengths, outperforming several state-of-the-art RAG baselines. While the approach is somewhat heavy and empirically oriented, its clear performance gains and conceptual novelty justify acceptance, provided reproducibility and efficiency details are strengthened.
1. The idea of introducing rhetorical trees and inter-chunk discourse graphs into RAG is original and conceptually well-motivated, bridging discourse analysis and generative reasoning.
2. The method is tested on diverse, long-context benchmarks with detailed ablations and robustness studies (chunk size, Top-k, noise, and structure perturbations), giving credibility to the empirical claims.
3. Discourse-RAG outperforms strong baselines (StructRAG, MAIN-RAG, RQ-RAG) in both accuracy (LLM Score, EM) and factuality (SummaC, SARI), particularly on large-context and noisy retrieval scenarios.
1. Both intra- and inter-chunk discourse structures rely on LLM prompting for RST parsing, raising concerns about reproducibility, cost, and stability.
2. While results show improvements, the paper doesn’t deeply explore why rhetorical modeling helps or how structural cues propagate through the generator beyond surface correlations.
3. The multi-agent setup (parsing, graphing, planning) introduces significant preprocessing latency and complexity, which may limit real-time or large-scale deployment; no efficiency analysis is reported.
1. Provide a quantitative evaluation of RST parsing accuracy and its impact on final performance (e.g., noise sensitivity to incorrect discourse trees).
2. Include a runtime and cost comparison versus other RAG systems (e.g., StructRAG, MAIN-RAG) to demonstrate practical feasibility. |
Fully AI-generated |
|
Discourse-Aware Retrieval-Augmented Generation via Rhetorical Structure Modeling |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Discourse-RAG, a novel training-free framework designed to address a key limitation in standard Retrieval-Augmented Generation (RAG): the tendency to treat retrieved documents as a flat, unstructured "bag of facts." This "flat structure" problem leads to intra-chunk structural blindness and inter-chunk coherence gaps, hindering the model's ability to synthesize evidence and reason.
S1. The paper identifies a clear and important limitation of standard RAG (its "flat structure") and proposes a novel, linguistically-grounded solution that directly addresses it.
S2. The method achieves state-of-the-art performance on multiple, diverse benchmarks (long-doc QA, ambiguous QA, summarization), demonstrating its effectiveness and generalization ability.
S3. The paper is clear, well-illustrated, and reproducible thanks to the detailed appendices.
W1. This method is computationally expensive. The proposed pipeline requires an enormous number of LLM inference calls per query. As described in the methodology, for a top-k retrieval, the framework k calls for intra-chunk RST tree construction, k * (k – 1) calls for inter-chunk rhetorical graph construction and 1 call for planning.
W2. The entire framework's success is predicated on the LLM's ability to function as a high-quality, zero-shot RST parser. This capability is assumed, not proven.
Q1: Why did the authors not include an intrinsic evaluation of the RST parser against a gold-standard dataset? How can we be confident that the generated structures are faithful and not just plausible-sounding hallucinations that happen to guide the LLM?
Q2: Did the authors compare the full, complex RST parsing against simpler structural signals? For example, what is the performance if only explicit discourse markers (e.g., "however", "because", "in contrast") are used to build the inter-chunk graph, without any RST tree parsing? |
Lightly AI-edited |
|
Discourse-Aware Retrieval-Augmented Generation via Rhetorical Structure Modeling |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents Discourse-RAG, a rag framework that explicitly models discourse structures. It works via a three-stage pipeline: 1) constructing intra-chunk RST trees to identify core vs. supporting information, 2) building inter-chunk rhetorical graphs to model relationships, 3) using a discourse-aware planning module to generate a blueprint for the final answer. Experiments on ASQA, Loong, and SciNews benchmarks show that Discourse-RAG achieves strong performance.
The paper is well written, and gets strong performance with good baselines.
The method can be plugged in any setup without any fine-tuning.
The components are ablated.
There is no analysis on how the method scales (in terms of cost (tokens) and latency) with higher top-k settings.
What are the tradeoff with offline indexing and creation of the RST trees for the whole dataset?
How does the method scale in terms of latency, tokens with respect to higher top-k, different chunk sizes, bigger documents? |
Fully human-written |
|
Discourse-Aware Retrieval-Augmented Generation via Rhetorical Structure Modeling |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Discourse-RAG, a retrieval augmented generation pipeline that makes the model explicitly use discourse structure. It first parses each retrieved chunk into an RST like tree and then links chunks with rhetorical relations to show support, elaboration, or conflict. A final planning stage guides generation using this structure. In evaluation it outperforms standard RAG and other structure aware baselines on long context QA, ambiguous QA, and scientific summarization.
- It uses one pipeline where chunk level discourse trees feed into a cross chunk graph, and both are used to guide generation.
- The method is tested on three tasks, with two Llama models, in both open and closed settings, and it beats 2025 RAG baselines, including on ASQA.
- The ablations, noise tests, and chunk size tests show that the method improves across different settings.
- While training-free, the pipeline requires multiple LLM calls per query (RST parsing per chunk, O(k²) pairwise relation inference, planning, generation). Cost grows quickly with larger k and the paper should discuss the latency and token counts.
- I noticed that all LLM benchmarks use Llama 3.x models. Since Qwen was already used for embeddings, why not include Qwen models in the main comparisons as well?
- Discourse quality is not validated. All trees and relations come from an LLM prompt rather than a parser with known accuracy. If the LLM segments poorly the whole pipeline can weaken.
- Relation set may be too big. The method uses many fine grained discourse labels but does not show which ones actually help. A smaller set might work the same.
- Evaluation scope is narrow. Results are mostly on English, long context, clean inputs. It is unclear how well this works on multilingual or noisy data.
- Did you evaluate the accuracy of your LLM-generated RST trees against gold-standard annotations (e.g., on RST-DT)?
- Have you considered integrating neural discourse parsers or non-RST frameworks (e.g., entity grids, coherence models)?
- In cases where Discourse-RAG underperforms standard RAG (if any), what are the failure modes? Are they due to incorrect rhetorical parsing, poor planning, or something else?
- Beyond automatic metrics (ROUGE, LLM Score), was there any human assessment of coherence, faithfulness, or readability? LLM-based scoring can be biased. |
Heavily AI-edited |