ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 4.00 3.00 2978
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 2 (50%) 2.00 4.00 3283
Fully human-written 1 (25%) 2.00 4.00 2602
Total 4 (100%) 2.50 3.75 3036
Title Ratings Review Text EditLens Prediction
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes XGRAG, an explainability framework for Graph-based Retrieval-Augmented Generation (GraphRAG) systems. Unlike previous explainability methods that operate on unstructured text, XGRAG performs graph-native perturbations — removing nodes, edges, or injecting synonyms — to quantify the causal influence of each graph component on the final LLM answer. The framework integrates (1) entity deduplication to merge semantically equivalent nodes, (2) perturb-generate-compare evaluation to compute importance scores, and (3) alignment with graph structural measures to assess validity. Experiments on NarrativeQA show that XGRAG improves explanation accuracy, robustness across story types and question complexities, and generalization across multiple open-source LLMs (LLaMA 3.1-8B, Mistral-7B, LLaVA-7B, etc.). Ablation studies confirm that entity deduplication and node-level perturbations are key to performance gains. 1. Clear motivation. The paper identifies a genuine gap: existing XAI methods for RAG cannot interpret reasoning grounded in structured graph data. XGRAG directly addresses this with a graph-native perturbation approach. 2. Novel methods to perturb the graph. The “Perturb-Generate-Compare” paradigm is adapted elegantly to graphs, combining semantic and structural importance in a unified explanation measure. 3. Comprehensive experiments. Authors include strong empirical validation spans multiple LLMs, question types, and story structures. Ablation studies show the necessity of entity deduplication and tests three perturbation strategies. 1. Limited experimental domain. All experiments are conducted on NarrativeQA, evaluation on other domains (scientific, biomedical or factual QA/KGs) will better demonstrate the ability to generalize. 2. Potential biased evaluation. When building the ground truth, the authors take the assumption that "graph components semantically similar to the final answer are the most relevant pieces of evidence." This assumption can be ungrounded especially when there is no exact information that can directly solve the query, relevant (semantically similar) information in this case could cause hallucination [1]. 3. Scalability issues. The framework requires multiple GraphRAG invocations per perturbation. While LightRAG mitigates some cost, scalability to very large KGs or multi-hop queries remains uncertain. [1] GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation 1. How does XGRAG scale to industrial-scale KGs (millions of entities)? Would LightRAG still be computationally feasible for large perturbation batches? 2. The evaluation relies on similarity to the final answer. Have the authors considered other evaluation metrics such as task-specific annotation to confirm faithfulness? 3. How sensitive is performance to the similarity threshold (θ_sim) used in entity deduplication? Can the authors include an ablation study for that? Fully AI-generated
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces an explainability method for GraphRAG that assigns importance scores to entities and relations by perturbing them. S1. The paper is well-written and the figures are clear, making the proposed method easy to follow and understand. W1. The connection of this work to "explainability" is tenuous. The output consists of importance scores for entities and relations, but these scores do not truly explain the inner workings of the llms or why it generated a specific answer. For example, in the case study from Figure 1, knowing that the three entities (Gold Watch, Delta, Jim) all have non-zero importance scores does not explain how the model utilized these entities to formulate its response. In my view, this is more akin to "importance attribution" or "evidence identification" rather than genuine explainability. W2. The experimental evaluation is insufficient. The entire study is conducted on a single dataset, NarrativeQA (2017), which is no longer considered a challenging benchmark for modern RAG systems and may not be representative of common GraphRAG application scenarios. The authors should supplement their experiments with more complex, practical use cases where GraphRAG is specifically needed and standard RAG would be insufficient. Otherwise, the practical significance of this work is questionable. W3. The method's design appears to be more of an engineering effort and lacks principled innovation. The main contributions seem to be demonstrating the value of operations like deduplication and clustering. However, these are standard pre-processing techniques in graph manipulation and can hardly be considered novel contributions of this work. W4. The practical utility of the proposed method is ambiguous. Once the importance scores for entities and relations are obtained, what is their explicit use case? Can they be used for debugging the knowledge graph, refining the retrieval process, or improving model factuality? The authors need to clearly articulate the downstream applications and benefits of their method. In addition to the points in the "Weaknesses" section, I have the following questions: Q1. The knowledge graph (KG) is a relatively traditional method of information representation and has several inherent limitations. For instance, it is challenging to represent complex content (such as the equations in your paper) using a KG, whereas text can be seen as a more general-purpose information carrier. In fact, most existing KGs are constructed by extracting information from text. This raises a fundamental question: why do we need to revert from text back to the structure of a knowledge graph? Historically, due to the limitations in text processing and representation capabilities, KGs were utilized to simplify textual information and remove redundancy, thereby facilitating better representation and retrieval. However, modern Large Language Models (LLMs) can now effortlessly handle the complexities of text representation. Therefore, I believe the remaining advantages of knowledge graphs in the current era of LLMs warrant a more in-depth analysis and discussion. Q2. I observed in your experiments that the effectiveness of your proposed method is primarily benchmarked against RAG-Ex. However, your method utilizes a graph structure, while RAG-Ex operates directly on text. Does this comparison introduce a potential unfairness in terms of the data modality? After all, the conversion of text into a graph is a necessary prerequisite for any knowledge graph-based approach. Consequently, it is difficult to consider the mere utilization of a graph structure as a novel contribution of your method. Could you elaborate on this? Lightly AI-edited
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes XGRAG, an explainable framework designed to identify the most influential nodes and edges that contributes to the outputs of GraphRAG models. The XGRAG framework consists of four components: a GraphRAG model for indexing and retrieving relevant subgraphs, an entity deduplication module for consolidating semantically similar entities, a perturber that systematically perturbs subgraphs, and an explaner that quantifies the importance of individual graph components to the model’s final response. 1. The paper proposes XGRAG to generate fine-grained explanations that identify the most influential nodes and edges contributing to an LLM’s response. 2. The paper conducts experiments on the NarrativeQA dataset and the results indicate the the proposed XGRAG outperforms a baseline RAG-Ex. 3. The paper is well-written and easy to follow. 1. The paper is incremental since the main idea of XGRAG, i.e., perturbing retrieved content to identify the most important components, has already been explored in previous works like RAG-Ex. 2. The paper only compares against RAG-Ex, which is designed for text-based RAG models, and lacks comprehensive comparisons with more relevant baselines, such as KGRAG-Ex. 3. The paper lacks sufficient justification for the proposed approach. For example, a simple baseline could involve using GraphRAG to generate predictions and then computing the similarity between graph components and the predicted answer to identify important elements. Additionally, the paper notes that XGRAG's importance scores align with structural properties such as Degree and PageRank. This raises the question of why these existing graph metrics are not used directly for identifying influential components. 4. The proposed method raises efficiency concerns, as it requires perturbing each graph component and generating outputs with the LLM. However, the paper does not provide statistics on the size of the retrieved subgraphs, making it difficult to assess the practical efficiency and scalability of the approach. 5. The paper only conducts experiments on the NarrativeQA dataset, raising concerns about whether the proposed approach can generalise to other datasets or domains. 1. How does XGRAG differ from prior work like RAG-Ex and KGRAG-Ex, beyond being applied to GraphRAG? 2. Why are graph-based baselines such as KGRAG-Ex not included in the experimental comparison? 3. Can the authors provide an efficiency analysis to quantify the computational cost of XGRAG, particularly given the need to perturb and repeatedly invoke the LLM for each graph component? Fully human-written
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation Soundness: 1: poor Presentation: 3: good Contribution: 3: good Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper aims to explain the outputs of Graph-based Retrieval-Augmented Generation (GraphRAG) systems. The authors propose a perturbation-based explanation approach, extended from existing work RAG-EX. The proposed method XGRAG uses LightRAG as the backbone and introduces graph-based perturbations such as node and edge removal. The experiments include only one baseline and one dataset. 1. The paper is well-written and clearly articulates its core idea. 2. The proposed approach reasonably integrates LightRAG with RAG-EX, introducing two key modifications: improved entity deduplication and a graph perturbation mechanism. 1. The experimental evaluation lacks comprehensiveness, with only one dataset and one baseline included. 2. An important evaluation relies on a hypothesis that is neither convincingly argued nor adequately justified in the paper. 1. The evaluation includes only one baseline, RAG-Ex. The authors should justify why KGRAG-Ex (Balanos et al., 2025) was not included as a baseline comparison. 2. The use of a single dataset for evaluation undermines the generalizability of the experimental results. 3. Figures 3 and 4 employ an inappropriate visualization method: line plots are used despite the x-axis representing discrete variables, when bar plots would be the correct choice for non-continuous data. 4. In Table 3, the purpose of evaluating across different LLMs is to demonstrate that the conclusion, XGRAG outperforms RAG-Ex, remains consistent regardless of the LLM used. However, to validate this claim, the results for RAG-Ex should also be reported alongside XGRAG for each LLM, rather than showing only XGRAG performance. 5. The hypothesis stated in Line 339 is questionable. In KG retrieval, triples relevant to a query do not necessarily correspond to those with high structural importance. Furthermore, standard PageRank assumes homogeneous edge semantics and may not perform effectively on multi-relational KGs [A]. Given that this hypothesis is central to validating the quality of graph explanations, the authors should provide more rigorous justification and clarification. 6. In Line 450, the similarity threshold is not defined, nor is the selection methodology explained. Please provide this information. 7. The paper lacks error analysis. Specifically, the authors should investigate which questions can be successfully explained by RAG-Ex, XGRAG, both methods, or neither, and analyze the underlying reasons for these outcomes. Furthermore, the necessity of KGs for explanation should be empirically demonstrated through comparative analysis. [A] Li, X., Ng, M. K., & Ye, Y. (2012, April). HAR: hub, authority and relevance scores in multi-relational data for query search. In Proceedings of the 2012 SIAM International Conference on Data Mining (pp. 141-152). Society for Industrial and Applied Mathematics. Lightly AI-edited
PreviousPage 1 of 1 (4 total rows)Next