ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	1 (25%)	4.00	4.00	3329
Heavily AI-edited	1 (25%)	2.00	3.00	4791
Moderately AI-edited	1 (25%)	4.00	4.00	1644
Lightly AI-edited	0 (0%)	N/A	N/A	N/A
Fully human-written	1 (25%)	6.00	5.00	4491
Total	4 (100%)	4.00	4.00	3564

Title	Ratings	Review Text	EditLens Prediction
Proof-Augmented Retrieval and Reasoning: Supervising Language models for Knowledge Graph Completion with Link Predictors	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper proposes Proof-Augmented Retrieval and Reasoning (PARR), a general framework for finetuning language models for knowledge graph completion. PARR consists of three components, a rewriter LLM, a fixed sparse or dense retriever, and a reasoner LLM. The rewriter augments the query into several triplets so that the retriever can better recall necessary paths for proving the target answer. The reasoner then performs reasoning on the set of retrieved triplets and decide the final top-k answers. To finetune the rewriter, the authors dump path proofs from existing link predictors, namely NBFNet and A\Net, and use them to compute several minimal sets of rewritten queries as the training signal for rewriter. The reasoner is then finetuned on the output of the rewriter and the retriever to predict the final answer. For top-k supervision, the authors distill the output from NBFNet or A\Net to the reasoner LLM. PARR achieves state-of-the-art performance on three transductive and two inductive knowledge graph datasets. 1. Instead of using LLMs to encode text features or triplets in a KG, this paper proposes a novel way to use LLMs as a reasoner over a retrieved subgraph. This potentially leverages the rule knowledge in LLMs and their ability to perform logical deduction. 2. The framework aligns well with classical logical reasoning algorithms, where the rewriter, retriever and the reasoner correspond to backward chaining, unification and forward chaining respectively. The rewriter is responsible for recall while the reasoner is responsible for precision. 3. PARR achieves state-of-the-art performance on all popular KGC benchmarks. The prediction steps of PARR are interpretable to human. 1. PARR heavily relies on paths and predictions generated by NBFNet or its derivatives. To train PARR, it needs to first train an NBFNet and dump its path proofs for every triplet in the training set, which is very time consuming. If proofs are used in retrieval, one also needs to conduct NBFNet inference on the inductive test sets before applying PARR. As shown in Table 9, it takes 835 GPU hours (4 days on 8 H100) to finetune the reasoner LLM on YAGO3-10. This is way more costly than NBFNet or ANet. For example, ANet reports 20.8 min / epoch for YAGO3-10, which translates to 14 GPU hours. It’s not practical to spend 60x training time just for 3% gain in HITS@1. 2. Since PARR requires finetuning two LLMs, it’s not clear to me whether PARR benefits from generative LLMs or just the curated training signals. For example, we may train a small Transformer model from scratch for query rewriting (similar to the LSTM used in [1]) using the same training signal. We may train a subgraph reasoner from scratch like NBFNet. Both models only require relation embeddings and don’t need language prior. [1] Yang et al. Differentiable Learning of Logical Rules for Knowledge Base Reasoning. NIPS 2017. 1. I recommend the authors to change the term proofs to paths, or more specifically, grounding paths, for consistency with terminology in KG literature [2][3]. 2. Does the initial rewriter need to be reasonably working? Otherwise, you can’t get enough minimal rewriting sets for training, right? 3. Line 261: If LLMs lack sufficient understanding of the patterns in KGs, why not train the reasoner from scratch? That would be much cheaper than a LLM. 4. Table 4: How is average number of retrievals computed? Why is it larger than average number of rewrites times k? 5. Line 398: How do you compute recall for queries with n ground truth answers? 6. Line 422: What do you mean by ground truth retrievals? Is it the path proof output by NBFNet? 7. Missing references: RAG for KGC[4], query rewriting [5]. 8. Fonts for $Q$ in Line 220 and 227 are inconsistent. You may either use ordinary and mathcal fonts, or mathcal and mathfrak fonts to distinguish set $Q$ and powerset $Q^*$ respectively. 9. Line 235: please be explicit that you try to take $argmin_{\|Q\|}$. 10. Typo: 1. Line 88: absed → based 2. Line 222: Sec 4.1 → Sec 4.2 3. Line 248: missing {} for sets. [2] Cohen. TensorLog: A Differentiable Deductive Database. NIPS 2016. [3] Zhu et al. Neural Bellman-Ford Networks: A General Graph Neural Network Framework for Link Prediction. NeurIPS 2021. [4] Das and Godbole et al. Knowledge Base Question Answering by Case-based Reasoning over Subgraphs. ICML 2022. [5] Gao and Ma et al. Precise Zero-Shot Dense Retrieval without Relevance Labels. arXiv 2022.	Fully human-written
Proof-Augmented Retrieval and Reasoning: Supervising Language models for Knowledge Graph Completion with Link Predictors	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces PARR (Proof-Augmented Retrieval and Reasoning) — a framework for knowledge graph completion (KGC) that supervises large language models (LLMs) using proof paths extracted from interpretable link predictors such as NBFNet and ANet:contentReference[oaicite:0]{index=0}. The framework consists of three modules: 1. Rewriter* – decomposes queries into sub-queries to improve retrieval coverage. 2. Retriever – gathers triplets and associated proofs to form subgraph contexts. 3. Reasoner – performs chain-of-thought reasoning to predict missing links. PARR fine-tunes Llama3 and Qwen3 models on these tasks and reports competitive performance on FB15K-237, WN18RR, and YAGO3-10 datasets in both transductive and inductive settings. - Bridging neurosymbolic and LLM reasoning. The integration of interpretable link predictors (proof-based reasoning) with LLMs is conceptually appealing and timely, aligning with the trend toward more structured and interpretable language models. - Methodological completeness. The paper covers the entire pipeline—proof extraction, query rewriting, reasoning fine-tuning, and evaluation—with comprehensive ablation studies that clarify component contributions:contentReference[oaicite:1]{index=1}. - Empirical validation. Results show consistent, if modest, improvements on HITS@1 and HITS@3 metrics over prior LLM-based baselines like MKGL and KICGPT. - Readable and organized. The writing is clear, figures are helpful, and the method is easy to follow even for readers not deeply familiar with KGC literature. - Limited conceptual novelty. The main ideas—query rewriting, proof-based retrieval, and CoT reasoning—are all well-established in related literature (e.g., RAG, symbolic reasoning, and interpretable KGE). The contribution lies more in combining these ideas than introducing a fundamentally new concept. - Marginal empirical gains. Improvements over strong baselines (e.g., MKGL) are small, typically within 1–2% HITS@1:contentReference[oaicite:2]{index=2}. This raises questions about how much the proposed supervision actually improves structural reasoning versus serving as an additional form of data augmentation. - Interpretability not demonstrated. Despite emphasizing “proof-guided interpretability,” the paper provides no qualitative analysis or human-interpretable reasoning traces. Without such evidence, the interpretability claim remains speculative. - Dependence on external models. The reliance on pre-trained interpretable link predictors for proof extraction makes the framework less self-contained and raises scalability concerns for large KGs like Wikidata. - Narrow evaluation scope. Experiments are limited to small, standard KGC datasets. There is no demonstration that PARR generalizes to large-scale or real-world knowledge graphs in scientific or biomedical domains. 1. How exactly does proof-guided CoT supervision differ from standard reasoning fine-tuning using retrieved triplets? 2. What is the runtime and resource cost of proof extraction and mixture-of-experts (MoE) sampling? 3. Can the framework generalize beyond link prediction—for instance, to multi-hop question answering? 4. Are there cases where noisy or contradictory proofs degrade reasoning performance?	Fully AI-generated
Proof-Augmented Retrieval and Reasoning: Supervising Language models for Knowledge Graph Completion with Link Predictors	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes PARR — Proof-Augmented Retrieval and Reasoning, a native generative LLM framework for knowledge graph completion (KGC). Rather than treating an LLM as a reranker or discriminative scorer, PARR retrieves proofs from pre-trained link predictors and applies LLMs to generate answers based on retrieved proofs. Experiments across a few benchmarks show comparable performance against previous LLM-based methods. 1. Novel Framework: The PARR framework presents a new retrieval-augmented reasoning approach for applying LLMs to knowledge graph completion, offering a fresh perspective compared to previous methods that mainly used LLMs as encoders or lacked structural reasoning as well as suffered from hallucination. 2. Combination with Link Predictors: By leveraging pre-trained link predictors to retrieve relevant proofs, PARR effectively combines symbolic reasoning with the generative capabilities of LLMs, enhancing the overall reasoning process and reducing the LLM hallucination. 3. Empirical Validation: The authors conduct extensive experiments on multiple benchmarks, demonstrating that PARR achieves comparable performance to existing LLM-based methods, validating the effectiveness of their proposed framework. 1. Limited Novelty: While the retrieval and reasoning framework is novel for KGC, similar frameworks have been widely explored in other similar tasks, especially in knowledge graph question answering (KGQA). The KGC task itself is actually a special case of KGQA where the question is explicitly given as a triple with a missing entity. The idea of retrieving relevant information (e.g., path-like proofs) and using LLMs to reason over them has been explored in prior works such as RoG [1]. A similar approach, utilizing GNNs to retrieve paths and then employing LLMs to generate answers, has also been explored in the KGQA literature [2]. Therefore, the novelty of the proposed framework in the context of KGC is somewhat limited. The authors should better clarify the differences and connections between their work and prior works in KGQA, and highlight the unique contributions of PARR specifically for KGC. 2. Unsatisfied Claims on Hallucination Reduction: The authors claim that PARR advances previous LLM-based methods by reducing hallucination in prediction through the use of retrieved proofs. However, the proposed rewriter is still purely based on LLMs' internal knowledge learned from training. Therefore, it still has the potential to generate hallucinated facts that do not exist in the knowledge graph. The authors should provide more empirical evidence to evaluate the rewrite results and support their claim on hallucination reduction. 3. Training Complexity: The training cost of PARR is relatively high, as it involves training LLMs on a large amount of generated data. The method needs to retrieve proofs for all triples in KGs and generate corresponding reasoning data for LLM training, which can be computationally expensive and time-consuming, making it less practical for large-scale KGs. 4. Limited Performance and Generalizability: While the proposed framework involves heavy LLMs and training, its performance and generalizability are still limited compared to traditional KGC methods like NBFNet. For example, on FB15k-237 Hits@10, NBFNet achieves 0.599 while PARR only achieves 0.593. Moreover, the experiments in inductive settings show that PARR's performance drops significantly, indicating limited generalizability to unseen KGs. Considering the proposed method is built upon the traditional KGC methods like NBFNet, it is not convincing to me that the introduction of LLMs and the complex training process is well justified, given the limited performance gains and generalizability. 5. Minor Issues: The multi-answer generation setting is not reasonable. Considering the original KG contains 1-to-many relations, it is not necessary to sample top-k predictions from link predictors as ground truth answers. 6. Typos and Notations: The writing and structure of the paper can be improved. There are many short sections that can be merged for better readability and flow. There are also many notation inconsistencies, e.g., using both R to denote relations and retrieved triples. [1] LUO, LINHAO, et al. "Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning." The Twelfth International Conference on Learning Representations. [2] Mavromatis, Costas, and George Karypis. "Gnn-rag: Graph neural retrieval for efficient large language model reasoning on knowledge graphs." Findings of the Association for Computational Linguistics: ACL 2025. 2025. 1. How does the proposed method compare to existing RAG-based KGQA methods? 2. How to alleviate the hallucination issues brought by the LLM rewriter.	Heavily AI-edited
Proof-Augmented Retrieval and Reasoning: Supervising Language models for Knowledge Graph Completion with Link Predictors	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a large language model (LLM)-based knowledge graph completion (KGC) method. The core idea is to use existing KGC models as retrievers to identify reasoning paths in the graph, and then combine these paths to train an LLM as a reasoner. A mixture-of-experts mechanism is employed to integrate multiple retrievers effectively. To further enhance the model’s performance, several optimization techniques are incorporated, such as query rewriting, fine-tuning, and sampling strategies. Experimental results demonstrate the effectiveness and superiority of the proposed approach. In the experiments, the authors compare the proposed method with a wide range of baselines, demonstrating that it achieves the best overall performance. The proposed pipeline is comprehensive, encompassing key components such as query rewriting, retrieval, reasoning, and fine-tuning. The paper is easy to understand and provides sufficient details to facilitate understanding. Although the paper is relatively easy to follow, the writing quality is quite rough and would benefit from substantial revision for clarity and readability. The experiments are conducted on only three knowledge graphs for the transductive setting and two for the inductive setting. However, prior works typically evaluate on a wider range of datasets, such as Family and UMLS. In the inductive setting, it is unclear why several common baselines are omitted. Moreover, the ablation study is not sufficiently comprehensive—there are no experiments analyzing time complexity. The proposed method also appears to be computationally expensive. please refer to Weaknesses	Moderately AI-edited

PreviousPage 1 of 1 (4 total rows)Next