ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 4.00 3.00 2088
Fully human-written 3 (75%) 4.00 4.00 4484
Total 4 (100%) 4.00 3.75 3885
Title Ratings Review Text EditLens Prediction
A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the problem of reasoning with missing commonsense information. It proposes a method of iteratively providing with an LLM the missing information in the form of L1 \land L2 \imply L3, where L1, L2, and L3 are all literals, and L1 and L2 are already deducible from the current premises. The paper experiments with 3 logical datasets and 3 less logical datasets. Experimental results demonstrate that the proposed method outperforms existing neural and neural-symbolic methods. 1. The paper addresses the important problem of reasoning with missing commonsense information. 2. The paper proposes a simple but potentially effective method to abduce and reasoning with the missing information. In contrast to neural-symbolic methods based on auto-formalization, the method resorts to more involved interaction of neural and symbolic methods. 3. Experimental results demonstrate the viability of the proposed method. 1. The paper states that it is dealing with abductive propositional logic problems (sec 4. Problem statement). But I believe the reasoning problem is first-order. Especially, the used dataset FOLIO is a typical dataset for natural language reasoning with first-order logic. The paper does not specify which SAT solver it is using. 2. Some use of logical notions in the paper is improper. For example, Line 141: “Propositional logic is a logical system that involves propositions about variables”. This is not a proper introduction of propositional logic. Line 91: “contain variables not previously mentioned in the input problem“. I had difficulty understanding this in the beginning. But later, I understand it actually means “contain propositions”. 3. Many logical notations used in the paper are problematic. I only give some examples here. The logic formula in Line 187 is incorrectly written. The formula in Line 214 is confusing. Proposition 1 in Appendix A is not well-stated. 4. From Sec 6.2, it seems that the work of the paper is founded on perfect logical translation. On the one hand, auto-formalization is still a challenging topic. On the other hand, when the paper presents the performance of SAT-LM, I assume it is based on auto-formalization, then the comparison might be unfair. Does the paper deal with propositional or first-order reasoning? Which SAT solver is used? Fully human-written
A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents ARGOS, a neuro-symbolic framework designed to solve logic problems that require abductive reasoning—the ability to infer missing commonsense information. It addresses a well-known gap in existing systems: while symbolic solvers are rigorous, they are brittle and require a complete set of premises, whereas Large Language Models (LLMs) possess vast commonsense knowledge but often fail at complex proof planning. - Clarity: The paper is exceptionally well-written and easy to follow. - Quality: The experiment is executed to a high standard, both methodologically and empirically. - Durability of the Problem Statement Against Frontier Models: The paper's motivation hinges on the inability of LLMs to perform abductive reasoning. I find that SOTA thinking models like Gemini-2.5-pro can solve the paper's motivating "winter fox" example directly via chain-of-thought. This raises the question of whether the proposed method addresses a fundamental limitation or a capability gap in a specific class of models that may soon be obsolete. - Worst-Case Complexity: The paper reports an average cost of 18.4 COT calls (Table 3), which is reasonable. However, the worst-case cost is unbounded in theory and in practice determined by the number of iterations allowed. For very hard problems requiring many abduction steps, the cost could become prohibitive, as each iteration involves multiple LLM calls (generation, commonsense scoring, relevance scoring) and solver calls. A discussion of the distribution of costs, not just the average, and the performance/cost trade-off would be valuable. - Inability to Express More Complex Rules: Real-world commonsense often takes more complex forms. The current llm_generate prompt structure seems hard-coded for the two-antecedent form. The paper would be more complete if it acknowledged this limitation and discussed potential extensions. Minor suggestion on paper structure: Section 4 ("PROBLEM STATEMENT") is very concise and could be integrated into the end of Section 3 ("BACKGROUND") to improve the paper's narrative flow. Lightly AI-edited
A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a neuro-symbolic framework called ARGOS to improve commonsense reasoning. This framework addresses the inability of logic solvers to handle missing commonsense facts by using an LLM to iteratively provide new commonsense propositions. An interesting contribution is the use of feedback from the symbolic SAT solver itself to guide the LLM's search for relevant facts. This allows ARGOS to search a larger space of potential facts, including new variables not present in the original problem. The framework also uses the LLM to score the generated facts for commonsense and relevance before adding them . The authors show that this approach improves performance on three abductive reasoning datasets. - The proposed method provides an intuitive framework for combining the strengths of symbolic solvers and LLMs. - The use of the SAT solver's backbone to guide the generation of new commonsense facts is novel. - The empirical results are strong and show consistent improvements over existing neural and symbolic baselines on 3 datasets. - The ablation studies show the value of the two main contributions, ie, backbone-guided search and score-based thresholding. - The tasks/datasets used are not practically relevant and lack real-world applicability. In addition, the paper relies on modified versions of existing datasets (ProntoQA, CLUTRR, FOLIO) to create an abductive setting, which means the evaluation is on a somewhat artificial task. - The method requires logit-level access to score generated clauses for commonsense and relevance, which may not always be accessible for closed-source models. - The main experiments assume a perfect logical translation from text, as failed translations were filtered out. But this ignores the issue of imperfect translation, which could be a bottleneck for this method and neuro-symbolic systems in general. - The method relies on an LLM itself to score its own generated clauses for commonsense and relevance. The reliability of this LLM-as-a-judge component is not validated against human-annotated scores. - There is a lack of examples accompanying the error analysis in the paper, showing failure cases of ARGOS. - The paper does not report the latency of all approaches compared in the main results. This is important because it would seem to me that ARGOS likely takes much higher computation time. - Since ARGOS depends on the LLM's ability to reliably score commonsense and relevance, did you do any human analysis to verify that the LLM-generated scores are calibrated & accurate? - How often does ARGOS introduced an unseen variable that is important for solving the problem? Fully human-written
A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes a new neuro-symbolic approach for enhancing the logical commonsense reasoning capabilities of LLMs and suggests a new way for LLM’s integration with logic solvers. In the approach proposed in this paper, the LLM iteratively provides unstated commonsense clauses to a logic solver, which is guided by feedback from the solver in the form of the SAT problem backbone. This approach allows the system to perform abductive reasoning, filling in missing background facts while keeping the search tractable. Overall, this work aims to contribute to leveraging the benefits of existing neural and symbolic methods to tackle commonsense logical reasoning problems. 1- The paper is easy to follow and is well-presented (modulo some issues that I point out in the weaknesses). The worked example presented in Section 5.2 and the methodology overview in Figure 3 faciliate understanding of the work. 2- The proposed methodology is novel and insightful. I think the general idea of the work in providing a new paradigm of interaction between an LLM and a symbolic solver is interesting. Existing approaches either initiate reasoning from the LLM and delegate theorem proving to a solver, mimic inference rules using an LLM, or propose methodologies to leverage the LLM to undergo a rigorous reasoning process while leveraging its commonsense. This paper introduces interactions between the LLM and the solver which I find novel and interesting. 3- The topic of focus, commonsense logical reasoning of LLMs is a quite an important topic with numerous practical applications. I think the idea of proposing novel frameworks for LLM interaction with formal reasoners can be be impactful by reducing reasoning errors of LLMs, but the paper's evaluation needs to be strengthened to validate this effect more properly. 1- There are several statements in Section 3 which I think are vague, inaccurate, or wrong: - Line 141: “Propositional logic is a logical system that involves propositions about variables.” Propositional logic does *not* involve “propositions about variables.” It is a logic of propositions themselves, and variables are symbols for propositions, not objects that propositions are “about.” - “A proposition, such as A → ¬B, is some statement about literals” Propositions aren’t “about literals”. They’re built from literals (or propositional variables) using logical connectives. - “We assume that ¬(P ∧ C → ⊥), that is that the premises P are not contradictory with commonsense.” The correct way to show consistency is “P∧C⊬⊥”. - “A predicate is a function, such as MotherOf (x, y)” A predicate represents a property or relation that can be true or false depending on its arguments. Whereas FOL functions point to a particular object in the domain as their output. For example, MotherOf (x, y) is a predicate which can be true or false, but MotherOf (x) is a function that returns a specific object y, i.e., MotherOf (x) = y. - “∀(x, y)MotherOf (x, y) → ¬Male(x)” This is a very unusual syntax. In standard FOL, you either write ∀x ∀y (MotherOf(x, y) → ¬Male(x)) or ∀x ∀y [MotherOf(x, y) → ¬Male(x)]. - Line 190: “First-order logic problems…” I encourage the authors to use the conventional terms “grounding” or “instantiation”. - Line 214: we first try to solve the problem using the SAT Solver (sat_solve) to test whether (P ∧ C) = P ⊢ Q or ¬Q what does (P ∧ C) = P ⊢ Q or ¬Q mean? I think you’re just trying to show whether (P ∧ C) ⊢ Q or (P ∧ C) ⊢ ¬Q. 2- The use of self-consistency as one of the ways ARGOS can come up with the final answer is questionable. At the end of the day, self-consistency is relying on the LLM to do the reasoning, but the reason why people leverage or combine symbolic theorem proving with the LLM reasoning is because LLMs alone may make errors in their reasoning or generate hallucinated answers. The experiments section only reports accuracy of the final answers, whereas in LLM reasoning works such as [1], the correctness of the reasoning process is also critical. Specifically, I think this metric can be insightful to see whether the correctness of reasoning process for answers provided by self-consistency mechanism of ARGOS is also improved or not. This can be a useful complement to the results in Figure 5. 3- There are some typos in the text such as line 81. [1] Kazemi, Mehran, et al. "LAMBADA: Backward Chaining for Automated Reasoning in Natural Language." Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. 1- In Line 228, the expression rankB(L) = #{L′ ∈ B | L′ has an entity in common with L} is written. Is this intended to be score(L)? Why would the rank of each literal be the number of existing literals that share the same variable? Regardless, the rationale for this approach is unclear to me. The only explanation provided is “which gives a measure of relevance of the literal to the problem” which is vague. I understand space limitations in the main paper, but I strongly suggest you explain the rationale near algorithm 2 in the appendix. 2- The methodology proposed in this work for leveraging the LLM’s commonsense knowledge is restricted to generating literals that can be deduced from the existing literals in the backbone. While this approach is in agreement with the way existing logical reasoning datasets are formed, I don’t think it is general enough for all practical applications. For example, a commonsense rule can be generated using only one literal from the backbone (e.g., ∀x Car(x) → Vehicle(x)), so why are pairs of literals necessary in the proposed approach? 3- Aside from being limited, I think the commonsense rule generation process in this work is also inefficient. Every pair of literals is presented to the LLM, and the scoring mechanism explained in appendix D4 is used to filter irrelevant ones. Some works cited in the paper such as LAMBADA and LLM TRes take a goal-driven approach to only generate rules that can contribute to solving the problem. Why isn’t a similar approach taken in ARGOS? 4- In appendix D3, why are the scoring propositions approaches different across datasets? I suggest a clarifying sentence to explain the rationale. I also appreciate the running example in section 5.2 which facilitates understanding. 5- In line 202, it’s stated that: “Four annotated examples are provided, intended for few-shot prompting.“ How are these few-shot examples chosen? Do they differ per-dataset? Are they chosen in a way that there is no risk of revealing the answer to the LLM? 6- What is the rationale for reducing γ at each iteration? By reducing γ, you are making the method more lenient, accepting answers even if there is less consistency in LLM generations. Doesn’t this approach reduce the rigor of reasoning as the algorithm proceeds? 7- Self-consistency is a key component of ARGOS and in fact one of the ways by which ARGOS generates its answers. As I mentioned in my earlier comments, using the LLM to generate the final answer might be sub-optimal, potentially generating hallucinated answers. Figure 5 nicely provides insight about how ARGOS improves accuracy of self-consistency responses, but I think the paper’s analysis also requires reporting the accuracy of reasons for all methods, at least for one dataset. Also, I think a study in which the self-consistency module of ARGOS is ablated should be provided, at least for one dataset. 8- Regarding RQ1, two mechanisms are used in ARGOS for scoring in the thresholding calculation. Are they both necessary? An additional ablation would be helpful. I also appreciate the honest discussion of limitations in RQ2. 9- I find the discussion in section 6.2 is questionable. Logical translation is a core part of ARGOS, not an orthogonal one and assuming having a correct propositional translation for real applications is a very big assumption to make. I think experiments on at least one dataset is needed without filtering the failures to shed light on how critical this step is to the framework. Using a more powerful LLM than the ones used for generating the commonsense rules is acceptable if it’s a major bottleneck, but having an experiment using the same LLM that ARGOS uses for reasoning is also quite beneficial. A proposed method isn’t required to beat all baselines on all tasks, but the reader must know the strengths and limitations of the proposed method. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next