ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 1 (25%) 6.00 3.00 1967
Moderately AI-edited 1 (25%) 6.00 3.00 4431
Lightly AI-edited 1 (25%) 8.00 4.00 1591
Fully human-written 1 (25%) 6.00 3.00 2997
Total 4 (100%) 6.50 3.25 2746
Title Ratings Review Text EditLens Prediction
Training LLMs with LogicReward for Faithful and Rigorous Reasoning Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes LogicReward: a step‑level reward for reasoning that averages (i) Premise Validity—a cosine‑similarity grounding check against the provided premises—and (ii) Logic Validity—a deterministic check with Isabelle/HOL. The combined LogicScore selects SFT targets and DPO pairs. On eight NLI/logic benchmarks, LogicReward yields notable gains and higher human‑audited faithfulness; it also transfers to BBH/GSM8K/CSQA and remains competitive without outcome labels. * Substantial performance improvements across diverse NLI/logic datasets; the 8B backbones show consistent average gains (+11.0 for Llama3.1‑8B; +3.2 for Qwen3‑8B). * Faithfulness improves and is human‑verified: on a manual audit of 100 correct cases, the LogicReward‑trained model achieves 68% faithful traces vs. 43%/57% for baselines. I think reasoning faithfulness is critical for transparency/trust in the era of thinking LLMs. * The training data are constructed automatically (LLM rollouts + theorem‑provers), and the paper does not report end‑to‑end human auditing of that data. This raises the risk that intermediate steps/premises or injected “soft‑unification” assumptions contain errors. Though the final positive performance gain from the proposed method's training compensates for some. * Your method generates training data with an automated pipeline. Could errors/noise slip in, and where in the pipeline are they most likely (e.g., premise-grounding, auto-formalization/parsing, soft-unification, prover-guided refinement, fallback scoring)? Did you perform any human auditing, and at what scale? Lightly AI-edited
Training LLMs with LogicReward for Faithful and Rigorous Reasoning Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper targets the flaw of outcome-based training that can yield correct answers with incorrect reasoning, proposing LogicReward, which enforces step-level logical correctness via a theorem prover, plus Autoformalization with Soft Unification to reduce ambiguity and improve formalization. An 8B model trained on LogicReward data surpasses GPT-4o and o4-mini by 11.6% and 2% on natural language inference and logical reasoning, improves reasoning faithfulness, generalizes to math and commonsense, and provides a reliable reward signal without ground-truth labels. The authors plan to release data and code upon acceptance. 1. The use of symbolic methods for process supervision is well motivated and directly addresses limitations of outcome rewards. By enforcing step level logical validity, the approach reduces reward hacking and aligns training with faithful reasoning. 2. Incorporating theorem prover feedback in multiple iterations expands the exploration space and yields a more informative training signal. Iterative prover guided refinement improves coverage of candidate solutions, lowers variance, and encourages discovery of alternative valid derivations rather than overfitting to a single trace. 3. The manuscript is clearly written with well easy to follow and transparent evidence. 1. The paper omits several studies that combine LLMs with provers, for example Quan et al., “Verification and Refinement of Natural Language Explanations through LLM and Symbolic Theorem Proving,” EMNLP 2024. The task focus is closely related. Please situate LogicReward within this line of work and add direct comparisons where possible. 2. It is unclear how much the method’s gains depend on the chosen prover and logic. If a different proof assistant or formal language is used, does LogicReward retain its advantages and stability. Experiments that swap the prover or vary the logic and encoding would strengthen the generality claim. please refer to weaknesses Heavily AI-edited
Training LLMs with LogicReward for Faithful and Rigorous Reasoning Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces LogicReward, a novel reward system designed to train LLMs to produce reasoning that is not only correct in its final outcome but also "faithful" and "logically rigorous" at each intermediate step. The authors think that in existing training methods, rewarding only correct final answers can lead to models arriving at the right solution through "flawed reasoning", and supervising intermediate steps often relies on probabilistic signals, still "lacks guarantees of logical soundness". To address these, the paper proposes LogicReward, which evaluates the logical validity of each reasoning step using a theorem prover. The method translates ambiguous natural language reasoning steps into the formal symbolic logic required by a theorem prover (specifically, Isabelle). Then, the paper gets the LogicScore by the following steps: 1. Checks if the reasoning step is "grounded in the given context". It measures the cosine similarity between the premises used in a step and the original premises provided in the problem. 2. Uses the theorem prover to check if the inference made in the step is logically sound. If the step cannot be parsed (i.e., has invalid syntax), the system falls back to using the "average token probability" as the score. 3. The average of the Premise Validity and Logic Validity scores across all steps in the reasoning chain. 4. A binary score (1 or 0) for whether the final answer matches the ground truth. 5. The final LogicScore is an equal-weighted average of the Reasoning Validity and the Outcome Validity. After getting the LogicReward, the signal could be used to construct datasets and used to train models via SFT and DPO. 1. The primary strength is the integration of a symbolic theorem prover into the reward loop. This moves beyond the limitations of purely outcome-based or probabilistic process-based rewards, providing a more rigorous "formal and symbolic assessment of logical validity". 2. The method directly targets and improves reasoning faithfulness, a critical need for high-stakes domains like medicine and law. The paper's manual evaluation found that the LogicReward-trained model could produce more faithful reasoning processes. 3. The model shows strong generalization to out-of-distribution tasks, including BBH, GSM8K, and CommonsenseQA. 4. LogicReward excels in scenarios without ground-truth labels. By removing the "Outcome Validity" component, the reward signal relies only on Premise and Logic Validity. In this setting, it still outperformed other reward systems (LLM-as-judge, PRM, Confidence), demonstrating its value in real-world situations where labels are scarce. 1. The authors explicitly acknowledge that the LogicReward labeling process, which involves running a theorem prover, is "relatively slower than approaches that only compare final outcomes". 2. This speed limitation presents a significant challenge for scalability, particularly for "on-policy reinforcement learning scenarios" (like PPO or GRPO) that require fast, in-the-loop reward generation. The paper's use of SFT and DPO cleverly sidesteps this by being offline methods, but it leaves the system's viability for on-policy RL as an open question. 3. The auto-formalization process is "still imperfect". Even with soft unification and refinement, ambiguities in natural language remain difficult to formalize. The results show that even after refinement, the logic success rate was 50.8%, meaning nearly half of the steps were still not validated as logically sound by the prover. 4. When a reasoning step fails the syntactic check and cannot be evaluated by the theorem prover, the system "falls back to the confidence of the inference $I$, defined as the average token probability of the reasoning step". This pragmatic choice means that for a large portion of reasoning steps (those that fail to parse), LogicReward does not actually provide a symbolic logical guarantee. 5. The method was primarily trained and evaluated on NLI and logical reasoning datasets. While it showed some generalization, the paper notes that "gains on GSM8K are smaller". The authors suggest this is because mathematical reasoning relies "more on domain-specific reasoning" than on the commonsense assumptions that Soft Unification is good at handling. This suggests the current pipeline may be less effective for highly specialized domains without further adaptation. Refer to Weaknesses. Moderately AI-edited
Training LLMs with LogicReward for Faithful and Rigorous Reasoning Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces LogicReward, a reward system that enforces logical correctness of the reasoning traces of LLMs with a theorem prover. It also comes with "autoformalization with soft unification", which is a technique that prompts the LLM to provide implicit assumptions in the reasoning steps to reduce ambiguity in the statements. The resulting LogicScore is used to train LLMs under SFT and DPO settings. The overall method demonstrates solid performance increase on a variety of logical reasoning benchmarks, allowing small 8B models to outperform larger reasoning models. Furthermore, LogicReward performs better than several alternative reward systems such as LLM-as-judge. - This paper tackles a good problem in LLM research, and the contribution is timely and generally principled. - LogicReward achieves strong empirical performance. I appreciate that the authors evaluate the models/methods on a pretty large set of logical reasoning benchmarks. - The study includes fair ablation studies and comparisons (e.g., Table 2 and Figure 4), corroborating the advantage of LogicReward. - The generalizability analysis is interesting and encouraging (Sec. 5.3): LogicReward does not only help with strictly logical reasoning tasks—it may help with other kinds of reasoning too (e.g. grade school math). - The related work section (Sec. 2) is incomplete. Much prior work has done on LLM logical reasoning using a range of different techniques, from earlier work like [1] to neurosymbolic approaches like [2, 3]. I think a dedicated paragraph is needed (could be in between "general reasoning" and "reward models") given the topic of this paper. - While LogicReward seems a good fit for logical reasoning tasks, many kinds of reasoning are harder to (auto)formalize. Do the authors expect their method to work on harder mathematical problems like AIME or code generation tasks? It is not necessary to evaluate on other benchmarks here, but some more thorough discussion would be ideal. In other words, what are the soft and hard limitations of this approach? [1] Kazemi, M., Kim, N., Bhatia, D., Xu, X., & Ramachandran, D. (2022). Lambada: Backward chaining for automated reasoning in natural language. arXiv preprint arXiv:2212.13894. [1] Olausson, T. X., Gu, A., Lipkin, B., Zhang, C. E., Solar-Lezama, A., Tenenbaum, J. B., & Levy, R. (2023). LINC: A neurosymbolic approach for logical reasoning by combining language models with first-order logic provers. arXiv preprint arXiv:2310.15164. [2] Pan, L., Albalak, A., Wang, X., & Wang, W. Y. (2023). Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv preprint arXiv:2305.12295. - Why specifically using the pipeline of SFT followed by DPO? A few more sentences of justifications/intuitions would be good. - It would be good to clearly label or state how many test problems are used for each benchmark (wrt Sec 4.1). - Minor: on Line 190, there is a typo. It should be "neo-davidsonia[n]". Fully human-written
PreviousPage 1 of 1 (4 total rows)Next