|
Neurosymbolic Language Reasoning as Satisfiability Modulo Theory |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces LogiText, a neurosymbolic language which supports partial formalization, and a novel SMT (Satisfiability Modulo Theory) solving framework. This approach aims to bridge the "compositional" and, most notably, the "combinatorial" reasoning gaps that persist in LLMs by coupling an SMT's Boolean search with an iterative LLM-driven "generate-validate-refine" loop.
1. This paper proposes a neurosymbolic language, LogiText. LogiText does not require converting the entire document into strict logical formulas; instead, it allows for explicitly annotating only the key logical structures (e.g., Boolean relations) while retaining ambiguous textual clauses as natural language. This design bridges the gap between traditional symbolic solvers (which require fully formalizable domains) and real-world complex documents (which are essentially a mix of text and logic), greatly enhancing the practical value of neurosymbolic methods in domains like legal analysis and content moderation (CMOD).
2. The authors propose a neurosymbolic framework for reasoning with semi-structured language that positions the LLM as an SMT theory solver. Specifically, the SMT symbolic algorithm is responsible for efficient Boolean structure search, and the LLM-driven NLSolver then generates assignments that satisfy logical and semantic constraints by iteratively calling LLM sampling, validation, and refinement operations.
3. Experimental results demonstrate that on the text instance generation (TIG) and text Coverage Generation (TCG) tasks, the performance of the LogiText-based formalization and the LLM-driven SMT neurosymbolic solving algorithm significantly outperforms that of end-to-end LLMs.
1. The framework's reliance on precise clause-level annotation and evaluation is a critical vulnerability. LogiText relies on human experts to manually annotate natural language documents. This not only incurs high labor costs but also limits the method's scalability and application scope. Furthermore, the framework is fragile to "clause-level" errors. It still relies on the precise evaluation of each clause, and a failure in evaluating one clause can cause the entire logical chain to collapse. As results on LegalBench (Figure 8) show, a textual judgment error by the LLM on any single clause can lead to reasoning failure. In contrast, the "holistic reasoning" of end-to-end LLMs sometimes exhibits stronger reasoning capabilities. Especially in real, complex scenarios, the partitioning of clauses and the formalization of their logical relationships remain a significant challenge.
2. In the NLSolver algorithm, although the authors introduce a caching mechanism to reduce the number of LLM calls, the cost of LLM calls in the "generate-validate-refine" framework (an iterative refinement process) is still a non-negligible issue. Admittedly, we believe that proposing a low-cost and efficient solving method remains a challenge in this type of search-based neurosymbolic framework. To better assess the practical viability of this framework, we ask the authors to report the average (and maximum) number of LLM calls required by NLSolver to solve each task in the experimental evaluation.
3. Section 3 is hard to follow, partly because the complex notation. Please improve the presentation quality of this part.
(Formatting Issue) Line 412 is incomplete.
1. The paper states that the benchmarks consist of 15 tasks with "10+ instances each". This seems like a very small scale for evaluation. Could the authors elaborate on the size of the test sets? How can you be confident in the generalizability of the results?
2. The paper mentions using the set of unsatisfied constraints to guide the refinement (Algorithm 1b, line 24). This is a key mechanism. Could the authors provide a concrete example of this refinement prompt? How is the LLM instructed to 'fix' the previously generated text based on which specific natural language constraints failed? |
Fully human-written |
|
Neurosymbolic Language Reasoning as Satisfiability Modulo Theory |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper describes an approach to the logical annotation of unstructured text documents that allows the definition of logical constraints in natural language in the context of a semi-structured prompt to a hybrid LLM/SMT solver. This allows the document to be translated into an SMT theory, formed by combining LLM valuations for atomic propositions within the constraints with auto-formalization of complex statements in the document, for which a solver then finds a satisfying assignment or determines the theory is unsatisfiable.
An effective and innovative approach to logical reasoning with LLMs that uses an SMT solver as cognitive scaffolding. This is an improvement over previous approaches which depend on autoformalization into logical statements followed by independent execution of a solver over the generated theory. The deeper integration of the LLM into the core of a hybrid reasoning engine is an important direction to pursue, because it begins to address how best to exploit the complementary strengths of formal reasoners and the approximate retrieval of LLM parametric knowledge for reasoning. The presentation is clear and for this reviewer informative and thought-provoking, the description of the formalization for natural language text constraints and the NLSolver were thorough easy to follow. The author(s) effectively make the case that using an LLM to ground atomic statements in the context of a solver is a viable, practical approach to reasoning in neurosymbolic systems that addresses some of the shortcomings of current approaches.
The paper extends SMT with a theory for textual constraints, but does not discuss formal soundness or completeness guarantees. What are the conditions under which the NLSolver is guaranteed to find a solution if one exists? The paper would benefit from a more rigorous theoretical treatment of these questions.
The paper doesn't adequately address how it ensures factuality of LLM evaluations when these serve as atomic propositions in the logical theory. What happens when the LLM misclassifies a natural language constraint? The LLMVerify function is treated as an oracle, but in practice, LLMs can be inconsistent or incorrect. The paper would benefit from error analysis on how LLM evaluation errors propagate through the logical reasoning process and what safeguards exist to detect or mitigate such errors.
Have the authors investigated the usability of the Logitext language from a human factors perspective? What is the learning curve for non-experts to write effective annotations? How error-prone is the annotation process? Understanding the practical challenges of getting users to correctly annotate documents is crucial for real-world deployment.
How does the system handle ambiguous natural language that could have multiple valid autoformalizations? Does it maintain multiple hypotheses or commit to a single interpretation? This is particularly important for legal and policy documents where ambiguity may be intentional or unavoidable.
Can the authors provide any formal guarantees about their system? For example, under what conditions is the NLSolver guaranteed to be sound, never returning an incorrect solution? Even partial guarantees would strengthen the theoretical contribution. |
Fully AI-generated |
|
Neurosymbolic Language Reasoning as Satisfiability Modulo Theory |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors introduce Logitext - which is a neurosymbolic language to extend NLTCs to SMT solvers. They identify "compositional" and "combinatorial" reasoning gaps in LLMs when dealing with certain types of documents. In these cases, Logitext lets you formalize part of the document (the "logic") - and then uses an iterative solver with Z3 for logical and an NLSolver for textual constraints.
However, this method feels quite contrived, and I don't see any applicability of this beyond some carefully curated examples that require a lot of manual annotation anyways. The complexity of the Logitext systems appears to not be proportionate to the demonstrated gains.
1. The paper does address a real problem wrt LLMs struggling with logical consistency in policy documents.
2. The paper is clearly motivated through empirics on reasoning gaps.
3. To my knowledge, this mixture of SMT solvers with NL constraints is novel.
1. The Logitext system seems quite contrived and requires significant manual effort to convert natural docs.
2. Manual annotation of the logical structure defeats the purpose of scalable automated reasoning. Therefore real usage of this is highly questionable.
3. The convergence of the NLSolver is not guaranteed and the caching employed seems quite ad-hoc.
4. Some of the proposed baselines appear weak and raise concerns about high quality evals.
1. How does the cost of annotation simply compare to using better prompting strategies?
2. How sensitive is the performance to annotation quality and completeness.
3. What happens (as is likely) when logical structures in the document are incomplete or vague? |
Fully human-written |
|
Neurosymbolic Language Reasoning as Satisfiability Modulo Theory |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 3: good
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes Logitext, a neuro-symbolic language that bridges the gap between natural language and formal language. Logitext is represented as natural language text constraints, which streamlines a logical structure of the natural language text. Furthermore, authors extend an SMT solver and propose a NLSolver using an LLM to solve natural language text constraints. The proposed method is evaluated on 15 tasks and yield impressive performance on three types of settings, supporting the effectiveness of the method.
1. The paper newly introduces a language called Logitext to fully represent the logical structure behind the natural language text. This is a very important problem in legal, political, and societal domain.
2. The performance is impressive. Logitext performs much better than few-shot prompting and the previous neurosymbolic approach on various benchmarks.
The presentation should be further improved for clarifications.
### Major points
- There are lots of words "solver", but I'm confused which solver do you exactly mean. I think you'd better come up with a clearer notation. There are lots of related words (e.g., solver, logical solver, SMT solver, LLM-based solver, symbolic solver), but for people who read this paper for the first time, it's really hard to get which "solver" do you mean in the paper throughout.
- For Logitext, what if an implicit premise is hidden in the text, so that you can't capture the entire underlying logical structure of the text by simply annotating it?
- The description of algorithm (Section 3.3) could be further improved. Now, there are too many variables only defined in Figure 4, not in the main paper, which hinders to fully understand the details.
- For the new benchmark CMOD, what is the source of these moderation policies? I cannot find it in anywhere.
- Line 367: 10+ instances per task seems so small. Elaborate the dataset size for each task.
- Line 425-426: Authors should elaborate how this neurosymbolic prompt looks like. If it is something from a previous work, then authors should cite that work.
### Minor points
- In Figure 2 (b), authors should mention that the definition of combinatorial gap is in Appendix A.1. Readers would be confused.
- Which solver did you use in Section 2.2? Did you use NLSolver or an SMT solver?
- In Figure 3, why is there C8? for defining disruptive behavior and immediate threat, you even do not use C8, and as the definition of C6 shows, i guess C8 is equivalent to C6.
- Line 213, if I understand correctly, the first `<var>` and the second `<var>` point two different ones. for clarification, you should indicate those as `<var1>` and `<var2>` instead. Also, what do `<var>`, `<clause>`, and `<phrase>` exactly mean? Authors should elaborate this right after they describe the whole new notations.
- Line 217, authors should cite related papers for `pyz3`.
- Line 253-254, authors should describe how u_i is related to c_jh and what do p_i's mean here. I could understand from the context, but to formally define Logitext, authors should consider this. Also, why does this formalization re-occur here? I think it's already briefly describe in 3.1 Clause naming.
- Line 263, the notation l_jh suddenly appears. what does it mean?
- Line 268: in the previous paragraph, NLTC was notated as v_jh but here it is notated as v_i where 1<=i<=n. Could authors clarify this point?
- Line 412: Text is overlapped by Figure 8.
See Weaknesses. |
Fully human-written |