|
LogicSR: A Unified Benchmark for Logical Discovery from Data |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work introduces **LogicSR**, a benchmark that evaluates 14 algorithms (including solvers, ML models and LLMs) on logical symbolic regression -- the task of discovering boolean expressions from data. The benchmark consists of two real-world datasets (from (i) biology and (ii) circuit) and a synthetic dataset, spanning a wide range of scalability and granularity. The 14 algorithms were evaluated using a diverse set of metrics considering scale, accuracy, robustness to noice, and efficiency. The key findings are:
- traditional logic synthesis methods perform well at small scales but lack generalization
- ml methods have better generalization ability but requires higher computation costs
- llms evaluated (gpt-o3 and gpt-4o) show limited capability
S1: the motivation was clearly identified - the gap between Symbolic Regression benchmarks and Logic Synthesis benchmarks. This benchmark gives a more comprehensive evaluation of how the input method performs given the task of finding underlying logical rules from the data.
S2: the selected methods being tested span traditional methods, ml methods and llms, which shows the compatibility of the proposed benchmark.
S3: analysis on scale, accuracy, robustness to noice, and efficiency was presented in detail, proving insightful findings for the capability of each selected method.
W1: it is not clear to the reader how noice level is defined in section 4.1, are the noisy data generated synthetically? Coming from real-life examples? Simple mutants of a valid data?
W2: the complexity metrics is confusing for the case of ML methods and LLMs. Since they do not output a set of logical expressions, how was complexity computed (the definition of complexity was to count the number of operators, are llms prompted to output the logical expressions?)?
W3: inconsistency/confusion in cost measurement. It was mentioned that for non-ml methods the efficiency metric refers to the total processing time, and for ml methods it refers to model training time. However, in section 4.4 training time for ml methods were not discussed. In addition, no cost/efficiency discussion is included for LLMs. Were they fine-tuned? API-called? Evaluated by inference time?
W4: some writing/presentation issues will be mentioned in the question section.
Q1: line 078-085 seem to contain repetitive contents and poorly written.
Q2: line 164 - is $M_j$ a fixed value? how is this value chosen?
Q3: Why are input and output always the same for synthetic data (in table 1)? any justification for that choice? could asymmetric input-output values significantly impact the performance?
Q4: Figure 5 what happens to GPT-4o and GPT-o3?
Q5: any consideration of including SAT/SMT learning methods (seem to be related as their tasks are also to learn a set of underlying boolean expressions from data, often in CNF/DNF forms). E.g. [1]
[1] Wang, Po-Wei, et al. "Satnet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver." International Conference on Machine Learning. PMLR, 2019. |
Fully human-written |
|
LogicSR: A Unified Benchmark for Logical Discovery from Data |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces LogicSR, a unified benchmark for discovering logical expressions from data. The benchmark combines (i) curated real‑world problems from digital circuit design and biological Boolean networks (BBM), and (ii) a two‑stage synthetic data generator that first builds small logic networks and then merges them into larger ones. The authors evaluate 14 algorithms spanning traditional logic synthesis, symbolic regression, neural models, and LLMs, using metrics for in‑/out‑of‑distribution accuracy, expression complexity, and efficiency. Key findings are that classical logic synthesis excels on small scales but struggles to generalize or scale while neuro‑symbolic and SR methods generalize better but pay higher compute.
* The benchmark spans two practical domains (digital circuits from IWLS and biological Boolean networks from BBM) plus one synthetic domain, yielding a broad and well‑structured testbed across scales.
* The study evaluates methods comprehensively from rule‑based logic synthesis through symbolic regression and neural networks to state‑of‑the‑art LLMs, offering a rare cross‑paradigm view.
* While the paper motivates the task by its importance to EDA and biology, it remains unclear to non‑specialists what score on LogicSR signifies practical readiness in those domains. Mapping sample‑wise/bit‑wise accuracy to domain‑level utility would help readers judge how “promising for practice” current AI methods really are.
* How do the IWLS (EDA) and BBM (biology) tasks compare to real‑world difficulty, and what do your scores imply about AI’s practical readiness? |
Heavily AI-edited |
|
LogicSR: A Unified Benchmark for Logical Discovery from Data |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces LogicSR, a unified benchmark for discovering logical expressions from data. It bridges the gap between symbolic regression, which targets continuous functions, and logic synthesis, which assumes complete, noiseless specifications. LogicSR combines real-world datasets from digital circuits and biological networks with a scalable synthetic generator that creates diverse Boolean formulas under noise and incompleteness. The authors evaluate 14 methods across four paradigms (logic synthesis, symbolic regression, neural, and LLM-based). Results show that traditional logic synthesis excels on small tasks but fails to generalize, symbolic and neural methods handle larger scales at higher cost, and current LLMs struggle with complex logical reasoning. LogicSR thus establishes a rigorous foundation for cross-domain evaluation in logical discovery.
- Logical discovery is becoming crucial for interpretable and neuro-symbolic AI, yet lacks a unified benchmark.
LogicSR clearly fills this gap, making the paper’s contribution broadly valuable.
- The inclusion of both real-world and synthetic datasets, with controlled levels of noise and data incompleteness, enables evaluation under realistic conditions, which was missing in prior symbolic regression or logic synthesis work.
- The study connects traditionally isolated communities (EDA, symbolic regression, neuro-symbolic AI, LLMs).
The quantitative analysis across scales, operators, and noise levels gives an excellent overview of capability boundaries.
- The paper details generation algorithms, metrics, and reproducibility measures (open-source plan, parameter documentation), making the benchmark credible and replicable.
- The observation that LLMs (even GPT-o3) fail on complex Boolean reasoning is both surprising and informative for the broader research community.
- Although logical discovery is interesting in general, the paper could be strengthened by demonstrating concrete downstream use cases, such as how the discovered logical expressions could enhance interpretability in scientific modeling, improve circuit design efficiency, or support symbolic reasoning in neuro-symbolic systems. Without such examples, the broader practical impact remains abstract, and elaborated discussions are encouraged.
- The benchmark currently excludes XOR, NAND, implication, or higher-arity operators, which limits expressiveness and generalization analysis.
- The LLMs are only tested with single-prompt fitting tasks. Multi-step prompting or program-of-thought reasoning (which could improve symbolic regression) is not considered, leaving the evaluation somewhat incomplete.
- The two-stage synthetic data generator is central to the paper, but the influence of its parameters (priority decay, layer weighting, merging strategy) is not systematically analyzed.
- While some limitations can be inferred, they are not explicitly stated. An elaborated discussion on limitations or failure cases of the proposed approach would strengthen the paper.
- How would LogicSR extend to non-Boolean (e.g., multi-valued or fuzzy logic) functions?
- How would LogicSR scale to more complex reasoning tasks with implications? |
Fully AI-generated |
|
LogicSR: A Unified Benchmark for Logical Discovery from Data |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Learning logical expressions from data is a critical task for interpretable AI and scientific discovery. However, existing research still lacks sufficiently comprehensive benchmarks for evaluating logical expression learning. In addition, current benchmarks are primarily designed under idealized conditions and do not account for noisy or incomplete data. In this work, the authors propose an automated method for generating benchmarks for logical expression learning across different scales and noise levels. They further evaluate 14 algorithms to assess their capabilities in inductively learning logical expressions.
The paper addresses a distinctive and important challenge in the machine learning and rule learning research communities. The proposed benchmarks enable rigorous evaluation of inductive reasoning capabilities.
The paper proposes a novel set of benchmarks. However, the organization of the manuscript makes it difficult to locate and assess the key information. For example, the complexity characteristics of the generated benchmarks are not clearly presented in the main content. In addition, more insightful comparisons with other similar benchmarks are missing. Please refer to the detailed questions below.
1. How is the benchmark complexity systematically determined? What indicators are used to characterize and quantify the complexity? Moreover, is there any analysis of the network structure beyond simply reporting the number of input and output nodes?
2. In addition to accuracy, are there other metrics considered for evaluation, such as average recall or precision? If not, please provide justification.
3. How is noise introduced into the data, and what is the formal or mathematical definition of noise used in this work?
4. Are there any comparisons with other existing benchmarks in terms of scalability and complexity? If not, please elaborate on this aspect. |
Moderately AI-edited |