ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (40%) 5.00 3.00 4088
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 1 (20%) 6.00 5.00 2779
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (40%) 2.00 3.50 3565
Total 5 (100%) 4.00 3.60 3617
Title Ratings Review Text EditLens Prediction
IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces IMProofBench, a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians, designed to assess LLMs on research-level mathematical proof writing rather than just final-answer prediction. Each problem requires a detailed proof and is paired with subproblems that have final answers. The authors evaluate LLMs on IMProofBench in an agentic environment, giving models access to Python, SageMath, and web tools for reasoning and computation. Human mathematicians grade main proof tasks, while subproblems are auto-scored. Experiments show that GPT-5 achieves the best overall performance, solving 22% of full proofs, while GROK-4 leads in final-answer accuracy. Analysis reveals that LLMs are prone to reasoning errors, ranging from simple logical mistakes to deep misconceptions, and they frequently hallucinate existing results to obtain a flawed answer. 1. The scope of the IMProofBench is novel, aiming to measure LLM's capability in research-level proof generation rather than just problem-solving accuracy. 2. The problems peer-reviewed by mathematicians in IMProofBench would be a valuable resource for the LLM research community 3. The evaluation is done in an agentic setting rather than pure LLM reasoning, reflecting a more realistic research workflow. 4. The authors provide valuable insights into current frontier LLMs on research-level proof generation, reveals that LLMs are prone to reasoning errors, ranging from simple logical mistakes to deep misconceptions, and they frequently hallucinate existing results to obtain a flawed answer. 1. The scale of IMProofBench is extremely small, containing only 39 problems. 2. The reliance on experts for proof evaluation limits its scalability and usefulness for the research community in model evaluation and development. 3. While the authors evaluate frontier LLMs in an agentic setting, it would be nice to compare their performance in a single-model reasoning mode without tools, to ablate the effect of tool use in the agentic setup. 4. Each problem in IMProofBench is paired with subproblems that have final answers that can be auto-graded. What is the correlation between subproblem correctness and the correctness of the entire proof? Could this serve as a proxy for overall proof correctness when lacking human expert graders? 1. While the authors mention community outreach to recruit professional mathematicians in the paper, what are their qualifications? 2. Is there plan to host a leaderboard for IMProofBench? If so, do you need professional mathematicians to grade every submission? Fully human-written
IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation Soundness: 3: good Presentation: 4: excellent Contribution: 4: excellent Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper introduces IMProofBench, a benchmark for evaluating LLMs on research-level mathematical proof generation rather than only final answers. The current release contains 39 peer-reviewed problems contributed and reviewed by professional mathematicians, each paired with auto-gradable follow-up subquestions. Models are evaluated in an agentic framework with access to tools (Python, Bash/GAP/Maxima, SageMath, and web search) and generous token budgets; main proofs are graded by the problem author on a 0–3 scale, while follow-ups are auto-graded. On this benchmark, GPT-5 achieves complete, fully correct proofs on 22% of tasks; Grok-4 attains the best final-answer accuracy (52%). There are also some qualitative and quantitative analyses showing that the final answer evaluations are informative but insufficient, LLMs are still struggling with open problems, etc. - **Clear problem framing & High-fidelity evaluation setup:** Directly targets proof writing at the research level, filling a documented gap left by final-answer benchmarks and high-school/undergrad datasets. Agentic environment with real tools (SageMath, Python, web search), submit tool to separate reasoning from final answers, and substantial token budgets appropriate for deep math. This successfully simulates the real mathematician's working scenario. - **Quality control & community process:** Problems are authored and peer-reviewed by domain experts; the site/workflow formalizes submission, review, and grading instructions. - **Forward-looking & contamination-aware:** Authors are considering dynamic retirement of problems to maintain fair evaluations. Human mathematicians in the loop could make the LLM evaluation promising and truly tailored to the needs of human users. - **Tool-access parity and fairness:** Different models use different web-search providers (internal vs external) and may leverage shell utilities to fetch PDFs, raising parity questions about retrieval power across vendors - **Potential grader bias & reliability issues:** Main proofs are graded by the problem’s author; without inter-rater studies, there’s risk of variability or unconscious bias. - **Contamination controls:** For the dynamic problem management system, how will you detect that a problem has become easier due to new publications, and how often will items be rotated or retired? - **Open-problem protocol:** Since no open problems were solved here, will future releases track partial advances on open questions differently (e.g., author judgment of novelty), and how will you guard against over-claiming? - **Scaling human effort:** As the benchmark grows, how will you mitigate grader load while preserving rigor (e.g., rubric refinements, rubric-guided LLM pre-screening, partial formalization)? Moderately AI-edited
IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents **IMProofBench**, a private, evolving benchmark aimed at **research‑level, proof‑centric** evaluation of LLMs in an agentic environment with tools (Python, SageMath, Bash, web search) and human expert grading on a 0–3 rubric (peer‑reviewed problem pipeline; automated grading for follow‑up subquestions). Key results: among 39 problems (with 79 subquestions), GPT‑5 achieves 22% complete proofs, while GROK‑4 leads on final‑answer subquestions with 52% perfect accuracy (Figs. 5 & 4; §§3–4; App. E; pp. 4–9, 21–23). Overall, the work targets an under‑served evaluation axis—proof quality rather than answer‑only scoring (motivation and setup; §1–§3; Fig. 3), but it currently has limited scale, private data, and heterogeneous tool/search access that may confound cross‑model comparisons (State & limits; §§3.3–3.4; App. A/E; pp. 5–9, 14–24). 1. **Proof‑centric, research‑level focus** Articulates a gap in answer‑only benchmarks and centers proof writing with human expert grading (Intro; §1; §3.3), clarifying capabilities beyond final answers (novel evaluation dimension). Problems span pure/applied topics; 7 are flagged “open,” signaling frontier intent (State; §3.4; p. 6)—useful for probing limits. - Worked sample (stable graphs) illustrates target rigor and expected solutions (App. C; pp. 19–20), aiding clarity and reproducibility expectations. 2. **Documented pipeline and blind expert grading** Two‑stage review with admin + domain expert; acceptance only after “no further comments” (Fig. 2; §3.2; pp. 4–5) → quality control. 3. **Balanced quantitative and qualitative evaluation** Reports proof vs final‑answer outcomes and a 0.45 correlation between author‑weighted subquestion score and 0–3 progress (§4.1; Figs. 4–5; p. 7) → answer accuracy ≠ proof quality. 1. **Private dataset limits replication and audit** - Dataset is private; only code + sample problems planned for release (Reproducibility; p. 10) → third‑party re‑grading is constrained. - No public leaderboard or downloadable test set is provided (No direct evidence found in the manuscript) → weakens community tracking. - Human‑graded proofs cannot be independently re‑assessed at scale without access to items/outputs (§3.3; p. 5–6; Reproducibility; p. 10). 2. **Small scale and topical skew** - 39 problems from 23 contributors is modest for statistical comparisons (§3.4; p. 6) → low power. - Tag distribution dominated by algebraic geometry (App. A Fig. 12; pp. 14–15) → representativeness concerns. - Only 7 “open” problems (§3.4; p. 6), so “research‑level” spans a broad difficulty band → interpretability issues. 3. **Relation to similar benchmarks is not fully disentangled** - FrontierMath also targets advanced/research‑level problems but emphasizes final‑answer evaluation; fairness concerns around access are documented (Related Work §2; arXiv; EpochAI clarification). The paper cites FrontierMath (§2) but lacks side‑by‑side comparisons or controlled discussion beyond final‑vs‑proof framing (No direct evidence of direct cross‑eval in manuscript). Can you situate IMProofBench against FrontierMath/RealMath/HLE/UQ/USAMO/PutnamBench/MiniF2F with a small cross‑evaluation or calibrated narrative to clarify novelty and complementarity? (Related Work §2). Fully AI-generated
IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation Soundness: 3: good Presentation: 3: good Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces a new benchmark that aims to evaluate LLMs performing research-level mathematics, with access to tools. The authors present a platform that they used to collect questions, as well as a question collection protocol. The benchmark currently contains 39 questions, and the authors state that more questions will be added. Several state-of-the-art LLMs are evaluated, and the authors describe various particularities of how the models reason, as well as promising results w.r.t. to solving rates (22% in case of GPT-5). The question-creating pipeline is the strongest part of the paper. It is significantly more rigorous than what typical papers in the math benchmarking space achieve. The associate web codebase seems to be a genuine improvement that makes problem collection easier. - **tiny benchmark size**: My main issue that prevents me from recommending this paper for acceptance is the size of the dataset: 39 questions, even if very carefully authored, manually reviewed (and accounting for 79 follow-up questions), fall significantly short of the number of questions needed to rigorously establish a benchmark that, "evaluates their [LLMs'] performance on research-level tasks at the frontier of mathematical knowledge". To achieve this goal, many more data points need to be sampled, as each mathematical subdomain comes with its own issues (algebraic geometry exhibits different typical ways of reasoning than differential topology), which probably require a few dozen data points each to rigorously test. - **the growth targets will not solve the problems** For comparison, papers typically used thousands of datapoints when autograding is possible (e.g. https://arxiv.org/html/2501.13766v1), and many hundreds when solely manual evaluation is employed (e.g., https://arxiv.org/abs/2301.13867). Growing the dataset by the numbers the authors mention will not bring them to this level. - **bias**: The word cloud (fig 12) shows a clear bias towards certain areas of mathematics; which is not surprising given the small sie of the benchmark. - **open questions are not a plus**: "Of the 39 benchmark problems, authors characterize 7 as open research questions." I am unconvinced of the value of adding open research questions to a benchmark. The value of an open research question for a benchmark is only worthwhile if an LLM solves that question and thus highlights an awesome display of reasoning. Otherwise, the output of the LLM is hard to quantify (e.g., "meaningful but ultimately insufficient progress towards a solution") - "Currently, IMProofBench consists of 39 problems developed in collaboration with over 23 mathematicians, with 30 more questions in the latest stages of the problem creation pipeline." I am not sure that the fact that "more is coming" (30 questions) is a good argument, as it is unclear when the questions will be devised. - **benchmark is not repeatable:** The fact that the authors evaluated commercial LLMs makes the subsequent use of these questions unsuitable for these LLMs, as they have likely been ingested in the training data. Further, a lot of space is used for a verbal assessment of how the models perform. While this provides a snapshot, this information is already folklore, given the many posts on X/Twitter; at the same time, this information will soon be outdated with the advancements of LLMs. The value of a benchmark lies mostly in the fact of being able to rank models across time (e.g., the success of ImageNet) - and this benchmark fails at that task. The authors state that their benchmark will grow over time, but simply scaling it will not resolve these problems. - **benchmark not public** In line with current trends, the authors do not make their benchmark public. This is claimed to preserve "intellectual property" of the creators (line 508). I am doubtful of this claim, as one could waive IP claims. This reduces reproducibility. I recommend the authors read on the FrontierMath debacle (https://www.reddit.com/r/math/comments/1iadcqw/the_frontiermath_scandal) that highlights the pitfalls of private benchmarks. - Were some existing benchmarks, such as https://arxiv.org/html/2501.13766v1, deliberately omitted from the Related Work section? - "high-school or undergraduate-level mathematics (Balunovic et al., 2025; Frieder et al., 2023)" Some problems from the latter seem to be rather at the graduate level (at least in some universities' master curricula), as graduate-level textbooks feature in those benchmarks? Fully human-written
IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. IMProofBench contains 39 problems authored and reviewed by expert mathematicians; each main question asks for a full proof and comes with FA subquestions for scalable scoring. Evaluation occurs in an agentic framework with Python, Bash, SageMath, and web search; models submit a final answer distinct from intermediate reasoning; token budgets are 300k for the main question and 100k per follow-up. Reported headline numbers: GPT-5 achieves 22% complete solutions on proof tasks; GROK-4 achieves 52% FA accuracy on the subset with follow-ups; the correlation between human proof scores and FA scores is 0.45. The paper positions itself against prior math benchmarks focused on high-school/undergrad or FA-only tasks, and argues for proof-based, research-level evaluation with human graders. Core contributions are explicitly listed as the benchmark itself, a cross-model proof study, and a qualitative analysis of errors and strengths. - Clear gap and strong motivation. The need to evaluate research-level proof writing rather than only FA questions is well motivated in the abstract and introduction. The benchmark directly targets that gap and brings expert graders into the loop. - Realistic evaluation environment. The agentic setup—with Python, SageMath, Bash, and web search—matches how a human would work. The required “submit” step separates scratch work from the final proof, which helps graders. The token budgets are large enough to explore tools without being trivially restrictive. - Thoughtful human grading design. Authors grade main proofs on a 0–3 progress scale and also tag error types (logic, concept, hallucination, calculation) and achievement indicators (understanding, insight, usefulness). This captures more signal than a single pass/fail. - Quantitative and qualitative results. The paper reports both proof-score distributions and FA accuracy, plus the 0.45 correlation between FA and human proof scores, making a concrete case that FA alone is an imperfect proxy. The qualitative section usefully documents typical failure modes (confident but incorrect claims, “well-known result” shortcuts) and behavior (rare abstention). - Problem sourcing and review. Problems go through an authoring pipeline with reviewer feedback and acceptance only after concerns are resolved; guidelines stress PhD-level insight and proof-centric tasks, with FA subquestions for automated scoring. - Private and evolving dataset. The benchmark is private, with plans to open-source the platform code and release open sample problems; for ICLR, private data can limit reproducibility and independent verification. Because the set is evolving, comparability across time may drift unless a frozen v1.0 test split is committed. - Scale and composition. N=39 is small for a general benchmark, and the paper itself highlights this. The topic distribution currently leans toward areas favored by organizers. Both factors raise questions about coverage and selection bias relative to “research-level mathematics” as a whole. - Human-in-the-loop reliability. The protocol records rich grader signals, but the paper does not report inter-rater reliability or consistency checks across graders for the same solution. Without IRR, it is difficult to quantify grading variance and potential systematic bias. - Agent and tool confounds. Allowing web search and shell access reflects reality, but it also introduces test-time retrieval confounds (e.g., locating key lemmas on the web). The appendix even notes models downloading papers via wget/curl. This is realistic but complicates claims about reasoning versus retrieval. A controlled ablation (no web; no Bash; SageMath only; different token budgets) would clarify how much performance comes from tool-mediated lookup versus algebraic reasoning. - Evaluation fairness across model APIs. The paper mentions that some models (e.g., GROK-4) give short final answers and have hidden reasoning tokens, complicating assessment. Tiering and API differences can affect both behavior and resource use. More detail on API normalization and prompting parity would strengthen fairness claims. - Statistical characterization. The headline metric “% complete solutions” on 39 items is informative but noisy. Confidence intervals, per-area breakdowns, and problem-level difficulty calibration would help. The 0.45 correlation between FA and proof scores is useful; expanding this analysis (e.g., partial correlations by topic, difficulty, or tool usage) would improve interpretability. - Reproducibility timing. The paper commits to releasing platform code and open sample problems before Nov 30, 2025, which would ease review concerns if delivered, but program committees must judge the submission as it stands. A clear plan for a frozen public subset and exact grader guidelines would materially improve reproducibility. See Weakness Fully AI-generated
PreviousPage 1 of 1 (5 total rows)Next