|
SciPro Arena: a Case Study of AI Agent Capabilities in Scientific Analysis Tasks |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a physics-based benchmark for scientific data analysis. They focus on Angle–Resolved Photoemission Spectroscopy (ARPES) data, which often substitutes many condensed matter experiments and stands as a realistic benchmark for finding patterns in noisy, multidimensional datasets.
- I appreciate the authors' meticulous presentation of a domain that's complex to understand.
- Furthermore, it reaffirms that LLMs still struggle to perform scientific data analysis.
- The experiments are thorough and thoughtfully executed.
- I recommend that authors consider a physical sciences x AI-related workshop to publish this work. It's a great paper, but unfortunately, too narrow to publish in ICLR.
While I like authors' effort to carefully construct the benchmark, it still suffers from some issues:
- I like the focus on scientific data analysis, but this focus has been explored quite a bit over the last 1-2 years. For example, ScienceAgentBench, DiscoveryBench, and AutoDST are some prominent examples that focus on real scientific data analysis, in fact, expanding across multiple domains. All of them find similar results that LLM struggles to perform scientific data analysis requiring long-tail methods. The paper also missed these very relevant and important citations, while claiming "SciPro Arena fills a gap that has not been addressed before — analysis of real scientific data."
- To my earlier point, what additional insights this paper brings remain unclear. In other words, why AI model builders would test their models on this benchmark compared to earlier comprehensive ones, or why solving this benchmark dictates fundamental capabilities of an LLM, is unclear.
- "datasets within a question are contained in a single text string" -- does this mean to solve this benchmark, all you need is language-based reasoning? What if the data is presented in a tabular file, and the system can interact with the file using code (e.g., Python)? Why is the proposed setup more important than the latter?
Please see the questions mentioned in the weakness. |
Fully human-written |
|
SciPro Arena: a Case Study of AI Agent Capabilities in Scientific Analysis Tasks |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose a new benchmark, SciPro Arena, to test how well AI systems can analyze scientific data - specifically angle-resolved photoemission spectroscopy (ARPES) data. They test several frontier models, and find that in general models perform very poorly on the dataset, highlighting the continuing challenges of using AI for scientific discovery.
- Evaluation on a real scientific task in condensed matter physics
- Reasonably rigorous scoring methodology
- While the abstract sets out to test "analysis of scientific data", the actual grounding of this on ARPES data seems very specific and idiosyncratic. ARPES has numerous complexities and specialities in, making it particularly difficult. If the authors goals are to more generally evaluate analysis of scientific data, they would perhaps do better first by characterizing different types/dimensions of scientific data, collecting samples of each, and doing a more systematic evaluation. In other words, the generalization from ARPES to all scientific data seems somewhat of a leap here.
- Related to this, I'd really like to learn not how the models performs overall on this benchmark, but how the models perform on different aspects of scientific data analysis. Is it possible to identify different types of scientific reasoning required for this task? In other words, rather than (or as well as) having 27 domain-specific question types, identify N data analysis types (e.g., interpolation, prediction, pattern recognition, noise tolerance, data cleaning, visual interpretation, etc.). I'd love to see the paper's framing and conclusions mapped from the physics domain to the AI research domain more.
- Frontier models appear to be applied in a vanilla/naive way, i.e., do a single-call direct query to the model. However, there are numerous "AI Scientist" systems out there that do coding, iterative reasoning, reflect loops (e.g., AIScientist, CodeScientist, ReAct, Reflexion, CodeAct) that might do better at this task. This should be clarified, in particular the conclusions may only apply to "direct query" uses of frontier models.
- The results seem somewhat dependent on the prompting strategy used, e.g., choices of few-shot prompting
- For several of the plots, I'm not sure what to take away from them - highlighting the takeaway in the caption, rather than just summarizing the visual data (e.g., "accuracy decrease with noise") would be very helpful.
See weaknesses |
Fully human-written |
|
SciPro Arena: a Case Study of AI Agent Capabilities in Scientific Analysis Tasks |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a new benchmark, called SciPro Arena, to evaluate language models on real-world scientific analysis tasks. Given a stream of numerical data (e.g., a 2D intensity map with real energy and momentum axes) and some question, models extract patterns from examples and provide numerical answers which are scored by deviation from ground truth obtained from a realistic simulator. The authors evaluated 14 recent reasoning models both open- and closed-source on 135 questions (27 question types, each tested at 5 noise level times). Recent models tend to perform better (e.g., o3 and Gemini 2.5 Pro), but they only reach <15% on average while a human baseline program (~400 Python lines) scores 37%. Even when removing the noise, SOTA models score 20% while human program gets 55%. Since the questions are composed of different difficulty tiers, the authors could highlight that "[SOTA] models can extract simple features but fail at tracing continuous patterns or computing derived quantities, the latter constituting core reasoning skills needed for real scientific analysis".
- SciPro looks very challenging for current language models and according to the leaderboard it is far from saturation. Improving on that benchmark (esp. dealing with noise) will require breakthrough in reasoning/test-time compute.
- The benchmark offers different difficulty tiers which are great to track progress. The questions are grouped under 5 domains and seems to be easily extendable.
- To me it is unclear if directly evaluating models on numerical tasks totally makes sense. Specifically, what is the impact of the tokenization process on numerical values provided in the context. There's been some work in the literature on shedding light on typical error patterns of models on tasks involving numerical reasoning (e.g., https://arxiv.org/abs/2410.11781).
- For each question, three in-context examples are included as part of the prompt. According to the paragraph Form of questions in Section 3.2, each example contains a matrix of numbers with whitespace delimiters. While it provides some structured to the prompt, it is unclear to me what is the impact of presenting the information that way. To me in a realistic scenario the data would be ingested as a CSV file or some other structured format instead of free form text strings.
- The title of the paper contains "AI agent capabilities" but without tools or scaffolding that enable LLMs to interact with some environment, I wouldn't call the models being tested AI agents. Also, to be more realistic and comparable with the human baseline, I would argue the models should have been able to use tools, e.g. generating code, to deal with numerical values.
### Minor
Line 146 mentions that ARPES analysis is an inverse problem, but line 151 says "the aim of ARPES data analysis is then to work out x→ y". While I think I understand where that comes from, I find it more intuitive to think that the underlying dispersion and linewidth functions are the 'x' (of the forward process) and the noisy spectrum is the 'y', thus ARPES is really about solving y → x.
- Did the authors try to equip the LLMs with tools (e.g., Python interpreter)? What about allowing them to write code? I suspect the performance will increase. If not, then that's a good motivation for SciPro. Also allowing the use of coding tool is more realistic for scientific analysis.
- Is providing 3 in-context examples optimal? Is it enough to capture the nature of the tasks or does the model still need prior knowledge about the scientific domain the question is coming from?
- I might have miss it but how are the answers extracted from the LLM's response?
- Will the benchmark be open-source? |
Fully human-written |
|
SciPro Arena: a Case Study of AI Agent Capabilities in Scientific Analysis Tasks |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a new benchmark called SciProArena that measures how good recent LLMs are for the task of scientific data analysis. In particular they focused on 27 categories of analysis tasks that require the models to extract patterns from noisy experimental data. Authors present extensive experimental results covering recent reasoning models’ performance on the SciProArena benchmark by varying the noise level and dataset resolution.
* Unlike other benchmarks that focus on information extraction and inductive reasoning, SciPro Arena focuses on real, empirical data as input and requires deep analytical reasoning about complex, noisy data. The proposed task mirrors natural scientific experiments where relevant information (‘y’) is latent and must be inferred from proxy measurements (‘x’).
* Results demonstrate there is a large gap between human performance of 38 (55% on noiseless) vs most recent reasoning models achieve around 13% (21% on noiseless) on these tasks.
* Even though the dataset covers 27 task categories that cover A) Fermi level extraction, B) dispersion tracing, C) linewidth tracing, D) phonon energy determination, E) doping determination. While these five tasks are complex and represent core analytical work within their domain, they are a small fraction of the total landscape of scientific data analysis. Additional discussion about the scope of these tasks would help situate the claims better.
* As expected, the performance of reasoning models degrades as noise level or dataset resolution increases. More detailed discussion on what kind of agentic systems be developed on top of these LLMs to alleviate these limitations on existing tasks to solve the individual datapoints would strengthen the paper. Current future work focuses more on generalizing tasks or developing agents for meta-analysis.
1. Placing result figures near their description will make the paper easy to read.
2. Authors have uploaded the supplementary material, however most of the prompt and result files are just placeholders. I would suggest releasing the prompts and data so that the research community can reproduce these results.
3. Section 3.4 explains how the noisy version of the dataset was generated by inserting randomly distributed 2D Gaussians in each spectrum. |
Fully human-written |