|
HARPA: A Testability-Driven, Literature-Grounded Framework for Research Ideation |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a new framework named HARPA, aimed at addressing the ideation-execution gap problem that is commonly found in current large language models (LLMs) when generating scientific ideas. HARPA is mainly composed of a proposal generator and a scorer . The former simulates the workflow of human researchers to generate high-quality ideas, and the later can predict the feasibility of the proposed ideas.
The scorer mechanism is interesting. It does not rely on heuristic rules or pure LLM judgments, but learns feasibility from the actual experimental execution results. Besides, this article precisely points out a core pain point in the current Al for Science field—the disconnection between the innovativeness and feasibility of ideas.
1. Over-optimization for "feasibility" may suppress breakthrough innovation: The core of HARPA is to enhance the feasibility of ideas. However, a potential risk is that the system may therefore prefer those ideas that are safer, simpler, and more incremental (such as, simply replacing a network layer or testing an old model on a new dataset), while filtering out those ideas that are high-risk but may bring breakthroughs. The experimental results of the paper also partly confirm this point (HARPA's novelty score is slightly lower than the baseline). The authors should discuss this "feasibility-novelty" trade-off more deeply in the paper and explore whether the HARPA framework can be adjusted to balance or encourage higher-risk innovation.
2. The organization and readability of the paper need to be improved:
- Frequent jumps to appendices: The narrative flow of the main text of the paper is frequently interrupted by "see Appendix X," making it difficult for readers to understand the core methods.
- It is strongly recommended that the authors reorganize the structure of the paper, optimize the writing, and enhance the reading experience.
3. The paper lacks comparison with other methods that can be used for idea generation, such as AI-scientist [A], NovelSeek [B] and so on.
- [A] Lu C, Lu C, Lange R T, et al. The ai scientist: Towards fully automated open-ended scientific discovery[J]. arXiv preprint arXiv:2408.06292, 2024.
- [B] Team N S, Zhang B, Feng S, et al. NovelSeek: When Agent Becomes the Scientist--Building Closed-Loop System from Hypothesis to Verification[J]. arXiv preprint arXiv:2505.16938, 2025.
4. The scope of related work can be broader: Although the paper has cited a large number of related works, some recent highly relevant research seems to have been omitted. For example:
- Nova: An iterative planning and search approach to enhance novelty and diversity of llm generated ideas.
- Large language models are zero shot hypothesis proposers.
- Large Language Models for Automated Open-domain Scientific Hypotheses Discovery.
- Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers.
- Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback
If the author can address my concerns, I will reconsider my rating.
The authors should revise the article and the figure to make the paper more readable. |
Fully human-written |
|
HARPA: A Testability-Driven, Literature-Grounded Framework for Research Ideation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes the HARPA framework, which generates testable research proposals through literature mining, hypothesis space exploration, and an executive feedback-based rater, which achieves significant improvement in feasibility and literature support.
This paper points out the problem of "disconnection between innovation and feasibility" when generating hypotheses in LLM, which is of great significance for promoting the practical application of ASD systems.
1. The paper proposes a complex multi-stage process, but the need for the components does not seem to be justified.
2. This paper needs to provide a detailed cost analysis and comparison.
3. Can the scorer work effectively on other ASD systems or on other scientific domains?
1. How do you handle this data for the "Uncertain" class?
2. How much does reasoning trace contribute to performance?
3. How do you determine inter-rater reliability, whether different experts agree on the same proposal? |
Fully human-written |
|
HARPA: A Testability-Driven, Literature-Grounded Framework for Research Ideation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a scientific research hypothesis generation framework named HARPA, aiming to address two major challenges of large language models (LLMs) in scientific research creativity generation: ensuring the testability and literature-groundedness of hypotheses. The HARPA framework consists of two core components: (1) a multi-stage hypothesis generator (Proposal Generator), which identifies research trends, constructs hypothesis space through literature mining, and finally converges to specific hypotheses that fill research gaps; (2) a hypothesis scorer (Proposal Scorer), which is a reward model (RM) trained based on previous experimental execution trajectories (success or failure) and used to predict the feasibility of new hypotheses.
The authors evaluated HARPA through two sets of experiments: (1) Human expert evaluation (compared with the AI-Researcher baseline), which showed that HARPA was significantly superior in "feasibility" and "groundedness"; (2) Automated scientific discovery (ASD) agent evaluation, which showed that the hypotheses generated by HARPA had a higher execution success rate (20 vs 11) on the ASD agent (CodeScientist). Additionally, an independent experiment demonstrated that HARPA's scorer was 28% more accurate than the untrained LLM baseline in predicting execution success rates.
1. Important Issue: This paper addresses a critical and timely issue: how to bridge the gap between the "creative generation" and "actual scientific research execution" of LLMs.
2. Execution-oriented rewards: Taking "feasibility" as a key indicator and attempting to use actual ASD agent execution trajectories (not just zero-shot judgments from LLMs) to train the reward model is the right and valuable direction.
3. Breadth of Evaluation: The evaluation design of the paper (although flawed) attempts to cover both human expert evaluation and real agent execution simultaneously, and this multi-dimensional evaluation approach is commendable.
1. Lack of end-to-end evaluation: There is a lack of critical end-to-end (End-to-End) verification. The paper claims that its core advantages lie in "testability-driven" and "Self-Adaptation to prior experimental outcomes". This strongly implies a closed-loop system: the feedback from the Scorer should be able to guide or improve the Generator. However, the experimental design in this paper is completely disjointed:
* Experiment 5.1 evaluated the generator (compared to AI-Researcher).
* Experiment 5.2 evaluated the scorer (compared to the untrained LLM).
* Missing Experiment: The authors never conducted an end-to-end evaluation to prove that the combination of "generator + scorer" outperforms the "generator (alone)". The authors only demonstrated that they could train a scorer, but never used this scorer in experiments to filter or re-rank the generator's output and verify whether doing so (e.g., taking the Top-K) could further improve the ratings of human experts or the execution success rate of ASD agents. This leaves the paper's core claims of "Self-Adaptation" and "learning from experience" without the most direct experimental verification.
2. Unfair baseline comparison: There are serious confounding variables in the core human evaluation and agent evaluation in Section 5.1. The input of HARPA (full paper) is far more informative than the base line (topics generated from abstracts). This makes it impossible for us to attribute the observed improvements in "feasibility" and "groundedness" to the HARPA framework, and it may simply be due to the difference in input information. This fundamentally weakens the argument for the effectiveness of the HARPA generator. This is manifested in at least two aspects:
* (a) Input Inconsistency: As described in Section 4.2, the input of HARPA is the "source paper", while the input of the base line AI-Researcher is the "topic generated from the abstract of each source paper". A complete paper clearly provides far richer and more specific context than a single topic word. The significant advantages of HARPA in "groundedness" (+0.85) and "feasibility" (+0.78) are most likely solely due to this difference in input granularity, rather than the superiority of its generator process itself.
* (b) Uncontrolled Retriever: The paper acknowledges that HARPA and AI-Researcher each used their own internal literature retrieval processes. Literature retrieval is the lifeblood of "groundedness". The authors did not control this variable (e.g., having both systems use exactly the same retrieval results as input), making it impossible for us to determine whether the performance improvement comes from HARPA's novel generation process or simply from a (possibly superior) literature retrieval component.
3. Sacrificed novelty: This is a key issue. According to Table 6 (Appendix), in the human evaluation, HARPA's "Novelty" score (5.98) is actually lower than the base line (6.43). Although the authors stated in the main text that "novelty is not sacrificed", for a top-tier conference like ICLR, a paper that (possibly defectively) excels in feasibility but is on par or even lags behind in novelty has limited contributions.
4. Generalization Ability and Domain Mismatch: The Scorer of HARPA was trained on a dataset of ACL (NLP domain) papers (Section 4.3). However, the evaluation of the papers (Section 4.2) covers a broader range of topics, including "RL (reinforcement learning), Optimization". The authors did not demonstrate whether this "feasibility" predictor trained on NLP can generalize to distinct domains such as RL or optimization. The universality of this framework for scientific domains other than those tested in the paper has not been fully explored.
5. Failure to address complex issues: The methodology and evaluation of the paper seem to focus on relatively straightforward hypotheses that can be reduced to "key variables". The paper does not explore how HARPA addresses multi-faceted research questions that require multi-step experiments or involve complex interactions among multiple variables.
6. Analysis of Lack of Computational Resources: The paper does not conduct a detailed analysis of the computational resources required for training and executing the HARPA framework (including the generator and scorer). This makes it difficult for other researchers to evaluate the feasibility of reproducing or deploying the framework in their laboratories.
1. (Regarding Weakness 1): Why didn't the author conduct end-to-end evaluation? For example, using the trained HARPA-Scorer to re-rank the proposals generated by HARPA-Generator, and then submitting the top-K proposals with the highest scores to human experts and ASD agents. This seems to be a direct way to verify the actual utility of the scorer and is also crucial to support your "test-driven" claim.
2. (Regarding Weakness 2): How can the author prove that the improvements in feasibility (+0.78) and groundedness (+0.85) observed in Section 5.1 stem from HARPA's superior generation process, rather than simply because HARPA was given a much more informative input (full paper vs. abstract topic)?
3. (Regarding Weakness 3): The evaluation in Table 6 shows that HARPA is lower than the base line in terms of novelty. Does this mean that the framework sacrifices novelty in exchange for (possibly problematic) improvements in feasibility? This seems to be a significant compromise for a "creative generation" tool.
4. (Regarding Weakness 4): The HARPA scorer is trained on NLP (ACL) data. How accurate is it when evaluating the feasibility of proposals in non-NLP domains (such as RL or optimization)? Are there out-of-distribution generalization (OOD) issues?
5. (Regarding Weakness 5): The method of the paper focuses on extracting variables and values (A + B). How does it handle more complex research questions that require multi-step experiments or involve complex interactions (non-direct causality) between variables?
6. (Regarding Weakness 6): Could the author provide a detailed analysis of the computational resources required for the training and inference of HARPA (including the generator and scorer)? |
Fully human-written |