ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 2.00 5.00 6751
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 2 (50%) 3.00 3.50 7051
Lightly AI-edited 1 (25%) 2.00 3.00 1490
Fully human-written 0 (0%) N/A N/A N/A
Total 4 (100%) 2.50 3.75 5586
Title Ratings Review Text EditLens Prediction
A Comprehensive Benchmark for RNA 3D Structure-Function Modeling Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a benchmarking suite of seven tasks for evaluating deep learning models on RNA 3D structure-function prediction, built on the rnaglib Python package with modular data processing, rigorous splitting strategies, and standardized evaluation protocols. Addresses the lack of standardized benchmarks for RNA 3D structure–function research Provides seven benchmark tasks covering distinct biological challenges, along with modular tooling (filters, splitters) to facilitate research Emphasizes strict data-splitting strategies (sequence- and structure-similarity–based) to prevent leakage, and empirically demonstrates performance inflation under random splits The paper acknowledges that its simple baseline model (RGCN) does not reach SOTA performance compared with models in the literature Some tasks (e.g., RNA-CM) have very small datasets, which may limit the effectiveness of training more complex models. The datasets mirror the overrepresentation of tRNAs and rRNAs in the PDB, potentially biasing models toward these families. Why does 2.5D beat 3D? The finding that coarse-grained graphs outperform 3D atomic graphs raises the question: is this due to intrinsic properties of RNA data, or limitations in the current 3D modeling approaches? What performance is “good enough”? For application-driven tasks like RNA-VS (virtual screening) , is the current baseline performance (e.g., AUROC 0.759) sufficient to reliably guide real-world drug discovery workflows? Lightly AI-edited
A Comprehensive Benchmark for RNA 3D Structure-Function Modeling Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces a standardized and reproducible benchmark suite for RNA 3D structure–function modeling. It provides seven tasks, modular data-processing tools, and baseline results across multiple RNA representations using experimental RNA structures. The work is aiming to facilitate future model development and fair comparison in the emerging field of RNA structure-function prediction. • The paper addresses an important and underexplored problem: the lack of standardized and reproducible RNA 3D structure–function benchmarks. While protein benchmarks have driven major progress in deep learning for structural biology, RNA has remained comparatively neglected. Establishing such a benchmark is useful for advancing model development in this area. • The authors provide open-sourced code, data, and results (via an anonymous repository) and report results over multiple random seeds with error bars, which supports transparency and reproducibility. • The paper is well-written, and easy to follow. 1. Limited technical novelty: The main distinction claimed over prior works (e.g., Beyond Sequence, Xu & Moskalev et al., ICLR 2025; rnaglib, Mallet et al., 2021) is the use of experimental RNA data rather than predicted 3D RNA structures. While this is useful, it represents a relatively modest incremental step rather than a substantial methodological or conceptual advance. 2. Substantial overlap with existing benchmarks and findings: The paper’s key conclusions largely align with prior work and do not provide new insights: o The finding that atomic-level detail may be unnecessary for modeling RNA structure under limited data is already reported in Beyond Sequence (Xu & Moskalev et al., ICLR 2025). o The claim that 3D methods outperform 1D methods contradicts the authors’ assertion of novelty, since Beyond Sequence already demonstrated that 3D models outperform 1D models when sufficiently parameterized and with sufficient receptive field (see Table 1 in their paper with nucleotide pooled 3D models which outperform 1D models almost always) and when reliable structures are available. Also this conclusion is rather expected. With reliable 3D structures, it is not a surprise that 3D models work better than 1D models as they have all the information that 1D models have and more. o The observation that 2.5D representations perform best is consistent with previous findings as well (e.g., Beyond Sequence’s Transformer 1D2D model also outperformed most of the 1D and even 2D and 3D models at times). Thus, overall, the empirical conclusions reinforce already well-reported conclusions rather than providing new insights. 3. Unclear distinction from rnaglib: The paper builds directly on rnaglib (Mallet et al., 2021), which already introduced a Pythonic benchmarking framework for RNA-related tasks with modular dataset, splitting and evaluation components. The current work seems to primarily use rnaglib and introduces additional datasets within this framework rather than introducing fundamentally new functionality or benchmarking paradigms. 4. Limited scope of evaluated models: The benchmark includes only a small set of baseline models — LSTM and Transformer for 1D, RGCN and GVP(-2.5D) for 2D and 3D — which makes the comparative conclusions less robust. In contrast, recent works such as Beyond Sequence (Xu & Moskalev et al., ICLR 2025) evaluated a substantially broader set (multiple 1D, 2D, and 3D architectures, including both spectral and spatial GNNs for 2D and classical as well as quite recent 3D models including GVP). Without testing diverse 2D and 3D models, it is difficult to conclude that 2.5D representations universally outperform 3D ones. Also, given that the curated data is from RCSB-PDB database which is known to have structures of different structural resolutions, it is possible that there are still structural inaccuracies in the annotated structures and hence different 2D models may still work better than 3D models as reported in prior works. 5. Practical relevance and generalizability: The paper acknowledges that high-quality RNA structures are scarce, yet focuses solely on curated experimental structures (which are still a few thousand). It is unclear how conclusions drawn from such high-quality but limited data translate to real-world scenarios, where RNA data are often noisy and with incomplete labels. Prior works such as Beyond Sequence (Xu & Moskalev et al., ICLR 2025) explored this robustness dimension (e.g., noisy structures, sequencing errors), which this paper omits entirely. See weaknesses above where I have described my rational for questions as well. Specifically, the following questions are raised by the weaknesses described above. 1. Clarify the novelty over rnaglib: What specific features or design principles differentiate this benchmark suite from rnaglib beyond the addition of datasets? 2. Positioning with respect to Beyond Sequence: The paper’s results and claims of novelty should be clearly delineated from Beyond Sequence (ICLR 2025). How does this work contribute new insight or capability beyond confirming previously established trends beyond just using experimental structures? 3. Broader model coverage. Comparing against more recent 2D and 3D architectures (e.g., E(n)-GNN, EGNN, SE(3)-Transformer, or equivariant message-passing baselines) will provide a more comprehensive benchmark comparison. Similarly, comparisons against recent 2D, 2.5D and 1D models should also be made. 4. Real-world robustness: Testing model robustness to predicted or noisy RNA structures (e.g., from RoseTTAFoldNA, EternaFold, or low-resolution PDB entries) would make the benchmark more reflective of practical challenges. Some additional questions that are necessary for the benchmark to be comprehensive and useful from bioinformatics applications perspective are as follows. It would be good to clarify them as well. 5. Given limited availability of high-resolution RNA structures, how do you ensure diversity across RNA families? How sensistive are results to incomplete or noisy annotations from Rfam? 6. Given that the claim is that the datasets used in the study are high quality experimental structures, then why might 3D models underperform compared to 2.5D ones? 7. How do the RNA-ligand interactions tasks (RNA-Ligand, RNA-Prot, RNA-site) account for structural flexibility? 8. Unless I am mistaken, the dataset appears to be mostly dominated by tRNA and rRNA families. Will this limit the benchmark’s ability to generalize to diverse RNA types. Given the many weaknesses which need thorough addressing and many more experiments to be conducted, I feel this paper is not ready yet for publication but can benefit from iterations for future conferences. Fully AI-generated
A Comprehensive Benchmark for RNA 3D Structure-Function Modeling Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces a modular and comprehensive benchmark for RNA 3D structure–function modeling, extending the rnaglib framework. It defines seven standardized tasks covering biological function prediction, molecular design, and RNA–ligand interaction modeling. Each task includes curated datasets, redundancy filtering, similarity-based data splitting, and consistent evaluation metrics. The authors provide a unified interface for data loading, preprocessing, and model benchmarking, together with baseline results. All datasets, code, and documentation are openly available. The work fills a clear gap in the current landscape of molecular machine learning benchmarks by focusing on RNA 3D structures. The design is comprehensive and reproducible, addressing crucial challenges such as data leakage, redundancy, and the lack of standardized evaluation protocols. The dataset construction is biologically meaningful and diverse, spanning multiple functional levels from sequence to small-molecule binding. The implementation is accessible and well-documented, offering an important resource for the community. The baseline experiments are thorough and demonstrate that the proposed tasks are feasible and sufficiently challenging. Some details are missing for the curated datasets. The authors should clearly present detailed statistics of the datasets for each task in the main text. For example, the size of the positive and negative samples in RNA-VS is currently unknown, which is important for understanding the data balance. Also, the split strategies could be described in more detail. For instance, the RNA-LIGAND dataset is said to be split based on structural similarity, but the exact similarity threshold is not stated. The related work section also misses structure-based RNA models [1] [2]. [1] Stefaniak, Filip, and Janusz M. Bujnicki. AnnapuRNA: A scoring function for predicting RNA-small molecule binding poses. PLoS computational biology, 2021 [2] Shuo Zhang, Yang Liu, and Lei Xie. Physics-aware graph neural network for accurate rna 3d structure prediction. NeurIPS 2022 Workshop on Machine Learning for Structural Biology, 2022 Are the datasets used in LigandRNA and AnnapuRNA possibly included or overlapped with the proposed benchmark? Both datasets are designed for evaluating or scoring RNA–small molecule binding poses. Moderately AI-edited
A Comprehensive Benchmark for RNA 3D Structure-Function Modeling Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a benchmarking framework for RNA 3D structure-function modeling. The framework currently includes seven tasks covering diverse biological challenges, each defined by standardized datasets, splitting strategies, and evaluation metrics. The framework is modular, providing reusable components such as annotators, filters, and splitting tools that simplify the creation of new tasks and ensure reproducibility. Finally, the authors establish initial leaderboards by training baseline neural models to evaluate different input representations and splitting schemes. The overall goal is to enable fair, reproducible comparisons and accelerate model development in RNA structural biology. The paper presents a valuable contribution to RNA structural modeling by introducing the first comprehensive benchmark for RNA 3D structure-function prediction. Deep learning is rapidly expanding into bio-related domains, yet suitable datasets and well-defined evaluation pipelines remain fragmented, especially for RNA. By providing standardized datasets, data-splitting strategies, and evaluation metrics for seven biologically diverse tasks, the proposed framework can significantly lower the entry barrier for machine learning researchers entering this field. The modular design, with ready-to-use annotators and splitting tools, makes it straightforward to add new tasks and ensures reproducibility. The inclusion of simple baselines and initial leaderboards is also valuable, offering an immediate reference point for future model development. Overall, this benchmark has strong potential to accelerate progress by bridging the gap between the ML and RNA-structure communities and making RNA research more accessible to computational scientists. Overall, the paper is clearly written, and the framework is implemented with attention to reproducibility (open-source code, retraining scripts, and detailed appendices). The main limitation of this work lies in its lack of novelty and limited significance for a machine learning audience. The paper does not introduce new modeling techniques, learning paradigms, or analytical insights into ML behavior. Its contribution is primarily infrastructural (data collection, filtering, and standardization), which, while useful, fits better within the scope of bioinformatics resources than an ML research venue. The framework may indeed help computer scientists enter the RNA field more easily, but this convenience comes with a trade-off: by abstracting away biological complexity, it risks encouraging the development of models that rely on poorly understood datasets rather than fostering a deeper understanding of the underlying biology. **Methodological clarity and justification:** - Some of the preprocessing and filtering choices were a bit unclear, and it would be beneficial to give more context on how these decisions were made. Lines 119-123 introduce resolution and size cutoffs, but only the upper limit for size is justified, and the rationale for excluding RNAs that depend on protein interactions is not discussed, and this might not be equally important for all tasks. Similarly, the 8 Å cutoff introduced in line 278 is unexplained; if this value is standard in the field, it should be cited, and if not, the authors should justify it empirically. The same applies to the removal of binding sites that bind multiple ligands (line 280), which is presented as a data preprocessing step without any discussion of why such cases are problematic. - Regarding Figure 5a, I was not entirely convinced by the claim that the size distribution of RNAs is bimodal. From visual inspection, it seems to reflect one dominant group of smaller RNAs and a long tail of very large ribosomal RNAs rather than two distinct peaks. If this interpretation is correct, a short clarification would help avoid confusion about the shape of the dataset distribution. **Dataset design and labeling:** - Some aspects of the dataset construction are not entirely clear to me. For example, in the RNA-ligand and RNA-VS tasks, the PDB contains only a limited number of experimentally determined RNA-ligand complexes, so I was wondering how “negative” examples are defined. Does the absence of a complex in PDB imply non-binding, or is there another source used to establish negatives? - For the RNA-VS task, if I understood correctly, the ground-truth binding scores are generated by rDock. It would be helpful if the authors could comment on how reliable rDock is for RNA systems, and what the motivation is for training models on data derived from this scoring function rather than using the tool directly. - Looking at Table 1, the train/validation/test split ratios also differ quite substantially across tasks. Could the authors explain why these percentages are not consistent and whether this affects the comparability of results? Finally, the gRNAde dataset appears to contain far more examples (over 11k RNAs) than others, is there a reason why this larger dataset was not incorporated in the main dataset? - For the RNA-Prot task, I wanted to confirm whether only the RNA sequence or structure is used as input, while the protein partner is not included. If that’s the case, it would be interesting to understand the reasoning behind this setup — doesn’t the identity or nature of the protein influence whether a nucleotide is likely to be in contact? Clarifying this could help readers interpret what kind of signal the model is expected to learn in this task. **Representation and modeling clarity:** - I found some parts of the description of input representations and model setups a bit difficult to follow, and additional clarification would be very helpful. In Section 5 and Appendix E, the term “1D representation space” is mentioned, but it was not entirely clear to me what this refers to, I assume it represents the RNA sequence, perhaps via a one-hot encoding, but a short explanation would make this explicit. - In Figure 4a, I was unsure whether 1D-LSTM and 1D-Transformer use the same input features and differ only in architecture, or if the representations themselves differ in some way. Similarly, in Figure 2, the GraphRepresentation (PyTorch Geometric) seems to correspond to the 3D representation, but this only became clear later in the paper. - For Figures 4b and Appendix F, it would also help to include the number of parameters or blocks for each model. Since the datasets are relatively small, this context would make the performance comparisons easier to interpret. Finally, I could not find details on how training was terminated (fixed epochs, early stopping, validation criterion, etc.); adding this information would help readers understand how the models were optimized. **Experimental transparency and reproducibility:** - I found Section 6.4 and Table 1 informative, but I was not fully sure how to interpret some of the reported results. It would be useful to know which specific model configuration was used for each task; for instance, the number of layers or which representation type was applied, so readers can better understand whether the performance differences come from architecture, data variation, or representation choice. - The paper mentions reproducibility scripts, which is great, but I could not find details on whether tools such as CD-HIT and US-align were run with default parameters or custom ones. Clarifying this would make the setup easier to reproduce. Additionally, line 305 refers to a “recomputation” option; I was curious what exactly this entails. Does it mean that the datasets can be automatically updated with new PDB entries, or that the benchmark is rebuilt from a fixed snapshot? This would help readers understand how the benchmark can be maintained or extended over time. **Conceptual limitations:** - In the discussion, the paper makes relatively strong claims about the advantages of certain representations (for example, that 3D models outperform sequence-based ones), but I was wondering how these conclusions should be interpreted given the limited dataset sizes. There are far more RNA sequences available than experimentally resolved 3D structures, and large-scale sequence pretraining might ultimately prove more effective once transferred to structural tasks. It would be valuable if the authors could comment on this broader perspective, whether they see their results as evidence that 3D structure is intrinsically more informative, or simply as an observation constrained by the current scale of available data. **Minor (did not influence the decision):** - Line 27: Please use the full name “AlphaFold 2”. The earlier AlphaFold (2019) model (https://www.nature.com/articles/s41586-019-1923-7) did not have the same impact or recognition, so it’s important to distinguish between the two. - Line 33: The phrasing suggests that CASP and CAPRI emerged after deep learning advances, while in reality, they date back to 1994 and 2001, respectively, long before neural networks were used in structural biology. It would be good to correct this and cite representative references such as https://onlinelibrary.wiley.com/doi/10.1002/prot.26617 and https://onlinelibrary.wiley.com/doi/10.1002/prot.70018. - Section 2.2: It would strengthen the context to mention that CASP now includes RNA 3D structure prediction (https://onlinelibrary.wiley.com/doi/10.1002/prot.26550) and that there is a dedicated RNA benchmark, RNA Puzzles (https://www.nature.com/articles/s41592-024-02543-9). - Section 2.3: It might also be useful to acknowledge recent deep-learning models for RNA 3D structure prediction such as DRfold (https://www.nature.com/articles/s41467-023-41303-9), RhoFold+ (https://www.nature.com/articles/s41592-024-02487-0), and trRosettaRNA (https://www.nature.com/articles/s41467-023-42528-4). - Line 242: The heading currently looks like it belongs under Section 4.4, but it seems to introduce Sections 4.5 and 4.6 instead, adjusting this could help readability. - Appendix B.1 (line 866): If I understood correctly, this should reference Figure 5b rather than 5c. - Figure 6b: The high-similarity thresholds (e.g., 0.95–1.0) seem to overlap or are not visible, and the text doesn’t clarify their position. Annotating or separating these points would make the figure easier to interpret. - Table 2 (Appendix C): Adding a note that “Nodes = Nucleotides” would make the table easier to understand. I was also unsure what the “number of features” refers to; a short explanation would help. - References: Several reference titles are not capitalized correctly, this can be fixed by wrapping words in braces ({}) in the BibTeX file. 1. Some of the datasets included in the benchmark are quite small (for example, 138 or 157 data points in the training set). It would be interesting to hear the authors’ perspective on what they expect can realistically be learned from such limited data. 2. Since the paper includes datasets like gRNAde and RNASite, did the authors consider training their baseline models on these existing datasets to compare performance or confirm consistency across sources? 3. I did not find information about whether it is possible to obtain a list of PDB IDs and corresponding target values for each dataset. Having access to such mappings would be very helpful for quick inspection, dataset validation, and external comparison. 4. Do the authors plan to include state-of-the-art baseline tools or models for each task? I could not find public leaderboards on the documentation website (only Table 1 in the paper). It would be very helpful to have task-specific leaderboards directly available in the online documentation to make performance comparisons and community contributions easier. Moderately AI-edited
PreviousPage 1 of 1 (4 total rows)Next