|
MotifScreen: Generalizing Virtual Screening through Learning Protein-Ligand Interaction Principles |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper presents MotifScreen, a new model for protein-ligand binding affinity prediction, along with a new virtual screening benchmark named ChEMBL-RL. The authors claim their method and benchmark are more robust to data leakage; however, significant concerns remain regarding the fairness of the comparative study against baseline models. Furthermore, the comparison with the previous LIT-PCBA benchmark requires more thorough discussion.
---
The usage of LLM: I wrote the entire review myself and only used the LLM to correct the grammar and improve readability.
- The motivation for preparing the new benchmark is clear.
- The proposed benchmark, ChEMBL-RL, exhibits less bias compared to the decoy-based test sets DUD-E and DEKOIS 2.0.
## ChEMBL-RL
**1. Lack of a robust strategy to avoid false negatives.**
The authors construct their decoy set by sampling actives from other targets in ChEMBL. However, I question why the authors only use Tanimoto similarity to filter these decoys, without employing other computational tools (e.g., docking, cofolding tools) to prevent false negatives. For DEKOIS 2.0 or DUD-E, decoys are drawn from large ligand libraries like ZINC, which contain many inactive molecules, posing a lower risk of including false negatives. In contrast, ChEMBL is a library of bioactive compounds, and it is one of the most popular library to identify initial hits. It is plausible that compounds from this library could exhibit activity against a target, even if they are structurally dissimilar to known actives.
**2. Is the constructed decoy set truly better than LIT-PCBA's inactive set?**
In Table 1, the authors claim that ChEMBL-RL achieves better bias control than LIT-PCBA due to a lack of protein-side data leakage. However, this is not a fair comparison, given that LIT-PCBA uses **experimentally validated inactives**, while ChEMBL-RL use **putative inactives** (i.e., cross-decoys). Removing bias from a limited set of _experimental_ data is arguably more difficult than drawing a decoy set from a large pool of _assumed_ inactives minimizing a bias.
Moreover, the AVE values in the original LIT-PCBA paper appear lower than the values reported here. The authors should justify why they report the AVE for only 4 targets from LIT-PCBA.
**3. Flawed EF1% comparison.**
The EF1% values in Table 2 cannot be directly compared across different benchmarks. This metric is highly dependent on the ratio of actives to decoys (i.e., the size of the decoy set), which differs between benchmarks.
---
## MotifScreen
**4. Unfair comparative study.**
MotifScreen is trained on three datasets (PDBbind, BioLip, and ChEMBL), creating a training set that is significantly larger (reportedly ~6x) than those used for the baseline models (PDBBind). For a fair comparison, the authors should either report MotifScreen's performance when trained only on PDBbind or retrain the baseline models using the same extended training set. For models like KarmaDock, which has separate structure and affinity modules, it seems feasible to retrain its affinity module using the binding affinity-only data in ChEMBL.
**5. Poor performance on DUD-E.**
In Table D5, MotifScreen's EF1% on the DUD-E benchmark is only 5.94, which is substantially lower than the baseline models (9-16) and other state-of-the-art methods (e.g., GLIDE[1], RTMScore[2], GenScore[3], PIGNet2[4]) that report EF1% > 20 (you can see the value in GenScore Paper). Given that MotifScreen uses an extended training dataset, this performance is insufficient to support the claim of robustness. The authors state they filtered the training set to avoid leakage; they should also report performance _without_ this filtering to clarify if this is the cause.
While target leakage is a valid concern, many drug development campaigns focus on known targets. Therefore, evaluating performance on targets similar to the training set is still a necessary and practical assessment.
**6. Missing evaluation on DEKOIS 2.0.**
In Table D5, results for MotifScreen on the DEKOIS 2.0 benchmark are absent. The authors should evaluate the model on the DEKOIS 2.0.
---
**Reference:**
1. Halgren, Thomas A., et al. "Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening." Journal of medicinal chemistry 47.7 (2004): 1750-1759.
2. Shen, Chao, et al. "Boosting protein–ligand binding pose prediction and virtual screening based on residue–atom distance likelihood potential and graph transformer." Journal of Medicinal Chemistry 65.15 (2022): 10691-10706.
3. Shen, Chao, et al. "A generalized protein–ligand scoring framework with balanced scoring, docking, ranking and screening powers." Chemical Science 14.30 (2023): 8129-8146.
4. Moon, Seokhyun, et al. "PIGNet2: a versatile deep learning-based protein–ligand interaction prediction model for binding affinity scoring and virtual screening." Digital Discovery 3.2 (2024): 287-299.
- Please report the number of similar complex data (by protein sequence and ligand similarity) in PDBbind for each benchmark set. Also, report the number of data points excluded from the training set for the DUD-E evaluation (Line 381).
- It is well-known that AUROC is not an ideal metric for evaluating virtual screening performance. Please report **BEDROC** in all main benchmark tables (e.g., Table 2).
- The notation **EF1** is incorrect. This metric is typically denoted as EF1\% or EF_{1\%}. Please correct this throughout the manuscript. Consequently, the (\%) in the header of Table 2 ('EF1 (%)') should be removed.
- Please report the specific PDB IDs used, the number of actives, and the number of decoys for each target in the ChEMBL-RL benchmark in the Appendix.
- Is the maximum EF1\% value is 31 in ChEMBL-RL? (30 decoys per each active) |
Fully human-written |
|
MotifScreen: Generalizing Virtual Screening through Learning Protein-Ligand Interaction Principles |
Soundness: 3: good
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper argues that widely used SBVS benchmarks suffer from target leakage and ligand bias, which inflate reported DL performance. It proposes ChEMBL-LR, a leakage-resistant benchmark (60 targets; near-zero mean AVE bias 0.033) and introduces MotifScreen, a multi-task, structure-based screening model with three heads: (1) pocket motif prediction, (2) fragment/key-atom structure compatibility, and (3) binding score prediction. The method is trained on PDBbind+BioLip+ChEMBL with strictly removing leakage and reports results on ChEMBL-LR and DUD-E.
- **Clear diagnosis of benchmark pitfalls** with concrete analyses of target leakage and ligand-only shortcuts (AVE). The paper emphasizes high overlap between common benchmarks and PDBbind and quantifies ligand bias.
- **New benchmark (ChEMBL-LR)** with principled curation: strict target-wise separation, cross-decoys, removal of non-drug-like molecules, and near-zero mean AVE bias (0.033).
- **Principle-guided multi-task design** (motif, structure/key-atom, affinity) that attempts to force learning of interaction physics rather than shortcut signals.
- **Efficiency**: forward pass timing ($\sim$ 0.03 s/compound) suggests practical scalability for large libraries.
- **Mismatched training regimes undermine the comparison.** MotifScreen is evaluated on DUD-E after removing all training entries similar to its targets (from PDBbind/BioLip/ChEMBL), while most baselines appear to use their original training with likely target overlap. This setup likely depresses MotifScreen's DUD-E EF1% (5.94) and weakens any one-to-one comparisons between the models. A fair test would retrain baselines under the proposed training dataset or evaluate all methods on a single leakage-controlled split.
- **Use of $\Delta$ (performance drop) without harmonizing metric ranges or training regimes.** The manuscript subtracts EF1%/AUROC values between external benchmarks and ChEMBL-LR to argue smaller degradation for MotifScreen. However, EF1% ranges and training conditions differ across benchmarks and methods, making raw subtraction potentially misleading. A consolidated table exists (Table D5), but $\Delta$ remains hard to interpret as "generalization" without a common training/eval protocol.
- **Early-enrichment evidence is mixed versus the strongest baseline.** EF1% gains over SurfDock are not statistically significant (p = 0.161), which matters because early enrichment drives SBVS utility. The manuscript foregrounds AUROC, potentially obscuring this point.
- **Ablation study's design choices look ad-hoc.** Ablations use a reduced dataset and report epoch 31 snapshots. Figure D2 indicates similar validation set's AUROC trajectories between "aff+motif", "aff+motif+str" configurations. Without multi-seed runs or later-epoch checks, conclusions about hierarchical synergy risk over-interpretation.
- **ChEMBL-LR vs. LIT-PCBA: incremental benefit unclear.** The paper argues LIT-PCBA's low AVE is limited (reported on only 4 targets) and leakage under external training data source (e.g., PDBbind). However, LIT-PCBA already uses experimentally measured inactives and, by the authors' own RF results, remains difficult across all 15 targets ( including the 11 with potential leakage), yielding low AUROC even when leakage could help. This can be treated as an evidence that LIT-PCBA already probes generalization. Absent a uniform, leakage-controlled training/evaluation of all methods, it is unclear what ChEMBL-LR contributes beyond more targets with target-wise separation, rather than fundamentally stronger bias control.
1. **Normalization of $\Delta$ metrics.** Since EF1% ranges and dataset compositions differ across benchmarks, how do authors justify interpreting raw $\Delta$EF1% as generalization? Would authors consider relative EF1% retention or standardized effect sizes with a common training corpus (Table D5 suggests this is possible)?
2. **Ablations: epoch choice and variance.** Why did authors choose epoch 31 for reporting? How many seeds were run for Table 3/Figure D2? Could authors report later-epoch or full-training ablations (or confidence intervals) to substantiate the hierarchical-synergy claim?
3. **Cross-docking criterion.** Sequence identity >95% is a strong but it may be an indirect criteria in terms of SBVS where we generally knows about the binding pocket. Did authors consider pocket-level similarity (e.g., local alignment, cavity overlap)?
4. **Inference score.** Please confirm that ŷ (final scalar binding prediction) is the ranking score used in all screening experiments, and note where this is specified in the main text.
5. **EF1% and BEDROC reporting.** Since early enrichment is crucial in VS, why is BEDROC only in the appendix rather than in the main comparison tables alongside EF1%/AUROC? Could authors include per-target EF1%/BEDROC distributions with CIs? Also, what $\alpha$ value used in BEDROC computation?
## Typos
- AVE formula text (line 125)
- "MotifGen (Anonymous, 2025)" placeholder. (line 205)
- Incomplete line in Table D1's paragraph (line 1359-1360) |
Fully AI-generated |
|
MotifScreen: Generalizing Virtual Screening through Learning Protein-Ligand Interaction Principles |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper presents MotifScreen, a multi-task deep learning framework for structure-based virtual screening that aims to improve generalization by modeling protein–ligand interaction principles. It integrates motif prediction, structure prediction, and affinity scoring modules trained jointly. The authors also introduce a new benchmark, ChEMBL-LR, designed to reduce ligand bias and target leakage compared to datasets such as DUD-E and DEKOIS 2.0. Experiments show that MotifScreen achieves 0.68 AUROC on ChEMBL-LR and exhibits improved robustness and smaller performance degradation across benchmarks. Ablation studies indicate that combining motif and structure learning contributes to generalization.
The paper addresses an important problem concerning data leakage and bias in existing datasets. The proposed multi-task learning framework, which incorporates various forms of external structural knowledge, represents an effective approach to better utilizing available data.
First, regarding bias, it is important to reconsider how it should be viewed. Similar binding pockets tend to bind similar molecules, and similar molecules tend to interact with similar pockets — this assumption underlies all machine learning–based models in this field. Based on this, the actives in any test set will naturally have higher similarity to the reference ligands. Therefore, this so-called ligand bias reflects an inherent relationship within the data itself, and its presence has a certain scientific justification.
Second, in terms of model design, the proposed architecture lacks novelty. Components such as the SE(3) Transformer, EGNN, and grid-based representations have already been widely used in related molecular representation and virtual screening models.
Third, the dataset, the use of random decoys makes the task overly easy. The real challenge lies in distinguishing structurally or physicochemically similar compounds; performance on average or dissimilar decoys is less meaningful. the BEDROC should be reported to show the top-ranking performance in table 1.
Finally, as a new benchmark, it should include a comprehensive evaluation across a wide range of models to ensure fairness and demonstrate general applicability.
1. The formula for AVE in lines 125–126 appears incorrect — it currently shows (IT IV − IT IV), which seems to be a typographical error.
2. The drop results in Table 2 are not directly comparable. First, the absolute value of EF1% is influenced by the ratio of actives to decoys in each dataset. Moreover, different datasets (such as DUD-E and DEKOIS 2.0) were used to calculate the drop for different models, making the comparisons inconsistent. In addition, the paper does not report results on more recent virtual screening models but only docking based methods.
3. The key challenge of this task lies in the enrichment of active compounds at the top of the ranking list. Therefore, metrics such as BEDROC, which assign higher weights to top-ranked molecules, are more appropriate than AUROC. Similarly, Table 1 should also report BEDROC and EF values, rather than only AUC. |
Lightly AI-edited |
|
MotifScreen: Generalizing Virtual Screening through Learning Protein-Ligand Interaction Principles |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses the issue of overfitting performance reporting in deep learning-based structure-based virtual screening (SBVS), which the authors attribute to systemic biases and data leakage in commonly used benchmarks. The authors make a two-fold contribution: first, they introduce ChEMBL-LR, a new leakage-resistant benchmark designed to provide a more realistic evaluation of model generalization. Second, they propose MotifScreen, a novel end-to-end SBVS model. MotifScreen uses a principle-guided, multi-task learning framework that reasons about protein-ligand interactions by predicting binding pocket motifs, ligand-pocket compatibility, and final binding probability.
1. The paper's most significant contribution is its critical analysis of the systemic flaws in existing SBVS benchmarks. The development of the ChEMBL-LR dataset, which explicitly controls for target leakage and ligand bias, is a valuable service to the community and helps establish a more rigorous standard for future research.
2. The work is well-motivated, and the paper is clearly written and structured. The analysis in Section 4.1, which uses a Random Forest model to quantify the extent of leakage in benchmarks like DUD-E and LIT-PCBA, provides strong evidence for the authors' claims.
3. The multi-task learning architecture of MotifScreen is conceptually sound. Forcing the model to learn intermediate, physically-grounded tasks like motif identification and key atom positioning is a promising strategy to improve generalization and move beyond simple classification shortcuts.
1. Lacking important baselines, DrugCLIP[1] and EquiScore[2].
2. Deep-learning methods tends to overfit on the benchmark. However, AutoDock-Vina just adopts a simple linear scoring function. In Table 2 and Table D5, significant decrease of AutoDock-Vina is also observed. More analysis about this should be performed.
3. Training data is important for deep-learning methods. MotifScreen employees different and larger training set compared to previous baselines. More analysis and ablation study about the effect of training data is important to evaluate this paper.
[1] DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening, NeurIPS, 2023
[2] Generic protein–ligand interaction scoring by integrating physical prior knowledge and data augmentation modelling, Nature Machine Intelligence, 2024
1. The citation of MotifGen in line 205 is wrong. |
Fully human-written |