|
AbBiBench: A Benchmark for Antibody Binding Affinity Maturation and Design |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The manuscript introduces AbBiBench (Antibody Binding Benchmarking), a comprehensive benchmarking framework specifically tailored for the tasks of antibody binding affinity maturation and design. A core principle of this framework is the treatment of the antibody–antigen (Ab–Ag) complex as the fundamental functional unit for analysis. The authors curate an extensive dataset comprising over 186,580 experimental measurements of antibody mutants spanning 13 antibodies and 9 antigens. The paper systematically evaluates 15 distinct classes of protein models—including masked language models, autoregressive models, inverse folding models, diffusion-based generative models, and geometric graph models.
1. **Creation of a Comprehensive Dataset**: The assembly and curation of the AbBiBench dataset constitute a significant and valuable contribution to the field of computational antibody engineering. By integrating a large volume of experimental binding affinity measurements across a diverse set of clinically relevant antibody–antigen complexes, the authors provide a crucial resource necessary for robustly training and rigorously evaluating next-generation generative and predictive models for affinity maturation.
1. **Omission of Relevant Baselines:** The empirical comparison lacks the inclusion of recent, highly relevant structural prediction models, most notably those in the AlphaFold3-series. Given that several recent studies have demonstrated the utility of confidence metrics derived from these advanced structure prediction pipelines (e.g., ipTMs) for estimating the impact of mutations on protein-protein interaction stability and binding affinity [1, 2].
2. **Limited Predictive Significance of Observed Correlations**: The reported correlation coefficients ($\rho$) in Section 4.1 are consistently low, often falling between 0.1 and 0.2. While the authors attempt to differentiate performance within this narrow band, the statistical and predictive difference between, for example, $\rho=0.2$ and $\rho=0.1$ is arguably too slight to be considered meaningful or indicative of strong predictive power for practical affinity maturation.
1. The citation style throughout the manuscript appears inconsistent with standard academic conventions. Could the authors please verify and uniformly revise the citation format to adhere to a recognized academic standard?
2. In Section 3.3.2, could the authors elaborate on the practical steps taken to determine $\Delta G$ without structure prediction?
3. To enhance the conviction and completeness of the benchmark results, would the authors consider incorporating the performance metrics (e.g., correlation with affinity) derived from AlphaFold2 and AlphaFold3?
[1] Wee, J. and Wei, G.W., 2024. Evaluation of AlphaFold 3’s protein–protein complexes for predicting binding free energy changes upon mutation. Journal of Chemical Information and Modeling, 64(16), pp.6676-6683.
[2] Lu, Wei, Jixian Zhang, Jiahua Rao, Zhongyue Zhang, and Shuangjia Zheng. "AlphaFold3, a secret sauce for predicting mutational effects on protein-protein interactions." bioRxiv (2024): 2024-05. |
Moderately AI-edited |
|
AbBiBench: A Benchmark for Antibody Binding Affinity Maturation and Design |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces AbBiBench, benchmark dataset and framework for antibody affinity maturation and design with focus on the regions of the data landscape that have accompanying structural information available. Within the designed framework authors compare several publicly available methods and also pursue an experimental validation on clinically relevant target.
The paper is clear and easy to follow with only some minor unclear parts (especially section 2.1). Authors idea on the benchmarking framework and the benchmark itself is valuable and could be useful in the field. Paper also performs experimental validation which is a big strength in my opinion.
The main weakness of the paper, in my opinion, is related to the scarcity and lack of novelty in terms of the aggregated datasets that are subsequently used in benchmarks. Authors present 15 individual datasets (coming from around 8 individual research articles) featuring roughly 200k measurements. Although this number may seem impressive, virtually each of these datasets was already used as a benchmark in several ML-related publications and majority of them are available publicly in easy to parse formats. Showing-off number of measurements can also be misleading since data comes from a variety of assays (high- and low-throughput and high-throughput results may not necceserily correspond to more accurates ones). Some publicly available datasets, recognized in the field, are also not included for unclear reason - e.g. IgDesign 2025 dataset. While I endorse the idea of the paper, I believe the execution lacks the detail and therefore prevents me from scoring this submission higher. I would welcome and score much higher the submission that would add much more value through e.g. manual (or LLM?) inspection of a large number of manuscripts and extracting the valuable data hiding there for many targets, similar to the effort in e.g. Skempi database.
- Is there any particular reason that some datasets with structural support were omitted by authors, e.g. Dreyer et al, 2024, Shanehsazzadeh et al, 2024? |
Fully human-written |
|
AbBiBench: A Benchmark for Antibody Binding Affinity Maturation and Design |
Soundness: 3: good
Presentation: 3: good
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors introduce AbBiBench, a benchmarking framework for antibodies. It evaluates models by scoring the complete antibody-antigen complex rather than the antibody in isolation, arguing this is a more biologically sound approach. Using a curated dataset of over 186,580 experimental affinity measurements, this paper benchmarks 15 protein models and finds that structure-conditioned inverse folding models (like ESM-IF) generally outperform other architectures. This conclusion is supported by an in vitro case study where designed antibodies successfully gained a new binding function.
- Assembling, curating, and standardizing over 186,580 experimental measurements is a very appreciated contribution to the field. Making this dataset public will accelerate future model development.
- The authors performed in vitro ELISA assays on 21 designed variants and showed a clear gain-of-function (H1N1 binding) that the wild-type antibody lacked. This validates that the benchmark's top models can be used in a practical, successful design campaign.
- The benchmark, and its in vitro validation, focuses on binding affinity (Kd or ELISA OD signals). In a therapeutic context, the ultimate goal is function (e.g., neutralization, measured by IC50). But affinity is a common proxy used in most computational studies.
- While the benchmark's focus on binding affinity is important, it doesn't capture the full picture. Antibody design is a multi-parameter optimization problem, and the authors acknowledge their work would be more beneficial if it included other key data like stability, immunogenicity, and functional assays.
- The generative case study (against H1N1) is based on an AlphaFold3-predicted structure, as no experimental one exists. The paper's supplement reveals this predicted complex has an iPTM score of 0.39, which indicates very low confidence in the predicted interface.
- Two anti-influenza antibodies (CR9114 and CR6261) account for most of the dataset. This means the "Average" performance (Figure 3) is overwhelmingly dominated by how well models perform on influenza, not on general Ab-Ag interactions.
- It would be great to incorporate data beyond binding affinity, e.g. from Flab, in the benchmark to allow for multi-parameter optimization and assessment
- Some class balancing in assessing overall performance across the different, very unbalanced datasets, would be useful, as well as some discussion of epitope diversity in the dataset. |
Fully human-written |
|
AbBiBench: A Benchmark for Antibody Binding Affinity Maturation and Design |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces AbBiBench, a large-scale benchmarking framework for antibody binding affinity maturation and design. Unlike prior metrics that evaluate antibodies in isolation, AbBiBench evaluates antibody–antigen complexes together. It comprise of experimental data from over 186,000 mutants across 13 antibodies and 9 antigens and systematically compares 15 protein models. Results show that structure-conditioned inverse folding models outperform others in predicting and generating high-affinity variants.
1. The benchmark consists of a wide range of antibodies, antigens, and model architectures, which allows comprehensive and biologically informed evaluation for antibody design models.
2. The study includes in vitro validation, providing strong experimental evidence that supports the findings of the computational benchmark.
1. While AbBiBench provides a biologically meaningful benchmarking pipeline, it primarily integrates existing protein and antibody machine learning models without introducing novel machine learning methodologies.
2. The antibody generation by sampling from the models focuses on a single antigen influenza H1N1. It limits the generalizability of the generation results across diverse antigen targets.
1. What is the correlation between affinity fold change and actual binding affinity? Why is it better than other latest binding affinity prediction models?
2. The Phase 2 for identifying final candidates in 3.3.2 relies on AlphaFold 3 for full complex structure generation, which is a computationally intensive step. Could the author provide details and discussion on the computational time/efficiency of the benchmark? |
Lightly AI-edited |