Testing whether de novo antibody design models learn therapeutic-like properties without explicit training, using the FLAb benchmark.
De novo models produce antibodies with reasonable developability—because their training data (SAbDab/PDB) is already biased toward well-behaved proteins.
De novo designs are better than natural antibodies but still don't match the optimization level of approved therapeutics. Traditional developability engineering remains valuable.
ESM-2 without fine-tuning shows weak or negative correlations with experimental properties. Don't use it for fitness prediction out of the box.
Humanness alone predicts immunogenicity worse than random (AUROC 0.28). Multi-feature models achieve 0.74 AUROC but dataset is tiny (n=217).
Developability refers to the likelihood that an antibody candidate will successfully progress through manufacturing and clinical development. It encompasses multiple properties:
A substantial fraction of clinical failures are attributable to developability issues. This is why the claim that AI models achieve "free developability" is so intriguing—and why we need to understand the mechanism.
Our analysis of the FLAb benchmark reveals how training data bias shapes de novo antibody design.
Cohen's d effect size for liability motifs. Structural databases (SAbDab) have significantly fewer problematic sequences than natural antibodies (OAS). This is the largest effect we observed.
Spearman correlation for ESM-2 expression prediction. Negative correlation means sequences ESM-2 calls "unnatural" actually express better—the opposite of what you'd want.
AUROC for immunogenicity classification with multi-feature models. Humanness-only baselines achieve just 0.28 AUROC—worse than random guessing.
Why might de novo antibody design models produce developable sequences without explicit optimization?
Current in silico developability metrics may be insufficiently stringent, allowing most sequences to pass regardless of true developability.
Structural databases (PDB/SAbDab) contain an inherent bias toward developable antibodies—poorly behaved sequences rarely get crystallized and deposited.
Reported results may reflect post-hoc filtering or framework selection not fully described in publications.
Figure 1: The training data bias hypothesis. Structural databases have inherent selection for well-behaved proteins.
Comparing four antibody sources on key developability metrics.
*Note: De novo sequences are simulated to match reported distributions, not actual model outputs. This is a limitation of our analysis. Liability motifs include NG, NS, DG deamidation/isomerization sites and N-glycosylation sequons.
| Dataset | Hydrophobicity | Net Charge | Liability Motifs | Aromatic % |
|---|---|---|---|---|
| CST (Therapeutics) | -0.604 | 0.94 | 1.36 | 6.9% |
| SAbDab | -0.498 | 2.10 | 2.03 | 8.0% |
| De Novo (Simulated)* | -0.452 | 2.50 | 2.62 | 7.8% |
| OAS (Natural) | -0.327 | 3.14 | 3.96 | 8.8% |
Data: FLAb CST panel (Jain et al. 2017), SAbDab (Oxford), OAS paired sequences. *De novo sequences are simulated, not actual model outputs. Hydrophobicity: Kyte-Doolittle scale (lower = better). Net charge at pH 7.4.
Effect sizes (Cohen's d) from Mann-Whitney U tests. All comparisons p < 0.001. Interpretation: d = 0.2 small, 0.5 medium, 0.8 large effect.
Can ESM-2 embeddings predict antibody fitness without fine-tuning? Short answer: No.
Method: ESM-2 (650M) pseudo-log-likelihood scoring. Data: FLAb benchmark. This confirms findings from the original FLAb paper (Chungyoun et al. 2024).
Predicting which antibodies will trigger immune responses—and the critical data gap.
Data: Immunogenicity data aggregated in FLAb (n=217 antibodies with clinical ADA data). Method: Leave-one-out cross-validation. Features: Sequence properties + humanness scores + ESM-2 embeddings.
How we analyzed the FLAb benchmark data.
FLAb benchmark from Gray Lab (Johns Hopkins). 160 CSV files across thermostability, expression, binding, and aggregation properties. ~4M datapoints total.
ESM-2 (650M parameters) with mean pooling. GPU inference on Modal cloud infrastructure (~$15-20 total compute cost).
Sequence-based metrics: liability motif detection, Kyte-Doolittle hydrophobicity, net charge at pH 7.4, aromatic content (F, W, Y fraction).
Mann-Whitney U tests for group comparisons. Cohen's d for effect sizes. Leave-one-out CV for immunogenicity models.
@article{flab_developability_2026,
title={Emergent Developability in De Novo Antibody Design:
A Computational Analysis of the FLAb Benchmark},
author={FLAb Analysis Team},
journal={bioRxiv},
year={2026},
note={Analysis code: github.com/inventcures/flab_gray-lab-jhu_ab_chars}
}