What sparked this analysis: Adil Yusuf's thought-provoking post asking "Does developability come for free?" combined with Michael Chungyoun's announcement that FLAb is now available on AWS (500K datapoints) and GitHub (4M+ datapoints)—plus my ongoing exploration of the AI antibody design landscape.

Does Developability Come For Free?

Testing whether de novo antibody design models learn therapeutic-like properties without explicit training, using the FLAb benchmark.

📄
Read the Scientific Paper Emergent Developability in De Novo Antibody Design (Feb 2026)
Download PDF

TL;DR: The Bottom Line

🎯

Yes, but it's not magic

De novo models produce antibodies with reasonable developability—because their training data (SAbDab/PDB) is already biased toward well-behaved proteins.

⚠️

Not therapeutic-quality

De novo designs are better than natural antibodies but still don't match the optimization level of approved therapeutics. Traditional developability engineering remains valuable.

🤖

Zero-shot PLMs fail

ESM-2 without fine-tuning shows weak or negative correlations with experimental properties. Don't use it for fitness prediction out of the box.

🔬

Humanness ≠ low immunogenicity

Humanness alone predicts immunogenicity worse than random (AUROC 0.28). Multi-feature models achieve 0.74 AUROC but dataset is tiny (n=217).

What is "Developability" and why does it matter?

Developability refers to the likelihood that an antibody candidate will successfully progress through manufacturing and clinical development. It encompasses multiple properties:

  • Expression — Can it be produced at high yield in cell culture?
  • Stability — Does it remain folded and functional over time?
  • Aggregation — Does it clump together (bad for injection)?
  • Immunogenicity — Will patients develop antibodies against it?

A substantial fraction of clinical failures are attributable to developability issues. This is why the claim that AI models achieve "free developability" is so intriguing—and why we need to understand the mechanism.

Key Findings

Our analysis of the FLAb benchmark reveals how training data bias shapes de novo antibody design.

Training Data Bias Confirmed

d = -1.19

Cohen's d effect size for liability motifs. Structural databases (SAbDab) have significantly fewer problematic sequences than natural antibodies (OAS). This is the largest effect we observed.

Zero-Shot PLMs Fail

ρ = -0.27

Spearman correlation for ESM-2 expression prediction. Negative correlation means sequences ESM-2 calls "unnatural" actually express better—the opposite of what you'd want.

Immunogenicity Prediction

0.74

AUROC for immunogenicity classification with multi-feature models. Humanness-only baselines achieve just 0.28 AUROC—worse than random guessing.

Three Hypotheses for "Free Developability"

Why might de novo antibody design models produce developable sequences without explicit optimization?

Untested

1. Weak Metrics Hypothesis

Current in silico developability metrics may be insufficiently stringent, allowing most sequences to pass regardless of true developability.

Tested Here

2. Training Data Bias Hypothesis

Structural databases (PDB/SAbDab) contain an inherent bias toward developable antibodies—poorly behaved sequences rarely get crystallized and deposited.

Untested

3. Undisclosed Filtering Hypothesis

Reported results may reflect post-hoc filtering or framework selection not fully described in publications.

SAbDab / PDB Curated structures trains De Novo Model Learns distribution generates Designed Abs Therapeutic-like? Developability bias inherited through training data

Figure 1: The training data bias hypothesis. Structural databases have inherent selection for well-behaved proteins.

Key insight: Antibodies in SAbDab have significantly better developability profiles than natural repertoires (OAS). De novo models trained on structural data inherit this bias—developability appears "for free" but is actually encoded in the training data.

Developability Profile Comparison

Comparing four antibody sources on key developability metrics.

Liability Motifs by Dataset
Fewer liability motifs = better developability. CST therapeutics set the gold standard.
CST
1.36
SAbDab
2.03
De Novo*
2.62
OAS
3.96
CST: Clinical-stage therapeutics (n=137)
SAbDab: Structural database (n=500)
De Novo*: Simulated designs (n=500)
OAS: Natural repertoire (n=500)

*Note: De novo sequences are simulated to match reported distributions, not actual model outputs. This is a limitation of our analysis. Liability motifs include NG, NS, DG deamidation/isomerization sites and N-glycosylation sequons.

Complete Developability Profile
Four key metrics across antibody sources. Green = favorable, Red = unfavorable.
Dataset Hydrophobicity Net Charge Liability Motifs Aromatic %
CST (Therapeutics) -0.604 0.94 1.36 6.9%
SAbDab -0.498 2.10 2.03 8.0%
De Novo (Simulated)* -0.452 2.50 2.62 7.8%
OAS (Natural) -0.327 3.14 3.96 8.8%
Interpretation: De novo designs show profiles closer to SAbDab than OAS across all metrics—supporting the training data bias hypothesis. However, they don't match CST-level optimization, suggesting explicit developability engineering remains valuable.

Data: FLAb CST panel (Jain et al. 2017), SAbDab (Oxford), OAS paired sequences. *De novo sequences are simulated, not actual model outputs. Hydrophobicity: Kyte-Doolittle scale (lower = better). Net charge at pH 7.4.

Effect Sizes: SAbDab vs. OAS
Cohen's d comparing structural database to natural repertoire. Negative values favor SAbDab (better developability).
-1.5 0 +1.5 ← Favors SAbDab | Favors OAS → Liability Motifs d = -1.19 Hydrophobicity -0.43 Aromatic Content -0.31 Net Charge -0.25

Effect sizes (Cohen's d) from Mann-Whitney U tests. All comparisons p < 0.001. Interpretation: d = 0.2 small, 0.5 medium, 0.8 large effect.

Zero-Shot PLM Performance

Can ESM-2 embeddings predict antibody fitness without fine-tuning? Short answer: No.

ESM-2 Correlation with Experimental Properties
Spearman rank correlation (ρ) for n=300 sequences per property. Values near zero indicate no predictive power.
-0.5 -0.25 0 +0.25 +0.5 Expression -0.27 Binding +0.18 Thermostability -0.14 Aggregation -0.09 Positive correlation Negative correlation No correlation
Key finding: Zero-shot ESM-2 shows weak or negative correlations with most fitness properties. Expression shows a moderate negative correlation (ρ = -0.27), meaning sequences ESM-2 considers "unnatural" may actually express better—possibly because therapeutic antibodies are engineered away from germline.

Method: ESM-2 (650M) pseudo-log-likelihood scoring. Data: FLAb benchmark. This confirms findings from the original FLAb paper (Chungyoun et al. 2024).

Immunogenicity Prediction

Predicting which antibodies will trigger immune responses—and the critical data gap.

Model Performance for Immunogenicity Classification
AUROC for binary immunogenic vs. non-immunogenic classification (n=217 antibodies with clinical ADA data).
Random (0.50) Logistic Regression 0.737 Gradient Boosting 0.727 Random Forest 0.723 Humanness Only 0.284
Critical insight: Humanness alone (AUROC 0.284) is worse than random chance at predicting immunogenicity. This challenges the common assumption that "more human = less immunogenic." Multi-feature models achieve 0.74 AUROC, but the dataset is tiny (n=217).

Data: Immunogenicity data aggregated in FLAb (n=217 antibodies with clinical ADA data). Method: Leave-one-out cross-validation. Features: Sequence properties + humanness scores + ESM-2 embeddings.

Important Limitations

  • Simulated de novo data: We used sequences sampled from reported distributions, not actual de novo design outputs. This is a proxy, not direct evidence.
  • Sequence-only features: Our analysis was limited to sequence-based metrics. Structure-based features may provide additional predictive value.
  • Small immunogenicity dataset: Only 217 samples—too small for reliable deep learning and may not generalize to new antibody formats.
  • No experimental validation: We did not experimentally validate predictions. Feature comparisons do not guarantee actual developability outcomes.
  • Exploratory statistics: No correction for multiple comparisons was applied. Results should be treated as hypothesis-generating.

Methods

How we analyzed the FLAb benchmark data.

Data Source

FLAb benchmark from Gray Lab (Johns Hopkins). 160 CSV files across thermostability, expression, binding, and aggregation properties. ~4M datapoints total.

Embeddings

ESM-2 (650M parameters) with mean pooling. GPU inference on Modal cloud infrastructure (~$15-20 total compute cost).

Developability

Sequence-based metrics: liability motif detection, Kyte-Doolittle hydrophobicity, net charge at pH 7.4, aromatic content (F, W, Y fraction).

Statistics

Mann-Whitney U tests for group comparisons. Cohen's d for effect sizes. Leave-one-out CV for immunogenicity models.

Citation

@article{flab_developability_2026,
  title={Emergent Developability in De Novo Antibody Design: 
         A Computational Analysis of the FLAb Benchmark},
  author={FLAb Analysis Team},
  journal={bioRxiv},
  year={2026},
  note={Analysis code: github.com/inventcures/flab_gray-lab-jhu_ab_chars}
}