Domain Deep-Dive · Spring 2026

Protein & Antibody Design

14 open problems — from Y2 statistics to graduate-level diffusion theory.
All grounded in real campaigns against cancer targets.

Interested? Email me at my Ashoka address with subject "IML: [Project ID]". I supervise ML/computational aspects; Abhimanyu Pant (BSc Math ’28) co-mentors on the mathematical directions. All levels welcome.

Domain Overview

Computational protein and antibody design is the problem of generating amino acid sequences that fold into structures with desired biochemical functions — binding a specific cancer antigen, avoiding off-target proteins, surviving in the bloodstream. Until 2021, this was the province of expensive wet-lab directed evolution. AlphaFold2 changed structure prediction; by 2026, a new stack has emerged for generative design. Protenix (ByteDance, 2025) surpasses AlphaFold3 on protein-ligand benchmarks as a fully open-source reimplementation. Boltz-2 jointly predicts structure and binding affinity in one forward pass, claiming 1000x speedup over free-energy perturbation. RFAntibody applies diffusion-based backbone generation specialized for CDR loops. BindCraft won the Adaptyv EGFR competition through hallucination-based design. And Escalante Bio's mosaic won the Adaptyv NiPaH de novo category in 180 lines of JAX — a 90% in vitro hit rate — using gradient-based PSSM optimization. The field moves fast; the mathematics lags behind.

Undergraduates can contribute now for three reasons. First, all the competitive tools are open-source (Protenix, Boltz-2, RFAntibody, BindCraft, mosaic, ProteinMPNN, RFdiffusion) and run on Colab-grade GPUs. Second, the Adaptyv Bio competition ecosystem provides monthly challenges with actual experimental validation — your design either binds or it doesn't. Third, the mathematical foundations are genuinely immature: RL posttraining for generative models has no convergence theory, in silico confidence metrics showed zero correlation with wet-lab binding in the EGFR competition, and no framework exists for jointly optimizing binding and off-target avoidance. These are not toy problems.

This portfolio connects directly to Ashish's ongoing work: the rfab-harness tool has run 4,085 antibody designs across 10 cancer targets (B7-H3, GD2, EGFRvIII, HER2, mesothelin, CEACAM5, CD276, GPC3, CLDN18.2, L1CAM) with pass rates ranging from 0.3% to 19.8% — a 60-fold range that begs statistical explanation. The DADB benchmark systematically compares AI antibody platforms. Active Adaptyv competition entries provide real experimental feedback loops for pipeline projects.

Abhimanyu Pant (BSc Mathematics ’28) co-develops six mathematical directions that map onto specific projects here: D1 (multi-objective specificity optimization) feeds I1 and A1; D2 (metric geometry of CDR spaces) feeds B3, I2, A8; D3 (GRPO/RL theory for diffusion models) feeds B5, B8, I9, I10, A1; D4 (exchangeable arrays for PPI) feeds A2; D5 (Bayesian optimization for campaign allocation) feeds B8, I3; D6 (TDA of binding interfaces) feeds I4.

Ashoka Coursework Connection

CourseNameProjects Enabled
MAT 1200Linear AlgebraB4 (pLDDT matrices, PCA), I1 (gradient computation), A8 (spectral theory)
MAT 2201Probability & StatisticsB3 B5 B8 B12 I3 I7 I9 I10 A1 A2 A7 A9 A13
MAT 2301Real AnalysisI1 I7 I9 A1 A2 A7 A8 A9
MAT 2401Metric & Topological SpacesI2 I4
MAT 2101Algebra II4 (quiver representations), A2 A8 (group representations)
MAT 3102Algebra IIA8 (Clebsch-Gordan, Schur’s lemma)
MAT 3211Statistical Inference II3 (GP regression), A13 (reward model learning)
CS 3410 / MAT 3211Introduction to MLI8 I10 I15 A13
CS 2101Data Structures & AlgorithmsB3 (Levenshtein), I8 I10 I15
MAT 4xxxMeasure TheoryA1 (variational RL objectives), A2 (Aldous-Hoover theorem)

Projects tagged Juneja below are compatible with Prof. Juneja’s Intro to ML course (B5, B8, B12, I3, I9, I10). Every project assumes basic Python + AI coding assistant.

B · Beginner — Y2, 1 semester I · Intermediate — Y3, 1-2 semesters A · Advanced — Y4/Grad, 2 semesters

All 14 Projects

IDTitle & DescriptionKey PrereqsTags
B3 Antibody Sequence Similarity Search with Edit Distance
Implement Hamming, Levenshtein, and BLOSUM-weighted distance metrics on CDR sequences; build a kNN classifier for binding prediction; partially reproduce the AbDist paper (ROC AUC 0.71–0.88). Lays groundwork for Abhimanyu D2.
DSA, P&SB Juneja
B4 Visualizing Protein Structure Prediction Confidence
Run Protenix on Ashish’s 10 cancer targets; systematically visualize pLDDT, iPTM, and PAE heatmaps; correlate confidence scores with rfab-harness pass rates to build a "confidence atlas."
Linear AlgebraB
B5 Monte Carlo Estimation of Protein Design Hit Rates
Apply Monte Carlo methods, bootstrap resampling, and importance sampling to estimate confidence intervals for rfab-harness pass rates (0.3%–19.8% across targets). Feeds Abhimanyu D3.
P&S (MAT 2201)B Juneja
B8 Sequential Experimental Design for Antibody Campaigns
Model cancer target selection as a 10-armed bandit; implement Thompson sampling, UCB1, and epsilon-greedy; simulate regret over rfab-harness pass-rate data. Gateway to I3 and Abhimanyu D5.
P&S (MAT 2201)B Juneja
B12 EDA of Antibody Design Campaign Results
Comprehensive exploratory analysis of 4,085 rfab-harness designs: sequence diversity (Shannon entropy), amino acid enrichment, CDR confidence ROC curves, and multi-target comparison. Generates hypotheses for upstream projects.
P&S, CS 1101B Juneja
I1 Multi-Objective Optimization for Antibody Specificity
Extend mosaic’s loss functional to jointly optimize on-target binding and off-target avoidance (28% of clinical antibodies bind off-targets). Characterize Pareto frontiers, prove scalarization properties. Direct implementation of Abhimanyu D1.
Real Analysis, P&S, Intro MLI Abhimanyu D1
I2 Metric Geometry of CDR Sequence Spaces
Prove theoretical properties of 6+ metrics on CDR sequences (metric axioms, ultrametricity, Bourgain embedding bounds, kernel construction); validate empirically on AB-Bind. Mathematical paper with proofs. Direct implementation of Abhimanyu D2.
Metric Spaces (MAT 2401)I Abhimanyu D2
I3 Bayesian Optimization for Campaign Resource Allocation
Build a GP surrogate over target features; implement Thompson sampling and GP-UCB for adaptive allocation across cancer targets; simulate vs uniform allocation. Direct implementation of Abhimanyu D5.
P&S, Statistical Inference II Juneja Abhimanyu D5
I4 Persistent Homology of Protein Binding Interfaces
Compute Vietoris-Rips persistence diagrams (H0, H1, H2) on Ashish’s 10 cancer target surfaces; correlate topological features (Betti numbers, persistence entropy) with rfab-harness pass rates to predict target "designability." Abhimanyu D6.
Metric Spaces, Algebra II Abhimanyu D6
I7 Wasserstein Distances Between Antibody Repertoires
Compute optimal transport distances between natural (OAS) and designed antibody distributions; implement Sinkhorn divergence; compare W1 vs MMD for detecting mode collapse in generative models.
Real Analysis, P&SI
I8 GNN-Based Inverse Folding Analysis
Analyze ProteinMPNN’s expressivity through the Weisfeiler-Leman hierarchy; implement 1-WL and 2-WL tests on protein contact graphs; compare with higher-order GNN alternatives on sequence recovery benchmarks.
DSA, Discrete Math, Intro MLI
I9 Information-Theoretic Scoring for Protein Design
Track Shannon entropy dynamics during mosaic’s simplex optimization; identify phase transitions in the annealing schedule; evaluate mutual information as an alternative binding predictor; fit Potts models to design ensembles.
P&S, Real AnalysisI Juneja
I10 Monte Carlo Tree Search for Binder Design
Cast sequence design as a tree search (node = partial sequence, action = amino acid assignment, reward = ipTM); implement vanilla MCTS with UCT; benchmark against mosaic’s gradient-based approach on 3 cancer targets.
DSA, P&S, Intro MLI Juneja
I15 Protein Design Competition Entry Pipeline
Build an end-to-end pipeline (RFdiffusion backbone generation → ProteinMPNN sequence design → Protenix validation → multi-stage filtering) and enter an active Adaptyv Bio or BioML Society competition.
Intro ML, DSAI rfab-harness
A1 GRPO Theory for Protein Diffusion Models
Derive GRPO from a KL-regularized RL objective; prove convergence for policy gradient on diffusion models; analyze mode collapse (all-helix artifacts); design entropy-regularized GRPO with diversity guarantees. Full version of Abhimanyu D3.
Measure Theory, Stochastic ProcessesA Abhimanyu D3
A2 Exchangeable Arrays for PPI Prediction
Apply the Aldous-Hoover theorem (as used by JURA Bio’s Vista transformer on 209M antibodies) to prove approximation theorems for cross-attention vs matrix factorization architectures; derive sample complexity bounds. Abhimanyu D4.
Measure Theory, P&S, Algebra IA Abhimanyu D4
A7 Diffusion Models on SE(3) for Backbone Generation
Derive convergence rates for the discretized reverse SDE on SO(3) (Wigner D-matrices, Laplace-Beltrami operator); prove optimal noise schedules on Riemannian manifolds; validate on protein backbone generation benchmarks.
Real Analysis, P&S, Differential GeometryA
A8 Geometric DL Expressivity for Protein Graphs
Characterize the expressive power of SE(3)-equivariant GNNs (EGNN, MACE, Equiformer) via a geometric WL hierarchy; prove universality/non-universality results using Clebsch-Gordan decomposition; identify which protein properties are provably out of reach.
Algebra I & II, Linear AlgebraA
A13 RL with Experimental Feedback for Protein Design
Develop theory for RL with expensive ($10–100/datapoint), noisy, budget-constrained rewards; derive budget-optimal acquisition (when to run experiments vs optimize policy); connect to JURA Bio’s BODA active learning framework.
P&S, Intro ML, Statistical Inference IA

Key References & Tools

  1. Protenix (ByteDance, 2025) — open-source AlphaFold3 reimplementation, surpasses AF3 on multi-chain benchmarks. github.com/bytedance/Protenix
  2. Boltz-2 (Wohlwend et al., 2026) — joint structure + binding affinity prediction, MIT license. github.com/jwohlwend/boltz
  3. RFdiffusion / RFAntibody (Watson et al., Nature 2023; Baker Lab 2025) — diffusion-based backbone generation. github.com/RosettaCommons/RFdiffusion
  4. BindCraft (Pacesa et al., bioRxiv 2024) — hallucination-based binder design, won Adaptyv EGFR competition. DOI: 10.1101/2024.09.30.615802
  5. mosaic (Escalante Bio, 2026) — 180-line JAX library for gradient-based PSSM optimization; 90% hit rate at Adaptyv NiPaH. github.com/escalante-bio/mosaic
  6. ProteinMPNN (Dauparas et al., Science 2022) — standard inverse folding model. github.com/dauparas/ProteinMPNN
  7. GUDHI — C++/Python library for persistent homology and TDA on point clouds. gudhi.inria.fr
  8. Adaptyv Bio competitions — monthly protein design challenges with experimental validation. adaptyv.bio
  9. AbDist (Hoffstedt et al., mAbs 2026) — simple kNN + Levenshtein distance matches transformer SOTA (ROC AUC 0.71–0.88). DOI: 10.1080/19420862.2026.2644655
  10. Weinstein et al., Nature Biotechnology 2026 (JURA Bio) — MESA dataset (209M antibodies × 100 pHLA targets), Vista transformer, Aldous-Hoover architecture motivation, BODA active learning.
  11. Adaptyv EGFR community paper (bioRxiv 2025.04.17.648362v2) — key negative result: zero correlation between ipTM/iPAE/ESM2 and wet-lab binding.
  12. Escalante Bio blog (Boyd & Guns, March 2026) — "Teaching generative models to hallucinate" — GRPO for BoltzGen, mode collapse, open theoretical questions.

Ready to start?

Email with subject "IML: [Project ID]" — include your year, relevant courses, and one sentence on why the project interests you.

View All 40 Open Problems →

Last updated March 2026. Full project specs available on request.