Domain Deep-Dive · Spring 2026

Protein & Antibody Design

14 open problems — from Y2 statistics to graduate-level diffusion theory.
All grounded in real campaigns against cancer targets.

Interested? Email me at my Ashoka address with subject "IML: [Project ID]". I supervise ML/computational aspects; Abhimanyu Pant (BSc Math ’28) co-mentors on the mathematical directions. All levels welcome.

Domain Overview

Computational protein and antibody design is the problem of generating amino acid sequences that fold into structures with desired biochemical functions — binding a specific cancer antigen, avoiding off-target proteins, surviving in the bloodstream. Until 2021, this was the province of expensive wet-lab directed evolution. AlphaFold2 changed structure prediction; by 2026, a new stack has emerged for generative design. Protenix (ByteDance, 2025) surpasses AlphaFold3 on protein-ligand benchmarks as a fully open-source reimplementation. Boltz-2 jointly predicts structure and binding affinity in one forward pass, claiming 1000x speedup over free-energy perturbation. RFAntibody applies diffusion-based backbone generation specialized for CDR loops. BindCraft won the Adaptyv EGFR competition through hallucination-based design. And Escalante Bio's mosaic won the Adaptyv NiPaH de novo category in 180 lines of JAX — a 90% in vitro hit rate — using gradient-based PSSM optimization. The field moves fast; the mathematics lags behind.

Undergraduates can contribute now for three reasons. First, all the competitive tools are open-source (Protenix, Boltz-2, RFAntibody, BindCraft, mosaic, ProteinMPNN, RFdiffusion) and run on Colab-grade GPUs. Second, the Adaptyv Bio competition ecosystem provides monthly challenges with actual experimental validation — your design either binds or it doesn't. Third, the mathematical foundations are genuinely immature: RL posttraining for generative models has no convergence theory, in silico confidence metrics showed zero correlation with wet-lab binding in the EGFR competition, and no framework exists for jointly optimizing binding and off-target avoidance. These are not toy problems.

This portfolio connects directly to Ashish's ongoing work: the rfab-harness tool has run 4,085 antibody designs across 10 cancer targets (B7-H3, GD2, EGFRvIII, HER2, mesothelin, CEACAM5, CD276, GPC3, CLDN18.2, L1CAM) with pass rates ranging from 0.3% to 19.8% — a 60-fold range that begs statistical explanation. The DADB benchmark systematically compares AI antibody platforms. Active Adaptyv competition entries provide real experimental feedback loops for pipeline projects.

Abhimanyu Pant (BSc Mathematics ’28) co-develops six mathematical directions that map onto specific projects here: D1 (multi-objective specificity optimization) feeds I1 and A1; D2 (metric geometry of CDR spaces) feeds B3, I2, A8; D3 (GRPO/RL theory for diffusion models) feeds B5, B8, I9, I10, A1; D4 (exchangeable arrays for PPI) feeds A2; D5 (Bayesian optimization for campaign allocation) feeds B8, I3; D6 (TDA of binding interfaces) feeds I4.

Ashoka Coursework Connection

Course	Name	Projects Enabled
MAT 1200	Linear Algebra	B4 (pLDDT matrices, PCA), I1 (gradient computation), A8 (spectral theory)
MAT 2201	Probability & Statistics	B3 B5 B8 B12 I3 I7 I9 I10 A1 A2 A7 A9 A13
MAT 2301	Real Analysis	I1 I7 I9 A1 A2 A7 A8 A9
MAT 2401	Metric & Topological Spaces	I2 I4
MAT 2101	Algebra I	I4 (quiver representations), A2 A8 (group representations)
MAT 3102	Algebra II	A8 (Clebsch-Gordan, Schur’s lemma)
MAT 3211	Statistical Inference I	I3 (GP regression), A13 (reward model learning)
CS 3410 / MAT 3211	Introduction to ML	I8 I10 I15 A13
CS 2101	Data Structures & Algorithms	B3 (Levenshtein), I8 I10 I15
MAT 4xxx	Measure Theory	A1 (variational RL objectives), A2 (Aldous-Hoover theorem)

Projects tagged Juneja below are compatible with Prof. Juneja’s Intro to ML course (B5, B8, B12, I3, I9, I10). Every project assumes basic Python + AI coding assistant.

B · Beginner — Y2, 1 semester I · Intermediate — Y3, 1-2 semesters A · Advanced — Y4/Grad, 2 semesters

All 14 Projects

ID	Title & Description	Key Prereqs	Tags
B3	Antibody Sequence Similarity Search with Edit Distance Implement Hamming, Levenshtein, and BLOSUM-weighted distance metrics on CDR sequences; build a kNN classifier for binding prediction; partially reproduce the AbDist paper (ROC AUC 0.71–0.88). Lays groundwork for Abhimanyu D2.	DSA, P&S	B Juneja
B4	Visualizing Protein Structure Prediction Confidence Run Protenix on Ashish’s 10 cancer targets; systematically visualize pLDDT, iPTM, and PAE heatmaps; correlate confidence scores with rfab-harness pass rates to build a "confidence atlas."	Linear Algebra	B
B5	Monte Carlo Estimation of Protein Design Hit Rates Apply Monte Carlo methods, bootstrap resampling, and importance sampling to estimate confidence intervals for rfab-harness pass rates (0.3%–19.8% across targets). Feeds Abhimanyu D3.	P&S (MAT 2201)	B Juneja
B8	Sequential Experimental Design for Antibody Campaigns Model cancer target selection as a 10-armed bandit; implement Thompson sampling, UCB1, and epsilon-greedy; simulate regret over rfab-harness pass-rate data. Gateway to I3 and Abhimanyu D5.	P&S (MAT 2201)	B Juneja
B12	EDA of Antibody Design Campaign Results Comprehensive exploratory analysis of 4,085 rfab-harness designs: sequence diversity (Shannon entropy), amino acid enrichment, CDR confidence ROC curves, and multi-target comparison. Generates hypotheses for upstream projects.	P&S, CS 1101	B Juneja
I1	Multi-Objective Optimization for Antibody Specificity Extend mosaic’s loss functional to jointly optimize on-target binding and off-target avoidance (28% of clinical antibodies bind off-targets). Characterize Pareto frontiers, prove scalarization properties. Direct implementation of Abhimanyu D1.	Real Analysis, P&S, Intro ML	I Abhimanyu D1
I2	Metric Geometry of CDR Sequence Spaces Prove theoretical properties of 6+ metrics on CDR sequences (metric axioms, ultrametricity, Bourgain embedding bounds, kernel construction); validate empirically on AB-Bind. Mathematical paper with proofs. Direct implementation of Abhimanyu D2.	Metric Spaces (MAT 2401)	I Abhimanyu D2
I3	Bayesian Optimization for Campaign Resource Allocation Build a GP surrogate over target features; implement Thompson sampling and GP-UCB for adaptive allocation across cancer targets; simulate vs uniform allocation. Direct implementation of Abhimanyu D5.	P&S, Statistical Inference I	I Juneja Abhimanyu D5
I4	Persistent Homology of Protein Binding Interfaces Compute Vietoris-Rips persistence diagrams (H0, H1, H2) on Ashish’s 10 cancer target surfaces; correlate topological features (Betti numbers, persistence entropy) with rfab-harness pass rates to predict target "designability." Abhimanyu D6.	Metric Spaces, Algebra I	I Abhimanyu D6
I7	Wasserstein Distances Between Antibody Repertoires Compute optimal transport distances between natural (OAS) and designed antibody distributions; implement Sinkhorn divergence; compare W1 vs MMD for detecting mode collapse in generative models.	Real Analysis, P&S	I
I8	GNN-Based Inverse Folding Analysis Analyze ProteinMPNN’s expressivity through the Weisfeiler-Leman hierarchy; implement 1-WL and 2-WL tests on protein contact graphs; compare with higher-order GNN alternatives on sequence recovery benchmarks.	DSA, Discrete Math, Intro ML	I
I9	Information-Theoretic Scoring for Protein Design Track Shannon entropy dynamics during mosaic’s simplex optimization; identify phase transitions in the annealing schedule; evaluate mutual information as an alternative binding predictor; fit Potts models to design ensembles.	P&S, Real Analysis	I Juneja
I10	Monte Carlo Tree Search for Binder Design Cast sequence design as a tree search (node = partial sequence, action = amino acid assignment, reward = ipTM); implement vanilla MCTS with UCT; benchmark against mosaic’s gradient-based approach on 3 cancer targets.	DSA, P&S, Intro ML	I Juneja
I15	Protein Design Competition Entry Pipeline Build an end-to-end pipeline (RFdiffusion backbone generation → ProteinMPNN sequence design → Protenix validation → multi-stage filtering) and enter an active Adaptyv Bio or BioML Society competition.	Intro ML, DSA	I rfab-harness
A1	GRPO Theory for Protein Diffusion Models Derive GRPO from a KL-regularized RL objective; prove convergence for policy gradient on diffusion models; analyze mode collapse (all-helix artifacts); design entropy-regularized GRPO with diversity guarantees. Full version of Abhimanyu D3.	Measure Theory, Stochastic Processes	A Abhimanyu D3
A2	Exchangeable Arrays for PPI Prediction Apply the Aldous-Hoover theorem (as used by JURA Bio’s Vista transformer on 209M antibodies) to prove approximation theorems for cross-attention vs matrix factorization architectures; derive sample complexity bounds. Abhimanyu D4.	Measure Theory, P&S, Algebra I	A Abhimanyu D4
A7	Diffusion Models on SE(3) for Backbone Generation Derive convergence rates for the discretized reverse SDE on SO(3) (Wigner D-matrices, Laplace-Beltrami operator); prove optimal noise schedules on Riemannian manifolds; validate on protein backbone generation benchmarks.	Real Analysis, P&S, Differential Geometry	A
A8	Geometric DL Expressivity for Protein Graphs Characterize the expressive power of SE(3)-equivariant GNNs (EGNN, MACE, Equiformer) via a geometric WL hierarchy; prove universality/non-universality results using Clebsch-Gordan decomposition; identify which protein properties are provably out of reach.	Algebra I & II, Linear Algebra	A
A13	RL with Experimental Feedback for Protein Design Develop theory for RL with expensive ($10–100/datapoint), noisy, budget-constrained rewards; derive budget-optimal acquisition (when to run experiments vs optimize policy); connect to JURA Bio’s BODA active learning framework.	P&S, Intro ML, Statistical Inference I	A

Key References & Tools

Protenix (ByteDance, 2025) — open-source AlphaFold3 reimplementation, surpasses AF3 on multi-chain benchmarks. github.com/bytedance/Protenix
Boltz-2 (Wohlwend et al., 2026) — joint structure + binding affinity prediction, MIT license. github.com/jwohlwend/boltz
RFdiffusion / RFAntibody (Watson et al., Nature 2023; Baker Lab 2025) — diffusion-based backbone generation. github.com/RosettaCommons/RFdiffusion
BindCraft (Pacesa et al., bioRxiv 2024) — hallucination-based binder design, won Adaptyv EGFR competition. DOI: 10.1101/2024.09.30.615802
mosaic (Escalante Bio, 2026) — 180-line JAX library for gradient-based PSSM optimization; 90% hit rate at Adaptyv NiPaH. github.com/escalante-bio/mosaic
ProteinMPNN (Dauparas et al., Science 2022) — standard inverse folding model. github.com/dauparas/ProteinMPNN
GUDHI — C++/Python library for persistent homology and TDA on point clouds. gudhi.inria.fr
Adaptyv Bio competitions — monthly protein design challenges with experimental validation. adaptyv.bio
AbDist (Hoffstedt et al., mAbs 2026) — simple kNN + Levenshtein distance matches transformer SOTA (ROC AUC 0.71–0.88). DOI: 10.1080/19420862.2026.2644655
Weinstein et al., Nature Biotechnology 2026 (JURA Bio) — MESA dataset (209M antibodies × 100 pHLA targets), Vista transformer, Aldous-Hoover architecture motivation, BODA active learning.
Adaptyv EGFR community paper (bioRxiv 2025.04.17.648362v2) — key negative result: zero correlation between ipTM/iPAE/ESM2 and wet-lab binding.
Escalante Bio blog (Boyd & Guns, March 2026) — "Teaching generative models to hallucinate" — GRPO for BoltzGen, mode collapse, open theoretical questions.

Ready to start?

Email with subject "IML: [Project ID]" — include your year, relevant courses, and one sentence on why the project interests you.

View All 40 Open Problems →

Last updated March 2026. Full project specs available on request.