Protein & Antibody Design
14 open problems — from Y2 statistics to graduate-level diffusion theory.
All grounded in real campaigns against cancer targets.
Interested? Email me at my Ashoka address with subject "IML: [Project ID]". I supervise ML/computational aspects; Abhimanyu Pant (BSc Math ’28) co-mentors on the mathematical directions. All levels welcome.
Domain Overview
Computational protein and antibody design is the problem of generating amino acid sequences that fold into structures with desired biochemical functions — binding a specific cancer antigen, avoiding off-target proteins, surviving in the bloodstream. Until 2021, this was the province of expensive wet-lab directed evolution. AlphaFold2 changed structure prediction; by 2026, a new stack has emerged for generative design. Protenix (ByteDance, 2025) surpasses AlphaFold3 on protein-ligand benchmarks as a fully open-source reimplementation. Boltz-2 jointly predicts structure and binding affinity in one forward pass, claiming 1000x speedup over free-energy perturbation. RFAntibody applies diffusion-based backbone generation specialized for CDR loops. BindCraft won the Adaptyv EGFR competition through hallucination-based design. And Escalante Bio's mosaic won the Adaptyv NiPaH de novo category in 180 lines of JAX — a 90% in vitro hit rate — using gradient-based PSSM optimization. The field moves fast; the mathematics lags behind.
Undergraduates can contribute now for three reasons. First, all the competitive tools are open-source (Protenix, Boltz-2, RFAntibody, BindCraft, mosaic, ProteinMPNN, RFdiffusion) and run on Colab-grade GPUs. Second, the Adaptyv Bio competition ecosystem provides monthly challenges with actual experimental validation — your design either binds or it doesn't. Third, the mathematical foundations are genuinely immature: RL posttraining for generative models has no convergence theory, in silico confidence metrics showed zero correlation with wet-lab binding in the EGFR competition, and no framework exists for jointly optimizing binding and off-target avoidance. These are not toy problems.
This portfolio connects directly to Ashish's ongoing work: the rfab-harness tool has run 4,085 antibody designs across 10 cancer targets (B7-H3, GD2, EGFRvIII, HER2, mesothelin, CEACAM5, CD276, GPC3, CLDN18.2, L1CAM) with pass rates ranging from 0.3% to 19.8% — a 60-fold range that begs statistical explanation. The DADB benchmark systematically compares AI antibody platforms. Active Adaptyv competition entries provide real experimental feedback loops for pipeline projects.
Abhimanyu Pant (BSc Mathematics ’28) co-develops six mathematical directions that map onto specific projects here: D1 (multi-objective specificity optimization) feeds I1 and A1; D2 (metric geometry of CDR spaces) feeds B3, I2, A8; D3 (GRPO/RL theory for diffusion models) feeds B5, B8, I9, I10, A1; D4 (exchangeable arrays for PPI) feeds A2; D5 (Bayesian optimization for campaign allocation) feeds B8, I3; D6 (TDA of binding interfaces) feeds I4.
Ashoka Coursework Connection
| Course | Name | Projects Enabled |
|---|---|---|
| MAT 1200 | Linear Algebra | B4 (pLDDT matrices, PCA), I1 (gradient computation), A8 (spectral theory) |
| MAT 2201 | Probability & Statistics | B3 B5 B8 B12 I3 I7 I9 I10 A1 A2 A7 A9 A13 |
| MAT 2301 | Real Analysis | I1 I7 I9 A1 A2 A7 A8 A9 |
| MAT 2401 | Metric & Topological Spaces | I2 I4 |
| MAT 2101 | Algebra I | I4 (quiver representations), A2 A8 (group representations) |
| MAT 3102 | Algebra II | A8 (Clebsch-Gordan, Schur’s lemma) |
| MAT 3211 | Statistical Inference I | I3 (GP regression), A13 (reward model learning) |
| CS 3410 / MAT 3211 | Introduction to ML | I8 I10 I15 A13 |
| CS 2101 | Data Structures & Algorithms | B3 (Levenshtein), I8 I10 I15 |
| MAT 4xxx | Measure Theory | A1 (variational RL objectives), A2 (Aldous-Hoover theorem) |
Projects tagged Juneja below are compatible with Prof. Juneja’s Intro to ML course (B5, B8, B12, I3, I9, I10). Every project assumes basic Python + AI coding assistant.
All 14 Projects
| ID | Title & Description | Key Prereqs | Tags |
|---|---|---|---|
| B3 | Antibody Sequence Similarity Search with Edit Distance Implement Hamming, Levenshtein, and BLOSUM-weighted distance metrics on CDR sequences; build a kNN classifier for binding prediction; partially reproduce the AbDist paper (ROC AUC 0.71–0.88). Lays groundwork for Abhimanyu D2. | DSA, P&S | B Juneja |
| B4 | Visualizing Protein Structure Prediction Confidence Run Protenix on Ashish’s 10 cancer targets; systematically visualize pLDDT, iPTM, and PAE heatmaps; correlate confidence scores with rfab-harness pass rates to build a "confidence atlas." | Linear Algebra | B |
| B5 | Monte Carlo Estimation of Protein Design Hit Rates Apply Monte Carlo methods, bootstrap resampling, and importance sampling to estimate confidence intervals for rfab-harness pass rates (0.3%–19.8% across targets). Feeds Abhimanyu D3. | P&S (MAT 2201) | B Juneja |
| B8 | Sequential Experimental Design for Antibody Campaigns Model cancer target selection as a 10-armed bandit; implement Thompson sampling, UCB1, and epsilon-greedy; simulate regret over rfab-harness pass-rate data. Gateway to I3 and Abhimanyu D5. | P&S (MAT 2201) | B Juneja |
| B12 | EDA of Antibody Design Campaign Results Comprehensive exploratory analysis of 4,085 rfab-harness designs: sequence diversity (Shannon entropy), amino acid enrichment, CDR confidence ROC curves, and multi-target comparison. Generates hypotheses for upstream projects. | P&S, CS 1101 | B Juneja |
| I1 | Multi-Objective Optimization for Antibody Specificity Extend mosaic’s loss functional to jointly optimize on-target binding and off-target avoidance (28% of clinical antibodies bind off-targets). Characterize Pareto frontiers, prove scalarization properties. Direct implementation of Abhimanyu D1. | Real Analysis, P&S, Intro ML | I Abhimanyu D1 |
| I2 | Metric Geometry of CDR Sequence Spaces Prove theoretical properties of 6+ metrics on CDR sequences (metric axioms, ultrametricity, Bourgain embedding bounds, kernel construction); validate empirically on AB-Bind. Mathematical paper with proofs. Direct implementation of Abhimanyu D2. | Metric Spaces (MAT 2401) | I Abhimanyu D2 |
| I3 | Bayesian Optimization for Campaign Resource Allocation Build a GP surrogate over target features; implement Thompson sampling and GP-UCB for adaptive allocation across cancer targets; simulate vs uniform allocation. Direct implementation of Abhimanyu D5. | P&S, Statistical Inference I | I Juneja Abhimanyu D5 |
| I4 | Persistent Homology of Protein Binding Interfaces Compute Vietoris-Rips persistence diagrams (H0, H1, H2) on Ashish’s 10 cancer target surfaces; correlate topological features (Betti numbers, persistence entropy) with rfab-harness pass rates to predict target "designability." Abhimanyu D6. | Metric Spaces, Algebra I | I Abhimanyu D6 |
| I7 | Wasserstein Distances Between Antibody Repertoires Compute optimal transport distances between natural (OAS) and designed antibody distributions; implement Sinkhorn divergence; compare W1 vs MMD for detecting mode collapse in generative models. | Real Analysis, P&S | I |
| I8 | GNN-Based Inverse Folding Analysis Analyze ProteinMPNN’s expressivity through the Weisfeiler-Leman hierarchy; implement 1-WL and 2-WL tests on protein contact graphs; compare with higher-order GNN alternatives on sequence recovery benchmarks. | DSA, Discrete Math, Intro ML | I |
| I9 | Information-Theoretic Scoring for Protein Design Track Shannon entropy dynamics during mosaic’s simplex optimization; identify phase transitions in the annealing schedule; evaluate mutual information as an alternative binding predictor; fit Potts models to design ensembles. | P&S, Real Analysis | I Juneja |
| I10 | Monte Carlo Tree Search for Binder Design Cast sequence design as a tree search (node = partial sequence, action = amino acid assignment, reward = ipTM); implement vanilla MCTS with UCT; benchmark against mosaic’s gradient-based approach on 3 cancer targets. | DSA, P&S, Intro ML | I Juneja |
| I15 | Protein Design Competition Entry Pipeline Build an end-to-end pipeline (RFdiffusion backbone generation → ProteinMPNN sequence design → Protenix validation → multi-stage filtering) and enter an active Adaptyv Bio or BioML Society competition. | Intro ML, DSA | I rfab-harness |
| A1 | GRPO Theory for Protein Diffusion Models Derive GRPO from a KL-regularized RL objective; prove convergence for policy gradient on diffusion models; analyze mode collapse (all-helix artifacts); design entropy-regularized GRPO with diversity guarantees. Full version of Abhimanyu D3. | Measure Theory, Stochastic Processes | A Abhimanyu D3 |
| A2 | Exchangeable Arrays for PPI Prediction Apply the Aldous-Hoover theorem (as used by JURA Bio’s Vista transformer on 209M antibodies) to prove approximation theorems for cross-attention vs matrix factorization architectures; derive sample complexity bounds. Abhimanyu D4. | Measure Theory, P&S, Algebra I | A Abhimanyu D4 |
| A7 | Diffusion Models on SE(3) for Backbone Generation Derive convergence rates for the discretized reverse SDE on SO(3) (Wigner D-matrices, Laplace-Beltrami operator); prove optimal noise schedules on Riemannian manifolds; validate on protein backbone generation benchmarks. | Real Analysis, P&S, Differential Geometry | A |
| A8 | Geometric DL Expressivity for Protein Graphs Characterize the expressive power of SE(3)-equivariant GNNs (EGNN, MACE, Equiformer) via a geometric WL hierarchy; prove universality/non-universality results using Clebsch-Gordan decomposition; identify which protein properties are provably out of reach. | Algebra I & II, Linear Algebra | A |
| A13 | RL with Experimental Feedback for Protein Design Develop theory for RL with expensive ($10–100/datapoint), noisy, budget-constrained rewards; derive budget-optimal acquisition (when to run experiments vs optimize policy); connect to JURA Bio’s BODA active learning framework. | P&S, Intro ML, Statistical Inference I | A |
Key References & Tools
- Protenix (ByteDance, 2025) — open-source AlphaFold3 reimplementation, surpasses AF3 on multi-chain benchmarks. github.com/bytedance/Protenix
- Boltz-2 (Wohlwend et al., 2026) — joint structure + binding affinity prediction, MIT license. github.com/jwohlwend/boltz
- RFdiffusion / RFAntibody (Watson et al., Nature 2023; Baker Lab 2025) — diffusion-based backbone generation. github.com/RosettaCommons/RFdiffusion
- BindCraft (Pacesa et al., bioRxiv 2024) — hallucination-based binder design, won Adaptyv EGFR competition. DOI: 10.1101/2024.09.30.615802
- mosaic (Escalante Bio, 2026) — 180-line JAX library for gradient-based PSSM optimization; 90% hit rate at Adaptyv NiPaH. github.com/escalante-bio/mosaic
- ProteinMPNN (Dauparas et al., Science 2022) — standard inverse folding model. github.com/dauparas/ProteinMPNN
- GUDHI — C++/Python library for persistent homology and TDA on point clouds. gudhi.inria.fr
- Adaptyv Bio competitions — monthly protein design challenges with experimental validation. adaptyv.bio
- AbDist (Hoffstedt et al., mAbs 2026) — simple kNN + Levenshtein distance matches transformer SOTA (ROC AUC 0.71–0.88). DOI: 10.1080/19420862.2026.2644655
- Weinstein et al., Nature Biotechnology 2026 (JURA Bio) — MESA dataset (209M antibodies × 100 pHLA targets), Vista transformer, Aldous-Hoover architecture motivation, BODA active learning.
- Adaptyv EGFR community paper (bioRxiv 2025.04.17.648362v2) — key negative result: zero correlation between ipTM/iPAE/ESM2 and wet-lab binding.
- Escalante Bio blog (Boyd & Guns, March 2026) — "Teaching generative models to hallucinate" — GRPO for BoltzGen, mode collapse, open theoretical questions.
Ready to start?
Email with subject "IML: [Project ID]" — include your year, relevant courses, and one sentence on why the project interests you.
View All 40 Open Problems →Last updated March 2026. Full project specs available on request.