Mathematical Foundations
— Open Problems
3 projects where pure mathematics meets protein design, drug resistance, and generative models. Genuine open questions — not textbook exercises.
Domain Overview
The theory-practice gap in ML for science is vast and largely unexplored. Geometric deep learning on SE(3), diffusion models on manifolds, topological data analysis, and optimal transport all have rich theoretical frameworks — yet convergence guarantees, expressivity bounds, and approximation results for the actual loss landscapes encountered in protein design or drug discovery are almost entirely missing. The algorithms that won the Adaptyv Nipah competition were hand-tuned; the tools that evaluate generative models lack metric properties; the topological descriptors for protein structures use only one filtration parameter when biology offers many.
The key mathematical tools in this domain are: topological data analysis (persistent homology, the stability theorem, multi-parameter persistence modules and their connection to quiver representations), optimal transport (Wasserstein distances, Sinkhorn's algorithm, gradient flows), optimization on the probability simplex (projected gradient, mirror descent with KL divergence, Frank-Wolfe), and information geometry (the Fisher information metric as the natural geometry on the simplex). These are not just computational tools — they are mathematical objects with open structural questions.
Why math majors should care: these are genuine open mathematical problems, not "apply existing ML to biology". Project I12 asks for convergence proofs that do not exist in the literature for the protein design setting. Project A9 asks whether Wasserstein distances on protein sequence spaces metrize weak convergence — a functional analysis question. Project A11 asks for new results about the structure of persistence modules arising from protein data — a quiver representation theory question with biological stakes. The papers do not have the theorems yet; that is the point.
Project A12, while classified under Mathematical Biology, sits equally in this domain: evolutionary game theory of drug resistance requires differential equations, fixed-point theorems for Nash equilibrium existence, Pontryagin's maximum principle for optimal drug scheduling, and branching processes for stochastic tumor models. It is pure applied mathematics, motivated by paediatric cancer.
Ashoka Coursework Connection
Prerequisite Map
| Project | Hard Prerequisites | Relevant Ashoka Courses |
|---|---|---|
| I12 — Simplex Optimization | Real Analysis, Linear Algebra | MAT 2003 Real Analysis — convergence proofs, Lipschitz continuity, convex analysisMAT 1001 Linear Algebra — matrix norms, eigenvalue analysis, gradient computation |
| A9 — Optimal Transport | Measure Theory, Functional Analysis | MAT 4xxx Measure Theory — weak convergence, probability measures on metric spacesMAT 4xxx Functional Analysis — RKHS, operator theory, Banach space geometry |
| A11 — Multi-Parameter Persistence | Algebra II, Algebraic Topology | MAT 2002 Algebra II — module theory, quiver representations, homological algebraMAT 2026 Metric & Topological Spaces — simplicial complexes, filtrations, topological invariants |
| A12 — Evolutionary Game Theory | Differential Equations, Real Analysis, P&S | MAT 3013 Differential Equations — dynamical systems, stability analysis, phase portraitsMAT 2003 Real Analysis — fixed-point theorems, variational methodsMAT 2020 Probability & Statistics — branching processes, stochastic population models |
Python libraries for these projects — GUDHI, Ripser, POT, GPyTorch, SciPy — are all learnable during the project. No prior familiarity required.
Projects in this Domain
| ID | Project | Tier | Hard Prerequisites | Scope |
|---|---|---|---|---|
| I12 | Convergence Analysis of Simplex Optimization for Protein Design Escalante Bio's mosaic library won the Adaptyv Nipah competition using a hand-tuned accelerated proximal gradient method on the probability simplex. This project proves convergence to critical points for non-convex protein design losses, compares mirror descent (KL divergence) against projected gradient, and bounds the rounding gap from continuous PSSM to discrete sequence. All theorems are missing from the current literature for this setting. | Intermediate | MAT 2003 Real Analysis MAT 1001 Linear Algebra | 1–2 semesters |
| A9 | Optimal Transport for Generative Model Evaluation Evaluating protein generative models is an open problem: existing metrics (MMD, BEAR tests, ipTM rankings) lack metric properties or theoretical grounding. This project develops Wasserstein-distance-based evaluation metrics for generated antibody ensembles, proves metrization properties and concentration inequalities for the empirical Wasserstein distance, formalizes directed evolution as a Wasserstein gradient flow, and benchmarks 5+ generative models (RFAntibody, ProteinMPNN, BoltzGen, RL-finetuned variants). | Advanced | MAT 4xxx Measure Theory MAT 4xxx Functional Analysis MAT 2020 P&S | 2 semesters |
| A11 | Multi-Parameter Persistence for Protein Design Spaces Single-parameter persistent homology uses one filtration (distance threshold); proteins have many — distance, hydrophobicity, charge, flexibility simultaneously. Multi-parameter persistence is an active area in pure mathematics: unlike the single-parameter case, no barcode analog exists in general, and the classification problem connects to quiver representation theory. This project extends TDA from I4, constructs bi-filtrations on protein binding interfaces, computes fibered barcodes, and seeks new mathematical results about the structure of persistence modules arising from protein data — potential submission to Journal of Applied and Computational Topology. | Advanced | MAT 2002 Algebra II MAT 2026 Topological Spaces Algebraic Topology background | 2 semesters |
| A12 | Mathematical Oncology: Evolutionary Game Theory of Drug Resistance Models tumor evolution as an evolutionary game between sensitive and resistant cell types. The project proves existence and stability of Nash equilibria via replicator dynamics, applies Pontryagin's maximum principle to derive optimal bang-bang drug scheduling, proves that adaptive therapy (treat-until-response) delays resistance compared to maximum tolerated dose, and extends to stochastic branching processes to analyze extinction probabilities for resistant clones. Clinical implications connect to paediatric GBM and neuroblastoma. | Advanced | MAT 3013 Differential Equations MAT 2003 Real Analysis MAT 2020 P&S | 2 semesters |
Key References & Tools
- GUDHI — gudhi.inria.fr — TDA library with alpha complexes, Rips complexes, and multi-parameter persistence
- Ripser — github.com/Ripser/ripser — fast Vietoris-Rips persistent homology, orders of magnitude faster than GUDHI for large point clouds
- POT (Python Optimal Transport) — pythonot.github.io — Wasserstein distances, Sinkhorn's algorithm, OT barycenters
- GPyTorch — gpytorch.ai — scalable Gaussian process inference for Bayesian optimization
- Bronstein et al., Geometric Deep Learning — geometricdeeplearning.com — free textbook on SE(3)-equivariance, group representations, gauge theory for ML
- Boyd & Vandenberghe, Convex Optimization — web.stanford.edu/~boyd/cvxbook — free textbook; simplex projection, mirror descent, Frank-Wolfe all covered
- Peyre & Cuturi (2019) — "Computational Optimal Transport." Foundations and Trends in ML, 11(5-6). arXiv:1803.00567 — the standard reference for Wasserstein theory and Sinkhorn algorithms
- Botnan & Lesnick (2022) — "An Introduction to Multiparameter Persistence." arXiv:2203.14289 — best entry point for A11's mathematical framework
- Altrock, Liu & Michor (2015) — "The mathematics of cancer: integrating quantitative models." Nature Reviews Cancer, 15(12), 730-745 — foundational reference for A12
Strong math background? These problems are yours.
Real Analysis + Algebra II is enough to start I12 or A11. Email with subject "IML: [Project ID]".
← All 40 ProjectsLast updated March 2026. Full project details available on request.