40 Open Problems Mathematical Foundations
Domain Deep-Dive · Spring 2026

Mathematical Foundations
— Open Problems

3 projects where pure mathematics meets protein design, drug resistance, and generative models. Genuine open questions — not textbook exercises.

The theory-practice gap in ML for science is vast and largely unexplored. Geometric deep learning on SE(3), diffusion models on manifolds, topological data analysis, and optimal transport all have rich theoretical frameworks — yet convergence guarantees, expressivity bounds, and approximation results for the actual loss landscapes encountered in protein design or drug discovery are almost entirely missing. The algorithms that won the Adaptyv Nipah competition were hand-tuned; the tools that evaluate generative models lack metric properties; the topological descriptors for protein structures use only one filtration parameter when biology offers many.

The key mathematical tools in this domain are: topological data analysis (persistent homology, the stability theorem, multi-parameter persistence modules and their connection to quiver representations), optimal transport (Wasserstein distances, Sinkhorn's algorithm, gradient flows), optimization on the probability simplex (projected gradient, mirror descent with KL divergence, Frank-Wolfe), and information geometry (the Fisher information metric as the natural geometry on the simplex). These are not just computational tools — they are mathematical objects with open structural questions.

Why math majors should care: these are genuine open mathematical problems, not "apply existing ML to biology". Project I12 asks for convergence proofs that do not exist in the literature for the protein design setting. Project A9 asks whether Wasserstein distances on protein sequence spaces metrize weak convergence — a functional analysis question. Project A11 asks for new results about the structure of persistence modules arising from protein data — a quiver representation theory question with biological stakes. The papers do not have the theorems yet; that is the point.

Project A12, while classified under Mathematical Biology, sits equally in this domain: evolutionary game theory of drug resistance requires differential equations, fixed-point theorems for Nash equilibrium existence, Pontryagin's maximum principle for optimal drug scheduling, and branching processes for stochastic tumor models. It is pure applied mathematics, motivated by paediatric cancer.

Prerequisite Map

ProjectHard PrerequisitesRelevant Ashoka Courses
I12 — Simplex OptimizationReal Analysis, Linear AlgebraMAT 2003 Real Analysis — convergence proofs, Lipschitz continuity, convex analysis
MAT 1001 Linear Algebra — matrix norms, eigenvalue analysis, gradient computation
A9 — Optimal TransportMeasure Theory, Functional AnalysisMAT 4xxx Measure Theory — weak convergence, probability measures on metric spaces
MAT 4xxx Functional Analysis — RKHS, operator theory, Banach space geometry
A11 — Multi-Parameter PersistenceAlgebra II, Algebraic TopologyMAT 2002 Algebra II — module theory, quiver representations, homological algebra
MAT 2026 Metric & Topological Spaces — simplicial complexes, filtrations, topological invariants
A12 — Evolutionary Game TheoryDifferential Equations, Real Analysis, P&SMAT 3013 Differential Equations — dynamical systems, stability analysis, phase portraits
MAT 2003 Real Analysis — fixed-point theorems, variational methods
MAT 2020 Probability & Statistics — branching processes, stochastic population models

Python libraries for these projects — GUDHI, Ripser, POT, GPyTorch, SciPy — are all learnable during the project. No prior familiarity required.

IDProjectTierHard PrerequisitesScope
I12 Convergence Analysis of Simplex Optimization for Protein Design
Escalante Bio's mosaic library won the Adaptyv Nipah competition using a hand-tuned accelerated proximal gradient method on the probability simplex. This project proves convergence to critical points for non-convex protein design losses, compares mirror descent (KL divergence) against projected gradient, and bounds the rounding gap from continuous PSSM to discrete sequence. All theorems are missing from the current literature for this setting.
IntermediateMAT 2003 Real Analysis
MAT 1001 Linear Algebra
1–2 semesters
A9 Optimal Transport for Generative Model Evaluation
Evaluating protein generative models is an open problem: existing metrics (MMD, BEAR tests, ipTM rankings) lack metric properties or theoretical grounding. This project develops Wasserstein-distance-based evaluation metrics for generated antibody ensembles, proves metrization properties and concentration inequalities for the empirical Wasserstein distance, formalizes directed evolution as a Wasserstein gradient flow, and benchmarks 5+ generative models (RFAntibody, ProteinMPNN, BoltzGen, RL-finetuned variants).
AdvancedMAT 4xxx Measure Theory
MAT 4xxx Functional Analysis
MAT 2020 P&S
2 semesters
A11 Multi-Parameter Persistence for Protein Design Spaces
Single-parameter persistent homology uses one filtration (distance threshold); proteins have many — distance, hydrophobicity, charge, flexibility simultaneously. Multi-parameter persistence is an active area in pure mathematics: unlike the single-parameter case, no barcode analog exists in general, and the classification problem connects to quiver representation theory. This project extends TDA from I4, constructs bi-filtrations on protein binding interfaces, computes fibered barcodes, and seeks new mathematical results about the structure of persistence modules arising from protein data — potential submission to Journal of Applied and Computational Topology.
AdvancedMAT 2002 Algebra II
MAT 2026 Topological Spaces
Algebraic Topology background
2 semesters
A12 Mathematical Oncology: Evolutionary Game Theory of Drug Resistance
Models tumor evolution as an evolutionary game between sensitive and resistant cell types. The project proves existence and stability of Nash equilibria via replicator dynamics, applies Pontryagin's maximum principle to derive optimal bang-bang drug scheduling, proves that adaptive therapy (treat-until-response) delays resistance compared to maximum tolerated dose, and extends to stochastic branching processes to analyze extinction probabilities for resistant clones. Clinical implications connect to paediatric GBM and neuroblastoma.
AdvancedMAT 3013 Differential Equations
MAT 2003 Real Analysis
MAT 2020 P&S
2 semesters

Key References & Tools

  • GUDHIgudhi.inria.fr — TDA library with alpha complexes, Rips complexes, and multi-parameter persistence
  • Ripsergithub.com/Ripser/ripser — fast Vietoris-Rips persistent homology, orders of magnitude faster than GUDHI for large point clouds
  • POT (Python Optimal Transport)pythonot.github.io — Wasserstein distances, Sinkhorn's algorithm, OT barycenters
  • GPyTorchgpytorch.ai — scalable Gaussian process inference for Bayesian optimization
  • Bronstein et al., Geometric Deep Learninggeometricdeeplearning.com — free textbook on SE(3)-equivariance, group representations, gauge theory for ML
  • Boyd & Vandenberghe, Convex Optimizationweb.stanford.edu/~boyd/cvxbook — free textbook; simplex projection, mirror descent, Frank-Wolfe all covered
  • Peyre & Cuturi (2019) — "Computational Optimal Transport." Foundations and Trends in ML, 11(5-6). arXiv:1803.00567 — the standard reference for Wasserstein theory and Sinkhorn algorithms
  • Botnan & Lesnick (2022) — "An Introduction to Multiparameter Persistence." arXiv:2203.14289 — best entry point for A11's mathematical framework
  • Altrock, Liu & Michor (2015) — "The mathematics of cancer: integrating quantitative models." Nature Reviews Cancer, 15(12), 730-745 — foundational reference for A12

Strong math background? These problems are yours.

Real Analysis + Algebra II is enough to start I12 or A11. Email with subject "IML: [Project ID]".

← All 40 Projects

Last updated March 2026. Full project details available on request.