Domain · 4 Projects · Advanced

Formal Mathematics & Lean4

Four open problems at the intersection of interactive theorem proving, algorithmic information theory, and computational biology.

The formal theorem proving landscape in 2026 has shifted from curiosity to capability. Mathlib now contains over 210,000 machine-verified theorems spanning algebra, analysis, topology, and measure theory. AlphaProof (DeepMind, Nature 2025) achieved an IMO silver medal with Lean4 proofs; Aletheia (Google DeepMind, 2026) has produced two genuinely novel publishable mathematical results. Leanstral (Mistral, March 2026) democratised proof engineering further: a 6B-parameter open-weight model that is 92x cheaper than frontier models while matching their success rates on Mathlib-style benchmarks. The tooling — LeanDojo-v2, LEANHAMMER, TorchLean — makes AI-assisted formalization a realistic undergraduate project for the first time.

Protein design needs formalization urgently. The Adaptyv EGFR competition (2025) exposed a brutal fact: no correlation exists between any current computational score (ipTM, iPAE, ESM2 pLDDT) and actual wet-lab binding. Ramachandran angle constraints, steric clash thresholds, and binding interface complementarity are precisely specified in structural biology yet have never been expressed as Lean4 propositions. With clinical antibodies showing ~28% off-target reactivity rates (Dai et al., Structure 2026), the absence of a formal specificity framework is not merely an academic gap — it is a patient safety problem.

The "proteins-as-programs" idea (Ishaan Direction D2) offers a different angle. Proteins and Brainfuck share a structural parallel: minimal alphabets (20 amino acids / 8 instructions) producing emergent complexity. Treating protein sequences as programs in a formal language (ProtLang) makes Kolmogorov complexity directly applicable: the length of the shortest ProtLang program that generates a sequence provides information-theoretic lower bounds on protein designability, and may explain why some binding targets are systematically harder to design against than others in rfab-harness campaigns.

These two threads — formal verification of protein constraints and algorithmic information theory for sequences — connect directly to the three Ishaan Varior research directions: D1 (formal verification), D2 (proteins-as-programs), and D5 (AI-augmented mathematical discovery for biomedical conjectures). Project I12, which analyzes convergence of the simplex optimizer underlying all mosaic-based designs, is formally an optimization project but bridges here: rigorous convergence proofs for non-convex objectives on the probability simplex are precisely the kind of result that could eventually be formalized in Lean4 and contributed to Mathlib.

Ashoka Coursework Connection

CS 5310 / MAT 3216 — Symbolic Logic Directly prepares for A3, A4: formal languages, proof theory, Curry-Howard correspondence

CS 4101 — Theory of Computation Core prerequisite for A3, A4: Turing machines, decidability, Kolmogorov complexity, formal language theory

MAT 3201 / MAT 3102 — Algebra I & II Type theory connections for A3; algebraic structures for ProtLang semantics in A4; Lean4 Mathlib foundations for A6

MAT 2301 — Real Analysis Convergence proofs and Lipschitz analysis for I12; ODE foundations for A6 (Gompertz growth, replicator dynamics)

MAT 3302 — Differential Equations Dynamical systems theory for A6: tumor growth models, evolutionary game theory, optimal control

Lean4 (self-study) Learnable during project via "Mathematics in Lean" tutorial and Mathlib docs — all four projects ~4-6 weeks to working proficiency

Projects in This Domain

ID	Project	Description	Key Prereqs	Tier
A3	Formal Verification of Protein Designs via Lean4	Build ProteinSpec, a Lean4 library that formalizes Ramachandran constraints, steric clash thresholds, and binding interface geometry as machine-checkable propositions. Validate retrospectively on Adaptyv EGFR and NiPaH competition designs using Leanstral for automated tactic generation.	Theory of Computation, Algebra I/II, Programming Languages	Advanced
A4	Proteins-as-Programs: Kolmogorov Complexity and Minimal Languages	Define ProtLang, a 20-instruction formal language with one instruction per amino acid, prove Turing completeness, derive Kolmogorov complexity bounds for target folds, and test whether complexity correlates with evolutionary conservation and experimental designability across 100 natural proteins.	Theory of Computation, Discrete Mathematics, Algebra I	Advanced
A6	AI-Augmented Mathematical Discovery for Biomedical Conjectures	Formalize 5-10 mathematical oncology conjectures in Lean4 — Gompertzian growth dominance, evolutionary game theory equilibria for drug resistance, optimal combination therapy scheduling — using Leanstral and LEANHAMMER for AI-assisted proof engineering, with results submitted to Mathlib.	Real Analysis, Differential Equations, Algebra II	Advanced
I12	Convergence Analysis of Simplex Optimization for Protein Design	Prove convergence of simplex_APGM (the algorithm behind Escalante Bio's mosaic library) for non-convex protein design losses: O(1/k²) rate in the convex case, critical-point convergence under Lipschitz gradients, and a rounding gap bound from continuous PSSM to discrete sequence. Compare mirror descent vs projected gradient empirically on mosaic's 22 benchmark targets.	Real Analysis, Linear Algebra	Intermediate

Key References

LeanDojo — AI-assisted Lean4 theorem proving: data extraction, proof search, benchmarks (leandojo.org)
Mathlib4 — 210K+ formalized theorems, 1.9M lines, 500+ contributors (github.com/leanprover-community/mathlib4)
Lean4 — Theorem prover and functional programming language based on dependent type theory (lean-lang.org)
AlphaProof — DeepMind, Nature 2025. IMO silver medal: 4/6 problems solved with machine-verified Lean4 proofs.
Aletheia — Google DeepMind, 2026. Towards Autonomous Mathematics Research. 87–90% on Olympiad problems; two novel publishable results.
Leanstral — Mistral AI, March 2026. 6B-parameter open-weight Lean4 agent (Apache 2.0); 92x cheaper than Claude Opus.
Li, M. & Vitányi, P. (2008). An Introduction to Kolmogorov Complexity and Its Applications, 3rd ed. Springer.
Altrock, P. M., Liu, L. L., & Michor, F. (2015). The mathematics of cancer: integrating quantitative models. Nature Reviews Cancer, 15(12), 730–745.

Ready to formalize something that has never been proved?

Email with subject "IML: A3", "IML: A4", "IML: A6", or "IML: I12".
These are 2-semester thesis-track projects — Y4 students and motivated Y3s welcome.

← Back to all 40 projects · Last updated March 2026.