Drug Discovery
Open Problems
Target identification. Molecular property prediction. Drug repurposing. Knowledge graph reasoning. Seven projects across three difficulty tiers.
Interested? Email me at my Ashoka address with subject "IML: [Project ID]" — e.g. "IML: B1". All levels welcome. I supervise the ML/computational aspects and connect you with biology or domain co-mentors as needed.
Domain Overview
ML-driven drug discovery spans the full pipeline from target identification to clinical candidate selection. The core tasks include predicting drug-target interaction (DTI) affinities from molecular structure, molecular property prediction (solubility, toxicity, ADMET), drug repurposing by matching approved drugs to new targets, and biomarker discovery from high-dimensional transcriptomic or proteomic data. Knowledge graph (KG) reasoning ties these tasks together: by encoding drugs, targets, pathways, and diseases as nodes and their biological relationships as edges, graph neural networks and differentiable symbolic reasoners can propagate evidence across the graph to generate and validate hypotheses.
The 2026 SOTA is moving fast. NVIDIA and Eli Lilly launched an AI co-innovation lab combining BioNeMo (GPU-accelerated molecular simulation and generative chemistry) with Lilly's medicinal chemistry expertise, targeting hit-to-lead optimization timelines measured in weeks rather than months. The open-source ecosystem has matured: DeepPurpose (Huang et al., 2020) unifies DTI prediction across multiple molecular and protein representations; TorchDrug (Zhu et al., 2022) provides standardized GNN benchmarks for property prediction, generation, and retrosynthesis; DeepChem covers the broadest ML-for-chemistry surface area. On the knowledge graph side, the KGT framework (GigaScience 2025) integrates heterogeneous biomedical KGs with LLMs for drug repositioning, while CancerKG.ORG (Moffitt Cancer Center) links genomic alterations to therapeutic strategies across cancer types. The AI drug discovery market reached $1.94B in 2025 with over 200 AI-designed molecules in clinical development.
The focus areas for these projects are deliberately chosen: NAFLD/NASH (non-alcoholic fatty liver disease / steatohepatitis) is a disease affecting ~25% of the global population with only one approved drug, yet ML-specific tooling for NASH biomarker identification and target-directed repurposing remains sparse compared to oncology. Cancer drug-drug interactions (DDIs) are acutely important for polypharmacy patients, and the public FAERS and DrugBank datasets provide a tractable entry point for knowledge graph construction. At the advanced end, evolutionary game theory of drug resistance and categorical deep learning for KG reasoning represent open problems where no published solutions exist.
Connection to Ashish's portfolio: Onco-TTT (oncology hypothesis generation via knowledge graph diffusion and GLiNER2 NER) is the direct parent of projects I14 and A5. The NASH/NAFLD disease focus appears in B1, B9, and I11. The FDA adverse event analysis pipeline (FAERS) is touched in B2. These projects are not self-contained exercises — they extend or interrogate real research infrastructure.
Ashoka Coursework Connection
All 7 Projects
| ID | Project | Description | Key Prereqs | Tags |
|---|---|---|---|---|
| B1 | Drug Repurposing for NAFLD/NASH Using DeepPurpose | Train drug-target interaction models on ChEMBL binding data for three validated NASH targets (PNPLA3, TM6SF2, HSD17B13), then screen ~1,500 FDA-approved drugs to rank repurposing candidates. ROC analysis and cross-validation throughout. | P&S, Intro CS | Juneja 1 sem |
| B2 | Cancer Drug Interaction Checker with Knowledge Graphs | Build a small knowledge graph from DrugBank and FAERS for 20-30 oncology drugs, then implement graph traversal queries to identify DDI risks via shared CYP450 enzymes and adverse event co-occurrence. Validate against gold-standard interaction databases. | Intro CS, Discrete Math | 1 sem |
| B9 | Molecular Property Prediction with TorchDrug | Compare GNN architectures (GCN, GIN, SchNet) on MoleculeNet benchmarks using scaffold splits, then transfer to a NASH-specific compound set from ChEMBL. Analyze which molecular features correlate with activity against NASH targets via GNN attribution. | P&S, Intro CS | Juneja 1 sem |
| I6 | Drug Discovery KG with Differentiable Reasoning (Scallop) | Build a CYP450 drug interaction knowledge graph and encode pharmacological rules as Scallop logic programs compiled into differentiable provenance semirings. Train a GNN link predictor end-to-end through the symbolic layer; compare against neural-only and symbolic-only baselines. | Intro ML, DSA, Discrete Math | Ishaan D4 1-2 sem |
| I11 | ML-Driven Biomarker Discovery for NASH/NAFLD | Apply ensemble feature selection (random forests, LASSO, stability selection) to public GEO transcriptomic datasets from NASH liver biopsies. Derive a consensus non-invasive biomarker panel, validate on independent cohorts, and compare against recent NAFLD-HCC biomarker publications. | Intro ML, P&S | Juneja 1-2 sem |
| I14 | Graph Diffusion for Oncology Hypothesis Generation | Reproduce the Onco-TTT pipeline (GLiNER2 NER → NetworkX KG → graph diffusion → 6-dimensional validation) for lung cancer, then extend to a new indication (cardiovascular disease, T2 diabetes, or paediatric neuroblastoma). Compare automated hypotheses against expert assessment. | Intro ML, DSA, Discrete Math | Onco-TTT 1-2 sem |
| A5 | Categorical Deep Learning for Drug Discovery KGs | Model the drug discovery KG as a category (drugs/targets as objects, biological relationships as morphisms), drug interaction patterns as functors, and pharmacological constraints as natural transformations. Prove compositionality properties, integrate Scallop, replace Onco-TTT's heuristic diffusion with the categorical layer, and formalize core theorems in Lean4 as a stretch goal. | Algebra II, Category Theory, Intro ML | Ishaan D4 2 sem |
| A12 | Evolutionary Game Theory of Drug Resistance | Formalize drug resistance as an evolutionary game with replicator dynamics over drug-sensitive and resistant tumor cell populations. Prove existence and stability of Nash equilibria, derive optimal bang-bang control schedules via Pontryagin's maximum principle, extend to a stochastic branching process model, and connect findings to adaptive therapy protocols for GBM and neuroblastoma. | DiffEq, Real Analysis, P&S | 2 sem |
Key References & Tools
- DeepPurpose — DTI prediction library: github.com/kexinhuang12345/DeepPurpose. Huang et al., Bioinformatics 2020. DOI: 10.1093/bioinformatics/btaa1005
- TorchDrug — ML platform for drug discovery: github.com/DeepGraphLearning/torchdrug. Zhu et al., arXiv:2202.08320
- DeepChem — Deep learning for chemistry: github.com/deepchem/deepchem
- Scallop — Differentiable symbolic reasoning: github.com/scallop-lang/scallop. Li et al., PLDI 2023
- DrugBank — Comprehensive drug database (free academic): go.drugbank.com. Wishart et al., Nucleic Acids Research 2018
- ChEMBL — Bioactive molecule database: ebi.ac.uk/chembl
- FAERS / openFDA — FDA adverse event reporting: open.fda.gov
- KGT framework — KG + LLM drug repositioning: Zhang et al., GigaScience 2025. DOI: 10.1093/gigascience/giae093
- Gavranovic et al. (2024). Categorical Deep Learning: An Algebraic Theory of Architectures. arXiv:2402.15332
- Altrock, Liu & Michor (2015). The mathematics of cancer: integrating quantitative models. Nature Reviews Cancer, 15, 730-745
Ready to work on one of these?
Email with subject "IML: [Project ID]" — include your year, major, and one sentence on why this problem interests you.
← All 40 Open ProblemsLast updated March 2026. Full project specs available on request.