Drug Discovery · 7 Projects · Spring 2026

Drug Discovery
Open Problems

Target identification. Molecular property prediction. Drug repurposing. Knowledge graph reasoning. Seven projects across three difficulty tiers.

Interested? Email me at my Ashoka address with subject "IML: [Project ID]" — e.g. "IML: B1". All levels welcome. I supervise the ML/computational aspects and connect you with biology or domain co-mentors as needed.

Domain Overview

ML-driven drug discovery spans the full pipeline from target identification to clinical candidate selection. The core tasks include predicting drug-target interaction (DTI) affinities from molecular structure, molecular property prediction (solubility, toxicity, ADMET), drug repurposing by matching approved drugs to new targets, and biomarker discovery from high-dimensional transcriptomic or proteomic data. Knowledge graph (KG) reasoning ties these tasks together: by encoding drugs, targets, pathways, and diseases as nodes and their biological relationships as edges, graph neural networks and differentiable symbolic reasoners can propagate evidence across the graph to generate and validate hypotheses.

The 2026 SOTA is moving fast. NVIDIA and Eli Lilly launched an AI co-innovation lab combining BioNeMo (GPU-accelerated molecular simulation and generative chemistry) with Lilly's medicinal chemistry expertise, targeting hit-to-lead optimization timelines measured in weeks rather than months. The open-source ecosystem has matured: DeepPurpose (Huang et al., 2020) unifies DTI prediction across multiple molecular and protein representations; TorchDrug (Zhu et al., 2022) provides standardized GNN benchmarks for property prediction, generation, and retrosynthesis; DeepChem covers the broadest ML-for-chemistry surface area. On the knowledge graph side, the KGT framework (GigaScience 2025) integrates heterogeneous biomedical KGs with LLMs for drug repositioning, while CancerKG.ORG (Moffitt Cancer Center) links genomic alterations to therapeutic strategies across cancer types. The AI drug discovery market reached $1.94B in 2025 with over 200 AI-designed molecules in clinical development.

The focus areas for these projects are deliberately chosen: NAFLD/NASH (non-alcoholic fatty liver disease / steatohepatitis) is a disease affecting ~25% of the global population with only one approved drug, yet ML-specific tooling for NASH biomarker identification and target-directed repurposing remains sparse compared to oncology. Cancer drug-drug interactions (DDIs) are acutely important for polypharmacy patients, and the public FAERS and DrugBank datasets provide a tractable entry point for knowledge graph construction. At the advanced end, evolutionary game theory of drug resistance and categorical deep learning for KG reasoning represent open problems where no published solutions exist.

Connection to Ashish's portfolio: Onco-TTT (oncology hypothesis generation via knowledge graph diffusion and GLiNER2 NER) is the direct parent of projects I14 and A5. The NASH/NAFLD disease focus appears in B1, B9, and I11. The FDA adverse event analysis pipeline (FAERS) is touched in B2. These projects are not self-contained exercises — they extend or interrogate real research infrastructure.

Ashoka Coursework Connection

MAT 2202 P&S DTI model evaluation, ROC analysis, biomarker feature selection → B1, B9, I11

CS 1101 Intro CS Python, data processing, knowledge graph construction → B2

MAT 2201 Discrete Math Graph theory, logic, Datalog-style reasoning → B2, I6

CS 3410 Intro to ML GNNs, ensemble methods, end-to-end neural training → I6, I11, I14

CS 2201 DSA Graph algorithms, KG traversal, pipeline engineering → I6, I14

MAT 3201 Algebra I / Category Theory Semiring theory (Scallop), functors, natural transformations → A5

MAT 3302 Differential Equations Replicator dynamics, optimal control, stability analysis → A12

BIO (Cell & Mol Bio, Genetics) Protein targets, gene expression data, pathway biology → B1, I11

All 7 Projects

ID	Project	Description	Key Prereqs	Tags
B1	Drug Repurposing for NAFLD/NASH Using DeepPurpose	Train drug-target interaction models on ChEMBL binding data for three validated NASH targets (PNPLA3, TM6SF2, HSD17B13), then screen ~1,500 FDA-approved drugs to rank repurposing candidates. ROC analysis and cross-validation throughout.	P&S, Intro CS	Juneja 1 sem
B2	Cancer Drug Interaction Checker with Knowledge Graphs	Build a small knowledge graph from DrugBank and FAERS for 20-30 oncology drugs, then implement graph traversal queries to identify DDI risks via shared CYP450 enzymes and adverse event co-occurrence. Validate against gold-standard interaction databases.	Intro CS, Discrete Math	1 sem
B9	Molecular Property Prediction with TorchDrug	Compare GNN architectures (GCN, GIN, SchNet) on MoleculeNet benchmarks using scaffold splits, then transfer to a NASH-specific compound set from ChEMBL. Analyze which molecular features correlate with activity against NASH targets via GNN attribution.	P&S, Intro CS	Juneja 1 sem
I6	Drug Discovery KG with Differentiable Reasoning (Scallop)	Build a CYP450 drug interaction knowledge graph and encode pharmacological rules as Scallop logic programs compiled into differentiable provenance semirings. Train a GNN link predictor end-to-end through the symbolic layer; compare against neural-only and symbolic-only baselines.	Intro ML, DSA, Discrete Math	Ishaan D4 1-2 sem
I11	ML-Driven Biomarker Discovery for NASH/NAFLD	Apply ensemble feature selection (random forests, LASSO, stability selection) to public GEO transcriptomic datasets from NASH liver biopsies. Derive a consensus non-invasive biomarker panel, validate on independent cohorts, and compare against recent NAFLD-HCC biomarker publications.	Intro ML, P&S	Juneja 1-2 sem
I14	Graph Diffusion for Oncology Hypothesis Generation	Reproduce the Onco-TTT pipeline (GLiNER2 NER → NetworkX KG → graph diffusion → 6-dimensional validation) for lung cancer, then extend to a new indication (cardiovascular disease, T2 diabetes, or paediatric neuroblastoma). Compare automated hypotheses against expert assessment.	Intro ML, DSA, Discrete Math	Onco-TTT 1-2 sem
A5	Categorical Deep Learning for Drug Discovery KGs	Model the drug discovery KG as a category (drugs/targets as objects, biological relationships as morphisms), drug interaction patterns as functors, and pharmacological constraints as natural transformations. Prove compositionality properties, integrate Scallop, replace Onco-TTT's heuristic diffusion with the categorical layer, and formalize core theorems in Lean4 as a stretch goal.	Algebra II, Category Theory, Intro ML	Ishaan D4 2 sem
A12	Evolutionary Game Theory of Drug Resistance	Formalize drug resistance as an evolutionary game with replicator dynamics over drug-sensitive and resistant tumor cell populations. Prove existence and stability of Nash equilibria, derive optimal bang-bang control schedules via Pontryagin's maximum principle, extend to a stochastic branching process model, and connect findings to adaptive therapy protocols for GBM and neuroblastoma.	DiffEq, Real Analysis, P&S	2 sem

Key References & Tools

DeepPurpose — DTI prediction library: github.com/kexinhuang12345/DeepPurpose. Huang et al., Bioinformatics 2020. DOI: 10.1093/bioinformatics/btaa1005
TorchDrug — ML platform for drug discovery: github.com/DeepGraphLearning/torchdrug. Zhu et al., arXiv:2202.08320
DeepChem — Deep learning for chemistry: github.com/deepchem/deepchem
Scallop — Differentiable symbolic reasoning: github.com/scallop-lang/scallop. Li et al., PLDI 2023
DrugBank — Comprehensive drug database (free academic): go.drugbank.com. Wishart et al., Nucleic Acids Research 2018
ChEMBL — Bioactive molecule database: ebi.ac.uk/chembl
FAERS / openFDA — FDA adverse event reporting: open.fda.gov
KGT framework — KG + LLM drug repositioning: Zhang et al., GigaScience 2025. DOI: 10.1093/gigascience/giae093
Gavranovic et al. (2024). Categorical Deep Learning: An Algebraic Theory of Architectures. arXiv:2402.15332
Altrock, Liu & Michor (2015). The mathematics of cancer: integrating quantitative models. Nature Reviews Cancer, 15, 730-745

Ready to work on one of these?

Email with subject "IML: [Project ID]" — include your year, major, and one sentence on why this problem interests you.

← All 40 Open Problems

Last updated March 2026. Full project specs available on request.

Drug DiscoveryOpen Problems

Domain Overview

Ashoka Coursework Connection

All 7 Projects

Key References & Tools

Drug Discovery
Open Problems