← Back to All Open Problems

Drug Discovery · 7 Projects · Spring 2026

Drug Discovery
Open Problems

Target identification. Molecular property prediction. Drug repurposing. Knowledge graph reasoning. Seven projects across three difficulty tiers.

Interested? Email me at my Ashoka address with subject "IML: [Project ID]" — e.g. "IML: B1". All levels welcome. I supervise the ML/computational aspects and connect you with biology or domain co-mentors as needed.

Domain Overview

ML-driven drug discovery spans the full pipeline from target identification to clinical candidate selection. The core tasks include predicting drug-target interaction (DTI) affinities from molecular structure, molecular property prediction (solubility, toxicity, ADMET), drug repurposing by matching approved drugs to new targets, and biomarker discovery from high-dimensional transcriptomic or proteomic data. Knowledge graph (KG) reasoning ties these tasks together: by encoding drugs, targets, pathways, and diseases as nodes and their biological relationships as edges, graph neural networks and differentiable symbolic reasoners can propagate evidence across the graph to generate and validate hypotheses.

The 2026 SOTA is moving fast. NVIDIA and Eli Lilly launched an AI co-innovation lab combining BioNeMo (GPU-accelerated molecular simulation and generative chemistry) with Lilly's medicinal chemistry expertise, targeting hit-to-lead optimization timelines measured in weeks rather than months. The open-source ecosystem has matured: DeepPurpose (Huang et al., 2020) unifies DTI prediction across multiple molecular and protein representations; TorchDrug (Zhu et al., 2022) provides standardized GNN benchmarks for property prediction, generation, and retrosynthesis; DeepChem covers the broadest ML-for-chemistry surface area. On the knowledge graph side, the KGT framework (GigaScience 2025) integrates heterogeneous biomedical KGs with LLMs for drug repositioning, while CancerKG.ORG (Moffitt Cancer Center) links genomic alterations to therapeutic strategies across cancer types. The AI drug discovery market reached $1.94B in 2025 with over 200 AI-designed molecules in clinical development.

The focus areas for these projects are deliberately chosen: NAFLD/NASH (non-alcoholic fatty liver disease / steatohepatitis) is a disease affecting ~25% of the global population with only one approved drug, yet ML-specific tooling for NASH biomarker identification and target-directed repurposing remains sparse compared to oncology. Cancer drug-drug interactions (DDIs) are acutely important for polypharmacy patients, and the public FAERS and DrugBank datasets provide a tractable entry point for knowledge graph construction. At the advanced end, evolutionary game theory of drug resistance and categorical deep learning for KG reasoning represent open problems where no published solutions exist.

Connection to Ashish's portfolio: Onco-TTT (oncology hypothesis generation via knowledge graph diffusion and GLiNER2 NER) is the direct parent of projects I14 and A5. The NASH/NAFLD disease focus appears in B1, B9, and I11. The FDA adverse event analysis pipeline (FAERS) is touched in B2. These projects are not self-contained exercises — they extend or interrogate real research infrastructure.

Ashoka Coursework Connection

MAT 2202 P&S DTI model evaluation, ROC analysis, biomarker feature selection → B1, B9, I11
CS 1101 Intro CS Python, data processing, knowledge graph construction → B2
MAT 2201 Discrete Math Graph theory, logic, Datalog-style reasoning → B2, I6
CS 3410 Intro to ML GNNs, ensemble methods, end-to-end neural training → I6, I11, I14
CS 2201 DSA Graph algorithms, KG traversal, pipeline engineering → I6, I14
MAT 3201 Algebra I / Category Theory Semiring theory (Scallop), functors, natural transformations → A5
MAT 3302 Differential Equations Replicator dynamics, optimal control, stability analysis → A12
BIO (Cell & Mol Bio, Genetics) Protein targets, gene expression data, pathway biology → B1, I11

All 7 Projects

IDProjectDescriptionKey PrereqsTags
B1Drug Repurposing for NAFLD/NASH Using DeepPurposeTrain drug-target interaction models on ChEMBL binding data for three validated NASH targets (PNPLA3, TM6SF2, HSD17B13), then screen ~1,500 FDA-approved drugs to rank repurposing candidates. ROC analysis and cross-validation throughout.P&S, Intro CSJuneja 1 sem
B2Cancer Drug Interaction Checker with Knowledge GraphsBuild a small knowledge graph from DrugBank and FAERS for 20-30 oncology drugs, then implement graph traversal queries to identify DDI risks via shared CYP450 enzymes and adverse event co-occurrence. Validate against gold-standard interaction databases.Intro CS, Discrete Math1 sem
B9Molecular Property Prediction with TorchDrugCompare GNN architectures (GCN, GIN, SchNet) on MoleculeNet benchmarks using scaffold splits, then transfer to a NASH-specific compound set from ChEMBL. Analyze which molecular features correlate with activity against NASH targets via GNN attribution.P&S, Intro CSJuneja 1 sem
I6Drug Discovery KG with Differentiable Reasoning (Scallop)Build a CYP450 drug interaction knowledge graph and encode pharmacological rules as Scallop logic programs compiled into differentiable provenance semirings. Train a GNN link predictor end-to-end through the symbolic layer; compare against neural-only and symbolic-only baselines.Intro ML, DSA, Discrete MathIshaan D4 1-2 sem
I11ML-Driven Biomarker Discovery for NASH/NAFLDApply ensemble feature selection (random forests, LASSO, stability selection) to public GEO transcriptomic datasets from NASH liver biopsies. Derive a consensus non-invasive biomarker panel, validate on independent cohorts, and compare against recent NAFLD-HCC biomarker publications.Intro ML, P&SJuneja 1-2 sem
I14Graph Diffusion for Oncology Hypothesis GenerationReproduce the Onco-TTT pipeline (GLiNER2 NER → NetworkX KG → graph diffusion → 6-dimensional validation) for lung cancer, then extend to a new indication (cardiovascular disease, T2 diabetes, or paediatric neuroblastoma). Compare automated hypotheses against expert assessment.Intro ML, DSA, Discrete MathOnco-TTT 1-2 sem
A5Categorical Deep Learning for Drug Discovery KGsModel the drug discovery KG as a category (drugs/targets as objects, biological relationships as morphisms), drug interaction patterns as functors, and pharmacological constraints as natural transformations. Prove compositionality properties, integrate Scallop, replace Onco-TTT's heuristic diffusion with the categorical layer, and formalize core theorems in Lean4 as a stretch goal.Algebra II, Category Theory, Intro MLIshaan D4 2 sem
A12Evolutionary Game Theory of Drug ResistanceFormalize drug resistance as an evolutionary game with replicator dynamics over drug-sensitive and resistant tumor cell populations. Prove existence and stability of Nash equilibria, derive optimal bang-bang control schedules via Pontryagin's maximum principle, extend to a stochastic branching process model, and connect findings to adaptive therapy protocols for GBM and neuroblastoma.DiffEq, Real Analysis, P&S2 sem

Key References & Tools

  • DeepPurpose — DTI prediction library: github.com/kexinhuang12345/DeepPurpose. Huang et al., Bioinformatics 2020. DOI: 10.1093/bioinformatics/btaa1005
  • TorchDrug — ML platform for drug discovery: github.com/DeepGraphLearning/torchdrug. Zhu et al., arXiv:2202.08320
  • DeepChem — Deep learning for chemistry: github.com/deepchem/deepchem
  • Scallop — Differentiable symbolic reasoning: github.com/scallop-lang/scallop. Li et al., PLDI 2023
  • DrugBank — Comprehensive drug database (free academic): go.drugbank.com. Wishart et al., Nucleic Acids Research 2018
  • ChEMBL — Bioactive molecule database: ebi.ac.uk/chembl
  • FAERS / openFDA — FDA adverse event reporting: open.fda.gov
  • KGT framework — KG + LLM drug repositioning: Zhang et al., GigaScience 2025. DOI: 10.1093/gigascience/giae093
  • Gavranovic et al. (2024). Categorical Deep Learning: An Algebraic Theory of Architectures. arXiv:2402.15332
  • Altrock, Liu & Michor (2015). The mathematics of cancer: integrating quantitative models. Nature Reviews Cancer, 15, 730-745

Ready to work on one of these?

Email with subject "IML: [Project ID]" — include your year, major, and one sentence on why this problem interests you.

← All 40 Open Problems

Last updated March 2026. Full project specs available on request.