This project aims to predict new therapeutic indications for small molecules. We will focus on repurposing drugs for well-studied complex human diseases, relying on recently-available high-throughput data sources. The approach is integrative, seeking to combine multiple information domains through heterogeneous networks and modern machine learning techniques.


  1. Create an open resource for integrative drug repurposing.
    A slew of bioinformatics resources have recently come online. Yet, these resources frequently rely on different vocabularies and require cleaning. We plan to release a network capturing a systems biology perspective of drug efficacy. Standardized vocabularies will form a network template and will be connected by the results of high-throughput experimentation. Structuring the resource as a network ensures the data is processed and reusable. The complete resource will be posted to our online portal for heterogeneous data integration under a CC-BY license and will provide a goto dataset for researchers implementing systems/network/computational pharmacology analyses.

  2. Compare and identify influential mechanisms of drug efficacy.
    The mechanism of action is poorly understood for many prevalent small molecule therapies. Discovering mechanisms through target identification has proven difficult. High-throughput drug and disease signatures offer an alternative approach for investigating a compound's mechanism of action. We plan to identify influential network connections that underlie drug efficacy. Greater insight into how existing drugs work will help us understand current treatments and predict future treatments. More immediately, our results will compare the informativeness of data sources, providing a solid foundation for computational drug repurposing.

  3. Predict probabilities for each small molecule's efficacy in treating each complex disease.
    Serendipity was responsible for the discovery of many breakthrough small molecule therapies [1]. Additionally, small molecules frequently treat multiple distinct diseases [2]. Rational systems-based drug repurposing overcomes the inefficiency and unreliability of serendipity, while capturing the benefits of polypharmacology and network pharmacology. From the integrative network, we will predict the probability that a given small molecule will treat a given complex disease. Our predictions will provide pharmacologists with evidence-based drug leads, which could develop into novel approved uses for existing drug.


Pharmaceutical companies seeking to bring a novel therapeutic compound to market face a single digit success rate, price tag in the billions, and duration spanning decades [3]. The trend in research efficiency is equally grim: the cost of developing a new drug has increased exponentially, doubling approximately every nine years since 1950 [4].

Since the 90’s, the prevailing model of drug discovery has focused on identifying compounds that target a single protein with maximum specificity. Through a molecular, reductionist approach to understanding disease, a plausible target is selected. Drugs are then designed to modulate the target or small molecules with a strong target affinity are identified using high throughput screens. However, overwhelming evidence suggests that the potential of the 'one drug, one target, one disease' approach is limited. Biological systems are characterized by phenotypic robustness: knockout experiments in model organisms reveal that less than one fifth of genes are essential for survival [5]. Similarly, pathology may represent a resilient homeostatic state, resistant to disruptions of a single protein. Approved small molecules affect on average 2.7 known targets, and when accounting for speculative targets that number jumps to 6.3 [6]. This promiscuity can play an important role in drug efficacy as exemplified by clozapine which remains the preeminent anti-psychotic drug over compounds engineered to bind a subset of its dozen-plus targets [7].

Uncovering disease therapies that rely on multiple mechanisms, known as polypharmacology, requires escaping the limitations of the 'magic bullet' paradigm in favor of a 'magic 00 buckshot' understanding of drug efficacy. An approach called network pharmacology seeks to characterize the multitude of corruptions embodying a pathology. With that knowledge, drugs are selected to restore a normal state. Network pharmacology encompasses polypharmacology by evaluating drugs which intervene at multiple points to achieve healthy homeostasis.

Drug repurposing — identifying novel uses for existing therapeutics — avoids many pitfalls and challenges of designing drugs from scratch. FDA approved drugs have undergone extensive toxicology profiling during development and safety evaluation in Phase III clinical trials. Given ample time on the market, post-marketing trials and adverse event reporting uncover potential flaws that could lead to withdrawal. The wealth of information surrounding approved drugs creates a favorable outcome for repurposed compounds compared to new molecular entities: time to approval is cut in half to as low as three years [3]; the success rate of advancing from phase II trials to approval increases from 10 to 25 percent [8]; and the average development cost for successful drugs plummets from 1.3 billion to as low as 8.4 million dollars [9].

Between 1999 and 2008, more first-in-class small-molecule drugs were discovered with phenotypic screening than target-centric approaches, despite preferential investment towards the later [10]. The advent of omics-technologies has enabled the quantification of several intermediate phenotypes between a disease's (or drug's) molecular basis and clinical manifestation. Intermediate phenotypes include transcriptional profiles, biological pathways, and genetic susceptibility markers. Traditional phenotypic approaches have focused on in vivo screening to identify compounds that alter a primary clinical indicator. In silico screening that instead relies on intermediate phenotypes offers a less costly and time-consuming way forward. Such approaches are easily amenable to leveraging repurposing, polypharmacology, and network pharmacology.

We propose an integrative method for repurposing approved small molecules to treat additional complex diseases. The approach relies on characterizing the effect of compounds and diseases using high-throughput resources — many of which provide intermediate-phenotypic profiles for compounds and diseases — and, from this information, calculating features that describe specific aspects of a compound-disease relationship. From these features, a machine learning approach identifies the influential mechanisms behind drug efficacy and predicts additional indications for existing drugs.

We chose to focus on complex diseases because they frequently exhibit:

  • poorly understood molecular bases
  • modest efficacy of approved therapies
  • multifactorial etiologies that highlight multiple modalities for intervention
  • good coverage in high-throughput bioinformatic resources

We chose to focus on small molecules because they exhibit:

  • known structures
  • greater data-availability than biologics
  • promiscuous binding, which enables polypharmacology
  • incomplete target knowledge, which can be overcome with phenotypic profiling

Research Plan

Part 1. Resource Construction

First, we will construct a resource that encodes a systems perspective of pathogenesis and pharmacology. We will structure the resource as a network where entities (nodes) are connected by their relationships (edges). Nodes and edges belong to predefined types — respectively called metanodes (Table 1) and metaedges (Table 2). The schematic view showing how types relate is called a metagraph (Figure 1).

Table 1. Metanodes

The network will consist of the following node types. Domain-specific vocabularies provide standardized terminologies for each node type.

DiseaseDisease Ontology[12]
GeneEntrez Gene[13]
Cellular ComponentGene Ontology[15, 16]
Molecular FunctionGene Ontology[15, 16]
Biological ProcessGene Ontology[15, 16]
Perturbation Gene SetMolecular Signatures Database (MSigDB 5.0)[17]
PathwayMolecular Signatures Database (MSigDB 5.0)[17]
Side EffectUMLS Metathesaurus[19]
Table 2. Metaedges

The network will consist of the following edge types. High-throughput bioinformatics resources provide the necessary information for connecting nodes.

CompoundDiseaseIndicationLabeledIn[22, 23]
CompoundCompoundSimilarityDice index of ECFPs[26]
CompoundSide EffectCausationOFFSIDES[28]
CompoundSide EffectCausationSIDER 4[29]
DiseaseGeneVariationGWAS Catalog[30]
DiseaseSymptomCausationMEDLINE Cooccurrence
DiseaseAnatomyLocalizationMEDLINE Cooccurrence
DiseaseDiseaseSimilarityMEDLINE Cooccurrence
GeneGeneInteractionHuman Interactome Project[35]
GeneGeneInteractionThe Incomplete Interactome[36]
GeneGeneEvolutionEvolutionary Rate Covariation[37]
GenePathwayMembershipMSigDB 5.0[17]
GenePerturbation Gene SetRegulationMSigDB 5.0[17]
GeneBiological ProcessMembershipGene Ontology annotations[15, 16]
GeneMolecular FunctionMembershipGene Ontology annotations[15, 16]
GeneCellular ComponentMembershipGene Ontology annotations[15, 16]

Each node type will be populated using a domain-specific vocabulary (Table 1). Controlled vocabularies provide a backbone for data integration, ensure entities are conceptually unique, and enable easy annotation for future users. Edges will be extracted from high-throughput bioinformatics resources (Table 2). We aim to incorporate resources that are high-throughput, high-quality, and publicly-available. When possible, systematic resources that circumvent knowledge biases will be employed.

Figure 1. Metagraph of the heterogeneous network

A schematic view of the node and edge types composing the network.

We are currently exploring various resources to provide a high-throughput catalog of indications. Feedback here would be appreciated.

Part 2. Discovering Mechanisms of Drug Efficacy

Features describe the relationship between a compound and disease. Each feature measures a certain aspect of a compound-disease relationship: for example, whether the compound targets a susceptibility gene of the disease or whether the compound downregulates genes that are overexpressed in the disease state. Features that distinguish therapeutic from untherapeutic compound-disease pairs represent mechanisms of drug efficacy. We refer to the discriminatory power of each feature as its performance. The performance of each feature indicates its pharmacological importance. And by comparing performance across features, we can contrast the informativeness of orthogonal domains of information. Finally, features describing the same general relationship but based on different data sources can identify the most informative resource or technology out of many.

Features will be computed from the network. Each feature will measure the prevalence of a specific type of path between a compound and disease. This approach was initially developed for social network analysis [40], and later adapted by us for predicting disease-associated genes [41]. Briefly, the method identifies all paths between a source and target node that follow a specified type of path (metapath). The contribution of each path is weighted by its specificity: paths through high-degree nodes, which are likely to be less informative, are downweighted. The sum of the weighted paths results in a value of 0 or greater, where 0 indicates no connectivity. We plan to use the degree-weighted path count metric for computing features [41]. The interpretation of a specific feature depends on its corresponding metapath. Table 3 provides example metapaths and describes their pharmacological significance.

Table 3. The interpretation of features for select metapaths

Features measure the prevalence of a specific metapath between the source compound and target disease. Metapaths are abbreviated using the first letter of each metanode (uppercase) and metaedge (lowercase). Refer to Figure 1 for metanode and metaedge names.

MetapathMeasures the extent that ...
CuGdDgenes downregulated by the disease are upregulated by the compound
CtGaDthe compound targets genes associated with the disease
CtGiGaDthe compound targets genes that interact with genes associated with the disease
CcScCiDthe compound shares side effects with compounds indicated for the disease
CtGeTlDthe compound targets genes that are expressed in tissues affected by the disease
CiDmPmDthe compound is indicated for diseases with the same pathophysiology as the target disease

Metapath-based approaches have several advantages for predictive data integration including [41]:

  • versatility (most biological phenomena are decomposable into entities connected by relationships)
  • scalability (no theoretical limit to metagraph complexity or graph size)
  • efficiency (low marginal cost to including an additional network component)

Harnessing these advantages, we hope to evaluate and compare a diverse and broad set of potential mechanisms of efficacy.

Part 3. Predicting Probabilities of Efficacy

We plan to predict probabilities of efficacy for compound-disease pairs using heterogeneous network edge prediction [40, 41]. This approach trains a model from the network-based features and can return a probability of efficacy for any compound-disease pair. Previously, our implementation relied on regularized logistic regression, but modern software packages will allow us to rigorously evaluate a broad range of machine learning algorithms.

Open science

We are committed to a transparent, freely available, reusable, and reproducible scientific process and believe open science can revolutionize medicine [42]. To this end, we will release all project related code on GitHub. Datasets that are too large for GitHub will be published on figshare. All original materials will be released under CC-BY (requires attribution) or CC-0 (public domain) licences. Derivatives of restrictively-licensed works will be released under the most permissive option available. Our analyses will be made available in real-time using GitHub pages to host R Markdown documents and IPython Notebooks. Finally, we plan to follow 10 proposed rules for reproducible research in computational biology [43].


Share your ideas by commenting on an existing discussion or by starting a new thread. Our goal in joining ThinkLab is to generate as much interaction as possible.

Team & Resources

Daniel Himmelstein is a PhD candidate in the Biological & Medical Informatics program at UCSF. Daniel works in the Sergio Baranzini Lab whose mission is to apply cutting-edge bioinformatic approaches to genomic data, with a focus on multiple sclerosis. Dr. Baranzini has extensive experience with data integration, genomic profiling, and disease bioinformatics.

UCSF and the surrounding Bay Area are hotspots for drug development and data analytics. The team has access to QB3 resources, which include a computing cluster and small molecule discovery center.

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant Number 1144247. We would like to thank BrowserStack for providing cross browser testing to help us compatibly share our research.


In silico drug repositioning – what we need to know
Zhichao Liu, Hong Fang, Kelly Reagan, Xiaowei Xu, Donna L. Mendrick, William Slikker, Weida Tong (2013) Drug Discovery Today. doi:10.1016/j.drudis.2012.08.005
Drug repositioning: identifying and developing new uses for existing drugs
Ted T. Ashburn, Karl B. Thor (2004) Nat Rev Drug Discov. doi:10.1038/nrd1468
Increasing R&D Spending per New Drug Approval
Daniel Himmelstein, Sergio Baranzini (2014) Figshare. doi:10.6084/m9.figshare.937004
Network pharmacology: the next paradigm in drug discovery
Andrew L Hopkins (2008) Nature Chemical Biology. doi:10.1038/nchembio.118
Data completeness—the Achilles heel of drug-target networks
Jordi Mestres, Elisabet Gregori-Puigjané, Sergi Valverde, Ricard V Solé (2008) Nat Biotechnol. doi:10.1038/nbt0908-983
Magic shotguns versus magic bullets: selectively non-selective drugs for mood disorders and schizophrenia
Bryan L. Roth, Douglas J. Sheffler, Wesley K. Kroeze (2004) Nat Rev Drug Discov. doi:10.1038/nrd1346
How were new medicines discovered?
David C. Swinney, Jason Anthony (2011) Nat Rev Drug Discov. doi:10.1038/nrd3480
DrugBank 4.0: shedding new light on drug metabolism
V. Law, C. Knox, Y. Djoumbou, T. Jewison, A. C. Guo, Y. Liu, A. Maciejewski, D. Arndt, M. Wilson, V. Neveu, A. Tang, G. Gabriel, C. Ly, S. Adamjee, Z. T. Dame, B. Han, Y. Zhou, D. S. Wishart (2013) Nucleic Acids Research. doi:10.1093/nar/gkt1068
Disease Ontology: a backbone for disease semantic integration
L. M. Schriml, C. Arze, S. Nadendla, Y.-W. W. Chang, M. Mazaitis, V. Felix, G. Feng, W. A. Kibbe (2011) Nucleic Acids Research. doi:10.1093/nar/gkr972
Entrez Gene: gene-centered information at NCBI
D. Maglott (2004) Nucleic Acids Research. doi:10.1093/nar/gki031
Uberon, an integrative multi-species anatomy ontology
Christopher J Mungall, Carlo Torniai, Georgios V Gkoutos, Suzanna E Lewis, Melissa A Haendel (2012) Genome Biol. doi:10.1186/gb-2012-13-1-r5
Gene Ontology: tool for the unification of biology
Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald, Gerald M. Rubin, Gavin Sherlock (2000) Nat Genet. doi:10.1038/75556
Molecular signatures database (MSigDB) 3.0
A. Liberzon, A. Subramanian, R. Pinchback, H. Thorvaldsdottir, P. Tamayo, J. P. Mesirov (2011) Bioinformatics. doi:10.1093/bioinformatics/btr260
WikiPathways: Pathway Editing for the People
Alexander R. Pico, Thomas Kelder, Martijn P. van Iersel, Kristina Hanspers, Bruce R. Conklin, Chris Evelo (2008) Plos Biol. doi:10.1371/journal.pbio.0060184
Development and evaluation of an ensemble resource linking medications to their indications
W.-Q. Wei, R. M. Cronin, H. Xu, T. A. Lasko, L. Bastarache, J. C. Denny (2013) Journal of the American Medical Informatics Association. doi:10.1136/amiajnl-2012-001431
LabeledIn: Cataloging labeled indications for human drugs
Ritu Khare, Jiao Li, Zhiyong Lu (2014) Journal of Biomedical Informatics. doi:10.1016/j.jbi.2014.08.004
Scaling drug indication curation through crowdsourcing
R. Khare, J. D. Burger, J. S. Aberdeen, D. W. Tresner-Kirsch, T. J. Corrales, L. Hirchman, Z. Lu (2015) Database. doi:10.1093/database/bav016
Development and evaluation of a crowdsourcing methodology for knowledge base construction: identifying relationships between clinical problems and medications
A. B. McCoy, A. Wright, A. Laxmisan, M. J. Ottosen, J. A. McCoy, D. Butten, D. F. Sittig (2012) Journal of the American Medical Informatics Association. doi:10.1136/amiajnl-2012-000852
Extended-Connectivity Fingerprints
David Rogers, Mathew Hahn (2010) Journal of Chemical Information and Modeling. doi:10.1021/ci100050t
PREDICT: a method for inferring novel drug indications with application to personalized medicine
A. Gottlieb, G. Y. Stein, E. Ruppin, R. Sharan (2011) Molecular Systems Biology. doi:10.1038/msb.2011.26
BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities
T. Liu, Y. Lin, X. Wen, R. N. Jorissen, M. K. Gilson (2007) Nucleic Acids Research. doi:10.1093/nar/gkl999
Data-Driven Prediction of Drug Effects and Interactions
N. P. Tatonetti, P. P. Ye, R. Daneshjou, R. B. Altman (2012) Science Translational Medicine. doi:10.1126/scitranslmed.3003377
A side effect resource to capture phenotypic effects of drugs
Michael Kuhn, Monica Campillos, Ivica Letunic, Lars Juhl Jensen, Peer Bork (2010) Mol Syst Biol. doi:10.1038/msb.2009.98
The NHGRI GWAS Catalog, a curated resource of SNP-trait associations
D. Welter, J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins, A. Klemm, P. Flicek, T. Manolio, L. Hindorff, H. Parkinson (2013) Nucleic Acids Research. doi:10.1093/nar/gkt1229
DISEASES: Text mining and data integration of disease–gene associations
Sune Pletscher-Frankild, Albert Pallejà, Kalliopi Tsafou, Janos X. Binder, Lars Juhl Jensen (2015) Methods. doi:10.1016/j.ymeth.2014.11.020
DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes
J. Pinero, N. Queralt-Rosinach, A. Bravo, J. Deu-Pons, A. Bauer-Mehren, M. Baron, F. Sanz, L. I. Furlong (2015) Database. doi:10.1093/database/bav028
A Framework for Annotating Human Genome in Disease Context
Wei Xu, Huisong Wang, Wenqing Cheng, Dong Fu, Tian Xia, Warren A. Kibbe, Simon M. Lin (2012) PLoS ONE. doi:10.1371/journal.pone.0049686
A Proteome-Scale Map of the Human Interactome Network
Thomas Rolland, Murat Taşan, Benoit Charloteaux, Samuel J. Pevzner, Quan Zhong, Nidhi Sahni, Song Yi, Irma Lemmens, Celia Fontanillo, Roberto Mosca, Atanas Kamburov, Susan D. Ghiassian, Xinping Yang, Lila Ghamsari, Dawit Balcha, Bridget E. Begg, Pascal Braun, Marc Brehme, Martin P. Broly, Anne-Ruxandra Carvunis, Dan Convery-Zupan, Roser Corominas, Jasmin Coulombe-Huntington, Elizabeth Dann, Matija Dreze, Amélie Dricot, Changyu Fan, Eric Franzosa, Fana Gebreab, Bryan J. Gutierrez, Madeleine F. Hardy, Mike Jin, Shuli Kang, Ruth Kiros, Guan Ning Lin, Katja Luck, Andrew MacWilliams, Jörg Menche, Ryan R. Murray, Alexandre Palagi, Matthew M. Poulin, Xavier Rambout, John Rasla, Patrick Reichert, Viviana Romero, Elien Ruyssinck, Julie M. Sahalie, Annemarie Scholz, Akash A. Shah, Amitabh Sharma, Yun Shen, Kerstin Spirohn, Stanley Tam, Alexander O. Tejeda, Shelly A. Trigg, Jean-Claude Twizere, Kerwin Vega, Jennifer Walsh, Michael E. Cusick, Yu Xia, Albert-László Barabási, Lilia M. Iakoucheva, Patrick Aloy, Javier De Las Rivas, Jan Tavernier, Michael A. Calderwood, David E. Hill, Tong Hao, Frederick P. Roth, Marc Vidal (2014) Cell. doi:10.1016/j.cell.2014.10.050
Uncovering disease-disease relationships through the incomplete interactome
J. Menche, A. Sharma, M. Kitsak, S. D. Ghiassian, M. Vidal, J. Loscalzo, A.-L. Barabasi (2015) Science. doi:10.1126/science.1257601
Bgee: Integrating and Comparing Heterogeneous Transcriptome Data Among Species
Frederic Bastian, Gilles Parmentier, Julien Roux, Sebastien Moretti, Vincent Laudet, Marc Robinson-Rechavi (2008) Data Integration in the Life Sciences. doi:10.1007/978-3-540-69828-9_12
Comprehensive comparison of large-scale tissue expression datasets
Alberto Santos, Kalliopi Tsafou, Christian Stolte, Sune Pletscher-Frankild, Seán I. O’Donoghue, Lars Juhl Jensen (2015) PeerJ. doi:10.7717/peerj.1054
Co-author Relationship Prediction in Heterogeneous Bibliographic Networks
Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han (2011) 2011 International Conference on Advances in Social Networks Analysis and Mining. doi:10.1109/ASONAM.2011.112
Ten Simple Rules for Reproducible Computational Research
Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, Eivind Hovig (2013) PLoS Computational Biology. doi:10.1371/journal.pcbi.1003285