Rephetio: Repurposing drugs on a hetnet [rephetio]

Creating a catalog of protein interactions

We would like to create a catalog of interactions between proteins (PPIs). I am currently leaning towards focusing on physical interactions, since other types of interactions will be captured by other metanodes and metaedges. If some non-physical interactions are included that is acceptable but not the goal.

Some previous studies have compiled PPI catalogs:

  • the Incomplete Interactome (II) [1] — compiled protein interactions of seven types

  • the Human Interaction Database (HID) [2] — systematic experimental approach for identifying PPIs

  • our disease-gene association study (hetio) [3] — interactions from iRefIndex, which compiles records from primary databases, processed using ppiTrim [4].

Suggestions for other resources are welcome.

Methods for the Incomplete Interactome PPI catalog

Here, we reproduce the methods section from the supplement of the Incomplete Interactome publication [1] describing how their PPI catalog was constructed:

In building the interactome, we rely only physical protein interactions with experimental support, hence we do not include interactions extracted from gene expression data or evolutionary considerations. In order to obtain an interactome as complete as currently feasible, we combine several databases with various kinds of physical interactions:

  1. Regulatory interactions: We use the TRANSFAC database [2] that lists interactions derived from the presence of a transcription factor binding site in the promoter region of a certain gene. The resulting network consists of 271 transcription factors regulating 564 genes via 1,335 interactions.
  2. Binary interactions: We combine several yeast-two-hybrid high-throughput datasets [3, 4, 5, 6, 7, 8] with binary interactions from IntAct [9] and MINT databases [10]. The sum of these data sources yields 28,653 interactions between 8,120 proteins. Note that IntAct and MINT provide interactions derived from both literature curation and direct submissions.
  3. Literature curated interactions: These interactions, typically obtained by low throughput experiments, are manually curated from the literature. We use IntAct [9], MINT [10], BioGRID [11] and HPRD [12], resulting in 88,349 interactions between 11,798 proteins.
  4. Metabolic enzyme-coupled interactions: Two enzymes are assumed to be coupled if they share adjacent reactions in the KEGG and BIGG databases. In total, we use 5,325 such metabolic links between 921 enzymes from [13].
  5. Protein complexes: Protein complexes are single molecular units that integrate multiple gene products. The CORUM database [14] is a collection of mammalian complexes derived from a variety of experimental tools, from co-immunoprecipitation to co-sedimentation and ion exchange chromatography. In total, CORUM yields 2,837 complexes with 2,069 proteins connected by 31,276 links.
  6. Kinase network (kinase-substrate pairs): Protein kinases are important regulators in different biological processes, such as signal transduction. PhosphositePlus [15] provides a network of peptides that can be bound by kinases, yielding in total 6,066 interactions between 1,843 proteins.
  7. Signaling interactions: The dataset from [16] provides 32,706 interactions between 6,339 proteins that integrate several sources, both high-throughput and literature curation, into a directed network in which cellular signals are transmitted by proteins-protein interactions.

The union of all interactions obtained from (i)-(vii) yields a network of 13,460 proteins that are interconnected by 141,296 physical interactions. Note that we treat the interactome as an undirected network (see Section 2.3 for a discussion of the impact of directionality). The interactome is approximately scale-free (Figure S1a) and shows other typical characteristics as observed previously in many other biological networks [17, 18], such as high clustering and short pathlengths (Figure S1c)

Human Interaction Database

The Human Interaction Database (HID) 2014 release [1] contained two PPI datasets:

  • HI-II-14 — 13,944 interactions — proteome-scale map of the human binary interactome network generated by systematically screening Space-II
  • Lit-BM-13 — 11,045 interactions — high-quality recurated literature binary interactions extracted from 7 public repository in 2013

We compared interactions from HID to interactions from the Incomplete Interactome (II). We found that HI-II-14 was a subset of II_binary. However, only 78.2% of Lit-BM-13 was included in II_literature.

To better understand Lit-BM-13, we have added the relevant sections of the paper supplement below, omitting the "assignment of experimental method" sections [1]:

Literature datasets: We generated two datasets from literature-curated protein-protein interactions. A first dataset was generated in 2010 and used for all experiments, concomitantly with our mapping experiment, and a second dataset was extracted in 2013 to provide an updated version for all computational analyses.

Obtaining the Lit-2010 dataset: The Lit-2010 dataset extracts human protein-protein interactions (PPIs), annotated through December 2010, from seven primary source databases: BIND [2], BioGRID [3], DIP [4], HPRD [5], MINT [6], IntAct [7] and PDB [8]. For each reported PPI the interacting proteins were mapped to UniProt protein identifiers and then converted to NCBI Entrez gene ID pairs using an ID mapping table downloaded on January 12, 2012 from Information about the specific publications reporting each interaction was retained and reported interactions that did not have an associated PubMed ID (PMID) were not included in the Lit-2010 dataset.

Identification of binary interactions: We divided Lit-2010 into the PPIs reported by
systematic high-throughput binary human interactome mapping efforts [9, 10, 11] and those detected in small- or medium-scale experiments. A small number of PPIs that had been detected in both systematic and other studies could appear in both datasets. Removing the PPIs only seen in systematic studies resulted in a dataset of 56,743 human PPIs.

Next we attempted to distinguish binary interactions (direct biophysical contact between two proteins) [12, 13] from indirect associations (associations between two proteins that are in the same complex, but may or may not directly interact) [14]. We evaluated each experimental interaction detection method in the PSI-MI 2.5 and classified them as binary, that is, primarily detects binary interactions, versus indirect, that is, primarily detects association of proteins within a complex (Table S1C). Where an experimental method could be viewed as either, depending on the specific experimental implementation then the method was conservatively classified as indirect. Fewer methods were classified as binary here than in previous [15, 16] or parallel [17] efforts to ensure the highest confidence binary Lit dataset possible.

After parsing all PPI data from the source databases we obtained a binary human
dataset of 13,962 PPIs that contained at least one piece of binary evidence supporting each PPI (there could be other pieces of experimental evidences that were either binary or indirect) and a non-binary dataset containing 42,781 PPIs for which none of the experimental methods are binary (Lit-NB-10).

A paper curated independently by two or more different PPI databases is commonly annotated to different PSI-MI terms, generally to terms of different depth on the same branch of the PSI-MI ontology tree [18, 19]. If not corrected for, these annotations would count as two or more pieces of evidence for the PPI, when actually there is only one piece of supporting evidence. For example, a yeast two-hybrid experiment might be annotated to the deeper term “two hybrid prey pooling approach” (MI:1112) by one PPI database but to the parent term “two hybrid” (MI:0018) by another database; a coimmunoprecipitation (co-IP) experiment might be annotated to the deeper term “anti-tag coimmunoprecipitation” (MI:0007) by one database but to the parent term “affinity chromatography technology” (MI:0004) by another. To compensate for variability in the annotated methods, when the same paper with the same PMID had different MI terms in two databases, we reassigned the deeper term “up” to the corresponding parent binary or nonbinary term on the same PSI-MI branch. In the examples given, the two Y2H annotations collapse to the single ID MI:0018, while the two co-IP annotations collapse to the single ID MI:0004.

The binary human dataset was next separated into “binary multiple” (Lit-BM-10) (Table S1A), containing all interactions supported by two or more pieces of experimental evidence, at least one of which was binary (4,906 PPIs); versus “binary single”, containing all interactions supported by exactly one piece of binary experimental evidence (Lit-BS-10) (9,056 PPIs).

Updating the Lit dataset to 2013: To construct Lit-2013 (Figure S1B and Table S1B) we downloaded, on August 5, 2013, the updated curated PPI content of the same seven PPI databases used for Lit-2010.

hetio interactions

Previously, we used the following method to catalog protein interactions [1]:

Physical protein-protein interactions (S8 Data) were extracted from iRefIndex 12.0, a compilation of 15 primary interaction databases [2]. The iRefIndex was processed with ppiTrim to convert proteins to genes, remove protein complexes, and condense duplicated entries [3].

The method contributed 97,938 interactions to our network of protein-coding genes. For this project, we converted these interactions to entrez genes. Prior to filtering for coding genes, 98,119 interactions were in the hetio dataset.

The hetio interactions overlapped most with Lit-BM-13, II_literature, and II_signaling (notebook).

Signaling PPI

The incomplete interactome paper includes interactions from a signaling study [1]. We report the sources of this HPPI1 network from the original paper's supplement:

Table S3: Human PPI interaction data sets used to construct the HPPI1 network. A comprehensive HPPI1 network was created by unifying the data sets listed. Name of the data set, publication reference, number of proteins and number of interactions in the data set (after mapping to NCBI Entrez GeneID) are given.

PPI Data setsReferenceProteinsInteractions
Human Protein Reference Database V7.0[2]930535021
Genome-wide Y2H screen and literature-derived PPI data[3]30246221
Y2H screen for inherited ataxia and literature-derived PPI data[4]29095440
Genome-wide Y2H screen[5]16993150
Y2H PPIsThis study11262626
Mouse signaling PPI data from AfCS
Network for Smad signaling[6]623874
PPIs of proteins in MHC class III region and mRNA decay[7, 8]300376
Network of nuclear receptors[9]134288
Huntingtin’s disease PPI network[10]64156
PPIs between KIAA proteins[11]9484
Dexter Hadley Researcher  611 days

Have you seen Human Interactome out of Harvard. I used them for my PPI work for CNVs in GWAS for Autism and worked out quite well. They seem to have at least some of the ones you have already listed, although I think your list looks more extensive. You want to be careful of integrating genome-wide PPI (i.e. Y2H) vs targeted PPI datasets. It may mess up interpreting the statistics if the evidence for PPI is biased vs genome-wide.

@idrdex, we're referring to this data as the Human Interaction Database (HID) and found that the systematic datasets were wholly included in the Incomplete Interactome (II) data (comment, notebook).

You are right that once we introduce PPIs from targeted or curated analyses, we have introduce knowledge bias. While we prefer systematic data, we decided to incorporate biased knowledge. We will likely perform a parallel analysis on an network built from only unbiased data sources. For this analysis, we'll use the Y2H datasets used in your autism study [1].

Unbiased PPI datasets

Since we now include a edge attribute for bias in our network, we need to identify a subset of our PPIs that are derived from hypothesis free, i.e. unbiased, experiments.

The Incomplete Interactome [1] describes their creation of an unbiased interactome:

Since our interactome includes data from literature curation, it is inherently biased towards much studied disease-associated proteins and their interactions. We, therefore, complement our analysis using only interactions from well controlled and completely unbiased high-throughput yeast two-hybrid (y2h) datasets [2, 3, 4, 5, 6].

Minus the last citation [6] which is prepublication, we can use the other four resources [2, 3, 4, 5], two of which were used by @idrdex in his study [7].

Completed PPI catalog

We have completed an initial version of our protein interaction catalog for this project [1], named hetio-ind. We defined interaction as two genes whose protein products physically interact. Physical associations from protein complexes were minimized.

Interactions were taken from the following sources:

  • Human Interactome Database (HID): specifically the HI-I-05 [2], Venkatesan-09 [3], Yu-11 [4], HI-II-14 [5], Lit-BM-13 [5] datasets.
  • Incomplete Interactome [6]: specifically the II_binary and II_literature subsets.
  • hetio-dag [7]: our previous project (details). We removed all interactions that were not physical associations (MI:0195). This step excluded genetic and colocalization interactions.

16,526 interactions reported by HI-I-05 [2], Venkatesan-09 [3], Yu-11 [4], or HI-II-14 [5] were considered unbiased. The 135,203 other interactions were considered biased.

In total our dataset contains 151,729 protein interactions (notebook, downloads).

Have you considered the Mentha PPI database?

  • It combines protein-protein interaction data from most databases.
  • It assigns a reliability score to each interaction (like iRefIndex).
  • It uses the psicquic protocol and MITAB 2.5 format, so it compatible with all other protein interaction databases.
  • Over the last few years, it has been regularly updated.
  • It uses HGNC Gene IDs and UniProt accessions.

@ostrokach, thanks for letting us know about Mentha [1]. We've moved past the network construction stage on this project but will keep Mentha in mind for the future.

I downloaded the latest release ( Here are a few select rows from the top:

Protein AGene ATaxon AProtein BGene BTaxon BScorePMID
Q9LK77Q9LK773702A0A088QD33OEC103 {ECO:0000313|EMBL:AIN81148.1}627150.23625211078
F4J5N9BZIP24 {ECO:0000313|EMBL:AEE78868.1}3702A0A088QD42OEC112 {ECO:0000313|EMBL:AIN81158.1}627150.23625211078
Q9Y5Q8GTF3C59606Q86U86PBRM196060.15522939629 26344197

Two features that caught my eye are the reliability scores and the PubMed IDs of the original source(s). It looks like the file format could be a little bit cleaner (see for example {ECO:0000313|EMBL:AEE78868.1}). Also I didn't see a license on the Mentha website. Going forward I plan to avoid resources without an open license — something we learned the importance of during this project.

Join to Reply
Status: Completed
Referenced by
Cite this as
Daniel Himmelstein, Dexter Hadley, Alexey Strokach (2015) Creating a catalog of protein interactions. Thinklab. doi:10.15363/thinklab.d85

Creative Commons License