Rephetio: Repurposing drugs on a hetnet [rephetio]

DISEASES: Text mining and data integration of disease–gene associations

Processing the DISEASES resource for disease–gene relationships

We are looking into DISEASES [1] as a resource for gene–disease relationships. This database is produced by @larsjuhljensen's group and follows similar protocols as TISSUES [2], which we have already processed.

DISEASES includes three types of evidence:

  • text mining: using named entity recognition to look for disease–protein cooccurrences in abstracts and sentences. @larsjuhljensen, which literature corpus was used?
  • knowledge: curated relationships from GHR and UniProtKB [3]
  • experiments: cancer mutation data from COSMIC [4, 5] and GWAS data from DistiLD [6]

We did a preliminary processing of the integrated dataset, which yielded 81,499 gene–disease relationships for DO Slim diseases (notebook, download). Filtering for scores ≥ 3, resulted in 2,441 relationships.

@larsjuhljensen, are scores in DISEASES comparable between datasets? In other words, are confidence scores standardized to a common gold standard?

We may consider creating an integrated score excluding DistiLD, since we have a distinct GWAS edge.

Regarding the scores, they are designed to be as comparable as we could make them; however, it was not possible to do so purely through benchmarking, since a high-quality unbiased benchmark set does not exist.

If you already have GWAS from another source, I would exclude DistiLD too. You already import mutation data from e.g. COSMIC, I would exclude the experiments channel entirely. This also makes comparability of scores much less of an issue, since you're left with only automatically text-mined associations, which are scores the same way as tissue associations, and manually curated associations, which are inherently highly reliable.

Completed processing

We have completed an initial processing of DISEASES (notebook). The output is a tsv of gene–disease pairs (download) with scores for following channels:

  • text mining
  • knowledge
  • cosmic — the COSMIC subset of the experiments channel
  • distild — the DistiLD subset of the experiments channel
  • integrated_no_distild — the integration of the four aforementioned scores
  • integrated — the integrated score calculated by the DISEASES team, without any exclusions

Genes were converted to Entrez identifiers using the STRING 9.1 mapping (entrez_gene_id.vs.string.v9.05.28122012.txt). We also created a dataset with only DO Slim diseases (download). For this file, we propagated scores from subsumed diseases and reported the max.

Visualizing channel concordance

We visualized the relationships between scores on the full dataset. The off-diagonal plots show a 2D histogram, using hexagonal bins. The diagonal of the grid contains 1D histograms for the x-variable. Bin counts for all panels are log-transformed.

Join to Reply
Status: Completed
Referenced by
Cite this as
Daniel Himmelstein, Lars Juhl Jensen (2015) Processing the DISEASES resource for disease–gene relationships. Thinklab. doi:10.15363/thinklab.d106

Creative Commons License