## Processing the DISEASES resource for disease–gene relationships

We are looking into DISEASES [1] as a resource for gene–disease relationships. This database is produced by @larsjuhljensen's group and follows similar protocols as TISSUES [2], which we have already processed.

DISEASES includes three types of evidence:

• text mining: using named entity recognition to look for disease–protein cooccurrences in abstracts and sentences. @larsjuhljensen, which literature corpus was used?
• knowledge: curated relationships from GHR and UniProtKB [3]
• experiments: cancer mutation data from COSMIC [4, 5] and GWAS data from DistiLD [6]

We did a preliminary processing of the integrated dataset, which yielded 81,499 gene–disease relationships for DO Slim diseases (notebook, download). Filtering for scores ≥ 3, resulted in 2,441 relationships.

@larsjuhljensen, are scores in DISEASES comparable between datasets? In other words, are confidence scores standardized to a common gold standard?

We may consider creating an integrated score excluding DistiLD, since we have a distinct GWAS edge.

Regarding the scores, they are designed to be as comparable as we could make them; however, it was not possible to do so purely through benchmarking, since a high-quality unbiased benchmark set does not exist.

If you already have GWAS from another source, I would exclude DistiLD too. You already import mutation data from e.g. COSMIC, I would exclude the experiments channel entirely. This also makes comparability of scores much less of an issue, since you're left with only automatically text-mined associations, which are scores the same way as tissue associations, and manually curated associations, which are inherently highly reliable.

Daniel Himmelstein Researcher

# Completed processing

We have completed an initial processing of DISEASES (notebook). The output is a tsv of gene–disease pairs (download) with scores for following channels:

• text mining
• knowledge
• cosmic — the COSMIC subset of the experiments channel
• distild — the DistiLD subset of the experiments channel
• integrated_no_distild — the integration of the four aforementioned scores
• integrated — the integrated score calculated by the DISEASES team, without any exclusions

Genes were converted to Entrez identifiers using the STRING 9.1 mapping (entrez_gene_id.vs.string.v9.05.28122012.txt). We also created a dataset with only DO Slim diseases (download). For this file, we propagated scores from subsumed diseases and reported the max.

## Visualizing channel concordance

We visualized the relationships between scores on the full dataset. The off-diagonal plots show a 2D histogram, using hexagonal bins. The diagonal of the grid contains 1D histograms for the x-variable. Bin counts for all panels are log-transformed.

Daniel Himmelstein, Lars Juhl Jensen (2015) Processing the DISEASES resource for disease–gene relationships. Thinklab. doi:10.15363/thinklab.d106