## Extracting disease-gene associations from the GWAS Catalog

The GWAS Catalog [1] compiles SNP associations from published genome-wide studies. We converted the catalog from SNP associations to gene associations. We classify each gene association as high or low confidence and as primary or secondary (based on whether the gene is assumed to drive the signal at a loci).

We only extracted associations for diseases in DO Slim, which should cover most diseases in the catalog while excluding traits. Genes are restricted to protein-coding.

## Method

The method for processing associations was taken from our previous work [2], which describes it as follows (modifications afterwards):

Disease-gene associations were extracted from the GWAS Catalog [1], a compilation of GWAS associations where $p . First, associations were segregated by disease. GWAS Catalog phenotypes were converted to Experimental Factor Ontology (EFO) terms using mappings produced by the European Bioinformatics Institute. Associations mapping to multiple EFO terms were excluded to eliminate cross-phenotype studies. We manually mapped EFO to DO terms (now included in the DO as cross-references) and annotated each DO term with its associations. Associations were classified as either high or low-confidence, where exceeding two thresholds granted high-confidence status. First,$p \leq 5 × 10^{-8}$corresponding to$p \leq 0.05$after Bonferroni adjustment for one million comparisons (an approximate upper bound for the number of independent SNPs evaluated by most GWAS). Second, a minimum sample size (counting both cases and controls) of 1,000 was required, since studies below this size are underpowered [3]—i.e. any discovered associations are more likely than not to be false—for the majority of true effect size distributions commonly assumed to underlie complex disease etiology [4]. Lead-SNPs were assigned windows—regions wherein the causal SNPs are assumed to lie—retrieved from the DAPPLE server [5]. Windows were calculated for each lead-SNP by finding the furthest upstream and downstream SNPs where$r^2 > 0.5 and extending outwards to the next recombination hotspot. Associations were ordered by confidence, sorting on following criteria: high/low confidence, p-value (low to high), and recency. In order of confidence, associations were overlapped by their windows into disease-specific loci. By organizing associations into loci, associations from multiple studies tagging the same underlying signal were condensed. A locus was classified as high-confidence if any of its composite associations were high-confidence and low-confidence otherwise.

For each disease-specific loci, we attempted to identify a primary gene. The primary gene was resolved in the following order:

1. the mode author-reported gene
2. the containing gene for an intragenic lead-SNP
3. the mode author-reported gene for an intragenic lead-SNP (in the case of overlapping genes)
4. the mode author-reported gene of the most proximal up and downstream genes.

Steps 2–4 were repeated on each association composing the loci, in order of confidence, until a single gene resolved as primary. Loci where ambiguity was unresolvable or where no genes were returned did not receive a primary gene. All non-primary genes—genes that were author-reported, overlapping the lead-SNP, or immediately up or downstream from the lead-SNP—were considered secondary.

Accordingly, four categories of processed associations were created: high-confidence primary, high-confidence secondary, low-confidence primary, and low-confidence secondary. We assume that our primary gene annotation for each loci represents the single causal gene responsible for the association.

## Method modifications

We switched from HapMap LD data provided by DAPPLE to 1000 Genomes LD data provided by SNAP and removed the recombination hotspot extensions.

Daniel Himmelstein Researcher

# GWAS associations for all DO diseases

We repeated the above analysis for all diseases, not just DO slim diseases. The EFO terms added to the GWAS Catalog by the EBI are still converted to DO terms: therefore, associations whose EFO terms are not cross-referenced in the DO are omitted.

In total, the dataset contains 1447 high-confidence primary gene-disease associations. Counting both confidence levels, associations exist for 124 diseases and 4142 genes.