## Compiling Gene Ontology annotations into an easy-to-use format

We have previously retrieved our human Gene Ontology (GO) [1] annotations from MSigDB. MSigDB was designed for gene set enrichment analyses and therefore GO terms with fewer than 10 annotations are excluded. Additionally, MSigDB is infrequently updated and only contains human gene sets.

Therefore, we created a utility to provide GO annotations for a variety of species using the most recent annotation data. The resource relies on Entrez Gene as the main gene vocabulary. The utility's source code is online, but briefly, annotations are retrieved from the Entrez gene2go.gz file and the python goatools package is used to parse go-basic.obo and propagate annotations.

Propagating annotations refers to transferring annotations from a more specific GO term to its broader parent terms. The theoretical justification is described [2]:

When a gene is annotated to a term, associations between the gene and the terms' parents are implicitly inferred. Because GO annotations to a term inherit all the properties of the ancestors of those terms, every path from any term back to its root(s) must be biologically accurate or the ontology must be revised.

We allow the user to choose propagated or unpropagated annotations, gene identifiers as Entrez IDs or symbols, and protein-coding or all genes. Since this resource is meant to be maximally useful, any suggestions or feature requests are welcome.

Daniel Himmelstein Researcher

## Background reading on Gene Ontology annotations

1. Gene Ontology Annotations and Resources [1]
2. Use and misuse of the gene ontology annotations [2]
3. Understanding how and why the Gene Ontology and its annotations evolve: the GO within UniProt [3]
4. Quality of Computationally Inferred Gene Ontology Annotations [4]

@dhimmel I'd add this as particularly important for GO as well: http://wiki.geneontology.org/index.php/Transitive_closure

People frequently overlook this.

Daniel Himmelstein Researcher

# Transitive Closure

@caseygreene, our resource has an option to propagate annotations to account for transitive closure. Briefly, transitive closure is defined through example as:

‘every kidney is located in some body’ follows from ‘every kidney is located in some abdomen’ and ‘every abdomen is located in some body’ [1]

Our current propagation method transfers annotations across is_a relationships between terms in the go-basic.obo ontology. We rely on the goatools python package to process the gene ontology. This package appears to discard all non-is_a relationships. It sounds like our method would classify as the "the old way" according to your link.

Is there an easy way to retrieve a table of closure relationships that we should use for annotation propagation? The site mentions a "pre-computed closure tsv" but does not indicate whether it is currently available. If we do switch to a method that incorporates additional relationship types beyond is_a, which additional types do you recommend propagating on?

@dhimmel To understand what relationships you should compute closure on I recommend reading http://geneontology.org/page/ontology-relations

I would add part_of and maybe has_part first before exploring the other relationships.

Another option for reasoning that can take advantage of the relationships is https://github.com/owlcollab/owltools

Daniel Himmelstein Researcher

# Migrating to owltools

We are in the decade of the ontology: the pace of development and growth of this field is incredible. Therefore, I would like to outsource the ontology reasoning and inference to established software projects, namely owltools.

The first step will be to load GO with owltools http://purl.obolibrary.org/obo/go.owl. Beyond this step I am stuck and haven't found sufficient documentation. @chrismungall or @fbastian perhaps you could help me with the below queries or point me to a good tutorial:

1. Adding annotations: we would like to add human gene annotations to GO terms.
2. Propagating annotations: we would like to propagate annotations up is_a and part_of edges. Negative (NOT) annotations should short-circuit annotation propagation.
3. Filter overly broad terms: Remove the "do not annotate" terms for GO.
4. Output: Write the propagated annotations to a text file

If it's not possible to perform all of these steps in one command, then we can work on a piecemeal approach.

You should definitely check with Chris, GO annotation/propagation is not really my area of expertise. And it is indeed possible to do a lot of things in a single command using owltools.

Daniel Himmelstein Researcher

# Migration cancelled

In the interest of time, we did not switch to owltools. The OWL ecosystem codebases rely on a completely different stack than our current python workflow, and the documentation is often incomplete. We asked our usage questions on GitHub and will consider migrating in the future with clearer guidance.

# Updated annotations framework

We revamped the analysis behind our user-friendly GO annotation utility [1].

We made the following changes:

• an option to discard annotations without experimental evidence
• propagation along part_of (as well as is_a) relationships
• direct annotations short-circuit the propagation of conflicting annotations. This occurs only when negative (NOT) and positive annotations conflict.
• exclude terms in the goantislim_grouping, gocheck_do_not_annotate, or gocheck_do_not_manually_annotate` subsets

We removed the "protein-coding genes only" option and made a single download with gene identifiers and symbols.

Status: Completed
Views
323
Topics
Referenced by
Cite this as
Daniel Himmelstein, Casey Greene, Venkat Malladi, Frederic Bastian (2015) Compiling Gene Ontology annotations into an easy-to-use format. Thinklab. doi:10.15363/thinklab.d39