Project:
Rephetio: Repurposing drugs on a hetnet [rephetio]

Computing consensus transcriptional profiles for LINCS L1000 perturbations


LINCS (Library of Integrated Cellular Signatures) provided "perturbational profiles across multiple cell and perturbation types, as well as read-outs, at a massive scale." We plan to compute transcriptional profiles, i.e. expression signatures — sets of up and down-regulated genes — for the compounds in our network. We will be following a workflow suggested to us by Ted Natoli during the online office hours.

For a given compound in our network, there may be multiple matched LINCS compounds. Additionally, each LINCS compound may have been assayed across multiple cell lines, dosages, and replicates. To calculate a single consensus transcriptional profile across multiple signatures we will

  1. calculate pairwise correlations between signatures
  2. calculate mean correlation with other signatures for each signature
  3. scale similarities to sum to 1
  4. multiply z-score signature vectors by their similarity weights
  5. sum weighted z-score signature vectors

The z-score signature vectors are retrieved from the /xchip/cogs/data/build/a2y13q1/modzs.gctx file on the C3 cloud.

The BD2K LINCS Data Science Research webinars serve as a general forum to engage data scientists within and outside of LINCS project to work on problems related to LINCS data analysis and integration.

The BD2K-LINCS DCIC engages the research community by delivering high quality educational materials through the web as well as through mentoring, seminars and symposia. Also, Center investigators actively engage in the education of a new generation of Big Data Scientists by developing a graduate-level Big Data Science MOOC that will be delivered to graduate students in Big Data Biostatistics and other Biomedical Informatics graduate programs.

See BD2K-LINCS DCIC Training and Outreach.

@leobrueggeman and I attended the online LINCS office hours today. We spoke primarily with Ted who provided some helpful insights and suggestions.

  1. Spearman's correlation is preferred to the Pearson's correlation when calculating correlations between transcriptional profiles. Spearman's correlation is rank based and therefore less prone to excessive influence of extremes. We will switch from Pearson's to Spearman's correlation in step 2 above.

  2. Gold signatures — Multiple replicates (often 3) are performed for each signature. Higher quality signatures are considered gold (is_gold) based on a heuristic that values reproducibility across replicates and distinctness of that signature. They require the aggregate correlation within replicates to exceed a threshold among other criteria. Around half of the dataset is gold. The informativeness of nongold signatures is dubious, thus Ted suggests restricting to gold signatures.

  3. Z-score threshold — when setting a DEG (deferentially expressed gene) threshold, the LINCS team uses > 2.0 or < -2.0 and is pleased with the results.

  4. Signature number bias in DEG counts — We found that some perturbagens had an extremely high number of DEGs. We speculate that this occurs for perturbagens with few signatures, since with many signatures across varying contexts, a large set of consistently dysregulated genes seems unlikely. Two possible corrections for this bias are: a) choose a constant number of signatures to include per perturbagen. However, this would lead to information loss, which is undesirable. b) develop an empirical method for adjusting for expected DEG numbers. We will explore option (b) in a future post.

  5. Best inferred gene set — (is_bing) refers to genes that are reliably imputed from the 1000 assayed genes (L1000). We plan to switch to only included the ~7,500 bing genes in our DEG analysis because we would like to avoid unreliable data. This may also help reduce the excess of DEG for certain perturbagens. Whether a gene is part of the bing subset can be retrieved through the geneinfo API query.

We talked to Dave and Ted at the LINCS office hours today. Here are the meeting notes:

Probes and genes

There have been two versions of landmark genes (the genes that are measured by the L1000 platform). In the first version (pr_pool_id = 'deltap'), there were 979 landmark genes. In the current version (pr_pool_id = 'epsilon'), there are 978.

The L1000 platform is designed to imitate the Affymetrix HG-U133A array [1], so L1000 output is in probe-space. The LINCS team performs their analyses in probe-space. For landmark genes, the probe-to-gene correspondence is one-to-one. However, other genes may consist of multiple probes. We plan to average z-scores across probes to convert our observations into gene-space.

Computing consensus signatures

When drugs have multiple signatures, we find the average correlation value for each signature. We have noticed that some signatures have negative correlations and were thus contributing their inverse signature to the consensus. We are uncomfortable with negative weights and therefore plan to set a minimum correlation threshold for each signature. A minimum of 0 would exclude all negatively correlated signatures. Another option is 0.1, which is used by the LINCS team when processing shRNA data.

Miscellaneous

  • The API returns -666 for missing values.
  • Instances refer to the replicates that compose a signature.
  • @leobrueggeman, will be presenting on LINCS at lab meeting today (presentation).
  • is_summly refers to whether a perturbagen has been profiled across a broad range of cell lines.

Combining z-scores across multiple signatures

To create consensus signatures for a compound, we have been taking a weighted average of z-scores (steps 4–5 above). It has occurred to us that this is an underpowered method, just as averaging p-values is a weak method of meta-analysis.

Instead we can use Stouffer's method to meta-analyze z-scores [1]. This method accepts weights (calculated in steps 2–3 above). The formula is below for weight vector w and z-score vector Z:

$$$ Z \sim \frac{\sum_{i=1}^k w_iZ_i}{\sqrt{\sum_{i=1}^k w_i^2}} $$$

Initial release of consensus signatures

We have computed consensus transcriptional signatures for LINCS L1000 perturbations. We have released datasets [1] for the following pertubation sets:

  • LINCS pert_ids: 38,327 consensi (download)
  • DrugBank compounds: 1,170 consensi (download)
  • Gene knockdowns: 4,363 consensi (download)
  • Gene over-expressions: 2,471 consensi (download)

The datasets are tsv-formatted with perturbations as rows and genes as columns. We only report expression values for the 978 assayed genes. Non-gold signatures were omitted. We set a minimum signature weight of 0.05 and combined z-scores using Stouffer's method.

Concensus signatures version 2.0

We've released v2.0 of our analysis of LINCS L1000. This release brings major updates including:

  • Inferred genes. Previously, dysregulation scores were only reported for the 978 landmark (directly measured) genes. Now we've expanded our analysis to include 6,489 imputed genes from the best inferred gene set (is_bing), which covers genes imputed with high accuracy. However, we've maintained backwards compatibility by only using landmark probes for signature weighting.
  • Significantly dysregulated genes. We now report significantly down or upregulated perturbagen–gene pairs.
  • Improved knockdown and overexpression pipeline. We now convert gene symbols to entrez genes at the earliest stage. Now two genetic perturbations with different symbols that map to the same entrez gene will benefit from z-score meta-analysis.

Consensus signature datasets

Our consensus signatures are available on GitHub [1] and figshare [2]. Each consensus signature measures the transcriptional response to a perturbation at 7,467 genes. Genes are identified by their Entrez GeneID. Consensi are produced for:

  • DrugBank compounds — 1,170 small molecule compounds identified by their DrugBank ID. L1000 compounds were mapped to DrugBank using atomic connectivity.
  • Gene knockdowns — 4,326 knocked-down genes identified by their Entrez GeneID.
  • Gene overexpressions — 2,413 overexpressed genes identified by their Entrez GeneID
  • All L1000 pertubations — 38,327 perturbagens identified by their L1000 pert_id.

Methodology

To recap our methodology, our objective is to compute a consensus signature from multiple input signatures. Each input signature measures the dysregulation caused by a specific perturbation and condition. The purpose of computing consensi is to combine signatures for the same perturbation under different conditions (for examples different dosages, cell types, or time points).

Our method, developed in consultation with the L1000 team, arrives at a consensus signature from a set of input signatures by:

A) Starting with a probe × signature matrix of dysregulation z-scores with the following filters:

  • Initial probe filter: include all landmark or is_bing probes with the following exclusions: a) inferred probes for genes with a landmark probe. b) probes with non-existent Entrez GeneIDs.
  • Initial signature filter: use only gold signatures to remove non-replicating or indistinct signatures

B) Then a gene-level consensus signature is computed by:

  1. Calculating an input signature weight. Each input signature gets a weight equal to its average Spearman's correlation with other input signatures. We set a minimum correlation value of 0.05 to ensure all signatures make at least a small contribution and to prevent negative weights. Weights are computed using only landmark probes.
  2. Meta-analyzing z-scores with Stouffer's method to compute a probe-level consensus signature.
  3. Condensing to a gene-level consensus by averaging probe z-scores for the same entrez gene.

C) Finally, significant Perturbagen–regulates–Gene relationships are extracted for a given perturbation by:

  1. Converting gene z-scores to p-values by via a normal distribution
  2. Correcting p-values for multiple testing using a Bonferroni adjustment. The correction is applied separately to measured genes and inferred genes. Hence, inferred genes are more heavily penalized for multiple testing.
  3. Filter genes for corrected p-value ≤ 0.05.
  4. Allowing at most 1000 significant inferred genes. In cases with more than 1000 significant inferred genes, filter to the 1000 smallest p-values.

The methods described above are executed in the consensi.ipynb and significance.ipynb notebooks.

 
Join to Reply
Status: Completed
Views
602
Topics
Referenced by
Cite this as
Daniel Himmelstein, Caty Chung (2015) Computing consensus transcriptional profiles for LINCS L1000 perturbations. Thinklab. doi:10.15363/thinklab.d43
License

Creative Commons License

Share