Project:
Rephetio: Repurposing drugs on a hetnet [rephetio]

Unifying drug vocabularies


Currently, we would like to integrate several drug resources that rely on different compound vocabularies. These resources include

typeresourcevocabulary
indicationMEDI [1]RxNorm [2]
indicationLabeledIn [3, 4]RxNorm [2]
transcriptional signaturesLINCSLINCS & PubChem
target bindingChEMBL [5]ChEMBL
side effectsSIDER 2 [6]STITCH [7]
side effectsOFFSIDES [8]STITCH [7]

We are planning on using DrugBank [9] as the primary vocabulary for compounds. While the coverage of DrugBank is limited, DrugBank includes FDA-approved compounds and likely covers the majority of compounds that would be well-connected in the network. The main benefits of DrugBank are extensive information per compound and a level of granularity that matches our needs.

We plan to use UniChem [10] to map resources with available structures. Importantly, we will likely benefit from a permissive matching algorithm that ignores small structural variations [11]. UniChem has a connectivity mapping feature to perform fuzzy matching [12].

This paper [1] discusses the problem of mapping between medication vocabularies. They identify several major difficulties:

  1. the availability of up-to-date information to assess the suitability of a given terminological system for a particular use case, and to assess the quality and completeness of cross-terminology links
  2. the difficulty of correctly using complex, rapidly evolving, modern terminologies
  3. the time and effort required to complete and evaluate the mapping
  4. the need to address differences in granularity between the source and target terminologies
  5. the need to continuously update the mapping as terminological systems evolve

They provide a helpful diagram (manuscript Fig. 2) that illustrates the connections between terminologies:

Since most of our resources include structural information, we will likely face a slightly different and more computationally-amenable set of mapping challenges. Nonetheless, the "differences in granularity between the source and target terminologies" will be an important consideration.

Integrating RxNorm ingredients

The RxNorm terminology [1] does not contain chemical structures for ingredients. However, the terminology does cross-reference the NDF-RT and FDA-SRS. The FDA-SRS identifiers, called UNIIs (Unique Ingredient Identifiers), are included in UniChem. Therefore to map RxNorm ingredients to other vocabularies, we will first convert RXCUIs to UNIIs.

We downloaded the RxNorm data release and loaded it into a MySQL database. We used the following query to produce a RXCUI–UNII mapping:

SELECT DISTINCT RXCUI, CODE
FROM rxnorm.RXNCONSO
WHERE SAB = 'MTHSPL' AND TTY = 'SU' AND CODE != 'NOCODE';

The UniChem Connectivity Search

UniChem is a structure-centric search engine (based on InChI identifiers [1]) for compound unification [2]. UniChem includes a widesearch mode that matches compounds based on common connectivity [3]. We speculate that fuzzy matching will outperform a strict structural identity matching [4] because:

  • we will retain more information by integrating greater percentages of external databases
  • small chemical variations may have a minimal pharmacodynamic impact
  • many resources and pharmacologists conceptualize compounds with less granularity than exact chemical structure

Mapping external resrouces to DrugBank

We would like to standardize all compound resources using DrugBank [5]. To accomplish this task, we first parsed the DrugBank xml download to extract basic compound information. Second, we mapped DrugBank compounds to each resource in UniChem using the connectivity search [docs]. Third, we assessed the mappings using a variety of metrics. For this third step, we concentrated only on approved small molecules as these will be the most essential and connected in our network.

The following findings were aparent:

  • DrugBank contained 1,600 approved small molecules, 51 of which were lacking structural information and could not be mapped.
  • 108 DrugBank compounds mapped to multiple DrugBank compounds indicating the granularity of our connectivity search is not equivalent to the granularity of the DrugBank inclusion criteria.
  • 93% of DrugBank compounds had atleast one match in ChEMBL, 77% matched FDA SRS (UNII), 95% matched PubChem, 57% matched LINCS
  • Zolmitriptan (DB00315) matched 768 PubChem compounds

Given these findings, we have the following questions for a chemist or cheminformaticist:

  1. Given our focus on pharmacodynamics rather than pharmacokinetics and our desire to avoid duplicate entities in the network, did we properly construct our UniChem query?
  2. Given that some DrugBank compounds matched multiple DrugBank compounds (see histograms), should we instead use the First InChIKey Hash Block (FIKHB) as the primary compound identifier?
  3. ~20% of approved small molecules in DrugBank did not match a FDA-SRS (UNII) compound, which is troubling. Many of these unmatched compounds would have matched by name matching. We would like an explanation for this discrepency and will look into the issue further ourselves.
  4. Should we use a tiered matching system, where we take only exact matches when available and then expand to connecitivity matches if necessary?
  5. Should we adopt an even more permissive (or alternative) mapping strategy for LINCS to annotate more compounds with transcriptomic profiles?
  6. Is the excessive number of PubChem matches for some DrugBank compounds indicative of a larger problem? The full mapping can be downloaded here.

Compound Vocabulary

We have proceeded with a subset of 1,552 terms from DrugBank [1] as our compound vocabulary. Included compounds meet the following criteria:

  • DrugBank type is small molecule
  • DrugBank groups includes approved
  • Have an InChI chemical structure

Other compound vocabularies are mapped to DrugBank with UniChem [2] using the most permissive matching scheme available (B = 0 and C = 4). B = 0 matches compounds using the FIKHB (First InChIKey Hash Block) which is based on atomic connectivity [3]. C = 4 matches compounds which share a component with a component of the DrugBank compound, in order to ignore differences based on salts and acids.

 
Join to Reply
Status: Completed
Labels
  data integration
Views
382
Topics
Referenced by
Cite this as
Daniel Himmelstein (2015) Unifying drug vocabularies. Thinklab. doi:10.15363/thinklab.d40
License

Creative Commons License

Share