Processing LabeledIn to extract indications

The LabeledIn resource consists of an expert curated [1] and crowdsourced [2] components. Here we will discuss parsing these resources to extract indications.

• Jesse Spaulding: For future reference it's probably better to wait until you have more to say here before you post this. The project's followers probably don't need to know what you are planning to post here (until you actually post it!) What do you think? And yes, 'draft mode' is coming soon!

• Daniel Himmelstein: Hey Jesse, I was making a stub because one project member may be interested in posting and I thought this would simplify the process.

• Jesse Spaulding: Oh alright, carry on then :)

Question: I was under the impression that the workers assessed individual indications rather than all indications within a specific label. Therefore each drug–disease (RxNORM–UMLS) pair should have it's own majority vote. However, the data release appears to be listed in terms of labels rather than indications. Some labels have multiple UMLS diseases but only report the outcome of a single vote. Majority votes should be in terms of indications rather than labels, right?

Answer: You are right: each drug–disease (RxNORM–UMLS) pair should have it's own majority vote and majority votes should be in terms of indications rather than labels. The data is organized in this manner only (one entry = one drug-label/UMLSCUI pair). Each entry in the text file corresponds to one indication candidate (i.e. one disease UMLS-CUI) in a given drug label. The disease CUI is specified in the third field of the file. Also, as you have already noted that there for some entries with two CUIs in the third field. These correspond to composite mentions (e.g."Moderate to severe pain"). Our disease NER module detects two concepts for this phrase ("moderate +pain" and "severe pain") but we present this phrase as a single disease mention to the turkers and hence a single majority vote was computed for both UMLS-CUIs.

We are happy to answer more questions! - LabeledIn Team

Daniel Himmelstein Researcher

Thanks @ritukhare, we've processed your datasets and combined the expert [1] and crowdsourced [2] indications. The resulting .tsv file is available for download. We provide ingredient and disease names here only for convenience, since our simplistic lookup methodology left many identifiers unnamed.

Specifically, we extracted 1,335 indications from the expert data release and 1,516 indications from the crowdsourced data release. The two sets shared one indication, so merging the two resources resulted in 2850 = 1335 + 1516 - 1 indications.

We calculated the total number of labels reporting each indication. For this task, we assumed study_drug_label_ID was consistent across the expert and crowdsourced datasets. If this assumption is wrong, the effect would be minimal, since the two releases report almost entirely disjoint sets of indications.

• Daniel Himmelstein: @ritukhare, if you have a readily-available and exhaustive mapping of ingredient and disease identifiers to names, I could update my analysis with those names.

This is great. Thanks @dhimmel. There should be no confusion with the study_drug_label_ID between the two datasets: In expert-LabeledIn, the values are numbers and in crowd-LabeledIn, the values are concatenation of drug type and a number.

I don't have a readily available mapping of ingredient and disease identifiers to names. Please note that the it would be more appropriate to use the title of drug label (SPL) instead of ingredient name as the title will also contain the dose form information of the drug (and we found that indications may be different between two drugs having same ingredient but different dose form). However, it's your decision.

Daniel Himmelstein Researcher

we found that indications may different between two drugs having same ingredient but different dose form

@ritukhare, interesting to hear that examples of repurposing frequently relied on different dose forms (and perhaps dosage levels as well). I think we would like to ignore this complexity. In other words, our predicted indications will not include dosage or dose form recommendations. I am comfortable leaving these details for the end users to investigate.

Status: Completed
Labels
resource
Views
135
Topics
Referenced by
Cite this as
Daniel Himmelstein, Ritu Khare (2015) Processing LabeledIn to extract indications. Thinklab. doi:10.15363/thinklab.d46