The LabeledIn resource consists of an expert curated [1] and crowdsourced [2] components. Here we will discuss parsing these resources to extract indications.

Question: I was under the impression that the workers assessed individual indications rather than all indications within a specific label. Therefore each drug–disease (RxNORM–UMLS) pair should have it's own majority vote. However, the data release appears to be listed in terms of labels rather than indications. Some labels have multiple UMLS diseases but only report the outcome of a single vote. Majority votes should be in terms of indications rather than labels, right?

Answer: You are right: each drug–disease (RxNORM–UMLS) pair should have it's own majority vote and majority votes should be in terms of indications rather than labels. The data is organized in this manner only (one entry = one drug-label/UMLSCUI pair). Each entry in the text file corresponds to one indication candidate (i.e. one disease UMLS-CUI) in a given drug label. The disease CUI is specified in the third field of the file. Also, as you have already noted that there for some entries with two CUIs in the third field. These correspond to composite mentions (e.g."Moderate to severe pain"). Our disease NER module detects two concepts for this phrase ("moderate +pain" and "severe pain") but we present this phrase as a single disease mention to the turkers and hence a single majority vote was computed for both UMLS-CUIs.

Thanks @ritukhare, we've processed your datasets and combined the expert [1] and crowdsourced [2] indications. The resulting .tsv file is available for download. We provide ingredient and disease names here only for convenience, since our simplistic lookup methodology left many identifiers unnamed.

Specifically, we extracted 1,335 indications from the expert data release and 1,516 indications from the crowdsourced data release. The two sets shared one indication, so merging the two resources resulted in 2850 = 1335 + 1516 - 1 indications.

We calculated the total number of labels reporting each indication. For this task, we assumed study_drug_label_ID was consistent across the expert and crowdsourced datasets. If this assumption is wrong, the effect would be minimal, since the two releases report almost entirely disjoint sets of indications.

This is great. Thanks @dhimmel. There should be no confusion with the study_drug_label_ID between the two datasets: In expert-LabeledIn, the values are numbers and in crowd-LabeledIn, the values are concatenation of drug type and a number.

I don't have a readily available mapping of ingredient and disease identifiers to names. Please note that the it would be more appropriate to use the title of drug label (SPL) instead of ingredient name as the title will also contain the dose form information of the drug (and we found that indications may be different between two drugs having same ingredient but different dose form). However, it's your decision.

we found that indications may different between two drugs having same ingredient but different dose form

@ritukhare, interesting to hear that examples of repurposing frequently relied on different dose forms (and perhaps dosage levels as well). I think we would like to ignore this complexity. In other words, our predicted indications will not include dosage or dose form recommendations. I am comfortable leaving these details for the end users to investigate.

