## Background

The National Library of Medicine (NLM) produces a catalog of 23 million journal articles called PubMed. PubMed contains two subsets that are relevant for literature mining:

1. PubMed Central (PMC) — 3.4 million articles that include full texts, rather than just abstracts.
2. MEDLINE — 21 million articles that are manually annotated with their topics. Topics are chosen from the MeSH vocabulary. 5,594 journals are currently indexed.

MeSH, which stands for Medical Subject Headings, is a broad terminology of ~27 thousand terms structured hierarchically to form an ontology. Skilled subject analysts at the NLM typically assign 10–12 MeSH terms per article and denote a subset of these terms as major topics.

## Application

Text mining, as suggested to us by @b_good, is an intriguing technique because it is widely-applicable and draws from a knowledge base of epic proportions [1].

We would like to infer relationships between nodes in our network based on MEDLINE cooccurrence. We will search for pairs of MeSH terms that are assigned to the same articles beyond what would be expected if the terms were unrelated. This approach has successfully identified disease symptoms [2] (browse results). The method is versatile and can be applied to any nodes which have been mapped to MeSH.

Daniel Himmelstein Researcher

# Proof of concept implementation

We implemented a topic cooccurrence calculator based on MEDLINE and used this method to identify disease-symptom relationships (notebook, API query script, tsv of results).

First we created a disease set of 119 MeSH terms that mapped to DO slim diseases (tsv of diseases). Next, we created a symptom set of 438 MeSH terms by finding all descendants of D012816 (Signs and Symptoms) (notebook, tsv of symptoms).

For each disease, we identified the articles where that disease was a major topic. For each symptom, we identified the articles where that symptom was a topic. We then identified the articles that contained both a disease major topic and symptom topic. We based further analysis only on these 392,397 articles that contain at least one disease–symptom cooccurrence.

For each symptom–disease pair, we calculated:

• cooccurrence — the number of articles where the disease and symptom terms cooccurred.
• expected — the number of expected cooccurrences by chance based on each term's marginal frequency.
• enrichmentcooccurrence divided by expected.
• odds_ratio — the odds of cooccurrence divided by the odds of expected. This calculation appears to be slightly messed up due to non-integer expected counts.
• p_fisher — the p-value from Fisher's exact test evaluating whether the observed cooccurrence exceeded that expected by chance.

@apankov, can you comment on the Fisher's exact test and whether there is a superior way to identify terms that significantly cooccur?

@b_good or others: do you know of better metrics for literature mining? One issue is that our approach may miss common symptoms that are not greatly enriched for any particular disease. The HSDN study [1] used a TF-IDF measure, but we require metrics that are comparable across diseases.

I think the Fisher's exact test will be accepted well by reviewers, but Barnard's test could be a good alternative. Otherwise, if you can calculate a p-value based on permutation (or get a bootstrapped estimates for the variance of the number of expected cooccurrences) , that could be an easy, straightforward approach.

Daniel Himmelstein Researcher

Thanks @apankov. I couldn't find a python implementation of Barnard's test [1, 2], so I think we'll stick with Fisher's exact test [3] for simplicity. The fidelity of p-values is not a major concern here.

However, it has occurred to me that in our above post, we incorrectly created the contingency table for the exact test. We now construct it similarly to Table 1 of this paper [4] so that the contingency table is:

$$\begin{bmatrix} a & b\\ c & d \end{bmatrix}$$

where

• a is the number of studies with both the disease and the symptom (cooccurrence)
• b is the number of studies with the disease and without the symptom
• c is the number of studies without the disease and with the symptom
• d is the number of studies without either the disease or symptom

The revised symptom–disease pair tsv file is available here.

Daniel Himmelstein Researcher

# Anatomy–Disease Relationships

The Uberon ontology [1] of anatomical structures includes MeSH cross-references. Thus, we performed our MEDLINE cooccurrence analysis described above to find relationships between diseases and anatomical structures (notebook, tsv download).

The ability of this method to capture disease localization was exceptional. For example, the top five terms by p-value for multiple sclerosis were:

mesh_namecooccurrenceexpectedenrichmentodds_ratiop_fisher
Central Nervous System88138.622.834.30.000
Spinal Cord149280.818.527.50.000
Myelin Sheath100619.950.5146.80.000
Brain4777778.36.111.50.000
Optic Nerve37236.510.211.90.000

One improvement would be to exclude Uberon terms that don't exist in humans such as venom (UBERON:0007113). Additionally, there are some Uberon–MeSH mapping issues that should get resolved soon allowing us to update the analysis.

Daniel Himmelstein Researcher

# Disease–Disease Relationships

We computed disease similarities based on MEDLINE cooccurrences. Refer to this discussion for more information.

Daniel Himmelstein Researcher

# Noting MRCOC

Head this through a Tweet by @b_good. It appears that the National Library of Medicine precomputes literature co-occurrences for MeSH terms. See the page MEDLINE Co-Occurrences (MRCOC) Files.

This could replace some or all functionality of dhimmel/medline — however, I haven't actually looked into whether it's a user friendly substitute. Just wanted to take note.

Status: Completed
Views
225
Topics
Referenced by
Cite this as
Daniel Himmelstein, Alex Pankov (2015) Mining knowledge from MEDLINE articles and their indexed MeSH terms. Thinklab. doi:10.15363/thinklab.d67