Five steps to mining scholarly knowledge about coronaviruses

The COVID-19 Open Research Dataset (CORD-19) is a public dataset with over 59,000 coronavirus-related scholarly articles. It was prepared by the White House and a coalition of leading research groups.

This post shows how Dcipher Analytics can be used to explore the CORD-19 dataset and extract useful information from it.

(Photo by CDC on Unsplash)

1. Getting an overview through article landscaping

When exploring a large body of texts, like the CORD-19 dataset, it helps to first get an overview of the different themes in the texts. In Dcipher Analytics, this can be done through an approach called document landscaping. It uses machine learning to transform the texts into vectors in a high-dimensional "semantic space" and then project them onto a two-dimensional area. The result is a landscape with peaks and valleys, where peaks correspond to areas with a high density of contextually similar texts, independently of what keywords they use. This bottom-up technique is useful because it lets patterns emerge from the data without any hypotheses or search criteria. It's a great way of getting an overview of themes in the text and spotting "unknown unknowns".

The landcape is generated by dragging the text field of interest to the Document landscaping drop zone in the Scatter plot workbench. This is what the landscape of articles in the CORD-19 dataset looks like after having been labeled:

Note that clusters are not formed around specific keywords, but rather around themes that can be expressed in many different ways. The machine learning model trained for this task "knows" that texts are semantically similar even if they don't use the same words.

This is a zoomed-in version where a sample of article passages are displayed as dots:

To interpret the peaks in the landscape, we used three different Dcipher Analytics workbenches to view different aspects of the data: (1) the landscape itself, (2) words characteristic of articles in a given region of the landcape, and (3) individual articles representative of a certain region. Simple drag-and-drops between workbenches are used to specify the input into each workbench.

The process is shown in the video below.

2. Detecting topics in the texts

Dcipher Analytics' built-in topic detection operation makes it possible to find specific topics in the article texts based how words tend to co-occur in the texts. Since the user can specify a range of topics to be extracted, it can be used to find broad as well as highly granular topics.

The 20 topics extracted here include preparedness and response, epidemiological modelling, public health, clinical management, and transmission dynamics. They are interpreted through the words that are most characteristic of each topic.

The video below shows the process of quantifying and interpreting the topics in Dcipher.

3. Digging deeper through entity extraction

After having identified a topic or theme of interest through document landscaping and topic detection – in this case we are particularly interested in peak around drug therapies – it is time to drill down to get the specifics. Entity extraction is a great way of picking up relevant concepts from articles, such as names of health conditions and drugs. Dcipher Analytics integrates with third-party services for entity detection, including IBM Watson which we use here.

By displaying the extracted entities as a network, where entities that tend to be mentioned in the same articles are clustered together, we can spot interesting clusters. For example, antiretroviral drugs used to treat and prevent HIV/AIDS, such as ritonavir, atazanavir, and efavirenz, form a distinct cluster. Another cluster relates to protease inhibitors, including Cystatin D. Scoring texts by these concepts, which is done through a simple drag-and-drop, reveals the passages where they are used, showing articles that investigate Cystatin D as an inhibitor of coronavirus replication.

The video below shows what the process looks like in Dcipher.

4. Identifying prominent researchers in areas of interest

We are not only interested in the content of coronavirus articles, but also in the researchers that are pushing the research efforts within different areas of interest.

In this case, the document landscaping has helped us identify phylogenic analysis as an area with a substantial number of articles. Scoring and filtering articles based on the relevant keywords gives the subset of articles of interest. We have already split the semicolon separated authors field into a collection of individual authors. By dragging this collection to the Bar chart workbench, top authors by number of publications are displayed. Dragging it to the Bubble workbench enables us to view the co-authorship network of the authors, which is useful for locating important research groups and collaborations.

The video below shows what this process looks like in Dcipher.

5. Mapping differences and similarities between concepts

The CORD-19 articles cover many types of coronaviruses, including SARS-CoV (the SARS coronavirus discovered in 2003) and SARS-CoV-2 (causing COVID-19). To understand what research topics are overlapping between these two and which ones are specific to each, we use a technique called seeding.

Both concepts are selected as seeds, which means that they attract words in the articles in proportion to how frequently they co-occur. This leads to a kind of tug-of-war between the two, lining up associated words between them. While SARS-CoV-2 research appears to have a stronger focus on topics related to public health, research on cellular mechanisms and testing in mice seem to be more strongly associated with SARS-CoV.

The video below shows the process in Dcipher.

To mine the CORD-19 dataset in Dcipher Analytics yourself – or analyze any other text data – sign up for a free trial. To better understand our solutions, read more here or check out our other blog posts. If you have questions, want to discuss your use case, or get a guided tour, don't hesitate to contact us.