COVID Explorer

In this period of rapid response, there is an urgent need to better understand the biology and clinical implications of the viral infection, COVID-19. As researchers engage with new and historical texts, the ability to discern what is relevant and important is paramount to successful collaboration and innovation.

What's it for?

COVID Explorer is a web-application that uses automated machine learning and interactive visualizations to help researchers find the most relevant and urgent documents around key scientific research questions.

By leveraging machine learning on full texts, we can identify linguistic subtopics, generate text summaries, and rank textual alignment to a research questions so that researchers can build context beyond the traditional search and sort often found in databases like Web of Science and Google Scholar.

How it Works

Our process is driven by two questions:

  1. What are the underlying topics within the corpus?
  2. Which publications are most relevant and urgent around key scientific research questions?

To address these goals, we brought together a suite of machine learning techniques including deep learning and graph analytic methods that identify natural “topics” within the corpus based on linguistic patterns within documents’ title, abstract, and body text. We then enhanced this information by folding in automatically extracted p-values, paper claims, country mentions, and automated text summaries.

This information is then captured within an interactive dashboard to help researchers critically evaluate textual relevance and importance through dynamic querying, filtering, and exploration of publication metdata, sub-topical categorization, scientific metrics (p-values / claims) and geographic affiliation (country mentions).

About the Dashboard

Most charts are interactive.

Start by choosing a driving question from the Research Question Dropdown. The document similarity score filters relevant matches. Drill down into a related topic, date range, P-value range, country of interest or publication source.

Click on a record row in the datatable to see full metadata including full text and links, if available.

Have a question? Drop us a note. We'd love your feedback.

The web-app is built using D3.js and dc.js, two JavaScript libraries that in combination offer rapid data-driven visualization, aggregation, and interactivity.

What We Found

Here are few easter eggs that we found within the dataset.

  1. You can use the p-value slider to select only documents that have p-values within a certain range. Simply click and drag in the p-value graph and only documents that mention p-values in that range will remain.
    1. We didn’t intentionally search for significant p-values.
    2. This means even when you ask for found p-values below 0.05, you’ll get researchers who report a claim as “p > 0.05.” We found this useful, since it’s an interesting way that researchers report negative results.
    3. You can see the claims associated to the p-values by clicking on the document itself in the center of the window.
  2. The rapid keyword / title / author filter is also an interesting place to look for things you care about. Surprisingly, this document set has a lot of ancillary information in it, including a whole set of papers on MRSA. You can see that using our unsupervised clustering (see below). If for some reason you care about MRSA, you can filter papers on that.
  3. Document view is useful when you want to dig into the documents text. Simply click on the document you want and you can see useful information including (i) claims in the document with their p-values and (ii) automatically generated keywords (which are also used to make our word cloud).
  4. Most biologists may not care about the Kaggle Questions, but they can help drive some investigations. Let’s say you care about interventions. Once you choose that question, the document list will update with matches to that question and the word cloud will reflect the new question. If you then adjust the document score so that only higher scoring documents are returned, then you can zoom in on research that might help answer your questions. Notice the p-value filter is still on.
  5. You can always reset all publications filters in the button on the right: Once there, just reset the filters.
  6. If you care about a specific country, you can always use the country selector to narrow down documents that mention that country. Once that country is selected, only papers mentioning that country are shown. You can use the keyword search to narrow down further to only those topics for which Italy is a keyword. (Italy was the country we picked earlier.) As before, we could add additional filters onto this search.

About the Analysis

How it Works

We note, there are many ways to do the operations enumerated below. We picked one way. We are working on additional ways using Deep Learning methods adapted to this document set.

Unsupervised Clustering

Documents were clustered into groups with similar topic contents by creating a document similarity graph and then using maximum modularity clustering to find “document communities” within the graph.

Word Bags

Bags of words for each automatically generated document cluster were computed using a word assignment model that minimized mutual information between the bags of words (subject to some constraints).

Query Matching

The dimension of vocabulary space was first reduced using a non-linear dimensional reduction method. Specifically we constructed a trimmed term-document matrix by removing common and non-key words. We then formed a term-graph and used maximum modularity clustering to find an orthogonal topic basis. Graph edges are words co-mentioned in a document. Semantic representation in topic space is then found by projecting relevant keywords from the Kaggle questions onto the orthogonal basis formed by the clusters. Matching is then performed using cosine similarity.

Country-Mentions

Entities were extracted using the built-in named entity recognition capability in SpaCy, against the small English language model (en_core_web_sm). Only entities of type ‘GPE’ were used, and these outputs were cross-referenced against country names included in geonamescache. GPU-enabled SpaCy was used for an approximate 15x time speedup vs CPU-only processing.

Pvalue / Claim Extraction

P-Values and surrounding claims were automatically extracting using a mixture of regular expression and MaxEnt methods