In this period of rapid response, there is an urgent need to better understand the biology and clinical implications of the viral infection, COVID-19. As researchers engage with new and historical texts, the ability to discern what is relevant and important is paramount to successful collaboration and innovation.
COVID Explorer is a web-application that uses automated machine learning and interactive visualizations to help researchers find the most relevant and urgent documents around key scientific research questions.
By leveraging machine learning on full texts, we can identify linguistic subtopics, generate text summaries, and rank textual alignment to a research questions so that researchers can build context beyond the traditional search and sort often found in databases like Web of Science and Google Scholar.
Our process is driven by two questions:
To address these goals, we brought together a suite of machine learning techniques including deep learning and graph analytic methods that identify natural “topics” within the corpus based on linguistic patterns within documents’ title, abstract, and body text. We then enhanced this information by folding in automatically extracted p-values, paper claims, country mentions, and automated text summaries.
This information is then captured within an interactive dashboard to help researchers critically evaluate textual relevance and importance through dynamic querying, filtering, and exploration of publication metdata, sub-topical categorization, scientific metrics (p-values / claims) and geographic affiliation (country mentions).
Most charts are interactive.
Start by choosing a driving question from the Research Question Dropdown. The document similarity score filters relevant matches. Drill down into a related topic, date range, P-value range, country of interest or publication source.
Click on a record row in the datatable to see full metadata including full text and links, if available.
Have a question? Drop us a note. We'd love your feedback.
Here are few easter eggs that we found within the dataset.
How it Works
We note, there are many ways to do the operations enumerated below. We picked one way. We are working on additional ways using Deep Learning methods adapted to this document set.
Documents were clustered into groups with similar topic contents by creating a document similarity graph and then using maximum modularity clustering to find “document communities” within the graph.
Bags of words for each automatically generated document cluster were computed using a word assignment model that minimized mutual information between the bags of words (subject to some constraints).
The dimension of vocabulary space was first reduced using a non-linear dimensional reduction method. Specifically we constructed a trimmed term-document matrix by removing common and non-key words. We then formed a term-graph and used maximum modularity clustering to find an orthogonal topic basis. Graph edges are words co-mentioned in a document. Semantic representation in topic space is then found by projecting relevant keywords from the Kaggle questions onto the orthogonal basis formed by the clusters. Matching is then performed using cosine similarity.
Entities were extracted using the built-in named entity recognition capability in SpaCy, against the small English language model (en_core_web_sm). Only entities of type ‘GPE’ were used, and these outputs were cross-referenced against country names included in geonamescache. GPU-enabled SpaCy was used for an approximate 15x time speedup vs CPU-only processing.
Pvalue / Claim Extraction
P-Values and surrounding claims were automatically extracting using a mixture of regular expression and MaxEnt methods