Dr. Anthony Fauci’s emails: what can we learn from a deep-dive with Natural Language Understanding?
In the last months, two news organizations, Washington Post and Buzzfeed News, uncovered thousands of emails of Anthony Fauci, the director of the U.S. National Institute of Allergy and Infectious Diseases. The emails sent during the first half of 2020 were released on request on the basis of the Freedom of Information Act. After the release, his emails attracted a lot of interest because he was one of the credible and responsible people for Coronavirus and its management regarding his position. The emails showed that there were many questionable claims regarding the Coronavirus. To learn the context of these claims in detail, people intended to check the emails, but there was a problem; they consist of 3234 pages! Inspecting all of the email texts without consuming days and nights would be nearly impossible.
At this point, Dcipher Analytics offers next-generation text analysis that can analyze and create insights from thousands of text data without requiring a programming language or complicated analytical tools, thanks to the AI-based technology that uses Natural Language Processing (NLP). Using this text analysis platform, we identified the content of emails in a few minutes, compared to hundreds of hours or even days of standard processing time. We primarily focus on incoming emails to Dr. Fauci to understand the main topics of people’s concerns and in what matters they contacted Dr. Fauci via email.
The output was fascinating and helpful in understanding the topics in the emails. There were discussions of the effects of Covid-19 on people, estimates of how to treat the virus, and findings regarding the effectiveness of vaccines. One controversial topic was mask usage and its usefulness. Its validity was questioned by many. The spread of the virus and in what form it occurs were also mentioned. Besides these, there were also some concerns about how this virus will affect the minorities and many more topics!
To find out more about the results and see how you can analyze emails with Dcipher only in a few minutes, you can continue to read!
Import the file
We begin our analysis by creating a new project on the platform. We choose a blank project and name it accordingly. After that, we need to import data that we want to analyze. On Dcipher Analytics, you can use different file types to import data, such as PDF and JSON. This time we use a JSON file that includes Fauci’s emails and upload the dataset to the platform. After uploading the chosen JSON file, the dataset is visible on the Schema and Table View, where we can check the data in detail. Now, we are ready to apply the following steps.
Prepare the data
Many raw files include complex and unstructured data. In this sense, It is significant to check the raw text files and determine what improvements can be made to the text for better analysis. For example, split words with spaces or hyphenations can create hardness for the system to detect words and analyze the context. Therefore, for creating a meaningful insight from the raw data, It is important to prepare data by doing these necessary manipulations. One way of doing this is altering these unwanted formats, and Dcipher’s replace pattern operation is highly recommended for this aim.
-
Alter Regular Expressions (Replace Pattern)
First, have a look at some of the texts. What kind of issues are there? For instance, you see things like “pan- demic.” It indicates that there are hyphenations that split words. And more, you see multiple spaces between words, whereas it should be only one space. Even though these might seem minor and unnecessary, they can be quite harmful to a healthy analysis. The system can misinterpret or disregard many words because of these few points.
To prevent any possible errors, for hyphenations, you can change the regular expression of “(\w)- (\w)” with “\1\2”. After doing this, you can see that the hyphenated words are now correct. In other words, that “pan- demic” has now become “pandemic” in the new text field. After this, you can check whether there are multiple spaces where it should be a single space. In this case, we can change the regular expression of “ +“ with “ “ and remove the unnecessary spaces.
Here, the rationale is that these formatting changes are visible as a pattern by the system. By defining certain regular expressions, we command to change and get the correct format of words. In the end, words turn into a detectable and meaningful shape, which results in better analysis and insights.
-
Change date format
Text can include dates as a part of your data, and they can be in any format. In our example, the date format of emails contains days and hours, even seconds. For our analysis, we are interested in months and years. With the help of Dcipher Analytics, we can change the date format only with a few clicks.
By choosing the date from Schema, we click the change date format from the operations button. By selecting the desired form of date, we apply the operation. Finally, we changed all of the dates to a consistent format without spending too much time.
-
Clean by preprocessing wizard
To clean the text from the noise, preprocessing is a good step. For this, we open the preprocessing from the wizards button on the platform. First, we choose the data we want to clean, which is the text in this example. As a next step, we remove the duplicates, which are not that high because the data comes from emails, and there is a low chance of duplicate appearances. After this, since the conversations in the emails are in English, we skip the language selection step. In Dcipher, you can skip any action that you think is unnecessary in the pre-processing wizard. Then, we clean the text by selecting URLs and XML tags to remove from emails if there are any.
The crucial point at this cleaning step is how to split the data. Dcipher offers four different options to segment the data: split into sentences, divided into paragraphs, by custom pattern, and through smart segmentation. In this analysis, we use smart segmentation, which considers the contextual similarity of adjacent sentences and references between sentences to split the text into coherent segments.
We are now ready to analyze the data at the end of all the data preparation processes and extract real insights.
Split the responses (Tokenize & Tag)
Tokenization is used in Natural Language Processing (NLP) to divide the text into smaller pieces, called tokens, a more intelligent method than splitting by whitespace. This operation also includes additional features. Lemmatization is the process of converting words into their basic form ("cars" -> "car", for example). Phrase detection looks for word sequences that create phrases that provide a better sense of context than individual words. Part-of-speech tagging is beneficial for minimizing noise later on by retaining just the most significant parts of speech, such as nouns or adjectives.
To apply it, we select the data set to which we want to apply the operation and click the Add Operation button above. After determining the operation, we define the preprocessing and tagging choices. These processes are also visible in the video above. As a result of this procedure, our data has been separated into tokens and is now appropriately categorized.
Topic Detection
-
Choose the relevant Part-of-Speech (POS) tags
Filtering tags of tokens are helpful before topic detection because deselecting unrelated part-of-speech (POS) tags create more accurate results. However, we keep the filtering broad in this analysis because we aim to understand the generally mentioned topics in the incoming emails. Strictly narrowing the tags can decrease the number of findings and lose the required results in the end.
-
Detect Topics
The Detect Topics function checks text input and detects "topics" and groups of words frequently in the text. To apply the operation, we find a detect topics operation. There are various selections. In this analysis, we used hierarchical density-based clustering (HDBSCAN) as a clustering method to create more detected topics with another rationale of machine learning.
For more information about semantic topic modeling, see this Help Center article.
-
Picking the Topics
Before moving to data visualization as the last step, selecting the relevant outputs from identified topics is helpful for a better understanding and valuable insights. The AI-based classification system creates various topics, but every finding doesn’t have to give the necessary output that we want to reach. Therefore, before visualizing, we can narrow down the topics labels to only keep the applicable ones.
Data Visualization
As the last step, we use workbenches to show the output of the analysis. For this, you can easily drag and drop the required output to any chosen workbench area to trigger operations via drop zones. Various workbenches can be used to get different insights. In this analysis, we created a scatter plot to do document landscaping and a bump chart.
-
Scatter Plot (Document Landscaping)
When exploring a vast body of writings, it is helpful first to acquire a sense of the many themes in the texts. This is possible in Dcipher Analytics via a technique known as document landscaping by scatter plot. It employs machine learning to convert the words into vectors in a high-dimensional "semantic space" projected onto a two-dimensional region. As a result, a landscape of hills and valleys emerges, with peaks corresponding to locations with a high density of contextually related texts, regardless of the keywords used. This bottom-up approach is advantageous because it allows patterns to emerge from data without the need for hypotheses or search criteria. It's a great approach to obtain a grasp of the text's topics.
Regarding the document landscape, we can see the topics that are addressed to Dr. Fauci. According to the hills and the heat map, it can be seen that:
-
Virus effects & predictions are mainly discussed, including treatment estimations.
-
Clinical vaccine developments, virus transmission, and identification of Covid effects are the other relevant topics mentioned in the incoming emails.
-
It is seen that there also mask discussions about its usefulness.
-
Besides, there are many speech & meeting requests due to his position.
-
In addition, many collaboration offers are made for crisis management because of the virus and its identification. Many individuals and institutions reached Dr. Fauci in this manner.
-
Incoming emails show that he has support from the people, and they show gratitude towards him, appreciate and trust him.
-
News on the various TV channels and the media is another component of the incoming emails’ topics in the emails, which aim to inform Dr. Fauci about the news about himself and the virus.
-
Bump Chart
A Bump Chart shows changes over time, such as in the volume of subjects or tokens. Compared to line charts, bump charts have the advantage of displaying both volume and rank throughout each time period, making it simple to understand how relative positions shift over time.
The Bump Chart output supports the previous findings on the document landscape. Additionally, we can see the division of the topics regarding their volume in specific date intervals.
-
The virus spread in aerosol form is started to be a more concerned topic in the following months. There are discussions and worries about this.
-
In addition to the topics, virus effects on minorities are also an important topic at the beginning. People were anxious about how this virus will affect minorities and their behavior in society.
-
During this time, help offers and support are increased for the virus crisis management, as well as appreciation and gratitude to him in this process.
Remarkably, there are many interesting topics indicated in the incoming emails. Let us know which discussed topic or concern you found unexpected and surprising! And to not miss out on our future suggestions, make sure to follow us on our LinkedIn page as well.
Additionally, If you’d like to learn more about other types of charts used in different examples, you can also check out our previous blog posts:
Get started!
To access our text analytics toolbox and try out analyzing free-form text responses in Dcipher Analytics, sign up for a free trial.