Five steps to mining news articles for insights without programming
News media is a critical source of information for understanding how the contextual environment is changing. Yet most organizations are very traditional in how they extract insights and business value from news media. If they have a systematic approach at all, it tends to be centered around keyword-based monitoring. To understand the shortcomings of this approach, and the benefits of using new-generation AI tools for monitoring and making sense of news and other unstructured data, read our blog post on how AI powers the next generation information search.
In this article, we offer a step-by-step guide to the news media mining approach in Dcipher Analytics. It does not require any programming or advanced analytics skills. Instead of reading articles one by one, the new approach allows you to visually find patterns in large amounts of articles and drill down in article clusters of interest. This will ensure that you make unexpected discoveries and find information that you would not otherwise have thought of looking for.
In this post, we look at the topic of education. We run the analysis on 196,000 articles in U.S. local news media that contain keywords related to education. The aim is to explore themes and topics in these articles to get an overview of the news reporting on education.
1. Getting an overview of the data
We are dealing with a flat dataset, so it makes sense to view it as a table, which can be done by drag-and-dropping the dataset to Dcipher Analytics’ Document Summary workbench. This allows us to brows the data and identify columns of interest.
We drag-and-drop the Geography column to the Bubble workbench to summarize and get an overview of the geographical location of the publishers of the news articles in the dataset. Doing so informs us that new media in the states of Connecticut, California, Texas, and New Jersey are the prime sources of news about education, but that relevant articles are published across the country.
2. Identifying authors and their content
The Authors column contains the name of the journalist of each piece. Displaying this information in the bubble view gives a quick overview of the writers that are most prolific in the area of education in U.S. media. Since most articles lack information about the author, the biggest bubble is empty. We remove it by selecting it and hitting backspace to give more space for the other bubbles. This reveals the most active journalists covering education, including Collin Binkley, Al Sullivan May, and Jeff Amy. To display the articles of the latter, we drag-and-drop the bubble representing him to the Document Summary workbench. It appears that Mr Amy covers a range of educational topics in Mississippi.
3. Drilling down into topics of interest
Drag-and-dropping the Headline column, which we have identified as an interesting part of the data, to the ‘Tokens as bubbles’ drop zone in the Document Summary workbench strips the article of irrelevant words and symbols (text cleaning), splits the article headlines into individual words (“tokenization”), convert them to their base form (“lemmatization”), and displays the resulting tokens as bubbles. This gives a quick overview of terms that are frequently used in the reporting.
To take this overview to the next level, we use the Display as network drop zone in the Bubble workbench to cluster words based on the articles they were used in. The result is a topic map, where words that have tended to be used in the same headlines are clustered together. This reveals a number of topics, including those related to state budgets, school shootings, scholarships and rewards, and minority students.
To get a more focused dataset, we select a set of relevant keywords in the topic landscape and single out all articles containing these words, while eliminating the rest from the analysis.
4. Creating a document landscape
After the relevance filtering described above, we are left with 45,289 articles. To explore their content, we drag-and-drop the Headline column to the Document landscaping drop zone in the Scatter plot workbench. The result is a landscape where documents have self-organized based on their content; articles about the same or similar things are found in the same part of the landscape.
We can explore the landscape by zooming into parts of the landscape and hovering over a dot to read the headline represented by it. We are particularly interested in the dense areas of the landscape.
One cluster about school board elections, a large “continent” of the landscape about student debt, and a third “island” containing reporting about the education secretary. Note that these posts do not necessarily use the same keywords. Dcipher Analytics uses the latest AI-powered text analytics techniques to understand that, for example, “education secretary”, “education commissioner”, and “Betsy DeVos” are related.
We can see that discussions about recycling revolve around plastic products, alternatives such as glass containers, biodegradable materials such as bamboo, recyclable paper and packaging, landfills, and waste water.
5. Further research and analysis
To quantify and further analyze the different themes in the landscape, select documents and tag them. To gain further qualitative insights, drag-and-drop documents to the Document Summary Workbench where they can be read in-depth.
In our case, a number of topics of interest were discovered, including:
- A growing number of strikes as teachers find their working conditions and stagnating salaries increasingly intolerable.
- An expected shortage of 200,000 teachers in the U.S. by 2025.
- An effort to integrate various kinds of educational technologies into the classroom.
- Parenting values that are shifting toward a stronger emphasis on “soft skills” such as grit and resilience.
- Maturing business models for remote education.