Mining social media for insights without programming: a 6-step guide
Social media contain open and public discussions about every conceivable topic. These discussions can provide invaluable insights into the views and narratives among consumers, influencers, and businesses. But the information is unstructured, in the form of text and images, and spread out across a large number of social media platforms. Making sense of it has typically required programming and data science skills: the data needs to be collected, preprocessed, structured, and analyzed. This post presents a new approach, one which does not require any programming or expert text analytics skills.
What value can we get from social media? In relation to any given topic, analyzing posts from social media can generate:
- An unbiased overview of discussions and narratives
- Detailed insights into specific discussions
- Mapping of trends and spotting of emerging phenomena
- Identification of influencers and enthusiasts
- An understanding of hot content attracting online interest
In this post, we take a look at how to extract such rich insights by mining social media, using simple drag-and-drops in our text analytics tool Dcipher Analytics. In this particular case, we look into the online discussion around sustainability. Though Dcipher Analytics provides access to tens of thousands of sources – including Twitter, YouTube, and worldwide forums and blogs – in this post we focus on public posts on Facebook.
1. Importing data from social media
In Dcipher Analytics, importing data from social media is easy. Simply click the Import data button and follow the instructions. We use "sustainable" and "sustainability" as keywords and download the 10,000 posts with the highest engagement scores from the most recent month. This is what it looks like:
In this video, we cut out the three minutes it took to import the 10,000 posts.
2. Getting familiarized with the data
It's usually a good idea to get familiar with the dataset before getting started with the analysis. In Dcipher Analytics, we do this by dragging-and-dropping individual fields or the entire dataset to other views to get different perspectives on the data.
By displaying it in the Table View, we get an Excel-like overview of the different columns in the dataset. But while traditional spreadsheets require flat data, in Dcipher we can easily navigate between the different levels of this nested dataset.
The Bubble View helps us aggregate and display the data in a field, in this case we're looking at what cities the posts originate from.
We use the Bar Chart View to aggregate and view hashtags, which gives a quick snapshot of the dominant topics in the posts.
3. Cleaning and preprocessing the texts
Social media is a notoriously noisy data source, so before moving on to the analysis we need to clean up the data. Follow the steps in Dcipher Analytics' preprocessing wizard to:
- Remove outliers. The longest post in this dataset contains almost 6000 words. It's often a good idea to filter out such long texts so that they don't dominate the analysis and slow down subsequent operations.
- Remove duplicates. It's common that the same texts get posted multiple times. Keeping duplicates does not serve any purpose in this analysis, so we opt to remove them.
- Filter by language. Even though we asked Facebook for posts in English, typically a few posts in other languages slip through. Using Dcipher Analytics' built-in language detection function, we can eliminate posts written in unwanted languages.
- Text cleaning. When looking for topics in the data, elements such as urls and @ tags are simply noise. The text cleaning step helps us clean these and other unwanted entities from the texts.
- Extract relevant content. Though all the posts contain at least one of our keywords, sustainability might be discussed in a small part of a long post. So to get the content that is relevant for our analysis, we use the option to extract the parts of the texts that contain "sustainable" or "sustainability".
This is what the process looks like in Dcipher:
4. Mapping and exploring topics
Now that the data has been cleaned, it's time to start mining the texts for information. The first thing we want to know is what the sustainability discussions on Facebook are about.
Dcipher offers several techniques for mapping topics and themes in the data: topic modelling, visual word networks, and document landscaping. In this example we use a combination of the latter two.
To split the cleaned texts into words and phrases, we use the tokenization operation. Lemmatization transforms words into their base form ("plants" -> "plant", etc). Phrase detection looks for sequences of words that form phrases, which tend to give a better idea of the context than individual words. Part-of-speech tagging is useful for, later on, reducing noise by keeping only the most relevant parts-of-speech, such as nouns.
Once the texts have been split into tokens, we can use the bubble view to view their frequencies and how they co-occur in the same posts. "Food", "farming", and "agriculture", for example, are strongly connected. Selecting and dragging words from this cluster to the list of posts displayed in the Table View finds the posts in which they appear the most.
This yields 855 posts, which is too much to read through manually. So we click the column header to select all the posts and drag them to the Document landscaping drop zone in the Scatter Plot View. After about half a minute (cut out from the video below) of processing, we can now see the texts as dots in a landscape, where similar posts are clustered together. By hovering over the dots we can read the texts that correspond to them and interpret the meaning of the different clusters.
The whole process in Dcipher:
5. Finding influencers and enthusiasts
To find out who is most active and influential in relation to the topic, we use the Authors field, which contains information about the author of each post. By dragging it to the Group by field drop zone in the Table View, the number of posts published by each author is counted. We use the number of likes as an engagement metric to understand not just who is active, but also who is influential.
Sorting the authors by number of published posts gives the most prolific authors – let us call them the enthusiasts, those who are writing a lot about sustainability. Sorting by the engagement provides the most influential public accounts.
The Bubble View can be used to present this in a visual form. To quickly find out what an author has been writing about, we drag the author (represented by a bubble) to the list of posts in the Table View. This gives the author's posts.
In Dcipher Analytics:
6. Spotting hot content
People are posting links to content they find interesting, both on Facebook and in external sources. Mapping these links can give insights into what content people are talking about, liking to, and sharing. Information about links are available in the ‘attachment’ field.
Following the same procedure as with authors above, we quickly get a list of the top links to sustainability related content. The 'title' and 'description' fields provide information that makes it fast to spot interesting content. This is a great way of picking up the latest things that have caught the interest of the sustainability enthusiasts and influencers, like an article about indigenous technologies for climate resilient design and another about hemp plastic.
To access our comprehensive social media archive and try out social media mining in Dcipher Analytics, sign up for a free trial. To better understand our social media mining solutions, read more here or check out our other blog posts. If you have questions, want to discuss your use case, or get a guided tour, contact us.