How to analyze open-ended survey responses

Traditionally, analysis of free-form text data from surveys have required coding, where researchers read through the answers and manually code them with fixed or emergent categories. To ensure accuracy and consistency, each post needs to be coded by at least two researchers. If the survey is conducted in multiple languages, native speakers of each language need to do the coding. They also need to coordinate with each other to make sure they interpret answers in a consistent way.

In short, coding is a time-consuming and tedious task with a strong dependence on the skills of individual researchers, making a high degree of quality and consistency difficult to maintain. This is why open-ended questions are often avoided in survey design.

The algorithmic approach to analysis of free-form text data is an attractive alternative because it is fast, automated, and ensures consistency. The quality of insights depend, however, on what approach is used. Below we outline five text analytics approaches that can help making sense of open-ended survey responses.

1. Content analysis and word clouds

The simplest and most common way of analyzing text data in surveys is through content analysis. The idea is simple: split sentences into words and count the frequency of each word. The words and their frequencies are commonly visualized in the form of a word cloud, where the size of the words represent the frequency of their use. While useful for giving an overview of the aggregate use of words, this method is too simplistic to replace human coding. Individual words are simply unable to capture the meaning and nuances that a human researcher can build into a code.

2. Augmentation through natural language processing

A few simple preprocessing steps can be used to augment the content analysis approach. Natural language processing (NLP) is a branch of artificial intelligence that help computers interpret and process human language. Simple NLP operations to get more insights from free-form text data include:

While making the overall picture clearer, this approach still does not code individual responses and does not convey the contexts of how words are used.

3. Relational word clouds

A traditional word cloud is simply a visual representations of a list of words and their frequencies. Most of the meaning that words convey, however, are through their use in combination with other words. Traditional word clouds, therefore, leave us to speculate about how and why words were used. To address this, relational word clouds take into account not only words’ frequence of use, but also their contexts. By measuring how often different words are used together in the survey responses, we can construct a network where words are nodes and co-occurence is represented by links. The result is a word cloud where not only the size of words matter, but their positions as well. Two words that are located close to each other in the network are often used together in the responses.

There are several ways of measuring the closeness of words:

The relational word cloud is a visual way of identifying topics and, in extension, concepts in the responses. There are other approaches too, such as topic modelling, a technique for discovering abstract topics used in a collection of documents. The benefit of the visual approach, and the reason we favor it if we have to choose, is that it makes it fast and easy for the analyst to iterate. When first generated, a structured word cloud commonly contains themes that are not relevant, and sometimes these themes drown out more relevant ones. To address this, the analyst can remove posts containing the irrelevant themes until a clear structure of topics and themes emerges.

4. Tagging of responses

Once the concepts are identified using a relational word cloud or topic modelling approach, the analyst can now tag responses related to each concept. There are different ways of carrying out the scoring:

Rather than a “yes” or “no”, the scoring provides a scale for the analyst to use for tagging decisions. Usually responses above a certain score get the tag in question, but what the cutoff should be needs to be determined on a case-by-case basis.

Once the responses are tagged, the free-form text data can be considered as structured. This means traditional methods can be applied to analyze and visualize the data. It may, for example, be relevant to cross tabulate the tags against other structured data, such as background variables, also collected through the survey.

5. Further text enrichment using NLP

In order to further enrich the data, the analyst can apply additional NLP steps. Popular methods include:

CASE

HOW THE SWEDISH INSTITUTE ANALYZED THE IMAGE OF SWEDEN IN SIX COUNTRIES IN THE WAKE OF THE MIGRANT CRISIS

The Swedish Institute wanted to understand the view in six countries of Sweden’s handling of the European migrant crisis of 2015-2016. The analysis was carried out by our value-added partner, Kairos Future, using Dcipher Analytics.

The Swedish Institute and Kairos Future opted for a survey with open-ended questions as a way to identify themes that were not already know, and not to influence the respondents. Questions asked respondents to describe their view of Sweden and whether the view had changed in the last year, before priming respondents by mentioning the migrant crisis.

The survey was conducted in six different countries, none of them sharing the same language. Rather than translating responses into English, which would have risked removing important nuances in the process, the analysis was carried out in six different languages.

The natural language processing techniques outlined in this post were used to identify a total of six broad themes about Sweden’s handling of the migrant crisis. The themes ranged from very negative (“loss of control due to naivety”) to very positive (“a humanitarian role model”). Within each theme, the analysis revealed a number of more specific topics and nuances. Once the responses had been tagged with these topics and themes, the different languages were no longer an issue in the analysis (although the subsequent qualitative analysis still required researchers mastering each language).

Researchers drilled down into each of the six themes by reading individual responses representing each theme. This provided a rich, qualitative understanding of each theme.

The analysis proceeded by cross tabulating the themes against various other variables, such as place of residence, age, and income range. The result was insights into the who, to complement the what, of the discovered themes. A country-level analysis showed how the image varied across the countries surveyed.

Overall, the study informed the Swedish Institute’s decisions on how and where to act in response to the image change that the country had undergone. It would not have been feasible without the free-form text analytics toolbox of Dcipher Analytics.

If you want to try out the methods described in this post yourself, sign up for a free trial. In-app video tutorials will guide you though the process step by step.

Book a demo