One of the biggest challenges for nontechnical and business users in producing social data visualization is deciding which visual should b...
One of the biggest challenges for nontechnical and business users in producing social data visualization is deciding which visual should be used to represent the data accurately. Maybe it’s not so much the accuracy as it is the clarity. But why do it at all? As humans, we tend to think visually (at least a good number of us do). Sometimes concepts or ideas can best be described with pictures rather than just words or verbal discussions. Discussing the differences between 10 different data points is interesting, but if we can convey that thought with a simple picture, why not do that? The idea is that there might be a huge difference between point 1 and point 6 but not so much between point 1 and point 3. In this case, how much easier would it be to just see those differences?
Visualizing the information we’ve gathered allows the consumer of that information to potentially see things in our results that might have otherwise gone unnoticed. Any representation of raw data can convey information, but without a visualization, we could miss out on trends, behavior patterns, or dependencies. Visualizations give us answers faster. Looking at a graph and identifying a trend can take but an instant. However, imagine how much time it would take to scan rows of numbers and pick out that same trend.
A data analytics project isn’t complete until the results we’ve collected are packaged or presented in a way that truly helps the receiver of those insights make decisions or take action based on our work. That means we, as analysts, not only need to know what the data means but also need to be able to represent it in a way that can best convey the information we inferred from it so that our users can derive conclusions, which in turn can drive business outcomes.
The options for data visualization seem to be growing almost by the day. New technologies and techniques can turn number-intensive reports into bright, colorful 3D interactive graphics. But we must be careful. In our zeal to create the prettiest of charts or sexiest rendering, we may lose the message we are trying to convey. Don’t get us wrong: Compelling graphics are a wonderful tool in expressing findings, but overuse (or overindulgence) may push the audience from the realm of understanding into the world of confusion. The choice of how to deliver results will depend on clients’ needs.
Visualization should help tell the story, not drown it out. In this chapter, we discuss some of the different types of visualizations to consider when presenting the results of an analysis. There isn’t a single “right” answer to the question of what kind of a graph provides the clearest insight to a user. There are some best practices when it comes to selecting color or limiting information on a chart to maximize the impact of a message, but it’s ultimately about the insights or additional information that was discovered. Many analysts opt for the pretty charts or the snappy presentation, but we must always keep in mind that it’s the results that count, often more than the presentation itself.
■ The design should understand the audience.
■ It should set up a clear framework.
■ And, probably most important, it should tell a story.
What these considerations really boil down to is this: clarity. Effective visualizations should not confuse the audience but should present a story (or an insight) in a clear, simple-to-understand way that helps the audience understand the conclusions that were drawn from the data. Or better yet, these visualizations should enable them to understand the data in such a way that they can discover new insights or relationships on their own.
Let’s look at some of the more simple types of graphics first, many of which we’ve used throughout this text. The graphs we create should aim to simplify the data in a visually appealing way. The main challenge that many people have with graphs is choosing which chart type to use. We all want a visually appealing chart to present to a customer, but remember, if a chart is visually appealing, we must be careful that we don’t spend more time talking about how “cool” the chart looks as compared to the message it is trying to convey.
Let’s look at a few chart types.
In other words, it is best to use pie charts when we want to show differences in a specific group based on one variable. In the example in Figure 13.1, we collected the number of times these users used the word cloud in their social media postings. It is important to remember that pie charts should be used only with a category or dimension that combines to make up a whole. In this case, we collected 151 tweets, and clearly user 4 was the dominant communicator with a 60% coverage.
What makes a pie chart useful is the quick visual comparisons that can be done. Again, without the percentages in the graph, we could quickly see that user 4 is more verbal than all of the other users combined. Or we can see that together percentages for users 5 and 1 are close to the amount of conversation initiated by user 2.
With the bar chart, like the pie chart, we are able to easily see that the most prolific social media participant was user 4. What’s slightly more difficult is understanding that together all of the other users’ posts combined don’t equal or compare to those made by user 4. That fact was much easier to see in Figure 13.1. On the other hand, a bar chart does allow us to easily see the user with the smallest number of posts or the user with the biggest number of posts a bit quicker. It must be noted that sometimes the pie slices, because they represent percentages and not actual numbers, make that a bit more difficult to discern. It is a bit easier to compare two bars (consider user 2 and user 3 in Figure 13.2) versus the corresponding pie slices in Figure 13.1. A comparison can be made, but it takes the audience a bit of time to see the difference in a pie chart versus the bar chart.
One danger of using bar charts is the comparison between different graphs or charts. While we didn’t have to scale the data in Figure 13.2, sometimes the data points might be so varied that they need to be scaled to be represented on a graph. The bottom line: Watch out for inconsistent scales across multiple graphs. If two charts are put side-by-side, ensure that they are using the same scale. If they don’t have the same scale, be aware of the differences and how they might trick the eye.
A perfect example of a scaling problem was discussed by Naomi Robbins in an online article in Forbes Magazine [3]. Consider the graphic (which we re-created) in Figure 13.3 showing the relative number of medals won, by country, in the Summer Olympics. While the graphic is interesting, it can be quite misleading. For example, if we look at Germany with 500 medals (the data reports 499, but it’s close enough), we might assume that each graphic of a medal is equivalent to 250 medals awarded. That makes sense. But if that were the case, then shouldn’t Russia show 4 medals? It would appear that Russia was awarded around 1,250 and the USA number of 1975 really should be 1,500 (250 × 6). Clearly, the scale doesn’t work for this graph, and while it was probably trying to show the relative number of medals by country, in the long run, it would probably cause more discussion and confusion when the audience tries to rationalize the numbers.
Let’s look at an example of the number of mentions of a particular product over a 24-hour period. The data we use is from Table 13.1, which lists the hour (0–23 hours) and the number of mentions made in that hour. A quick look at the table reveals nothing out of the ordinary. At first glance, this looks like a US-based audience (perhaps the Northeast) because social media posts are made throughout the day and a noticeable drop to zero occurs around 2 a.m. to 6 a.m. (when we assume users are sleeping), but no real trends.
A line chart, as shown in Figure 13.4, instantly shows an interesting trend.
It appears that around 13:00 hours and then again at 21:00 hours, discussion about the product or service is at its peak. Clearly, any kind of marketing plan or advertising should take place around these times. But these kinds of trends, while possible to see in Table 13.1, aren’t as obvious as when shown as a line chart.
Watching this data over time, perhaps over several of days, could show a repeating, and hopefully predictable, pattern that could be invaluable to those wishing to engage with potential customers or prospects.
In the case of Figure 13.5, we can see that the general trend for tweets made over a 24-hour period is on the rise, or generally increasing over time. Obviously, there are peaks and valleys in the data, but the amount of chatter, or conversation, is increasing.
Other times we see a pie chart (similar to Figure 13.1) with two or three other graphs that attempt to show a similar concept or provide an alternative way to represent the same concept. If one graph doesn’t describe the concept well enough, we need to ask, “Is that graph really providing value?”
While redundancy is one issue, the side effect of adding too much information to a graph is that in order to fit the additional information, font sizes tend to get smaller, approaching unreadable.
The representation of the results may be so compelling that the audience misses the message and focuses on the pretty picture. Nothing could be worse! Not only do they not receive the intended message, but in the long run, if this metric is utilized later (say as a descriptive statistic), they may just not understand how it’s used. In essence, they may have focused more on the delivery rather than the message.
One of the considerations we mentioned earlier is that the goal in presenting results is to do it in such a way as to avoid confusion. Nothing can derail a discussion about the meaning of an insight or a metric more than a discussion about the validity of the data. Now let’s look at the charts in Figure 13.6.
In this figure, the same data is plotted in two dimensions (2D) versus three dimensions (3D). While the chart on the right (the 3D rendering) does look a little prettier, the audience may start to question the values on the chart. Look at user 7. In the 2D graph, the value looks to be 800, yet on the 3D graph, the value appears to be below the 800 line. Upon realizing this, viewers may turn the conversation from the value of seeing user 7 as the most prolific to “why is the value not represented correctly?” Inadvertently, this graph has now raised some doubt in the eyes of the audience to the validity of the data being presented.
Consider the bar chart from Figure 13.2. If each of the bars were drawn in a different color (say blue, red, green, gray, and purple), does that provide any additional value to the chart? Or does it raise the question “What do the colors represent?” and simply add a layer of confusion to the message? Then again, we have to consider that an audience member who sees a bar in a bar chart as red may assume that’s a problem or an area to concentrate on (we tend to think of red as indicative of a problem area). Don’t use color to decorate the graph. Prettying up a graph might serve a purpose in attracting attention, but from an analytics perspective, it can only distract from what’s important—the data and the insights we are attempting to draw from the data.
Alternatively, if our graph were drawn in black and white and one of the bars was coded in a color (say red), it would stand out and perhaps draw attention to that specific point, which may be the intention. In that case, the use of color adds value to the graph in calling out a specific area that merits discussion.
The frequency with which particular words are used in a set of messages can potentially tell us something meaningful about that set of text. Of course, if all of the messages were produced by a single individual, the frequencies may tell us something about the author because the choice of words an individual uses is often not random but purposeful. In our case, we tend to look at the messages from a wide variety of people in the hopes of detecting common signals or messages in that body of data. A depiction of a word frequency report may be useful if we want to determine whether the most frequently used words of a given text represent a potentially meaningful pattern that may have been overlooked in a casual glance at the data.
However, when we look more closely at the graph, two things stand out:
■ There are far too many words to show along the horizontal axis. After about 10 values, the points get placed too close together, and therefore, it becomes difficult to read. This example shows just the top 60 words in a shortened version of what we’ve used previously; typically, we like to look at the top 100 to 150 words.
■ The scale quickly becomes confusing when there are words that have a very high frequency versus those that are average to low. Remember, average to low in the top 60 words is still significant. The point is that it’s difficult to discern the subtle differences between word usage.
For these reasons, we like to use the word cloud version of this analysis (see Figure 13.8).
The word cloud quickly shows the relationship between word counts by making the more frequently used words larger. Since these words are larger and more prominently displayed, they quickly catch the audience’s eyes. But more than that, the audience can draw their own conclusions about the relative use of one word versus another. This approach becomes even more useful when we look at the frequency of word phrases (two or more words used together).
Another consideration when trying to show the relationship between words used is to remove the clutter in the data. This clutter is often referred to as stop words. For the English language, this includes words such as a, the, is, and so on. Imagine creating a word cloud of frequently used words. If we included these in our analysis, more often than not the word cloud would be dominated by the word the with a count so high that it would literally drown out any of the other words.
Another consideration (and temptation) with word clouds is trying to make them look visually appealing. Often this has the reverse effect. Remember, early in this chapter we said that the goal of analysts is to report the findings and provide facts. When we introduce fancy charts, say in 3D or with fancy fonts, we distract the audience and risk having them miss the finer points of the insights.
For example, Figure 13.9 shows the same data used in a word cloud that is generated by an online tool.
We really like this website and often recommend it to users, but with the wrong options selected, the message of the word cloud could get lost in admiration of the graphic that was produced. Remember, the point of creating the graphic is about facts and insights, not sizzle.
Visualizing the information we’ve gathered allows the consumer of that information to potentially see things in our results that might have otherwise gone unnoticed. Any representation of raw data can convey information, but without a visualization, we could miss out on trends, behavior patterns, or dependencies. Visualizations give us answers faster. Looking at a graph and identifying a trend can take but an instant. However, imagine how much time it would take to scan rows of numbers and pick out that same trend.
A data analytics project isn’t complete until the results we’ve collected are packaged or presented in a way that truly helps the receiver of those insights make decisions or take action based on our work. That means we, as analysts, not only need to know what the data means but also need to be able to represent it in a way that can best convey the information we inferred from it so that our users can derive conclusions, which in turn can drive business outcomes.
The options for data visualization seem to be growing almost by the day. New technologies and techniques can turn number-intensive reports into bright, colorful 3D interactive graphics. But we must be careful. In our zeal to create the prettiest of charts or sexiest rendering, we may lose the message we are trying to convey. Don’t get us wrong: Compelling graphics are a wonderful tool in expressing findings, but overuse (or overindulgence) may push the audience from the realm of understanding into the world of confusion. The choice of how to deliver results will depend on clients’ needs.
Visualization should help tell the story, not drown it out. In this chapter, we discuss some of the different types of visualizations to consider when presenting the results of an analysis. There isn’t a single “right” answer to the question of what kind of a graph provides the clearest insight to a user. There are some best practices when it comes to selecting color or limiting information on a chart to maximize the impact of a message, but it’s ultimately about the insights or additional information that was discovered. Many analysts opt for the pretty charts or the snappy presentation, but we must always keep in mind that it’s the results that count, often more than the presentation itself.
Common Visualizations
In his Harvard Business Review article titled “The Three Elements of Successful Data Visualizations,” Jim Stikeleather [2] outlines three areas of concentration that visual designers should consider when creating graphic visualizations. These considerations are■ The design should understand the audience.
■ It should set up a clear framework.
■ And, probably most important, it should tell a story.
What these considerations really boil down to is this: clarity. Effective visualizations should not confuse the audience but should present a story (or an insight) in a clear, simple-to-understand way that helps the audience understand the conclusions that were drawn from the data. Or better yet, these visualizations should enable them to understand the data in such a way that they can discover new insights or relationships on their own.
Let’s look at some of the more simple types of graphics first, many of which we’ve used throughout this text. The graphs we create should aim to simplify the data in a visually appealing way. The main challenge that many people have with graphs is choosing which chart type to use. We all want a visually appealing chart to present to a customer, but remember, if a chart is visually appealing, we must be careful that we don’t spend more time talking about how “cool” the chart looks as compared to the message it is trying to convey.
Let’s look at a few chart types.
Pie Charts
Pie charts are best used to illustrate the breakdown of a single dimension as it relates to the whole. Basically, when we want to look at the value of a specific dimension in relation to other values in that same dimension, we could use a pie chart to easily visualize it. Pie charts help us see, with a quick glance, which attribute in series of data is dominant or how any individual attribute or set of attributes relates to each other. Consider the graph in Figure 13.1 that depicts the number of social media data collection posts over a 24-hour period by five specific users.In other words, it is best to use pie charts when we want to show differences in a specific group based on one variable. In the example in Figure 13.1, we collected the number of times these users used the word cloud in their social media postings. It is important to remember that pie charts should be used only with a category or dimension that combines to make up a whole. In this case, we collected 151 tweets, and clearly user 4 was the dominant communicator with a 60% coverage.
What makes a pie chart useful is the quick visual comparisons that can be done. Again, without the percentages in the graph, we could quickly see that user 4 is more verbal than all of the other users combined. Or we can see that together percentages for users 5 and 1 are close to the amount of conversation initiated by user 2.
Bar Charts
Bar charts, like pie charts, are useful for comparing classes or groups of data. In bar charts, a group can have a single category of data, or it can be broken down further into multiple categories for greater depth of analysis. A bar chart is built in such a way that the heights of the different bars are proportional to the size of the category they represent. Since the x-axis (the horizontal axis) represents the categories that were measured or being represented, it tends to have no scale. The y-axis (the vertical axis) does have a scale, and this indicates the units of measurement. Figure 13.2 looks at the same set of data we used previously.With the bar chart, like the pie chart, we are able to easily see that the most prolific social media participant was user 4. What’s slightly more difficult is understanding that together all of the other users’ posts combined don’t equal or compare to those made by user 4. That fact was much easier to see in Figure 13.1. On the other hand, a bar chart does allow us to easily see the user with the smallest number of posts or the user with the biggest number of posts a bit quicker. It must be noted that sometimes the pie slices, because they represent percentages and not actual numbers, make that a bit more difficult to discern. It is a bit easier to compare two bars (consider user 2 and user 3 in Figure 13.2) versus the corresponding pie slices in Figure 13.1. A comparison can be made, but it takes the audience a bit of time to see the difference in a pie chart versus the bar chart.
One danger of using bar charts is the comparison between different graphs or charts. While we didn’t have to scale the data in Figure 13.2, sometimes the data points might be so varied that they need to be scaled to be represented on a graph. The bottom line: Watch out for inconsistent scales across multiple graphs. If two charts are put side-by-side, ensure that they are using the same scale. If they don’t have the same scale, be aware of the differences and how they might trick the eye.
A perfect example of a scaling problem was discussed by Naomi Robbins in an online article in Forbes Magazine [3]. Consider the graphic (which we re-created) in Figure 13.3 showing the relative number of medals won, by country, in the Summer Olympics. While the graphic is interesting, it can be quite misleading. For example, if we look at Germany with 500 medals (the data reports 499, but it’s close enough), we might assume that each graphic of a medal is equivalent to 250 medals awarded. That makes sense. But if that were the case, then shouldn’t Russia show 4 medals? It would appear that Russia was awarded around 1,250 and the USA number of 1975 really should be 1,500 (250 × 6). Clearly, the scale doesn’t work for this graph, and while it was probably trying to show the relative number of medals by country, in the long run, it would probably cause more discussion and confusion when the audience tries to rationalize the numbers.
Line Charts
Line charts are similar to bar charts and at times can seem interchangeable; however, a line chart works best for continuous data, whereas bar and column charts work best for data that is categorized. Think of continuous data of the same dimension that is changing over time (the number of posts made over the past 30 days, the number of mentions of a product in a 24-hour period, and so on). Of course, we can use bar charts to show the change of values for a particular entity over time as well; that may come down to style, but generally speaking, a line chart is much more useful in discerning trends and patterns in data.Let’s look at an example of the number of mentions of a particular product over a 24-hour period. The data we use is from Table 13.1, which lists the hour (0–23 hours) and the number of mentions made in that hour. A quick look at the table reveals nothing out of the ordinary. At first glance, this looks like a US-based audience (perhaps the Northeast) because social media posts are made throughout the day and a noticeable drop to zero occurs around 2 a.m. to 6 a.m. (when we assume users are sleeping), but no real trends.
A line chart, as shown in Figure 13.4, instantly shows an interesting trend.
It appears that around 13:00 hours and then again at 21:00 hours, discussion about the product or service is at its peak. Clearly, any kind of marketing plan or advertising should take place around these times. But these kinds of trends, while possible to see in Table 13.1, aren’t as obvious as when shown as a line chart.
Watching this data over time, perhaps over several of days, could show a repeating, and hopefully predictable, pattern that could be invaluable to those wishing to engage with potential customers or prospects.
Scatter Plots
While line charts provide a way to map independent and dependent variables that are both quantitative (that is, measurements), a scatter plot can be useful to depict a trend or the direction of the data. When both variables are quantitative on a graph, we can interpret a line that spans that data as a slope (or prediction) of future data or trends. Scatter plots are similar to line charts in that they start with mapping quantitativeIn the case of Figure 13.5, we can see that the general trend for tweets made over a 24-hour period is on the rise, or generally increasing over time. Obviously, there are peaks and valleys in the data, but the amount of chatter, or conversation, is increasing.
Common Pitfalls
When creating these graphs, we should consider a few things in an effort to keep messages clear and allow the audience to focus on the story or message being delivered, not on the charts and all the pretty colors.Information Overload
One of the most common issues we’ve seen with charts or presentations that attempt to show the results of a study is that they often contain too much information. Consider a simple chart (such as that in Figure 13.2) where the amount of information is kept to a minimum: the label for the data points along the x-axis and the values on the y-axis. There isn’t much more needed. Often analysts like to augment their graphs with notes that indicate a peak or valley in the data and perhaps why it might have occurred. Is this information really necessary? Isn’t that what a picture is for, to visually show the peak or valley?Other times we see a pie chart (similar to Figure 13.1) with two or three other graphs that attempt to show a similar concept or provide an alternative way to represent the same concept. If one graph doesn’t describe the concept well enough, we need to ask, “Is that graph really providing value?”
While redundancy is one issue, the side effect of adding too much information to a graph is that in order to fit the additional information, font sizes tend to get smaller, approaching unreadable.
The Unintended Consequences of Using 3D
When creating graphs to depict our data, often we feel that they are sort of dull or uninteresting (how interesting can you make a bar graph that depicts users and number of message postings, anyway?). But we all want our results to look visually appealing with the thought that the presentation could keep our audience’s attention. Often we’re tempted to take a standard graph and turn it into a three-dimensional rendering in an attempt to spice up the message. But this can have at least two unintended consequences.The representation of the results may be so compelling that the audience misses the message and focuses on the pretty picture. Nothing could be worse! Not only do they not receive the intended message, but in the long run, if this metric is utilized later (say as a descriptive statistic), they may just not understand how it’s used. In essence, they may have focused more on the delivery rather than the message.
One of the considerations we mentioned earlier is that the goal in presenting results is to do it in such a way as to avoid confusion. Nothing can derail a discussion about the meaning of an insight or a metric more than a discussion about the validity of the data. Now let’s look at the charts in Figure 13.6.
In this figure, the same data is plotted in two dimensions (2D) versus three dimensions (3D). While the chart on the right (the 3D rendering) does look a little prettier, the audience may start to question the values on the chart. Look at user 7. In the 2D graph, the value looks to be 800, yet on the 3D graph, the value appears to be below the 800 line. Upon realizing this, viewers may turn the conversation from the value of seeing user 7 as the most prolific to “why is the value not represented correctly?” Inadvertently, this graph has now raised some doubt in the eyes of the audience to the validity of the data being presented.
Using Too Much Color
Another consideration is color. Often analysts go overboard with the use of colors (and font types for that matter) in their graphs. As with many aspects of visual perception, humans do not all perceive color in the same way. Said another way, every user’s perception of an object is influenced by the context (or color) in which it is presented. This doesn’t mean we should stay away from color, but it does mean that the use of varied amounts of color should be done sparingly.Consider the bar chart from Figure 13.2. If each of the bars were drawn in a different color (say blue, red, green, gray, and purple), does that provide any additional value to the chart? Or does it raise the question “What do the colors represent?” and simply add a layer of confusion to the message? Then again, we have to consider that an audience member who sees a bar in a bar chart as red may assume that’s a problem or an area to concentrate on (we tend to think of red as indicative of a problem area). Don’t use color to decorate the graph. Prettying up a graph might serve a purpose in attracting attention, but from an analytics perspective, it can only distract from what’s important—the data and the insights we are attempting to draw from the data.
Alternatively, if our graph were drawn in black and white and one of the bars was coded in a color (say red), it would stand out and perhaps draw attention to that specific point, which may be the intention. In that case, the use of color adds value to the graph in calling out a specific area that merits discussion.
Visually Representing Unstructured Data
Probably the largest problem with social media data mining tools analytics is figuring out how to compute and then show relationships in the data and also visually represent topics of conversation. One of the more frequently used techniques is that of word frequency.The frequency with which particular words are used in a set of messages can potentially tell us something meaningful about that set of text. Of course, if all of the messages were produced by a single individual, the frequencies may tell us something about the author because the choice of words an individual uses is often not random but purposeful. In our case, we tend to look at the messages from a wide variety of people in the hopes of detecting common signals or messages in that body of data. A depiction of a word frequency report may be useful if we want to determine whether the most frequently used words of a given text represent a potentially meaningful pattern that may have been overlooked in a casual glance at the data.
However, when we look more closely at the graph, two things stand out:
■ There are far too many words to show along the horizontal axis. After about 10 values, the points get placed too close together, and therefore, it becomes difficult to read. This example shows just the top 60 words in a shortened version of what we’ve used previously; typically, we like to look at the top 100 to 150 words.
■ The scale quickly becomes confusing when there are words that have a very high frequency versus those that are average to low. Remember, average to low in the top 60 words is still significant. The point is that it’s difficult to discern the subtle differences between word usage.
For these reasons, we like to use the word cloud version of this analysis (see Figure 13.8).
The word cloud quickly shows the relationship between word counts by making the more frequently used words larger. Since these words are larger and more prominently displayed, they quickly catch the audience’s eyes. But more than that, the audience can draw their own conclusions about the relative use of one word versus another. This approach becomes even more useful when we look at the frequency of word phrases (two or more words used together).
Another consideration when trying to show the relationship between words used is to remove the clutter in the data. This clutter is often referred to as stop words. For the English language, this includes words such as a, the, is, and so on. Imagine creating a word cloud of frequently used words. If we included these in our analysis, more often than not the word cloud would be dominated by the word the with a count so high that it would literally drown out any of the other words.
Another consideration (and temptation) with word clouds is trying to make them look visually appealing. Often this has the reverse effect. Remember, early in this chapter we said that the goal of analysts is to report the findings and provide facts. When we introduce fancy charts, say in 3D or with fancy fonts, we distract the audience and risk having them miss the finer points of the insights.
For example, Figure 13.9 shows the same data used in a word cloud that is generated by an online tool.
We really like this website and often recommend it to users, but with the wrong options selected, the message of the word cloud could get lost in admiration of the graphic that was produced. Remember, the point of creating the graphic is about facts and insights, not sizzle.
COMMENTS