Up to this point in our discussions, we’ve looked at how to find data, where to look for it, how to clean it, and now how to process it. Mu...
Up to this point in our discussions, we’ve looked at how to find data, where to look for it, how to clean it, and now how to process it. Much of what we have discussed thus far involves the process of building a model (or a definition) of what we want to look for and then running an analysis to see if our model fits or provides relevant results. If the data seems to fit the model, we take the output, calling it information, and build our insights or knowledge from there. If the data doesn’t fit our model, perhaps we modify the model and try again, or go look for different sources of data that may be more relevant than what we have already gathered.
Specifically, we use this type of social media analytics strategy when dealing with situations as they occur rather than ones that are repeated on a regular basis. As a result, often we are searching for an answer in dataset. This ad hoc method assumes we are looking for a number of different insights to emerge from our assembled data and that a model will be able to accurately, or at least somewhat accurately, represent the information contained within the dataset. Sometimes we just want to look at the data we’ve collected and determine the answer to a specific question or questions and perhaps gain some insightful information. Or perhaps we just want to understand the composition or makeup of the data before we go off and build a more complex model and undertake a deeper analysis.
Consider the diagram in Figure 9.1. The process of analytics is often an iterative model, where we attempt an analysis, determine if we have succeeded or failed, make repairs and adjustments, and then try again. Figure 9.1 presents the iterative set of steps recommended for understanding a dataset in an attempt to answer a question from the data. Obviously, we start at the top by locating what we believe to be our relevant data sources. We attempt a first model or perhaps execute a series of simple queries against the data and analyze our results. If we begin to see a pattern emerge, perhaps we can perform a deeper analysis or simply begin to assemble our derived insights. If, on the other hand, our queries reveal no answers or patterns, we have to ask ourselves if we have a sufficient amount of data to analyze or if the model we’ve built is incorrect. At this point, we either make adjustments by augmenting the model or by adding additional data to our dataset.
The purpose of this type of analysis may be to fill in the blanks left by a larger analysis or perhaps as a precursor to a larger initiative. More importantly, ad hoc reporting allows analysts (or general users) to manipulate and explore the data they have on hand to build reports, often on-the-fly, to answer their questions. Think of dashboards answering the question “What is happening in social network data around my topic?” and ad hoc reporting answering the question “Why is it happening?”
With smaller collections, the verification of data is often a straightforward process of simply looking at what we have to see if data we are interested in falls into the correct columns or is made up of unreadable characters, thus rendering any analysis useless. Often, what we, as data scientists, want to do is perform some simple queries on the data to ensure that what we are about to analyze is indeed going to give us some meaningful results. This idea of performing some simple queries is a perfect illustration of what we mean by the ad hoc query of data. In some cases, we want to perform a one-time (or specific) query to build up our knowledge of the information we are uncovering. At times, the simple queries produce just throw-away results, but results that give us an indication about whether we should proceed down our current path.
predictive versus descriptive. One looks to describe the contents of the data set under analysis (descriptive), and the other attempts to extract information from existing data sets to determine a pattern of behavior or to predict future outcomes and trends. An ad hoc analysis lies somewhere between these two.
Consider a dataset of tweets that were collected around January 17, 2013. On that date, Lance Armstrong confessed to Oprah Winfrey on national television in the United States that he won all seven of his record Tour de France championships with the help of performance-enhancing drugs and that his attitude was one of “win at all costs.” We wanted to understand the public’s reaction to this topic, so in this case, we collected data that contained:
■ Mentions of Lance Armstrong
■ Mentions of Oprah Winfrey
We didn’t collect the data for any other purpose than to experiment with some analytics around the event.
With the raw data in hand, we wanted to understand some of the basics about what we had collected before performing some simple descriptive analytics.
The first question we wanted to answer was:
Is the data clean (in other words, is it usable)? For simplicity, we didn’t take the full JSON string of data that we got from a typical tweet, since it was not relevant to our analysis. This could actually be an important step for some teams that have concerns about capacity and performance of their analytics environment. A typical tweet can have a large number of name-value pairs that represent all of the various parts of the tweet (when it was posted, the person’s name, the Twitter handle, hashtags, and so on). Most of the information that came from Twitter was either redundant or wasn’t useful in this exercise. As a result, we took the JSON data and simply converted it to a comma-separated value (CSV) file.
This resulted in a data file that contained the following elements :
■ Preferred user name
■ Display name (what gets displayed in public Twitter)
■ The status count (the number of tweets this person has made)
■ Language of the tweet
■ The number of times the tweet was retweeted
■ The posted time
■ The text of the tweet
The first thing we wanted to do was just browse the data before we did any kind of analytics on it to see if it was well formed. This was a preliminary but very important step to ensure that all of the data was in its proper columns and was a complete dataset.
This process is best shown with some examples. Up to this point in the book, we’ve made it a point not to show specific products or dive into howto descriptions, so here we try to stay as true to that doctrine as possible, but we may have to stray just a bit.
For ad hoc querying, we like to use interactive tools such as IBM’s BigSheets, which is part of the Big Insights suite of tools from IBM; custom code that looks over Hadoop clusters; or public domain tools such as R. For this example, we used R.
R is a powerful and widely used analytics tool. It includes virtually every data manipulation, statistical model, and chart that the modern data scientist could ever need. You can easily find, download, and use cutting-edge community- reviewed methods in statistics and predictive modeling from leading researchers in data science, free of charge [2].
Figure 9.2 shows a screen capture of the R session we ran on some of this data. We started off with the trimmed-down version of our dataset as discussed earlier. (Remember, we chose to convert the JSON data into CSV for easier manipulation. The data doesn’t change, just the format.)
Ad hoc analysis essentially is the process of running a number of simple queries against a database to understand more about the data we may want to process with more complex tools.
In section 1 of Figure 9.2, the first step was to bring our data into R. We used the read.csv() function, which reads a comma-separated value file into a variable called LAdata. Note that in this case, the first line of the file contains the headers for each column. For clarity, the first line of the CSV file contains the following (each element representing a name of the column): prefusername,displayname,statuscount,lang,retweetcount,time,body
To access a specific column in a table, the syntax is:
table_name$column_name
To see a specific row, we would use:
table_name[index,]
Typically, one of our first queries is to look at size—just to understand how big of a dataset we’re working with (see section 2 in Figure 9.2). The nrow() function in R simply looks at the number of rows of data read into the table variable (in this case, LAdata) and returns a count. So we can quickly see that this dataset had about 360,000 tweets from the second day of the interview. Not a very useful analytic, but a descriptive feature we may want to have when summarizing our data source.
Because this was our initial look at the data, the first thought was to get a quick overview of what we had collected. From the previous step, we knew we had about 360,000 tweets, but where did they come from? What language were they in?
In section 3 of Figure 9.2, we used the unique() function in R to look at a column of data in our table and summarize it for us, showing the individual values (or just the unique values). So if French, for example, was used in a tweet, the column called lang would contain fr. The unique() function lists it only once, so it gives a good representation of the number of unique languages used in the sample.
Interestingly, when we ran the command unique(LAdata$lang), we saw that R believed there were 935 unique values in the lang column of our dataset! And the output of the unique() command reported that the 933rd unique language was "1953"; something didn’t seem right. Obviously, "1953" isn’t a valid language, so using R’s query capabilities, we looked at that offending line of data to see where others may have errors. To do that, we used an R query that looked like this:
LAdata[LAdata$lang=="1953”,]
This simple query says:
“Show the row of data in LAdata where the lang column has the value ‘1953.’” What we got back resembled this:
Notice that the statuscount has the value PhD and the lang field has the value "1953". No doubt what happened was that this person entered his name in a format like this: Matt Ganis, PhD.
So when we took the original data and moved it into a CSV format, this name field looked like two separate elements just separated by a comma. Consequently, all of the fields ended up shifting to the right. Simply cleaning up the original data and re-creating the CSV file fixed the problem. The cleanup was to remove all nonessential characters from the user-created data. In this case, those characters included commas, linefeeds, newlines, and so on. The important thing is the lesson learned: Don’t trust any free-form text data. By “free-form,” we mean any input provided by a user that doesn’t have any given structure or format. Looking at Figure 9.3, we can see that data collected from a web dropdown menu (the example on the left) is an example of what we mean by “well formed.” In this case, when the user selects an option, we know exactly what values are possible. However, data collected from an input field, where a user can (and will) type anything he or she wants, is considered “free form.” We need to look closer at this data than a well-formed set of data because we have no way of knowing ahead of time what the user may or may not enter. The content in the search query field shown on the right side of Figure 9.3 is an example of free-form input.
When we looked at the data in a comma-separated value format, our cleansing program didn’t realize (and neither did we) that someone might use a comma in the name field. Basically, our first ad hoc query helped us uncover an issue with our JSON-to-CSV conversion.
Once we corrected this issue, we were able to see that our data appeared to be more correct (see Figure 9.4). Again, using the unique() command from R to determine the number of unique languages represented in this dataset, we can see in Figure 9.4 that the number is now 32. This number is much more reasonable than the previous value of 935, as reported in the initial queries from Figure 9.2.
So now, with the data reloaded, we could look at the various languages represented in our sample. Note that the R command unique() looked at the column of data lang and reported all of the unique values. We can see that there are 32 languages, which seems a much more reasonable number. To get a better idea of how much of our data was represented in specific languages, we again issued a simple R command to represent the data in tabular form (table(LAdata$lang)). Note that, for clarity, we show only the first two lines of this output. What might be more useful is to look at the representation of the data graphically.
So, with a quick query, we can see that the English-speaking population was the most vocal, followed by French (fr), Spanish (es), Dutch (nl), and Portuguese (pt).
For the purpose of this example, we chose to focus on the predominant language used in this sample, English. One of the first things we wanted to understand was the general feeling around this topic and the kinds of things people were saying.
Remember, we view ad hoc analysis as a simple query, or a “peek” at a dataset. Some computation may be involved, but if we were to look at it from a classification perspective, these are the simple, lightweight analytics, whereas the deeper insights and relationships (that is, computationally heavy) come with longer-running processes. Don’t misunderstand; we’re not saying that these are not valuable pieces of information. The more information we have about a topic, the more accurate our insights or knowledge becomes.
There’s another way to think about this scenario. On the far right of the graph in Figure 9.5, where we show “Deep Analysis,” we have a rich set of insights and relationships between terms and concepts built up in our analysis. This deep analysis leverages all of the descriptive statistics we collect, and the simple ad hoc queries/analysis we do and is combined with a complex textual analysis. This is why we show the upward graph in the effort needed to derive value. Creating a set of descriptive statistics is relatively low in effort contrasted with the larger effort to derive deeper insights.
In our case, we wanted to understand the discussion around the topic of Lance Armstrong’s confession , but we thought it might be interesting to look at it from a gender perspective, to understand if males reacted differently to his confession than females. For example, if we examined the general sentiment to Mr. Armstrong’s confession, we may obtain an overwhelming negative reaction. One hypothesis could be that this makes sense if the audience was predominantly male. Males may view cheating in an athletic competition as outrageous, while perhaps females land more on the sympathetic side and view the confession more ambivalently, or perhaps positively. While perhaps not as relevant in this experiment, the idea of understanding how males and females react to different events could become important for areas such as marketing or public relations.
Unfortunately, Twitter and many of the other social media data mining outlets don’t supply that information. Many allow users to enter it into a profile, but often that data is not readily available for analysis. To get around this limitation, we can compute the gender based on some of the fields we have in this sample (see Figure 9.6).
Remember, we currently have the following information:
■ Preferred user name
■ Display name (what gets displayed in public Twitter)
■ The status count (the number of tweets this person has made)
■ Language of the tweet
■ The number of times the tweet was retweeted
■ The posted time
■ The text of the tweet
From this, if we take the display name from the tweet, we can look up the name in a dictionary of common names and return a value of “male,” “female,” or “unknown” (if we can’t find the name or if there is ambiguity in the name, such as Kim or Chris).
In the profile in Figure 9.6, the display name is represented in the tweet as Karen Scilla Ganis. So for our ad hoc query, we simply took the first part of the display name (we assumed it’s a first name) and looked it up in a dictionary of common names (in this case, a dictionary of common US names). We built a new column in our dataset and inserted the result of our query (in this case, “female”) into the dataset and then proceeded to the next name. When we’re done, we had an augmented dataset that contains our original data point (if we should need it again), the display name, but we also created a new field (called “gender”) that contains our computation of the user’s gender.
With that, a quick query revealed the following new demographic for our sample:
Using our simple calculation, we were able to classify about 72% of our dataset into a male or female category. If we wanted to spend more time, we could probably do additional work on the undetermined users because many of the display names use forms such as “Mr. John Jones” or “Mrs. Kelly Jackson” (clearly, Mr. and Mrs. aren’t proper first names, but we could easily deduce that they would be male or female with a few additional tests). Likewise, we were not naïve enough to think that our numbers were perfect. We knew that there would be (and definitely are) some misclassifications based on gender-neutral or unisex names. But given the fairly large numbers of tweets, we assumed those are in the minority (but it would be useful to look for a measure of uncertainty if we were to make any conclusions based on the derived gender).
The next thing we might want to get a feel for is the general tone of the conversations. What are the hot buttons in the conversations? One way to do this quickly is to create a word cloud. We discussed them previously, but since we’ll be looking at a few here, let’s review the concept quickly.
Basically, a word cloud takes all of the relevant words in a set of text and displays them in such a way as to indicate use. Words appearing in a larger font indicate that they are used more frequently, and words appearing in smaller fonts are obviously used less frequently. Often, people use color to indicate levels of use or intensity. These conventions are useful, but only to the point of giving you, the analyst, a feeling for areas you may want to explore further in the dataset. Many applications allow the user to remove words or text that may cause confusion. For example, in these word clouds, we removed all of the English “stop” words (the, this, is, was, and so on). The theory is this: If we were to count the uses of those stop words, they would far outpace the other words. We also eliminated punctuation and converted everything to lowercase to avoid case as a factor (when somebody uses IBM or ibm, they still mean the same thing). The other thing we removed was common words we don’t need to show in a word cloud. In this case, since this was an interview done by Oprah Winfrey with Lance Armstrong, we simply removed any references to their names. If a word cloud shows a lot of mentions of these words, it doesn’t really tell us any net new information.
For example, Figure 9.7 shows our derived word cloud (showing the top 150 words) used within the 288,875 tweets that were marked as being in English.
Many of the terms are obvious, but the power of the word cloud is that it gives us a great starting point for deeper analysis. If we were starting blind (with just a set of tweets about the topic in front of us), we would probably start building a model with words such as drugs and cheating, which are shown in this word cloud. We might not have thought of using words like doping or doped.
One thing we found very interesting was a wide-sweeping reference to Manti Te’o. Many people drew comparisons to the story of how Manti Te’o revealed that his girlfriend, a supposed cancer victim, never existed. In other words, he lied just as Lance Armstrong did. This result was something we never would have predicted, but it could provide some valuable insights into the perception of Mr. Armstrong and his denying allegations about his use of performance-enhancing drugs.
A look at the word clouds broken down by gender (see Figure 9.8) could be useful in revealing different feelings or emotions. In this exercise, we divided all of the content from the tweets in our database into those tweets authored by males and those authored by females. We then took these two subsets of words and converted them into word clouds, as shown in Figure 9.8.
In the word cloud on the left, we see that males were focused on what Rick Reilly (a sports writer for ESPN) was saying as well as many references again to Manti Te’o. The female side had similar results, but we can see that the magnitude of the word use in both cases is less. Interestingly, males referenced Sheryl Crow, whereas females used the word futurecrow. Both males and females commented on Lance Armstrong’s apologies and the stripping of awards. Both groups seemed to give similar emphasis to words such as lying and doping.
Specifically, we use this type of social media analytics strategy when dealing with situations as they occur rather than ones that are repeated on a regular basis. As a result, often we are searching for an answer in dataset. This ad hoc method assumes we are looking for a number of different insights to emerge from our assembled data and that a model will be able to accurately, or at least somewhat accurately, represent the information contained within the dataset. Sometimes we just want to look at the data we’ve collected and determine the answer to a specific question or questions and perhaps gain some insightful information. Or perhaps we just want to understand the composition or makeup of the data before we go off and build a more complex model and undertake a deeper analysis.
Ad Hoc Analysis
The word ad hoc is originally derived from Latin and is loosely translated as “for this” or “for this situation.” We use this term to describe something that has been formed or used for a special and immediate purpose, without previous planning. Ad hoc analysis is the discipline of analyzing data on an as-needed or requested basis. This analysis is based on a set of data currently available to the person doing the analysis; thus, the resulting analysis is only as good as the data on which it is based. Generally, we look to this process to answer a specific question.Consider the diagram in Figure 9.1. The process of analytics is often an iterative model, where we attempt an analysis, determine if we have succeeded or failed, make repairs and adjustments, and then try again. Figure 9.1 presents the iterative set of steps recommended for understanding a dataset in an attempt to answer a question from the data. Obviously, we start at the top by locating what we believe to be our relevant data sources. We attempt a first model or perhaps execute a series of simple queries against the data and analyze our results. If we begin to see a pattern emerge, perhaps we can perform a deeper analysis or simply begin to assemble our derived insights. If, on the other hand, our queries reveal no answers or patterns, we have to ask ourselves if we have a sufficient amount of data to analyze or if the model we’ve built is incorrect. At this point, we either make adjustments by augmenting the model or by adding additional data to our dataset.
The purpose of this type of analysis may be to fill in the blanks left by a larger analysis or perhaps as a precursor to a larger initiative. More importantly, ad hoc reporting allows analysts (or general users) to manipulate and explore the data they have on hand to build reports, often on-the-fly, to answer their questions. Think of dashboards answering the question “What is happening in social network data around my topic?” and ad hoc reporting answering the question “Why is it happening?”
With smaller collections, the verification of data is often a straightforward process of simply looking at what we have to see if data we are interested in falls into the correct columns or is made up of unreadable characters, thus rendering any analysis useless. Often, what we, as data scientists, want to do is perform some simple queries on the data to ensure that what we are about to analyze is indeed going to give us some meaningful results. This idea of performing some simple queries is a perfect illustration of what we mean by the ad hoc query of data. In some cases, we want to perform a one-time (or specific) query to build up our knowledge of the information we are uncovering. At times, the simple queries produce just throw-away results, but results that give us an indication about whether we should proceed down our current path.
An Example of Ad Hoc Analysis
we differentiate between two types of analytics:predictive versus descriptive. One looks to describe the contents of the data set under analysis (descriptive), and the other attempts to extract information from existing data sets to determine a pattern of behavior or to predict future outcomes and trends. An ad hoc analysis lies somewhere between these two.
Consider a dataset of tweets that were collected around January 17, 2013. On that date, Lance Armstrong confessed to Oprah Winfrey on national television in the United States that he won all seven of his record Tour de France championships with the help of performance-enhancing drugs and that his attitude was one of “win at all costs.” We wanted to understand the public’s reaction to this topic, so in this case, we collected data that contained:
■ Mentions of Lance Armstrong
■ Mentions of Oprah Winfrey
We didn’t collect the data for any other purpose than to experiment with some analytics around the event.
With the raw data in hand, we wanted to understand some of the basics about what we had collected before performing some simple descriptive analytics.
The first question we wanted to answer was:
Is the data clean (in other words, is it usable)? For simplicity, we didn’t take the full JSON string of data that we got from a typical tweet, since it was not relevant to our analysis. This could actually be an important step for some teams that have concerns about capacity and performance of their analytics environment. A typical tweet can have a large number of name-value pairs that represent all of the various parts of the tweet (when it was posted, the person’s name, the Twitter handle, hashtags, and so on). Most of the information that came from Twitter was either redundant or wasn’t useful in this exercise. As a result, we took the JSON data and simply converted it to a comma-separated value (CSV) file.
This resulted in a data file that contained the following elements :
■ Preferred user name
■ Display name (what gets displayed in public Twitter)
■ The status count (the number of tweets this person has made)
■ Language of the tweet
■ The number of times the tweet was retweeted
■ The posted time
■ The text of the tweet
The first thing we wanted to do was just browse the data before we did any kind of analytics on it to see if it was well formed. This was a preliminary but very important step to ensure that all of the data was in its proper columns and was a complete dataset.
This process is best shown with some examples. Up to this point in the book, we’ve made it a point not to show specific products or dive into howto descriptions, so here we try to stay as true to that doctrine as possible, but we may have to stray just a bit.
For ad hoc querying, we like to use interactive tools such as IBM’s BigSheets, which is part of the Big Insights suite of tools from IBM; custom code that looks over Hadoop clusters; or public domain tools such as R. For this example, we used R.
R is a powerful and widely used analytics tool. It includes virtually every data manipulation, statistical model, and chart that the modern data scientist could ever need. You can easily find, download, and use cutting-edge community- reviewed methods in statistics and predictive modeling from leading researchers in data science, free of charge [2].
Figure 9.2 shows a screen capture of the R session we ran on some of this data. We started off with the trimmed-down version of our dataset as discussed earlier. (Remember, we chose to convert the JSON data into CSV for easier manipulation. The data doesn’t change, just the format.)
Ad hoc analysis essentially is the process of running a number of simple queries against a database to understand more about the data we may want to process with more complex tools.
In section 1 of Figure 9.2, the first step was to bring our data into R. We used the read.csv() function, which reads a comma-separated value file into a variable called LAdata. Note that in this case, the first line of the file contains the headers for each column. For clarity, the first line of the CSV file contains the following (each element representing a name of the column): prefusername,displayname,statuscount,lang,retweetcount,time,body
To access a specific column in a table, the syntax is:
table_name$column_name
To see a specific row, we would use:
table_name[index,]
Typically, one of our first queries is to look at size—just to understand how big of a dataset we’re working with (see section 2 in Figure 9.2). The nrow() function in R simply looks at the number of rows of data read into the table variable (in this case, LAdata) and returns a count. So we can quickly see that this dataset had about 360,000 tweets from the second day of the interview. Not a very useful analytic, but a descriptive feature we may want to have when summarizing our data source.
Because this was our initial look at the data, the first thought was to get a quick overview of what we had collected. From the previous step, we knew we had about 360,000 tweets, but where did they come from? What language were they in?
In section 3 of Figure 9.2, we used the unique() function in R to look at a column of data in our table and summarize it for us, showing the individual values (or just the unique values). So if French, for example, was used in a tweet, the column called lang would contain fr. The unique() function lists it only once, so it gives a good representation of the number of unique languages used in the sample.
Interestingly, when we ran the command unique(LAdata$lang), we saw that R believed there were 935 unique values in the lang column of our dataset! And the output of the unique() command reported that the 933rd unique language was "1953"; something didn’t seem right. Obviously, "1953" isn’t a valid language, so using R’s query capabilities, we looked at that offending line of data to see where others may have errors. To do that, we used an R query that looked like this:
LAdata[LAdata$lang=="1953”,]
This simple query says:
“Show the row of data in LAdata where the lang column has the value ‘1953.’” What we got back resembled this:
Notice that the statuscount has the value PhD and the lang field has the value "1953". No doubt what happened was that this person entered his name in a format like this: Matt Ganis, PhD.
So when we took the original data and moved it into a CSV format, this name field looked like two separate elements just separated by a comma. Consequently, all of the fields ended up shifting to the right. Simply cleaning up the original data and re-creating the CSV file fixed the problem. The cleanup was to remove all nonessential characters from the user-created data. In this case, those characters included commas, linefeeds, newlines, and so on. The important thing is the lesson learned: Don’t trust any free-form text data. By “free-form,” we mean any input provided by a user that doesn’t have any given structure or format. Looking at Figure 9.3, we can see that data collected from a web dropdown menu (the example on the left) is an example of what we mean by “well formed.” In this case, when the user selects an option, we know exactly what values are possible. However, data collected from an input field, where a user can (and will) type anything he or she wants, is considered “free form.” We need to look closer at this data than a well-formed set of data because we have no way of knowing ahead of time what the user may or may not enter. The content in the search query field shown on the right side of Figure 9.3 is an example of free-form input.
When we looked at the data in a comma-separated value format, our cleansing program didn’t realize (and neither did we) that someone might use a comma in the name field. Basically, our first ad hoc query helped us uncover an issue with our JSON-to-CSV conversion.
Once we corrected this issue, we were able to see that our data appeared to be more correct (see Figure 9.4). Again, using the unique() command from R to determine the number of unique languages represented in this dataset, we can see in Figure 9.4 that the number is now 32. This number is much more reasonable than the previous value of 935, as reported in the initial queries from Figure 9.2.
So now, with the data reloaded, we could look at the various languages represented in our sample. Note that the R command unique() looked at the column of data lang and reported all of the unique values. We can see that there are 32 languages, which seems a much more reasonable number. To get a better idea of how much of our data was represented in specific languages, we again issued a simple R command to represent the data in tabular form (table(LAdata$lang)). Note that, for clarity, we show only the first two lines of this output. What might be more useful is to look at the representation of the data graphically.
So, with a quick query, we can see that the English-speaking population was the most vocal, followed by French (fr), Spanish (es), Dutch (nl), and Portuguese (pt).
For the purpose of this example, we chose to focus on the predominant language used in this sample, English. One of the first things we wanted to understand was the general feeling around this topic and the kinds of things people were saying.
Remember, we view ad hoc analysis as a simple query, or a “peek” at a dataset. Some computation may be involved, but if we were to look at it from a classification perspective, these are the simple, lightweight analytics, whereas the deeper insights and relationships (that is, computationally heavy) come with longer-running processes. Don’t misunderstand; we’re not saying that these are not valuable pieces of information. The more information we have about a topic, the more accurate our insights or knowledge becomes.
There’s another way to think about this scenario. On the far right of the graph in Figure 9.5, where we show “Deep Analysis,” we have a rich set of insights and relationships between terms and concepts built up in our analysis. This deep analysis leverages all of the descriptive statistics we collect, and the simple ad hoc queries/analysis we do and is combined with a complex textual analysis. This is why we show the upward graph in the effort needed to derive value. Creating a set of descriptive statistics is relatively low in effort contrasted with the larger effort to derive deeper insights.
Data Integrity
One of the topics that we want to concern ourselves with is the integrity of the data we use. This is an important topic because many times when performing ad hoc queries or manipulation of data, we could inadvertently change the integrity of the dataset. A simple change to a value could lead to erroneous results. For this reason, we like to say that we “augment” datasets, but we don’t change them. That may seem like a subtle difference, but it’s meant to draw a distinction between changing data and adding additional data to an existing dataset.In our case, we wanted to understand the discussion around the topic of Lance Armstrong’s confession , but we thought it might be interesting to look at it from a gender perspective, to understand if males reacted differently to his confession than females. For example, if we examined the general sentiment to Mr. Armstrong’s confession, we may obtain an overwhelming negative reaction. One hypothesis could be that this makes sense if the audience was predominantly male. Males may view cheating in an athletic competition as outrageous, while perhaps females land more on the sympathetic side and view the confession more ambivalently, or perhaps positively. While perhaps not as relevant in this experiment, the idea of understanding how males and females react to different events could become important for areas such as marketing or public relations.
Unfortunately, Twitter and many of the other social media data mining outlets don’t supply that information. Many allow users to enter it into a profile, but often that data is not readily available for analysis. To get around this limitation, we can compute the gender based on some of the fields we have in this sample (see Figure 9.6).
Remember, we currently have the following information:
■ Preferred user name
■ Display name (what gets displayed in public Twitter)
■ The status count (the number of tweets this person has made)
■ Language of the tweet
■ The number of times the tweet was retweeted
■ The posted time
■ The text of the tweet
From this, if we take the display name from the tweet, we can look up the name in a dictionary of common names and return a value of “male,” “female,” or “unknown” (if we can’t find the name or if there is ambiguity in the name, such as Kim or Chris).
In the profile in Figure 9.6, the display name is represented in the tweet as Karen Scilla Ganis. So for our ad hoc query, we simply took the first part of the display name (we assumed it’s a first name) and looked it up in a dictionary of common names (in this case, a dictionary of common US names). We built a new column in our dataset and inserted the result of our query (in this case, “female”) into the dataset and then proceeded to the next name. When we’re done, we had an augmented dataset that contains our original data point (if we should need it again), the display name, but we also created a new field (called “gender”) that contains our computation of the user’s gender.
With that, a quick query revealed the following new demographic for our sample:
Using our simple calculation, we were able to classify about 72% of our dataset into a male or female category. If we wanted to spend more time, we could probably do additional work on the undetermined users because many of the display names use forms such as “Mr. John Jones” or “Mrs. Kelly Jackson” (clearly, Mr. and Mrs. aren’t proper first names, but we could easily deduce that they would be male or female with a few additional tests). Likewise, we were not naïve enough to think that our numbers were perfect. We knew that there would be (and definitely are) some misclassifications based on gender-neutral or unisex names. But given the fairly large numbers of tweets, we assumed those are in the minority (but it would be useful to look for a measure of uncertainty if we were to make any conclusions based on the derived gender).
The next thing we might want to get a feel for is the general tone of the conversations. What are the hot buttons in the conversations? One way to do this quickly is to create a word cloud. We discussed them previously, but since we’ll be looking at a few here, let’s review the concept quickly.
Basically, a word cloud takes all of the relevant words in a set of text and displays them in such a way as to indicate use. Words appearing in a larger font indicate that they are used more frequently, and words appearing in smaller fonts are obviously used less frequently. Often, people use color to indicate levels of use or intensity. These conventions are useful, but only to the point of giving you, the analyst, a feeling for areas you may want to explore further in the dataset. Many applications allow the user to remove words or text that may cause confusion. For example, in these word clouds, we removed all of the English “stop” words (the, this, is, was, and so on). The theory is this: If we were to count the uses of those stop words, they would far outpace the other words. We also eliminated punctuation and converted everything to lowercase to avoid case as a factor (when somebody uses IBM or ibm, they still mean the same thing). The other thing we removed was common words we don’t need to show in a word cloud. In this case, since this was an interview done by Oprah Winfrey with Lance Armstrong, we simply removed any references to their names. If a word cloud shows a lot of mentions of these words, it doesn’t really tell us any net new information.
For example, Figure 9.7 shows our derived word cloud (showing the top 150 words) used within the 288,875 tweets that were marked as being in English.
Many of the terms are obvious, but the power of the word cloud is that it gives us a great starting point for deeper analysis. If we were starting blind (with just a set of tweets about the topic in front of us), we would probably start building a model with words such as drugs and cheating, which are shown in this word cloud. We might not have thought of using words like doping or doped.
One thing we found very interesting was a wide-sweeping reference to Manti Te’o. Many people drew comparisons to the story of how Manti Te’o revealed that his girlfriend, a supposed cancer victim, never existed. In other words, he lied just as Lance Armstrong did. This result was something we never would have predicted, but it could provide some valuable insights into the perception of Mr. Armstrong and his denying allegations about his use of performance-enhancing drugs.
A look at the word clouds broken down by gender (see Figure 9.8) could be useful in revealing different feelings or emotions. In this exercise, we divided all of the content from the tweets in our database into those tweets authored by males and those authored by females. We then took these two subsets of words and converted them into word clouds, as shown in Figure 9.8.
In the word cloud on the left, we see that males were focused on what Rick Reilly (a sports writer for ESPN) was saying as well as many references again to Manti Te’o. The female side had similar results, but we can see that the magnitude of the word use in both cases is less. Interestingly, males referenced Sheryl Crow, whereas females used the word futurecrow. Both males and females commented on Lance Armstrong’s apologies and the stripping of awards. Both groups seemed to give similar emphasis to words such as lying and doping.
COMMENTS