Social data analysis is the critical set of activities that assist in transforming raw data into insights, which in turns leads to a new b...
Social data analysis is the critical set of activities that assist in transforming raw data into insights, which in turns leads to a new base of knowledge and business value. A review of the current literature on big data social media analytics reveals a wide variety of use cases and types of analysis.
Many different types of analysis can be performed with social media data. In our own work, we have found it beneficial to categorize and classify these different types of analysis in a number of “buckets.” In this chapter, we present our recommended taxonomy schema for social media analytics as well as the types of insights that could be derived from that analysis. This taxonomy is mainly based on four dimensions:
This proposed taxonomy is then examined through the use of several sample metrics derived from a number of engagements we have worked on in the past.
For each distinct type of analytics specified in the taxonomy, we describe the type of analysis that can be performed and what types of techniques can be utilized for the analysis.
Real-time analysis in social media is an important tool when trying to understand the public’s perception of a certain topic as it is unfolding to allow for reaction or an immediate change in course. The amount of data to be processed can be very large in such cases. During one of the debates between President Barack Obama and Mitt Romney during the 2012 presidential election , there were about 20,000 related tweets per second.
The need to collect, store, and analyze information at such velocities causes us to rate the bandwidth and the CPU requirements as high within our taxonomy.
In the case of near real-time analysis, we assume that data is ingested into the tool at a rate that is less than real time. As a consequence, the bandwidth requirement is less than that of a real-time component, and the CPU requirement also becomes less.
An ad hoc analysis is a process designed to answer a single specific question. The product of ad hoc analysis is typically a report or data summary. An ad hoc analysis is typically used to analyze data at rest—that is, data that has previously been retrieved and ingested in a non-real-time manner. These types of systems are used to create a report or analysis that does not already exist, or drill deeper into a specific dataset to uncover details within the data. As a result, the CPU requirement can be moderate while the network bandwidth requirement would be relatively low.
A deep analysis implies an analysis that spans a long time and involves a large amount of data, which typically translates into a high CPU requirement.
Figure 6.1 presents a graphical view of the machine capacity requirements.
Many different types of analysis can be performed with social media data. In our own work, we have found it beneficial to categorize and classify these different types of analysis in a number of “buckets.” In this chapter, we present our recommended taxonomy schema for social media analytics as well as the types of insights that could be derived from that analysis. This taxonomy is mainly based on four dimensions:
- The depth of analysis (the complexity of the analysis)
- The machine capacity (the computational needs of the analysis)
- The domain of analysis (internal versus external social media)
- The velocity of data (the rate at which data arrives for analysis)
This proposed taxonomy is then examined through the use of several sample metrics derived from a number of engagements we have worked on in the past.
The Four Dimensions of Analysis Taxonomy
A taxonomy is a valuable construct for the categorization and organization of attributes used to describe similar entities. There have been many attempts to classify the type of analysis possible in the social networking analytics space [3] and, recently, an attempt to classify the types of data used [4]. In the case of social data analysis, various tools and techniques are used to aid analysts in drawing conclusions from these data sources. Understanding the type of analysis that is required is important when comparing tools or services that would be needed for any future analysis projects.
The taxonomy is mainly based on four dimensions:
- Depth of analysis—Simple descriptive statistics based on streaming data, ad hoc analysis on accumulated data, or deep analysis performed on accumulated data.
- Machine capacity—The amount of CPU needed to process datasets in a reasonable time period. Capacity numbers need to address not only the CPU needs but also the network capacity needed to retrieve data.
- Domain of analysis—The vast amount of social media content available out there can be broadly classified into internal social media (all of the social media content shared by company employees with each other that typically stays inside a firewall) and external social media (content that is outside a company’s firewall).
- Velocity of data—Streaming data or data at rest. Streaming data such as Twitter that is being posted in real time about a conference versus accumulated data from the past five minutes, past day, past week, past month, or past year.
For each distinct type of analytics specified in the taxonomy, we describe the type of analysis that can be performed and what types of techniques can be utilized for the analysis.
Depth of Analysis
The depth of analysis dimension is really driven by the amount of time available to come up with the results of a project. This can be considered as a broad continuum, where the analysis time ranges from a few hours at one end to several months at the other end. For the sake of simplicity, we can consider this as three broad categories: small, medium, and large.
In cases in which the depth of analysis is small, we typically use a system called Simple Social Metrics (SSM). SSM allows us to look at a stream of data and come up with some simple and quick metrics that yield useful information. For example, if we are monitoring Twitter data on a given topic, say cloud computing, SSM will be able to answer the following questions at the end of the day:
In cases in which the depth of analysis is small, we typically use a system called Simple Social Metrics (SSM). SSM allows us to look at a stream of data and come up with some simple and quick metrics that yield useful information. For example, if we are monitoring Twitter data on a given topic, say cloud computing, SSM will be able to answer the following questions at the end of the day:
How many people mentioned IBM in their tweets?
How many people mentioned the word Softlayer in their tweets?
How many times were the words IBM, Microsoft, and Amazon mentioned during the day?
Which author had the highest number of posts on cloud computing during the day?
In cases that could be classified as having a medium depth of analysis, we can take the example of projects in which we have to do ad hoc analysis. Consider, for instance, the case in which our marketing team has been collecting information from social media channels, including Twitter, over the past three months. Now the team wants the analysts to answer the following questions:
How many times were the words IBM, Microsoft, and Amazon mentioned during the day?
Which author had the highest number of posts on cloud computing during the day?
In cases that could be classified as having a medium depth of analysis, we can take the example of projects in which we have to do ad hoc analysis. Consider, for instance, the case in which our marketing team has been collecting information from social media channels, including Twitter, over the past three months. Now the team wants the analysts to answer the following questions:
Which IBM competitor is gathering the most mentions in the context of social business?
What is the trend of positive sentiment of IBM over the past three months in the context of mentions of the term social business?
And, lastly, cases that could be considered as having a large depth of analysis are varied and are really project specific. For example, a group within IBM that is responsible for releasing new features for a specific product wants to do an in-depth analysis of social media chatter about its product continuously over a period of one year. This group may do a baseline analysis for three months, where it just collects social media data and counts the number of mentions and assesses sentiment. Then, in response to a specific new feature release, the group wants to see how the same metrics change over a period of three or six months. The group also wants to understand how the sentiment is influenced by some other attributes in the marketplace such as overall economy or competitors’ product releases.
And, lastly, cases that could be considered as having a large depth of analysis are varied and are really project specific. For example, a group within IBM that is responsible for releasing new features for a specific product wants to do an in-depth analysis of social media chatter about its product continuously over a period of one year. This group may do a baseline analysis for three months, where it just collects social media data and counts the number of mentions and assesses sentiment. Then, in response to a specific new feature release, the group wants to see how the same metrics change over a period of three or six months. The group also wants to understand how the sentiment is influenced by some other attributes in the marketplace such as overall economy or competitors’ product releases.
Machine Capacity
The machine capacity dimension considers the network and CPU capacity of the machines that are either available or required for a given type of use case.
In subsequent sections of this blog, we discuss aspects of network use and CPU use in the context of the four main dimensions that we have considered in the taxonomy. Table 6.1 shows a summary view of Network and CPU requirements.
The need to collect, store, and analyze information at such velocities causes us to rate the bandwidth and the CPU requirements as high within our taxonomy.
In the case of near real-time analysis, we assume that data is ingested into the tool at a rate that is less than real time. As a consequence, the bandwidth requirement is less than that of a real-time component, and the CPU requirement also becomes less.
An ad hoc analysis is a process designed to answer a single specific question. The product of ad hoc analysis is typically a report or data summary. An ad hoc analysis is typically used to analyze data at rest—that is, data that has previously been retrieved and ingested in a non-real-time manner. These types of systems are used to create a report or analysis that does not already exist, or drill deeper into a specific dataset to uncover details within the data. As a result, the CPU requirement can be moderate while the network bandwidth requirement would be relatively low.
A deep analysis implies an analysis that spans a long time and involves a large amount of data, which typically translates into a high CPU requirement.
Figure 6.1 presents a graphical view of the machine capacity requirements.
COMMENTS