When you think of the word social data , what comes to mind? At a very simplistic level, data is nothing more than a collection of facts,...
When you think of the word social data, what comes to mind?
At a very simplistic level, data is nothing more than a collection of facts, measurements, or observations that, when combined, form or create what we might call information. At its core, data is a raw, unorganized set of facts that need to be processed into something more meaningful. A set of numbers (or data points) such as those shown in Table 5.1 is relatively meaningless until we put them into context. The numbers 45.3, 39.1, 35.9, and so on mean nothing to us until we realize (from Table 5.2) that they are measurements of temperature. So, by combining data (values in a table) with the additional contextual data that the values represent temperature, we are able to extract “information.”
By labeling the column of data with the word Temperature, we at least have a bit of content. The numbers by themselves could represent literally anything from distances between a specific city and nearby towns or cities, average test scores, water levels in a reservoir, or any other measurable entity. But they’re still not useful.
At a very simplistic level, data is nothing more than a collection of facts, measurements, or observations that, when combined, form or create what we might call information. At its core, data is a raw, unorganized set of facts that need to be processed into something more meaningful. A set of numbers (or data points) such as those shown in Table 5.1 is relatively meaningless until we put them into context. The numbers 45.3, 39.1, 35.9, and so on mean nothing to us until we realize (from Table 5.2) that they are measurements of temperature. So, by combining data (values in a table) with the additional contextual data that the values represent temperature, we are able to extract “information.”
By labeling the column of data with the word Temperature, we at least have a bit of content. The numbers by themselves could represent literally anything from distances between a specific city and nearby towns or cities, average test scores, water levels in a reservoir, or any other measurable entity. But they’re still not useful.
- Are those temperatures at a specific location?
- When were the measurements taken? (At night, during the sunlit day, under water?)
- How is the temperature measured? Is it degrees Fahrenheit or Celsius, or is it measured in kelvins?
Structured Data Versus Unstructured Data
Even though data comes in a variety of shapes, sizes, and forms, for the purpose of our analysis, we can think of data as coming in essentially two categories: organized and unorganized. That may be a bit of an oversimplification. A data scientist would call them structured and unstructured. Think of structured data as data that contains a high degree of organization: We can define a data model for it and allow it to easily be placed into a database system. Once in a database, the data is readily searchable by simple, straightforward search engine mechanisms. Consider the example shown in Figure 5.1.
This example illustrates a table from a structured database. Each row in the column represents an observation (a specific employee), and each column is an attribute that describes the row of data (the person’s last name, position in the company, work location, and so on). We call it structured because each record or row has the same attribute. Each one can be interrogated because we know the type of data held in each attribute. For example, we know the country column has a well-known set of values, as does the work location code and position. This table can be stored in a database, so querying or evaluating the data is relatively straightforward using standard database querying tools that are available in the marketplace.
Unstructured data is essentially the opposite of structured. The lack of structure makes collecting this data a time- and energy-consuming task due to the nonuniformity of the data.
In the case of social media, the data takes the form of unstructured data, or data without a specific format. Unstructured data often includes text but more often than not can contain additional multimedia content. In the case of social media, this could include likes, URLs in messages, and pictures or references to other individuals. One contradiction to consider is that different data sources may have a specific “application structure”; that is, data from one source can all look similar (as if it was structured). But because the data that these sources contain doesn’t fit neatly into a database or across multiple applications, we still refer to it as “unstructured.” Consider the Twitter example in Figure 5.2.
Unstructured data is essentially the opposite of structured. The lack of structure makes collecting this data a time- and energy-consuming task due to the nonuniformity of the data.
In the case of social media, the data takes the form of unstructured data, or data without a specific format. Unstructured data often includes text but more often than not can contain additional multimedia content. In the case of social media, this could include likes, URLs in messages, and pictures or references to other individuals. One contradiction to consider is that different data sources may have a specific “application structure”; that is, data from one source can all look similar (as if it was structured). But because the data that these sources contain doesn’t fit neatly into a database or across multiple applications, we still refer to it as “unstructured.” Consider the Twitter example in Figure 5.2.
While this data looks similar to that of the previous example of structured content, the real value of the Twitter data (or any other social media content) comes from an analysis of the unstructured portion of the “record,” in this case, the tweet.
Big Data
According to an article by Gil Press of Forbes [2], the first documented use of the term big data appeared in a 1997 paper by scientists from NASA, in which they were describing the problem they had with the visualization of their datasets:
[A large data set] provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.
The term big data also appeared in a 2008 paper by Randal Bryant, Randy Katz, and Edward Lazowska that made the following bold statement:
Big-data computing is perhaps the biggest innovation in computing in the last decade. We have only begun to see its potential to collect, organize, and process data in all walks of life.
There a number of definitions for big data:
“Big data is the derivation of value from traditional relational database-driven business decision making, augmented with new sources of unstructured data.” —Oracle Corporation
“Big data is the term increasingly used to describe the process of applying serious computing power—the latest in machine learning and artificial intelligence—to seriously massive and often highly complex sets of information .” —Microsoft [6]
The National Institute of Standards and Technology (NIST) argues that “Big data refers to digital data volume, velocity, and/or variety that exceed the storage capacity or analysis capability of current or conventional methods and systems” (in other words, the notion of “big” is relative to the current standard of computation) [7]. No matter how we slice it or describe it, big data is the latest craze in the Information Technology industry. It’s a term (or concept) that attempts to describe the voluminous amount of data available (most likely on the Internet) that can be used to mine for information.
Most pundits in the industry define big data through the use of the Three Vs: the extremely large volume of data, the wide-sweeping variety of types of data, and the various data sources and the velocity at which the data is appearing (or being created) and therefore must be processed. Others include a fourth dimension, veracity (or the quality of the data), and still others attach value as an attribute (see Figure 5.3). No matter what definition we use to describe this phenomena, it is clear that we are being inundated with data, and choosing the right data source can be crucial to timing social media analytics strategy.
[A large data set] provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.
The term big data also appeared in a 2008 paper by Randal Bryant, Randy Katz, and Edward Lazowska that made the following bold statement:
Big-data computing is perhaps the biggest innovation in computing in the last decade. We have only begun to see its potential to collect, organize, and process data in all walks of life.
There a number of definitions for big data:
“Big data is the derivation of value from traditional relational database-driven business decision making, augmented with new sources of unstructured data.” —Oracle Corporation
“Big data is the term increasingly used to describe the process of applying serious computing power—the latest in machine learning and artificial intelligence—to seriously massive and often highly complex sets of information .” —Microsoft [6]
The National Institute of Standards and Technology (NIST) argues that “Big data refers to digital data volume, velocity, and/or variety that exceed the storage capacity or analysis capability of current or conventional methods and systems” (in other words, the notion of “big” is relative to the current standard of computation) [7]. No matter how we slice it or describe it, big data is the latest craze in the Information Technology industry. It’s a term (or concept) that attempts to describe the voluminous amount of data available (most likely on the Internet) that can be used to mine for information.
Most pundits in the industry define big data through the use of the Three Vs: the extremely large volume of data, the wide-sweeping variety of types of data, and the various data sources and the velocity at which the data is appearing (or being created) and therefore must be processed. Others include a fourth dimension, veracity (or the quality of the data), and still others attach value as an attribute (see Figure 5.3). No matter what definition we use to describe this phenomena, it is clear that we are being inundated with data, and choosing the right data source can be crucial to timing social media analytics strategy.
Social Media as Big Data
The term social media can be viewed as an umbrella term that can be used for several different venues where people connect with others directly on the Internet to communicate and exchange views and opinions or participate in any type of social commentary.
It is important to understand why people use these websites, as there is a variety of demographics represented on these sites. Some people use them for business purposes, to network, and to find new deals. Others use social networking sites for purely personal reasons and are totally oblivious to the fact that there is a business presence in the social networking environment.
We started with the concept of data, and through an explanation of big data and unstructured data, we have described what social media data is. Before we get into the sentiment analysis basics aspects in future posts, we want to explore why social media is such an important media to focus on.
Social media has gained in acceptance over the past few years for a number of reasons. We can point to the growth of the Internet and the concept of information sharing and dissemination. Humans, by their very nature, are naturally social and want to share what they know. Add to that the incredible growth of smart phones and mobile technologies such that we, as a society, have truly reached an “always on” culture—and it’s no wonder we see a growth in social and community sites.
But what are people actually doing on these sites? Why do they bother? In looking at a report by Anita Whiting and David Williams [9], we can see a number of reasons that Internet users participate in social media, as covered in the following sections.
It is important to understand why people use these websites, as there is a variety of demographics represented on these sites. Some people use them for business purposes, to network, and to find new deals. Others use social networking sites for purely personal reasons and are totally oblivious to the fact that there is a business presence in the social networking environment.
We started with the concept of data, and through an explanation of big data and unstructured data, we have described what social media data is. Before we get into the sentiment analysis basics aspects in future posts, we want to explore why social media is such an important media to focus on.
Social media has gained in acceptance over the past few years for a number of reasons. We can point to the growth of the Internet and the concept of information sharing and dissemination. Humans, by their very nature, are naturally social and want to share what they know. Add to that the incredible growth of smart phones and mobile technologies such that we, as a society, have truly reached an “always on” culture—and it’s no wonder we see a growth in social and community sites.
But what are people actually doing on these sites? Why do they bother? In looking at a report by Anita Whiting and David Williams [9], we can see a number of reasons that Internet users participate in social media, as covered in the following sections.
Social for Social’s Sake
While not in any particular order, the concept of participating on these sites can purely be for the social aspect of meeting new acquaintances and staying in touch with older ones. In our opinion, this is the primary reason for the rapid growth of sites like Facebook. The ability to allow others to share in our daily lives and stay abreast of our every move has a certain appeal and perhaps gives us a sense of community. In the same vein, we tend to be curious by nature, and through the windows of social media, we are able to see into the lives of others, perhaps living vicariously through them or just building a bond as we stay “in” each other’s lives.
Social Media as Entertainment
Many people report that the use of social media is a way to unwind, to pass time, or just relax with others via the electronic highways of the Internet. Given that many of these social sites not only provide for the sharing of information in the form of pictures, short updates, and even longer blog-like postings, they also foster gaming, either with others in an immediate network or individuals. Through the use of games, many social media providers hope to keep participants coming back for more—and they obviously do. Many people simply look for relaxation or ways to alleviate stress and escape from reality, even if just for a little while.
Social Media as Sharing
Apart from socializing with others, we also like to share what we know with others or seek the advice of others when or if we have questions of our own. Not only that, but we do like to share our opinions and experiences with others in the hope of influencing them or perhaps steering them away from bad experiences. Social media can be a valuable tool, put in the hands of individuals, in a quest to spread praise or criticism of products or services.
In the case of blogging, millions of people are making their voices heard. The Internet has drastically changed how we, as individuals, can reach out to others. Never before has it been so easy to be able to reach a global audience with so little effort. Today, bloggers have the opportunity of reaching hundreds or even thousands of people every day and spreading their stories, opinions, and values. For individuals, there is a benefit of building their personal brand.
Think of this as Consumer Reports for individuals. From it own website, Consumer Reports defines itself like this: Consumer Reports has empowered consumers with the knowledge they need to make better and more informed choices and has battled in the public and private sectors for safer products and fair market practices.
How do writers at Consumer Reports form their opinions? Through their own independent tests and evaluations, which then get reported to individuals. It’s not unlike posting a request for a local restaurant review on Facebook or Twitter and then making an informed decision based on what others have to say.
Now, to be fair, this is where the V for veracity comes in (and perhaps some pessimistic individuals). Obviously, there are those who will believe that since we can’t ensure an unbiased review, how do we know we can “trust” the opinions or thoughts espoused on social media sites? Well, the short answer is: we don’t. But if we see enough positive reviews versus negative reviews, we’re likely to be swayed (obviously one way or the other).
In the case of blogging, millions of people are making their voices heard. The Internet has drastically changed how we, as individuals, can reach out to others. Never before has it been so easy to be able to reach a global audience with so little effort. Today, bloggers have the opportunity of reaching hundreds or even thousands of people every day and spreading their stories, opinions, and values. For individuals, there is a benefit of building their personal brand.
Think of this as Consumer Reports for individuals. From it own website, Consumer Reports defines itself like this: Consumer Reports has empowered consumers with the knowledge they need to make better and more informed choices and has battled in the public and private sectors for safer products and fair market practices.
How do writers at Consumer Reports form their opinions? Through their own independent tests and evaluations, which then get reported to individuals. It’s not unlike posting a request for a local restaurant review on Facebook or Twitter and then making an informed decision based on what others have to say.
Now, to be fair, this is where the V for veracity comes in (and perhaps some pessimistic individuals). Obviously, there are those who will believe that since we can’t ensure an unbiased review, how do we know we can “trust” the opinions or thoughts espoused on social media sites? Well, the short answer is: we don’t. But if we see enough positive reviews versus negative reviews, we’re likely to be swayed (obviously one way or the other).
COMMENTS