Where to seek big data for social media analytics

The next question, then, is Where do we look for big data social media analytics ? There are large number of conversational sites on the In...

The next question, then, is Where do we look for big data social media analytics? There are large number of conversational sites on the Internet where even larger varieties of conversations are happening. Knowing where to look or what conversations to watch for can be a daunting task. Conversations can (and do) happen everywhere. They obviously occur in social media sites, but conversations and opinions are often found in comments made to news stories, online retail stores, obituary sites—virtually everywhere on the Internet. It’s easy to get lost in the sea of sites when trying to collate all of this information. Often, organizations merge enterprise data with social data, linking together employees and their social commentary or product schedules and announcements with public discussions around the products. This can only compound the issue of trying to find the “right” conversations to analyze. While we acknowledge this is being done and it makes “big data” even bigger, within the context of this book, we don’t address the additional issues of expanding the datasets’ size by augmenting with additional data. This brings us to the next question: Which data source is right for my project?

Paradox of Choice: Sifting Through Big Data

In his book The Paradox of Choice: Why More Is Less, American psychologist Barry Schwartz purports that eliminating consumer choices can greatly reduce anxiety that many shoppers feel [11]. The idea that we have so many choices in life can actually cause us stress (stress over “Did I pick the right one?” or “Will this other option last longer?” and so on). According to The Paradox of Choice, when it comes to buying any kind of item, from a box of cereal to a new car, we have trouble making decisions. What’s interesting is the fact that too much choice is actually harmful to our well-being. When there are too many options, we suffer, believing our choice is flawed or that we could have made a better one in the long run.

This suffering is the paradox of choice, and it describes how we become less satisfied the more choices there are. We bring up this topic not to discuss our buying habits, but to point out a similar situation in choosing where to look for the social media data that we would like to analyze.

Consider the graph in Figure 5.4. Here, we graph the number of users reporting to be registered in some of the more popular social media sites available on the Internet. We’ve taken just a snapshot of the more popular sites; it would be nearly impossible to show them all—partially because social media sites come and go and some might not be identified as traditional “social media” sites, so we don’t know about them .

big data for social media analytics

The question, however, remains: Which sites should we gather data from for our analysis?

The leader of social media content is clearly Facebook, which boasts of over 1.5 billion registered users at the time we wrote this chapter [13]. So the question analysts have to ask themselves is: Can I use data from just Facebook and ignore the rest? The short answer is no. To obtain a wellrounded opinion, we want to have as much relevant data as possible. So again, looking at the graph in Figure 5.4, we probably should focus on sites from Facebook to Baidu Tieba (or maybe Tumblr). But why not all of them? Where do we (or should we) stop?

On one hand, when we hear about big data, we immediately think of sifting through tremendous amounts of data, distilling it down to its bare essence, revealing some golden nuggets of truth. In some cases, this may be true, but in practice, we have found that as the amount of data rises, so does the noise in the data. An electrical engineer refers to this as the signal-to-noise ratio—or how much “relevant” information there is in a sample versus how much “noise,” or nonrelevant information (see Figure 5.5).

As a case in point, consider one analysis we were involved in during our early days here. We were approached by one of our internal divisions and asked for an analysis of IBM’s BladeCenter product against its competitors. We were asked to look at this from a number of different angles including the sentiment analysis basics toward/against IBM’s product, the sentiment around its competitors, the trends over time, the topics of conversation, and even where, within the social media space, the conversations about the product were occurring most.

For this analysis, we ingested as much “relevant” data as we could find into our analytics tool and began the analysis. In one instance, we discovered one particular venue (social media site) where the discussions about BladeCenter were far outpacing the others (see Figure 5.6).

The interesting thing about that site is that it was a gaming site where users discuss various ins and outs of gaming: running sites, playing games, deploying strategies, and so on. This led to the (false) conclusion that the BladeCenter product was used heavily as a gaming platform, so we continue to “drill” into the specifics from that site.

Well, as any good data scientist will tell you: Your results are only as good as your data.

As we looked closer, it became apparent that the discussion on that social media site was indeed about Blades, but it wasn’t a computer; it was a character in one particular game that was very popular. In other words, all of that conversation was completely irrelevant to our topic! In that particular case, we ensured that our data collection avoided gaming sites to make sure we had the correct data. Issues like these are usually picked up in the first few iterations of analysis, which is an essential part of the process. We highlight this point here for you as an issue to be aware of as you do a similar analysis on your own.

Of course, the obvious question is: What if there was substantial discussion on these gaming sites of the BladeCenter product? It would have meant a much more sophisticated data model that ensured any discussion of BladeCenter was in the context of computing (or perhaps cloud computing), but in this case, the easiest solution was to simply remove the data source.

In another case, we were looking at what kinds of issues or topics were being talked about during the SapphireNow conference (http://events.sap.com/sapphirenow/en/home) a few years ago. We got some really interesting information, but at first we started to see several conversations surrounding mixed drinks (cocktails). A careful inspection revealed we were seeing discussions (and advertisements) for Bombay’s Sapphire Gin—completely unrelated to the IT conference we were interested in observing. These examples illustrate what we mean by “scoping.” The concept of scoping a data collection simply means to set a boundary within which we want to collect data. In the case of SapphireNow, we wanted to scope the collection of an mentions of the word sapphire to computer- or software related issues rather than allow it to be wide open to anything.

Identifying Data in Social Media Outlets

When we talk about social media sites, we often have to be clear what type of data we would gather. It’s important to understand how people interact on the site and how they exchange information with one another. In this section, we primarily discuss the sites that are most relevant in the United States. There are too many social media sites and venues to dive into the demographics of all of them. But it is useful to understand where we can go to find information from a specific region or demographic. The following tables provide a glimpse of the social media outlets for China (see Table 5.4), Europe (see Table 5.5), and India (see Table 5.6).

Professional Networking Sites

Professional networking sites are just that: social networking sites where business professionals can go to meet or find others with similar interests or just  stay connected with business contacts, all with a goal of building up a professional network. They are rich in profiling information such as education level, job history, and current position within companies. These sites often allow the users to be “introduced to” and collaborate with other professionals in an effort to enhance or improve their professional stature. LinkedIn is the dominant professional social network. It has become the de facto system of record for online résumés for many professionals. Want to hear what business professionals are saying about a topic? Look to LinkedIn (or other professional networking sites).


User population: 332 million users
  • North America: 119 million
  • Europe: 78 million
  • Asia: 52 million
  • South America: 46 million
  • Africa: 15 million
  • Middle East: 11 million
  • Oceania: 8 million

LinkedIn is an online social network that was designed for business professionals. Because of this distinction, it tends to be different from other social networking sites such as Facebook or RenRen. LinkedIn users are looking to build or enhance their professional network. They are looking for or posting job opportunities, discovering sales leads, or connecting with potential business partnersrather than simply making friends or sharing content.

LinkedIn profiles are more like professional résumés with a focus on employment and education history. Like Facebook users, LinkedIn users are part of “invited networks” (in Facebook, users have “friends”; in LinkedIn, they are part of a network). Based on work history and education background (from a profile), LinkedIn can help identify others that (potentially) have similar interests or backgrounds to invite into your network. On LinkedIn, the people who are part of your network are called your “connections.” A connection implies that you know the person well or that it is a trusted business.

Generally, user profiles are fully visible to all LinkedIn members who have signed into the site; all of this is configurable within LinkedIn. Typically, contact information such as email, phone number, and physical address is visible only to first-degree connections and members (those you have in your network). Users can control the visibility of their posts and other recent activity by adjusting their visibility settings, and most allow their comments/ posts to be seen only within their networks.

Since LinkedIn does tend to be more professional networking, it’s not clear that sentiment would be all that valuable. If users are on LinkedIn to potentially connect with future employers (or employers are looking for new employees), most of these users tend to be careful about what they say.

Social Sites

Classifying a social media site as social seems a bit redundant, but in this instance, we’re referring to sites where people go to reconnect with old friends or meet new people with similar interests and hobbies. These sites enable users to create public profiles and form relationships with other users of the same website who access their profiles. These sites offer online discussion forums, chat rooms, status updates, as well as content sharing (such as pictures or videos). The king of social sites has to be Facebook with over 1.5 billion registered users!


User population: 1.5 billion registered users Demographics:
  • 171 million users from the United States
  • 66 million users from India
  • 60 million users from Brazil
  • 54 million users from Indonesia
  • 41 million users from Mexico
  • 35 million users from Turkey
  • 34 million users from Phillipines

Facebook is the king of social media sites, claiming the largest number of registered and active users of all such sites. At first blush, this would seem like a goldmine for gathering data for a social media analysis, but in actuality it’s not as useful, from a business perspective, as you would think.

Facebook is organized around a timeline, the things users say or post, and their friends. When users post something, typically the only ones who can see the content are those in that user’s list of friends. That capability is wonderful for sharing information within a community of friends, but from the perspective of trying to look at public opinion or thoughts, these posts or timelines aren’t available to the general public. While it is possible to set user controls to make all content publicly accessible, this setting is not generally used because most users want to keep their comments and conversations private, or within their own circles.

One alternative to the more private profile is a fan page . Businesses, organizations, celebrities, and political figures use fan pages to represent themselves to the public on Facebook. Unlike regular Facebook profiles and content, fan pages are visible to everybody on the Internet, which makes for a useful set of information to collect and analyze.

Facebook groups enable communication among a group of people to share common interests and express their opinions. Groups allow people to come together around a common topic to post and share content. When creating such a group, the owner can choose to make the group publicly available or private to members, but members must be Facebook users.


User population: 219 million users in 2014
Demographics: Predominantly Chinese

RenRen is the leading (real-name) social networking Internet platform in China. RenRen, which means “everyone” in Chinese, enables users to connect and communicate with each other, share information, create user generated content, play online games, watch videos, and enjoy a wide range of other features and services. Most people refer to RenRen as the Chinese Facebook. Like Facebook, RenRen does not allow visitors or search engine spiders to view profile pages without being logged in (in other words, being a member). Of course, like Facebook, it does allow searching for public profile pages of brands and celebrities.

Information Sharing Sites

Most of the social media sites allowing sharing of status messages or content such as pictures or videos. Some, however, like Instagram, are almost solely focused on sharing (in this case, images). Other examples include sites like YouTube and Tumblr, and while they allow for commentary on the content posted, they are primarily content sharing.


User population: 1 billion users
YouTube is a website that was designed to enable the general public to share video content. Millions of users around the world have created accounts and uploaded videos that anyone can watch—anytime and virtually anywhere.

Many businesses today have come to realize that the use of YouTube videos can help to increase their brand exposure while creating a personal connection with their audience. More importantly, these short videos can be an effective way to deliver information that users may find useful—perhaps leading to an increased brand loyalty.

What’s most important is the number of views that videos achieve from the YouTube site. Looking at the top 10 countries viewing videos, we see:
  • United States: 124 billion views
  • UK: 34 billion
  • India: 15 billion
  • Germany: 15 billion
  • Canada: 14 billion
  • France: 13 billion
  • South Korea: 12 billion
  • Russia: 11 billion
  • Japan: 11 billion
  • Brazil: 10 billion

Video files can be very large and are often too big to send to someone else by email. By posting a video on YouTube, users can share a video simply by sending the other person a URL link.

Microblogging Sites

Microblogs are short postings or brief updates sent online. Think of them like text messaging. Unlike traditional blogs, microblogs are typically limited in the amount of text that can be posted (Twitter’s limit is 140 characters). These updates often contain links to online resources, such as web pages, images, or videos, and more often than not, they refer to other users (called mentions). As is the case with most microblogging, when a message is posted, those updates are seen by all users who have chosen to “follow” the author who posted the message (submitter). In the case of Twitter, those posts are all public; you may not receive them if you don’t follow the submitter, but you can search for a keyword or topic and find someone who is talking about a specific subject (and then perhaps follow that person if he or she seems interesting). Microblogging should not be confused with text messaging (or texting) on mobile phones, which is private and not recorded anywhere. Texting is typically one-to-one (or in the hybrid case, group chats among a small number of people).


User population: 289 million users Demographics:
  • 180 million from the United States
  • 23 million from the UK
  • 16 million from Canada
  • 8 million from Australia
  • 6 million from Brazil
  • 4 million from India

Twitter is the prototypical microblogging site. Users tweet using short bursts of messages out to the Twitterverse with the hope that their messages will be useful or interesting to others. Messages on Twitter, by definition, are limited to 140 characters, so they tend to be statements as opposed to conversations. There are threads of conversation where users reply to other tweets, but more often than not, we tend to see more retweeting of messages. This behavior can be viewed as an implicit agreement with the originator’s view.

One of the draws of Twitter is the instantaneous delivery (and reception) of messages and information. Many people tweet about current events that are underway, during sporting events or talks at trade shows, for example. All of these message can be immensely important when trying to understand conditions surrounding an event in real time. From a historical perspective, looking back at tweets (or sentiment) when an event occurred (understanding the event in hindsight) can be particularly useful in trying to predict reactions to future events.


A blog is nothing more than an online personal journal or diary. It provides a platform for people to express themselves and their opinions. It is a place to share thoughts and passions. The Internet makes this information dissemination that much easier. In earlier days, someone with a strong opinion would stand on a raised platform (typically a box meant for holding soap) in a public square and make an impromptu speech, often about politics, but it could be about anything. Hyde Park in London is know n for its Sunday soapbox orators, who have assembled at Speakers’ Corner since 1872 to allow individuals to discuss any number of topics ranging from religion and politics to social themes. The modern form of this soapbox is a blog, which allows anyone, anywhere, to make a statement that can be heard by all (or ignored by many).

Blogs can range from personal experiences and observations to wellcrafted marketing messages put out, seemingly, by individuals on the behalf of corporations. Of course, this is true for any form of social media. And as with any media, being sure to determine who is conveying the opinions and message is important.



Analytics Case Study Content Experience How-To Mobile Marketing Social Media Strategy Strategy
The Digital Media Strategy Blog: Where to seek big data for social media analytics
Where to seek big data for social media analytics
The Digital Media Strategy Blog
Not found any posts VIEW ALL Readmore Reply Cancel reply Delete By Home PAGES POSTS View All RECOMMENDED FOR YOU LABEL ARCHIVE SEARCH ALL POSTS Not found any post match with your request Back Home Sunday Monday Tuesday Wednesday Thursday Friday Saturday Sun Mon Tue Wed Thu Fri Sat January February March April May June July August September October November December Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec just now 1 minute ago $$1$$ minutes ago 1 hour ago $$1$$ hours ago Yesterday $$1$$ days ago $$1$$ weeks ago more than 5 weeks ago Followers Follow THIS CONTENT IS PREMIUM Please share to unlock Copy All Code Select All Code All codes were copied to your clipboard Can not copy the codes / texts, please press [CTRL]+[C] (or CMD+C with Mac) to copy