Looking for social media data at the right places

Somewhere around 1964, George Fuechsel is thought to have coined the phrase “garbage in, garbage out.” This popular computer science slang ...

Somewhere around 1964, George Fuechsel is thought to have coined the phrase “garbage in, garbage out.” This popular computer science slang expression comes from early programmer education. Fuechsel taught his classes that they must check and recheck their data and coding to ensure that the results they achieved were valid. In this new era of computing, programmers were trained to test each step in their programs and cautioned not to expect that a resulting program would do the right thing even when given imperfect input. This basic premise also is true today in data analytics. Simply stated, if you use the wrong data, the results will be wrong (or worse, inaccurate).

So while originally intended for programming, this expression is equally applicable to data analysis. In this post, we explore what is meant by data identification and describe how this concept fits into the overall landscape of social media data analytics.

Data identification is the process of identifying the subset of available data to focus on for an analysis. A key element of choosing the appropriate data source is to take the time to understand the outcomes/results.

What Social Media Data Do We Mean?

Think of data as the raw material that is transformed into information and ultimately knowledge. Data by itself as a concrete concept can be viewed as the lowest level of abstraction from which information (and then knowledge) can be derived. Unprocessed data refers to a collection of numbers, characters, and phrases (snippets of a blog, tweet, and so on) think of this as “observations”—that are somewhat random and by themselves convey no meaningful information or knowledge. Many pieces of data, when combined together, analyzed, and processed, produce that next level of abstraction: information.

Data by itself is fairly useless; however, as we process it (or begin to interpret it), it begins to become useful as it conveys some kind of message. At this point, we deem it “information.” Information is simply data that has been processed in such a way as to be meaningful to someone or something: it contains meaning, whereas data does not.

To make this concept easier to understand, let’s consider a highly simplified version of a real project that we executed. The different levels of the social media business value pyramid could be identified as follows:

Noisy Data - There is a lot of conversation happening in the marketplace about our new product, related products, and competitors’ products.

Filtered Data - There is a lot of conversation happening in the marketplace about our new product.

Information - The conversation about our product peaked during announcements or events but tapered off precipitously shortly after the events.

Knowledge - The majority of the conversation that is taking place in the marketplace is being generated by our own marketing messages. The marketplace isn’t picking up the message and engaging with it.

Wisdom - Our marketing campaign for this particular brand of machines is really not working.

The end goal is to take all of the observations we can collect (our data), filter them so we look at only the relevant set of “puzzle pieces,” and by applying some kinds of processing to it, convert, or organize, that data into meaningful information. By “meaningful,” we mean it expresses the data in a way that conveys a new message or insight. Figure below shows a simple diagram that illustrates this concept.

According to the online Merriam-Webster dictionary, knowledge is

the fact or condition of knowing something with familiarity gained through experience or association

acquaintance with or understanding of a science, art, or technique. [2] That familiarity allows us to make better decisions based on the facts presentedto us. And it then follows that wisdom is the ability to think and act using this derived knowledge as well as our set of experiences, common sense, and insight.

It all starts at the bottom of that pyramid: the data. But we need to ensure we have the right data. As Chief Engineer Montgomery Scott (of the Starship Enterprise) says : “How many times do I have to tell you? The right tool [or the right data source] for the right job!

What Subset of Content Are We Interested In?

In the context of social media analysis, data identification simply means “what” content are we interested in. In addition to the “text” of the content, we want to know:

Who wrote the text?

Where was it found (or in which social media venue did it appear)?

Are we interested in information from a specific locale?

When did someone say something in social media?

As an example, consider the following: In 2012, Malala Yousafzai, a young girl from Pakistan, made national headlines as a result of the brutal attack against her by Taliban. She had taken a public stand on the rights for education for women in Pakistan. She has become widely known for her activism in Pakistan, where the Taliban had at times banned girls from attending school.

In 2009, Malala was working with the BBC and she created her blog. In this blog, she wrote of life under the Taliban rule in Swat valley, Pakistan, and her strong support for a woman’s right to an education. The views of a 12-year-old girl on women’s education and the Taliban regime caused quite a sensation. Many newspapers worldwide gave prominent coverage to her blog and her views. New York Times even created a documentary film about her life and views of the social situations in her region of the world. This obviously made her quite famous and her enemies quite upset. In October 2012, when Malala boarded her school bus, a gunman boarded the bus and fired three shots directly at her. She suffered major injuries to her face. She was in critical condition for several days. She was later transferred to a hospital in England for rehabilitation. Even though a lot of Islamic religious leaders came to her support after this incident, the Taliban was still intent on harming her and her family. This assassination attempt sparked an international outpouring of support for Malala, ultimately leading to her nomination for the 2013 Nobel Peace Prize.

So in this case, say we were trying to pose the question, “What is the reaction of the general population to Malala, the young girl from Pakistan who defied Al Qaeda, in the Western media?” Where should we look for relevant data to analyze?

The word relevant here is important. Remember the basic principle of garbage in, garbage out. We could pull data from “everywhere” in the social media space, but we want to be sure that the data we use in our analysis is relevant to the question we are trying to answer. The data identification process would go as follows:

We want to analyze content in the media, so we might choose to focus on popular news media and ignore blogs, bulletin boards, Twitter, and so on.

We want to analyze content in the Western media, so we would focus on the content emanating from a region considered “West” and eliminate content from other regions.

The process of identifying relevant data can be accomplished in a single step, or it might take multiple steps, depending on the type of project we are working on. The process described in this section can be considered as follows:

Step 1: What content are we interested in?

In subsequent sections, we discuss additional possible steps as follows:

Step 2: Whose comments are we interested in?

Step 3: What window of time are we interested in?

Whose Comments Are We Interested In?

A second possible step in the data identification process gets into more details. Considering the previous example, we need to think about issues such as the following:

Should we eliminate content from sources that are known to be pro Al Qaeda because of negative bias?

Should we eliminate all content from young girls because of positive bias?

Bias is a prejudice in favor of or against one thing, person, or group compared with another, usually in a way that is considered to be unfair. Being biased implies that we may have only one side of a story; thus, any conclusions we make from the data provided are bound to be biased toward one side. Being biased means that the entire data collected already had a preconceived opinion, and any analysis done on it would be useless.

For example, is there a segment of the population in our target audience, say young girls, who are so impressed by the inherent heroism in the stance taken by another girl against an organization that their comments may not be objective enough for our study?

The Digital Media Strategy Blog

Header$type=social_icons

Looking for social media data at the right places

What Social Media Data Do We Mean?

What Subset of Content Are We Interested In?

Whose Comments Are We Interested In?

Labels:

COMMENTS

Trending

Footer Social$type=social_icons

Looking for social media data at the right places

SHARE:

What Social Media Data Do We Mean?

What Subset of Content Are We Interested In?

Whose Comments Are We Interested In?

Labels:

SHARE:

COMMENTS

Trending

Footer Social$type=social_icons