Understanding link between data science and big data analytics

Big Data analytics has become the engine for business analytics today. Companies are using Big Data to analyze their business processes, fo...

Big Data analytics has become the engine for business analytics today. Companies are using Big Data to analyze their business processes, formulate future business strategies and, extensively, employ it for decision making, post they know the link between data science and big data analytics. Companies such as Amazon and Netflix use Big Data analytics to understand behavioral patterns and choices of customers, in order to tune their offerings for the individual. Credit card companies use Big Data analytics to estimate the risk of losing a customer; they analyze a customer’s spending and paying patterns and use such insights to change potential offerings in hopes of retaining that customer.

Data Science
Image Credit: Pixabay

Big Data analytics has also been very successfully employed in scientific fields as well. For example, experiments at the Large Hadron Collider generate a tremendous amount of data, 30 petabytes per year (http://home.cern/topics/large-hadron-collider). Such huge amounts of data require processing in order to determine the behavior of subatomic particles. A lot of the generated data is processed using Big Data analytics tools (Warmbein 2015; http://home.cern/topics/large-hadron-collider).

Upcoming fields, such as the Internet of Things (IoT) (Gubbi et al. 2013), which envisions connecting a large number of smart devices from everyday use to the internet, are expected to exponentially increase the amount of data generated in the future. Some approximations put the increase at more than 10 times the current volume in the next four years. For example, Marr (2015) in Forbes approximated the accumulated data to grow from close to 4 zetabytes in 2016 to more than 40 zetabytes in 2020.

Almost every person in today’s world has had some interaction with Big Data analytics. Receiving personalized advertisements both online and in print is a result of analytics, which companies perform on their gathered customer purchasing behavior data. Similarly, movie suggestions for online video stores also have their roots in Big Data analytics. Social media platforms are the biggest users of Big Data analytics. From friend suggestions on Facebook to targeted news feeds, thse are all enabled by Big Data analytics.

Such a pervasive and impactful field is still a black box for a majority of people. The aim of this post is to present different types of analytics constituting Big Data analytics and their application areas. We also present techniques and tools which drive such analytics and make them useful.

Big Data analytics (Zikopoulos et al. 2011, Chen et al. 2012) involves processing and analyzing huge amounts of data to gain insights for decision making. Analytics can be broadly divided into four categories, as illustrated in Figure below.

data science and big data analytics

Data science and big data analytics: Types of Big Data Analytics

Data science and big data analytics: Descriptive Analytics

Descriptive analysis deals with the question “What has happened?” This form of analytics mainly deals with understanding the gathered data. It involves the use of tools and algorithms to understand the internal structure of the Big Data and find categorical or temporal patterns or trends in it. For example, the sales data gathered by a retail chain reflect the buying patterns of different categories of customers. Students, housewives, or small business owners all have different buying patterns, which can be found when overall sales data are analyzed. Moreover, a sudden spurt in sales of a category of items, say notebooks, can also be identified and be made available for analysis with other tools (Zikopoulos et al. 2011).

Diagnostic Analytics

Once the internal structure of data is identified, the next task is to seek reasons behind such a structure. For example, if sales data show a spurt in sales of notebooks, then seeking the reason behind such an increase falls in the domain of diagnostic analytics (Borne 2017).

Predictive Analytics

Given the current trends in data identified by the descriptive analytics tools, what might happen in the future is a crucial question. Businesses can fail if they are not able to tune to the future requirements of their customers. Predictive analytics tools provide insights into the possible future scenarios. For example, predicting the future rate of customer churn on the basis of sales patterns, complaints, and refund requests made by customers provide useful information (Soltanpoor and Sellis 2016).

Prescriptive Analytics

Today’s businesses not only want to predict the future but also want to be best prepared for it. Prescriptive analytics tools provide a “what if” kind of analysis capability. What are the different options available to business management and which among them is the best suited, given the predictions and other constraints? These questions fall under the domain of prescriptive analytics. For example, prescriptive analytics can be employed to provide a directed and personalized advertisement to customers to help in reducing customer churn (Soltanpoor and Sellis 2016, Borne 2017).

Figure shows a relative comparison of different types of analytics. In terms of complexity of the algorithms and techniques involved, descriptive analytics are the simplest. The most complex is prescriptive analytics, as there is automation of decision making involved. Moreover, prescriptive analytics encompasses all other analytics in one or the other form. Prescriptive analytics also has the most impact on decision making, as it helps to identify the best course of action for the future. Prescriptive analytics is more of an optimization strategy. We present a detailed discussion about prescriptive analytics later in the post.

It summarizes the very important differences in the natures of different analytics types. Descriptive and diagnostic analytics concentrate on the past and answer questions like, “what has happened” and “why did that happen.” On the other hand, predictive and prescriptive analytics concentrate on the future and answer questions like “what can happen” and “what should we do.”

big data analytics

Software analytics tools do not confine their functionalities to a specific analytics task. More often, the tools are capable of performing more than one kind of analytics, as we discuss below. The available tools have a wide variety in terms of kinds of businesses they can model, kinds of data they can process, and even the kinds of outputs they produce. Earlier, the tools were complex and companies were forced to hire data scientists to utilize them. Recently, there has been a shift in this approach, and many easy-to-use tools have appeared in the market. Using these tools, any person can perform a reasonable amount of analytics. We will discuss specific tools later in the post.

Analytics Use Case: Customer Churn Prevention

To enable readers to understand the real world uses of Big Data analytics, we present a use case where Big Data analytics is employed to prevent customer churn. We now discuss why customer churn is a problem and why companies are forced to use Big Data analytics to prevent it.

Today, due to e-commerce retail space, customers are not confined by choices, location, availability, or more importantly, competitive pricing. This makes the customer base highly volatile and difficult to retain. E-retailers like Amazon, Flipkart, E-bay, and their physical counterparts like Walmart, Target, etc., face a real and challenging problem in the form of loss of customers, which is also known as customer churn. Customer churn is detrimental for a business due to stiff competition, and the retailers go to great lengths to retain customers and avoid their migration to other competitors.

Identifying the customers who are likely to leave is a herculean task in itself. This is because the size of a business. where there are tens of millions of transactions a week, hundreds of thousands of items on sale at any point in time, millions of active customers, and millions of pieces of data regarding feedback and complaints. Companies utilize and employ Big Data analytics to handle these issues and analyze such large amounts of data to predict and/or diagnose the problem(s).

Customer churn prevention is the process of maintaining existing customers using methods like increasing product inventory based on current trends, personalized loyalty programs, and promotions, as well as identifying dissatisfaction among customers.

The business objective in our case is to retain the customers who are more likely to leave. One of the methods which we are going to discuss is to send them personalized coupons for targeted items. However, to achieve this objective, we must analyze the following.

  • Why are customers at risk of going away? Why have customers left?
  • What type of inventory items are liked by customers who have left or who are at risk of leaving?
  • What items would customers like to buy together (basket analysis)?

The above-mentioned analysis is required to decide what type of coupons the customer is more likely to accept. The coupons, however, should ensure customer satisfaction along with no loss of the business.

Descriptive Analytics

Descriptive analytics is the first analytics stage; it makes raw collected data interpretable by humans. Retailers like Walmart, Amazon, e-Bay, and Flipkart collect different customerrelated data into their data repositories. The data are sourced from social networks, IoT devices, click streams, search engines, customer call logs, transactions, etc. Collected data can be structured, semistructured, or unstructured, as shown in Figure below.

Descriptive Analytics

Structured data refers to data which have been organized using rules (e.g., relational databases). Unstructured data consist of images and audio-video streams. Between these two extremes lie semistructured data, where chunks of unstructured data are organized. Examples of semistructured data include XML files, emails etc. Collected data are cleansed and categorized into customer transaction logs, customer reviews, feedback, etc., by using tools like Sqoop and Flume.

Descriptive analytics tools then process the collected raw data and provide insights, as shown in Figure below. The insights range from the internal structure of data, like categories or events which occurred, to a mere summary, like average profit per item sold.

Descriptive analytics tools

Application of Descriptive Analytics in Customer Churn Prevention

Descriptive analytics techniques can help in identifying the segment of customers at maximum risk of leaving. Descriptive analytics techniques can categorize customers according to their recent buying patterns. A simple example of customers and the weekly money spent by them for the past four weeks. Although simple, the table shows how the raw data are processed and summarized by descriptive analytics tools. In the real world, there will be millions of customers and the data will span many more dimensions, including time spent in deciding, coupons applied, etc. Even creating such a simple table requires tools like MapReduce when the data size is big.

Table shows that customers 7 and 8 drastically reduced their spending at this particular store over the course of four weeks. These customers are high-risk customers and are probably going to leave or have already left. The table also shows the category of customers which allows risk of churn but can become high-risk customers if preventive steps are not taken.

There can be other tables created, e.g., one for refunds requested, listing customers who have returned items and are expecting refunds.

When to use descriptive analytics?

When an aggregate level of understanding of what is going on in a business is required. Descriptive analytics is also used to describe or summarize various characteristics of a business.

Techniques Used for Descriptive Analytics


Clustering (Davidson and Ravi 2005, Balcan et al. 2014) refers to the process by which the data points are grouped together in such a way that two data points lying in the same cluster are more similar to each other than the data points in a different cluster. In Table, customers 7 and 8 both were categorized in a high-risk cluster. This is because they showed a similar pattern of diminished spending.

Clustering can be of many types, depending on the definition of similarity or affinity between two points. A common affinity measure is a Cartesian distance between the points. Two points which are closer to each other are considered more similar to each other than two far points. The clusters can thus be defined by the threshold on the distance between the points.


At the beginning, each data point is a cluster of its own. Depending on the interpoint distances, clusters are formed. In the next and later rounds, the clusters are merged with each other, depending on the distances between clusters. As clusters can contain more than one point, a representative point is chosen to calculate distances with other clusters. One such representative point is considered the centroid of the cluster.

Clustering algorithms merge the clusters until the distance between any two clusters is not close enough to be merged, as shown in Figure. The distance criteria are crucial for the quality and quantity of resultant clusters. Consider an example where a supermarket wants to find different categories of customer categories. Each cluster represents a category. The unchecked merging of clusters will result in only a single big cluster consisting of all the customers. Similarly, the higher threshold on distance will result in too many fine-grained categories, such as high school student, college student, graduate student, etc. Thus, a proper distance threshold is necessary for meaningful and useful clustering.

Decision Tree-Based Classification

Classification is the process of assigning an object to one or more predefined categories. Classification is employed for many problems, like email spam detection, categorizing credit card customers as high risk or low risk, and many more. In this section, we will present one of the classification techniques, called decision trees (https://en.wikipedia.org/wiki/Decision_tree).

Decision trees refer to a well-designed series of questions which are asked with regard to the input object to be classified. Below figure shows one such decision tree, an example of classifying light passenger vehicles. When a new vehicle is seen, the category to which it belongs can be determined by using the decision tree. It can be seen that the design of the decision tree utilizes historical data. In the tree, the historical data are from characteristics of vehicles that are already available.

Decision Tree Based Classification

Diagnostic Analytics

Diagnostic analytics focuses on the reasons behind the observed patterns that are derived from descriptive analytics. For example, descriptive analytics can point out that the sale of an item has shot up or suddenly decreased at a supermarket, and then the diagnostic analytics tools can provide reasons behind such an observation.

Diagnostic analytics focuses on causal relationships and sequences embedded in the data. It helps answer questions like “Why?”.

For example, in the customer churn prevention use case, the diagnostic analytic tools can be used to find the probable reason behind the high-churn-risk customers. As mentioned above, descriptive analytics tools summarize the feedback and complaint logs and the refund requests by customers. Table below lists the number of complaints, refund requests, and price match requests, generated by the descriptive analytics.

Diagnostic analytics tools correlate the information between tables to answer why customers 7 and 8 are at a higher risk of leaving. For example, customer 7 may be someone who is unhappy with his or her complaints and refund requests not being handled. The diagnostic tool will not give a result for each and every customer. The results will be more like “Customers with a higher percentage of unresolved complaints are at high churn risk.” Customer 8, however, seems to be a different case. Observing such a high number of price match requests, it is highly probable that customer 8 has found another store which has lower prices for products than the current store. In the case of customer 7, urgent resolution of the pending complaints and refund cases can decrease the chances of their leaving. However, in the case of customer 8, providing promotional discounts and coupons on products in which they may be interested could be a good strategy for retention. The key constraint which should still be ensured is the profitability of business. Thus, finding coupons which may satisfy customer 8 and yet do not result in a loss for the store is a challenge for the store.

For ease of understanding, the table is oversimplified with customer IDs. In reality, such information regarding pending complaints and price match requests will also be in the form of clusters. To find the probable reason behind high-churn-risk customers, an intersection between clusters in first table and the clusters of complaints, refund requests, price match requests will be found.

Thus, the answer to “why is there customer churn?” could vary from bad delivery service, the quality of products sold to customers who prefer quality over price, migration to a competitor based on price or service, etc.

When to use diagnostic analytics?

A. When the reason behind a certain observed phenomenon or characteristic needs to be determined.

diagnostic analytics

Predictive Analytics

Predictive analytics uses the outcomes of descriptive and diagnostic analytics to create a model for the future. In other words, analyzing the what and why gives insights to prepare a model for questions like “What is possible in the future?” For example, when diagnostic analysis specifies a correlation between large customer churn and unresolved complaints, the predictive analytics can model this relation to approximate the future customer churn rate on the basis of the fraction of unresolved complaints. Such a model is shown in below figure The red curve in the figure specifies the relation between customer churn rate (y axis) and unresolved complaints (x axis).

Predictive analytics are also utilized by businesses to estimate different kinds of risks, finding the next best offers for customers, etc. Use of predictive analytics helps businesses forecast future scenarios. For example, promotion offers and targeted discount coupons can be mailed to customers to avoid a scenario of customer churn when the unresolved complaint rate is high; this seems to be a good strategy to avoid customer churn. However, questions like “Is this the best possible strategy or do we have other options?” and “Will this strategy effectively reduce risk of customer churn?” and “Will it still be profitable for business?” are some of the questions which predictive analytics cannot answer. We need a more powerful tool, which can prescribe potential options and also predict the future impacts of the potential options. We discuss prescriptive analytics, a more capable analytics approach than predictive analytics, in the next section.

When to use predictive analytics?

A. When something about the future needs to be predicted or some missing information needs to be approximated.

Techniques Used for Predictive Analytics

Linear Regression Techniques

Linear regression (https://en.wikipedia.org/wiki/Linear_regression) is a technique used to analyze the relationship between the independent variable and the dependent variable. In the context of retail business, the independent variable can be the discount offered on a certain item, and the dependent variable can be the corresponding increase in sales. When discounts are greater, the more the sales go up. However, it is necessary to capture the exact relationship between the discount and the sales to make estimates.

Linear Regression Techniques

Predictive Analytics

The relationship between sales, the dependent variable, and the discount, the independent variable, is modeled as a linear equation. Historical data are then analyzed and plotted on a graph, as shown in Figure. Next, a straight line, which closely follows the plotted points, is drawn. The key is to estimate the slope of the linear curve, which in turn gives a relationship between the dependent and independent variables. The goal is to maximize the accuracy of prediction and thus minimize the error.

Time Series Models

Time series (Hamilton 1994), as the name suggests, is the arrangement of data points in temporal order. Generally, the time difference between the successive data points is the same. Examples of time series are stock market index levels (generated each day), the number of items sold each day in a supermarket, etc. Time series analysis has been extensively used for forecasting.

Time series analysis consists of methods to extract meaningful characteristics of data, like increasing or decreasing trends with time. The trend in the data is the key to predicting the future values of data. It is analogous to the slope of the line described for linear regression. In below figure the data shows a decreasing trend.

Time Series Models

A simple way of calculating the trend is by taking moving averages. There are various ways of calculating moving averages.

Prescriptive Analytics

Prescriptive analytics (Song et al. 2013, Gröger et al. 2014) is relatively new and complex, compared to other analytics approaches. The aim of prescriptive analytics is to give advice on possible outcomes. Prescriptive analytics tries to approximate an effect or possible outcome of a future decision even before the decision has been made. It provides a “What if” kind of analytics capability.

Prescriptive analytics helps to determine the best solution among a variety of choices, given the known parameters, and suggests options for how to take advantage of a future opportunity or mitigate a future risk. It can also illustrate the implications of each decision to improve decision making. Examples of prescriptive analytics for customer retention include next-best action and next-best offer analyses.

Prescriptive analytics internally employs all other analytic techniques to provide recommendations. Prescriptive analytics employs a simulation optimization validation iterative cycle for fine-tuning future predictions.

When to use prescriptive analytics?

Use prescriptive analytics when advice is needed regarding what action to take for the best results.

Application of Prescriptive Analytics in the Customer Churn Prevention Use Case
If a customer leaves, then they are of no value to a business. However, preventing a customer who is about to leave is much more critical for the business. We need schemes which can not only predict customer churn, but also suggest preventive measures to keep it from happening. Prescriptive analytics is such a scheme. Prescriptive analytics is much more powerful than predictive analytics.

Using prescriptive analytics, the business can explore different options available to it to prevent customer churn. Prescriptive analytics will provide inputs similar to the ones mentioned in figure below. (Note that the figure is used for the understanding of readers and uses hypothetical data. Real world tools have various ways of suggesting the options and best choices.)

prescriptive analytics

Using prescriptive analytics

Prescriptive analytics suggests three options: targeted coupons, resolving complaints and refund requests. and store-wide discounts. The x axis of the graph in above figure shows the reduction in profits, and the y axis shows the percentage of high-risk customers who are retained. It can be seen that storewide discounts are not an effective way of retaining customers, as they result in a large reduction of profit but retain few customers.

The best option suggested by prescriptive analytics is to be in Region R, which amounts to resolving complaints and handling refund requests. This is because the number of customers retained is close to 60% at this point, with only a 30% reduction in profit.



Analytics Case Study Content Experience How-To Mobile Marketing Social Media Strategy Strategy
The Digital Media Strategy Blog: Understanding link between data science and big data analytics
Understanding link between data science and big data analytics
The Digital Media Strategy Blog
Not found any posts VIEW ALL Readmore Reply Cancel reply Delete By Home PAGES POSTS View All RECOMMENDED FOR YOU LABEL ARCHIVE SEARCH ALL POSTS Not found any post match with your request Back Home Sunday Monday Tuesday Wednesday Thursday Friday Saturday Sun Mon Tue Wed Thu Fri Sat January February March April May June July August September October November December Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec just now 1 minute ago $$1$$ minutes ago 1 hour ago $$1$$ hours ago Yesterday $$1$$ days ago $$1$$ weeks ago more than 5 weeks ago Followers Follow THIS CONTENT IS PREMIUM Please share to unlock Copy All Code Select All Code All codes were copied to your clipboard Can not copy the codes / texts, please press [CTRL]+[C] (or CMD+C with Mac) to copy