Embarking on big data architecture framework

Below figure shows the overall Big Data analytics architecture framework. MapReduce and Spark provide the large data processing capabilitie...

Below figure shows the overall Big Data analytics architecture framework. MapReduce and Spark provide the large data processing capabilities for different types of analytics. For example, descriptive analytics uses MapReduce to filter and summarize a large amount of data. Similarly, predictive analytics techniques employ MapReduce to process data from data warehouses.

Before a data analytics process begins, the relevant data are collected from a variety of sources (stage 1). The sources generally depend on the kind of business that is employing the analytics. For e-retailers, the most important data science sources are the transaction and customer logs. Similarly, for an electrical service company, the most important source of data will be IoT devices and the smart meters at customer premises.

Raw data are collected in a repository known as a data lake. The raw data are then processed by using descriptive analytics tools and techniques to filter and summarize them (stage 2). Descriptive analytics makes the data usable for humans and other analytics tools. The processed and summarized data are stored in a data warehouse (stage 3).

Diagnostic analytics uses the processed data that have been stored in a data warehouse and derives causal relationships and correlations inherent in the data (stage 4). These findings are stored back in the data warehouse to be used for predictive and prescriptive analytics. Predictive analytics utilizes the summarized and cleaned data stored in the data warehouse along with the correlations and causal relationships provided by the diagnostic analytics. It then provides predictions and future estimations, which are again stored back in the data warehouse (stage 5).

Prescriptive analytics internally employs all other analytic techniques to provide recommendations. It employs an iterative approach, where the predictions are tuned and optimized. This iterative approach is termed a simulation–optimization–validation loop and is shown as a cycle in stage 6. Predictive analytics also provide feedback and update data in the data warehouse in the course of fine-tuning the predictions. In the next section, we present some of the tools which employ these analytics in detail.

Big data architecture: Tools Used for Big Data Analytics

There are many tools available for performing Big Data analytics. Generally, the tools are not confined to a single type of analytics. There are some special-purpose and businessand data-specific tools too; however, discussion of such tools is beyond the scope of this post. Here, we present a list of the most widely used analytics tools. Table below provides a list of tools and types of analytics that each tool supports. We also provide a brief discussion about each of the mentioned tools.

IBM InfoSphere

IBM’s InfoSphere (https://www-01.ibm.com/software/data/infosphere/) is a widely used data integration, warehousing, and information governance tool. It provides enterprise-scale performance and reliability in bringing diverse data sets together, creating visualizations and aids in data life cycle management.

IBM SPSS

IBM’s SPSS software (http://www.ibm.com/analytics/us/en/technology/spss/) is a predictive analytics program that creates visualizations of statistical reporting and diagnostic analysis to create predictive models for data mining. It is a powerful tool for model evaluation and automation of advanced analytics in the Cloud.

Apache Mahout

The Apache Mahout project (http://mahout.apache.org/) is an open source project for implementing scalable machine learning algorithms among researchers. It employs Apache Hadoop and the MapReduce processing framework to churn data. The algorithms available within this project range from clustering and classification to collaborative learning.

Azure Machine Learning Studio

The Azure machine learning studio (https://azure.microsoft.com/en-us/) is a Cloud-based predictive analytics service for building and deploying models from the Azure data lake. It provides a large repository of state-of-the-art machine learning algorithms with R and Python support. Azure provides not only tools for predictive model development but also a fully dedicated platform on which to deploy these models via Web services. Features such as data collection and management and ready-to-plug-in sample predictive modeling modules, coupled with support from the Azure storage system, makes the Azure machine learning studio a one-stop solution for descriptive, diagnostic, and predictive analytics.

Halo

Halo is a forecasting tool (https://halobi.com/2016/01/halo-for-forecasting/) that combines statistical modeling with a business intelligence platform to provide forecasting and decision making capabilities for large data sets. It is automated, customizable, and designed especially for complete analytics platform, from data cleansing to prescriptive analytics. Halo specializes in prescriptive analytics modeling.

Tableau

Tableau (https://www.tableau.com/sites/default/files/media/whitepaper_bigdatahadoop_eng_0.pdf) provides real-time data visualization and is employed in different analytics platform solutions spanning different segments, including business intelligence, Big Data analytics, sports, health care, and retail. Tableau is designed to facilitate real-time “conversations” between data across multiple platforms, like relational databases, Cloud data stores, OLAP cubes, spreadsheets, etc. Tableau is a visualization and query tool that can be used in all stages of Big Data analytics tools.

SAP Infinite Insight

SAP’s Infinite Insight program (http://www.butleranalytics.com/enterprise-predictive-analytics-comparisons-2014/) addresses a definite set of predictive analytics problems. It harnesses in-database predictive scoring and also comes with R packages to support a large number of algorithms. Predictive models can be built using specific machine learning and data mining algorithms. Infinite Insight is restrictive in its approach to its market focus, and hence cannot be employed for general purpose machine learning or data mining requirements.

@Risk

@Risk (https://www.palisade.com/risk/l) provides risk management strategies that combine simulations and genetic algorithms to optimize logs or spreadsheets with uncertain values. It performs simulations for mathematical computations, tracking and evaluating different future scenarios for risk analysis. It objectifies probabilities of each such scenario and forecasts the probabilities of risk associated with each of them. @Risk can be used in the simulation, optimization, and validation phases of prescriptive analytics.

Oracle Advanced Analytics

The Oracle Advance Analytics platform (https://www.oracle.com/database/advanced-analytics/index.html) uses in-database processing to provide data mining, statistical computation, visualization, and predictive analytics. It supports most of the data mining algorithms for predictive modeling. Oracle has encompassed the R package for addressing statistical analysis of data.

TIBCO SpotFire

The TIBCO SpotFire platform (http://spotfire.tibco.com/solutions/technology/predictive-analytics) is a predictive and prescriptive analytics tool for implementation of data exploration, discovery, and analytics; the user is aided by interactive visualizations to gain insights about the data. It integrates R, S+, MATLAB, and SAS statistical tools. It uses predictive modeling techniques, such as linear and logistic regression, classification, and regression trees, as well as optimization algorithms for decision-making capabilities.

R

R, a product of The R Project for Statistical Computing (https://www.r-project.org/), is an open-source statistical computing program that performs data mining and statistical analysis of large data sets. By design, R is flexible and has achieved a lot of industry focus. R is integrated with data processing frameworks like MapReduce and Spark. A popular project, SparkR provides real-time statistical processing of streaming data. SparkR integrates R to also support filtering, aggregation, integration of data, and an MLib library for machine learning algorithms. Many tools listed here also utilize R as a component for providing additional capabilities.

Wolfram Mathematica

Wolfram’s Mathematca system (https://www.wolfram.com/mathematica/) is considered a computer algebra system, but it also has tools capable of supervised and unsupervised learning over artificial neural networks for processing images, sounds, and other forms of data.

Future Directions and Technologies

The size of data is growing at a tremendous pace and with that, the need to have better and more powerful analytical tools is growing too. There have been a lot of recent technological advancements which look promising. In this section, we mention some of these new advancements and future directions which analytics can take.

From Batch Processing to Real-Time Analytics

Most of the implementations of MapReduce and related techniques, like Hadoop, employ batch processing. This puts a limitation on the data processing as well because, before being able to process the data, the data must first be collected. The collection process may take days to weeks and even months. Such a delay is detrimental to business decisions.

For example, consider customer churn. If an e-retailer discovers a high-risk customer a week after the first signs appear in their data, then there is nothing much to be done. The customer will have already left by then. Such situations, where decision making is extremely time sensitive, require real-time or a stream processing paradigm rather than the batch processing one.

Apache Spark (Shanahan abd Dai 2015) and Apache Storm (Ranjan 2014) provide real-time distributed data processing capabilities. Both Spark and Storm entail a stream processing framework, but Spark is a more general-purpose distributed computing framework. Spark can run over existing Hadoop clusters, and thus it provides easy portability.

In-Memory Big Data Processing

Big Data analytics involve a lot of data movement to and from data repositories and warehouses. Such data movement is time-consuming, and most of the time it acts as a bottleneck in overall processing time. This is because secondary storage is several orders of magnitude slower than a processor itself.

Apache Ignite (Anthony et al. 2016) is the in-memory implementation of Hadoop libraries. It provides a much faster processing capability than the Vanilla implementations of Hadoop.

Prescriptive Analytics

Prescriptive analytics (Soltanpoor and Sellis 2016) is still not as widely utilized by companies as other analytics. This is partly due to the fact that prescriptive analytics is a type of automation of analytics. Companies and businesses are still skeptical of letting machine handle business analytics.

Another challenge in employing prescriptive analytics is that the data available are rarely without gaps. Analysts have adapted to this challenge and often work around the unavailable data. However, such data gaps are not suitable for automation-based analytics.

Data latency poses another challenge for prescriptive analytics. As automation makes analytics faster, there is always a requirement of fresh data to provide accurate projections and future options. With current batch processing-based analytics models, the freshness of data is often questionable. However, recent developments in real-time and stream processing based Big Data processing platforms make predictive analytics possible.

The Digital Media Strategy Blog

Header$type=social_icons

Embarking on big data architecture framework

Big data architecture: Tools Used for Big Data Analytics

IBM InfoSphere

IBM SPSS

Apache Mahout

Azure Machine Learning Studio

Halo

Tableau

SAP Infinite Insight

@Risk

Oracle Advanced Analytics

TIBCO SpotFire

R

Wolfram Mathematica

Future Directions and Technologies

From Batch Processing to Real-Time Analytics

In-Memory Big Data Processing

Prescriptive Analytics

Labels:

COMMENTS

Trending

Footer Social$type=social_icons

Embarking on big data architecture framework

SHARE:

Big data architecture: Tools Used for Big Data Analytics

IBM InfoSphere

IBM SPSS

Apache Mahout

Azure Machine Learning Studio

Halo

Tableau

SAP Infinite Insight

@Risk

Oracle Advanced Analytics

TIBCO SpotFire

R

Wolfram Mathematica

Future Directions and Technologies

From Batch Processing to Real-Time Analytics

In-Memory Big Data Processing

Prescriptive Analytics

Labels:

SHARE:

COMMENTS

Trending

Footer Social$type=social_icons