Assessment of reliable big data analytical tools

There are various big data analytical tools options which can be selected for our Big Data analytics program. These options include vendor ...

There are various big data analytical tools options which can be selected for our Big Data analytics program. These options include vendor tool types and tool features, user’s techniques and methodologies, and team or organizational structures. The list cover a lot of items which are complex and which we may not have considered seriously. Irrespective of what project stage you’re in within Big Data analytics, knowing what options are available is foundational to making good decisions about approaches to take and software or hardware products to evaluate.

Image Credit: Pixabay

To accumulate all these problems, the TDWI organization conducted a survey of the pros and cons of options for Big Data analytics. The listed options include the latest innovations, like Clouds, MapReduce, and complex event processing, which were present during the past few years but are now being adopted worldwide. The list presents a catalog of available options for Big Data analytics, and responses to the survey questions indicate what combinations of analytic functions, platforms, and tools users are employing today in terms of usability and complexity. With the help of these available data, project planning can be done efficiently and we can deduce priorities based on the digital marketing challenges foreseen.

The above figure portrays a slightly different view of option usage, as indicated by the pairs of bars on the right side of the figure. Per-option differences between responses for “Using Today” and “Using in Three Years” was calculated based on the potential growth bars; this delta tells us how much the usage of a Big Data analytics option will increase or decrease. An option’s commitment value is the percentage of survey respondents who are committed to using that option, whether today, in three years, or both. Note that none of the values in the figure comes to 100%, which indicates that no option will be used by all survey respondents in all time frames. In this post, we focus on learning about Big Data analytics best practices, from identifying business goals to selecting the best Big Data management tools for your organization’s needs (Wayer 2012).

What kinds of techniques and tool types is your organization using for advanced analytics and Big Data, both today and in three years? (Checking nothing on a row means you have no plans for that technique or tool.)

Potential Growth Versus Commitment for Big Data Analytics Tools Options

Potential Growth for big data analytics tools

The potential growth chart subtracts tools in use now from those anticipated to be in use in three years, and the delta provides a rough indicator for the growth or decline of usage of options for Big Data analytics over the next three years. The charted numbers are positive or negative. Note that a positive number indicates growth and that growth can be good or strong. A negative number indicates that the use of an option may decline or remain flat instead of increasing.

Commitment

The numbers in the commitment column represent the percentages of survey respondents (based on a total of approximately 325 respondents) who selected “Using Today” and/or “Using in Three Years” during the survey process. The measure of commitment here is cumulative, in that the commitment may be realized today, in the near future, or both. A survey respondent could leave it unchecked if they had no plans for the option.

Balance of Commitment and Potential Growth

To get a complete picture, it’s imperative to look at the metrics for both commitment and growth. For instance, some features or techniques may have significant growth rates but only within a weakly committed segment of the user community (Clouds, SaaS, or No-SQL databases). They are strongly committed through common use today (analytic data marts, online analytic processing [OLAP] tools) but they could have low growth rates. Options which are seen with the greatest activity in the near future will most likely be those with high ratings for both growth and commitment. To visualize the balance of commitment and growth, Figure includes the potential growth and commitment numbers from Figure on opposing axes of the single chart. Big Data analytics options are plotted as growing or declining usage (x axis) and narrow or broad commitment (y axis).

Trends for Big Data Analytics Tools Options

In the above figure, we showed that most Big Data analytics options will experience some level of growth in the near future. The figures also indicate which options will grow the most, as well as those that will stagnate or decline. In particular, four groups of options stand out based on combinations of growth and commitment. The groups are reflective of trends in advanced analytics and Big Data (Rusom 2011, Wayer 2012).

Group 1: Strong to Moderate Commitment, Strong Potential Growth

The highest-probability options for changing best practices in Big Data analytics are those with higher potential growth (as validated by the survey results) with moderate or strong organizational commitment. Group 1 adheres to both of these requirements, including tool types and techniques that TDWI has been adopting aggressively during recent years. Furthermore, today’s strongest trends in business intelligence (BI), data warehousing and analytics, are apparent in Group 1 and are summarized in the next five subsections.

Advanced Analytics

Advanced analytics can be a collection of techniques and tool types, including tools for predictive analytics, data mining, statistical analysis, complex SQL, data visualization, artificial intelligence, natural language processing, and database methods that support analytics. The highest commitment among all the mentioned options for Big Data analytics is for advanced analytics. The options which are most nearly related to advanced analytics are predictive analytics, data mining, and statistical analysis in terms of commitment. Corporate commitment cannot be denied in the field of advanced analytics, as it will no doubt increase the growth area for users and for customers in years to come.

Visualization

Advanced data visualization (ADV) projects the strongest potential among all the options for Big Data analytics. ADV can be seen as a natural fit for Big Data analytics. ADV can be scaled to represent millions to thousands of data points, unlike the usual standard pie, bar, or line charts. ADV can handle varied data types and then present analytic data structures that aren’t easily flattened onto a computer screen. Many of the ADV tools and functions present today are compatible with all the leading data sources, so a business analyst can explore data widely in search of just the right analytic data set in real time. ADV tools available today have evolved into easy-to-use and self-service tools, so that people can use them comprehensively. Alternatively, TDWI has seen many corporations adopt ADV and visual discovery, as both stand-alone analytic tools and general purpose BI platforms, on both departmental and enterprise levels (KarmaSphere Solution Brief 2011, JasperSoft Quick Start Guide 2017).

Real Time

Operational BI is a business practice that measures and monitors the performance of business operations frequently. It is enabled by BI technologies, especially dashboard-style reports. Although the definition of “frequently” varies, most operational BI implementations fetch data in real time (or close to it) to refresh real-time management dashboards (which are poised for growth. Users’ aggressive adoption of operational BI in recent years has (among other things) pushed BI technologies into real-time operation, as seen in management dashboards. As users evolve operational BI to be more analytic (i.e., not merely reports based on metrics), analytics are likewise being pushed into real time. Visualization and advanced analytics are poised for aggressive adoption. Real time is the strongest BI trend, yet it hasn’t hit Big Data analytics much, as yet.

Rates of growth and commitment identified four groups of options for Big Data analytics. The third V in the three Vs of Big Data stands for velocity. As numerous examples in this post show, there are many real-world applications for analytics available today for streaming Big Data, plus more applications are coming. However, real-time analytic applications are still new, and they ae utilized today by relatively few organizations. Perhaps this explains why real-time and streaming data did not fare well in the use survey. Even so, given that real-time applications are the strongest trend in BI today, these will no doubt transform analytics soon, just as they have transformed reporting.

In-Memory Databases

One of the optimal ways to get a real-time, quicker response from a database is to fetch the information in the server memory, hence eliminating disk I/O and speed challenges. For several years now, TDWI has seen consistent adoption of in-memory databases among its members and other organizations. An in-memory database can serve many purposes, but in BI they usually support real-time dashboards for operational BI, and the database usually stores metrics, key performance indicators (KPIs), and sometimes OLAP cubes. Similar growth is observed among users in the adoption of in-memory databases for advanced analytics; this is trending, as accessing data is faster than traditional approaches. Leading vendors now offer data warehouse appliances with flash memory or solid-state drives, to which in-memory databases will soon move.

Unstructured Data

We all subscribe insincerely to the fact that there’s valuable, actionable information in natural language text and other unstructured data. In spite of this, organizations haven’t taken advantage of this information until recently. Tools for text mining and text analytics
have slowly gained usage, because they can find facts about key business entities in text and turn those facts into usable, structured data. The data resulting from this can be applied to customer sentiment analyses and it can churn out many applications. For example, many insurance companies use text analytics to parse the mountains of text that result from the claims process, turn the text into structured records, then add that data to the samples studied via data mining or statistical tools for risk, fraud, and actuarial analyses (Russom 2011).

Group 2: Moderate Commitment, Good Potential Growth

Group 2 is dominated by different types of analytic database platforms. The recent innovations that have been carried by vendor firms have provided more options for analytic database platforms, including dedicated analytic DBMSs, data warehouse appliances, columnar data stores, and sandboxes, in addition to older options. Owing to user adoption, the newer analytic database platforms have achieved moderate commitment and good potential growth. Most of the user organizations going through a new analytics program (or a revamp of an established one) have experienced one issue which is a determining factor: Can the current or planned enterprise data warehouse (EDW) handle Big Data and advanced analytics without degrading performance of other workloads for reporting and OLAP? A simpler question could be: How can our EDW perform and scale with concurrent mixed workloads? The answer to this question will help us determine whether the analytic data are managed and operated on the EDW properly or in a separate platform (which is usually integrated with the EDW).

EDWs can handle advanced analytic workloads, showing that in-database analytics has become very common. Yet, performance of host analytics on an EDW is not preferred by everyone. That’s because the management of Big Data and the processing workloads of advanced analytics make stringent demands of server resources, such that (depending on the EDW platform that has been assembled) they can rob server resources from other data warehouse workloads, resulting in report refreshes and slow queries. Some BI professionals prefer to isolate analytic workloads and Big Data on platforms outside the EDW separately to avoid performance degradation due to mixed workloads. If the performance is kept aside, separate analytic database platforms make sense when analytics is funded or controlled by a department instead of the EDW sponsor. Some moderate demand has been observed for analytic database platforms that can be seen as permanent fixtures in data warehouse programs, although more than two-thirds of organizations tend toward analytics on a central EDW. The movement started in early 2003, when the first data warehouse appliances were coming into light. After this movement came new vendor-built databases with columnar data stores, which inherently accelerated column-oriented analytic queries for faster searches. Most recently, vendors have carried out analytic platforms by using distributed file systems, MapReduce, and No-SQL indexing.

Group 3: Weak Commitment, Good Growth

Group 3 shows weak commitment, as they are relatively new. Potential growth is good within committed organizations, and we can expect these options to be in more use soon.

Hadoop Distributed File System (HDFS)

In the current scenario, interest in the HDFS is extremely high, although it is rarely adopted (hence, the figure shows weak commitment). Interest is high with the advent of Big Data, which is diverse in terms of data types. Complex data types that we normally associate with Big Data originate in files, examples being Web logs and XML documents. It is quite troublesome to transform these files into standard forms via a traditional database management system (DBMS). Also, data transformation could potentially lose the data details and anomalies that fuel some forms of analytics. Some users would prefer to simply copy files into a file system without preparing the data much, as long as the Big Data strewn across a million or more files is accessible for analytics.

MapReduce

MapReduce is a new analytic option and hence it is attracting more interest today, similar to Hadoop. The two are similar in principle, as MapReduce makes a distributed file system like HDFS addressable through analytic logic. For example, in MapReduce, a user is required to define a data operation, such as a query or analysis, and the platform “maps” the operation across all relevant nodes for distributed parallel processing and data collection. Mapping and analytic processing span multiple distributed files, despite diverse data types. MapReduce works well in a database management system with a relational store, as in the Aster Data database. Analytics for Big Data are possible due to the distributed processing of MapReduce.

Complex Event Processing (CEP)

This option is relatively new, compared to others, yet it is experiencing rapid adoption. For example, a recent TDWI report discovered that 20% of survey respondents had incorporated some form of event processing into their data integration solutions; that is significant given the newness of this practice. Although it is not required, CEP often operates in a real-time scenario, and so its adoption is driven partially by the real-time trend. CEP can also be used in association with analytics, which is another driver. CEP technologies are evolving to handle streaming Big Data.

SQL

Trends in BI sometimes cancel out each other. That’s currently the case with SQL, as some organizations have deepened their use of SQL while others have done the opposite. On one hand, many organizations rely heavily on SQL as the best go-to approach for advanced analytics. The reason for this is that BI professionals know SQL, and it is quite compatible with every system. An experienced BI professional can create complex SQL programs (depicted as “Extreme SQL” in Figure above), and these work in accordance with Big Data that’s SQL addressable. Extreme SQL is typically applied to highly detailed source data, still in its original schema (or lightly transformed). The SQL is “extreme” because it creates multidimensional structures and other complex data models on the fly, without remodeling and transforming the data ahead of time. On the other hand is the small innovative and rare group of No-SQL enthusiasts. This is feasible when the majority of data types are not rational and converting them to tabular structures would not make sense. No-SQL databases also tend to appeal to application developers, who don’t have the BI professional’s attachment to SQL.

Clouds in TDWI Technology Surveys

Clouds in the TDWI technology survey showed that BI professionals prefer private clouds over public ones, especially for BI, DW, and analytic purposes. This helps explain why the public Cloud has the weakest commitment. The preference is given to private clouds, mostly due to the importance of data security and governance. Even so, some organizations experiment with analytic tools and databases on a public cloud and then move the information onto a private cloud once they decide analytics is mission critical. In a related scenario, software-as-a-service (SaaS) doesn’t necessarily require a cloud, but most SaaS-based analytic applications or analytic database platforms are on a tightly secured public cloud.

Group 4: Strong Commitment, Flat or Declining Growth

Group 4 includes essential options, such as centralized EDWs, data marts for analytics, hand-coded SQL, OLAP tools, and DBMSs built for OLTP. In fact, these are some of the most common options in use today for BI, analytics, and data warehousing. Why does the survey show them in decline, if these are so popular? There are mainly two reasons for this:

Users are maintaining mature investments while shifting new investments to more modern options. For instance, many organizations with a BI program have developed solutions for OLAP, but the current trend is to implement forms of advanced analytics, which are new to many organizations. OLAP won’t go away. In fact, OLAP is the most common form of analytics in today’s world, and it will remain so for the coming years. No doubt that users’ spending for OLAP will grow, albeit modestly compared to other analytic options. Databases designed for online transaction processing (OLTP) are in a similar situation. As we saw in the discussion of Group 2, many users have come to the conclusion that their organization would be better served by an analytic database platform built specifically for data analytics and warehousing. They will shift new investments to databases purpose-built for data analytics or warehousing while maintaining their investments in older relational databases (designed for OLTP, although also used for DW).
Users are correcting problems with their designs or best practices. Data marts are more problematic than ever due to recent requirements for data sharing and compliance. Although data marts regularly host analytic data sets, they are typically on older platforms that include an SMP hardware architecture and an OLTP database. Whether to get a better analytic database platform or to rein in proliferated marts, many user organizations are aggressively decommissioning analytic data marts. The natural option on which to base analytics is hand-coded SQL. The catch is that hand coding tends to be anticollaborative and nonproductive. Because SQL (as the leading language for data) is supported by almost every tool and platform in IT, and is in skill sets of most data management professionals, it cannot go away. In fact, analytics is driving up the use of hand-coded SQL. Most organizations should consider tools that generate SQL based on analytic applications developed in a user-friendly GUI, instead of hand-coding SQL. This needs to happen to make analytic tools more palatable to business people and mildly technical personnel as well as to make developers more productive.

Understanding Internet of Things Data

To get maximum business value from Big Data analytics efforts, users should look to incorporate a mix of structured and unstructured information; they should think of it as wide data, not merely Big Data. Big Data is a bit of a misnomer. Certainly, the volume of information coming from the Web, modern call centers, and other data sources can be enormous. But the main benefit of all that data isn’t in the size. It’s not even in the business insights you can get by analyzing individual data sets in search of interesting patterns and relationships. To get true BI from Big Data analytics applications, user organizations and BI and analytics vendors alike must focus on integrating and analyzing a broad mix of information, in short, wide data.

Future business success lies in ensuring that the data in both Big Data Business Intelligence streams and mainstream enterprise systems can be analyzed in a coherent and coordinated fashion. Numerous vendors are working on one possible means of doing so, including the following:

Products that provide SQL access to Hadoop repositories and NoSQL databases
The direction they are taking matters, particularly with SQL-on-Hadoop technologies, because far more people know SQL than know Hadoop.

Hadoop is a powerful technology for managing large amounts of unstructured data, but it’s not so great for quickly running analytics applications, especially ones combining structured and unstructured data. Conversely, SQL has a long and successful history of enabling heterogeneous data sources to be accessed with almost identical calls. And the business analysts who do most of the work to provide analytics to business managers and the CxO suite typically are well versed in using SQL.

In addition, most users want evolutionary advances in technology, not revolutionary ones. That means intelligently incorporating the latest technologies into existing IT ecosystems to gain new business value as quickly and as smoothly as possible. The result: information from Hadoop clusters, NoSQL systems, and other new data sources gets joined with data from relational databases and data warehouses to build a more complete picture of customers, market trends, and business operations. For example, customer sentiment data that can be gleaned from social networks and the Web is potentially valuable, but its full potential won’t be realized if it’s compartmentalized away from data on customer leads and other digital marketing strategic information.

Challenges for Big Data Analytics Tools

One of the major Big Data challenges is what information to use, and what not to use. Businesses looking to get real value out of Big Data, while avoiding overwhelming their systems, need to be selective about what they analyze.

RichRelevance Inc. faces one of the prototypical Big Data challenges: lots of data and not a lot of time to analyze it. For example, the marketing analytics services provider runs an online recommendation engine for Target, Sears, Neiman Marcus, Kohl’s, and other retailers. Its predictive models, running on a Hadoop cluster, must be able to deliver product recommendations to shoppers in 40 to 60 milliseconds, which is not a simple task for a company that has two petabytes of customer and product data in its systems, a total that grows as retailers update and expand their online product catalogs. “We go through a lot of data,” said Marc Hayem, vice president in charge of RichRelevance’s service-oriented architecture platform.

It would be easy to drown in all that data. Hayem said that managing it smartly is critical, both to ensure that the recommendations the San Francisco company generates are relevant to shoppers and to avoid spending too much time (and processing resources) analyzing unimportant data. The approach it adopted involves whittling down the data being analyzed to the essential elements needed to quickly produce recommendations for shoppers.

The full range of the historical data that RichRelevance stores on customers of its clients is used to define customer profiles, which help enable the recommendation engine to match up shoppers and products. But when the analytical algorithms in the predictive models are deciding in real time what specific products to recommend, they look at data on just four factors: the recent browsing history of shoppers, their demographic data, the products availability on a retailer’s website, and special promotions currently being offered by the retailer. “With those four elements, we can decide what to do,” Hayem said, adding that data on things such as past purchases, how much customers typically spend, and other retailers where they also shop isn’t important at that point in the process.

In the age of Big Data, it is important to know what information is needed in analytics applications and what information isn’t; this has never been more important, or in many cases, more difficult. The sinking cost of data storage and the rise of the Hadoop data lake concept are making it more feasible for organizations to stash huge amounts of structured, unstructured, and semistructured data collected from both internal systems and external sources. But getting the questions wrong regareing what to use, what to hold onto for the future, and what to jettison wrong can have both immediate and long-term consequences.

Even if a particular data set may seem unimportant now, it could have uses down the line. On the other hand, cluttering up Hadoop systems, data warehouses, and other repositories with useless data could pose unnecessary costs and make it hard to find the true gems of information amid all the clutter. And not thinking carefully, and intelligently, about the data that needs to be analyzed for particular applications could make it harder to get real business benefits from Big Data analytics programs.

Tools for Using Big Data

As the scale of Big Data is very large, it is more complicated. The data are mostly expanded over a number of servers, and the work of compiling the data is computed among them. This work was usually assigned to the database software in the past, which used its innovative JOIN mechanism to compile tables, then add up the columns before handing off the rectangle of data to the reporting software that would validate it. This task is harder than it seems, as database programmers can tell you about the instances where complex JOIN commands that would hang up their database for hours as it tried to produce an urgent report.

Now, the scenario is completely different. Hadoop is a go-to tool for organizing the racks and racks of servers, and NoSQL databases are popular tools for storing data on these racks. These mechanisms can be way more powerful and efficient than the old single machine, but they are far from being as refined as the old database servers. Although SQL may be complicated, writing the JOIN query for the SQL databases was often much simpler than collecting information from lots of machines and compiling it into one coherent solution, which is quite cumbersome to maintain. Hadoop jobs are written in Java, and they require another level of sophistication. The tools for using Big Data are just beginning to package this distributed computing power in a way that’s a bit easier to use.

NoSQL data stores are being used with many Big Data tools. These are more flexible than traditional relational databases, but the flexibility isn’t as much of a deviation from the past as Hadoop. NoSQL queries are simpler to use as it discourages the complexities provided by SQL queries. The main concern is that software needs to anticipate the possibility that there should not be redundancy and not every row will have some data for every column.

Here are some of the top tools used for using Big Data, according to TechTarget.

Jaspersoft BI Suite

Jaspersoft features include the following:

Capabilities
Reporting
Dashboards
Analysis
Data integration
BI platformBenefits

Jaspersoft’s BI provides key features that benefit both business and IT and look forward to enabling self-BI for their organization. Key features of the BI platform include the following (JasperSoft Quick start Guide 2017):

Full-featured analytics, reporting, and dashboards that are easy to use
Application can be embedded by flexible Web-based architecture
Subscription model that enables more users at substantially reduced cost

The core of the Jaspersoft BISoftware suite is the JasperReports server. The end-to-end BI suite delivers shared services that include a repository for storing and structuring your resources, multiple levels of security, distribution of information, a semantic layer that greatly simplifies creating reports, a report scheduler, and many more features.

The Jaspersoft package is one of the open source leaders for producing reports from database columns. This innovative software, which is up and running at many organizations, turns SQL tables into PDFs, which can be scrutinized at meetings. The company is soaring on the Big Data train, which means adding a software layer to connect the places where Big Data gets stored to its report-generating software. The JasperReports server now offers software to suck up data from many of the major storage platforms, including Cassandra, MongoDB, Redis, CouchDB, Riak, and Neo4j. Hadoop is also well-represented, with JasperReports providing a Hive connector to reach inside of HBase.

This effort feels like it is still starting up; the tools are not fully integrated, and many pages of the documentation wiki are blank. For example, the visual query designer doesn’t work yet with Cassandra’s CQL. You have to type these queries out by hand.

The Jaspersoft’s server will boil information down to interactive tables and graphs, once you get the data from these sources. The reports can be reasonably sophisticated interactive tools which let you drill down into various corners. You can ask for more and more details if you need them (Splice Machine App overview 2016).

This is a well-developed corner of the software world, and Jaspersoft is expanding by making it easier to use these sophisticated reports with newer sources of data. Jaspersoft is not offering mainly new ways to look at the data, it just offers more sophisticated ways to access data stored in new locations. I found this unexpectedly useful. The aggregation of data was enough to make basic sense of when someone was going to the website and who was going there (Nunns 2015, JasperSoft 2017).

Pentaho Business Analytics

Pentaho is yet another software platform that began as a report-generating engine. Just like JasperSoft, it branched into Big Data by absorbing information from the new sources while making it easier to access. You can hook up Pentaho’s tool to many of the most popular NoSQL databases, such as Cassandra and MongoDB. You can drag and drop the columns into views and reports as if the information came from SQL databases, once the databases are connected.

Pentaho also provides software for drawing HBase data and HDFS file data from Hadoop clusters. The graphical programming interface, known as either Kettle or Pentaho Data Integration, is again one of the more intriguing tools. It has a bunch of built-in modules that can be dragged and dropped onto a picture and then connected to them. You can write your code and send it out to execute on the cluster, as Pentaho has thoroughly integrated Hadoop and the other sources into this (Nunns 2015, TechTarget).

Karmasphere Studio and Analyst

Many of the Big Data tools did not begin as reporting tools. For instance, Karmasphere Studio is a set of plug-ins built on top of Eclipse. It is a specialized IDE that makes it easier to create and run Hadoop jobs.

Karmasphere delivers the Big Data workspace for data professionals that want to take advantage of the opportunity to mine and analyze mobile, sensor, Web, and social media data in Hadoop and bring new value to their business. They provide a graphical environment on Cloudera’s distribution that includes Apache Hadoop (CDH), in which you can navigate through Big Data of any variety and spot patterns and trends in order to influence the strategies of a business. They provide the ability to integrate the insights into reoccurring business processes, once something meaningful is discovered.

Direct Access to Big Data for Analysis

Karmasphere Analyst enables data analysts immediate entry to structured and unstructured data on Cloudera CDH, through SQL and other familiar languages, so that you can make ad hoc queries, interact with the results, and run iterations, without the aid of IT.

Operationalization of the Results

Karmasphere Studio enables developers with a support analytic backup team a graphical environment in which to develop custom algorithms for them and systematize the creation of meaningful data sets they find and feed them into business processes and applications.

Flexibility and Independence

The Karmasphere Analytics engine is the foundation for all Karmasphere products. It provides easy access to Hadoop in data center and Cloud environments, transparency across Hadoop ecosystem, prebuilt heuristics and algorithms, familiar language support, and collaboration facilities.

A rare feeling of joy is felt when you configure a Hadoop job with this developer tool. There are any number of stages in the life of a Hadoop job, and Karmasphere’s tools walk you through each step, showing the fractional results along the way. The debuggers have always made it possible for us to peer into the mechanism as it does its work, but Karmasphere Studio does something a bit better: the tools display the state of the test data at each step, as you set up the workflow. You see what the temporary data will look like as it is cut apart, analyzed, and then reduced.

Karmasphere Analyst is yet another tool which Karmasphere distributes; it is designed to simplify the process of working through all of the data in a Hadoop cluster. It comes with many useful building blocks for programming a good Hadoop job, like subroutines for uncompressing zipped log files. Then, it strings them together and parameterizes the Hive calls to produce a table of output for perusing (Russom 2011).

Talend Open Studio

While mostly invisible to users of BI platforms, ETL processes retrieve data from all operational systems and preprocess it for analysis and reporting tools.

Talend’s program has the following features:

Packaged applications (ERP, CRM, etc.), databases, mainframes, files, Web services, etc., to address the growing disparity of sources.
Data warehouses, data marts, and OLAP applications, for analysis, reporting, dashboarding, scorecarding, and so on.
Built-in advanced components for ETL, including string manipulations, slowly changing dimensions, automatic lookup handling, bulk loads support, and so on. Most connectors addressing each of the above needs are detailed in the Talend Open Studio Components Reference Guide 6.2.1 (2016).

Talend also offers an Eclipse-based IDE for stringing together data processing jobs with Hadoop. Its tools are designed to help with data management, data integration, and data quality, all with subroutines tuned to these jobs (Wayer 2012).

One of the features supported by Talend Studio is that it allows you to build up your jobs by dragging and dropping little icons onto a canvas. Talend’s component will fetch the RSS and add proxies if necessary if you want to get an RSS feed. There are many components for accumulating information and many more for doing things like a “fuzzy match.” Then, you can generate the output results.

After you get a feel for what the components actually do and don’t do, stringing together blocks visually can be simple. This became easier to figure out when I started looking at the source code being assembled behind the canvas. Talend lets you see this, and I think it’s an ideal compromise. Visual programming may seem like a lofty goal, but I’ve found that the icons can never represent the mechanisms with enough detail to make it possible to understand what’s going on. I need the source code.

Talend also maintains a collection of open source extensions which make it easier to work with a company’s products. These are known collectively as TalendForge. Most of the tools seem to be filters or libraries that link Talend’s software to other major products, such as SugarCRM and Salesforce.com. You can simplify the integration by bringing the information from these systems into your own projects.

Skytree Server

Skytree delivers a bundle that performs many of the more advanced machine learning algorithms. These commands are required for typing the correct command in command line interface (CLI).

Skytree is concentrated mainly on the logic used, rather than the shiny graphical user interface. The Skytree server uses an implementation that the company claims can be 10,000 times faster than other packages and is optimized to run a number of classic machine learning algorithms on your data using this implementation. It looks for clusters of mathematically similar items while searching through your data, then inverts this information to identify outliers that may be opportunities, problems, or both. The algorithms can search through vast quantities of data looking for the entries that are a bit out of the ordinary, and they can be more precise than humans. This may be a fraudulent claim, or designed for a particularly good customer who will spend and spend.

The proprietary and the free version of the software offer the same algorithms, but the free version is limited to data sets of 100,000 rows. This should be sufficient to establish whether the software is a good match to your organization’s needs.

Tableau Desktop and Server

Tableau Desktop is a data visualization tool that makes it easier to look at your data in new ways and then apply actions to it and look at it in a different way. You can even combine the data with other data sets and examine it in yet another way. The tool is optimized to give you all the columns for the data and let you mix them before stuffing it into one of the many graphical templates or visualizations that are provided.

Tableau Software started implementation of Hadoop several versions ago, and now you can treat Hadoop “just like you would with any data connection.” Tableau tries its best to cache as much information in memory to allow the tool to be interactive while relying upon Hive to structure the queries. Tableau wants to offer an interactive mechanism so that you can slice and dice your data again and again, while many of the other reporting tools are built on a tradition of generating reports offline. Some of the latency of a Hadoop cluster can be dealt with by help of caching. The software is well-polished and aesthetically pleasing.

Splunk

Splunk is quite different from the other available Big Data tools. It is not exactly a collection of AI routines or a report-generating tool, although it achieves much of that along the way. It creates an index of your data as if your data were a block of text or a book. Though we all know that databases also build indices, the approach that Splunk uses is much closer to a text search process.

This indexing is surprisingly flexible. Splunk makes sense of log files while coming already tuned to a particular application, and it collects them easily. It is also sold in a number of different solution packages, including one for detecting Web attacks and another for monitoring a Microsoft Exchange server. The index helps associate the data in these and several other common server-side scenarios.

Splunk searches around in the index while reading the text strings. You might type in the URLs of important articles or an IP address. Splunk finds these URLs and packages them into a timeline built around the time stamps it discovers in the data. All other fields are associated, and you can click around to drill deeper and deeper into the data set. While it seems like a simple process, it is quite powerful if you are looking for the right kind of indicator in your data feed. If you know the right text string, Splunk will help you track it. Log files are a great application for it.

A new Splunk tool which is currently in private beta testing, Shep, promises bidirectional integration between Splunk and Hadoop, allowing you to query Splunk data from Hadoop and exchange data between the systems.

The Digital Media Strategy Blog

Header$type=social_icons

Assessment of reliable big data analytical tools

Potential Growth Versus Commitment for Big Data Analytics Tools Options

Potential Growth for big data analytics tools

Commitment

Balance of Commitment and Potential Growth

Trends for Big Data Analytics Tools Options

Group 1: Strong to Moderate Commitment, Strong Potential Growth

Advanced Analytics

Visualization

Real Time

In-Memory Databases

Unstructured Data

Group 2: Moderate Commitment, Good Potential Growth

Group 3: Weak Commitment, Good Growth

Hadoop Distributed File System (HDFS)

MapReduce

Complex Event Processing (CEP)

SQL

Clouds in TDWI Technology Surveys

Group 4: Strong Commitment, Flat or Declining Growth

Understanding Internet of Things Data

Challenges for Big Data Analytics Tools

Tools for Using Big Data

Jaspersoft BI Suite

Pentaho Business Analytics

Karmasphere Studio and Analyst

Direct Access to Big Data for Analysis

Operationalization of the Results

Flexibility and Independence

Talend Open Studio

Skytree Server

Tableau Desktop and Server

Splunk

Labels:

COMMENTS

Trending

Footer Social$type=social_icons

Assessment of reliable big data analytical tools

SHARE:

Potential Growth Versus Commitment for Big Data Analytics Tools Options

Potential Growth for big data analytics tools

Commitment

Balance of Commitment and Potential Growth

Trends for Big Data Analytics Tools Options

Group 1: Strong to Moderate Commitment, Strong Potential Growth

Advanced Analytics

Visualization

Real Time

In-Memory Databases

Unstructured Data

Group 2: Moderate Commitment, Good Potential Growth

Group 3: Weak Commitment, Good Growth

Hadoop Distributed File System (HDFS)

MapReduce

Complex Event Processing (CEP)

SQL

Clouds in TDWI Technology Surveys

Group 4: Strong Commitment, Flat or Declining Growth

Understanding Internet of Things Data

Challenges for Big Data Analytics Tools

Tools for Using Big Data

Jaspersoft BI Suite

Pentaho Business Analytics

Karmasphere Studio and Analyst

Direct Access to Big Data for Analysis

Operationalization of the Results

Flexibility and Independence

Talend Open Studio

Skytree Server

Tableau Desktop and Server

Splunk

Labels:

SHARE:

COMMENTS

Trending

Footer Social$type=social_icons