You are currently browsing the tag archive for the ‘SPSS’ tag.

One of the key findings in our latest benchmark research into predictive analytics is that companies are incorporating predictive analytics into their operational systems more often than was the case three years ago. The research found that companies are less inclined to purchase stand-alone predictive analytics tools (29% vs 44% three years ago) and more inclined to purchase predictive analytics built into business intelligence systems (23% vs 20%), applications (12% vs 8%), databases (9% vs 7%) and middleware (9% vs 2%). This trend is not surprising since operationalizing predictive analytics – that is, building predictive analytics directly into business process workflows – improves companies’ ability to gain competitive advantage: those that deploy predictive analyticsvr_NG_Predictive_Analytics_12_frequency_of_updating_predictive_models within business processes are more likely to say they gain competitive advantage and improve revenue through predictive analytics than those that don’t.

In order to understand the shift that is underway, it is important to understand how predictive analytics has historically been executed within organizations. The marketing organization provides a useful example since it is the functional area where organizations most often deploy predictive analytics today. In a typical organization, those doing statistical analysis will export data from various sources into a flat file. (Often IT is responsible for pulling the data from the relational databases and passing it over to the statistician in a flat file format.) Data is cleansed, transformed, and merged so that the analytic data set is in a normalized format. It then is modeled with stand-alone tools and the model is applied to records to yield probability scores. In the case of a churn model, such a probability score represents how likely someone is to defect. For a marketing campaign, a probability score tells the marketer how likely someone is to respond to an offer. These scores are produced for marketers on a periodic basis – usually monthly. Marketers then work on the campaigns informed by these static models and scores until the cycle repeats itself.

The challenge presented by this traditional model is that a lot can happen in a month and the heavy reliance on process and people can hinder the organization’s ability to respond quickly to opportunities and threats. This is particularly true in fast-moving consumer categories such as telecommunications or retail. For instance, if a person visits the company’s cancelation policy web page the instant before he or she picks up the phone to cancel the contract, this customer’s churn score will change dramatically and the action that the call center agent should take will need to change as well. Perhaps, for example, that score change should mean that the person is now routed directly to an agent trained to deal with possible defections. But such operational integration requires that the analytic software be integrated with the call agent software and web tracking software in near-real time.

Similarly, the models themselves need to be constantly updated to deal with the fast pace of change. For instance, if a telecommunications carrier competitor offers a large rebate to customers to switch service providers, an organization’s churn model can be rendered out of date and should be updated. Our research shows that organizations that constantly update their models gain competitive advantage more often than those that only update them periodically (86% vs 60% average), more often show significant improvement in organizational activities and processes (73% vs 44%), and are more often very satisfied with their predictive analytics (57% vs 23%).

Building predictive analytics into business processes is more easily discussed than done; complex business and technical challenges must be addressed. The skills gap that I recently wrote about is a significant barrier to implementing predictive analytics. Making predictive analytics operational requires not only statistical and business skills but technical skills as well.   From a technical perspective, one of the biggest challenges for operationalizing predictive analytics is accessing and preparing data which I wrote about. Four out of ten companies say that this is the part of the predictive analytics process vr_NG_Predictive_Analytics_02_impact_of_doing_more_predictive_analyticswhere they spend the most time. Choosing the right software is another challenge that I wrote about. Making that choice includes identifying the specific integration points with business intelligence systems, applications, database systems, and middleware. These decisions will depend on how people use the various systems and what areas of the organization are looking to operationalize predictive analytics processes.

For those that are willing to take on the challenges of operationalizing predictive analytics the rewards can be significant, including significantly better competitive positioning and new revenue opportunities. Furthermore, once predictive analytics is initially deployed in the organization it snowballs, with more than nine in ten companies going on to increase their use of predictive analytics. Once companies reach that stage, one third of them (32%) say predictive analytics has had a transformational impact and another half (49%) say it provides a significant positive benefits.


Ventana Research

Our benchmark research into predictive analytics shows that lack of resources, including budget and skills, is the number-one business barrier to the effective deployment and use of predictive analytics; awareness – that is, an understanding of how to apply predictive analytics to business problems – is second. In order to secure resources and address awareness problems a business case needs to be created and communicated clearly wherever appropriate across the organization. A business case presents the reasoning for initiating a project or task. A compelling business case communicates the nature of the proposed project and the arguments, both quantified and unquantifiable, for its deployment.

The first steps in creating a business case for predictive analytics are to understand the audience and to communicate with the experts who will be involved in leading the project. Predictive analytics can be transformational in nature and therefore the audience potentially is broad, including many disciplines within the organization. Understand who should be involved in business case creation a list that may include business users, analytics users and IT. Those most often primarily responsible for designing and deploying predictive analytics are data scientists (in 31% of organizations), the business intelligence and data warehouse team (27%), those working in general IT (16%) and line of business analysts (13%), so be sure to involve these groups. Understand the specific value and challenges for each of the constituencies so the business case can represent the interests of these key stakeholders. I discuss the aspects of the business where these groups will see predictive analytics most adding value here and here.

For the business case for a predictive analytics deployment to be persuasive, executives also must understand how specifically the deployment will impact their areas of responsibilityvr_NG_Predictive_Analytics_01_front_office_functions_use_predictive_anal.._ and what the return on investment will be. For these stakeholders, the argument should be multifaceted. At a high level, the business case should explain why predictive analytics is important and how it fits with and enhances the organization’s overall business plan. Industry benchmark research and relevant case studies can be used to paint a picture of what predictive analytics can do for marketing (48%), operations (44%) and IT (40%), the functions where predictive analytics is used most.

A business case should show how predictive analytics relates to other relevant innovation and analytic initiatives in the company. For instance, companies have been spending money on big data, cloud and visualization initiatives where software returns can be more difficult to quantify. Our research into big data analytics and data and analytics in the cloud show that the top benefit for these initiatives are communication and knowledge sharing. Fortunately, the business case for predictive analytics can cite the tangible business benefits our research identified, the most often identified of which are achieving competitive advantage (57%), creating new revenue opportunities (50%), and increasing profitability vr_NG_Predictive_Analytics_03_benefits_of_predictive_analytics(46%). But the business case can be made even stronger by noting that predictive analytics can have added value when it is used to leverage other current technology investments. For instance, our big data analytics research shows that the most valuable type of analytics to be applied to big data is predictive analytics.

To craft the specifics of the business case, concisely define the business issue that will be addressed. Assess the current environment and offer a gap analysis to show the difference between the current environment and the future environment). Offer a recommended solution, but also offer alternatives. Detail the specific value propositions associated with the change. Create a financial analysis summarizing costs and benefits. Support the analysis with a timeline including roles and responsibilities. Finally, detail the major risk factors and opportunity costs associated with the project.

For complex initiatives, break the overall project into a series of shorter projects. If the business case is for a project that will involve substantial work, consider providing separate timelines and deliverables for each phase. Doing so will keep stakeholders both informed and engaged during the time it takes to complete the full project. For large predictive analytics projects, it is important to break out the due-diligence phase and try not to make any hard commitments until that phase is completed. After all, it is difficult to establish defensible budgets and timelines until one knows the complete scope of the project.

Ensure that the project time line is realistic and addresses all the key components needed for a successful deployment.  In particular with predictive analytics projects, make certain that it reflects a thoughtful approach to data access, data quality and data preparation. We note that four in 10 organizations say vr_NG_Predictive_Analytics_08_time_spent_in_predictive_analytic_processthat the most time spent in the predictive analytics process is in data preparation and another 22 percent say that they spend the most time accessing data sources. If data issues have not been well thought through, it is next to impossible for the predictive analytics initiative to be successful. Read my recent piece on operationalizing predictive analytics to show how predictive analytics will align with specific business processes.

If you are proposing the implementation of new predictive analytics software, highlight the multiple areas of return beyond competitive advantage and revenue benefits. Specifically, new software can have a total lower cost of ownership and generate direct cost savings from improved operating efficiencies. A software deployment also can yield benefits related to people (productivity, insight, fewer errors), management (creativity, speed of response), process (shorter time on task or time to complete) and information (easier access, more timely, accurate and consistent). Create a comprehensive list of the major benefits the software will provide compared to the existing approach, quantifying the impact wherever possible. Detail all major costs of ownership whether the implementation is on-premises or cloud-based: these will include licensing, maintenance, implementation consulting, internal deployment resources, training, hardware and other infrastructure costs. In other words, think broadly about both the costs and the sources of return in building the case for new technology. Also, read my recent piece on procuring predictive analytics software.

Understanding the audience, painting the vision, crafting the specific case, outlining areas of return, specifying software, noting risk factors, and being as comprehensive as possible are all part of a successful business plan process. Sometimes, the initial phase is really just a pitch for project funding and there won’t be any dollar allocation until people are convinced that the program will get them what they need.  In such situations multiple documents may be required, including a short one- to two-page document that outlines vision and makes a high-level argument for action from the organizational stakeholders. Once a cross functional team and executive support is in place, a more formal assessment and design plan following the principles above will have to be built.

Predictive analytics offers significant returns for organizations willing pursue it, but establishing a solid business case is the first step for any organization.


Ventana Research

Like every large technology corporation today, IBM faces an innovator’s dilemma in at least some of its business. That phrase comes from Clayton Christensen’s seminal work, The Innovator’s Dilemma, originally published in 1997, which documents the dynamics of disruptive markets and their impacts on organizations. Christensen makes the key point that an innovative company can succeed or fail depending on what it does with the cash generated by continuing operations. In the case of IBM, it puts around US$6 billion a year into research and development; in recent years much of this investment has gone into research on big data and analytics, two of the hottest areas in 21st century business technology. At the company’s recent Information On Demand (IOD) conference in Las Vegas, presenters showed off much of this innovative portfolio.

At the top of the list is Project Neo, which will go into beta release early in 2014. Its purpose to fill the skills gap related to big data analytics, which our benchmark research into big data shows is held back most by lack of knowledgeable staff (79%) and lack of training (77%). The skills situation can be characterized as a three-legged stool of domain knowledge (that is, line-of-business knowledge), statistical knowledge and technological knowledge. With Project Neo, IBM aims to reduce the technological and statistical demands on the domain expert and empower that person to use big data analytics in service of a particular outcome, such as reducing customer churn or presenting the next best offer. In particular, Neo focuses on multiple areas of discovery, which my colleague Mark Smith outlined. Most of the industry discussion about simplifying analytics has revolved around visualization rather than data discovery, which applies analytics that go beyond visualization, or information discovery, which addresses how we find and access information in a highly distributed environment. These areas are the next logical steps after visualization for software vendors to address, and IBM takes them seriously with Neo.

At the heart of Neo are the same capabilities found in IBM’s SPSSUntitled 1 Analytic Catalyst, which won the 2013 Ventana Research Innovation Award for analytics and which I wrote about. It also includes IBM’s BLU acceleration against the DB2 database, an in-memory optimization technique, which I have discussed as well, that provides access to the analysis of large data sets. The company’s Vivisimo acquisition, which is now called InfoSphere Data Explorer, adds information discovery capabilities. Finally, the Rapid Adaptive Visualization Engine (RAVE), which is IBM’s visualization approach across its portfolio, is layered on top for fast, extensible visualizations. Neo itself is a work in progress currently offered only over the cloud and back-ended by the DB2 database. However, following the acquisition earlier this year of SoftLayer, which provides a cloud infrastructure platform. I would expect to also have IBM make Neo to allow it to access more sources than just loaded data into IBM DB2.

IBM also recently started shipping SPSS Modeler 16.0. IBM bought SPSS in 2009 and has invested in Modeler heavily. Modeler Untitled 2(formerly SPSS Clementine) is an analytic workflow tool akin to others in the market such as SAS Enterprise Miner, Alteryx and more recent entries such as SAP Lumira. SPSS Modeler enables analysts at multiple levels to interact on analytics and do both data exploration and predictive analytics. Analysts can move data from multiple sources and integrate it into one analytic workflow. These are critical capabilities as our predictive analytics benchmark research shows: The biggest challenges to predictive analytics are architectural integration (for 55% of organizations) and lack of access to necessary source data (35%).

IBM has made SPSS the centerpiece of its analytic portfolio and offers it at three levels, Professional, Premium and Gold. With the top-level Gold edition, Modeler 16.0 includes capabilities that are ahead of the market: run-time integration with InfoSphere Streams (IBM’s complex event processing product), IBM’s Analytics Decision Management (ADM) and the information optimization capabilities of G2, a skunks-works project by led by Jeff Jonas, chief scientist of IBM’s Entity Analytics Group.

Integration with InfoSphere Streams that won a Ventana Research Technology Innovation award in 2013 enables event processing to occur in an analytic workflow within Modeler. This is a particularly compelling capability as the so-called “Internet of things” begins to evolve and the ability to correlate multiple events in real time becomes crucial. In such real-time environments, often quantified in milliseconds, events cannot be pushed back into a database and wait to be analyzed.

Decision management is another part of SPSS Modeler. Once models are built, users need to deploy them, which often entails steps such as integrating with rules and optimizing parameters. In a next best offer situation in a retail banking environment, for instance, a potential customer may score highly on propensity want to take out a mortgage and buy a house, but other information shows that the person would not qualify for the loan. In this case, the model itself would suggest telling the customer about mortgage offers, but the rules engine would override it and find another offer to discuss. In addition, there are times when optimization exercises are needed such as Monte Carlo simulations to help to figure out parameters such as risk using “what-if” modelling. In many situations, to gain competitive advantage, all of these capabilities must be rolled into a production environment where individual records are scored in real time against the organization’s database and integrated with the front-end system such as a call center application. The net capability that IBM’s ADM  brings is the ability to deploy analytical models into the business without consuming significant resources.

G2 is a part of Modeler and developed in IBM’s Entity Analytics Group. The group is garnering a lot of attention both internally and externally for its work around “entity analytics” – the idea that each information entity has characteristics that are revealed only in contextual information – charting innovative methods in the areas of data integration and privacy. In the context of Modeler this has important implications for bringing together disparate data sources that naturally link together but otherwise would be treated separately. A core example is that an individual may have multiple email addresses in different databases, has changed addresses or changed names perhaps due to a new marital status. Through machine-learning processes and analysis of the surrounding data, G2 can match records and attach them with some certainty to one individual. The system also strips out personally identifiable information (PII) to meet privacy and compliance standards. Such capabilities are critical for business as our latest benchmark research on information optimization shows that two in five organizations have more than 10 different data sources that they need to integrate and that the ability to simplify access to these systems is important to virtually all organizations (97%).

With the above capabilities, SPSS Modeler Gold edition achieves  market differentiation, but IBM still needs to show the advantage of base editions such as Modeler Professional. The marketing issue for SPSS Modeler is that it is considered a luxury car in a market being infiltrated by compacts and kit cars. In the latter case there is the R programming language, which is open-source and ostensibly free, but the challenge here is that companies need R programmers to run everything. SPSS Modeler and other such visually oriented tools (many of which integrate with open source R) allow easier collaboration on analytics, and ultimately the path to value is shorter. Even at its base level Modeler is an easy-to-use and capable statistical analysis tool that allows for collaborative workgroups and is more mature than many others in the market.

Companies must consider predictive analytics capabilities or Untitledrisk being left behind. Our research into predictive analytics shows that two-thirds of companies see predictive analytics as providing competitive advantage (68%) and particularly important in revenue-generating functions such as marketing (for 70%) and forecasting (72%). Companies currently looking into discovery analytics may want to try Neo, which will be available in beta in early 2014. Those interested in predictive analytics should consider the different levels of SPSS 16.0 as well as IBM’s flagship Signature Solutions, which I have covered. IBM has documented use cases that can give users guidance in terms of leading-edge deployment patterns and leveraging analytics for competitive advantage. If you have not taken a look at the depth of the analytic technology portfolio at IBM, I would make sure to do so, as you might miss some fundamental advancements to the processing of data and analytics to provide the valuable insights required to operate effectively in the global marketplace.


Tony Cosentino

VP and Research Director

IBM’s SPSS Analytic Catalyst enables business users to conduct the kind of advanced analysis that has been reserved for expert users of statistical software. As analytic modeling becomes more important to businesses and models proliferate in organizations, the ability to give domain experts advanced analytic capabilities can condense the analytic process and make the results available sooner for business use. Benefiting from IBM’s research and development in natural-language processing and its statistical modeling expertise, IBM SPSS Analytic Catalyst can automatically choose an appropriate model, execute the model, test it and explain it in plain English.

Information about the skills gap in analytics and the needvr_bigdata_obstacles_to_big_data_analytics (2) for more user-friendly tools indicates pent-up demand for this type of tool. Our benchmark research into big data shows that big data analytics is held back most by lack of knowledgeable staff (79%) and lack of training (77%).

In the case of SPSS Analytic Catalyst, the focus is on driver analysis. In its simplest form, a driver analysis aims to understand cause and effect among multiple variables. One challenge with driver analysis is to determine the method to use in each situation (choosing among, for example, linear or logistic regression, CART, CHAID or structural equation models). This is a complex decision which most organizations leave to the resident statistician or outsource to a professional analyst. Analytic Catalyst automates the task. It does not consider every method available, but that is not necessary. By examining the underlying data characteristics, it can address data sets, including what may be considered big data, with an appropriate algorithm. The benefit for nontechnical users is that Analytic Catalyst makes the decision on selecting the algorithm.

The tool condenses the analytic process into three steps: data upload, selection of the target variable (also called the dependent variable or outcome variable) and data exploration. Once the data is uploaded, the system selects target variables and automatically correlates and associates the data. Based on characteristics of the data, Analytic Catalyst chooses the appropriate method and returns summary data rather than statistical data. On the initial screen, it communicates so-called “top insights” in plain text and presents visuals, such as a decision tree in a churn analysis. Once the user has absorbed the top-level information, he or she can drill down into top key drivers. This enables users to see interactivity between attributes. Understanding this interactivity is an important part of driver analysis since causal variables often move together (a challenge known as multicollinearity) and it is sometimes hard to distinguish what is actually causing a particular outcome. For instance, analysis may blame the customer service department for a product defect and point to it as the primary driver of customer defection. Accepting this result, a company may mistakenly try to fix customer service when it is a product issue that needs to be addressed. This approach also overcomes the challenge of Simpson’s paradox, which is a hindrance for some visualization tools in the market. On subsequent navigations, Analytic Catalyst goes even further into how different independent variables move together, even if they do not directly explain the outcome variable.

Beyond the ability to automate modeling and enable exploration of data, I like that this new tool is suitable for both statistically inclined users (who can use it to get r-scores, model parameters or other data) and business users (whom visualizations and natural language walk through what things mean). Thus it enables cross-functional conversations and allows the domain expert to own the overall analysis.

I also like the second column of the “top key driver” screen, through which users can drill down into different questions regarding the data. Having a complete question set, the analyst can simply back out of one question and dive into another. The iterative process aligns naturally with the concept of data exploration.

IBM seems to be positioning the tool to help with early-stage analysis. From the examples I’ve seen, however, I think Analytic Catalyst would work well also as a back-end tool for marketers trying to increase wallet share through specific campaigns or for efforts by operations personnel to reduce churn by creating predefined actions at the point of service for particular at-risk customer populations.

IBM will need to continue to work with Analytic Catalyst vr_ngbi_br_importance_of_bi_technology_considerationsto get it integrated with other tools and ensure that it keeps the user experience in mind. Usability is the key buying criteria for nearly two-thirds (64%) of companies, according to our benchmark research into next-generation business intelligence.

It is important that the data models align with other models in the organization, such as customer value models, so that the right populations are targeted. Otherwise a marketer or operations person would likely need to figure this out in a different system, such as a BI tool. Also that user would have to put the analytical output into another system, such as a campaign management or business process tool, to make it actionable. Toward this end, I expect that IBM is working to integrate this product within its own portfolio and those of its partners.

SPSS Analytic Catalyst has leaped over the competition in putting sophisticated driver analytics into natural language that can guide almost any user through complex analytic scenarios. However, competitors are not standing still. Some are working on similar tools that apply natural language to sophisticated commodity modeling approaches, and many of the visual discovery vendors have similar but less optimized approaches. With the less sophisticated approaches, the question comes down to optimizing vs. satisfying. Other tools in the market satisfy the basic need for driver analysis (usually approached through simple correlation or one type of decision tree), but a more dynamic approach to driver analysis such as offered by IBM can reveal deeper understanding of the data. The answer will depend on an organization and its user group, but in fast-moving markets and scenarios where analytics is a key differentiator, this is a critical question to consider.


Tony Cosentino

VP and Research Director

The challenge with discussing big data analytics is in cutting through the ambiguity that surrounds the term. People often focus on the 3 Vs of big data – volume, variety and velocity – which provides a good lens for big data technology, but only gets us part of the way to understanding big data analytics, and provides even less guidance on how to take advantage of big data analytics to unlock business value.

Part of the challenge of defining big data analytics is a lack of clarity vr_bigdata_big_data_capabilities_not_availablearound the big data analytics value chain – from data sources, to analytic scalability, to analytic processes and access methods. Our recent research on big data find many capabilities still not available including predictive analytics (41%) to visualization (37%). Moreover, organizations are unclear on how best to initiate changes in the way they approach data analysis to take advantage of big data and what processes and technologies they ought to be using. The growth in use of appliances, Hadoop and in-memory databases and the growing footprints of RDBMSes all add up to pressure to have more intelligent analytics, but the most direct and cost-effective path from here to there is unclear. What is certain is that as business analytics and big data increasingly merge, the potential for increased value is building expectations.

To understand the organizational chasm that exists with respect to big data analytics, it’s important to understand two foundational analytic approaches that are used in organizations today. Former Census Bureau Director Robert Grove’s ideas around designed data and organic data give us a great jumping off point for this discussion, especially as it relates to big data analytics.

In Grove’s estimation, the 20th century was about designed data, or what might be considered hypothesis-driven data. With designed data we engage in analytics by establishing a hypothesis and collecting data to prove or disprove it. Designed data is at the heart of confirmatory analytics, where we go out and collect data that are relevant to the assumptions we have already made. Designed data is often considered the domain of the statistician, but it is also at the heart of structured databases, since we assume that all of our data can fit into columns and rows and be modeled in a relational manner.

In contrast to the designed data approach of the 20th century, the 21st century is about organic data. Organic data is data that is not limited by a specific frame of reference that we apply to it, and because of this it grows without limits and without any structure other than that structure provided by randomness and probability. Organic data represents all data in the world, but for pragmatic reasons we may think of it as all the data we are able to instrument. RFID, GPS data, sensor data, sentiment data and various types of machine data are all organic data sources that may be characterized by context or by attributes such data sparsity (also known as low-density data). Much like the interpretation of silence in a conversation, analyzing big data is as much about interpreting that which exists between the lines as it is about what we can put on the line itself.

vr_predanalytics_adequacy_of_predictive_analytics_supportThese two types of data and the analytics associated with them reveal the chasm that exists within organizations and shed light on the skills gap that our predictive analytics benchmark research shows to be the primary challenge for analytics in organizations today. This research finds inadequate support in many areas including product training (26%) and how to apply to business problems (23%).

On one side of the chasm are the business groups and the analysts who are aligned with Grove’s idea of designed data. These groups may encompass domain experts in areas such as finance or marketing, advanced Excel users, and even Ph.D.-level statisticians. These analysts serve organizational decision-makers and are tied closely to actionable insights that lead to specific business outcomes. The primary way they get work done is through a flat file environment, as was outlined in some detail last week by my colleague Mark Smith. In this environment, Excel is often the lowest common denominator.

On the other side of the chasm exist the IT and database professionals, where a different analytical culture and mindset exist. The priority challenge for this group is dealing with the three Vs and simply organizing data into a legitimate enterprise data set. This group is often more comfortable with large data sets and machine learning approaches that are the hallmark of the organic data of 21st century. Their analytical environment is different from that of their business counterparts; rather than Excel, it is SQL that is often the lowest common denominator.

As I wrote in a recent blog post, database professionals and business analytics practitioners have long lived in parallel universes. In technology, practitioners deal with tables, joins and the ETL process. In business analysis, practitioners deal with datasets, merges and data preparation. When you think about it, these are the same things. The subtle difference is that database professionals have had a data mining mindset, or, as Grove calls it, an organic data mindset, while the business analyst has had a designed data or statistic-driven mindset.  The bigger differences revolve around the cultural mindset, and the tools that are used to carry out the analytical objectives. These differences represent the current conundrum for organizations.

In a world of big data analytics, these two sides of the chasm are being pushed together in a shotgun wedding because the marriage of these groups is how competitive advantage is achieved. Both groups have critical contributions to make, but need to figure out how to work together before they can truly realize the benefits of big data analytics. The firms that understand that the merging of these different analytical cultures is the primary challenge facing the analytics organization, and that develop approaches that deal with this challenge, will take the lead in big data analytics. We already see this as a primary focus area for leading professional services organizations.

In my next analyst perspective on big data I will lay out some pragmatic approaches companies are using to address this big data analytics chasm; these also represent the focus of the benchmark research we’re currently designing to understand organizational best practices in big data analytics.


Tony Cosentino

VP and Research Director

Revolution Analytics is a commercial provider of software and services related to enterprise implementations of the open source language R. At its base level, R is a programming language built by statisticians for statistical analysis, data mining and predictive analytics. In a broader sense, it is data analysis software used by data scientists to access data, develop and perform statistical modeling and visualize data. The R community has a growing user base of more than two million worldwide, and more than 4,000 available applications cover specific problem domains across industries. Both the R Project and Revolution Analytics have significant momentum in the enterprise and in academia.

Revolution Analytics provides value by taking the most recent release from the R community and adding scalability and other functionality so that R can be implemented and seamlessly work in a commercial environment. Revolution R provides a development environment so that data scientists can write and debug R code more effectively, and web service APIs that integrate with other BI tools and dashboards so that R can work with business intelligence tools and visual discovery tools. In addition, Revolution Analytics makes money through professional and support services.

Companies are collecting enormous amounts of data, but few have activevr_bigdata_the_volume_of_big_data big data analytics strategies. Our big data benchmark research shows that more than 50 percent of companies in our sample maintain more than 10TB of data, but often they cannot analyze the data due to scale issues. Furthermore, our research into predictive analytics says that integrating into the current architecture is the biggest obstacle facing the implementation of predictive analytics.

Revolution Analytics helps address these challenges in a few ways. It can perform file-based analytics, where a single node orchestrates commands across a cluster of commodity servers and delivers the results back to the end user. This is an on-premise solution that runs on Linux clusters or Microsoft HPC clusters. A perhaps more exciting use case is alignment with the Hadoop MapReduce paradigm, where Revolution Analytics allows for direct manipulation of the HDFS file system, can submit a job directly to the Hadoop jobtracker, and can directly define and manipulate analytical data frames through Hbase database tables. When front-ended with a visualization tool such as Tableau, this ability to work with data directly in Hadoop becomes a powerful tool for big data analytics. A third use case has to do with the parallelization of computations within the database itself. This in-database approach is gaining a lot of traction for big data analysis primarily because it is the most efficient way to do analytics on very large structured datasets without moving a lot of data. For example, IBM’s PureData System for Analytics (IBM’s new name for its MPP Netezza appliance) uses the in-database approach with an R instance running on each processing unit in the database, each of which is connected to an R server via ODBC. The analytics are invoked as the data is served up to the processor such that the algorithms run in parallel across all of the data.

In the big data analytics context, speed and scale are critical drivers of success, and Revolution R delivers on both. It is built with the Intel Math Kernel Library, so that processing is streamlined for multithreading at the processor level and it can leverage multiple cores simultaneously. In test cases on a single node, R was only able to scale to observations of about 400,000 in a linear regression model, while Enterprise R was able to go into the millions. With respect to speed, Revolution R 6.1 was able to conduct a principal component analysis in about 11 seconds versus 142 seconds with version R-2 14.2. As similar tests are performed across multiple processors in a parallelized fashion, the observed performance difference increases.

vr_predanalytics_predictive_analytics_obstaclesSo why does all of this matter? Our benchmark research into predictive analytics shows that companies that are able to score and update their models more efficiently show higher maturity and gain greater competitive advantage. From an analytical perspective, we can now squeeze more value out of our large data sets. We can analyze all of the data and take a census approach instead of a sampling approach which in turn allows us to better understand the errors that exist in our models and identify outliers and patterns that are not linear in nature. Along with prediction, the ability to identify outliers is probably the most important capability since seeing the data anomalies often leads to the biggest insights and competitive advantage. Most importantly, from a business perspective, we can apply models to understand things such as individual customer behavior, a company’s risk profile or how to take advantage of market uncertainty through tactical advantage.

I’ve heard that Revolution R isn’t always the easiest software to use and that the experience isn’t exactly seamless, but it can be argued that in the cutting-edge field of big data analytics a challenging environment is to be expected. If Revolution Analytics can address some of these useability challenges, it may find its pie growing even faster than it is now. Regardless, I anticipate that Revolution Analytics will continue its fast growth (already its headcount is doubling year-over-year). Furthermore, I anticipate that in-database analytics (an area where R really shines) will become the de-facto approach to big data analytics and that companies that take full advantage of that trend will reap benefits.


Tony Cosentino

VP and Research Director

IBM acquired SPSS in late 2009 and has been investing steadily in the business as a key component of its overall business analytics portfolio. Today, IBM SPSS provides an integrated approach to predictive analytics through four distinct software packages: SPSS Data Collection, SPSS Statistics, SPSS Modeler and SPSS Decision Management. IBM SPSS is also integrated with Cognos Insight, IBM’s entry into the visual discovery arena.

Our benchmark research into predictive analytics shows that companies are struggling with two core issues: a skills shortage related to predictive analytics and integration of predictive analytics into their information architecture. A preliminary look at IBM’s SPSS software makes it obvious to me that IBM is putting its full weight behind addressing both of these issues.

Predictive analytics is a hot term in business today, but there is still some debate about what it means. My blog entry on predictive analytics discusses findings from our research and the idea that the lines between predictive and descriptive analytics are becoming blurred. IBM provides an interesting take on this conversation by discussing predictive analytics in the context of data mining and statistics. Data mining it sees as more bottom-up and exploratory in nature (though it can also be predictive) and statistics as more of a top-down hypothesis-driven approach (though it can also use descriptive techniques).

If you use SPSS Modeler you don’t have to be a data scientist to participate in predictive analytics discussions. Once data is loaded into the modeler and a few preliminary questions are answered about what you are trying to do, SPSS Modeler presents a choice of analytical techniques, such as CHAID, CART and others, and suggests the best approach based on multiple variables, such as number of fields or predictive power. This is a big deal because business managers often have some idea of clustering, regression and cause-and-effect-type functions, but they don’t necessarily know the intricacies of different techniques. With SPSS Modeler you don’t have to know the details of all of this, but can still participate in these important discussions. SPSS Modeler can bridge the gap between statistician and a day-to-day LOB analyst and decision-maker, and thus help bridge the analytics skills gap facing organizations today.

Another challenge for organizations is integrating multiple streams of data including attitudinal data. Built-in survey data collection in SPSS Data Collection can fill in these blanks for analysts. Sometimes behavioral data reveals knowledge gaps that can only be filled with direct perceptual feedback from stakeholders collected through a survey instrument. Trying to tell a story with only behavioral data can be like trying to tell the actual contents of a file based only on metadata descriptors such as file size, type, and when and how often the file was accessed. Similarly, social media data may provide some of the context, but it does not always give direct answers. We see LOB initiatives bringing together multiple streams of data, including attitudinal data such as brand perceptions or customer satisfaction. The data collection functionality allows managers, within the context of broader analytics initiatives, to bring such data directly into their models and even to do scoring for things such as customer or employee churn.

I have not yet discussed IBM SPSS integration with decision systems and the idea of moving from the “so what” of analytics to the “now what” of decision-making. This is a critical component of a company’s analytics agenda, since operationalizing analytics necessitates that the model outcomes be pushed out to organizations’ front lines and then updated in a closed-loop manner. Such analytics are more and more often seen as a competitive advantage in today’s marketplace – but this is a separate discussion that I will address in a future blog entry.

SPSS has a significant presence across multiple industries, but it is ubiquitous in academia and  in the market research industry. The market research industry itself is a particularly interesting foothold for IBM as the market is estimated to be over $30 billion globally, according to the Council of American Survey Research Organizations. By leveraging IBM and SPSS, companies gain access to a new breed of market research to help merge forward-looking attitudinal data streams with behavioral data streams. The academic community’s loyalty to SPSS provides it an advantage similar to that of Apple when it dominated academic institutions with the Macintosh computer. As people graduate with familiarity with certain platforms, they carry this loyalty with them into the business world. As spreadsheets are phased out as the primary modeling tool due to their limitations, IBM can capitalize on the changes with continued investments in institutions of higher learning.

Companies looking to compete based on analytics should almost certainly consider IBM SPSS. This is especially true of companies that are looking to merge LOB expertise with custom analytical approaches, but that don’t necessarily want to write custom applications to accomplish these goals.


Tony Cosentino

VP and Research Director

RSS Tony Cosentino’s Analyst Perspectives at Ventana Research

  • An error has occurred; the feed is probably down. Try again later.

Tony Cosentino – Twitter

Error: Twitter did not respond. Please wait a few minutes and refresh this page.


  • 73,783 hits
%d bloggers like this: