Revolutionary White papers

R is Hot

R has already won praise and plaudits from established media outlets such as the New York Times, Forbes, Intelligent Enterprise, InfoWorld and The Register. When you consider that R is a high-level computer programming language designed mostly for quants, the adoring media attention seems nothing short of astounding.

So it’s entirely fair to ask: Why all the hoopla? Why is an esoteric programming language created in the early 1990s by two academics in New Zealand suddenly all the rage? Why is R so hot?

This article examines the reasons behind the rising popularity of R, by looking at how R is used at organizations including Bank of America, National Institutes of Health, the New York Times and Deloitte Consulting.


Click to view

Big Data Decision Trees with R

By Richard Calaway, Lee Edlefsen, and Lixin Gong

Growth in Data Volumes

“Revolution Analytics’ RevoScaleR package provides full-featured, fast, scalable, distributable predictive data analytics.  The included rxDTree function provides the ability to estimate decision trees efficiently on very large data sets. Decision trees (Breiman, Friedman, Olshen, & Stone, 1984) provide relatively easy-to-interpret models, and are widely used in a variety of disciplines.  For example,

  • Predicting which patient characteristics are associated with high risk of, for example, heart attack.
  • Deciding whether or not to offer a loan to an individual based on individual characteristics.
  • Predicting the rate of return of various investment strategies

The rxDTree function fits tree models using a binning-based recursive partitioning algorithm. The resulting model is similar to that produced by the recommended R package rpart (Therneau & Atkinson, 1997). Both classification-type trees and regression-type trees are supported.

Click to enlarge

Advanced ‘Big Data’ Analytics with R and Hadoop

Revolution Analytics delivers optimized statistical algorithms for the three primary data management paradigms being employed to address growing size and increasing variety of organizations’ data, including file-based, MapReduce (e.g. Hadoop) or In-Database Analytics.

The company is optimizing algorithms - even complex ones - to work well with Big Data. Open Source R was not built for Big Data Analytics because it is memory-bound. Depending on the type of statistical analysis required, Big Data also causes issues that we’ll call “Big Computations,” as some algorithms require a great deal of processing capacity on their own and may not lend themselves to running in every data management paradigm. For these Big Computations, parallelism (as we’ve deployed with IBM Netezza and ScaleR) is important to performance and to the accuracy of the statistical analysis. Coupled with an intuitive R Development Environment from Revolution Analytics, the degree of innovation exceeds that which may be achieved through packaged analytic applications.

This paper addresses specific integration between R and Hadoop that is supported by Revolution Analytics.

Click to enlarge

The RevoScaleR Data Step White Paper

This paper provides an introduction to working with large data sets with Revolution Analytics’ proprietary R package, RevoScaleR. Although the main focus is on the use and capabilities of the rxDataStep function, we take a broad view and describe the capabilities of the functions in the RevoScaleR package that may be useful for reading and manipulating large data sets, cleaning them, and preparing them for statistical analysis with R.

Click to enlarge

Fast, Powerful and Cost-Effective Analytics
Reshaping Business Competition Worldwide

‘Big Data’ is fast becoming a big part of business. Big Data is driving transformational change across markets and across the enterprise. The convergence of cloud, mobile and social computing has dramatically increased the sheer volume of available data and greatly accelerated the speed at which new data is created. Data is changing the world, and we need to keep up with it.

This executive white paper explores a new generation of predictive analytics solutions unlocking the value of big numbers—converting them from sprawling collections of data points into finely honed competitive weapons.

Click to enlarge

Emerging Open-Source Analytics Stack Reflects Dynamism and Diversity of New Global Economy

The emerging open-source analytics stack is a cost-effective and practical alternative to monolithic BI solutions offered by traditional vendors. Beyond its potential as a lower-cost option, the new stack represents a clear break with the past. Unlike traditional analytic solutions, the new stack is defined by its flexibility, extensibility and open-source foundations...

Click to enlarge

RevoScaleR Speed and Scalability

RevoScaleR, the Big Data predictive analytics library included with Revolution R Enterprise, is designed from the ground up to be fast and scalable. Consideration has been give to all of the components that are involved in performing large-scale statistical analysis. These include data storage, usage of a computing infrastructure’s resources (RAM, CPUs, cores, and computers) and the algorithms themselves. Its extreme speed and scalability are the result of careful, innovative engineering at every stage. This white paper describes the design and implementation considerations that are the foundation of the high-performance Big Data capabilities of Revolution R Enterprise.

Visualizing Huge Data Sets with R: An Old Wives Tale from the U.S. Census Whitepaper

Visualizing Huge Data Sets with R: An Old Wives Tale from the U.S. Census

This white paper provides an example of quickly analyzing and visualizing a huge data set in R using the new R package from Revolution Analytics, RevoScaleR. The example uses a census data set with over 14 million observations (5% Public Use Microdata Sample (PUMS) of the 2000 United States Census) and examines patterns related to the sex ratio by age. After first identifying an aberration in the aggregate data, we are able to quickly drill down and create plots conditioned on a variety of characteristics such as region, race, and marital status. In the process, errors in the data are graphically revealed; for example, 65-year-old men are more likely to have an “old wife” age 70 than a wife their own age. The power and flexibility of the R language and graphics combined with the speed of RevoScaleR make visualization an easy and critical first step to analysis of huge data sets.

Also included are scripts to recreate the analysis using Revolution R Enterprise 4, and a link to a video demonstration of the analysis.

Big Data Analysis with Revolution R Enterprise White Paper

Big Data Analysis with Revolution R Enterprise

The R language is well established as the language for doing statistics, data analysis, data-mining algorithm development, stock trading, credit risk scoring, market basket analysis and all manner of predictive analytics. However, given the deluge of data that must be processed and analyzed today, many organizations have been reticent about deploying R beyond research into production applications.

RevoDeployR

R for Web Services with RevoDeployR

The confluence of three major trends: the unprecedented growth of the mobile web[1], the acceptance of cloud computing as a viable model for business applications and the demand for sophisticated predictive analytics[2] is driving demand for web applications that are backed by advanced analytical techniques and data visualizations that are well beyond the simple, canned analyses that characterized early business intelligence efforts. For example, an application to provide near real-time, web-accessible customer risk profiles to department managers located throughout an enterprise is not far-fetched in today’s environment. Developing such an application might require evaluating and then deploying classification algorithms only recently published in the statistical journals. The R language is particularly well suited for this fast moving quantitative world. R is interpretative language that is ideal for the rapid prototyping of new techniques. Furthermore, R can package these techniques in a very small profile. R scripts containing only code required for particular calculations fit nicely into the web applications development environments.

Click to enlarge

Deploying Advanced Analytics Using R & PMML

In the old days, you had to wait for the model to be converted – usually through a laborious or expensive proprietary process -- into an application that could run on the execution system. With the advent of PMML, you can build a model in the morning and have it ready for integration testing, change management, and potential deployment in the afternoon.

In a very real sense, standard languages such as PMML are unlocking the latent value of big data, and ushering in a new era of real-time analytics. With PMML and its associated technologies, the observation that data is the “new oil” of the 21st century seems entirely plausible.

Revolution Analytics, February 2011

Click to enlarge

R Competition Brings Out the Best in Data Analysis

It’s often been said that competition brings out the best in us.  We are all attracted to contests; our passion for competing seems hardwired into our souls. Apparently, even predictive modelers find the siren song of competition irresistible.

That’s what a small Australian firm named Kaggle has discovered – when given the chance, data scientists love to duke it out, just like everyone else. Kaggle describes itself as “an innovative solution for statistical/analytics outsourcing.” That’s a very formal way of saying that Kaggle manages competitions among the world’s best data scientists.

Here’s how it works: Corporations, governments and research laboratories are confronted with complex statistical challenges. They describe the problems to Kaggle and provide datasets. Kaggle converts the problems and the data into contests that are posted on its web site. The contests feature cash prizes ranging in value from $100 to $3 million. Kaggle’s clients range in size from tiny startups to multinational corporations such as Ford Motor Company and government agencies like NASA.

Click to enlarge

The Rise of Big Data Spurs a Revolution in Big Analytics

By Norman H. Nie, CEO Revolution Analytics

The enormous growth in the amount of data that the global economy now generates has been well documented, but the magnitude of its potential impact to drive competitive advantage has not. It is my hope that this briefing urges all stakeholders—executives who must fund analytics initiatives, IT teams that support them and data scientists, who uncover and communicate meaningful insight—to go boldly in the direction of “Big Analytics.” This opportunity is enormous and without precedent.

Getting Started with Revolution R Enterprise Whitepaper

Getting Started with Revolution R Enterprise

This guide is intended as an introduction to the visual productivity features in Revolution R Enterprise. For new users, these Windows tools include a number of usabaility aids including IntelliSense for automatic word completion, code snippets to simplify programming, and an Object Browser with editing and plotting capabilities. For experienced R programmers, this interface includes a full-featured integrated development environment (IDE) with a built-in visual debugger. This Getting Started Guide walks you through the most useful features of the new environment.

Tags: REvolution R Enterprise for Windows, debugger, script editing, loading packages, R console

REvolution Computing, February 2010

   

High-Performance Risk Analysis with Revolution R Enterprise Whitepaper

High-Performance Risk Analysis with Revolution R Enterprise

Illustrated benchmarks of the significant performance gains possible with Revolution R Enterprise and the Intel® Xeon® processor 5500 series when computing risk metrics with the CreditMetrics algorithm. Revolution R Enterprise can substantially reduce computation time for the CreditMetrics analysis benchmark on multi-core workstations. The optimized numeric routines available in Revolution R Enterprise can transparently speed up a large number of compute-intensive tasks in R. Quantitative finance is just one analytics field where Revolution's high-performance and productivity tools can help statisticians deliver better results.

CreditMetrics Benchmark Code: Creditmetrics.rproj

Tags: Revolution R Enterprise, CreditMetrics, value at risk, portfolio performance benchmarks

REvolution Computing and Intel, 2009

 

Financial Applications with ParallelR Whitepaper

Financial Applications with ParallelR

The use of statistical packages in finance has two functions. One, econometric analysis of large volumes of data, and two, programming financial models. A popular package for these purposes is R. In this article, we will examine two canonical applications of parallel programming for option pricing.We use the ParallelR package developed by Revolution Analytics.We price options using trees and Monte Carlo simulation. Both these approaches are commonly used for option pricing and are amenable to parallelization and grid computing. In this paper, we demonstrate the application using the widely-used mathematical/statistical R package.

Tags: Parallel Monte Carlo, option pricing on trees

Sanjiv R. Das and Brian Granger, Journal of Investment Management, 2009

   

Using the foreach Package Whitepaper

Using the foreach Package

Much of parallel computing comes down to doing three things: splitting the problem into pieces, executing the pieces in parallel, and combining the results back together. Using the foreach package, the "iterator" object helps you to split the problem into pieces, the %dopar% operator  executes the pieces in parallel, and the "combine" option puts the results back together. This whitepaper demonstrates how simple things can be done in parallel quite easily using the foreach package, and given some ideas about how more complex problems can be solved.

Tags: parallel computing, ParallelR, .combine function, iterators package, parallel random forest, parallel apply, list comprehensions,

REvolution Computing, October 2009

   

Parallelized Backtesting with foreach Whitepaper

Parallelized Backtesting with foreach

ParallelR and the foreach function provide a simple mechanism to speed up "embarrassingly parallel" problems, even on modest hardware like a dual-core laptop. In many cases, with just a simple conversion from the for syntax to the foreach syntax you can get significant speedups without having to worry about many of the housekeeping details of setting up worker R sessions. And for the really big problems, you just need to change one line of code to move your job onto a distributed cluster or grid.

Tags: automated trading rule, MACD oscillator, Sharpe Ratio

REvolution Computing, May 2009

   

Click to enlarge

A Benchmark Study of Large-Scale Chemical Classification using ParallelR

The world is awash with digital data online. Utilizing this data to yield knowledge is the big challenge. The raw data by itself is rather worthless. Modern data mining techniques have emerged as a potential solution, but they are sufficiently compute intensive for real world applications that conventional PCs and servers often cannot provide knowledge to us in a timely fashion. This is a major issue as CPU clock rates seem to have leveled off and data sets (and subsequent run times) are increasing exponentially.

In this paper, we will show that by utilizing an “off the shelf” parallel data mining R package called "caretNWS", knowledge workers can use quad-core processor-based systems to classify data with a minimum of effort and yet realize high performance. For the first time, knowledge workers can achieve scalable data mining without resorting to parallel programming.

Tags: High-performance computing, drug development, caretNWS package, datamining, random forest method

REvolution Computing, Pfizer Research Labs, AMD 2008

   

Click to enlarge

Using ParallelR for High Performance Monte Carlo Simulation on Multiprocessor Computers

Opportunities for utilizing high performance Financial Services systems are common throughout the financial world. Examples include: credit risk assessment, portfolio optimization, optimization of marketing strategies, and credit card fraud detection.

The specific example we consider is a simple, prototype portfolio optimization problem that is a model of the “efficient frontier” approach suggested by Markowitz. Our intention here is to illustrate the general ideas behind the use of multiprocessors to accelerate general portfolio optimization rather than to present the design and analysis of a production portfolio optimizer applied to real market data. We present benchmark data showing how well our parallel portfolio optimizer scales as a function of the number of cores.

Tags: credit risk, portfolio optimization, Sleigh function, multi-core processor

REvolution Computing and HP, 2008