You are here

Leading Research Center Speeds up Analysis and Simplifies Complex Analysis on Very Large Data Sets


Background

The State University of New York (SUNY) at Buffalo is home to one of the leading multiple sclerosis (MS) research centers in the world.  MS is a devastating, chronic neurological disease that affects nearly a million people worldwide.  From the beginning, the genetics of MS were known to be complex and it was apparent that no single gene was likely causative for the disease. 
The SUNY team began to look at data obtained from scanning the genomes of MS patients to identify genes whose variations could contribute to the risk of developing MS.  Since gene products work by interacting with both other gene products and environmental factors, the team was interested in researching combinations of interacting genes. The researchers believed that multiple single nucleotide polymorphisms (SNPs) combined with environmental variables would better explain the risk of developing MS.  The team was also interested in identifying candidate environmental factors that could be used to prevent the disease from progressing in patients.

Challenge

The data sets used in this type of multi-variable research are very large and the analysis is computationally very demanding because the researchers are looking for significant interactions between thousands of genetic and environmental factors.  There are two issues to overcome; crunching through the immense data set and building analytic models that allow the team to look at more than simply first order interactions.  The researchers want to see not only which variable is significant, and also which pairs of variables or which three variables are significant.  This requires the ability to quickly build models using a range of variable types and run them on a high-performing environment on huge data sets, and it also requires the ability to include an almost limitless variety of dependent variable types.

The computational challenge in gene-environmental interaction analyses is due to a phenomenon called ‘combinatorial explosion.’    Considering that there are thousands of SNPs, the number of combinations of SNPs that have to be assessed for uncovering potential interaction becomes incredibly large.  
Before they could run the analysis using all of the variables presented in the data set being used, the researchers needed an analytic framework that would allow them to add and remove variables from the model quickly and easily, without having to write hundreds of lines of code.  

The sheer number of SNPs combined with environmental variables and phenotype values mean that the amount of computations necessary for data mining could number in the quintillions (18 zeros). 
To identify how many interactions could exist based on the number of SNPs included, the SUNY researchers graphed a mathematical function.  For example, if they were to mine through 1,000,000 SNPs to look for the number of combinations containing four SNPs, they’d be looking for a possible sextillion interactions, or 10 to the 21st power.

The researchers attempted to run the algorithms against commodity hardware. In this method they found a simple run of the algorithm took almost a week.  Extrapolating this performance, they realized it would take many weeks to run the algorithm against the volume of data they wanted.  Meanwhile, they knew that the algorithm results would lead to additional questions, algorithm adjustments, data changes and more, which would be untenable.  They needed processing power that would speed this up by 100X in order to be able to make meaningful discoveries and publish them.

Solution

SUNY Buffalo is using Revolution R Enterprise for IBM Netezza in conjunction with their Netezza Analytics Appliance and were able to simplify and speed up very complex analysis on very big data sets.

The researchers knew they needed the kind of processing only available in a high-performance computer (previously known as a ‘supercomputer’). High-performance computers (HPC) typically ran their processing only in parallel, breaking up calculations to run simultaneously across many processors. They also often used field-programmable gate array (FPGA) architectures for additional speed. They also needed capabilities found in relational databases—not often included in HPC platforms. The Netezza appliance promised the performance that the researchers wanted and much more.

From an analytics perspective, the researchers were able to write a version of their software tools in in Revolution R Enterprise for IBM Netezza, so that all their reporting as well as our analysis is all consolidated in one place. This prevents the need to move very large amounts of data in and out of Netezza, which would cause further delays. They were also able to use a wider variety of data sets.

Due to the nature of the research being done, there is immense value in being able to use a variety of data sets and include a wide range of dependent variables so that interactions among the variables may be studied. SUNY’s team adopted R – the core of Revolution R Enterprise for IBM Netezza - because it offers the flexibility to include new kinds of variables such as discrete dependent, Poisson dependent or continuous normally-distributed variables by simply adding a few lines of code. In the past, the SUNY team would have had to have re-written the entire algorithm, which would have required a great deal of time from a grad student of PhD candidate. Now, the scientist can change the algorithm himself by adding a new entropy function and move on with the science, which is now more sophisticated because they can more easily increase the level of interaction analysis.

Results

Once deploying Netezza as their research analytics infrastructure along with Revolution R Enterprise for IBM Netezza, the genetic data were assembled, the environmental and phenotype data were combined, and the algorithms were customized, the researchers were empowered to look for potential factors contributing to the risk of developing MS.

The SUNY researchers were able to:

  • Use the new algorithms and add multiple variables that before, were nearly impossible to achieve.
  • Reduce the time required to conduct analysis from 27.2 hours without Netezza to 11.7 minutes with it.
  • Carry out their research with little to no database administration (Unlike other HPC platforms or databases available, Netezza was designed to require a minimum amount of maintenance)
  • Publish multiple articles in scientific journals, with more in process.
  • Proceed with studies based on ‘vector phenotypes’—a more complex variable that will further push the Netezza platform.

There are lots of good reasons to use Revolution Analytics in a super computer like Netezza. It’s faster and easier to program. It will speed up our computation. And, we have more data sets available to us because of how flexible and quick Revolution Analytics makes it to add and delete variables in our model.

Dr. Murali Ramanathan
Co-Director, Data Intensive Discovery Initiative (DI2)
State University of New York at Buffalo

User

About Revolution Analytics

Revolution Analytics was founded in 2007 to foster the R community, as well as support the growing needs of commercial users. Our name derives from combining the letter "R" with the word "evolution." It speaks to the ongoing development of the R language from an open-source academic research tool into commercial applications for industrial use.

Though our Revolution R products, we aim to make the power of predictive analytics accessible to every type of user & budget. We provide free and premium software and services that bring high-performance, productivity and ease-of-use to R – enabling statisticians and scientists to derive greater meaning from large sets of critical data in record time.  

We also offer our full-featured production-grade software to the academic community for FREE, in order to support the continued spread of R's popularity to the next generation of analysts. 

For customers such as Pfizer, Novartis, Yale Cancer Center, Bank of America and others, our flagship Revolution R Enterprise product stands for faster drug development, reduced time of data analysis, and more powerful and efficient financial models.

About Revolution Analytics

Revolution Analytics was founded in 2007 to foster the R community, as well as support the growing needs of commercial users. Our name derives from combining the letter "R" with the word "evolution." It speaks to the ongoing development of the R language from an open-source academic research tool into commercial applications for industrial use. 

Though our Revolution R products, we aim to make the power of predictive analytics accessible to every type of user & budget. We provide free and premium software and services that bring high-performance, productivity and ease-of-use to R – enabling statisticians and scientists to derive greater meaning from large sets of critical data in record time.  

We also offer our full-featured production-grade software to the academic community for FREE, in order to support the continued spread of R's popularity to the next generation of analysts. 

For customers such as Pfizer, Novartis, Yale Cancer Center, Bank of America and others, our flagship Revolution R Enterprise product stands for faster drug development, reduced time of data analysis, and more powerful and efficient financial models.