Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using AWS for parallel processing with R

I want to take a shot at the Kaggle Dunnhumby challenge by building a model for each customer. I want to split the data into ten groups and use Amazon web-services (AWS) to build models using R on the ten groups in parallel. Some relevant links I have come across are:

  • The segue package;
  • A presentation on parallel web-services using Amazon.

What I don't understand is:

  • How do I get the data into the ten nodes?
  • How do I send and execute the R functions on the nodes?

I would be very grateful if you could share suggestions and hints to point me in the right direction.

PS I am using the free usage account on AWS but it was very difficult to install R from source on the Amazon Linux AMIs (lots of errors due to missing headers, libraries and other dependencies).

like image 329
harshsinghal Avatar asked Aug 30 '11 09:08

harshsinghal


People also ask

Can you use R in AWS?

Furthermore, there are many R packages, such as RJDBC or dplyr, which you can use to connect to all AWS big data services. AWS provides efficient, scaling infrastructure for installing R, RStudio Server, and Shiny Server for data analysis.

Can R do parallel computing?

There are various packages in R which allow parallelization. “parallel” Package The parallel package in R can perform tasks in parallel by providing the ability to allocate cores to R. The working involves finding the number of cores in the system and allocating all of them or a subset to make a cluster.

What is parallel package in R?

The parallel package which comes with your R installation. It represents a combining of two historical packages–the multicore and snow packages, and the functions in parallel have overlapping names with those older packages.


1 Answers

You can build up everything manually at AWS. You have to build your own amazon computer cluster with several instances. There is a nice tutorial video available at the Amazon website: http://www.youtube.com/watch?v=YfCgK1bmCjw

But it will take you several hours to get everything running:

  • starting 11 EC2 instances (for every group one instance + one head instance)
  • R and MPI on all machines (check for preinstalled images)
  • configuring MPI correctly (probably add a security layer)
  • in best case a file server which will be mounted to all nodes (share data)
  • with this infrastructure the best solution is the use of the snow or foreach package (with Rmpi)

The segue package is nice but you will definitely get data communication problems!

The simples solution is cloudnumbers.com (http://www.cloudnumbers.com). This platform provides you with easy access to computer clusters in the cloud. You can test 5 hours for free with a small computer cluster in the cloud! Check the slides from the useR conference: http://cloudnumbers.com/hpc-news-from-the-user2011-conference

like image 69
Markus Schmidberger Avatar answered Sep 22 '22 22:09

Markus Schmidberger