Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write map reduce in R?

Tags:

r

mapreduce

I am new to R. I know how to write map reduce in Java. I want to try the same in R. So can any one help in giving any samle codes and is there any fixed format there for MapReduce in R.

Please send any link other than this: https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial

Any sample codes will be more helpful.

like image 386
Manoj Avatar asked Jul 26 '12 05:07

Manoj


People also ask

How do you write any code in MapReduce?

First, we divide the input into three splits as shown in the figure. This will distribute the work among all the map nodes. Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each of the tokens or words.

What is MapReduce R?

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

How do I run a MapReduce program in R?

First, you need to get data into the distributed file system. Next, you execute the MapReduce job, and then massage the output according to your needs. Here is the same word count operation using RHadoop.

How do you explain MapReduce?

MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).


1 Answers

When you want to implement a map reduce (with Hadoop) in a language other than Java, then you use a feature called streaming. Then the data is fed to the mapper via STDIN (readLines()), back to Hadoop via STDOUT(cat()), then to the reducer again through STDIN (readLines()) and blurted finally via STDOUT (cat()).

The following code is taken from an article I wrote on writing a map reduce job with R for Hadoop. The code is supposed to count 2-grams but I'd say simple enough to see what is going on MapReduce-wise.

# map.R

library(stringdist, quietly=TRUE)

input <- file("stdin", "r")

while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
   # in case of empty lines
   # more sophisticated defensive code makes sense here
   if(nchar(line) == 0) break

   fields <- unlist(strsplit(line, "\t"))

   # extract 2-grams
   d <- qgrams(tolower(fields[4]), q=2)

   for(i in 1:ncol(d)) {
     # language / 2-gram / count
     cat(fields[2], "\t", colnames(d)[i], "\t", d[1,i], "\n")
   }
}

close(input)

-

# reduce.R

input <- file("stdin", "r")

# initialize variables that keep
# track of the state

is_first_line <- TRUE

while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
   line <- unlist(strsplit(line, "\t"))
   # current line belongs to previous
   # line's key pair
   if(!is_first_line &&
      prev_lang == line[1] &&
      prev_2gram == line[2]) {
        sum <- sum + as.integer(line[3])
   }
   # current line belongs either to a
   # new key pair or is first line
   else {
     # new key pair - so output the last
     # key pair's result
     if(!is_first_line) {
       # language / 2-gram / count
       cat(prev_lang,"\t",prev_2gram,"\t",sum,"\n")
     }
     # initialize state trackers
     prev_lang <- line[1]
     prev_2gram <- line[2]
     sum <- as.integer(line[3])
     is_first_line <- FALSE
   }
}

# the final record
cat(prev_lang,"\t",prev_2gram, "\t", sum, "\n")

close(input)

http://www.joyofdata.de/blog/mapreduce-r-hadoop-amazon-emr/

like image 167
Raffael Avatar answered Oct 17 '22 08:10

Raffael