How to write map reduce in R?

1 Answers

When you want to implement a map reduce (with Hadoop) in a language other than Java, then you use a feature called streaming. Then the data is fed to the mapper via STDIN (readLines()), back to Hadoop via STDOUT(cat()), then to the reducer again through STDIN (readLines()) and blurted finally via STDOUT (cat()).

The following code is taken from an article I wrote on writing a map reduce job with R for Hadoop. The code is supposed to count 2-grams but I'd say simple enough to see what is going on MapReduce-wise.

# map.R

library(stringdist, quietly=TRUE)

input <- file("stdin", "r")

while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
   # in case of empty lines
   # more sophisticated defensive code makes sense here
   if(nchar(line) == 0) break

   fields <- unlist(strsplit(line, "\t"))

   # extract 2-grams
   d <- qgrams(tolower(fields[4]), q=2)

   for(i in 1:ncol(d)) {
     # language / 2-gram / count
     cat(fields[2], "\t", colnames(d)[i], "\t", d[1,i], "\n")
   }
}

close(input)

# reduce.R

input <- file("stdin", "r")

# initialize variables that keep
# track of the state

is_first_line <- TRUE

while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
   line <- unlist(strsplit(line, "\t"))
   # current line belongs to previous
   # line's key pair
   if(!is_first_line &&
      prev_lang == line[1] &&
      prev_2gram == line[2]) {
        sum <- sum + as.integer(line[3])
   }
   # current line belongs either to a
   # new key pair or is first line
   else {
     # new key pair - so output the last
     # key pair's result
     if(!is_first_line) {
       # language / 2-gram / count
       cat(prev_lang,"\t",prev_2gram,"\t",sum,"\n")
     }
     # initialize state trackers
     prev_lang <- line[1]
     prev_2gram <- line[2]
     sum <- as.integer(line[3])
     is_first_line <- FALSE
   }
}

# the final record
cat(prev_lang,"\t",prev_2gram, "\t", sum, "\n")

close(input)

http://www.joyofdata.de/blog/mapreduce-r-hadoop-amazon-emr/

167

answered Oct 17 '22 08:10

Raffael

Related questions
                            
                                Use of other columns as arguments to function in summarize_if()
                            
                                Problems with hierarchical modelling/reconciliation in tidyverts
                            
                                ggplot2 facet_grid with facet titles
                            
                                across function not found in dplyr package [duplicate]
                            
                                `vec_arith` not called as expected
                            
                                Centrality calculations from multiple TPMs
                            
                                How to update both data.tables in a join
                            
                                Where can I find a hosting service with R? [closed]
                            
                                R: reading a binary file that is zipped
                            
                                How do you retrieve the title of a plot device window?
                            
                                Can't read csv automatically with R(D)COM
                            
                                How can I allow rapache/brew to securely connect to a MySQL database?
                            
                                Arguments descriptions order in .Rd file when using the roxygen2 tag @inheritParams
                            
                                Unexpected behaviour with argument defaults
                            
                                How can I efficiently use an R prediction model from Java?
                            
                                Knowing what objects to clusterExport beforehand
                            
                                ggplot2: boxplot with facet_grid and free scale
                            
                                Get call on errors on top level?
                            
                                ggplot: Multiple Lines for one Color/class
                            
                                Different random number generation between OS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to write map reduce in R?

Tags:

r

mapreduce

Manoj

People also ask

1 Answers

Raffael

Recent Activity

Donate For Us