speed up large result set processing using rmongodb

Tags:

I'm using rmongodb to get every document in a a particular collection. It works but I'm working with millions of small documents, potentially 100M or more. I'm using the method suggested by the author on the website: cnub.org/rmongodb.ashx

count <- mongo.count(mongo, ns, query)
cursor <- mongo.find(mongo, query)
name <- vector("character", count)
age <- vector("numeric", count)
i <- 1
while (mongo.cursor.next(cursor)) {
    b <- mongo.cursor.value(cursor)
    name[i] <- mongo.bson.value(b, "name")
    age[i] <- mongo.bson.value(b, "age")
    i <- i + 1
}
df <- as.data.frame(list(name=name, age=age))

This works fine for hundreds or thousands of results but that while loop is VERY VERY slow. Is there some way to speed this up? Maybe an opportunity for multiprocessing? Any suggestions would be appreciated. I'm averaging 1M per hour and at this rate I'll need a week just to build the data frame.

EDIT: I've noticed that the more vectors in the while loop the slower it gets. I'm now trying to loop separately for each vector. Still seems like a hack though, there must be a better way.

Edit 2: I'm having some luck with data.table. Its still running but it looks like it will finish the 12M (this is my current test set) in 4 hours, that's progress but far from ideal

dt <- data.table(uri=rep("NA",count),
                 time=rep(0,count),
                 action=rep("NA",count),
                 bytes=rep(0,count),
                 dur=rep(0,count))

while (mongo.cursor.next(cursor)) {
  b <- mongo.cursor.value(cursor)
  set(dt, i, 1L,  mongo.bson.value(b, "cache"))
  set(dt, i, 2L,  mongo.bson.value(b, "path"))
  set(dt, i, 3L,  mongo.bson.value(b, "time"))
  set(dt, i, 4L,  mongo.bson.value(b, "bytes"))
  set(dt, i, 5L,  mongo.bson.value(b, "elaps"))

}

229

asked Dec 20 '12 04:12

rjb101

1 Answers

You might want to try the mongo.find.exhaust option

cursor <- mongo.find(mongo, query, options=[mongo.find.exhaust])

This would be the easiest fix if actually works for your use case.

However the rmongodb driver seems to be missing some extra features available on other drivers. For example the JavaScript driver has a Cursor.toArray method. Which directly dumps all the find results to an array. The R driver has a mongo.bson.to.list function, but a mongo.cursor.to.list is probably what you want. It's probably worth pinging the driver developer for advice.

A hacky solution could be to create a new collection whose documents are data "chunks" of 100000 of the original documents each. Then these each of these could be efficiently read with mongo.bson.to.list. The chunked collection could be constructed using the mongo server MapReduce functionality.

answered Sep 21 '22 13:09

mjhm

Related questions
                            
                                Random row selection in R
                            
                                data.tables and sweep function
                            
                                RandomForest in R linear regression tails mtry
                            
                                indexing a matrix in R
                            
                                R: Truncate string without splitting words
                            
                                R - plot power law line with x and y data
                            
                                R: ggplot2: Adding count labels to histogram with density overlay
                            
                                R Regex / gsub : How to collapse spaces in a string
                            
                                Using Time Diary Data with TraMineR
                            
                                Is there a way for an R function to tell if it's being called from a `for` or `while` loop?
                            
                                From dataframe to vertex/edge array
                            
                                Is it possible to read from the console with scan without echoing the characters?
                            
                                reading a specific line of a text file in R
                            
                                Decrease number of x-axis ticks (labels) in barchart
                            
                                How to reduce white space margins of world map
                            
                                roxygenise just one .R file to add the file and its documentation to a package
                            
                                slam package install fails with make error
                            
                                How to remove gap at end of ggplot2 graph
                            
                                Nonlinear term with unknown in R
                            
                                R linear regression issue : lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

speed up large result set processing using rmongodb

Tags:

dataframe

r

mongodb

bigdata

rmongodb

rjb101

People also ask

1 Answers

mjhm

Recent Activity

Donate For Us