I'm using rmongodb to get every document in a a particular collection. It works but I'm working with millions of small documents, potentially 100M or more. I'm using the method suggested by the author on the website: cnub.org/rmongodb.ashx
count <- mongo.count(mongo, ns, query)
cursor <- mongo.find(mongo, query)
name <- vector("character", count)
age <- vector("numeric", count)
i <- 1
while (mongo.cursor.next(cursor)) {
b <- mongo.cursor.value(cursor)
name[i] <- mongo.bson.value(b, "name")
age[i] <- mongo.bson.value(b, "age")
i <- i + 1
}
df <- as.data.frame(list(name=name, age=age))
This works fine for hundreds or thousands of results but that while loop is VERY VERY slow. Is there some way to speed this up? Maybe an opportunity for multiprocessing? Any suggestions would be appreciated. I'm averaging 1M per hour and at this rate I'll need a week just to build the data frame.
EDIT: I've noticed that the more vectors in the while loop the slower it gets. I'm now trying to loop separately for each vector. Still seems like a hack though, there must be a better way.
Edit 2: I'm having some luck with data.table. Its still running but it looks like it will finish the 12M (this is my current test set) in 4 hours, that's progress but far from ideal
dt <- data.table(uri=rep("NA",count),
time=rep(0,count),
action=rep("NA",count),
bytes=rep(0,count),
dur=rep(0,count))
while (mongo.cursor.next(cursor)) {
b <- mongo.cursor.value(cursor)
set(dt, i, 1L, mongo.bson.value(b, "cache"))
set(dt, i, 2L, mongo.bson.value(b, "path"))
set(dt, i, 3L, mongo.bson.value(b, "time"))
set(dt, i, 4L, mongo.bson.value(b, "bytes"))
set(dt, i, 5L, mongo.bson.value(b, "elaps"))
}
Other ways to improve MongoDB performance after identifying your major query patterns include: Storing the results of frequent sub-queries on documents to reduce read load Making sure that you have indices on any fields you regularly query against Looking at your logs to identify slow queries, then check your indices
How fast are MongoDB queries? Pretty darn fast. Primary key or index queries should take just a few milliseconds. Queries without indexes depend on collection size and machine specs, etc. 3. How can I make MongoDB faster? It depends on what you are and aren’t doing already. Try adding indices. Don’t do joins (embedding is preferable).
Data locality: In terms of performance, replication also improves latency for read usage. If you have the same data spread out across multiple servers, that data can be accessed at the location closest to the end user. Sharded clusters in MongoDB are another way to potentially improve performance.
MongoDB documentation includes a great section on data modeling, starting from planning out your document data model and going into detail on specifics such as embedding and referencing. MongoDB University offers a free training course on data modeling. This is a great way for beginners to get started with schema design and document data models.
You might want to try the mongo.find.exhaust
option
cursor <- mongo.find(mongo, query, options=[mongo.find.exhaust])
This would be the easiest fix if actually works for your use case.
However the rmongodb driver seems to be missing some extra features available on other drivers. For example the JavaScript driver has a Cursor.toArray
method. Which directly dumps all the find results to an array. The R driver has a mongo.bson.to.list
function, but a mongo.cursor.to.list
is probably what you want. It's probably worth pinging the driver developer for advice.
A hacky solution could be to create a new collection whose documents are data "chunks" of 100000 of the original documents each. Then these each of these could be efficiently read with mongo.bson.to.list
. The chunked collection could be constructed using the mongo server MapReduce functionality.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With