Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In MongoDB, strategy for maximizing performance of writes to daily log documents

Tags:

io

mongodb

nosql

We have a collection of log data, where each document in the collection is identified by a MAC address and a calendar day. Basically:

{
  _id: <generated>,
  mac: <string>,
  day: <date>,
  data: [ "value1", "value2" ]
}

Every five minutes, we append a new log entry to the data array within the current day's document. The document rolls over at midnight UTC when we create a new document for each MAC.

We've noticed that IO, as measured by bytes written, increases all day long, and then drops back down at midnight UTC. This shouldn't happen because the rate of log messages is constant. We believe that the unexpected behavior is due to Mongo moving documents, as opposed to updating their log arrays in place. For what it's worth, stats() shows that the paddingFactor is 1.0299999997858227.

Several questions:

  1. Is there a way to confirm whether Mongo is updating in place or moving? We see some moves in the slow query log, but this seems like anecdotal evidence. I know I can db.setProfilingLevel(2), then db.system.profile.find(), and finally look for "moved:true", but I'm not sure whether it's ok to do this on a busy production system.
  2. The size of each document is very predictable and regular. Assuming that mongo is doing a lot of moves, what's the best way to figure out why isn't Mongo able to presize more accurately? Or to make Mongo presize more accurately? Assuming that the above description of the problem is right, tweaking the padding factor does not seem like it would do the trick.
  3. It should be easy enough for me to presize the document and remove any guesswork from Mongo. (I know the padding factor docs say that I shouldn't have to do this, but I just need to put this issue behind me.) What's the best way to presize a document? It seems simple to write a document with a garbage byte array field, and then immediately remove that field from the document, but are there any gotchas that I should be aware of? For example, I can imagine having to wait on the server for the write operation (i.e. do a safe write) before removing the garbage field.
  4. I was concerned about preallocating all of a day's documents at around the same time because it seems like this would saturate the disk at that time. Is this a valid concern? Should I try to spread out the preallocation costs over the previous day?
like image 375
jtoberon Avatar asked Nov 04 '11 14:11

jtoberon


People also ask

What would be the best way to improve the performance of a MongoDB query where your query references particular field within a document?

Other ways to improve MongoDB performance after identifying your major query patterns include: Storing the results of frequent sub-queries on documents to reduce read load. Making sure that you have indices on any fields you regularly query against. Looking at your logs to identify slow queries, then check your indices.

What method can we use to maximize performance and prevent MongoDB from returning more results than required for processing?

Use limit() to maximize performance and prevent MongoDB from returning more results than required for processing.

What are some tips to improve the performance of NoSQL queries?

Add Appropriate Indexes NoSQL databases require indexes, just like their relational cousins. An index is built from a set of one or more fields to make querying fast. For example, you could index the country field in a user collection.

What is the maximum size of document in MongoDB?

The maximum size an individual document can be in MongoDB is 16MB with a nested depth of 100 levels. Edit: There is no max size for an individual MongoDB database.


1 Answers

The following combination seems to cause write performance to fall off a cliff:

  1. Journaling is on.
  2. Writes append entries to an array that makes up the bulk of a larger document

Presumably I/O becomes saturated. Changing either of these factors seems to prevent this from happening:

  1. Turn journaling off. Use more replicas instead.
  2. Use smaller documents. Note that document size here is measured in bytes, not in the length of any arrays in the documents.
  3. Journal on a separate filesystem.

In addition, here are some other tricks that improve write throughput. With the exception of sharding, we found the improvements to be incremental, whereas we were trying to solve a "this doesn't work at all" kind of problem, but I'm including them here in case you're looking for incremental improvements. The 10Gen folks did some testing and got similar results:

  1. Shard.
  2. Break up long arrays into several arrays, so that your overall structure looks more like a nested tree. If you use hour of the day as the key, then the daily log document becomes:
    {"0":[...], "1":[...],...,"23":[...]}.
  3. Try manual preallocation. (This didn't help us. Mongo's padding seems to work as advertised. My original question was misguided.)
  4. Try different --syncdelay values. (This didn't help us.)
  5. Try without safe writes. (We were already doing this for the log data, and it's not possible in many situations. Also, this seems like a bit of a cheat.)

You'll notice that I've copied some of the suggestions from 10Gen here, just for completeness. Hopefully I did so accurately! If they publish a cookbook example, then I'll post a link here.

like image 50
jtoberon Avatar answered Sep 18 '22 15:09

jtoberon