created a collection in MongoDB consisting of 11446615 documents. Each document has the following form: <pre class="prettyprint"><code>{ "_id" : ObjectId("4e03dec7c3c365f574820835"), "httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1", "words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"], "howMany" : 3 } </code></pre> httpReferer: just an url words: words parsed from the url above. Size of the list is between 15 and 90. I am planning to use this database to obtain list of webpages which have similar content. I 'll by querying this collection using words field so I created (or rather started creating) index on this field: <pre class="prettyprint"><code>db.my_coll.ensureIndex({words: 1}) </code></pre> Creating this collection takes very long time. I tried two approaches (tests below were done on my laptop): <ol> <li> Inserting and indexing Inserting took 5.5 hours mainly due to cpu intensive preprocessing of data. Indexing took 30 hours.</li> <li> Indexing before inserting It would take a few days to insert all data to collection. </li> </ol> My main focus it to decrease time of generating the collection. I don't need replication (at least for now). Querying also doesn't have to be light-fast. Now, time for a question: I have only one machine with one disk were I can run my app. Does it make sense to run more than one instance of the database and split my data between them?

Yes, it does make sense to shard on a single server. <ol> <li>At this time, MongoDB still uses a global lock per mongodb server. Creating multiple servers will release a server from one another's locks.</li> <li>If you run a multiple core machine with seperate NUMAs, this can also increase performance.</li> <li>If your load increases too much for your server, initial sharding makes for easier horizontal scaling in the future. You might as well do it now.</li> </ol> Machines vary. I suggest writing your own bulk insertion benchmark program and spin up a various number of MongoDB server shards. I have a 16 core RAIDed machine and I've found that 3-4 shards seem to be ideal for my heavy write database. I'm finding that my two NUMAs are my bottleneck.

MongoDB: Sharding on single machine. Does it make sense?

Tags:

mongodb

sharding

created a collection in MongoDB consisting of 11446615 documents.

Each document has the following form:

{ 
 "_id" : ObjectId("4e03dec7c3c365f574820835"), 
 "httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1", 
 "words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"],     
 "howMany" : 3 
}

httpReferer: just an url

words: words parsed from the url above. Size of the list is between 15 and 90.

I am planning to use this database to obtain list of webpages which have similar content.

I 'll by querying this collection using words field so I created (or rather started creating) index on this field:

db.my_coll.ensureIndex({words: 1})

Creating this collection takes very long time. I tried two approaches (tests below were done on my laptop):

Inserting and indexing Inserting took 5.5 hours mainly due to cpu intensive preprocessing of data. Indexing took 30 hours.
Indexing before inserting It would take a few days to insert all data to collection.

My main focus it to decrease time of generating the collection. I don't need replication (at least for now). Querying also doesn't have to be light-fast.

Now, time for a question:

I have only one machine with one disk were I can run my app. Does it make sense to run more than one instance of the database and split my data between them?

687

asked Jun 25 '11 12:06

whysoserious

2 Answers

Yes, it does make sense to shard on a single server.

At this time, MongoDB still uses a global lock per mongodb server. Creating multiple servers will release a server from one another's locks.
If you run a multiple core machine with seperate NUMAs, this can also increase performance.
If your load increases too much for your server, initial sharding makes for easier horizontal scaling in the future. You might as well do it now.

Machines vary. I suggest writing your own bulk insertion benchmark program and spin up a various number of MongoDB server shards. I have a 16 core RAIDed machine and I've found that 3-4 shards seem to be ideal for my heavy write database. I'm finding that my two NUMAs are my bottleneck.

140

answered Dec 01 '22 15:12

EhevuTov

In modern day(2015) with mongodb v3.0.x there is collection-level locking with mmap, which increases write throughput slightly(assuming your writing to multiple collections), but if you use the wiredtiger engine there is document level locking, which has a much higher write throughput. This removes the need for sharding across a single machine. Though you can technically still increase the performance of mapReduce by sharding across a single machine, but in this case you'd be better off just using the aggregation framework which can exploit multiple cores. If you heavily rely on map reduce algorithms it might make most sense to just use something like Hadoop.

The only reason for sharding mongodb is to horizontally scale. So in the event that a single machine cannot house enough disk space, memory, or CPU power(rare), then sharding becomes beneficial. I think its really really seldom that someone has enough data that they need to shard, even a large business, especially since wiredtiger added compression support that can reduce disk usage to over 80% less. Its also infrequent that someone uses mongodb to perform really CPU heavy queries at a large scale, because there are much better technologies for this. In most cases IO is the most important factor in performance, not many queries are CPU intensive, unless you're running a lot of complex aggregations, even geo-spatial is indexed upon insertion.

Most likely reason you'd need to shard is if you have a lot of indexes that consume a large amount of RAM, wiredtiger reduces this, but its still the most common reason to shard. Where as sharding across a single machine is likely just going to cause undesired overhead, with very little or possible no benefits.

answered Dec 01 '22 15:12

tsturzl

Related questions
                            
                                How do I update Array Elements matching criteria in a MongoDB document?
                            
                                Can't get mongoid working with Rails 4
                            
                                Mongoengine: ConnectionError: You have not defined a default connection
                            
                                How can a store raw JSON in Mongo using Spring Boot
                            
                                Mongo throwing "Element name 'name' is not valid' exception
                            
                                Authentication failed when using flask_pymongo
                            
                                Using mongoose to add new field to existing document in mongodb through update method
                            
                                How to Ignore Duplicate Key Errors Safely Using insert_many
                            
                                Add new field to all documents in a nested array
                            
                                Mongoose - how to tap schema middleware into the 'init' event?
                            
                                mongodb geoNear vs near
                            
                                Google Cloud Platform - Can't connect to mongodb
                            
                                Pymongo $in Query Not Working
                            
                                How to get last inserted id using the JS api?
                            
                                How do I dump an entire MongoDB database as text/json?
                            
                                How does one properly increment many dates in mongoDB?
                            
                                Fastest way to import millions of JSON documents to MongoDB
                            
                                How do I know if MongoDB needs more CPU/RAM? [closed]
                            
                                Mongoose query where with missing (or undefined) fields
                            
                                Rails: storing encrypted data in database

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With