Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB: Sharding on single machine. Does it make sense?

created a collection in MongoDB consisting of 11446615 documents.

Each document has the following form:

{ 
 "_id" : ObjectId("4e03dec7c3c365f574820835"), 
 "httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1", 
 "words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"],     
 "howMany" : 3 
}

httpReferer: just an url

words: words parsed from the url above. Size of the list is between 15 and 90.

I am planning to use this database to obtain list of webpages which have similar content.

I 'll by querying this collection using words field so I created (or rather started creating) index on this field:

db.my_coll.ensureIndex({words: 1})

Creating this collection takes very long time. I tried two approaches (tests below were done on my laptop):

  1. Inserting and indexing Inserting took 5.5 hours mainly due to cpu intensive preprocessing of data. Indexing took 30 hours.
  2. Indexing before inserting It would take a few days to insert all data to collection.

My main focus it to decrease time of generating the collection. I don't need replication (at least for now). Querying also doesn't have to be light-fast.

Now, time for a question:

I have only one machine with one disk were I can run my app. Does it make sense to run more than one instance of the database and split my data between them?

like image 687
whysoserious Avatar asked Jun 25 '11 12:06

whysoserious


People also ask

Does MongoDB Sharding improve performance?

Sharded clusters in MongoDB are another way to potentially improve performance. Like replication, sharding is a way to distribute large data sets across multiple servers. Using what's called a shard key, developers can copy pieces of data (or “shards”) across multiple servers.

When should you shard MongoDB?

Depending on the database size and on disk speed, a backup/restore process might take hours or even days! There is no hard number in Gigabytes to justify a cluster. But in general, you should engage when the database is more than 200GB the backup and restore processes might take a while to finish.

Does sharding improve write performance?

Advantages of shardingIncreased read/write throughput — By distributing the dataset across multiple shards, both read and write operation capacity is increased as long as read and write operations are confined to a single shard.

Is sharding better than replication?

Replication may help with horizontal scaling of reads if you are OK to read data that potentially isn't the latest. sharding allows for horizontal scaling of data writes by partitioning data across multiple servers using a shard key. It's important to choose a good shard key.


2 Answers

Yes, it does make sense to shard on a single server.

  1. At this time, MongoDB still uses a global lock per mongodb server. Creating multiple servers will release a server from one another's locks.

  2. If you run a multiple core machine with seperate NUMAs, this can also increase performance.

  3. If your load increases too much for your server, initial sharding makes for easier horizontal scaling in the future. You might as well do it now.

Machines vary. I suggest writing your own bulk insertion benchmark program and spin up a various number of MongoDB server shards. I have a 16 core RAIDed machine and I've found that 3-4 shards seem to be ideal for my heavy write database. I'm finding that my two NUMAs are my bottleneck.

like image 140
EhevuTov Avatar answered Dec 01 '22 15:12

EhevuTov


In modern day(2015) with mongodb v3.0.x there is collection-level locking with mmap, which increases write throughput slightly(assuming your writing to multiple collections), but if you use the wiredtiger engine there is document level locking, which has a much higher write throughput. This removes the need for sharding across a single machine. Though you can technically still increase the performance of mapReduce by sharding across a single machine, but in this case you'd be better off just using the aggregation framework which can exploit multiple cores. If you heavily rely on map reduce algorithms it might make most sense to just use something like Hadoop.

The only reason for sharding mongodb is to horizontally scale. So in the event that a single machine cannot house enough disk space, memory, or CPU power(rare), then sharding becomes beneficial. I think its really really seldom that someone has enough data that they need to shard, even a large business, especially since wiredtiger added compression support that can reduce disk usage to over 80% less. Its also infrequent that someone uses mongodb to perform really CPU heavy queries at a large scale, because there are much better technologies for this. In most cases IO is the most important factor in performance, not many queries are CPU intensive, unless you're running a lot of complex aggregations, even geo-spatial is indexed upon insertion.

Most likely reason you'd need to shard is if you have a lot of indexes that consume a large amount of RAM, wiredtiger reduces this, but its still the most common reason to shard. Where as sharding across a single machine is likely just going to cause undesired overhead, with very little or possible no benefits.

like image 25
tsturzl Avatar answered Dec 01 '22 15:12

tsturzl