Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB large index build very slow

I have a collection with 400 million documents. Each has 6 DateTime, 1 Boolean, 8 Double, 9 Integer, and 6 String fields. I am trying to build the following index:

db.MyCollection.ensureIndex( 
    { "String1" : 1, "String2" : 1, "String3" : 1, "DateTime1" : 1, "Integer1" : 1, "DateTime2" : 1 }, 
    {background: true} 
);

After running for 5 days it is only half done.

The server is running Windows Server Enterprise and has 4TB disk space and 256GB RAM. Very few other processes are running against the database. No sharding or other special configuration.

Is there any way to speed this up? (Without dropping the background = true qualifier, because I don't want it to completely shut me out of the database, which it does in that case.)

like image 642
davidpaulstevens Avatar asked May 22 '14 19:05

davidpaulstevens


1 Answers

Misconceptions

Speed

Even when not talking of a multi key index, here is what happens. There is a massive table scan going on. So mongoDB iterates over the documents, tries to find the field to be indexed, evaluates that field (to null if it does not exist in the current document) and writes it's findings to no less than 6 files as we are talking of 6 indices. Doing the math: 200.000.000 / 86400 * 5 tells us that mongoDB does this for roughly 460 documents per second or only needs 2.2 milliseconds per document. I would not call that slow. It may take long, but it is not slow.

{background:true}

Using this parameter does not lock you out of the databases. Quite the contrary, which is clearly stated in the docs, both on the Index Creation section and in the tutorial section on creating indices in the background. However, there is a sentence which can easily be misinterpreted:

Also, no operation that requires a read or write lock on all databases (e.g. listDatabases) can occur during a foreground index build.

What that means is that you can not do operations which apply to all databases and require a read or write lock.

Ways to improve (in the future)

Sharded Cluster

Use a shared cluster with replica set shards. It is easy to set up and has multiple advantages besides improved performance. One of them is easy scalability adding a shard (and thus adding space and computing power to a cluster) is very easy. Backups have less impact on the application. There is not single point of failure any more (when done right, this even applies to outages at the scale of a whole datacenter).

Use a different filesystem

Sorry, running a disk io performance dependent application on a Windows Server does not make sense to me - at all. ExtFS4 or XFS are between 25% and 40% faster than NTFS or ReFS, depending on the optimization. This makes a real difference on applications which are as disk IO dependent like your use case. We are talking of a matter of days (not even taking into account the more efficient memory mapping and the reduced memory consumption of the OS on Linux systems).

{background:true}

While this does not really improve performance (actually building indices in the background take longer than in foreground for obvious reasons), your application stays available during the time during which the index is build. So depending on your needs, this may be a viable option.

Side note: It is a Bad Idea™, to scale vertically when using mongoDB since it was explicitly designed to be scaled horizontally. This especially applies for large collections like yours as parallel processing would greatly improve the performance of your application.

like image 164
Markus W Mahlberg Avatar answered Nov 14 '22 04:11

Markus W Mahlberg