Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongodb insert speed slower after sharding

I have a Mongodb cluster which contains one primary replica and one secondary replica running together as a replication set. But with the traffic grows I decided to perform a sharding to get more write speed.

I performed a hashed sharding to "_id" column based on this tutorial and split the data into two shards. I then performed some benchmark tests and found out that in some circumstances the sharded cluster is even slower than the unsharded one.

Here is the test result.

  1. Max throughput test: Use ten machines to run "mongoimport" at the same time to load data into target db inorder to test db's the maximum write speed.

    Result:

    The sharded cluster can insert 39500 documents/s.

    The unsharded cluster can insert 27400 documents/s.

  2. Single instance mongoimport test: Use just one machine to run "mongoimport" to load data into target db.

    Result:

    The sharded cluster can insert 14285 documents/s.

    The unsharded cluster can insert 14085 documents/s.

  3. Single instance data loading with mongodb java driver: Use just one instance to load data into target db by calling mongodb java driver's api.

    Result:

    The sharded cluster can insert 4630 documents/s.

    The unsharded cluster can insert 17544 documents/s.

The first test's result makes perfect sense. You shard the db into a 2-shard cluster and the throughput increased about 50%, everything is perfect, hooray!

The second test somewhat makes sense. Well the throughput is about the same but maybe the bottleneck is at the data loader's side, after all we are loading data with just one instance.

But the third test really bugs me. It makes no sense the the sharded cluster can be that much slower than the unsharded one. The unsharded db, on the other hand, has a amazing speed, even faster than loading data with mongoimport.

The java code that was used for loading data is pasted below. I really cannot figure this out and thanks in advance for all answers.

public static void insert(String host, int port) throws FileNotFoundException,
        InterruptedException, ExecutionException {
    MongoClient mongoClient = new MongoClient(host, port);
    mongoClient.setWriteConcern(WriteConcern.UNACKNOWLEDGED);
    MongoDatabase database = mongoClient.getDatabase("my-db");
    MongoCollection<Document> collection = database.getCollection("my-collection");
    Scanner scan = new Scanner(new File("my-sample-dataset"));

    // Pre-load the data into the memory, so that the db load test won't be 
    // affected by disk I/O time.
    Queue<List<String>> resource = new LinkedList<>();
    for (int i = 0; i < 100; i++) {
        List<String> strs = new ArrayList<>();
        for (int j = 0; j < 10000; j++)
            strs.add(scan.nextLine());
        resource.add(strs);
    }

    System.out.println("start");
    long startTime = System.currentTimeMillis();
    while (!resource.isEmpty()) {
        List<String> strs = resource.poll();
         List<WriteModel<Document>> list = new ArrayList<>();
        for (int i = 0; i < 10000; i++) {
             list.add(new
             InsertOneModel<Document>(Document.parse(strs.get(i))));
        }
        collection.bulkWrite(list);
    }
    System.out.println("Finished loading. Time taken: " + (System.currentTimeMillis() - startTime) + "ms");
    scan.close();
}
like image 309
Mohan Yang Avatar asked Mar 01 '17 02:03

Mohan Yang


1 Answers

Here is the possible culprit collection.bulkWrite(list);

In case of bulk writes, mongos needs to break up your batches into smaller batches that go to each shard.

As you haven't specified anything about the insertion order of docs in your batch, MongoDB must honor the requirement that the inserts happen in the order that they are specified. The consequence is that consecutive inserts can be batched if and only if they correspond to the same shard.

mongos maintains the original document order, hence only the consecutive inserts which belong to the same shard can be grouped together

For eg. Consider the case where "k" is the shard key. There are two shards, corresponding to ranges

[MinKey, 10], (20, MaxKey]

Now suppose we batch insert the following documents:

[{k: 1}, {k: 25}, {k: 2}]

Doc1 -> Shard1, Doc2 -> Shard2, Doc3 -> Shard3

No two consecutive documents belongs to the same shard, hence a call to getLastError is required after each document in this case.

In the case of Hashed keys, documents will be distributed more randomly among the shards. i.e. documents belonging the same shards may be more scattered and hence will create more number of batches The more random is the distribution, smaller the size of batches, more the number of total batches, higher the incurred cost for getLastError which effectively means poorer the performance.

FIX : specify "ordered: false".

collection.bulkWrite(list, new BulkWriteOptions().ordered(false));

This tells the database that you do not care about strictly preserving the order in which insertions take place. With "ordered: false", mongos will create a single batch per shard, obviating the extra getLastError calls. Each batch operation can be performed on the appropriate shard concurrently, without waiting for the getLastError response from the previous batch.


Also,

MongoClient mongoClient = new MongoClient(host, port);

Creates a Mongo instance based on a single mongodb node and will not be able to discover other nodes in your replica-set or sharded cluster.

In this case, all your write requests are routed to a single node which is being responsible for all the additional bookkeeping work because of sharded-cluster. What you should use is

MongoClient(final List<ServerAddress> seeds)

When there is more than one server to choose from based on the type of request (read or write) and the read preference (if it's a read request), the driver will randomly select a server to send a request. This applies to both replica sets and sharded clusters.

Note : Put as many servers as you can in the list and the system will figure out the rest.

like image 142
Rahul Avatar answered Sep 18 '22 09:09

Rahul