Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is full text search of MongoDB shards directly much faster than going through the cluster manager (mongos) instance?

I have been very unhappy with full text search performance in MongoDB so I have been looking for outside-the-box solutions. With a relatively small collection of 25 million documents sharded across 8 beefy machines (4 shards with redundancy) I see some queries taking 10 seconds. That is awful. On a lark, I tried a 10 second query to the shards directly, and it seems like the mongos is sending the queries to shards serially, rather than in parallel. Across the 4 shards I saw search times of 2.5 seconds on one shard and the other 3 shards under 2 seconds each. That is a total of less than 8.5 seconds, but it took 10 through mongos. Facepalm.

Can someone confirm these queries to shards are being run serially? Or offer some other explanation?

What are the pitfalls to querying the shards directly?

We are on 4.0 and the query looks like this:

db.items.aggregate(
[
   { "$match" : {
    "$text" : { "$search" : "search terms"}
      }
   }, 
   { "$project": { "type_id" : 1, "source_id": 1 } },
   { "$facet" : { "types" : [ { "$unwind" : "$type_id"} , { "$sortByCount" : "$type_id"}] , "sources" : [ { "$unwind" : "$source_id"} , { "$sortByCount" : "$source_id"}]}}
]
);

I made a mistake before, this is the query being sent that has the issue. And I talked to a MongoDB expert and was clued into a big part of what's going on (I think), but happy to see what others have to say so I can pay the bounty and make it official.

like image 445
Chris Seline Avatar asked Aug 31 '18 17:08

Chris Seline


People also ask

How does MongoDB increase sharding performance?

Sharding with MongoDB allows you to seamlessly scale the database as your applications grow beyond the hardware limits of a single server, and it does so without adding complexity to the application.

What are the benefits of sharding in MongoDB?

Advantages of Sharding MongoDB distributes the read and write workload across the shards in the sharded cluster, allowing each shard to process a subset of cluster operations. Both read and write workloads can be scaled horizontally across the cluster by adding more shards.

What is the main purpose of the mongos process?

The mongos process, shown in the center of figure 1, is a router that directs all reads, writes, and commands to the appropriate shard. In this way, mongos provides clients with a single point of contact with the cluster, which is what enables a sharded cluster to present the same interface as an unsharded one.

How does sharding work in MongoDB?

Sharding is the process of distributing data across multiple hosts. In MongoDB, sharding is achieved by splitting large data sets into small data sets across multiple MongoDB instances.


1 Answers

Can someone confirm these queries to shards are being run serially? Or offer some other explanation?

Without a shard key in the query, the query is sent to all shards and processed in parallel. However, the results from all shards will be merged at the primary shard, and thus it'll wait until the slowest shard returns.

What are the pitfalls to querying the shards directly?

You can potentially include orphaned documents. Query via mongos also checks orphaned documents to ensure data consistency. Therefore, querying via mongos has more overhead than querying directly from each shard.

Measured using Robo 3T's query time

Using Robo 3T doesn't measure the query time correctly. By default, Robo 3T returns first 50 documents. For driver implementations, if the number of returned documents is more than the default batch size, to retrieve the all docs, there will be getmore requests followed to database. Robo 3T only gives you the first batch, i.e. a subset of results.

To evaluate your query, add explain('executionStats') to your query. The performance hit is likely the data transfer between shards. Because the lacking of a shard key in the query, the results of all shards have to be sent to a shard before merging. The total time is not only the query time (locating the docs) from mongo engine, but also documents retrieval time.

Execute the command below and you'll see inputStages from each shard to better evaluate your query.

db.items.explain('executionStats').aggregate(
[
   { "$match" : {
    "$text" : { "$search" : "search terms"}
      }
   }, 
   { "$project": { "type_id" : 1, "source_id": 1 } },
   { "$facet" : { "types" : [ { "$unwind" : "$type_id"} , { "$sortByCount" : "$type_id"}] , "sources" : [ { "$unwind" : "$source_id"} , { "$sortByCount" : "$source_id"}]}}
]
);
like image 148
simagix Avatar answered Oct 24 '22 14:10

simagix