In mongodb there are multiple types of index. For this question I'm interested in the ascending (or descending) index which can be used for sorting and the hash index which according to the documentation is "primarily used with sharded clusters to support hashed shard keys" (source) ensuring "a more even distribution of data"(source)
I know that you can't create an index like: db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } )
because you get an error
{ "createdCollectionAutomatically" : true, "numIndexesBefore" : 1, "errmsg" : "exception: Currently only single field hashed index supported.", "code" : 16763, "ok" : 0 }
My question:
Between the indices:
db.test.ensureIndex( { "key": 1 } )
db.test.ensureIndex( { "key": "hashed" } )
For the query db.products.find( { key: "a" } )
, which one is more performant?, is the hashed
key O(1)
How I got to the question:
Before I knew that you could not have multi-key indices with hashed
, I created an index of the form db.test.ensureIndex( { "key": 1, "sortOrder": 1 } )
, and while creating it I wondered if the hashed index was more performant than the ascending one (hash usually is O(1)
). I left the key as it is now because (as I mentioned above) db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } )
was not allowed. But the question of is the hashed index faster for searches by a key stayed in my mind.
The situation in which I made the index was:
I had a collection that contained a sorted list of documents classified by keys.
e.g. {key: a, sortOrder: 1, ...}
, {key: a, sortOrder: 2, ...}
, {key: a, sortOrder: 3, ...}
, {key: b, sortOrder: 1, ...}
, {key: b, sortOrder: 2, ...}
, ...
Since I used the key
to classify and the sortOrder for pagination, I always queried filtering with one value for the key
and using the sortOrder
for the order of the documents.
That means that I had two possible queries:
db.products.find( { key: "a" } ).limit(10).sort({"sortOrder", 1})
db.products.find( { key: "a" , sortOrder: { $gt: 10 } } ).limit(10).sort({"sortOrder", 1})
In this specific scenario, searching with O(1)
for the key and O(log(n))
for the sortOrder would have been ideal, but that wasn't allowed.
MongoDB supports hashed indexes of any single field. The hashing function collapses embedded documents and computes the hash for the entire value, but does not support multi-key (i.e. arrays) indexes.
If every “the” and every “and” is indexed, this slows down your MongoDB instances because each time data is inserted into results, the indexes are updated. Also, while it's better than having no index at all, it can take longer for MongoDB to find what you're looking for and then everything slows down.
Hashed indexes use a hashing function to compute the hash of the value of the index field. [1] The hashing function collapses embedded documents and computes the hash for the entire value but does not support multi-key (i.e. arrays) indexes.
The index stores the value of a specific field or set of fields, ordered by the value of the field. The ordering of the index entries supports efficient equality matches and range-based query operations. In addition, MongoDB can return sorted results by using the ordering in the index.
For the query
db.products.find( { key: "a" } )
, which one is more performant?
Given that field key
is indexed in both cases, the complexity index search itself would be very similar. As the value of a
would be hashed, and stored in the index tree.
If we're looking for the overal performance cost, the hashed version would incur an extra (negligible) cost of hashing the value of a
before matching the value in the index tree. See also mongo/db/index/hash_access_method.h
Also, hashed index would not be able to utilise index prefix compression (WiredTiger). Index prefix compression is especially effective for some data sets, like those with low cardinality (eg, country), or those with repeating values, like phone numbers, social security codes, and geo-coordinates. It is especially effective for compound indexes, where the first field is repeated with all the unique values of second field.
Any reason not to use hash in a non-ordered field?
Generally there is no reason to hash a non-range value. To choose a shard key, consider the cardinality, frequency, and rate of change of the value.
Hashed index is commonly used for a specific case of sharding. When a shard key value is a monotonically increasing/decreasing value, the distribution of data would likely to go into one shard only. This is where a hashed shard key would be able to improve the distribution of writes. It's a minor trade-off to greatly improve your sharding cluster. See also Hashed vs Ranged Sharding.
is it worth to insert a random hash or value with the document, and use that for sharding instead of a hash generated on the _id ?
Whether it's worth it, depends on the use case. A custom hash value would mean that any query for the hash value would have to go through a custom hashing code i.e. application.
The advantage for utilising the built-in hash function is that MongoDB automatically computes the hashes when resolving queries using hashed indexes. Therefore, applications do not need to compute hashes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With