Situation:
I have 1 million users and I want to search them by tags.
Here's the schema:
const Account = new Schema({
tags: { type: [String], default: ['cats'] }
})
Here's the first query:
const query = { tags: 'cats'}
const fields = { _id: 0 }
const options = { skip: 0, limit: 5 }
Account.find(query, fields, options)
After calling find
method Mongoose
will start searching and return the first 5 entries it matches. The performance in this case is optimal.
Here's the second query:
const query = { tags: 'cats'}
const fields = { _id: 0 }
const options = { skip: 500 000, limit: 5 }
Account.find(query, fields, options)
What going on in this case is of interest to me.
Does Mongoose
match 500 000 entries first and then returns the next 5 entries?
Or does it somehow 'jump' to the 500 000 element without having to match them all beforehand?
I'm trying to understand how efficient this proccess is and whether I can improve it somehow. Do you have any advice?
The easiest way: create a tag table, create an entity table, create a many-to-many table like { tag_id, entity_id }. For every tag/entity insert that pair into link table. The best way: do not store tag data in relational DB, use mongo, arangodb or other nosql db for this.
In information systems, a tag is a keyword or term assigned to a piece of information (such as an Internet bookmark, multimedia, database record, or computer file). This kind of metadata helps describe an item and allows it to be found again by browsing or searching.
Interesting. My understanding is that Find
here is very similar in query functionalities to Select
, wherein it will work inefficiently by first going through 500,000 values before skipping them.
Hence, what I suggested is implementing something known as a Multilevel index. Basically create a nested indexing system, which although taking up data space, is very efficient, especially for sequential search practices where you do not wish to implement a tree outright. Create a set of outer indices which are more general, then link them to inner indices respectively, before moving onto stored data blocks.
Implementing this should not be too difficult and should drastically improve the amount of traversing being done. Hope this helps.
Yes, it is matching 500,000 entries first.
From the MongoDB Definitive Guide 2nd Edition:
$skip takes a number, n, and discards the first n documents from the result set. As with “normal” querying, it isn’t efficient for large skips, as it must find all of the matches that must be skipped and then discard them.
You can examine your query in detail from the Mongo shell using explain():
https://docs.mongodb.com/manual/reference/method/cursor.explain/
Ex:
// Create a dummy collection
var tagTypes = ['cats', 'dogs', 'chickens']
for (i=0; i < 1000; i++) {
let tag = tagTypes[Math.floor(Math.random()*tagTypes.length)]
db.people.insert({"tags" : [tag]})
}
then explain it
db.people.find({"tags" : "cats"}).skip(100).limit(10).explain("executionStats”)
totalDocsExamined is showing you that it is matching everything that it's skipping
"executionStats" : {
...
"totalDocsExamined" : 420
If you were to create an index on tags:
db.people.ensureIndex({ "tags" : 1 })
Running it again you'd get:
"executionStats" : {
...
"totalDocsExamined" : 110
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With