Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB queries optimisation

I wish to retrieve several information from my User model that looks like this:

var userSchema = new mongoose.Schema({
  email: { type: String, unique: true, lowercase: true },
  password: String,

  created_at: Date,
  updated_at: Date,

  genre : { type: String, enum: ['Teacher', 'Student', 'Guest'] },
  role : { type: String, enum: ['user', 'admin'], default: 'user' },
  active : { type: Boolean, default: false },

  profile: {
    name : { type: String, default: '' },
    headline : { type: String, default: '' },
    description : { type: String, default: '' },
    gender : { type: String, default: '' },
    ethnicity : { type: String, default: '' },
    age : { type: String, default: '' }
  },

  contacts : {
    email : { type: String, default: '' },
    phone : { type: String, default: '' },
    website : { type: String, default: '' }
  },

  location : {
    formattedAddress : { type: String, default: '' },
    country : { type: String, default: '' },
    countryCode : { type: String, default: '' },
    state : { type: String, default: '' },
    city : { type: String, default: '' },
    postcode : { type: String, default: '' },
    lat : { type: String, default: '' },
    lng : { type: String, default: '' }
  }
});

In Homepage I have a filter for location where you can browse Users from Country or City.

All the fields contains also the number of users in there:

United Kingdom
  All Cities (300)
  London (150)
  Liverpool (80)
  Manchester (70)
France
  All Cities (50)
  Paris (30)
  Lille (20)
Nederland
  All Cities (10)
  Amsterdam (10)
Etc...

This in the Homepage, then I have also the Students and Teachers pages where I wish to have information only about how many teachers there are in those Countries and Cities...

What I'm trying to do is to create a query to MongoDB to retrieve all these information with a single query.

At the moment the query looks like this:

User.aggregate([
    { 
      $group: { 
        _id: { city: '$location.city', country: '$location.country', genre: '$genre' },
        count: { $sum: 1 }
      }
    },
    {
      $group: { 
        _id: '$_id.country',
        count: { $sum: '$count' },
        cities: { 
          $push: { 
            city: '$_id.city', 
            count: '$count'
          }
        },
        genres: {
          $push: {
            genre: '$_id.genre',
            count: '$count'
          }
        }
      }
    }
  ], function(err, results) {
    if (err) return next();
    res.json({ 
        res: results
    });
  });

The problem is that I don't know how to get all the information I need.

  • I don't know how to get the length of the total users in every Country.
  • I have the users length for each Country.
  • I have the users length for each city.
  • I don't know how to get the same but for specific genre.

Is it possible to have all these information with a single query in Mongo?

Otherwise:

Creating few promises with 2, 3 different requests to Mongo like this:

getSomething
.then(getSomethingElse)
.then(getSomethingElseAgain)
.done

I'm sure it would be easier storing every time specified data but: is it good for performance when there are more than 5000 / 10000 users in the DB?

Sorry but I'm still in the process of learning and I think these things are crucial to understand MongoDB performance / optimisation.

Thanks

like image 823
Ayeye Brazo Avatar asked Feb 12 '23 05:02

Ayeye Brazo


1 Answers

What you want is a "faceted search" result where you hold the statistics about the matched terms in the current result set. Subsequently, while there are products that "appear" to do all the work in a single response, you have to consider that most generic storage engines are going to need multiple operations.

With MongoDB you can use two queries to get the results themselves and another to get the facet information. This would give similar results to the faceted results available from dedicated search engine products like Solr or ElasticSearch.

But in order to do this effectively, you want to include this in your document in a way it can be used effectively. A very effective form for what you want is using an array of tokenized data:

 {
     "otherData": "something",
     "facets": [
         "country:UK",
         "city:London-UK",
         "genre:Student"
     ]
 }

So "factets" is a single field in your document and not in multiple locations. This makes it very easy to index and query. Then you can effectively aggregate across your results and get the totals for each facet:

User.aggregate(
    [
        { "$unwind": "$facets" },
        { "$group": {
            "_id": "$facets",
            "count": { "$sum": 1 }
        }}
    ],
    function(err,results) {

    }
);

Or more ideally with some criteria in $match:

User.aggregate(
    [
        { "$match": { "facets": { "$in": ["genre:student"] } } },
        { "$unwind": "$facets" },
        { "$group": {
            "_id": "$facets",
            "count": { "$sum": 1 }
        }}
    ],
    function(err,results) {

    }
);

Ultimately giving a response like:

{ "_id": "country:FR", "count": 50 },
{ "_id": "country:UK", "count": 300 },
{ "_id": "city:London-UK", "count": 150 },
{ "_id": "genre:Student": "count": 500 }

Such a structure is easy to traverse and inspect for things like the discrete "country" and the "city" that belongs to a "country" as that data is just separated consistently by a hyphen "-".

Trying to mash up documents within arrays is a bad idea. There is a BSON size limit of 16MB to be respected also, from which mashing together results ( especially if you are trying to keep document content ) is most certainly going to end up being exceeded in the response.

For something as simple as then getting the "overall count" of results from such a query, then just sum up the elements of a particular facet type. Or just issue your same query arguments to a .count() operation:

User.count({ "facets": { "$in": ["genre:Student"] } },function(err,count) {

});

As said here, particularly when implementing "paging" of results, then the roles of getting "Result Count", "Facet Counts" and the actual "Page of Results" are all delegated to "separate" queries to the server.

There is nothing wrong with submitting each of those queries to the server in parallel and then combining a structure to feed to your template or application looking much like the faceted search result from one of the search engine products that offers this kind of response.


Concluding

So put something in your document to mark the facets in a single place. An array of tokenized strings works well for this purpose. It also works well with query forms such as $in and $all for either "or" or "and" conditions on facet selection combinations.

Don't try and mash results or nest additions just to match some perceived hierarchical structure, but rather traverse the results received and use simple patterns in the tokens. It's very simple to

Run paged queries for the content as separate queries to either facets or overall counts. Trying to push all content in arrays and then limit out just to get counts does not make sense. The same would apply to a RDBMS solution to do the same thing, where paging result counts and the current page are separate query operations.

There is more information written on the MongoDB Blog about Faceted Search with MongoDB that also explains some other options. There are also articles on integration with external search solutions using mongoconnector or other approaches.

like image 160
Neil Lunn Avatar answered Feb 13 '23 21:02

Neil Lunn