Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongodb - Sort by computed field

Tags:

mongodb

I'm struggling finding a solution to a problem with mongo db:

I need run a query on a collection with high write/read ratio. The query consists in sorting the documents by a field which is derived from other fields belonging the same document. Moreover, one of those fields is the size of an array, which makes it even harder.

A simple example:

D1 - { _id: 1, field: 1, array_field: [a,b,c,d] } -> score = 1 + 4 = 5
D2 - { _id: 2, field: 2, array_field: [a,b] }     -> score = 2 + 2 = 4

Expected result:

D1 - { _id: 2, score: 4 }
D2 - { _id: 1, score: 5 }

(The score is not required in the resultset)

The solutions I've tried so far:

  1. Add the score as a field of the document, which is consistently updated the other fields are updated. Problems:

    • It is not possible to parameterize the query (tuning) once the score has been computed
    • It is expensive because the index on the score has to be updated very frequently
  2. Create an aggregation pipeline which makes things easy do develop and solves the parameterization problem. However, the performance drop is really high beacuse mongo can't rely on use indexes on computed fields, causing memory issues (100MB query error). A possible solution is to enable the allowDiskUse flag. However, the query will become too slow.

Update: i'd like to point out that the query will be run about 10 times a second. Therefore, pre-computing and storing the score in a different document might not be a viable solution.

Pratical Use: since the problem is very difficult. Let me give you a bit more of context. I have a document of Posts (like facebook posts) I am currently sorting by creation date and last update. I'd like to be able to sort the posts by "hotness" which is defined by the score I was talking about. I thought that a interesting way to compute the score could be as follow:

score = a * likes - b * dislikes + c * num_comments + d * ( now - creation_date)

where a, b, c and d are parameters I can change to tune the algorithm. likes and dislikes are arrays of ObjectIDs referencing the users, while num_comments is simply the number of comments. The query is run to provide the response to a REST endpoint. No further operations: Request -> Query -> Response.

Have you had any experience with derived/aggregated fields? Thanks!

like image 832
a.periz Avatar asked Jun 28 '16 11:06

a.periz


1 Answers

It looks like complex issue.

this query will do the job, but i'd to hear from you about performance.

db.perlz.aggregate([
// {$match:{whatever is needed here}}
        {
            $project : {
                _id : 1,
                score : {
                    $sum : [{
                            "$size" : "$array_field"
                        }, "$field"]
                }
            }
        }, {
            $sort : {
                score : 1
            }
        }

    ])

As this is done on busy server I would consider a replica set setup, and try to balance load by issuing some queries on slave server.

EDIT

As per your update, I'm wondering if those steps could be applicable to this problem:

  1. update document structure to have a two types of likes: processed and new. Processed like is a like that was added to document score by worker (that affects likes, dislikes, numComments fields) and setup score - then we need to calculate delta/difference value.

  2. try to determine lowest input set based on previous point (pre-computed score)

  3. Limit output to known amount of documents (implement paging)

As per dynamic field value - there is no huge amount of computation needed to get the score value. What could be considered is to project fields that are used in computation and _id, then use $lookup as a last stage and macz parent document with scored and sorted result.

Any comments welcome!

like image 112
profesor79 Avatar answered Nov 08 '22 16:11

profesor79