I'm struggling finding a solution to a problem with mongo db:
I need run a query on a collection with high write/read ratio. The query consists in sorting the documents by a field which is derived from other fields belonging the same document. Moreover, one of those fields is the size of an array, which makes it even harder.
A simple example:
D1 - { _id: 1, field: 1, array_field: [a,b,c,d] } -> score = 1 + 4 = 5
D2 - { _id: 2, field: 2, array_field: [a,b] } -> score = 2 + 2 = 4
Expected result:
D1 - { _id: 2, score: 4 }
D2 - { _id: 1, score: 5 }
(The score is not required in the resultset)
The solutions I've tried so far:
Add the score as a field of the document, which is consistently updated the other fields are updated. Problems:
Create an aggregation pipeline which makes things easy do develop and solves the parameterization problem.
However, the performance drop is really high beacuse mongo can't rely on use indexes on computed fields, causing memory issues (100MB query error).
A possible solution is to enable the allowDiskUse
flag. However, the query will become too slow.
Update: i'd like to point out that the query will be run about 10 times a second. Therefore, pre-computing and storing the score in a different document might not be a viable solution.
Pratical Use: since the problem is very difficult. Let me give you a bit more of context. I have a document of Posts (like facebook posts) I am currently sorting by creation date and last update. I'd like to be able to sort the posts by "hotness" which is defined by the score I was talking about. I thought that a interesting way to compute the score could be as follow:
score = a * likes - b * dislikes + c * num_comments + d * ( now - creation_date)
where a
, b
, c
and d
are parameters I can change to tune the algorithm. likes
and dislikes
are arrays of ObjectID
s referencing the users, while num_comments
is simply the number of comments.
The query is run to provide the response to a REST endpoint. No further operations: Request -> Query -> Response.
Have you had any experience with derived/aggregated fields? Thanks!
It looks like complex issue.
this query will do the job, but i'd to hear from you about performance.
db.perlz.aggregate([
// {$match:{whatever is needed here}}
{
$project : {
_id : 1,
score : {
$sum : [{
"$size" : "$array_field"
}, "$field"]
}
}
}, {
$sort : {
score : 1
}
}
])
As this is done on busy server I would consider a replica set setup, and try to balance load by issuing some queries on slave server.
EDIT
As per your update, I'm wondering if those steps could be applicable to this problem:
update document structure to have a two types of likes: processed
and new
. Processed
like is a like that was added to document score by worker (that affects likes, dislikes, numComments
fields) and setup score - then we need to calculate delta/difference
value.
try to determine lowest input set based on previous point (pre-computed score)
Limit output to known amount of documents (implement paging)
As per dynamic field value - there is no huge amount of computation needed to get the score value. What could be considered is to project fields that are used in computation and _id, then use $lookup
as a last stage and macz parent document with scored and sorted result.
Any comments welcome!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With