I use mongodb DB.
The problem: There are n parallel processes, each of them takes documents with query {data_processed: {$exists: false}}, processes them and updates setting {data_processed: true}. When I run all n processes, sometimes the same document appears on two or more different processes.
I think I can use something like this on query to prevent collision.
each process have id from 1 to n
for process with id i, get these documents
{
data_processed: {$exists: false},
_id: {mod_n: i}
}
where mod_n is Modulo operation on i
I use bson default ObjectId as _id, so I think it is possible to do something like this.
How can I implement this query ? Or can you suggest better way to solve this problem.
It seems like there's no easy way to convert ObjectId to long to perform modulo operation. Alternatively you can distribute your processing using simple string comparison for last character of _id or few last characters if you need more threads,
For instance if you want to run your processing using 4 processes you can try following queries:
db.col.aggregate([ { $match: { $expr: { $in: [ { $substr: [ { $toString: "$_id" }, 23, 1 ] }, [ "0", "1", "2", "3" ] ] } } } ])
...
db.col.aggregate([ { $match: { $expr: { $in: [ { $substr: [ { $toString: "$_id" }, 23, 1 ] }, [ "c", "d", "e", "f" ] ] } } } ])
This can scale to a higher number of processes, if you need more than 16 just take last two characters like:
db.col.aggregate([ { $match: { $expr: { $in: [ { $substr: [ { $toString: "$_id" }, 22, 2 ] }, [ "00", "01" ] ] } } } ])
Load should be distributed more or less evenly since last three characters represent
3-byte counter, starting with a random value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With