We're working on audit log solution for one of our projects and would appreciate your help.
We have media entity which contains media id (numeric), action (string, eg. PUBLISH) and occurred on date with time representing date and time when action happened.
Requirement is to delete logs that are older than 90 days but to keep log with last action on media entity.
We are using Mongo 3.2
and we need help how to organize our collection since we need to support reads and writes on collection that will contain 200-300 millions of documents.
We tried several approaches but couldn't figure out a simple way on how to do it.
First approach
We tried to solve it with flat collection with document format:
{
_id: ObjectId("570b3cf65eac4e48e92b4e20"),
mediaId: 10000,
action: "PUBLISH",
occurredOn: ISODate("2016-04-04T12:42:07.000Z")
}
where insert is easy but we have problem with deletion of documents.
Second approach
We also tried to solve it with documents that contain array of actions and dates:
{
_id: 10000,
actions: [
{
action:"PUBLISH",
occurredOn: ISODate("2016-04-04T12:42:07.000Z")
},
...
]
}
where insert is also easy but again we have problem with deletion of documents.
Any suggestions on how to organize schema for this scenario?
Actually, it is pretty simple though not to be done in a single query.
Let us say we have a collection which stores the values for a single mediaId first, as this makes the approach a bit more clear.
{ "_id" : ObjectId("570ba4f66931b8f21a8bf25f"), "a" : 3, "date" : ISODate("2016-01-01T00:00:00Z") }
{ "_id" : ObjectId("570ba50a6931b8f21a8bf260"), "a" : 4, "date" : ISODate("2015-01-01T00:00:00Z") }
{ "_id" : ObjectId("570baab36931b8f21a8bf261"), "a" : 5, "date" : ISODate("2014-01-01T00:00:00Z") }
Note that all dates are more than 90 days in the past as of the time of this writing.
First, we want to identify the _id
of the last entry, which we want to keep no matter what:
> db.date.find({},{_id:1}).sort({date:-1}).limit(1)
{ "_id" : ObjectId("570ba4f66931b8f21a8bf25f") }
Next, we want to delete all entries which are older than 90 days and are not the last entry:
var outdated = new Date()
outdated.setDate(outdated.getDate() - 90 )
// We check first
db.date.find({
_id:{ $not:{ $eq: ObjectId("570ba4f66931b8f21a8bf25f") }},
date:{ $lt: outdated }
})
Which gives us the correct documents:
{ "_id" : ObjectId("570ba50a6931b8f21a8bf260"), "a" : 4, "date" : ISODate("2015-01-01T00:00:00Z") }
{ "_id" : ObjectId("570baab36931b8f21a8bf261"), "a" : 5, "date" : ISODate("2014-01-01T00:00:00Z") }
So we can run
> db.date.remove({
_id:{ $not:{ $eq: ObjectId("570ba4f66931b8f21a8bf25f") }},
date:{ $lt: outdated }
})
WriteResult({ "nRemoved" : 2 })
> db.date.find()
{ "_id" : ObjectId("570ba4f66931b8f21a8bf25f"), "a" : 3, "date" : ISODate("2016-01-01T00:00:00Z") }
Simple logic. Unless stated otherwise, the query conditions are conjuncted with a logical AND, meaning all conditions must be met for a document to match. So even if the last document is older than 90 days, it is excluded by the $not
phrase.
Now, we get to the real stuff.
{ "_id" : ObjectId("570baf826931b8f21a8bf262"), "mediaId" : 1000, "date" : ISODate("2016-04-11T14:06:58.668Z") }
{ "_id" : ObjectId("570baf8e6931b8f21a8bf263"), "mediaId" : 1000, "date" : ISODate("2016-01-01T00:00:00Z") }
{ "_id" : ObjectId("570bafa06931b8f21a8bf264"), "mediaId" : 1001, "date" : ISODate("2016-01-01T00:00:00Z") }
{ "_id" : ObjectId("570bafa56931b8f21a8bf265"), "mediaId" : 1001, "date" : ISODate("2015-01-01T00:00:00Z") }
{ "_id" : ObjectId("570bafc96931b8f21a8bf266"), "mediaId" : 1002, "date" : ISODate("2014-01-01T00:00:00Z") }
{ "_id" : ObjectId("570bafcd6931b8f21a8bf267"), "mediaId" : 1002, "date" : ISODate("2013-01-01T00:00:00Z") }
We have to use an aggregation here:
db.media.aggregate(
{ $sort:{ date:1 }},
{ $group:{
_id:"$mediaId",
lastId:{ "$last":"$_id" },
}},
{ $group:{
_id:"lastIds",
ids:{ $push:"$lastId" }
}},
{ $project:{ _id: 0, ids: 1 } }
)
Let us dissect that:
{ $sort:{ date:1 }}
We want to have all sorted ascending{$group:{_id:"$mediaId",lastId:{"$last":"$_id"}}}
We want the last ObjectId in sort order for each mediaId
{ $group:{ _id:"lastIds", ids:{ $push:"$lastId" } }}
we want to have all ObjectIds in an array and us an artificial static _id
to make sure all ObjectIds are pushed to said array.{ $project:{ _id: 0, ids: 1 } }
And we want only said array to be returnedThe result is
{
"ids" : [
ObjectId("570baf826931b8f21a8bf262"),
ObjectId("570bafa06931b8f21a8bf264"),
ObjectId("570bafc96931b8f21a8bf266")
]
}
Those are the ObjectIds we definetly want to keep.
You use those for
$in
db.media.find({
_id:{ $not:{ $in:[
ObjectId("570baf826931b8f21a8bf262"),
ObjectId("570bafa06931b8f21a8bf264"),
ObjectId("570bafc96931b8f21a8bf266")
]}},
date:{$lt: outdated}
})
The problem here is that $in
is not exactly good performing for (admittedly rather large) arrays. It might be sufficient for you, since the removal does not exactly seem to be time critical. However, you can also remove the docs by
We adjust the identification of the last docs a bit and add a bulk operation for each of the documents we want to keep:
var bulk = db.media.initializeUnorderedBulkOp()
db.media.aggregate(
{ $sort:{ date: 1 } },
{ $group:{ _id: "$mediaId", lastId:{ "$last": "$_id" }}}
).forEach(
function(doc){
bulk.find({
_id:{ $not:{ $eq: doc.lastId }},
date:{ $lt: outdated }
}).remove()
}
)
bulk.execute()
This method is as easy as it may get, imho, while retaining decent performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With