Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB - How to delete expired documents except last one

Tags:

mongodb

We're working on audit log solution for one of our projects and would appreciate your help.

We have media entity which contains media id (numeric), action (string, eg. PUBLISH) and occurred on date with time representing date and time when action happened.

Requirement is to delete logs that are older than 90 days but to keep log with last action on media entity.

We are using Mongo 3.2 and we need help how to organize our collection since we need to support reads and writes on collection that will contain 200-300 millions of documents.

We tried several approaches but couldn't figure out a simple way on how to do it.

First approach

We tried to solve it with flat collection with document format:

{
  _id: ObjectId("570b3cf65eac4e48e92b4e20"),
  mediaId: 10000,
  action: "PUBLISH",
  occurredOn: ISODate("2016-04-04T12:42:07.000Z")
}

where insert is easy but we have problem with deletion of documents.

Second approach

We also tried to solve it with documents that contain array of actions and dates:

{
  _id: 10000,
  actions: [
    {
      action:"PUBLISH", 
      occurredOn: ISODate("2016-04-04T12:42:07.000Z")
    }, 
    ...
  ]
}

where insert is also easy but again we have problem with deletion of documents.

Any suggestions on how to organize schema for this scenario?

like image 547
Josip Grggurica Avatar asked Apr 11 '16 11:04

Josip Grggurica


1 Answers

Actually, it is pretty simple though not to be done in a single query.

Non-matched data

Let us say we have a collection which stores the values for a single mediaId first, as this makes the approach a bit more clear.

Sample data

{ "_id" : ObjectId("570ba4f66931b8f21a8bf25f"), "a" : 3, "date" : ISODate("2016-01-01T00:00:00Z") }
{ "_id" : ObjectId("570ba50a6931b8f21a8bf260"), "a" : 4, "date" : ISODate("2015-01-01T00:00:00Z") }
{ "_id" : ObjectId("570baab36931b8f21a8bf261"), "a" : 5, "date" : ISODate("2014-01-01T00:00:00Z") }

Note that all dates are more than 90 days in the past as of the time of this writing.

The procedure

First, we want to identify the _id of the last entry, which we want to keep no matter what:

> db.date.find({},{_id:1}).sort({date:-1}).limit(1)
{ "_id" : ObjectId("570ba4f66931b8f21a8bf25f") }

Next, we want to delete all entries which are older than 90 days and are not the last entry:

var outdated = new Date()
outdated.setDate(outdated.getDate() - 90 )
// We check first
db.date.find({
    _id:{ $not:{ $eq: ObjectId("570ba4f66931b8f21a8bf25f") }},
    date:{ $lt: outdated }
})

Which gives us the correct documents:

{ "_id" : ObjectId("570ba50a6931b8f21a8bf260"), "a" : 4, "date" : ISODate("2015-01-01T00:00:00Z") }
{ "_id" : ObjectId("570baab36931b8f21a8bf261"), "a" : 5, "date" : ISODate("2014-01-01T00:00:00Z") }

So we can run

> db.date.remove({
    _id:{ $not:{ $eq: ObjectId("570ba4f66931b8f21a8bf25f") }},
    date:{ $lt: outdated }
})
WriteResult({ "nRemoved" : 2 })
> db.date.find()
{ "_id" : ObjectId("570ba4f66931b8f21a8bf25f"), "a" : 3, "date" : ISODate("2016-01-01T00:00:00Z") }

Why does this work?

Simple logic. Unless stated otherwise, the query conditions are conjuncted with a logical AND, meaning all conditions must be met for a document to match. So even if the last document is older than 90 days, it is excluded by the $not phrase.

Multiple mediaIds

Now, we get to the real stuff.

Sample data

{ "_id" : ObjectId("570baf826931b8f21a8bf262"), "mediaId" : 1000, "date" : ISODate("2016-04-11T14:06:58.668Z") }
{ "_id" : ObjectId("570baf8e6931b8f21a8bf263"), "mediaId" : 1000, "date" : ISODate("2016-01-01T00:00:00Z") }
{ "_id" : ObjectId("570bafa06931b8f21a8bf264"), "mediaId" : 1001, "date" : ISODate("2016-01-01T00:00:00Z") }
{ "_id" : ObjectId("570bafa56931b8f21a8bf265"), "mediaId" : 1001, "date" : ISODate("2015-01-01T00:00:00Z") }
{ "_id" : ObjectId("570bafc96931b8f21a8bf266"), "mediaId" : 1002, "date" : ISODate("2014-01-01T00:00:00Z") }
{ "_id" : ObjectId("570bafcd6931b8f21a8bf267"), "mediaId" : 1002, "date" : ISODate("2013-01-01T00:00:00Z") }

Identify last documents

We have to use an aggregation here:

db.media.aggregate(
    { $sort:{ date:1 }},
    { $group:{
         _id:"$mediaId",
         lastId:{ "$last":"$_id" },
    }},
    { $group:{
      _id:"lastIds",
      ids:{ $push:"$lastId" }
    }},
    { $project:{ _id: 0, ids: 1 } }
)

Let us dissect that:

  1. { $sort:{ date:1 }} We want to have all sorted ascending
  2. {$group:{_id:"$mediaId",lastId:{"$last":"$_id"}}} We want the last ObjectId in sort order for each mediaId
  3. { $group:{ _id:"lastIds", ids:{ $push:"$lastId" } }} we want to have all ObjectIds in an array and us an artificial static _id to make sure all ObjectIds are pushed to said array.
  4. { $project:{ _id: 0, ids: 1 } } And we want only said array to be returned

The result is

{
    "ids" : [
        ObjectId("570baf826931b8f21a8bf262"),
        ObjectId("570bafa06931b8f21a8bf264"),
        ObjectId("570bafc96931b8f21a8bf266")
     ]
 }

Those are the ObjectIds we definetly want to keep.

You use those for

Removing the old docs

Using $in

db.media.find({
    _id:{ $not:{ $in:[
        ObjectId("570baf826931b8f21a8bf262"),
        ObjectId("570bafa06931b8f21a8bf264"), 
        ObjectId("570bafc96931b8f21a8bf266")
    ]}},
    date:{$lt: outdated}
})

The problem here is that $in is not exactly good performing for (admittedly rather large) arrays. It might be sufficient for you, since the removal does not exactly seem to be time critical. However, you can also remove the docs by

Using bulk operations

We adjust the identification of the last docs a bit and add a bulk operation for each of the documents we want to keep:

var bulk = db.media.initializeUnorderedBulkOp()
db.media.aggregate(
    { $sort:{ date: 1 } },
    { $group:{  _id: "$mediaId", lastId:{ "$last": "$_id" }}}
).forEach(
    function(doc){
        bulk.find({
          _id:{ $not:{ $eq: doc.lastId }},
          date:{ $lt: outdated }
        }).remove()
    }
)
bulk.execute()

This method is as easy as it may get, imho, while retaining decent performance.

like image 195
Markus W Mahlberg Avatar answered Sep 18 '22 14:09

Markus W Mahlberg