Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Techniques for removing old data on Oracle databases

We have a mature Oracle database application (in production for over 10 years), and during that time, we have been using scripts of our own devising to remove old data that is no longer needed. They work by issuing delete statements against the appropriate tables, in a loop with frequent commits, in order to avoid overloading the system with i/o or using too much undo space.

They work fine, for the most part. They run daily, and it takes about an hour to remove the oldest days worth of data from the system. The main concerns I have are the effects on tables and indexes that all this deleting may have, and the fact that even though they don't overly load the system, deleting one day's worth of data in that short time does have the effect of blowing out the instances buffer cache, resulting in subsequent queries running slightly slower for the next few hours as the cache is gradually restored.

For years we've been considering better methods. In the past, I had heard that people used partitioned tables to manage old data reaping - one month per partition, for example, and dropping the oldest partition on a monthly basis. The main drawback to this approach is that our reaping rules go beyond "remove month X". Users are allowed to specify how long data must stay in the system, based on key values (e.g., in an invoice table, account foo can be removed after 3 months, but account bar may need to remain for 2 years).

There is also the issue of referential integrity; Oracle documentation talks about using partitions for purging data mostly in the context of data warehouses, where tables tend to be hypercubes. Ours is closer to the OLTP end of things, and it is common for data in month X to have relationships to data in month Y. Creating the right partitioning keys for these tables would be ticklish at best.

As for the cache blowouts, I have read a bit about setting up dedicated buffer caches, but it seems like it's more on a per-table basis, as opposed to a per-user or per-transaction basis. To preserve the cache, I'd really like the reaping job to only keep one transaction's worth of data in the cache at any time, since there is no need to keep the data around once deleted.

Are we stuck using deletes for the foreseeable future, or are there other, more clever ways to deal with reaping?

like image 457
Steve Broberg Avatar asked Jun 11 '09 15:06

Steve Broberg


1 Answers

For the most part I think that you're stuck doing deletes.

Your comments on the difficulty of using partitions in your case probably do prevent them being used effectively (different delete dates being used depending on the type of record) but it it possible that you could create a "delete date" column on the records that you could partition on? It would have the disadvantage of making updates quite expensive as a change in the delete date might cause row migration, so your update would really be implemented as a delete and insert.

It could be that even then you cannot use DDL partition operations to remove old data because of the referential integrity issues, but partitioning still might serve the purpose of physically clustering the rows to be deleted so that fewer blocks need to be modified in order to delete them, mitigating the impact on the buffer cache.

like image 191
David Aldridge Avatar answered Jan 03 '23 13:01

David Aldridge