Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a more elegant way to detect changes in a large SQL table without altering it? [duplicate]

Suppose you have a reasonably large (for local definitions of “large”), but relatively stable table.

Right now, I want to take a checksum of some kind (any kind) of the contents of the entire table.

The naïve approach might be to walk the entire table, taking the checksum (say, MD5) of the concatenation of every column on each row, and then perhaps concatenate them and take its MD5sum.

From the client side, that might be optimized a little by progressively appending columns' values into the MD5 sum routine, progressively mutating the value.

The reason for this, is that at some point in future, we want to re-connect to the database, and ensure that no other users may have mutated the table: that includes INSERT, UPDATE, and DELETE.

Is there a nicer way to determine if any change/s have occurred to a particular table? Or a more efficient/faster way?

Update/clarification:

  • We are not able/permitted to make any alterations to the table itself (e.g. adding a “last-updated-at” column or triggers or so forth)

(This is for Postgres, if it helps. I'd prefer to avoid poking transaction journals or anything like that, but if there's a way to do so, I'm not against the idea.)

like image 787
BRPocock Avatar asked Jun 18 '13 20:06

BRPocock


People also ask

How can we monitor all the changes made to a DB by a user?

At the basic database level you can track changes by having a separate table that gets an entry added to it via triggers on INSERT/UPDATE/DELETE statements. Thats the general way of tracking changes to a database table. The other thing you want is to know which user made the change.


1 Answers

Adding columns and triggers is really quite safe

While I realise you've said it's a large table in a production DB so you say you can't modify it, I want to explain how you can make a very low impact change.

In PostgreSQL, an ALTER TABLE ... ADD COLUMN of a nullable column takes only moments and doesn't require a table re-write. It does require an exclusive lock, but the main consequence of that is that it can take a long time before the ALTER TABLE can actually proceed, it won't hold anything else up while it waits for a chance to get the lock.

The same is true of creating a trigger on the table.

This means that it's quite safe to add a modified_at or created_at column and an associated trigger function to maintain them to a live table that's in intensive real-world use. Rows added before the column was created will be null, which makes perfect sense since you don't know when they were added/modified. Your trigger will set the modified_at field whenever a row changes, so they'll get progressively filled in.

For your purposes it's probably more useful to have a trigger-maintained side-table that tracks the timestamp of the last change (insert/update/delete) anywhere in the table. That'll save you from storing a whole bunch of timestamps on disk and will let you discover when deletes have happened. A single-row side-table with a row you update on each change using a FOR EACH STATEMENT trigger will be quite low-cost. It's not a good idea for most tables because of contention - it essentially serializes all transactions that attempt to write to the table on the row update lock. In your case that might well be fine, since the table is large and rarely updated.

A third alternative is to have the side table accumulate a running log of the timestamps of insert/update/delete statements or even the individual rows. This allows your client read the change-log table instead of the main table and make small changes to its cached data rather than invalidating and re-reading the whole cache. The downside is that you have to have a way to periodically purge old and unwanted change log records.

So... there's really no operational reason why you can't change the table. There may well be business policy reasons that prevent you from doing so even though you know it's quite safe, though.

... but if you really, really, really can't:

Another option is to use the existing "md5agg" extension: http://llg.cubic.org/pg-mdagg/ . Or to apply the patch currently circulating pgsql-hackers to add an "md5_agg" to the next release to your PostgreSQL install if you built from source.

Logical replication

The bi-directional replication for PostgreSQL project has produced functionality that allows you to listen for and replay logical changes (row inserts/updates/deletes) without requiring triggers on tables. The pg_receivellog tool would likely suit your purposes well when wrapped with a little scripting.

The downside is that you'd have to run a patched PostgreSQL 9.3, so I'm guessing if you can't change a table, running a bunch of experimental code that's likely to change incompatibly in future isn't going to be high on your priority list ;-) . It's included in the stock release of 9.4 though, see "changeset extraction".

Testing the relfilenode timestamp won't work

You might think you could look at the modified timestamp(s) of the file(s) that back the table on disk. This won't be very useful:

  • The table is split into extents, individual files that by default are 1GB each. So you'd have to find the most recent timestamp across them all.
  • Autovacuum activity will cause these timestamps to change, possibly quite a while after corresponding writes happened.
  • Autovacuum must periodically do an automatic 'freeze' of table contents to prevent transaction ID wrap-around. This involves progressively rewriting the table and will naturally change the timestamp. This happens even if nothing's been added for potentially quite a long time.
  • Hint-bit setting results in small writes during SELECT. These writes will also affect the file timestamps.

Examine the transaction logs

In theory you could attempt to decode the transaction logs with pg_xlogreader and find records that affect the table of interest. You'd have to try to exclude activity caused by vacuum, full page writes after hint bit setting, and of course the huge amount of activity from every other table in the entire database cluster.

The performance impact of this is likely to be huge, since every change to every database on the entire system must be examined.

All in all, adding a trigger on a table is trivial in comparison.

like image 168
Craig Ringer Avatar answered Nov 15 '22 05:11

Craig Ringer