Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing a database record hash for keeping track of whether a record has changed or not

Tags:

database

hash

I have database schema for an integration project in which I need to be able to query for records that have changed, but only based on a given set of fields within that record.

So, for instance, here's an example table:

CUSTOMERS

  • ID
  • Name
  • Phone
  • Fax
  • Balance

I need to be to query to fetch records whose Name, Phone, or Fax fields have changed. However, other fields should not be taken into account, i.e. if just the Balance field changes, my query should not pull that record in (thus, an timestamp field that updates automatically whenever the record is modified does not work).

Also, this has to run on a number of different databases and platforms, so TRIGGERS or something similar are not really an option unless they'll run on MySQL, PostgreSQL, SQL Server, and SQLLite.

The fields are modified by a third-party application that I can't modify, so I can't just add a flag and have the third-party application set the flag to TRUE whenever it modifies a relevant field.

My initial solution to this is to calculate a HASH of the relevant fields and store it in a new field 'LastHash' or something. Then, I can calculate the hash of the relevant fields for the data currently in the record, and if it doesn't match the stored LastHash, I know it's changed.

That seems pretty messy... but it seems like it will work. Is there a better way? If not, is there a good way to implement that hash so it's efficient and not too time-consuming to extract those changed records?

EDIT

Some clarifications: Both my application and the other application update and insert into these tables. I can make my application calculate the initial hash. I can't make the other application calculate it though.

Timestamp columns that automatically update whenever a record changes are do-able, those are easy enough to replicate in all database systems using different column types or very simple triggers.

ADDITIONAL QUESTION

If hashing is the way to go... is there any sort of efficient hash algorithm that won't take forever to calculate on all of these records? MD5 or SHA1 might work, but they seem like they'd be sllloowwww.

like image 984
Keith Palmer Jr. Avatar asked Sep 30 '09 14:09

Keith Palmer Jr.


1 Answers

That's a tough one. You're still going to have to table scan (or index scan), since YOU have to calculate the new hash and compare it to the old hash stored.

If triggers are not possible because of cross-platform concerns, you might be able to have the database engine calculate the current hash (i.e. persisted computed column - effectively like a trigger). This is also cross-platform problem, though! Then if you index the current hash and your hash, it's a relatively easier search.

Can you at least use the timestamp field to reduce the number of hashes you need to check?

Another thing to remember is that there's no such thing as a perfect hash function, so you could potentially have false negatives (inadvertent hash collision leads to a change not being detected). Is that (astronomically small) risk worth taking?

like image 75
Cade Roux Avatar answered Oct 14 '22 00:10

Cade Roux