detecting when data has changed

Question

Ok, so the story is like this:

-- I am having lots of files (pretty big, around 25GB) that are in a particular format and needs to be imported in a datastore

-- these files are continuously updated with data, sometimes new, sometimes the same data

-- I am trying to figure out an algorithm on how could I detect if something has changed for a particular line in a file, in order to minimize the time spent updating the database

-- the way it currently works now is that I'm dropping all the data in the database each time and then reimport it, but this won't work anymore since I'll need a timestamp for when an item has changed.

-- the files contains strings and numbers (titles, orders, prices etc.)

The only solutions I could think of are:

-- compute a hash for each row from the database, that it's compared against the hash of the row from the file and if they're different the update the database

-- keep 2 copies of the files, the previous ones and the current ones and make diffs on it (which probably are faster than updating the db) and based on those update the db.

Since the amount of data is very big to huge, I am kind of out of options for now. On the long run, I'll get rid of the files and data will be pushed straight into the database, but the problem still remains.

Any advice, will be appreciated.

Manish Singh · Accepted Answer

Problem definition as understood.

Let’s say your file contains

ID,Name,Age
1,Jim,20
2,Tim,30
3,Kim,40

As you stated Row can be added / updated , hence the file becomes

ID,Name,Age
1,Jim,20    -- to be discarded 
2,Tim,35    -- to be updated
3,Kim,40    -- to be discarded 
4,Zim,30    --  to be inserted

Now the requirement is to update the database by inserting / updating only above 2 records in two sql queries or 1 batch query containing two sql statements.

I am making following assumptions here

You cannot modify the existing process to create files.
You are using some batch processing [Reading from file - Processing in Memory- Writing in DB] to upload the data in the database.

Store the hash values of Record [Name,Age] against ID in an in-memory Map where ID is the key and Value is hash [If you require scalability use hazelcast ].

Your Batch Framework to load the data [Again assuming treats one line of file as one record], needs to check the computed hash value against the ID in in-memory Map.First time creation can also be done using your batch framework for reading files.

 If (ID present)
--- compare hash 
---found same then discard it
—found different create an update sql 
In case ID not present in in-memory hash,create an insert sql and insert the hashvalue

You might go for parallel processing , chunk processing and in-memory data partitioning using spring-batch and hazelcast.

http://www.hazelcast.com/

http://static.springframework.org/spring-batch/

Hope this helps.

detecting when data has changed

Tags:

algorithm

database

scalability

hyperboreean

1 Answers

Manish Singh

Recent Activity

Donate For Us

detecting when data has changed

Tags:

algorithm

database

scalability

hyperboreean

1 Answers

Manish Singh

Related questions

Recent Activity

Donate For Us