Ok, so the story is like this:
-- I am having lots of files (pretty big, around 25GB) that are in a particular format and needs to be imported in a datastore
-- these files are continuously updated with data, sometimes new, sometimes the same data
-- I am trying to figure out an algorithm on how could I detect if something has changed for a particular line in a file, in order to minimize the time spent updating the database
-- the way it currently works now is that I'm dropping all the data in the database each time and then reimport it, but this won't work anymore since I'll need a timestamp for when an item has changed.
-- the files contains strings and numbers (titles, orders, prices etc.)
The only solutions I could think of are:
-- compute a hash for each row from the database, that it's compared against the hash of the row from the file and if they're different the update the database
-- keep 2 copies of the files, the previous ones and the current ones and make diffs on it (which probably are faster than updating the db) and based on those update the db.
Since the amount of data is very big to huge, I am kind of out of options for now. On the long run, I'll get rid of the files and data will be pushed straight into the database, but the problem still remains.
Any advice, will be appreciated.
Problem definition as understood.
Let’s say your file contains
ID,Name,Age
1,Jim,20
2,Tim,30
3,Kim,40
As you stated Row can be added / updated , hence the file becomes
ID,Name,Age
1,Jim,20 -- to be discarded
2,Tim,35 -- to be updated
3,Kim,40 -- to be discarded
4,Zim,30 -- to be inserted
Now the requirement is to update the database by inserting / updating only above 2 records in two sql queries or 1 batch query containing two sql statements.
I am making following assumptions here
Store the hash values of Record [Name,Age] against ID in an in-memory Map where ID is the key and Value is hash [If you require scalability use hazelcast ].
Your Batch Framework to load the data [Again assuming treats one line of file as one record], needs to check the computed hash value against the ID in in-memory Map.First time creation can also be done using your batch framework for reading files.
If (ID present)
--- compare hash
---found same then discard it
—found different create an update sql
In case ID not present in in-memory hash,create an insert sql and insert the hashvalue
You might go for parallel processing , chunk processing and in-memory data partitioning using spring-batch and hazelcast.
http://www.hazelcast.com/
http://static.springframework.org/spring-batch/
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With