I am working on a project and i am stuck on following scenario.
I have a table: superMerge(id, name, salary)
and I have 2 other tables: table1 and table2
all the tables ( table1, table2 and superMerge) has same structure.
Now, my challenge is to insert/update superMerge table from table1 and table2. table1 is updated every 10mins and table2 every 20 mins therefore at time t=20mins i have 2 jobs trying to update same table(superMerge in this case.)
I want to understand how can i acheive this parallel insert/update/merge into superMerge table using Spark or any other hadoop application.
The problem here is that the two jobs can't communicate with each other, not knowing what the other is doing. A relatively easy solution whould be to implement a basic file-based "locking" system:
Each job creates a (empty) file in a specific folder on HDFS indicating that the update/insert is in progress and removes that file if the jobs is done
Now, each jobs has to check whether such a file exists or not prior to starting the update/insert. If it exists, the job must wait until the files is gone.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With