Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multiple insert into a table using Apache Spark

I am working on a project and i am stuck on following scenario.

I have a table: superMerge(id, name, salary)

and I have 2 other tables: table1 and table2

all the tables ( table1, table2 and superMerge) has same structure.

Now, my challenge is to insert/update superMerge table from table1 and table2. table1 is updated every 10mins and table2 every 20 mins therefore at time t=20mins i have 2 jobs trying to update same table(superMerge in this case.)

I want to understand how can i acheive this parallel insert/update/merge into superMerge table using Spark or any other hadoop application.

like image 979
GKV Avatar asked Jan 29 '23 08:01

GKV


1 Answers

The problem here is that the two jobs can't communicate with each other, not knowing what the other is doing. A relatively easy solution whould be to implement a basic file-based "locking" system:

  • Each job creates a (empty) file in a specific folder on HDFS indicating that the update/insert is in progress and removes that file if the jobs is done

  • Now, each jobs has to check whether such a file exists or not prior to starting the update/insert. If it exists, the job must wait until the files is gone.

like image 184
Raphael Roth Avatar answered Mar 06 '23 13:03

Raphael Roth