multiple insert into a table using Apache Spark

Question

I am working on a project and i am stuck on following scenario.

I have a table: superMerge(id, name, salary)

and I have 2 other tables: table1 and table2

all the tables ( table1, table2 and superMerge) has same structure.

Now, my challenge is to insert/update superMerge table from table1 and table2. table1 is updated every 10mins and table2 every 20 mins therefore at time t=20mins i have 2 jobs trying to update same table(superMerge in this case.)

I want to understand how can i acheive this parallel insert/update/merge into superMerge table using Spark or any other hadoop application.

Raphael Roth · Accepted Answer

The problem here is that the two jobs can't communicate with each other, not knowing what the other is doing. A relatively easy solution whould be to implement a basic file-based "locking" system:

Each job creates a (empty) file in a specific folder on HDFS indicating that the update/insert is in progress and removes that file if the jobs is done
Now, each jobs has to check whether such a file exists or not prior to starting the update/insert. If it exists, the job must wait until the files is gone.

multiple insert into a table using Apache Spark

Tags:

apache-spark

hadoop

bigdata

apache-phoenix

GKV

1 Answers

Raphael Roth

Recent Activity

Donate For Us

multiple insert into a table using Apache Spark

Tags:

apache-spark

hadoop

bigdata

apache-phoenix

GKV

1 Answers

Raphael Roth

Related questions

Recent Activity

Donate For Us