Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading data from RDBMS to Hadoop with multiple destinations

We have implemented a solution using Sqoop to load data from RDBMS to our hadoop cluster, for append-only data, it goes to hive while dimension data to hbase.

Now we are setting up two identical Hadoop clusters, they are the backup cluster for each other. We want to load the data from RDBMS once to both clusters. Sqoop doesn't allow us to do it. We have seen some streaming solutions like streamsets or nifi which allows to pull data from one place and send it to multiple destinations in one go. Also, we are considering to use sqoop to load data to one cluster, then set up a sync up job to copy the data to another cluster periodically, this just sounds more appropriate considering the volume of the data we have is huge.

Can someone share some real life experiences on this?

like image 887
Shengjie Avatar asked May 08 '17 15:05

Shengjie


1 Answers

I see two options to do this:

  • Setup two diff Sqoop jobs for copying into each cluster. This would be more like two sets of active data than a backup as both are being updated from the source. This will create an extra overload on the relational database system as x2(approx) connections will be created to do data copy.

  • Use single Sqoop job for loading data into one cluster. From there copy to other cluster using distcp -update (or) distcp -append. Few advantages with this method:

    • This should reduce the load on the relational database system.

    • You can leverage the power of MR for faster copy of data b/w clusters.

    • You have an option to schedule your backup frequency using Oozie.

    • You can work on active copy or the backup copy.

Let me know your thoughts and if you have already finalized on any solutions, please do share it.

like image 55
vmorusu Avatar answered Nov 15 '22 11:11

vmorusu