Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use storm to join two tables from two different dbs

Tags:

apache-storm

I am a newbie on storm. just thinking if I can use storm to merge/join two tables from two different dbs(of coz, two tables have some sort of Foreign Key relationship, just happen to exist in different dbs/systems), any ideas How I'd make up the topology? like having two separated spouts reading periodically from two dbs and having a bolt to do the join work?

Is this even a proper use case for storm?

any ideas are appreciated!

like image 502
Shengjie Avatar asked Nov 21 '13 05:11

Shengjie


2 Answers

This may be a good use of Storm, but it really depends on your dataset. If you just have two tables in separate DBMSs that you want to join and store in some third place (DBMS or otherwise), Storm will only make really make sense if this is a streaming join, i.e. the two tables are frequently written to and you want to join the stuff that was just recently written together.

Also, it almost goes without saying that you should only employ the complexity Storm will bring if this is for something relatively large and high volume.

If it's small, you will probably be better served with a traditional ETL tool, even if that's just some code you whip up to access the two databases and combine the data.

If the data set is large and you need to do joins across more than a short timeframe, I'd consider doing this another way, such as using a map-reduce job that pulls data from the two DBs and spreads the join out over a cluster.

like image 106
G Gordon Worley III Avatar answered Oct 15 '22 20:10

G Gordon Worley III


like having two separated spouts reading periodically from two dbs and having a bolt to do the join work

Yes, this is very much possible. Storm can have multiple spouts. And A bolt consumes any number of input streams, does some processing, and possibly emits new streams. typically its better to have your spout read from a queue like Kafka or RabbitMQ (you can find spout integration with most of the queuing system). So in that case you can feed the queue with the data from DB and then let spout consumes the same.

UPDATE:
Here is a very nice Article about how storm parallelism works

like image 43
user2720864 Avatar answered Oct 15 '22 19:10

user2720864