Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka Storm HDFS/S3 data flow

It is unclear if you can do a fan-out (duplication) in Kafka like you can in Flume.

I'd like to have Kafka save data to HDFS or S3 and send a duplicate of that data to Storm for real time processing. The output of Storm aggregations/analysis will be stored in Cassandra. I see some implementations flowing all data from Kafka into Storm and then two outputs from Storm. However, I'd like to eliminate the dependency of Storm for the raw data storage.

Is this possible? Are you aware of any documentation/examples/implementations like this?

Also, does Kafka have good support for S3 storage?

I saw Camus for storing to HDFS -- do you just run this job via cron to continually load data from Kafka to HDFS? What happens if a second instance of the job starts before the previous has finished? Finally, would Camus work with S3?

Thanks -- I appreciate it!

like image 515
Roy Avatar asked May 01 '26 02:05

Roy


1 Answers

Regarding Camus, Yeah, a scheduler that launches the job should work. What they use at LinkedIn is Azkaban, you can look at that too.

If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.

Regarding Camus with S3, currently I dont think that is in place.

like image 81
ggupta1612 Avatar answered May 05 '26 00:05

ggupta1612



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!