Kafka Storm HDFS/S3 data flow

Question

It is unclear if you can do a fan-out (duplication) in Kafka like you can in Flume.

I'd like to have Kafka save data to HDFS or S3 and send a duplicate of that data to Storm for real time processing. The output of Storm aggregations/analysis will be stored in Cassandra. I see some implementations flowing all data from Kafka into Storm and then two outputs from Storm. However, I'd like to eliminate the dependency of Storm for the raw data storage.

Is this possible? Are you aware of any documentation/examples/implementations like this?

Also, does Kafka have good support for S3 storage?

I saw Camus for storing to HDFS -- do you just run this job via cron to continually load data from Kafka to HDFS? What happens if a second instance of the job starts before the previous has finished? Finally, would Camus work with S3?

Thanks -- I appreciate it!

ggupta1612 · Accepted Answer

Regarding Camus, Yeah, a scheduler that launches the job should work. What they use at LinkedIn is Azkaban, you can look at that too.

If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.

Regarding Camus with S3, currently I dont think that is in place.

Kafka Storm HDFS/S3 data flow

Tags:

apache-kafka

hdfs

apache-storm

Roy

1 Answers

ggupta1612

Recent Activity

Donate For Us

Kafka Storm HDFS/S3 data flow

Tags:

apache-kafka

hdfs

apache-storm

Roy

1 Answers

ggupta1612

Related questions

Recent Activity

Donate For Us