My company is considering using flume for some fairly high volume log processing. We believe that the log processing needs to be distributed, both for volume (scalability) and failover (reliability) reasons, and Flume seems the obvious choice.
However, we think we must be missing something obvious, because we don't see how Flume provides automatic scalability and failover.
I want to define a flow that says for each log line, do thing A, then pass it along and do thing B, then pass it along and do thing C, and so on, which seems to match well with Flume. However, I want to be able to define this flow in purely logical terms, and then basically say, "Hey Flume, here are the servers, here is the flow definition, go to work!". Servers will die, (and ops will restart them), we will add servers to the cluster, and retire others, and flume will just direct the work to whatever nodes have available capacity.
This description is how Hadoop map-reduce implements scalability and failover, and I assumed that Flume would be the same. However, the documentation sees to imply that I need to manually configure which physical servers each logical node runs on, and configure specific failover scenarios for each node.
Am I right, and Flume does not serve our purpose, or did I miss something?
Thanks for your help.
Depending on whether you are using multiple masters, you can code your configuration to follow a failover pattern.
This is fairly detailed in the guide: http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html#_automatic_failover_chains
To answer your question, bluntly, Flume does not yet have an ability to figure out a failover scheme automatically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With