May be this question has been asked before but I think it is good to consider it again today given that these technologies have matured. We're looking to use one of flume, kafka, scribe, or others to store streaming facebook and twitter profile information into hbase for doing analytics later on. We're considering flume for the purpose but I have not worked with other technologies in order to make an informed decision. Anyone who can shed some light will be great! Thanks a lot.
Kafka can support data streams for multiple applications, whereas Flume is specific for Hadoop and big data analysis. Kafka can process and monitor data in distributed systems whereas Flume gathers data from distributed systems to land data on a centralized data store.
Kafka runs as a cluster which handles the incoming high volume data streams in the real time. Flume is a tool to collect log data from distributed web servers.
One of the best features of Kafka is, it is highly available and resilient to node failures and supports automatic recovery. On the other hand, flume is mainly designed for Hadoop and it is a part of Hadoop ecosystem. It is used to collect data from different sources and transfer data to the centralized data store.
We can use Apache Flume mainly when we have to collect and move huge volumes of log data generated by web servers to the Hadoop HDFS. Apache Flume is useful for sentiment analysis.
Mediawiki (Wikipedia) went through this and published a nice article of how they arrived at their choice (Kafka) vs Scribe, Flume and others.
http://www.mediawiki.org/wiki/Analytics/Kraken/Request_Logging
new link:
https://wikitech.wikimedia.org/wiki/Analytics/Archive/Hadoop_Logging_-_Solutions_Recommendation
summary for posterity:
"Our recommendation is Apache Kafka, a distributed pub-sub messaging system designed for throughput. We evaluated about a dozen[1] best-of-breed systems drawn from the domains of distributed log collection, CEP / stream processing, and real-time messaging systems. While these systems offer surprisingly similar features, they differ substantially in implementation, and each is specialized to a particular work profile (a more thorough technical discussion is available as an appendix).
"Kafka stands out because it is specialized for throughput and explicitly distributed in all tiers of its architecture. Interestingly, it is also concerned enough with resource conservation[2] to offer sensible tradeoffs that loosen guarantees in exchange for performance — something that may not strike Facebook or Google as an important feature in the systems they design. Constraints breed creativity.
"In addition, Kafka has several perks of particular interest to Operations readers. While it is written in Scala, it ships with a native C++ producer library that can be embedded in a module for our cache servers, obviating the need to run the JVM on those servers. Second, producers can be configured to batch requests to optimize network traffic, but do not create a persistent local log which would require additional maintenance. Kafka's I/O and memory usage is left up to the OS rather than the JVM[3].
"Kafka was written by LinkedIn and is now an Apache project. In production at LinkedIn, approximately 10,000 producers are handled by eight Kafka servers per datacenter. These clusters consolidate their streams into a single analytics datacenter, which Kafka supports out of the box via a simple mirroring configuration.
"These features are a very apt fit for our intended use cases; even those we don't intend to use — such as sharding and routing by "topic" categories — are interesting and might prove useful in the future as we expand our goals.
"The rest of this document dives into these topics in greater detail..."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With