Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

flume vs kafka vs others [closed]

Tags:

scribe

flume

May be this question has been asked before but I think it is good to consider it again today given that these technologies have matured. We're looking to use one of flume, kafka, scribe, or others to store streaming facebook and twitter profile information into hbase for doing analytics later on. We're considering flume for the purpose but I have not worked with other technologies in order to make an informed decision. Anyone who can shed some light will be great! Thanks a lot.

like image 210
pranavsharma Avatar asked Sep 24 '12 05:09

pranavsharma


People also ask

Why Kafka is better than Flume?

Kafka can support data streams for multiple applications, whereas Flume is specific for Hadoop and big data analysis. Kafka can process and monitor data in distributed systems whereas Flume gathers data from distributed systems to land data on a centralized data store.

What is the main difference between Kafka and Flume?

Kafka runs as a cluster which handles the incoming high volume data streams in the real time. Flume is a tool to collect log data from distributed web servers.

Why is Kafka over Flume?

One of the best features of Kafka is, it is highly available and resilient to node failures and supports automatic recovery. On the other hand, flume is mainly designed for Hadoop and it is a part of Hadoop ecosystem. It is used to collect data from different sources and transfer data to the centralized data store.

Is Apache Flume still used?

We can use Apache Flume mainly when we have to collect and move huge volumes of log data generated by web servers to the Hadoop HDFS. Apache Flume is useful for sentiment analysis.


1 Answers

Mediawiki (Wikipedia) went through this and published a nice article of how they arrived at their choice (Kafka) vs Scribe, Flume and others.

http://www.mediawiki.org/wiki/Analytics/Kraken/Request_Logging

new link:
https://wikitech.wikimedia.org/wiki/Analytics/Archive/Hadoop_Logging_-_Solutions_Recommendation

summary for posterity:

"Our recommendation is Apache Kafka, a distributed pub-sub messaging system designed for throughput. We evaluated about a dozen[1] best-of-breed systems drawn from the domains of distributed log collection, CEP / stream processing, and real-time messaging systems. While these systems offer surprisingly similar features, they differ substantially in implementation, and each is specialized to a particular work profile (a more thorough technical discussion is available as an appendix).

"Kafka stands out because it is specialized for throughput and explicitly distributed in all tiers of its architecture. Interestingly, it is also concerned enough with resource conservation[2] to offer sensible tradeoffs that loosen guarantees in exchange for performance — something that may not strike Facebook or Google as an important feature in the systems they design. Constraints breed creativity.

"In addition, Kafka has several perks of particular interest to Operations readers. While it is written in Scala, it ships with a native C++ producer library that can be embedded in a module for our cache servers, obviating the need to run the JVM on those servers. Second, producers can be configured to batch requests to optimize network traffic, but do not create a persistent local log which would require additional maintenance. Kafka's I/O and memory usage is left up to the OS rather than the JVM[3].

"Kafka was written by LinkedIn and is now an Apache project. In production at LinkedIn, approximately 10,000 producers are handled by eight Kafka servers per datacenter. These clusters consolidate their streams into a single analytics datacenter, which Kafka supports out of the box via a simple mirroring configuration.

"These features are a very apt fit for our intended use cases; even those we don't intend to use — such as sharding and routing by "topic" categories — are interesting and might prove useful in the future as we expand our goals.

"The rest of this document dives into these topics in greater detail..."

like image 107
Anentropic Avatar answered Oct 19 '22 14:10

Anentropic