What is the most mature library for building a Data Analytics Pipeline in Java/Scala for Hadoop?

2 Answers

As I'm a developer of Scoobi, don't expect an unbiased answer.

First of all, FlumeJava is an internal google project that provides a (awesomely productive) abstraction ontop of MapReduce (not hadoop though). They released a paper about it, which is what projects like Scoobi and Crunch are based on.

If your only criteria is the maturity -- I guess Cascading is your best bet.

However, if you're looking for the (imho superior) FlumeJava style abstraction, you'll want to pick between (S)crunch and Scoobi.

The biggest difference, superficial as it may be is that crunch is written in Java, with Scala bindings (Scrunch). And Scoobi is written in Scala with Java bindings (scoobij). They're both really solid choices, and you won't go wrong which ever you choose. I'm sure there's quite a similar story with Crunch, but Scoobi is being used in real projects and is under continual development. We're pretty very active in fixing bugs and implementing features.

Anyway, they're both great projects with great people behind them and were both released within days of each other. They provide the same abstraction (with similiar api), so switching between the two won't be an issue in the slightest. My recommendation is to give them both a try, and see what works for you. There' no lock in in either project, so you don't need to commit :)

And if you have any feedback for either project, please be sure to provide it :)

103

answered Oct 18 '22 19:10

Heptic

I'm a big Scoobi fan myself and I've used it in production. I like the way it allows you to write type-safe Hadoop programs in a very idiomatic Scala way. If that is not necessarily your thing and you like the Cascading model but are scared off by the huge amount of boilerplate code you'd have to write, Twitter has recently open sourced its own Scala abstraction layer on top of Cascading called Scalding.

Announcement: https://dev.twitter.com/blog/scalding
GitHub: https://github.com/twitter/scalding

I guess it's all a matter of taste at this point since feature-wise most of the frameworks are very close to one another.

answered Oct 18 '22 20:10

Age Mooij

Related questions
                            
                                Blocking IO in Akka
                            
                                Scala Cake Pattern Encourage Hardcoded Dependencies?
                            
                                Unresolved Dependencies when building play 2.0 project
                            
                                Scala String Equality Question from Programming Interview
                            
                                How can I share memory between two JVM instances?
                            
                                Cartesian product of two lists
                            
                                Why won't Scala optimize tail call with try/catch?
                            
                                reader writer state monad - how to run this scala code
                            
                                Sequencing an HList
                            
                                meaning of top level private class in scala
                            
                                Can this functionality be implemented with Haskell's type system?
                            
                                Scala updating Array elements
                            
                                How to effectively use Scala in a Spring MVC project?
                            
                                Functional implementation of Tarjan's Strongly Connected Components algorithm
                            
                                How strongly is scala tied to JVM?
                            
                                What to use in the face of deprecation of the scala.util.parsing.json._ package?
                            
                                Synthetic Function "##" in scala
                            
                                Scala overriding a non-abstract def with a var
                            
                                Understanding why "pimp my library" was defined that way in Scala
                            
                                How do I create an enum in scala that has an extra field

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the most mature library for building a Data Analytics Pipeline in Java/Scala for Hadoop?

Tags:

scala

hadoop

cascading

flume

yura

People also ask

2 Answers

Heptic

Age Mooij

Recent Activity

Donate For Us