Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the most mature library for building a Data Analytics Pipeline in Java/Scala for Hadoop?

I found many options recently, and interesting in their comparisons primarely by maturity and stability.

  1. Crunch - https://github.com/cloudera/crunch
  2. Scrunch - https://github.com/cloudera/crunch/tree/master/scrunch
  3. Cascading - http://www.cascading.org/
  4. Scalding https://github.com/twitter/scalding
  5. FlumeJava
  6. Scoobi - https://github.com/NICTA/scoobi/
like image 461
yura Avatar asked Feb 24 '12 08:02

yura


People also ask

Is Scala used in Hadoop?

Several of the Hadoop's high-performance data frameworks are written in Scala or Java. The main reason for using Scala in these environments is due to its amazing concurrency support, which is the key in parallelizing processing of the large data sets.

What is the use of Scala in big data?

This language is designed to construct scalable solutions for digesting and grouping large amounts of data in order to generate actionable insights. Scala allows you to work with immutable data and higher-order functions, just like we use those concepts frequently in the Python programming paradigm.

How do you make a scalable big data analytics pipeline?

The first step would be to involve data ingestion from your data source. The process would extend to enriching the data for your downstream system to leverage them in the most understandable format. After that, you need to store your data in a data warehouse or data lake for analysis and reporting.

What is analytics pipeline?

An analytics pipeline streamlines data flow to improve the speed and quality of insights. Similar to a continuous integration/continuous delivery (CI/CD) pipeline used by a DevOps team, the speed advantage of an analytics pipeline hinges on automating tasks.


2 Answers

As I'm a developer of Scoobi, don't expect an unbiased answer.

First of all, FlumeJava is an internal google project that provides a (awesomely productive) abstraction ontop of MapReduce (not hadoop though). They released a paper about it, which is what projects like Scoobi and Crunch are based on.

If your only criteria is the maturity -- I guess Cascading is your best bet.

However, if you're looking for the (imho superior) FlumeJava style abstraction, you'll want to pick between (S)crunch and Scoobi.

The biggest difference, superficial as it may be is that crunch is written in Java, with Scala bindings (Scrunch). And Scoobi is written in Scala with Java bindings (scoobij). They're both really solid choices, and you won't go wrong which ever you choose. I'm sure there's quite a similar story with Crunch, but Scoobi is being used in real projects and is under continual development. We're pretty very active in fixing bugs and implementing features.

Anyway, they're both great projects with great people behind them and were both released within days of each other. They provide the same abstraction (with similiar api), so switching between the two won't be an issue in the slightest. My recommendation is to give them both a try, and see what works for you. There' no lock in in either project, so you don't need to commit :)

And if you have any feedback for either project, please be sure to provide it :)

like image 103
Heptic Avatar answered Oct 18 '22 19:10

Heptic


I'm a big Scoobi fan myself and I've used it in production. I like the way it allows you to write type-safe Hadoop programs in a very idiomatic Scala way. If that is not necessarily your thing and you like the Cascading model but are scared off by the huge amount of boilerplate code you'd have to write, Twitter has recently open sourced its own Scala abstraction layer on top of Cascading called Scalding.

  • Announcement: https://dev.twitter.com/blog/scalding
  • GitHub: https://github.com/twitter/scalding

I guess it's all a matter of taste at this point since feature-wise most of the frameworks are very close to one another.

like image 31
Age Mooij Avatar answered Oct 18 '22 20:10

Age Mooij