Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop Distribution Differences

Tags:

Can somebody outline the various differences between the various Hadoop Distributions available:

  • Cloudera - http://www.cloudera.com/hadoop
  • Yahoo - http://developer.yahoo.net/blogs/hadoop/

using the Apache Hadoop distro as a baseline.

Is there a good reason to using one of these distributions over the standard Apache Hadoop distro?

like image 954
Jon Avatar asked Sep 11 '09 18:09

Jon


People also ask

What are different Hadoop distributions?

There are several distributions available, such as ones provided by EMC and Intel, as well as those provided by hardware vendors like IBM which are typically all-in-one solutions that include hardware. But the three biggest and most prevalent Hadoop distributions that exist today are Cloudera, MapR andHortonworks.

What does Hadoop distribution mean?

What are Hadoop Distributions? Hadoop distributions are used to provide scalable, distributed computing against on-premises and cloud-based file store data. Distributions are composed of commercially packaged and supported editions of open-source Apache Hadoop-related projects.

What is difference between Cloudera MapR and hortonworks?

Cloudera has a commercial license, while Hortonworks has open source license. Cloudera also allows the use of its open- source projects free of cost, but the package doesn't include the management suite Cloudera Manager or any other proprietary software. Cloudera has a free 60-day trial, Hortonworks is completely free.

What is Hadoop distributions and their ecosystem components?

Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. It includes Apache projects and various commercial tools and solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common.


2 Answers

Disclaimer: I interned at Cloudera this summer (but some of my best friends are at Yahoo! :-))

The Yahoo distribution is a version of Hadoop 20 that they run (ran?) on some subset of their clusters. It includes a set of patches for stability, bug fixes, etc. It is a source release; it does not have admin-friendly features like rpm or debian packages, etc.

The Cloudera distribution is packages as rpms and debs (the source is also available). This means you can get updates via standard methods, etc. It also includes stability and bug fix patches. It is constantly maintained (not to say Yahoo's isn't -- I suppose one could just go on github and check when they last updated it). It also packages Pig and Hive.

Cloudera's distribution of Hadoop 20 is in beta, and 18 is considered stable (more on this on the Cloudera blog). The 18 version also includes packages for Hive and Pig; for 20, you have to build them yourself (there aren't official releases of Pig or Hive that support 20 yet, although patches exist). There may well be significant overlap between the Cloudera and Yahoo versions of 20; both provide manifests, so you can check. The latest documentation of Cloudera's distros is at http://archive.cloudera.com

Yahoo does not provide support for their distribution; they provide their patched version as a service to the community, so the folks who are interested can build what Yahoo runs internally. Given the size of Yahoo clusters, that's a significant contribution, especially if you aren't a Hadoop developer who follows the JIRAs all the time. Cloudera supports their distribution commercially, as well as providing some community support via the Hadoop mailing lists and, for distro-specific issues, on their GetSatisfaction page.

Both are pretty different from the vanilla Apache distro since they patch it in between releases (the cloudera version of 20 has 60+ patches!).

like image 155
SquareCog Avatar answered Oct 24 '22 05:10

SquareCog


Yahoo has discontinued it's own distribution and focusing on Apache Hadoop.

http://developer.yahoo.com/blogs/hadoop/posts/2011/01/announcement-yahoo-focusing-on-apache-hadoop-discontinuing-the-yahoo-distribution-of-hadoop/

http://www.cloudera.com/blog/2011/02/some-news-related-to-the-apache-hadoop-project/

Recently, HortonWorks (www.hortonworks.com) was spun out of Yahoo. And now HortonWorks would also be providing support unlike Yahoo.

http://www.hortonworks.com/about-us/our-manifesto/

Cloudera is along the same lines as HortonWorks

http://www.cloudera.com/products-services/

The main difference is HortonWorks wants to make the Apache distributions stable, easy to install and others. While, Cloudera has it's own distribution CDH* based on the Apache Hadoop.

like image 43
Praveen Sripati Avatar answered Oct 24 '22 05:10

Praveen Sripati