I am new to hadoop. Could you please tell me what are different distributions available for hadoop.
Am seeing standard apache hadoop and Cloudera distribution for hadoop(CDH).
What is the difference between these two? Is CDH free or commercial?
What are Hadoop Distributions? Hadoop distributions are used to provide scalable, distributed computing against on-premises and cloud-based file store data. Distributions are composed of commercially packaged and supported editions of open-source Apache Hadoop-related projects.
Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
Top six vendors offering Big Data Hadoop solutions are:Amazon Web Services Elastic MapReduce Hadoop Distribution. Microsoft. MapR. IBM InfoSphere Insights.
Hadoop does distributed processing for huge data sets across the cluster of commodity servers and works on multiple machines simultaneously. To process any data, the client submits data and program to Hadoop. HDFS stores the data while MapReduce process the data and Yarn divide the tasks.
Besides Apache Hadoop, it's more or less a three horse race for Hadoop distribution between HortonWorks, Cloudera and MapR. Then there are GreenPlum HD and IBM InfoSphere BigInsights.
Is CDH free or commercial?
CDH from Cloudera is free to use. But, need to pay for any support and management tools on top of CDH.
What is the difference between these two?
In Apache all the projects (Pig, Hive etc) are independent. Cloudera makes sure all these frameworks work properly with each other and packages them as CDH. With CDH there are regular release, which I haven't seen in Apache. Another thing is it's difficult to get support for Apache Hadoop, while Cloudera and others provide commercial support for their own versions of Hadoop.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With