Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between Pig and Hive? Why have both? [closed]

My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS (PDF link).

I understand that-

  • Pig's language Pig Latin is a shift from(suits the way programmers think) SQL like declarative style of programming and Hive's query language closely resembles SQL.

  • Pig sits on top of Hadoop and in principle can also sit on top of Dryad. I might be wrong but Hive is closely coupled to Hadoop.

  • Both Pig Latin and Hive commands compiles to Map and Reduce jobs.

My question - What is the goal of having both when one (say Pig) could serve the purpose. Is it just because Pig is evangelized by Yahoo! and Hive by Facebook?

like image 784
Arnkrishn Avatar asked Jul 28 '10 18:07

Arnkrishn


People also ask

What is the difference between Pig and Hive?

Pig is a Procedural Data Flow Language. Hive is a Declarative SQLish Language. 4. It was developed by Yahoo.

Does Pig differ from MapReduce and Hive?

Hadoop MapReduce is a compiled language whereas Apache Pig is a scripting language and Hive is a SQL like query language. Pig and Hive provide higher level of abstraction whereas Hadoop MapReduce provides low level of abstraction. Hadoop MapReduce requires more lines of code when compared to Pig and Hive.

Why is Pig faster than Hive?

For fast processing: Apache Pig is faster than Hive because it uses a multi-query approach. Apache Pig is famous worldwide for its speed. When you don't want to work with Schema: In case of Apache Pig, there is no need for creating a schema for the data loading related work.

What is true about Pig and Hive in relation to the Hadoop ecosystem?

Answer: Pig and Hive are the two key components of the Hadoop ecosystem. ... There is no simple way to compare both Pig and Hive without digging deep into both in greater detail as to how they help in processing large amounts of data.


2 Answers

Check out this post from Alan Gates, Pig architect at Yahoo!, that compares when would use a SQL like Hive rather than Pig. He makes a very convincing case as to the usefulness of a procedural language like Pig (vs. declarative SQL) and its utility to dataflow designers.

like image 62
Jakob Homan Avatar answered Sep 21 '22 12:09

Jakob Homan


Hive was designed to appeal to a community comfortable with SQL. Its philosophy was that we don't need yet another scripting language. Hive supports map and reduce transform scripts in the language of the user's choice (which can be embedded within SQL clauses). It is widely used in Facebook by analysts comfortable with SQL as well as by data miners programming in Python. SQL compatibility efforts in Pig have been abandoned AFAIK - so the difference between the two projects is very clear.

Supporting SQL syntax also means that it's possible to integrate with existing BI tools like Microstrategy. Hive has an ODBC/JDBC driver (that's a work in progress) that should allow this to happen in the near future. It's also beginning to add support for indexes which should allow support for drill-down queries common in such environments.

Finally--this is not pertinent to the question directly--Hive is a framework for performing analytic queries. While its dominant use is to query flat files, there's no reason why it cannot query other stores. Currently Hive can be used to query data stored in Hbase (which is a key-value store like those found in the guts of most RDBMSes), and the HadoopDB project has used Hive to query a federated RDBMS tier.

like image 43
Joydeep Sen Sarma Avatar answered Sep 21 '22 12:09

Joydeep Sen Sarma