Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between Apache Pig and Apache Hive?

What is the exact difference between Pig and Hive? I found that both have same functional meaning because they are used for doing same work. The only thing is implimentation which is different for both. So when to use and which technology? Is there any specification for both which shows clearly the difference between both in terms of applicability and performance?

like image 371
Ananda Avatar asked Apr 23 '12 11:04

Ananda


People also ask

What is the difference between Apache Pig and SQL?

Apache Pig Vs SQLPig Latin is a procedural language. SQL is a declarative language. In Apache Pig, schema is optional. We can store data without designing a schema (values are stored as $01, $02 etc.)

Why Pig is faster than Hive?

Especially, for all the data load related work While you don't want to create the schema. Since it has many SQL-related functions and additionally you have cogroup function as well. It does support Avro Hadoop file format. Pig is faster than Hive.

What is Apache Pig used for?

Apache Pig is a very high-level tool or platform being used for the processing of large data sets. The data analysis codes are developed with the use of a high-level scripting language called Pig Latin. Firstly, the programmers will write Pig Latin scripts to process the data into a specific map and reduce the tasks.


2 Answers

Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library. Apache Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data -- exactly the operations that MapReduce was originally designed for. Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script. Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself.

If Pig is "scripting for Hadoop", then Hive is "SQL queries for Hadoop". Apache Hive offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several MapReduce jobs on Hadoop. The language is, by design, extremely SQL-like. Hive is still intended as a tool for long-running batch-oriented queries over massive data; it's not "real-time" in any sense. Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems; it will let them easily leverage your shiny new Hadoop cluster to perform ad-hoc queries or generate report data across data stored in storage systems mentioned above.

like image 153
Sean Owen Avatar answered Oct 21 '22 18:10

Sean Owen


From a purely engineering point of view, I find PIG both easier to write and maintain than SQL-like languages. It is procedural, so you apply a bunch of relations to your data one-by-one, and if something fails you can easily debug at intermediate steps, and even have a command called “illustrate” which uses an algorithm to sample some data matching your relation. I’d say for jobs with complex logic, this is definitely much more convenient than Hive, but for simple stuff the gain is probably minimal.

Regarding interfacing, I find that PIG offers a lot of flexibility compared to Hive. You don’t have a notion of table in PIG so you manipulate files directly, and you can define loader to load it into pretty much any format very easily with loader UDFs, without having to go through the table loading stage before you can do your transformations. They have a nice feature in the recent versions of PIG where you can use dynamic invokers, i.e. use pretty much any Java method directly in your PIG script, without having to write a UDF.

For performance/optimization, from what I’ve seen you can directly control in PIG the type of join and grouping algorithm you want to use (I believe 3 or 4 different algorithms for each). I’ve personally never used it, but as you’re writing demanding algorithms it could probably be useful to be able to decide what to do instead of relying on the optimizer as it’s the case in Hive. So I wouldn’t say it necessarily performs better than Hive, but in cases where the optimizer makes the wrong decision, you have the option to choose what algorithm to use and have more control on what happens.

One of the cool things I did lately was splits: you can split your execution flow and apply different relations to each split. So you can have a non-linear dataset, split it based on a field, and apply a different processing to each part, and maybe join the results together in the end, all this in the same script. I don’t think you can do this in Hive, you’d have to write different queries for each case, but I may be wrong.

One thing to note also is that you can increment counters in PIG. Currently you can only do this in PIG UDFs though. I don’t think you can use counters in Hive.

And there are some nice projects that allow you to interface PIG with Hive as well (like HCatalog), so you can basically read data from a hive table, or write data to a hive table (or both) by simply changing your loader in the script. Supports dynamic partitions as well.

like image 30
Charles Menguy Avatar answered Oct 21 '22 17:10

Charles Menguy