I would like to know how to run Pig queries stored in Hive format. I have configured Hive to store compressed data (using this tutorial http://wiki.apache.org/hadoop/Hive/CompressedStorage).
Before that I used to just use normal Pig load function with Hive's delimiter (^A). But now Hive stores data in sequence files with compression. Which load function to use?
Note that don't need close integration like mentioned here: Using Hive with Pig, just what load function to use to read compressed sequence files generated by Hive.
Thanks for all the answers.
We have taken sample data to load it into Pig, which would be further used to move into the Hive table. Enter into Pig with HCatalog option. Load the data into Pig relation 'A' from the HDFS path. Appending the above stored data from Pig to the Hive table – emp_tab (non-partitioned).
The Hadoop component related to Apache Pig is called the “Hadoop Pig task”. This component is almost the same as Hadoop Hive Task since it has the same properties and uses a WebHCat connection.
The user interacts with the Hive through the user interface by submitting Hive queries. The driver passes the Hive query to the compiler. The compiler generates the execution plan. The Execution engine executes the plan.
Here's what I found out: Using HiveColumnarLoader makes sense if you store data as a RCFile. To load table using this you need to register some jars first:
register /srv/pigs/piggybank.jar
register /usr/lib/hive/lib/hive-exec-0.5.0.jar
register /usr/lib/hive/lib/hive-common-0.5.0.jar
a = LOAD '/user/hive/warehouse/table' USING org.apache.pig.piggybank.storage.HiveColumnarLoader('ts int, user_id int, url string');
To load data from Sequence file you have to use PiggyBank (as in previous example). SequenceFile loader from Piggybank should handle compressed files:
register /srv/pigs/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
a = LOAD '/user/hive/warehouse/table' USING SequenceFileLoader AS (int, int);
This doesn't work with Pig 0.7 because it's unable to read BytesWritable type and cast it to Pig type and you get this exception:
2011-07-01 10:30:08,589 WARN org.apache.pig.piggybank.storage.SequenceFileLoader: Unable to translate key class org.apache.hadoop.io.BytesWritable to a Pig datatype
2011-07-01 10:30:08,625 WARN org.apache.hadoop.mapred.Child: Error running child
org.apache.pig.backend.BackendException: ERROR 0: Unable to translate class org.apache.hadoop.io.BytesWritable to a Pig datatype
at org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:78)
at org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:132)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
at org.apache.hadoop.mapred.Child.main(Child.java:211)
How to compile piggybank is described here: Unable to build piggybank -> /home/build/ivy/lib does not exist
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With