Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using hive table over parquet in Pig

I am trying to create a Hive table with schema string,string,double on a folder containing two Parquet files. The first parquet file schema is string,string,double and the schema of the second file is string,double,string.

CREATE EXTERNAL TABLE dynschema (
 trans_date string,
 currency string,
 rate double) 
STORED AS PARQUET
LOCATION '/user/impadmin/test/parquet/evolution/';

I am trying to use the hive table in pig(0.14) script.

 A = LOAD 'dynschema' USING org.apache.hive.hcatalog.pig.HCatLoader();

DUMP A;

But I get the error

java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.hive.serde2.io.DoubleWritable

Which I suspect is due to the schema of the second file is different from the table schema as the first file's split is successfully read but this exception occurs while reading the second file's split.

I also looked into the HCatRecordReader's code and found this piece of code

DefaultHCatRecord dr = new DefaultHCatRecord(outputSchema.size());
  int i = 0;
  for (String fieldName : outputSchema.getFieldNames()) {
    if (dataSchema.getPosition(fieldName) != null) {
      dr.set(i, r.get(fieldName, dataSchema));
    } else {
      dr.set(i, valuesNotInDataCols.get(fieldName));
    }
    i++;
  }

Here, I see that there is logic of conversion from the data schema to the output schema, but while debugging, I found there is no difference in both the schema.

Please help me to find if,

  1. Pig support such cases of reading data from hive table created over multiple parquet files with different schema.

  2. If yes, how to do this.

like image 240
SaurabhG Avatar asked Jan 20 '16 01:01

SaurabhG


People also ask

Does hive work with parquet?

Parquet is supported by a plugin in Hive 0.10, 0.11, and 0.12 and natively in Hive 0.13 and later.

How do you transfer data from hive to pig?

We have taken sample data to load it into Pig, which would be further used to move into the Hive table. Enter into Pig with HCatalog option. Load the data into Pig relation 'A' from the HDFS path. Appending the above stored data from Pig to the Hive table – emp_tab (non-partitioned).

Which is better parquet or orc?

PARQUET is more capable of storing nested data. ORC is more capable of Predicate Pushdown. ORC supports ACID properties. ORC is more compression efficient.


1 Answers

If you have files with 2 different schemas, the following seems to be sensible:

  1. Split up the files, based on which schema they have
  2. Make tables out of them
  3. If desirable, load the individual tables and store them into a supertable
like image 96
Dennis Jaheruddin Avatar answered Oct 18 '22 17:10

Dennis Jaheruddin