Using hive table over parquet in Pig

Tags:

I am trying to create a Hive table with schema string,string,double on a folder containing two Parquet files. The first parquet file schema is string,string,double and the schema of the second file is string,double,string.

CREATE EXTERNAL TABLE dynschema (
 trans_date string,
 currency string,
 rate double) 
STORED AS PARQUET
LOCATION '/user/impadmin/test/parquet/evolution/';

I am trying to use the hive table in pig(0.14) script.

 A = LOAD 'dynschema' USING org.apache.hive.hcatalog.pig.HCatLoader();

DUMP A;

But I get the error

java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.hive.serde2.io.DoubleWritable

Which I suspect is due to the schema of the second file is different from the table schema as the first file's split is successfully read but this exception occurs while reading the second file's split.

I also looked into the HCatRecordReader's code and found this piece of code

DefaultHCatRecord dr = new DefaultHCatRecord(outputSchema.size());
  int i = 0;
  for (String fieldName : outputSchema.getFieldNames()) {
    if (dataSchema.getPosition(fieldName) != null) {
      dr.set(i, r.get(fieldName, dataSchema));
    } else {
      dr.set(i, valuesNotInDataCols.get(fieldName));
    }
    i++;
  }

Here, I see that there is logic of conversion from the data schema to the output schema, but while debugging, I found there is no difference in both the schema.

Please help me to find if,

Pig support such cases of reading data from hive table created over multiple parquet files with different schema.
If yes, how to do this.

240

asked Jan 20 '16 01:01

SaurabhG

1 Answers

If you have files with 2 different schemas, the following seems to be sensible:

Split up the files, based on which schema they have
Make tables out of them
If desirable, load the individual tables and store them into a supertable

answered Oct 18 '22 17:10

Dennis Jaheruddin

Related questions
                            
                                Problem with copying local data onto HDFS on a Hadoop cluster using Amazon EC2/ S3
                            
                                Hadoop reduce stops running
                            
                                Storing query result in a variable
                            
                                How to fix the "Illegal partition" error in hadoop?
                            
                                Explanation of YARN's DRF
                            
                                Use SparkContext hadoop configuration within RDD methods/closures, like foreachPartition
                            
                                NoClassDefFoundError org/apache/hadoop/yarn/server/timelineservice/collector/TimelineCollectorManager
                            
                                Accessing files in hadoop distributed cache
                            
                                Hive Job failed with return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask and Query Performance
                            
                                Spark SQL unable to complete writing Parquet data with a large number of shards
                            
                                hadoop Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit
                            
                                Spark driver disassociated and removed by the master

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using hive table over parquet in Pig

Tags:

hadoop

hive

apache-pig

parquet

hcatalog

SaurabhG

People also ask

1 Answers

Dennis Jaheruddin

Recent Activity

Donate For Us