Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HIVE_CANNOT_OPEN_SPLIT: Schema mismatch when querying parquet files from Athena

I'm getting a schema mismatch error when querying parquet data from Athena.

The error is:

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://datalake/googleanalytics/version=0/eventDate=2017-06-11/part-00001-9c9312f7-f9a5-47c3-987e-9348b78aaebc-c000.snappy.parquet (offset=0, length=48653579): Schema mismatch, metastore schema for row column totals has 13 fields but parquet schema has 12 fields

In the AWS Glue Crawler I tried enabling Update all new and existing partitions with metadata from the table which I thought would resolve this issue, however I'm still getting the above error. I did this because of the similar question: How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

The table schema for the totals column is:

struct<visits:bigint,hits:bigint,pageviews:bigint,timeOnSite:bigint,bounces:bigint,transactions:bigint,transactionRevenue:bigint,newVisits:bigint,screenviews:bigint,uniqueScreenviews:bigint,timeOnScreen:bigint,totalTransactionRevenue:bigint,sessionQualityDim:bigint>

The parquet file for partition eventDate=2017-06-11 is missing the last field "sessionQualityDim".

like image 445
AndrewSC Avatar asked Nov 06 '22 17:11

AndrewSC


1 Answers

You have parquet files with two different schema and the Athena table schema matches with the newer one. You can do one of the following :

1) Create two different tables in athena, one pointing to data till 2017 and other pointing to data post 2017.

2) In case the older data is no more valid for current use case, then you can simply archive that data and remove the 2017 and older partitions from your current table.

like image 126
Harsh Bafna Avatar answered Nov 27 '22 03:11

Harsh Bafna