I'm getting a schema mismatch error when querying parquet data from Athena.
The error is:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://datalake/googleanalytics/version=0/eventDate=2017-06-11/part-00001-9c9312f7-f9a5-47c3-987e-9348b78aaebc-c000.snappy.parquet (offset=0, length=48653579): Schema mismatch, metastore schema for row column totals has 13 fields but parquet schema has 12 fields
In the AWS Glue Crawler I tried enabling Update all new and existing partitions with metadata from the table which I thought would resolve this issue, however I'm still getting the above error. I did this because of the similar question: How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')
The table schema for the totals column is:
struct<visits:bigint,hits:bigint,pageviews:bigint,timeOnSite:bigint,bounces:bigint,transactions:bigint,transactionRevenue:bigint,newVisits:bigint,screenviews:bigint,uniqueScreenviews:bigint,timeOnScreen:bigint,totalTransactionRevenue:bigint,sessionQualityDim:bigint>
The parquet file for partition eventDate=2017-06-11 is missing the last field "sessionQualityDim".
You have parquet files with two different schema and the Athena table schema matches with the newer one. You can do one of the following :
1) Create two different tables in athena, one pointing to data till 2017 and other pointing to data post 2017.
2) In case the older data is no more valid for current use case, then you can simply archive that data and remove the 2017 and older partitions from your current table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With