Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to solve this HIVE_PARTITION_SCHEMA_MISMATCH?

I have partitioned data in CSV files on S3:

  • s3://bucket/dataset/p=1/*.csv (partition #1)
  • ...
  • s3://bucket/dataset/p=100/*.csv (partition #100)

I run a classifier over s3://bucket/dataset/ and the result looks very much promising as it detects 150 columns (c1,...,c150) and assigns various data types.

Loading the resulting table in Athena and querying (select * from dataset limit 10) it though will yield the error message:

HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'c100' in table 'tests.dataset' is declared as type 'string', but partition 'AANtbd7L1ajIwMTkwOQ' declared column 'c100' as type 'boolean'.

First of all I have no idea how to make use of 'AANtbd7L1ajIwMTkwOQ' ... but I can tell from the list of partitions in Glue that some partitions have c100 classified as string and some as boolean. While the table schema lists it as string.

That also means if I restrict a query to a partition which classifies c100 as string agreeing with the table schema then the query will work. If I use a partition classifying c100 as boolean the query fails with above error message.

Now from having a look at some of the CSVs column c100 seems to contain three different values:

  • true
  • false
  • [empty] (like ...,,...)

Possibly some row contains a typo (maybe) and hence some partitions classify as string - but that is just a theory and a difficult to verify due to the number and size of the files.

I also tried MSCK REPAIR TABLE dataset to no avail.

Is there a quick solution to this? Maybe forcing all partition to use string? If I look at the list of partitions there is a deactivated "edit schema" button.

Or do I have to write a Glue job checking and discarding or repairing every row?

like image 272
Raffael Avatar asked Sep 11 '19 13:09

Raffael


2 Answers

If you are using crawler, you should select following option:

Update all new and existing partitions with metadata from the table

You may do it while creating table too. Check https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-schema-changes-prevent for more details.

This should solve issue. If it doesn't then check other options at https://github.com/awsdocs/amazon-athena-user-guide/blob/master/doc_source/glue-best-practices.md#schema-syncing

For understanding issue in athena, check https://docs.aws.amazon.com/athena/latest/ug/updates-and-partitions.html

like image 58
Sandeep Fatangare Avatar answered Oct 09 '22 21:10

Sandeep Fatangare


Update all new and existing partitions with metadata from the table don't always work for me, it seems the reason is usualy when I have different number of fields in different partitions. The error I get is something like:

There is a mismatch between the table and partition schemas,
The column 'a' in table 'tests.dataset' is declared as type 'string', but partition 'b' declared column 'c' as type 'boolean'

Where field names are different because some field is just missing in partition and Athena somehow ignores filed naming when compare them.

What is helping is to recreate the table using the crawler generated table and then update partitions with `MSCK REPAIR TABLE my_new_table_name; How to create new table

After that drop the table that crawler has generated and use the new one.

The above workaround is described here https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-invalid-metadata-duplicate/

like image 38
BLiN Avatar answered Oct 09 '22 20:10

BLiN