I have partitioned data in CSV files on S3:
I run a classifier over s3://bucket/dataset/ and the result looks very much promising as it detects 150 columns (c1,...,c150) and assigns various data types.
Loading the resulting table in Athena and querying (select * from dataset limit 10
) it though will yield the error message:
HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'c100' in table 'tests.dataset' is declared as type 'string', but partition 'AANtbd7L1ajIwMTkwOQ' declared column 'c100' as type 'boolean'.
First of all I have no idea how to make use of 'AANtbd7L1ajIwMTkwOQ' ... but I can tell from the list of partitions in Glue that some partitions have c100 classified as string and some as boolean. While the table schema lists it as string.
That also means if I restrict a query to a partition which classifies c100 as string agreeing with the table schema then the query will work. If I use a partition classifying c100 as boolean the query fails with above error message.
Now from having a look at some of the CSVs column c100 seems to contain three different values:
Possibly some row contains a typo (maybe) and hence some partitions classify as string - but that is just a theory and a difficult to verify due to the number and size of the files.
I also tried MSCK REPAIR TABLE dataset
to no avail.
Is there a quick solution to this? Maybe forcing all partition to use string? If I look at the list of partitions there is a deactivated "edit schema" button.
Or do I have to write a Glue job checking and discarding or repairing every row?
If you are using crawler, you should select following option:
Update all new and existing partitions with metadata from the table
You may do it while creating table too. Check https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-schema-changes-prevent for more details.
This should solve issue. If it doesn't then check other options at https://github.com/awsdocs/amazon-athena-user-guide/blob/master/doc_source/glue-best-practices.md#schema-syncing
For understanding issue in athena, check https://docs.aws.amazon.com/athena/latest/ug/updates-and-partitions.html
Update all new and existing partitions with metadata from the table
don't always work for me, it seems the reason is usualy when I have different number of fields in different partitions. The error I get is something like:
There is a mismatch between the table and partition schemas,
The column 'a' in table 'tests.dataset' is declared as type 'string', but partition 'b' declared column 'c' as type 'boolean'
Where field names are different because some field is just missing in partition and Athena somehow ignores filed naming when compare them.
What is helping is to recreate the table using the crawler generated table and then update partitions with `MSCK REPAIR TABLE my_new_table_name;
After that drop the table that crawler has generated and use the new one.
The above workaround is described here https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-invalid-metadata-duplicate/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With