Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue crawler - partition keys types

I am using Spark to write files to S3 in ORC format. Also using Athena to query this data.

I am using the following partition keys:

s3://bucket/company=1123/date=20190207

Once I execute the Glue crawler to run on the bucket everything works as expected except the types of the partitions keys.

The Crawler configures them in the catalog as String type instead of int

Is there a configuration to define the default type of the partition keys ?

I know it can be changed manually later and set the Crawler config to Add new columns only.

like image 673
Alex Stanovsky Avatar asked Feb 07 '19 13:02

Alex Stanovsky


People also ask

What is partition key AWS Glue?

AWS Glue provides enhanced support for working with datasets that are organized into Hive-style partitions. AWS Glue crawlers automatically identify partitions in your Amazon S3 data. The AWS Glue ETL (extract, transform, and load) library natively supports partitions when you work with DynamicFrames.

How does a glue crawler determine partitions?

When an AWS Glue crawler scans Amazon S3 and detects multiple folders in a bucket, it determines the root of a table in the folder structure and which folders are partitions of a table. The name of the table is based on the Amazon S3 prefix or folder name.

What is partition index in glue?

AWS Glue partition indexes are an important configuration to reduce overall data transfers and processing, and reduce query processing time. In the AWS Glue Data Catalog, the GetPartitions API is used to fetch the partitions in the table.


1 Answers

Glue crawlers always treat partition keys as type string and unfortunately there is no configuration option available to change this behavior.

like image 60
Yuriy Bondaruk Avatar answered Oct 10 '22 01:10

Yuriy Bondaruk