Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to change column names of autodetected partitions created by Glue Crawler?

I have bucket which is used as destination for a Kinesis Firehose stream.

Firehose automatically creates date-based prefixes on that bucket using the yyyy/mm/dd/HH format.

Then I created a crawler that will search for data into this bucket and configured it as follow:

Crawler configuration

After running the crawler, it creates a table with the following schema:

| #   | Column name   | Data type | Key           |
| --- | -----------   | --------- | ------------- |
| 1   | numberissues  | int       |               |
| 2   | group         | string    |               |
| 3   | createdat     | string    |               |
| 4   | companyunitid | string    |               |
| 5   | partition_0   | string    | Partition (0) |
| 6   | partition_1   | string    | Partition (1) |
| 7   | partition_2   | string    | Partition (2) |
| 8   | partition_3   | string    | Partition (3) |

If I rename the partition-* to their right counterparts year, month, day and hour, the table is ready for me to use.

However, if the crawler runs again, the schema revets the column names to the original partition-*.

I know this would work for Hive partition schemas year=2018/month=04..., but I want to know if it's possible to "hint" Glue about the partition field names.

Another alternative would be trying to change the Firehose prefixing, but I couldn't find anything that suggests this is even possible.

like image 962
Henrique Barcelos Avatar asked Apr 06 '18 19:04

Henrique Barcelos


People also ask

How does a glue crawler determine partitions?

When an AWS Glue crawler scans Amazon S3 and detects multiple folders in a bucket, it determines the root of a table in the folder structure and which folders are partitions of a table. The name of the table is based on the Amazon S3 prefix or folder name.

How does AWS Glue crawler work?

A crawler accesses your data store, extracts metadata, and creates table definitions in the AWS Glue Data Catalog. The Crawlers pane in the AWS Glue console lists all the crawlers that you create. The list displays status and metrics from the last run of your crawler.

What is a partition in AWS Glue?

AWS Glue partition indexes are an important configuration to reduce overall data transfers and processing, and reduce query processing time. In the AWS Glue Data Catalog, the GetPartitions API is used to fetch the partitions in the table. The API returns partitions that match the expression provided in the request.


1 Answers

In this case you can set the "Ignore the change and don't update the data catalog" option.

Then you can rename the columns. This will allow the crawler to detect new partitions on the next run but keep therenamed names.

like image 188
Ricardo Mayerhofer Avatar answered Sep 19 '22 18:09

Ricardo Mayerhofer