Force Glue Crawler to create separate tables

Question

I am continuously add parquet data sets to an S3 folder with a structure like this:

s3:::my-bucket/public/data/set1
s3:::my-bucket/public/data/set2
s3:::my-bucket/public/data/set3

At the beginning I only have set1 and my crawler is configured to run on the whole bucket s3:::my-bucket. This leads to the creation of a partitioned tabled named my-bucket with partitions named public, data and set1. What I actually want is to have a table named set1 without any partitions. I see the reasons why this happens, as it is explained under How Does a Crawler Determine When to Create Partitions?. But when a new data set is uploaded (e.g. set2) I don't want it to be another partition (because it is completely different data with a different schema). How can I force the Glue crawler to NOT create partitions? I know I could define the crawler path as s3:::my-bucket/public/data/ but unfortunately I don't know where the new data sets will be created (e.g. could also be s3:::my-bucket/other/folder/set2).

Any ideas how to solve this?

Robert Kossendey · Accepted Answer

You can use the TableLevelConfiguration to specify in which folder level the crawler should look for tables.

More information on that here.

Force Glue Crawler to create separate tables

Tags:

amazon-s3

parquet

partitioning

aws-glue

glue-crawler

WolfgangM

1 Answers

Robert Kossendey

Recent Activity

Donate For Us

Force Glue Crawler to create separate tables

Tags:

amazon-s3

parquet

partitioning

aws-glue

glue-crawler

WolfgangM

1 Answers

Robert Kossendey

Related questions

Recent Activity

Donate For Us