Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Force Glue Crawler to create separate tables

I am continuously add parquet data sets to an S3 folder with a structure like this:

s3:::my-bucket/public/data/set1
s3:::my-bucket/public/data/set2
s3:::my-bucket/public/data/set3

At the beginning I only have set1 and my crawler is configured to run on the whole bucket s3:::my-bucket. This leads to the creation of a partitioned tabled named my-bucket with partitions named public, data and set1. What I actually want is to have a table named set1 without any partitions. I see the reasons why this happens, as it is explained under How Does a Crawler Determine When to Create Partitions?. But when a new data set is uploaded (e.g. set2) I don't want it to be another partition (because it is completely different data with a different schema). How can I force the Glue crawler to NOT create partitions? I know I could define the crawler path as s3:::my-bucket/public/data/ but unfortunately I don't know where the new data sets will be created (e.g. could also be s3:::my-bucket/other/folder/set2).

Any ideas how to solve this?

like image 590
WolfgangM Avatar asked Sep 17 '25 03:09

WolfgangM


1 Answers

You can use the TableLevelConfiguration to specify in which folder level the crawler should look for tables.

More information on that here.

like image 140
Robert Kossendey Avatar answered Sep 19 '25 16:09

Robert Kossendey