I am continuously add parquet data sets to an S3 folder with a structure like this:
s3:::my-bucket/public/data/set1
s3:::my-bucket/public/data/set2
s3:::my-bucket/public/data/set3
At the beginning I only have set1
and my crawler is configured to run on the whole bucket s3:::my-bucket
. This leads to the creation of a partitioned tabled named my-bucket
with partitions named public
, data
and set1
. What I actually want is to have a table named set1
without any partitions.
I see the reasons why this happens, as it is explained under How Does a Crawler Determine When to Create Partitions?. But when a new data set is uploaded (e.g. set2
) I don't want it to be another partition (because it is completely different data with a different schema).
How can I force the Glue crawler to NOT create partitions?
I know I could define the crawler path as s3:::my-bucket/public/data/
but unfortunately I don't know where the new data sets will be created (e.g. could also be s3:::my-bucket/other/folder/set2
).
Any ideas how to solve this?
You can use the TableLevelConfiguration
to specify in which folder level the crawler should look for tables.
More information on that here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With