I have several thousand files in an S3 bucket in this form:
├── bucket
│ ├── somedata
│ │ ├── year=2016
│ │ ├── year=2017
│ │ │ ├── month=11
│ │ | │ ├── sometype-2017-11-01.parquet
│ | | | ├── sometype-2017-11-02.parquet
│ | | | ├── ...
│ │ │ ├── month=12
│ │ | │ ├── sometype-2017-12-01.parquet
│ | | | ├── sometype-2017-12-02.parquet
│ | | | ├── ...
│ │ ├── year=2018
│ │ │ ├── month=01
│ │ | │ ├── sometype-2018-01-01.parquet
│ | | | ├── sometype-2018-01-02.parquet
│ | | | ├── ...
│ ├── moredata
│ │ ├── year=2017
│ │ │ ├── month=11
│ │ | │ ├── moretype-2017-11-01.parquet
│ | | | ├── moretype-2017-11-02.parquet
│ | | | ├── ...
│ │ ├── year=...
etc
Expected behavior: The AWS Glue Crawler creates one table for each of somedata, moredata, etc. It creates partitions for each table based on the childrens' path names.
Actual Behavior: The AWS Glue Crawler performs the behavior above, but ALSO creates a separate table for every partition of the data, resulting in several hundred extraneous tables (and more extraneous tables which every data add + new crawl).
I see no place to be able to set something or otherwise prevent this from happening... Does anyone have advice on the best way to prevent these unnecessary tables from being created?
Use the Create a Single Schema for Each Amazon S3 Include Path option to avoid the AWS Glue Crawler adding all these extra tables.
I had this problem and ended up with ~7k tables 😅 so wrote the following script to remove them. It requires jq.
#!/bin/sh
aws glue get-tables --region <YOUR AWS REGION> --database-name <YOUR AWS GLUE DATABASE> | jq '.TableList[] | .Name' | grep <A PATTERN THAT MATCHES YOUR TABLENAMEs> > /tmp/table-names.json
cd /tmp
mkdir table-names
cd table-names
split -l 50 ../table-names.json
for f in `ls`; cat $f | tr '\r\n' ' ' | xargs aws glue batch-delete-table --region <YOUR AWS REGION> --database-name <YOUR AWS GLUE DATABASE> --tables-to-delete;
Adding to the excludes
**_SUCCESS
**crc
worked for me (see aws page glue/add-crawler). Double stars match the files at all folder (ie partition) depths. I had an _SUCCESS living a few levels up.
Make sure you set up logging for glue, which quickly points out permission errors etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With