Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Glue crawler exclude patterns



I have an s3 bucket that I'm trying to crawl and catalog. The format is something like this, where the SQL files are DDL queries (CREATE TABLE statements) that match the schema of the different data files, i.e. data1, data2, etc.)


I just want to catalog data1, so I am trying to use the exclude patterns in the Glue Crawler - see below - i.e. *.sql and data2/*.

Unfortunately the crawler is still classifying everything within the root path of s3://my-bucket/somedata/. I can live with having data2 cataloged; I'm most concerned/annoyed by the sql files.

Anyone have experience with exclude patterns or able to point out what is wrong here?

like image 955
Kirk Broadhurst Avatar asked Feb 15 '18 16:02

Kirk Broadhurst

2 Answers

The * in the exclude pattern does not cross directories, but the ** does span across directories.

To exclude all .sql files you can use: **.sql

The fullpath of your data2/* exclusion is s3://my-bucket/somedata/data2/*, but its missing your date partition folders. This is remedied by adding a * in front.

To exclude the data2/ directories use: */data2/*

like image 199
Jonathan Eckel Avatar answered Sep 16 '22 11:09

Jonathan Eckel

Also, to exclude folder pattern -
Exclude Pattern: folder_n**/** (excludes all folders starting with "folder_n")

like image 27
Gaurav Upadhyay Avatar answered Sep 19 '22 11:09

Gaurav Upadhyay