I'm writing a Glue Crawler as a part of an ETL, and I have a very annoying problem - The S3 bucket I'm crawling contains many different JSON files, all with the same schema. When crawling the bucket, the crawler creates a new table for every empty file and one additional table for the non-empty files.
When manually deleting the empty files and running the crawler - I get the expected behaviour, one table is created with the non-empty files data.
Is there a way to avoid this? I'm having issues to delete the empty files before crawling.
Many thanks.
I do not know if this could still be useful after 2 years. I stumbled upon the same issue recently and after researching a couple of evening this were i landed.
TL;DR There is no good way to address this issue directly with AWS Glue Crawlers (sigh).
You should work your way building a data pipeline that works well with your crawlers. Having empty or multiple little files could also impact your crawlers running time, which over time could become a serious issue.
The approaches I have come out with are mainly 2, based on whether you have or not control over the data load. Honestly, they are pretty obvious, but i could not find anything better.
If you can control how data is produced and stored, you will probably be able to prevent empty files to hit your storage in the first place. Maybe filtering your JSON files before transferring them to your storage.
If you don't have control over the data transfer, create an extra step in your data pipeline where you either filter and/or merge together (which is preferred since you could end up with less files for the crawler to ingest) files, creating a new layer of data cleansed from the empty files and better organized. This could be done event based or in time batches based on your requirements.
With that being said, I hope a better solution comes up soon, I think it would be cool if crawlers could directly skip files with no data or do some action based on the file size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With