Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue does not detect partitions and creates 1000+ tables in catalog

I am using AWS Glue to create metadata tables.

AWS Glue Crawler data store path: s3://bucket-name/

Bucket structure in S3 is like

├── bucket-name        
│   ├── pt=2011-10-11-01     
│   │   ├── file1                    
|   |   ├── file2                                        
│   ├── pt=2011-10-11-02               
│   │   ├── file1          
│   ├── pt=2011-10-10-01           
│   │   ├── file1           
│   ├── pt=2011-10-11-10              
│   │   ├── file1  

                       

for this aws crawler create 4 tables.

My question is why aws glue crawler does not detect partition?

like image 701
iammehrabalam Avatar asked Feb 05 '23 02:02

iammehrabalam


1 Answers

To force Glue to merge multiple schemas together, make sure this option is checked, when creating the crawler - Create a single schema for each S3 path.

Screenshot of crawler creation step, with this setting enabled

Here's a detailed explanation - quoting directly, from AWS documentation (reference)

By default, when a crawler defines tables for data stored in Amazon S3, it considers both data compatibility and schema similarity. Data compatibility factors taken into account include whether the data is of the same format (for example, JSON), the same compression type (for example, GZIP), the structure of the Amazon S3 path, and other data attributes. Schema similarity is a measure of how closely the schemas of separate Amazon S3 objects are similar.

You can configure a crawler to CombineCompatibleSchemas into a common table definition when possible. With this option, the crawler still considers data compatibility, but ignores the similarity of the specific schemas when evaluating Amazon S3 objects in the specified include path.

If you are configuring the crawler on the console, to combine schemas, select the crawler option Create a single schema for each S3 path.

like image 192
bhrd Avatar answered Feb 07 '23 08:02

bhrd