I am using AWS Glue to create metadata tables.
AWS Glue Crawler data store path: s3://bucket-name/
Bucket structure in S3 is like
├── bucket-name
│ ├── pt=2011-10-11-01
│ │ ├── file1
| | ├── file2
│ ├── pt=2011-10-11-02
│ │ ├── file1
│ ├── pt=2011-10-10-01
│ │ ├── file1
│ ├── pt=2011-10-11-10
│ │ ├── file1
for this aws crawler create 4 tables.
My question is why aws glue crawler does not detect partition?
To force Glue to merge multiple schemas together, make sure this option is checked, when creating the crawler - Create a single schema for each S3 path.
Screenshot of crawler creation step, with this setting enabled
Here's a detailed explanation - quoting directly, from AWS documentation (reference)
By default, when a crawler defines tables for data stored in Amazon S3, it considers both data compatibility and schema similarity. Data compatibility factors taken into account include whether the data is of the same format (for example, JSON), the same compression type (for example, GZIP), the structure of the Amazon S3 path, and other data attributes. Schema similarity is a measure of how closely the schemas of separate Amazon S3 objects are similar.
You can configure a crawler to CombineCompatibleSchemas into a common table definition when possible. With this option, the crawler still considers data compatibility, but ignores the similarity of the specific schemas when evaluating Amazon S3 objects in the specified include path.
If you are configuring the crawler on the console, to combine schemas, select the crawler option Create a single schema for each S3 path.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With