I have a MySQL source from which I am creating a Glue Dynamic Frame with predicate push down condition as follows
datasource = glueContext.create_dynamic_frame_from_catalog(
database = source_catalog_db,
table_name = source_catalog_tbl,
push_down_predicate = "id > 1531812324",
transformation_ctx = "datasource")
I am always getting all the records in 'datasource' whatever the condition I put in 'push_down_predicate'. What am I missing?
This is great! I was able to use it to obtain the last 30 days of data using my "dt" partition column:
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database = "my_db",
table_name = "my_table",
push_down_predicate = "to_date(dt) >= date_sub(current_date, 30)",
transformation_ctx = "datasource0"
)
I'm using Glue 1.0 - Spark 2.4 - Python 2.
Pushdown predicate works for partitioning columns only. In other words, your data files should be placed in hierarchically structured folders. For example, if data is located in s3://bucket/dataset/
and partitioned by year, month and day then the structure should be following:
s3://bucket/dataset/year=2018/month=7/day=18/<data-files-here>
In such case pushdown predicate would work for columns year
, month
and day
only:
datasource = glueContext.create_dynamic_frame_from_catalog(
database = source_catalog_db,
table_name = source_catalog_tbl,
push_down_predicate = "year = 2017 and month > 6 and day between 3 and 10",
transformation_ctx = "datasource")
Besides that you have to keep in mind that pushdown predicates work with s3 data sources only.
Here is a nice blog post written by AWS Glue devs about data partitioning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With