Having a very weird problem with Glue. Using it to run some ETL on data I'm moving from MySQL RDS to Redshift. Using the same code I used on another table, where it worked fine and copied all the data as it should have.
However on the second table, for some reason it doesn't copy the data in the id column from MySQL. The id column on Redshift is completely blank.
query_df = spark.read.format("jdbc").option("url",
args['RDSURL']).option("driver",
args['RDSDRIVER']).option("dbtable",
args['RDSQUERY']).option("user", args['RDSUSER']).option("password",
args['RDSPASS']).load()
datasource0 = DynamicFrame.fromDF(query_df, glueContext,
"datasource0")
logging.info(datasource0.show())
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings =
[("id", "int", "id", "int"), ... , transformation_ctx =
"applymapping1")
logging.info(applymapping1.show())
From the above logs I print above I can see that the Dynamic Frame contains the id field even after ApplyMapping.
datasink2 = glueContext.write_dynamic_frame.from_jdbc_conf(frame =
applymapping1, catalog_connection = args['RSCLUSTER'],
connection_options = {"dbtable": args['RSTABLE'], "database":
args['RSDB']}, redshift_tmp_dir = args["TempDir"],
transformation_ctx = "datasink2")
The problem seems to happening here I think? After this the job completes, on checking Redshift the id column is completely empty.
Very puzzled by this behaviour. The exact code worked fine on another table, the only differences between the id in these two tables is that this table has id as int (11) unsigned while the table where the code worked had id as int (10) signed.
I've had exactly this behaviour with extracting from MySQL RDS using Glue. For anyone seeking the answer to this - the reason is as follows: AWSGlue has the concept of a 'type choice' where the exact type of a crawled column can remain as a number of possibilities throughout the ETL Job, since the crawler only crawls a subset of a column's data to determine the probable type and doesn't decide definitively. This is why converting to use an explicit schema rather than a crawler will fix the issue as it doesn't involve any type choices.
When the job runs (or you look at a preview) Spark will attempt to process the entire column dataset. At this point it's possible that the column type is resolved to a type that is incompatible with the data set - i.e the interpreter can't decide on the right type choice, and this results in empty data for the column in question. I have experienced this in transforming a number of tables from a MySQL DB, and there's no apparent pattern on why some fail and some don't that I've been able to determine although it must be related to the data in the source DB column.
The solution is to add into your script an explicit resolution of the choices, by casting the column that's failing to the desired target type with something like the following:
df.resolveChoice(specs = [('id', 'cast:int')])
Where df is the data frame. This will force the column to be interpreted as the intended type and should result in the expected output of data in this column. This has always worked for me.
Note that for those using the Glue Studio visual editor it's now possible to add a 'Custom Transformation' step which contains code to perform this for you. In this instance the code for the transformation should look as follows:
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0])
df_resolved = df.resolveChoice(specs = [('id', 'cast:int')])
return (DynamicFrameCollection({"CustomTransform0": df_resolved}, glueContext))
Note also that in this scenario that it will be necessary to follow this Custom Transformation node with a 'Select from Collection' transformation since the Custom Transformation returns a collection rather than a single frame.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With