I am using AWS Glue to join two tables. By default, it performs INNER JOIN. I want to do a LEFT OUTER JOIN. I referred the AWS Glue documentation but there is no way to pass the join type to the Join.apply() method. Is there a way to achieve this in AWS Glue?
## @type: Join
## @args: [keys1 = id, keys2 = "user_id"]
## @return: cUser
## @inputs: [frame1 = cUser0, frame2 = cUserLogins]
#cUser = Join.apply(frame1 = cUser0, frame2 = +, keys1 = "id", keys2 = "user_id", transformation_ctx = "<transformation_ctx>")
## @type: Join
## @args: [keys1 = id, keys2 = user_id]
## @return: datasource0
## @inputs: [frame1 = cUser, frame2 = cKKR]
datasource0 = Join.apply(frame1 = cUser0, frame2 = cKKR, keys1 = "id", keys2 = "user_id", transformation_ctx = "<transformation_ctx>")
## @type: Join
## @args: [keys1 = branch_id, keys2 = user_id]
## @return: datasource1
## @inputs: [frame1 = datasource0, frame2 = cBranch]
datasource1 = Join.apply(frame1 = datasource0, frame2 = cBranch, keys1 = "branch_id", keys2 = "user_id", transformation_ctx = "<transformation_ctx>")
AWS Glue Jobs can be configured with the arguments listed in this document. You can configure a Job through the console, on the Job details tab, under the Job Parameters heading. You can also configure a Job through the AWS CLI by setting DefaultArguments or NonOverridableArguments on a Job or Arguments on a Job Run.
AWS charges you an hourly rate based on the number of Data Processing Units (or DPUs) used to run your ETL job. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. AWS bills for jobs and development endpoints in increments of 1 second, rounded up to the nearest second.
Currently, LEFT and RIGHT joins are not supported by AWS Glue. But, we can still achieve it by converting the DynamicFrame to the DataFrame and using join method.
Here the example:
cUser0 = glueContext.create_dynamic_frame.from_catalog(database = "captains", table_name = "cp_txn_winds_karyakarta_users", transformation_ctx = "cUser")
cUser0DF = cUser0.toDF()
cKKR = glueContext.create_dynamic_frame.from_catalog(database = "captains", table_name = "cp_txn_winds_karyakarta_karyakartas", redshift_tmp_dir = args["TempDir"], transformation_ctx = "cKKR")
cKKRDF = cKKR.toDF()
dataSource0 = cUser0DF.join(cKKRDF, cUser0DF.id == cKKRDF.user_id,how='left_outer')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With