Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to specify join types in AWS Glue?

I am using AWS Glue to join two tables. By default, it performs INNER JOIN. I want to do a LEFT OUTER JOIN. I referred the AWS Glue documentation but there is no way to pass the join type to the Join.apply() method. Is there a way to achieve this in AWS Glue?

## @type: Join
## @args: [keys1 = id, keys2 = "user_id"]
## @return: cUser
## @inputs: [frame1 = cUser0, frame2 = cUserLogins]
#cUser = Join.apply(frame1 = cUser0, frame2 = +, keys1 = "id", keys2 = "user_id", transformation_ctx = "<transformation_ctx>")


## @type: Join
## @args: [keys1 = id, keys2 = user_id]
## @return: datasource0
## @inputs: [frame1 = cUser, frame2 = cKKR]
datasource0 = Join.apply(frame1 = cUser0, frame2 = cKKR, keys1 = "id", keys2 = "user_id", transformation_ctx = "<transformation_ctx>")

## @type: Join
## @args: [keys1 = branch_id, keys2 = user_id]
## @return: datasource1
## @inputs: [frame1 = datasource0, frame2 = cBranch]
datasource1 = Join.apply(frame1 = datasource0, frame2 = cBranch, keys1 = "branch_id", keys2 = "user_id", transformation_ctx = "<transformation_ctx>")
like image 304
Vikas Roy Avatar asked Jan 21 '19 14:01

Vikas Roy


People also ask

How do you run a glue job with parameters?

AWS Glue Jobs can be configured with the arguments listed in this document. You can configure a Job through the console, on the Job details tab, under the Job Parameters heading. You can also configure a Job through the AWS CLI by setting DefaultArguments or NonOverridableArguments on a Job or Arguments on a Job Run.

What are DPUs in AWS Glue?

AWS charges you an hourly rate based on the number of Data Processing Units (or DPUs) used to run your ETL job. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. AWS bills for jobs and development endpoints in increments of 1 second, rounded up to the nearest second.


1 Answers

Currently, LEFT and RIGHT joins are not supported by AWS Glue. But, we can still achieve it by converting the DynamicFrame to the DataFrame and using join method.

Here the example:

cUser0 = glueContext.create_dynamic_frame.from_catalog(database = "captains", table_name = "cp_txn_winds_karyakarta_users", transformation_ctx = "cUser")

cUser0DF = cUser0.toDF()

cKKR = glueContext.create_dynamic_frame.from_catalog(database = "captains", table_name = "cp_txn_winds_karyakarta_karyakartas", redshift_tmp_dir = args["TempDir"], transformation_ctx = "cKKR")

cKKRDF = cKKR.toDF()

dataSource0 = cUser0DF.join(cKKRDF, cUser0DF.id == cKKRDF.user_id,how='left_outer')
like image 119
Vikas Roy Avatar answered Sep 17 '22 22:09

Vikas Roy