Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue to Redshift: Is it possible to replace, update or delete data?

Here are some bullet points in terms of how I have things setup:

  • I have CSV files uploaded to S3 and a Glue crawler setup to create the table and schema.
  • I have a Glue job setup that writes the data from the Glue table to our Amazon Redshift database using a JDBC connection. The Job also is in charge of mapping the columns and creating the redshift table.

By re-running a job, I am getting duplicate rows in redshift (as expected). However, is there way to replace or delete rows before inserting the new data, using a key or the partitions setup in glue?

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job  from awsglue.dynamicframe import DynamicFrame from awsglue.transforms import SelectFields  from pyspark.sql.functions import lit  ## @params: [TempDir, JOB_NAME] args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])  sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args)  columnMapping = [     ("id", "int", "id", "int"),     ("name", "string", "name", "string"), ]  datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "db01", table_name = "table01", transformation_ctx = "datasource0")  applymapping1 = ApplyMapping.apply(frame = datasource1, mappings = columnMapping, transformation_ctx = "applymapping1") resolvechoice1 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", transformation_ctx = "resolvechoice1") dropnullfields1 = DropNullFields.apply(frame = resolvechoice1, transformation_ctx = "dropnullfields1") df1 = dropnullfields1.toDF() data1 = df1.withColumn('platform', lit('test')) data1 = DynamicFrame.fromDF(data1, glueContext, "data_tmp1")  ## Write data to redshift datasink1 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = data1, catalog_connection = "Test Connection", connection_options = {"dbtable": "table01", "database": "db01"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink1")  job.commit() 
like image 878
krchun Avatar asked Sep 14 '17 21:09

krchun


People also ask

Can we update data in Redshift?

While Amazon Redshift does not support a single merge, or upsert, command to update a table from a single data source, you can perform a merge operation by creating a staging table and then using one of the methods described in this section to update the target table from the staging table.

Can glue load data to Redshift?

Steps to Move Data from AWS Glue to Redshift Below are the steps you can follow to move data from AWS Glue to Redshift: Step 1: Create Temporary Credentials and Roles using AWS Glue. Step 2: Specify the Role in the AWS Glue Script. Step 3: Handing Dynamic Frames in AWS Glue to Redshift Integration.

How does AWS Glue handle updates?

Between job runs, AWS Glue sequences duplicate transactions to the same primary key (for example, insert, then update) by file name and order. It determines the last transaction and uses it to re-write the impacted object to S3.


2 Answers

Job bookmarks are the key. Just edit the job and enable "Job bookmarks" and it won't process already processed data. Note that the job has to rerun once before it will detect it does not have to reprocess the old data again.

For more info see: http://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

The name "bookmark" is a bit far fetched in my opinion. I would have never looked at it if I did not coincidentally stumble upon it during my search.

like image 195
Matthijs Avatar answered Sep 18 '22 17:09

Matthijs


This was the solution I got from AWS Glue Support:

As you may know, although you can create primary keys, Redshift doesn't enforce uniqueness. Therefore, if you are rerunning Glue jobs then duplicate rows can get inserted. Some of the ways to maintain uniqueness are:

  1. Use a staging table to insert all rows and then perform a upsert/merge [1] into the main table, this has to be done outside of glue.

  2. Add another column in your redshift table [1], like an insert timestamp, to allow duplicate but to know which one came first or last and then delete the duplicate afterwards if you need to.

  3. Load the previously inserted data into dataframe and then compare the data to be insert to avoid inserting duplicates[3]

[1] - http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html and http://www.silota.com/blog/amazon-redshift-upsert-support-staging-table-replace-rows/

[2] - https://github.com/databricks/spark-redshift/issues/238

[3] - https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html

like image 43
krchun Avatar answered Sep 20 '22 17:09

krchun