AWS Glue to Redshift: Is it possible to replace, update or delete data?

Tags:

Here are some bullet points in terms of how I have things setup:

I have CSV files uploaded to S3 and a Glue crawler setup to create the table and schema.
I have a Glue job setup that writes the data from the Glue table to our Amazon Redshift database using a JDBC connection. The Job also is in charge of mapping the columns and creating the redshift table.

By re-running a job, I am getting duplicate rows in redshift (as expected). However, is there way to replace or delete rows before inserting the new data, using a key or the partitions setup in glue?

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job  from awsglue.dynamicframe import DynamicFrame from awsglue.transforms import SelectFields  from pyspark.sql.functions import lit  ## @params: [TempDir, JOB_NAME] args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])  sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args)  columnMapping = [     ("id", "int", "id", "int"),     ("name", "string", "name", "string"), ]  datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "db01", table_name = "table01", transformation_ctx = "datasource0")  applymapping1 = ApplyMapping.apply(frame = datasource1, mappings = columnMapping, transformation_ctx = "applymapping1") resolvechoice1 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", transformation_ctx = "resolvechoice1") dropnullfields1 = DropNullFields.apply(frame = resolvechoice1, transformation_ctx = "dropnullfields1") df1 = dropnullfields1.toDF() data1 = df1.withColumn('platform', lit('test')) data1 = DynamicFrame.fromDF(data1, glueContext, "data_tmp1")  ## Write data to redshift datasink1 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = data1, catalog_connection = "Test Connection", connection_options = {"dbtable": "table01", "database": "db01"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink1")  job.commit()

878

asked Sep 14 '17 21:09

krchun

2 Answers

Job bookmarks are the key. Just edit the job and enable "Job bookmarks" and it won't process already processed data. Note that the job has to rerun once before it will detect it does not have to reprocess the old data again.

For more info see: http://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

The name "bookmark" is a bit far fetched in my opinion. I would have never looked at it if I did not coincidentally stumble upon it during my search.

195

answered Sep 18 '22 17:09

Matthijs

This was the solution I got from AWS Glue Support:

As you may know, although you can create primary keys, Redshift doesn't enforce uniqueness. Therefore, if you are rerunning Glue jobs then duplicate rows can get inserted. Some of the ways to maintain uniqueness are:

Use a staging table to insert all rows and then perform a upsert/merge [1] into the main table, this has to be done outside of glue.
Add another column in your redshift table [1], like an insert timestamp, to allow duplicate but to know which one came first or last and then delete the duplicate afterwards if you need to.
Load the previously inserted data into dataframe and then compare the data to be insert to avoid inserting duplicates[3]

[1] - http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html and http://www.silota.com/blog/amazon-redshift-upsert-support-staging-table-replace-rows/

[2] - https://github.com/databricks/spark-redshift/issues/238

[3] - https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html

answered Sep 20 '22 17:09

krchun

Related questions
                            
                                Amazon S3 with s3fs and fuse, transport endpoint is not connected
                            
                                Configure apache on elastic beanstalk
                            
                                Trying to use RSA Keys to SSH into EC2, Getting: Incorrect RSA1 identifier... permission denied (public key)
                            
                                AWS AssumeRole - User is not authorized to perform: sts:AssumeRole on resource
                            
                                Invoke amazon lambda function from node app
                            
                                EC2 t2.medium burstable credit "savings" calculation
                            
                                Java 8 application on EC2
                            
                                Very slow requests to dynamodb from lambda function
                            
                                AngularJS SEO for static webpages (S3 CDN)
                            
                                configured logging driver does not support reading : Docker
                            
                                Source Control and deployment for AWS Lambda
                            
                                How to restrict the size of the file being uploaded on to the AWS S3 service using presigned urls
                            
                                Why doesn’t Amazon S3 automatically serve /foo/index.html when I ask for /foo or /foo/?
                            
                                Amazon SQS message multi-delivery
                            
                                Using EC2 to resize images stored on S3 on demand
                            
                                Setting up PostGis on Amazon RDS
                            
                                How to change default root EBS size in cloudformation? [AWS]
                            
                                DynamoDb table design: Single table or multiple tables
                            
                                Managing dev/staging/production on DynamoDB?
                            
                                What does read-after-write consistency really mean on new object PUT in S3?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Glue to Redshift: Is it possible to replace, update or delete data?

Tags:

amazon-web-services

jdbc

pyspark

aws-glue

krchun

People also ask

2 Answers

Matthijs

krchun

Recent Activity

Donate For Us