AWS Glue and update duplicating data

Tags:

I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will of course get duplicate records in the database. Instead of multiple records being inserted I want Glue to try and update that record if it notices a field has changed, each record has a unique id. Is this possible?

406

asked Nov 22 '18 19:11

joshuahornby10

2 Answers

I followed the similar approach which is suggested as 2nd option by Yuriy. Get existing data as well as new data and then do some processing to merge to of them and write with ovewrite mode. Following code would help you to get an idea about how to solve this problem.

sc = SparkContext()
glueContext = GlueContext(sc)

#get your source data 
src_data = create_dynamic_frame.from_catalog(database = src_db, table_name = src_tbl)
src_df =  src_data.toDF()


#get your destination data 
dst_data = create_dynamic_frame.from_catalog(database = dst_db, table_name = dst_tbl)
dst_df =  dst_data.toDF()

#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)

#Finally save data to destination with OVERWRITE mode
merged_df.write.format('jdbc').options(   url = dest_jdbc_url, 
                                          user = dest_user_name,
                                          password = dest_password,
                                          dbtable = dest_tbl ).mode("overwrite").save()

193

answered Sep 29 '22 01:09

Tharsan Sivakumar

Unfortunately there is no elegant way to do it with Glue. If you would write to Redshift you could use postactions to implement Redshift merge operation. However, it's not possible for other jdbc sinks (afaik).

Alternatively in your ETL script you can load existing data from a database to filter out existing records before saving. However if your DB table is big then the job may take a while to process it.

Another approach is to write into a staging table with mode 'overwrite' first (replace existing staging data) and then make a call to a DB via API to copy new records only into a final table.

answered Sep 29 '22 01:09

Yuriy Bondaruk

Related questions
                            
                                Why are my subclass's __getitem__ and __setitem__ not called when I use [:]?
                            
                                Custom Loss Function in R Keras
                            
                                TensorFlow: no supported kernel for GPU devices is available
                            
                                Runtime Warning - Greenlet.greenlet size changed??
                            
                                Keras: is there an easy way to mutate (shuffle) data in/out of the training set between epochs?
                            
                                Dataflow/apache beam: manage custom module dependencies
                            
                                Bullet Lists in python-docx
                            
                                Replacement for test case inheritance in pytest?
                            
                                Sending and receiving signals in django models
                            
                                Using COUNT(*) OVER() in current query with SQLAlchemy over PostgreSQL
                            
                                NumPy - calculate histogram intersection
                            
                                Difference between django.core serializers and Django Rest Framework serializers
                            
                                When do I have to use TensorFlow's FileWriter.flush() method?
                            
                                Defining python type hints for list of a weakref object
                            
                                Testing Flask Sessions with Pytest
                            
                                Async for loop on AsyncGenerator
                            
                                Multiple Async Context Managers
                            
                                Upload file from URL to Microsoft Azure Blob Storage
                            
                                Difference between ax.set_xlabel() and ax.xaxis.set_label() in MatplotLib 3.0.1
                            
                                How to get the current QApplication?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Glue and update duplicating data

Tags:

python

amazon-web-services

pyspark

etl

aws-glue

joshuahornby10

People also ask

2 Answers

Tharsan Sivakumar

Yuriy Bondaruk

Recent Activity

Donate For Us