Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue and update duplicating data

I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will of course get duplicate records in the database. Instead of multiple records being inserted I want Glue to try and update that record if it notices a field has changed, each record has a unique id. Is this possible?

like image 406
joshuahornby10 Avatar asked Nov 22 '18 19:11

joshuahornby10


People also ask

What is transformation context in AWS Glue?

The transformation_ctx parameter is used to identify state information within a job bookmark for the given operator. Specifically, AWS Glue uses transformation_ctx to index the key to the bookmark state. For job bookmarks to work properly, enable the job bookmark parameter and set the transformation_ctx parameter.

What determines the schema of the data in AWS Glue?

Classifier. Determines the schema of your data. AWS Glue provides classifiers for common file types, such as CSV, JSON, AVRO, XML, and others. It also provides classifiers for common relational database management systems using a JDBC connection.

Why is AWS Glue so slow?

Some common reasons why your AWS Glue jobs take a long time to complete are the following: Large datasets. Non-uniform distribution of data in the datasets. Uneven distribution of tasks across the executors.

Is AWS Glue an ETL tool?

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. AWS Glue can run your extract, transform, and load (ETL) jobs as new data arrives.


2 Answers

I followed the similar approach which is suggested as 2nd option by Yuriy. Get existing data as well as new data and then do some processing to merge to of them and write with ovewrite mode. Following code would help you to get an idea about how to solve this problem.

sc = SparkContext()
glueContext = GlueContext(sc)

#get your source data 
src_data = create_dynamic_frame.from_catalog(database = src_db, table_name = src_tbl)
src_df =  src_data.toDF()


#get your destination data 
dst_data = create_dynamic_frame.from_catalog(database = dst_db, table_name = dst_tbl)
dst_df =  dst_data.toDF()

#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)

#Finally save data to destination with OVERWRITE mode
merged_df.write.format('jdbc').options(   url = dest_jdbc_url, 
                                          user = dest_user_name,
                                          password = dest_password,
                                          dbtable = dest_tbl ).mode("overwrite").save()
like image 193
Tharsan Sivakumar Avatar answered Sep 29 '22 01:09

Tharsan Sivakumar


Unfortunately there is no elegant way to do it with Glue. If you would write to Redshift you could use postactions to implement Redshift merge operation. However, it's not possible for other jdbc sinks (afaik).

Alternatively in your ETL script you can load existing data from a database to filter out existing records before saving. However if your DB table is big then the job may take a while to process it.

Another approach is to write into a staging table with mode 'overwrite' first (replace existing staging data) and then make a call to a DB via API to copy new records only into a final table.

like image 25
Yuriy Bondaruk Avatar answered Sep 29 '22 01:09

Yuriy Bondaruk