My project is undergoing a transition to a new AWS account, and we are trying to find a way to persist our AWS Glue ETL bookmarks. We have a vast amount of processed data that we are replicating to the new account, and would like to avoid reprocessing.
It is my understanding that Glue bookmarks are just timestamps on the backend, and ideally we'd be able to get the old bookmark(s), and then manually set the bookmarks for the matching jobs in the new AWS account.
It looks like I could get my existing bookmarks via the AWS CLI using:
get-job-bookmark --job-name <value>
(Source)
However, I have been unable to find any possible method of possibly setting the bookmarks in the new account.
As far as workarounds, my best bets seem to be:
Really at a loss here and the AWS Glue forums are a ghost town and have not been helpful in the past.
I was not able to manually set a bookmark or get a bookmark to manually progress and skip data using the methods in the question above.
However, I was able to get the Glue ETL job to skip data and progress its bookmark using the following steps:
Ensure any Glue ETL schedule is disabled
Add the files you'd like to skip to S3
Crawl S3 data
Comment out the processing steps of your Glue ETL job's Spark code. I just commented out all of the dynamic_frame steps after the initial dynamic frame creation, up until job.commit().
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Create dynamic frame from raw glue table
datasource0 = 
glueContext.create_dynamic_frame.from_catalog(database=GLUE_DATABASE_NAME, 
table_name=JOB_TABLE, transformation_ctx="datasource0")
# ~~ COMMENT OUT ADDITIONAL STEPS ~~ #
job.commit()
Run glue etl job with bookmark enabled as usual
Revert Glue ETL Spark code back to normal
Now, the Glue ETL job's bookmark has been progressed and any data that would have been processed on that job run in step 5 will have been skipped. Next time a file is added to S3 and crawled, it will be processed normally by the Glue ETL job.
This can be useful if you know you will be getting some data that you don't want processed, or if you are transitioning to a new AWS account and are replicating over all your old data like I did. It would be nice if there was a way to manually set bookmark times in Glue so this was not necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With