Every job script code should be ended with job.commit()
but what exact action this function do?
job.commit()
is called?P.S. I have not found any description in PyGlue.zip
with aws py source code :(
To expand on @yspotts answer. It is possible to execute more than one job.commit()
in an AWS Glue Job script, although the bookmark will be updated only once, as they mentioned. However, it is also safe to call job.init()
more than once. In this case, the bookmarks will be updated correctly with the S3 files processed since the previous commit. If false
, it does nothing.
In the init()
function, there is an "initialised" marker that gets updated and set to true
. Then, in the commit()
function this marker is checked, if true
then it performs the steps to commit the bookmarker and reset the "initialised" marker.
So, the only thing to change from @hoaxz answer would be to call job.init()
in every iteration of the for loop:
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)
paths = [
's3://bucket-name/my_partition=apples/',
's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for s3_path in paths:
job.init(args[‘JOB_NAME’], args)
dynamic_frame = glue_context.create_dynamic_frame_from_options(
connection_type='s3',
connection_options={'paths'=[s3_path]},
format='json',
transformation_ctx="path={}".format(path))
do_something(dynamic_frame)
# Commit file read to Job Bookmark
job.commit()
As of today, the only case where the Job object is useful is when using Job Bookmarks. When you read files from Amazon S3 (only supported source for bookmarks so far) and call your job.commit
, a time and paths read so far will be internally stored, so that if for some reason you attempt to read that path again, you will only get back unread (new) files.
In this code sample, I try to read and process two different paths separately, and commit after each path is processed. If for some reason I stop my job, the same files won't be processed.
args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)
job.init(args[‘JOB_NAME’], args)
paths = [
's3://bucket-name/my_partition=apples/',
's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for path in paths:
try:
dynamic_frame = glue_context.create_dynamic_frame_from_options(
connection_type='s3',
connection_options={'paths'=[s3_path]},
format='json',
transformation_ctx="path={}".format(path))
do_something(dynamic_frame)
# Commit file read to Job Bookmark
job.commit()
except:
# Something failed
Calling the commit method on a Job
object only works if you have Job Bookmark enabled, and the stored references are kept from JobRun to JobRun until you reset or pause your Job Bookmark. It is completely safe to execute more python statements after a Job.commit
, and as shown on the previous code sample, committing multiple times is also valid.
Hope this helps
According to the AWS support team, commit
should not be called more than once. Here is the exact response I got from them:
The method job.commit() can be called multiple times and it would not throw any error
as well. However, if job.commit() would be called multiple times in a Glue script
then job bookmark will be updated only once in a single job run that would be after
the first time when job.commit() gets called and the other calls for job.commit()
would be ignored by the bookmark. Hence, job bookmark may get stuck in a loop and
would not able to work well with multiple job.commit(). Thus, I would recommend you
to use job.commit() once in the Glue script.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With