Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What actions does job.commit perform in aws glue?

Every job script code should be ended with job.commit() but what exact action this function do?

  1. Is it just job end marker or not?
  2. Can it be called twice during one job (if yes - in what cases)?
  3. Is it safe to execute any python statement after job.commit() is called?

P.S. I have not found any description in PyGlue.zip with aws py source code :(

like image 803
Cherry Avatar asked Jan 14 '18 08:01

Cherry


3 Answers

To expand on @yspotts answer. It is possible to execute more than one job.commit() in an AWS Glue Job script, although the bookmark will be updated only once, as they mentioned. However, it is also safe to call job.init() more than once. In this case, the bookmarks will be updated correctly with the S3 files processed since the previous commit. If false, it does nothing.

In the init() function, there is an "initialised" marker that gets updated and set to true. Then, in the commit() function this marker is checked, if true then it performs the steps to commit the bookmarker and reset the "initialised" marker.

So, the only thing to change from @hoaxz answer would be to call job.init() in every iteration of the for loop:

args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)

paths = [
    's3://bucket-name/my_partition=apples/',
    's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for s3_path in paths:
    job.init(args[‘JOB_NAME’], args)
    dynamic_frame = glue_context.create_dynamic_frame_from_options(
        connection_type='s3',
        connection_options={'paths'=[s3_path]},
        format='json',
        transformation_ctx="path={}".format(path))
    do_something(dynamic_frame)
    # Commit file read to Job Bookmark
    job.commit()
like image 65
nanodgb Avatar answered Oct 08 '22 14:10

nanodgb


As of today, the only case where the Job object is useful is when using Job Bookmarks. When you read files from Amazon S3 (only supported source for bookmarks so far) and call your job.commit, a time and paths read so far will be internally stored, so that if for some reason you attempt to read that path again, you will only get back unread (new) files.

In this code sample, I try to read and process two different paths separately, and commit after each path is processed. If for some reason I stop my job, the same files won't be processed.

args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)
job.init(args[‘JOB_NAME’], args)

paths = [
    's3://bucket-name/my_partition=apples/',
    's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for path in paths:
    try:
        dynamic_frame = glue_context.create_dynamic_frame_from_options(
            connection_type='s3',
            connection_options={'paths'=[s3_path]},
            format='json',
            transformation_ctx="path={}".format(path))
        do_something(dynamic_frame)
        # Commit file read to Job Bookmark
        job.commit()
    except:
        # Something failed

Calling the commit method on a Job object only works if you have Job Bookmark enabled, and the stored references are kept from JobRun to JobRun until you reset or pause your Job Bookmark. It is completely safe to execute more python statements after a Job.commit, and as shown on the previous code sample, committing multiple times is also valid.

Hope this helps

like image 30
hoaxz Avatar answered Oct 08 '22 15:10

hoaxz


According to the AWS support team, commit should not be called more than once. Here is the exact response I got from them:

The method job.commit() can be called multiple times and it would not throw any error 
as well. However, if job.commit() would be called multiple times in a Glue script 
then job bookmark will be updated only once in a single job run that would be after 
the first time when job.commit() gets called and the other calls for job.commit() 
would be ignored by the bookmark. Hence, job bookmark may get stuck in a loop and 
would not able to work well with multiple job.commit(). Thus, I would recommend you 
to use job.commit() once in the Glue script.
like image 4
yspotts Avatar answered Oct 08 '22 15:10

yspotts