I am relatively new to AWS and this may be a bit less technical question, but at present AWS Glue notes a maximum of 25 jobs permitted to be created. We are loading in a series of tables that each have their own job that subsequently appends audit columns. Each job is very similar, but simply changes the connection string source and target. Is there a way to parameterize these jobs to allow for reuse and simply pass the proper connection strings to them? Or even possibly loop through a set connection strings in a master job that would call a child job passing the varying connection strings through? Any examples or documentation would be most appreciated

In the below example I present how to use Glue job input parameters in the code. This code takes the input parameters and it writes them to the flat file. <ol> <li>Setting the input parameters in the job configuration.</li> </ol> <img src="https://i.stack.imgur.com/hEUbn.jpg" alt="enter image description here"> <ol start="2"> <li>The code of Glue job</li> </ol> <pre class="prettyprint lang-py prettyprint-override"><code>import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) args = getResolvedOptions(sys.argv, ['JOB_NAME','VAL1','VAL2','VAL3','DEST_FOLDER']) job.init(args['JOB_NAME'], args) v_list=[{"VAL1":args['VAL1'],"VAL2":args['VAL2'],"VAL3":args['VAL3']}] df=sc.parallelize(v_list).toDF() df.repartition(1).write.mode('overwrite').format('csv').options(header=True, delimiter = ';').save("s3://"+ args['DEST_FOLDER'] +"/") job.commit() </code></pre> <ol start="3"> <li>There is also possible to provide input parameters during using boto3, CloudFormation or StepFunctions. This example shows how to do it by using boto3.</li> </ol> <pre class="prettyprint lang-py prettyprint-override"><code>import boto3 def lambda_handler(event, context): glue = boto3.client('glue') myJob = glue.create_job(Name='example_job2', Role='AWSGlueServiceDefaultRole', Command={'Name': 'glueetl','ScriptLocation': 's3://aws-glue-scripts/example_job'}, DefaultArguments={"VAL1":"value1","VAL2":"value2","VAL3":"value3"} ) glue.start_job_run(JobName=myJob['Name'], Arguments={"VAL1":"value11","VAL2":"value22","VAL3":"value33"}) </code></pre> Useful links: <ol> <li>https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html</li> <li>https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html</li> <li>https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_job</li> <li>https://docs.aws.amazon.com/step-functions/latest/dg/connectors-glue.html</li> </ol>

AWS Glue Job Input Parameters

Tags:

amazon-web-services

aws-glue

I am relatively new to AWS and this may be a bit less technical question, but at present AWS Glue notes a maximum of 25 jobs permitted to be created. We are loading in a series of tables that each have their own job that subsequently appends audit columns. Each job is very similar, but simply changes the connection string source and target.

Is there a way to parameterize these jobs to allow for reuse and simply pass the proper connection strings to them? Or even possibly loop through a set connection strings in a master job that would call a child job passing the varying connection strings through?

Any examples or documentation would be most appreciated

585

asked Sep 13 '18 15:09

Sauron

1 Answers

In the below example I present how to use Glue job input parameters in the code. This code takes the input parameters and it writes them to the flat file.

Setting the input parameters in the job configuration.

enter image description here

The code of Glue job

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
 
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
 
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
args = getResolvedOptions(sys.argv, ['JOB_NAME','VAL1','VAL2','VAL3','DEST_FOLDER'])
job.init(args['JOB_NAME'], args)

v_list=[{"VAL1":args['VAL1'],"VAL2":args['VAL2'],"VAL3":args['VAL3']}]

df=sc.parallelize(v_list).toDF()
df.repartition(1).write.mode('overwrite').format('csv').options(header=True, delimiter = ';').save("s3://"+ args['DEST_FOLDER'] +"/")

job.commit()

There is also possible to provide input parameters during using boto3, CloudFormation or StepFunctions. This example shows how to do it by using boto3.

import boto3
    
def lambda_handler(event, context):
    glue = boto3.client('glue')
        
        
    myJob = glue.create_job(Name='example_job2', Role='AWSGlueServiceDefaultRole',
                            Command={'Name': 'glueetl','ScriptLocation': 's3://aws-glue-scripts/example_job'},
                            DefaultArguments={"VAL1":"value1","VAL2":"value2","VAL3":"value3"}       
                                   )
    glue.start_job_run(JobName=myJob['Name'], Arguments={"VAL1":"value11","VAL2":"value22","VAL3":"value33"})

Useful links:

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_job
https://docs.aws.amazon.com/step-functions/latest/dg/connectors-glue.html

answered Sep 17 '22 12:09

j.b.gorski

Related questions
                            
                                Necessary s3cmd S3 permissions for PUT/Sync
                            
                                SQS - Delivery Delay of 30 minutes
                            
                                Making PHP's mail() asynchronous
                            
                                AWS Cognito User Pools in iOS (Swift) app
                            
                                AWS S3 bucket "404 Not Found"
                            
                                Lambda function within VPC doesn't have access to public Internet [closed]
                            
                                Which is lower cost, Sagemaker or EC2?
                            
                                Where are Tomcat application log files stored in Elastic Beanstalk?
                            
                                Move files between amazon S3 to Glacier and vice versa programmatically using API
                            
                                Download a application from AWS Elastic Beanstalk
                            
                                Can we copy the files and folders recursively between aws s3 buckets using boto3 Python?
                            
                                How to connect Amazon Redshift to python
                            
                                AWS Lambda@Edge Nodejs "Environment variables are not supported."
                            
                                Django Storages - Could Not Load Amazon's S3 Bindings Errors
                            
                                How to set environment variables for Laravel 5 on AWS EC2 with MySQL
                            
                                EB Deploy to multiple environments
                            
                                Amazon s3a returns 400 Bad Request with Spark
                            
                                Lambda Return Payload botocore.response.StreamingBody object prints but then empty in variable
                            
                                AWS trusted adviser vs Inspector [closed]
                            
                                "sls dynamodb start" throws spawn java ENOENT

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With