I am relatively new to AWS and this may be a bit less technical question, but at present AWS Glue notes a maximum of 25 jobs permitted to be created. We are loading in a series of tables that each have their own job that subsequently appends audit columns. Each job is very similar, but simply changes the connection string source and target.
Is there a way to parameterize these jobs to allow for reuse and simply pass the proper connection strings to them? Or even possibly loop through a set connection strings in a master job that would call a child job passing the varying connection strings through?
Any examples or documentation would be most appreciated
You can configure a Job through the console, on the Job details tab, under the Job Parameters heading. You can also configure a Job through the AWS CLI by setting DefaultArguments on a Job or Arguments on a Job Run. Default Arguments and Job Parameters will stay with the Job through multiple runs.
Job definitions are split into four basic parts: the job definition name, the type of the job definition, parameter substitution placeholder defaults, and the container properties for the job.
Maximum capacity is the number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. The Standard worker type has a 50 GB disk and 2 executors. G.
In the below example I present how to use Glue job input parameters in the code. This code takes the input parameters and it writes them to the flat file.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
args = getResolvedOptions(sys.argv, ['JOB_NAME','VAL1','VAL2','VAL3','DEST_FOLDER'])
job.init(args['JOB_NAME'], args)
v_list=[{"VAL1":args['VAL1'],"VAL2":args['VAL2'],"VAL3":args['VAL3']}]
df=sc.parallelize(v_list).toDF()
df.repartition(1).write.mode('overwrite').format('csv').options(header=True, delimiter = ';').save("s3://"+ args['DEST_FOLDER'] +"/")
job.commit()
import boto3
def lambda_handler(event, context):
glue = boto3.client('glue')
myJob = glue.create_job(Name='example_job2', Role='AWSGlueServiceDefaultRole',
Command={'Name': 'glueetl','ScriptLocation': 's3://aws-glue-scripts/example_job'},
DefaultArguments={"VAL1":"value1","VAL2":"value2","VAL3":"value3"}
)
glue.start_job_run(JobName=myJob['Name'], Arguments={"VAL1":"value11","VAL2":"value22","VAL3":"value33"})
Useful links:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With