AWS Step Functions allow calling AWS Glue jobs, as described here: https://docs.aws.amazon.com/step-functions/latest/dg/connect-glue.html
I want to run the job and (after saving the results to S3) return some metadata produced during the job (like row count or number of filtered rows) back to the Step function flow.
We can pass parameters from Step functions to the Glue job like this:
"RunGlueJob": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "MyJobName",
"Arguments": {
"--param1.$": "$.param1",
"--param2.$": "$.param2"
}
},
"Next": "NextState"
},
But how can the Glue job return output back to the Step Function workflow? I tried just returning a String from the main() function inside the (Scala) Glue job, but it doesn't show up in JSON returned to the step function flow:
{
"AllocatedCapacity": 3,
"Arguments": {
"--param1.$": "$.param1",
"--param2.$": "$.param2"
},
"Attempt": 0,
"CompletedOn": 1570114802442,
"ExecutionTime": 39,
"GlueVersion": "0.9",
"Id": "jr_some_id",
"JobName": "MyJobName",
"JobRunState": "SUCCEEDED",
"LastModifiedOn": 1570114802442,
"LogGroupName": "/aws-glue/jobs",
"MaxCapacity": 3,
"PredecessorRuns": [],
"StartedOn": 1570114746138,
"Timeout": 2880
}
I cannot find any documentation on this, so it might be that this is simply not possible. However, returning values from Lambdas works just fine and shows up normally inside the Step function workflow.
You can't return anything from glue job at this stage. By definition, AWS glue is expected to work on huge amount of data and hence it is expected that output will also be huge amount of data.
You may write result to dynamodb or s3 or any other storage and access it using lambda in next step in AWS step functions
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With