Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to include AWS Glue crawler in Step Function

This is my requirement: I have a crawler and a pyspark job in AWS Glue. I have to setup the workflow using step function.

Questions:

  1. How can I add Crawler as the first state. What are the parameters I need to provide(Resource,Type etc).
  2. How to make sure that the next state - Pyspark job starts only once the crawler ran successfully.
  3. Is there any way I can schedule the Step Function State Machine to run at a particular time?

References:

  • Manage AWS Glue Jobs with Step Functions
like image 615
dragonachu Avatar asked Jan 29 '20 11:01

dragonachu


1 Answers

A few months late to answer this but this can be done from within the step function. You can create the following states to achieve it:

  • TriggerCrawler: Task State: Triggers a Lambda function, within this lambda function you can write code for triggering AWS Glue Crawler using any of the aws-sdk
  • PollCrawlerStatus: Task state: Lambda function that polls for Crawler status and returns it as a response of lambda.
  • IsCrawlerRunSuccessful: Choice State: Based on that status of Glue crawler you can make Next state to be a Choice state which will either go to the next state that triggers yours Glue job (once the Glue crawler state is 'READY') or go to the Wait State for few seconds before you poll for it again.
  • RunGlueJob: Task State: A Lambda function that triggers the glue job.
  • WaitForCrawler: Wait State: That waits for 'n' seconds before you poll for status again.
  • Finish: Succeed State.

Here is how this Step Function will look like:

enter image description here

like image 129
frosty Avatar answered Oct 22 '22 08:10

frosty