I have been searching for an example of how to set up Cloudformation for a glue workflow which includes triggers, jobs, and crawlers, but I haven't been able to find much information on it.
This is the only piece of information I am able to find from AWS
{
"Type" : "AWS::Glue::Workflow",
"Properties" : {
"DefaultRunProperties" : Json,
"Description" : String,
"Name" : String,
"Tags" : Json
}
}
AWS CloudFormation is designed to allow resource lifecycles to be managed repeatably, predictable, and safely, while allowing for automatic rollbacks, automated state management, and management of resources across accounts and regions.
What is an AWS CloudFormation template? A template is a declaration of the AWS resources that make up a stack. The template is stored as a text file whose format complies with the JavaScript Object Notation (JSON) or YAML standard.
CloudFormation creates a bucket for each region in which you upload a template file. The buckets are accessible to anyone with Amazon Simple Storage Service (Amazon S3) permissions in your AWS account. If a bucket created by CloudFormation is already present, the template is added to that bucket.
Here's an example of a workflow with one crawler and a job to be run after the crawler finishes.
It is defined through tagging the triggers with the WorkflowName.
I believe there can be only one SCHEDULED or ON_DEMAND trigger to start the workflow. All the other triggers in the workflow need to be CONDITIONAL on the jobs / crawlers. That's probably how CloudFormation knows how to build the DAG.
Also see how the workflow parameters are defined as a json in the DefaultRunProperties.
---
AWSTemplateFormatVersion: '2010-09-09'
Parameters:
BaseBucket:
Description: Bucket used by my workflow jobs
Type: String
Resources:
MyWorkflow:
Type: AWS::Glue::Workflow
Properties:
DefaultRunProperties:
{
"workflowParameter1": "Foo",
"workflowParameter2": "Bar",
"bucket": { "Fn::Sub": "${BaseBucket}" }
}
Description: Workflow for orchestrating my jobs
Name: MyWorkflowName
WorkflowCrawler:
Type: AWS::Glue::Crawler
Properties:
Name: MyCrawler
Role: MyCrawlerRole
Description: A crawler to run as the first step in the workflow
DatabaseName: MyDatabase
Targets:
S3Targets:
- Path: !Sub "s3://${BaseBucket}/"
WorkflowJob:
Type: AWS::Glue::Job
Properties:
Description: Glue job to run after the crawler
Name: MyWorkflowJob
Role: MyJobRole
Command:
Name: pythonshell
PythonVersion: 3
ScriptLocation: !Sub "s3://${BaseBucket}/my_workflow_job_script.py"
WorkflowStartTrigger:
Type: AWS::Glue::Trigger
Properties:
Name: StartTrigger
Type: ON_DEMAND
Description: Trigger for starting the workflow
Actions:
- CrawlerName: !Ref WorkflowCrawler
WorkflowName: !Ref MyWorkflow
WorkflowJobTrigger:
Type: AWS::Glue::Trigger
Properties:
Name: CrawlerSuccessfulTrigger
Type: CONDITIONAL
StartOnCreation: True
Description: Trigger to start the glue job
Actions:
- JobName: !Ref WorkflowJob
Predicate:
Conditions:
- LogicalOperator: EQUALS
CrawlerName: !Ref WorkflowCrawler
CrawlState: SUCCEEDED
WorkflowName: !Ref MyWorkflow
Here is an example of a Glue workflow using triggers, crawlers and a job to convert JSON to Parquet:
JSONtoParquetWorkflow:
Type: AWS::Glue::Workflow
Properties:
Name: json-to-parquet-workflow
Description: Workflow for orchestrating JSON to Parquet conversion
RawJSONCrawlerTrigger:
Type: AWS::Glue::Trigger
Properties:
WorkflowName: !Ref JSONtoParquetWorkflow
Name: raw-json-crawler-trigger
Description: Start crawler for raw JSON data
Type: ON_DEMAND
Actions:
- CrawlerName: !Ref RawJSONCrawler
JSONToParquetETLJobTrigger:
Type: AWS::Glue::Trigger
Properties:
WorkflowName: !Ref JSONtoParquetWorkflow
Name: json-to-parquet-etl-trigger
Description: Start JSON to Parquet ETL job
Type: CONDITIONAL
StartOnCreation: True
Predicate:
Conditions:
- LogicalOperator: EQUALS
CrawlerName: !Ref RawJSONCrawler
CrawlState: SUCCEEDED
Actions:
- JobName: !Ref JSONToParquetETLJob
RawParquetCrawlerTrigger:
Type: AWS::Glue::Trigger
Properties:
WorkflowName: !Ref JSONtoParquetWorkflow
Name: raw-parquet-crawler-trigger
Description: Start crawler for raw Parquet data
Type: CONDITIONAL
StartOnCreation: True
Predicate:
Conditions:
- LogicalOperator: EQUALS
JobName: !Ref JSONToParquetETLJob
State: SUCCEEDED
Actions:
- CrawlerName: !Ref RawParquetCrawler
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With