Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Would someone be able provide an example of what an AWS Cloudformation AWS::GLUE::WORKFLOW template would look like?

I have been searching for an example of how to set up Cloudformation for a glue workflow which includes triggers, jobs, and crawlers, but I haven't been able to find much information on it.

This is the only piece of information I am able to find from AWS

{
  "Type" : "AWS::Glue::Workflow",
  "Properties" : {
      "DefaultRunProperties" : Json,
      "Description" : String,
      "Name" : String,
      "Tags" : Json
    }
}
like image 735
Travis Brannan Avatar asked Oct 08 '19 21:10

Travis Brannan


People also ask

Why would you use AWS CloudFormation?

AWS CloudFormation is designed to allow resource lifecycles to be managed repeatably, predictable, and safely, while allowing for automatic rollbacks, automated state management, and management of resources across accounts and regions.

What are the objects of an AWS CloudFormation template?

What is an AWS CloudFormation template? A template is a declaration of the AWS resources that make up a stack. The template is stored as a text file whose format complies with the JavaScript Object Notation (JSON) or YAML standard.

How does AWS CloudFormation work?

CloudFormation creates a bucket for each region in which you upload a template file. The buckets are accessible to anyone with Amazon Simple Storage Service (Amazon S3) permissions in your AWS account. If a bucket created by CloudFormation is already present, the template is added to that bucket.


2 Answers

Here's an example of a workflow with one crawler and a job to be run after the crawler finishes.

It is defined through tagging the triggers with the WorkflowName.

I believe there can be only one SCHEDULED or ON_DEMAND trigger to start the workflow. All the other triggers in the workflow need to be CONDITIONAL on the jobs / crawlers. That's probably how CloudFormation knows how to build the DAG.

Also see how the workflow parameters are defined as a json in the DefaultRunProperties.

---
AWSTemplateFormatVersion: '2010-09-09'

Parameters:
  BaseBucket:
    Description: Bucket used by my workflow jobs
    Type: String

Resources:
  MyWorkflow:
    Type: AWS::Glue::Workflow
    Properties: 
      DefaultRunProperties:
        {
          "workflowParameter1": "Foo",
          "workflowParameter2": "Bar",
          "bucket": { "Fn::Sub": "${BaseBucket}" }
        }
      Description: Workflow for orchestrating my jobs
      Name: MyWorkflowName

  WorkflowCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: MyCrawler
      Role: MyCrawlerRole
      Description: A crawler to run as the first step in the workflow
      DatabaseName: MyDatabase
      Targets:
        S3Targets:
          - Path: !Sub "s3://${BaseBucket}/"

  WorkflowJob:
    Type: AWS::Glue::Job
    Properties:
      Description: Glue job to run after the crawler
      Name: MyWorkflowJob
      Role: MyJobRole
      Command:
        Name: pythonshell
        PythonVersion: 3
        ScriptLocation: !Sub "s3://${BaseBucket}/my_workflow_job_script.py"

  WorkflowStartTrigger:
    Type: AWS::Glue::Trigger
    Properties:
      Name: StartTrigger
      Type: ON_DEMAND
      Description: Trigger for starting the workflow
      Actions:
        - CrawlerName: !Ref WorkflowCrawler
      WorkflowName: !Ref MyWorkflow

  WorkflowJobTrigger:
    Type: AWS::Glue::Trigger
    Properties:
      Name: CrawlerSuccessfulTrigger
      Type: CONDITIONAL
      StartOnCreation: True
      Description: Trigger to start the glue job
      Actions:
        - JobName: !Ref WorkflowJob
      Predicate:
        Conditions:
          - LogicalOperator: EQUALS
            CrawlerName: !Ref WorkflowCrawler
            CrawlState: SUCCEEDED
      WorkflowName: !Ref MyWorkflow
like image 127
antti Avatar answered Sep 28 '22 17:09

antti


Here is an example of a Glue workflow using triggers, crawlers and a job to convert JSON to Parquet:

JSONtoParquetWorkflow:
  Type: AWS::Glue::Workflow
  Properties: 
    Name: json-to-parquet-workflow
    Description: Workflow for orchestrating JSON to Parquet conversion
RawJSONCrawlerTrigger:
  Type: AWS::Glue::Trigger
  Properties:
    WorkflowName: !Ref JSONtoParquetWorkflow
    Name: raw-json-crawler-trigger
    Description: Start crawler for raw JSON data
    Type: ON_DEMAND
    Actions:
      - CrawlerName: !Ref RawJSONCrawler
JSONToParquetETLJobTrigger:
  Type: AWS::Glue::Trigger
  Properties:
    WorkflowName: !Ref JSONtoParquetWorkflow
    Name: json-to-parquet-etl-trigger
    Description: Start JSON to Parquet ETL job
    Type: CONDITIONAL
    StartOnCreation: True
    Predicate:
      Conditions:
        - LogicalOperator: EQUALS
          CrawlerName: !Ref RawJSONCrawler
          CrawlState: SUCCEEDED
    Actions:
      - JobName: !Ref JSONToParquetETLJob
RawParquetCrawlerTrigger:
  Type: AWS::Glue::Trigger
  Properties:
    WorkflowName: !Ref JSONtoParquetWorkflow
    Name: raw-parquet-crawler-trigger
    Description: Start crawler for raw Parquet data
    Type: CONDITIONAL
    StartOnCreation: True
    Predicate:
      Conditions:
        - LogicalOperator: EQUALS
          JobName: !Ref JSONToParquetETLJob
          State: SUCCEEDED
    Actions:
      - CrawlerName: !Ref RawParquetCrawler
like image 31
abk Avatar answered Sep 28 '22 15:09

abk