Terraform AWS Athena to use Glue catalog as db

Tags:

I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database.

I use

resource "aws_glue_catalog_database" "catalog_database" {
    name = "${var.glue_db_name}"
}

resource "aws_glue_crawler" "datalake_crawler" {
    database_name = "${var.glue_db_name}"
    name          = "${var.crawler_name}"
    role          = "${aws_iam_role.crawler_iam_role.name}"
    description   = "${var.crawler_description}"
    table_prefix  = "${var.table_prefix}"
    schedule      = "${var.schedule}" 

    s3_target {
      path = "s3://${var.data_bucket_name[0]}"
  }
    s3_target {
      path = "s3://${var.data_bucket_name[1]}"
  }
 }

to create a Glue DB and the crawler to crawl an s3 bucket (here only two), but I don't know how I link the Athena query service to the Glue DB. In the terraform documentation for Athena, there doesn't appear to be a way to connect Athena to a Glue catalog but only to an S3 Bucket. Clearly, however, Athena can be integrated with Glue.

How can I terraform an Athena database to use my Glue catalog as its data source rather than an S3 bucket?

669

asked Mar 12 '19 19:03

Steven

1 Answers

Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:

Crawler role and role policy:

The assume_role_policy of the IAM role needs only Glue as principal
The IAM role policy allows actions for Glue, S3, and logs
The Glue actions and resources can probably be narrowed down to the ones really needed
The S3 actions are limited to those needed by the crawler

resource "aws_iam_role" "glue_crawler_role" {
  name = "analytics_glue_crawler_role"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "glue.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

resource "aws_iam_role_policy" "glue_crawler_role_policy" {
  name = "analytics_glue_crawler_role_policy"
  role = "${aws_iam_role.glue_crawler_role.id}"
  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:*",
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket",
        "s3:GetBucketAcl",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::analytics-product-data",
        "arn:aws:s3:::analytics-product-data/*",
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": [
        "arn:aws:logs:*:*:/aws-glue/*"
      ]
    }
  ]
}
EOF
}

S3 Bucket, Glue Database and Crawler:

resource "aws_s3_bucket" "product_bucket" {
  bucket = "analytics-product-data"
  acl = "private"
}

resource "aws_glue_catalog_database" "analytics_db" {
  name = "inventory-analytics-db"
}

resource "aws_glue_crawler" "product_crawler" {
  database_name = "${aws_glue_catalog_database.analytics_db.name}"
  name = "analytics-product-crawler"
  role = "${aws_iam_role.glue_crawler_role.arn}"

  schedule = "cron(0 0 * * ? *)"

  configuration = "{\"Version\": 1.0, \"CrawlerOutput\": { \"Partitions\": { \"AddOrUpdateBehavior\": \"InheritFromTable\" }, \"Tables\": {\"AddOrUpdateBehavior\": \"MergeNewColumns\" } } }"

  schema_change_policy {
    delete_behavior = "DELETE_FROM_DATABASE"
  }

  s3_target {
    path = "s3://${aws_s3_bucket.product_bucket.bucket}/products"
  }
}

answered Sep 16 '22 14:09

Martin

Related questions
                            
                                How to read Environment Properties set in AWS ElasticBeanstalk
                            
                                Uploading to S3 from Laravel Quality Lost
                            
                                Bidirectional synchronisation between Amazon s3 bucket and physical server
                            
                                AWS Lambda - sync vs async
                            
                                Right way to deploy Rails + Puma + Postgres app to Elastic beanstalk?
                            
                                Getting full access to DynamoDB from my ios app using AWS Cognito Developer Identities
                            
                                How can I use AWS Boto3 to get Cloudwatch metric statistics?
                            
                                How to serve binary data from AWS API Gateway with proxy integration?
                            
                                DynamoDB - How to create map and add attribute to it in one update
                            
                                How to use terraform with environment variables in .tf file
                            
                                Semantic versioning with AWS CodeBuild
                            
                                How Can I Easily Add Environment Variables To Multiple Lambda Functions?
                            
                                How to include static files on Serverless Framework?
                            
                                Writing results from SQL query to CSV and avoiding extra line-breaks
                            
                                How to monitor and control DPU usage in AWS Glue Crawlers
                            
                                Unable to reference CloudFormation resource in serverless.yml. Invalid variable reference syntax for variable UserPoolId
                            
                                Invoke AWS Lambda with AWS X-Ray locally
                            
                                FFMPEG failing in AWS Lambda
                            
                                AWS SNS equivalent in GCP stack
                            
                                Can I make my CodePipeline only executed when triggered manually

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Terraform AWS Athena to use Glue catalog as db

Tags:

amazon-web-services

terraform

terraform-provider-aws

aws-glue

aws-glue-data-catalog

Steven

People also ask

1 Answers

Martin

Recent Activity

Donate For Us