Create AWS Athena view programmatically

Tags:

Can you create views in Amazon Athena? outlines how to create a view using the User Interface.

I'd like to create an AWS Athena View programatically, ideally using Terraform (which calls CloudFormation).

I followed the steps outlined here: https://ujjwalbhardwaj.me/post/create-virtual-views-with-aws-glue-and-query-them-using-athena, however I run into an issue with this in that the view goes stale quickly.

...._view' is stale; it must be re-created.

The terraform code looks like this:

Click to copy

resource "aws_glue_catalog_table" "adobe_session_view" {

  database_name = "${var.database_name}"
  name = "session_view"

  table_type = "VIRTUAL_VIEW"
  view_original_text = "/* Presto View: ${base64encode(data.template_file.query_file.rendered)} */"
  view_expanded_text = "/* Presto View */"

  parameters = {
    presto_view = "true"
    comment = "Presto View"
  }

  storage_descriptor {
    ser_de_info {
      name = "ParquetHiveSerDe"
      serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
    }

    columns { name = "first_column" type = "string" }
    columns { name = "second_column" type = "int" }
    ...
    columns { name = "nth_column" type = "string" }
}

An alternative I'd be happy to use is the AWS CLI, however aws athena [option] provides no option for this.

I've tried:

create-named-query which I have not been able to get working for a statement such as CREATE OR REPLACE VIEW as this doesn't seem to be the intended use case for this command.
start-query-execution which asks for an output location, which suggests that this is meant for querying the data and outputting the results, as opposed to making stateful changes/creations. It also seems to be paired with stop-query-execution.

597

asked May 24 '19 09:05

tjheslin1

3 Answers

Creating views programmatically in Athena is not documented, and unsupported, but possible. What happens behind the scenes when you create a view using StartQueryExecution is that Athena lets Presto create the view and then extracts Presto's internal representation and puts it in the Glue catalog.

The staleness problem usually comes from the columns in the Presto metadata and the Glue metadata being out of sync. An Athena view really contains three descriptions of the view: the view SQL, the columns and their types in Glue format, and the columns and types in Presto format. If either of these get out of sync you will get the "… is stale; it must be re-created." error.

These are the requirements on a Glue table to work as an Athena view:

TableType must be VIRTUAL_VIEW
Parameters must contain presto_view: true
TableInput.ViewOriginalText must contain an encoded Presto view (see below)
StorageDescriptor.SerdeInfo must be an empty map
StorageDescriptor.Columns must contain all the columns that the view defines, with their types

The tricky part is the encoded Presto view. That structure is created by this code: https://github.com/prestosql/presto/blob/27a1b0e304be841055b461e2c00490dae4e30a4e/presto-hive/src/main/java/io/prestosql/plugin/hive/HiveUtil.java#L597-L600, and this is more or less what it does:

Adds a prefix /* Presto View: (with a space after :)
Adds a base 64 encoded JSON string that contains the view SQL, the columns and their types, and some catalog metadata (see below)
Adds a suffix */ (with a space before *)

The JSON that describes the view looks like this:

A catalog property that must have the value awsdatacatalog.
A schema property that must be the name of the database where the view is created (i.e. it must match the DatabaseName property of the surrounding Glue structure.
A list of columns, each with a name and type
A originalSql property with the actual view SQL (not including CREATE VIEW …, it should start with SELECT … or WITH …)

Here's an example:

Click to copy

{
  "catalog": "awsdatacatalog",
  "schema": "some_database",
  "columns": [
    {"name": "col1", "type": "varchar"},
    {"name": "col2", "type": "bigint"}
  ],
  "originalSql": "SELECT col1, col2 FROM some_other_table"
}

One caveat here is that the types of the columns are almost, but not quite, the same as the names in Glue. If Athena/Glue would have string the value in this JSON must be varchar. If the Athena/Glue uses array<string> the value in this JSON must be array(varchar), and struct<foo:int> becomes row(foo int).

This is pretty messy, and putting it all together requires some fiddling and testing. The easiest way to get it working is to create a few views and decoding working the instructions above backwards to see how they look, and then try doing it yourself.

106

answered Sep 20 '22 14:09

Theo

As you suggested, it is definitely possible to create an Athena view programmatically via the AWS CLI using the start-query-execution. As you pointed out, this does require you to provide an S3 location for the results even though you won't need to check the file (Athena will put an empty txt file in the location for some reason).

Here is an example:

Click to copy

$ aws athena start-query-execution --query-string "create view my_view as select * from my_table" --result-configuration "OutputLocation=s3://my-bucket/tmp" --query-execution-context "Database=my_database"

{
    "QueryExecutionId": "1744ed2b-e111-4a91-80ea-bcb1eb1c9c25"
}

You can avoid having the client specify a bucket by creating a workgroup and setting the location there.

You can check whether your view creation was successful by using the get-query-execution command.

Click to copy

$ aws --region athena get-query-execution --query-execution-id bedf3eba-55b0-42de-9a7f-7c0ba71c6d9b
{
    "QueryExecution": {
        "QueryExecutionId": "1744ed2b-e111-4a91-80ea-bcb1eb1c9c25",
        "Query": "create view my_view as select * from my_table",
        "StatementType": "DDL",
        "ResultConfiguration": {
            "OutputLocation": "s3://my-bucket/tmp/1744ed2b-e111-4a91-80ea-bcb1eb1c9c25.txt"
        },
        "Status": {
            "State": "SUCCEEDED",
            "SubmissionDateTime": 1558744806.679,
            "CompletionDateTime": 1558744807.312
        },
        "Statistics": {
            "EngineExecutionTimeInMillis": 548,
            "DataScannedInBytes": 0
        },
        "WorkGroup": "primary"
    }
}

answered Sep 20 '22 14:09

JD D

Updating the above examples for Terraform 0.12+ syntax, and adding in reading the view queries from the filesystem:

Click to copy

resource "null_resource" "athena_views" {
  for_each = {
    for filename in fileset("${path.module}/athenaviews/", "**"):
           replace(filename,"/","_") => file("${path.module}/athenaviews/${filename}")
  }

  provisioner "local-exec" {
    command = <<EOF
    aws athena start-query-execution \
      --output json \
      --query-string CREATE OR REPLACE VIEW ${each.key} AS ${each.value} \
      --query-execution-context "Database=${var.athena_database}" \
      --result-configuration "OutputLocation=s3://${aws_s3_bucket.my-bucket.bucket}"
EOF
  }

  provisioner "local-exec" {
    when    = "destroy"
    command = <<EOF
    aws athena start-query-execution \
      --output json \
      --query-string DROP VIEW IF EXISTS ${each.key} \
      --query-execution-context "Database=${var.athena_database}" \
      --result-configuration "OutputLocation=s3://${aws_s3_bucket.my-bucket.bucket}"
EOF
  }
}

Note also then when= "destroy" block to ensure the views are dropped when your stack is torn down.

Place text files with a SELECT query below your module path under a directory (athenaview/ in this example), and it will pick them up and create views. This will create views named subfolder_filename, and destroy them if the files are removed.

answered Sep 18 '22 14:09

Joshua Samuel

Related questions
                            
                                Is there option to redirect http traffic to https in aws network load balancer
                            
                                How organize terraform modules for common load balancer
                            
                                RDS Serverless - Slow warm up?
                            
                                Update Angular app on AWS S3 + Cloudfront
                            
                                How to change the protocol from http to https in AWS ELB load balancer?
                            
                                How can I get the TaskId of a Fargate ecs Container
                            
                                Verify hostname of the server who invoked the API
                            
                                Using AWS API: Given IAM User, Get Current Effective IAM Policy Document
                            
                                How to use Cognito identity pool with UnAuthenticatd users in Amplify for Android
                            
                                aws glue `ImportError: cannot import name 'S3ArnParamHandler'`
                            
                                How do I specify another AWS account's Event Bus as a target of an EventBridge rule using CloudFormation or CDK?
                            
                                how to deploy in different environment (dev, uat ,prod) using cdk pipeline?
                            
                                what is the equivalent component of AWS Cognito in GCP for features like userpools, triggers and appclients?
                            
                                StartQueryExecution operation: Unable to verify/create output bucket
                            
                                Issue with EventBridge rule for aws.events
                            
                                For an Amazon S3 bucket deplolyent from Guithub how do I fix the error AccessControlListNotSupported: The bucket does not allow ACLs?
                            
                                Is it possible to have more than 5 read replicas in Amazon RDS?
                            
                                AWS ECS Fargate Container Healthcheck command
                            
                                Adding type-hinting to functions that return boto3 objects?
                            
                                AuthError - Error: Amplify has not been configured correctly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Create AWS Athena view programmatically

Tags:

amazon-web-services

amazon-cloudformation

terraform

terraform-provider-aws

amazon-athena

tjheslin1

People also ask

3 Answers

Theo

JD D

Joshua Samuel

Recent Activity

Donate For Us