Can you create views in Amazon Athena? outlines how to create a view using the User Interface.
I'd like to create an AWS Athena View programatically, ideally using Terraform (which calls CloudFormation).
I followed the steps outlined here: https://ujjwalbhardwaj.me/post/create-virtual-views-with-aws-glue-and-query-them-using-athena, however I run into an issue with this in that the view goes stale quickly.
...._view' is stale; it must be re-created.
The terraform code looks like this:
resource "aws_glue_catalog_table" "adobe_session_view" {
database_name = "${var.database_name}"
name = "session_view"
table_type = "VIRTUAL_VIEW"
view_original_text = "/* Presto View: ${base64encode(data.template_file.query_file.rendered)} */"
view_expanded_text = "/* Presto View */"
parameters = {
presto_view = "true"
comment = "Presto View"
}
storage_descriptor {
ser_de_info {
name = "ParquetHiveSerDe"
serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
}
columns { name = "first_column" type = "string" }
columns { name = "second_column" type = "int" }
...
columns { name = "nth_column" type = "string" }
}
An alternative I'd be happy to use is the AWS CLI, however aws athena [option]
provides no option for this.
I've tried:
You can create a view in the Athena console by using a template or by running an existing query. In the Athena console, next to Tables and views, choose Create, and then choose Create view. This action places an editable view template into the query editor. Edit the view template according to your requirements.
Athena views aren't materialized views so they aren't so beneficial for you. Instead of that I suggest to transform your data by using AWS Glue job before loading to QS. You can consider also partitioning or compresion of your source data.
To create an Athena databaseOpen the Athena console at https://console.aws.amazon.com/athena/ . If this is your first time to visit the Athena console in your current AWS Region, choose Explore the query editor to open the query editor. Otherwise, Athena opens in the query editor.
You can by using the Athena JDBC driver. This approach circumvents the catalog, as only Athena (and not Glue as of 25-Jan-2019) can directly access views. Download the driver and store the jar to an S3 bucket. Specify the S3 path to the driver as a dependent jar in your job definition.
Creating views programmatically in Athena is not documented, and unsupported, but possible. What happens behind the scenes when you create a view using StartQueryExecution
is that Athena lets Presto create the view and then extracts Presto's internal representation and puts it in the Glue catalog.
The staleness problem usually comes from the columns in the Presto metadata and the Glue metadata being out of sync. An Athena view really contains three descriptions of the view: the view SQL, the columns and their types in Glue format, and the columns and types in Presto format. If either of these get out of sync you will get the "… is stale; it must be re-created." error.
These are the requirements on a Glue table to work as an Athena view:
TableType
must be VIRTUAL_VIEW
Parameters
must contain presto_view: true
TableInput.ViewOriginalText
must contain an encoded Presto view (see below)StorageDescriptor.SerdeInfo
must be an empty mapStorageDescriptor.Columns
must contain all the columns that the view defines, with their typesThe tricky part is the encoded Presto view. That structure is created by this code: https://github.com/prestosql/presto/blob/27a1b0e304be841055b461e2c00490dae4e30a4e/presto-hive/src/main/java/io/prestosql/plugin/hive/HiveUtil.java#L597-L600, and this is more or less what it does:
/* Presto View:
(with a space after :
)*/
(with a space before *
)The JSON that describes the view looks like this:
catalog
property that must have the value awsdatacatalog
.schema
property that must be the name of the database where the view is created (i.e. it must match the DatabaseName
property of the surrounding Glue structure.name
and type
originalSql
property with the actual view SQL (not including CREATE VIEW …
, it should start with SELECT …
or WITH …
)Here's an example:
{
"catalog": "awsdatacatalog",
"schema": "some_database",
"columns": [
{"name": "col1", "type": "varchar"},
{"name": "col2", "type": "bigint"}
],
"originalSql": "SELECT col1, col2 FROM some_other_table"
}
One caveat here is that the types of the columns are almost, but not quite, the same as the names in Glue. If Athena/Glue would have string
the value in this JSON must be varchar
. If the Athena/Glue uses array<string>
the value in this JSON must be array(varchar)
, and struct<foo:int>
becomes row(foo int)
.
This is pretty messy, and putting it all together requires some fiddling and testing. The easiest way to get it working is to create a few views and decoding working the instructions above backwards to see how they look, and then try doing it yourself.
As you suggested, it is definitely possible to create an Athena view programmatically via the AWS CLI using the start-query-execution
. As you pointed out, this does require you to provide an S3 location for the results even though you won't need to check the file (Athena will put an empty txt file in the location for some reason).
Here is an example:
$ aws athena start-query-execution --query-string "create view my_view as select * from my_table" --result-configuration "OutputLocation=s3://my-bucket/tmp" --query-execution-context "Database=my_database"
{
"QueryExecutionId": "1744ed2b-e111-4a91-80ea-bcb1eb1c9c25"
}
You can avoid having the client specify a bucket by creating a workgroup and setting the location there.
You can check whether your view creation was successful by using the get-query-execution
command.
$ aws --region athena get-query-execution --query-execution-id bedf3eba-55b0-42de-9a7f-7c0ba71c6d9b
{
"QueryExecution": {
"QueryExecutionId": "1744ed2b-e111-4a91-80ea-bcb1eb1c9c25",
"Query": "create view my_view as select * from my_table",
"StatementType": "DDL",
"ResultConfiguration": {
"OutputLocation": "s3://my-bucket/tmp/1744ed2b-e111-4a91-80ea-bcb1eb1c9c25.txt"
},
"Status": {
"State": "SUCCEEDED",
"SubmissionDateTime": 1558744806.679,
"CompletionDateTime": 1558744807.312
},
"Statistics": {
"EngineExecutionTimeInMillis": 548,
"DataScannedInBytes": 0
},
"WorkGroup": "primary"
}
}
Updating the above examples for Terraform 0.12+ syntax, and adding in reading the view queries from the filesystem:
resource "null_resource" "athena_views" {
for_each = {
for filename in fileset("${path.module}/athenaviews/", "**"):
replace(filename,"/","_") => file("${path.module}/athenaviews/${filename}")
}
provisioner "local-exec" {
command = <<EOF
aws athena start-query-execution \
--output json \
--query-string CREATE OR REPLACE VIEW ${each.key} AS ${each.value} \
--query-execution-context "Database=${var.athena_database}" \
--result-configuration "OutputLocation=s3://${aws_s3_bucket.my-bucket.bucket}"
EOF
}
provisioner "local-exec" {
when = "destroy"
command = <<EOF
aws athena start-query-execution \
--output json \
--query-string DROP VIEW IF EXISTS ${each.key} \
--query-execution-context "Database=${var.athena_database}" \
--result-configuration "OutputLocation=s3://${aws_s3_bucket.my-bucket.bucket}"
EOF
}
}
Note also then when= "destroy"
block to ensure the views are dropped when your stack is torn down.
Place text files with a SELECT query below your module path under a directory (athenaview/ in this example), and it will pick them up and create views.
This will create views named subfolder_filename
, and destroy them if the files are removed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With