Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Connecting Athena and S3 in same Cloudformation Stack

From the documentation, AWS::Athena::NamedQuery, it is unclear how to attach Athena to an S3 bucket specified in the same stack.

If I had to guess from the example, I would imagine that you can write a template like,

Resources:
  MyS3Bucket:
    Type: AWS::S3::Bucket
       ... other params ...

  AthenaNamedQuery:
    Type: AWS::Athena::NamedQuery
    Properties:
      Database: "db_name"
      Name: "MostExpensiveWorkflow"
      QueryString: >
                    CREATE EXTERNAL TABLE db_name.test_table 
                    (...) LOCATION s3://.../path/to/folder/

Would a template like the above work? Upon stack creation, will the table db_name.test_table be available to run queries on?

like image 523
ignorance Avatar asked Oct 05 '17 22:10

ignorance


People also ask

Does Athena work with S3?

Athena works directly with data stored in S3. Athena uses Presto, a distributed SQL engine to run queries. It also uses Apache Hive to create, drop, and alter tables and partitions.

Can Athena query S3 bucket?

Amazon Athena is defined as “an interactive query service that makes it easy to analyse data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.” So, it's another SQL query engine for large data sets stored in S3.

Is S3 SELECT the same as Athena?

So Whats the Difference Between S3 Select and Athena? S3 Select is a lightweight solution designed to let you use SQL to perform simple SELECT clauses on a maximum of one file. Amazon Athena is an analytics workhorse that allows you to perform SQL on extremely large datasets spanning many files with great performance.


1 Answers

Turns out the way you connect the S3 and Athena is to make a Glue table! How silly of me!! Of course Glue is how you connect things!

Sarcasm aside, this is a template that worked for me when using AWS::Glue::Table and AWS::Glue::Database,

Resources:
  MyS3Bucket:
    Type: AWS::S3::Bucket

  MyGlueDatabase:
    Type: AWS::Glue::Database
    Properties:
      DatabaseInput:
        Name: my-glue-database
        Description: "Glue beats tape"
      CatalogId: !Ref AWS::AccountId

  MyGlueTable:
    Type: AWS::Glue::Table
    Properties:
      DatabaseName: !Ref MyGlueDatabase
      CatalogId: !Ref AWS::AccountId
      TableInput:
        Name: my-glue-table
        Parameters: { "classification" : "csv" }
        StorageDescriptor:
          Location:
            Fn::Sub: "s3://${MyS3Bucket}/"
          InputFormat: "org.apache.hadoop.mapred.TextInputFormat"
          OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
          SerdeInfo:
            Parameters: { "separatorChar" : "," }
            SerializationLibrary: "org.apache.hadoop.hive.serde2.OpenCSVSerde"
          StoredAsSubDirectories: false
          Columns:
            - Name: column0
              Type: string
            - Name: column1
              Type: string

After this, the database and table were in the AWS Athena Console!

like image 183
ignorance Avatar answered Sep 24 '22 02:09

ignorance