Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I use Athena View as a source for a AWS Glue Job?

I'm trying to use an Athena View as a data source to my AWS Glue Job. The error message I'm getting while trying to run the Glue job is about the classification of the view. What can I define it as? Thank you

Error Message Appearing

like image 315
Nikitas Bompolias Avatar asked Nov 01 '18 13:11

Nikitas Bompolias


People also ask

What is an AWS glue job in Athena?

An AWS Glue job runs a script that extracts data from sources, transforms the data, and loads it into targets. For more information, see Authoring Jobs in Glue in the AWS Glue Developer Guide . Tables that you create in Athena must have a table property added to them called a classification, which identifies the format of the data.

How to optimize query performance in Athena with AWS glue?

AWS Glue jobs can help you transform data to a format that optimizes query performance in Athena. Data formats have a large impact on query performance and query costs in Athena. We recommend to use Parquet and ORC data formats.

What is the glue catalog in AWS?

The Glue catalog is used as a central hive-compatible metadata catalog for your data in AWS S3. It can be used across AWS services – Glue ETL, Athena, EMR, Lake formation, AI/ML etc. A key difference between Glue and Athena is that Athena is primarily used as a query tool for analytics and Glue is more of a transformation and data movement tool.

How does Athena connect to my data stored in Amazon S3?

Athena can connect to your data stored in Amazon S3 using the AWS Glue Data Catalog to store metadata such as table and column names. After the connection is made, your databases, tables, and views appear in Athena's query editor.


1 Answers

You can by using the Athena JDBC driver. This approach circumvents the catalog, as only Athena (and not Glue as of 25-Jan-2019) can directly access views.

  1. Download the driver and store the jar to an S3 bucket.
  2. Specify the S3 path to the driver as a dependent jar in your job definition.
  3. Load the data into a dynamic frame using the code below (using an IAM user with permission to run Athena queries).
from awsglue.dynamicframe import DynamicFrame
# ...
athena_view_dataframe = (
    glueContext.read.format("jdbc")
    .option("user", "[IAM user access key]")
    .option("password", "[IAM user secret access key]")
    .option("driver", "com.simba.athena.jdbc.Driver")
    .option("url", "jdbc:awsathena://athena.us-east-1.amazonaws.com:443")
    .option("dbtable", "my_database.my_athena_view")
    .option("S3OutputLocation","s3://bucket/temp/folder") # CSVs/metadata dumped here on load
    .load()
    )

athena_view_datasource = DynamicFrame.fromDF(athena_view_dataframe, glueContext, "athena_view_source")

The driver docs (pdf) provide alternatives to IAM user auth (e.g. SAML, custom provider).

The main side effect to this approach is that loading causes the query results to be dumped in CSV format to the bucket specified with the S3OutputLocation key.

I don't believe that you can create a Glue Connection to Athena via JDBC because you can't specify an S3 path to the driver location.

Attribution: AWS support totally helped me get this working.

like image 107
Alejandro C De Baca Avatar answered Sep 30 '22 04:09

Alejandro C De Baca