Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Athena create table from parquet schema

Is there a way to create a table in Amazon Athena directly from parquet file based on avro schema? The schema is encoded into the file so its seems stupid that I need to actually create the DDL myself.

I saw this and also another duplication

but they are related directly to Hive, it wont work for Athena. Ideally I am looking for a way to do it programmatically without the need to define it at the console.

like image 769
NetanelRabinowitz Avatar asked Mar 29 '17 16:03

NetanelRabinowitz


People also ask

How do you create a table from Parquet file in Athena?

In IAM Role, choose Create an IAM Role and fill the suffix with something like 'athena-parquet'. Alternatively, you can opt to use a different IAM role with permissions for that S3 bucket. For Output, choose Add Database and create a database with the name 'athena-parquet'. Then choose Next.

Can Athena query Parquet files?

Athena allows you to use open source columnar formats such as Apache Parquet and Apache ORC. Converting your data to columnar formats not only helps you improve query performance, but also save on costs.

Can you create tables in Athena?

You can create tables in Athena by using AWS Glue, the add table form, or by running a DDL statement in the Athena query editor.

How do you create a table in Parquet format?

To make the new table also use Parquet format, include the clause STORED AS PARQUET in the CREATE TABLE LIKE PARQUET statement. If the Parquet data file comes from an existing Impala table, currently, any TINYINT or SMALLINT columns are turned into INT columns in the new table.


2 Answers

This is now more-or-less possible using AWS Glue. Glue can crawl a bunch of different data sources, including Parquet files on S3. Discovered tables are added to the Glue data catalog and queryable from Athena. Depending on your needs, you could schedule a Glue crawler to run periodically, or you could define and run a crawler using the Glue API.

If you have many separate hunks of data that share a schema, you can also use a partitioned table to reduce the overhead of making new loads available to Athena. For example, I have some daily dumps that load into tables partitioned by date. As long as the schema doesn't change, all you then need to do is MSCK REPAIR TABLE.

like image 149
Steve McKay Avatar answered Oct 18 '22 23:10

Steve McKay


It doesn't seem to be possible with Athena as avro.schema.url is not a supported property.

table property 'avro.schema.url' is not supported. (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException...)

You can use avro.schema.literal (you would have to copy the avro json schema to the query) but I still experienced problems querying the data afterwards.

Strange errors like: SYNTAX_ERROR: line 1:8: SELECT * not allowed in queries without FROM clause

like image 2
andresp Avatar answered Oct 19 '22 00:10

andresp