Is there a way to create a table in Amazon Athena directly from parquet file based on avro schema? The schema is encoded into the file so its seems stupid that I need to actually create the DDL myself.
I saw this and also another duplication
but they are related directly to Hive, it wont work for Athena. Ideally I am looking for a way to do it programmatically without the need to define it at the console.
In IAM Role, choose Create an IAM Role and fill the suffix with something like 'athena-parquet'. Alternatively, you can opt to use a different IAM role with permissions for that S3 bucket. For Output, choose Add Database and create a database with the name 'athena-parquet'. Then choose Next.
Athena allows you to use open source columnar formats such as Apache Parquet and Apache ORC. Converting your data to columnar formats not only helps you improve query performance, but also save on costs.
You can create tables in Athena by using AWS Glue, the add table form, or by running a DDL statement in the Athena query editor.
To make the new table also use Parquet format, include the clause STORED AS PARQUET in the CREATE TABLE LIKE PARQUET statement. If the Parquet data file comes from an existing Impala table, currently, any TINYINT or SMALLINT columns are turned into INT columns in the new table.
This is now more-or-less possible using AWS Glue. Glue can crawl a bunch of different data sources, including Parquet files on S3. Discovered tables are added to the Glue data catalog and queryable from Athena. Depending on your needs, you could schedule a Glue crawler to run periodically, or you could define and run a crawler using the Glue API.
If you have many separate hunks of data that share a schema, you can also use a partitioned table to reduce the overhead of making new loads available to Athena. For example, I have some daily dumps that load into tables partitioned by date. As long as the schema doesn't change, all you then need to do is MSCK REPAIR TABLE
.
It doesn't seem to be possible with Athena as avro.schema.url
is not a supported property.
table property 'avro.schema.url' is not supported. (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException...)
You can use avro.schema.literal
(you would have to copy the avro json schema to the query) but I still experienced problems querying the data afterwards.
Strange errors like:
SYNTAX_ERROR: line 1:8: SELECT * not allowed in queries without FROM clause
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With