Is there a way to create a table in Amazon Athena directly from parquet file based on avro schema? The schema is encoded into the file so its seems stupid that I need to actually create the DDL myself. I saw this and also another duplication but they are related directly to Hive, it wont work for Athena. Ideally I am looking for a way to do it programmatically without the need to define it at the console.

This is now more-or-less possible using AWS Glue. Glue can crawl a bunch of different data sources, including Parquet files on S3. Discovered tables are added to the Glue data catalog and queryable from Athena. Depending on your needs, you could schedule a Glue crawler to run periodically, or you could define and run a crawler using the Glue API. If you have many separate hunks of data that share a schema, you can also use a partitioned table to reduce the overhead of making new loads available to Athena. For example, I have some daily dumps that load into tables partitioned by date. As long as the schema doesn't change, all you then need to do is <code>MSCK REPAIR TABLE</code>.

It doesn't seem to be possible with Athena as <code>avro.schema.url</code> is not a supported property. <code>table property 'avro.schema.url' is not supported. (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException...)</code> You can use <code>avro.schema.literal</code> (you would have to copy the avro json schema to the query) but I still experienced problems querying the data afterwards. Strange errors like: <code>SYNTAX_ERROR: line 1:8: SELECT * not allowed in queries without FROM clause</code>

Athena create table from parquet schema

Tags:

java

amazon-web-services

hive

amazon-athena

presto

Is there a way to create a table in Amazon Athena directly from parquet file based on avro schema? The schema is encoded into the file so its seems stupid that I need to actually create the DDL myself.

I saw this and also another duplication

but they are related directly to Hive, it wont work for Athena. Ideally I am looking for a way to do it programmatically without the need to define it at the console.

769

asked Mar 29 '17 16:03

NetanelRabinowitz

2 Answers

This is now more-or-less possible using AWS Glue. Glue can crawl a bunch of different data sources, including Parquet files on S3. Discovered tables are added to the Glue data catalog and queryable from Athena. Depending on your needs, you could schedule a Glue crawler to run periodically, or you could define and run a crawler using the Glue API.

If you have many separate hunks of data that share a schema, you can also use a partitioned table to reduce the overhead of making new loads available to Athena. For example, I have some daily dumps that load into tables partitioned by date. As long as the schema doesn't change, all you then need to do is MSCK REPAIR TABLE.

149

answered Oct 18 '22 23:10

Steve McKay

It doesn't seem to be possible with Athena as avro.schema.url is not a supported property.

table property 'avro.schema.url' is not supported. (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException...)

You can use avro.schema.literal (you would have to copy the avro json schema to the query) but I still experienced problems querying the data afterwards.

Strange errors like: SYNTAX_ERROR: line 1:8: SELECT * not allowed in queries without FROM clause

answered Oct 19 '22 00:10

andresp

Related questions
                            
                                Is getOrCreate function a good or a bad practice?
                            
                                How to identify file type by Base64 encoded string of a image [duplicate]
                            
                                Why does @FunctionalInterface have a RUNTIME retention?
                            
                                TreeMap and setValue method
                            
                                Detect a key press in console
                            
                                How to avoid two different threads read the same rows from DB (Hibernate and Oracle 10g)
                            
                                replace classes from sun.security.* packages
                            
                                Switching on Strings
                            
                                Can WebSockets and HTTP server both run on the SAME port number?
                            
                                Spring Security 4.0.0 + ActiveDirectoryLdapAuthenticationProvider + BadCredentialsException PartialResultException
                            
                                Android scaling/transforming canvas doesn't modify clickable area
                            
                                How to force eager loading with CrudRepository in spring-data?
                            
                                Java compiler choosing wrong overload [duplicate]
                            
                                Thymeleaf Map Form Binding
                            
                                Why is ClassLoader's cache checked in ascending sequence?
                            
                                Why does not ArrayList override equals() for better performance?
                            
                                logback: no applicable action for [encoder], current ElementPath is [[configuration][appender][encoder]]
                            
                                RxJava + Retrofit + polling
                            
                                @SafeVarargs on interface method
                            
                                Java and haarcascade face and mouth detection - mouth as the nose

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With