Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting specific files for athena

While creating a table in Athena, I am not able to create tables using specific files. Is there any way to select all the files starting with "year_2019" from a given bucket? For e.g. s3://bucketname/prefix/year_2019*.csv


The documentation is very clear about it and it is not allowed.

From: https://docs.aws.amazon.com/athena/latest/ug/tables-location-format.html

Athena reads all files in an Amazon S3 location you specify in the CREATE TABLE statement, and cannot ignore any files included in the prefix. When you create tables, include in the Amazon S3 path only the files you want Athena to read. Use AWS Lambda functions to scan files in the source location, remove any empty files, and move unneeded files to another location.

I will like to know if the community has found some work-around :)

like image 273
shantanuo Avatar asked Mar 13 '19 05:03

shantanuo


1 Answers

Unfortunately the filesystem abstraction that Athena uses for S3 doesn't support this. It requires table locations to look like directories, and Athena will add a slash to the end of the location when listing files.

There is a way to create tables that contain only a selection of files, but as far as I know it does not support wildcards, only explicit lists of files.

What you do is you create a table with

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'

and then instead of pointing the LOCATION of the table to the actual files, you point it to a prefix with a single symlink.txt file (or point each partition to a prefix with a single symlink.txt). In the symlink.txt file you add the S3 URIs of the files to include in the table, one per line.

The only documentation that I know of for this feature is the S3 Inventory documentation for integrating with Athena.

You can also find a full example in this Stackoverflow response: https://stackoverflow.com/a/55069330/1109

like image 54
Theo Avatar answered Oct 21 '22 10:10

Theo