While creating a table in Athena, I am not able to create tables using specific files. Is there any way to select all the files starting with "year_2019" from a given bucket? For e.g. s3://bucketname/prefix/year_2019*.csv
The documentation is very clear about it and it is not allowed.
From: https://docs.aws.amazon.com/athena/latest/ug/tables-location-format.html
Athena reads all files in an Amazon S3 location you specify in the CREATE TABLE statement, and cannot ignore any files included in the prefix. When you create tables, include in the Amazon S3 path only the files you want Athena to read. Use AWS Lambda functions to scan files in the source location, remove any empty files, and move unneeded files to another location.
I will like to know if the community has found some work-around :)
Unfortunately the filesystem abstraction that Athena uses for S3 doesn't support this. It requires table locations to look like directories, and Athena will add a slash to the end of the location when listing files.
There is a way to create tables that contain only a selection of files, but as far as I know it does not support wildcards, only explicit lists of files.
What you do is you create a table with
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
and then instead of pointing the LOCATION
of the table to the actual files, you point it to a prefix with a single symlink.txt
file (or point each partition to a prefix with a single symlink.txt
). In the symlink.txt
file you add the S3 URIs of the files to include in the table, one per line.
The only documentation that I know of for this feature is the S3 Inventory documentation for integrating with Athena.
You can also find a full example in this Stackoverflow response: https://stackoverflow.com/a/55069330/1109
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With