If I define *.tsv files on Amazon S3 as a source for an Athena table and use OpenCSVSerde or LazySimpleSerDe as a deserializer it works correctly. But if I define *.tar.gz files that include *.tsv files I see several strange rows in a table (e.g. a row that contains tsv file name and several empty rows). What is the right way to use tar.gz files in Athena?
gz format are not supported. LZ4 – This member of the Lempel-Ziv 77 (LZ7) family also focuses on compression and decompression speed rather than maximum compression of data. LZ4 has the following framing formats: LZ4 Raw/Unframed – An unframed, standard implementation of the LZ4 block compression format.
Q: What data formats does Amazon Athena support? Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or Textfiles and also supports open source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats.
Open the Athena console at https://console.aws.amazon.com/athena/ . If the console navigation pane is not visible, choose the expansion menu on the left. In the navigation pane, choose Data sources. From the list of data sources, choose the name of the data source that you want to view.
Tar GZ files are most commonly used for: Storing multiple files in one archive. Sending and receiving larger files in a compressed format. Compressing single files to store locally.
The problem is tar, it adds additional rows. Athena can open only *.gz files, but not tar. So in this case I have to use *.gz instead of *.tar.gz.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With