Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using tar.gz file as a source for Amazon Athena

If I define *.tsv files on Amazon S3 as a source for an Athena table and use OpenCSVSerde or LazySimpleSerDe as a deserializer it works correctly. But if I define *.tar.gz files that include *.tsv files I see several strange rows in a table (e.g. a row that contains tsv file name and several empty rows). What is the right way to use tar.gz files in Athena?

like image 812
Alexander Ershov Avatar asked Sep 20 '17 12:09

Alexander Ershov


People also ask

Can Athena read GZ file?

gz format are not supported. LZ4 – This member of the Lempel-Ziv 77 (LZ7) family also focuses on compression and decompression speed rather than maximum compression of data. LZ4 has the following framing formats: LZ4 Raw/Unframed – An unframed, standard implementation of the LZ4 block compression format.

What data format does Amazon Athena support?

Q: What data formats does Amazon Athena support? Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or Textfiles and also supports open source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats.

How do I add a data source to Athena?

Open the Athena console at https://console.aws.amazon.com/athena/ . If the console navigation pane is not visible, choose the expansion menu on the left. In the navigation pane, choose Data sources. From the list of data sources, choose the name of the data source that you want to view.

What can you do with a Tar GZ file?

Tar GZ files are most commonly used for: Storing multiple files in one archive. Sending and receiving larger files in a compressed format. Compressing single files to store locally.


1 Answers

The problem is tar, it adds additional rows. Athena can open only *.gz files, but not tar. So in this case I have to use *.gz instead of *.tar.gz.

like image 121
Alexander Ershov Avatar answered Sep 18 '22 00:09

Alexander Ershov