Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Specify compression type with Athena

I have S3 data which is has GZIP compression. I'm trying to create a table in Athena using this file, and my CREATE TABLE statement succeeds - but when I query the table all rows are empty.

create external table mydatabase.table1 (
   date date,
   week_begin_date date,
   week_end_date date,
   value float
)
row format delimited fields terminated by ','
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat'     
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://my-bucket/some/path/'

How can I insist that Athena read my files as GZIP?

like image 618
Kirk Broadhurst Avatar asked Feb 15 '18 22:02

Kirk Broadhurst


People also ask

What formats does AWS Athena support?

Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or Textfiles and also supports open source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats.

Is ZSTD better than GZIP?

Experiment proved, ZSTD Level 9 and Level 19 are able to reduce Parquet file size by 8% and 12% compared to GZIP-based Parquet files, respectively. Moreover, both ZSTD Level 9 and Level 19 have decompression speeds faster than GZIP level 6.

How do you make Athena queries faster?

Compress and split files You can speed up your queries dramatically by compressing your data, provided that files are splittable or of an optimal size (optimal S3 file size is between 200MB-1GB). Smaller data sizes mean less network traffic between Amazon S3 to Athena.

What is Parquet compression?

Compression. File compression is the act of taking a file and making it smaller. In Parquet, compression is performed column by column and it is built to support flexible compression options and extendable encoding schemas per data type – e.g., different encoding can be used for compressing integer and string data.


1 Answers

While Athena supports TBLPROPERTIES metadata (we can set properties within a CREATE TABLE, ALTER TABLE to set these properties, and SHOW TBLPROPERTIES to display properties of any table), it does not respect the TBLPROPERTIES ('compressionType'='gzip') option.

There's no apparent way to force compression / decompression algorithm. Athena attempts to identify compression based on file extension. A GZIP file with a .gz suffix will be readable; a GZIP file without that suffix will not.

Similarly, an uncompressed file with a .gz suffix will fail. The reported error is

HIVE_CURSOR_ERROR: incorrect header check

Some investigation revealed the following:

  • The only known way to have Athena recognize a file as a GZIP is to name it with a .gz suffix.
  • Other similar suffixes that do not work include .gzip, .zip, [^.]gz
  • GZIP and uncompressed files can live happily side by side in an Athena table or partition - the compression detection is done at the file level, not at the table level.
like image 187
Kirk Broadhurst Avatar answered Sep 30 '22 12:09

Kirk Broadhurst