I have S3 data which is has GZIP
compression. I'm trying to create a table in Athena using this file, and my CREATE TABLE
statement succeeds - but when I query the table all rows are empty.
create external table mydatabase.table1 (
date date,
week_begin_date date,
week_end_date date,
value float
)
row format delimited fields terminated by ','
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://my-bucket/some/path/'
How can I insist that Athena read my files as GZIP
?
Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or Textfiles and also supports open source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats.
Experiment proved, ZSTD Level 9 and Level 19 are able to reduce Parquet file size by 8% and 12% compared to GZIP-based Parquet files, respectively. Moreover, both ZSTD Level 9 and Level 19 have decompression speeds faster than GZIP level 6.
Compress and split files You can speed up your queries dramatically by compressing your data, provided that files are splittable or of an optimal size (optimal S3 file size is between 200MB-1GB). Smaller data sizes mean less network traffic between Amazon S3 to Athena.
Compression. File compression is the act of taking a file and making it smaller. In Parquet, compression is performed column by column and it is built to support flexible compression options and extendable encoding schemas per data type – e.g., different encoding can be used for compressing integer and string data.
While Athena supports TBLPROPERTIES
metadata (we can set properties within a CREATE TABLE
, ALTER TABLE
to set these properties, and SHOW TBLPROPERTIES
to display properties of any table), it does not respect the TBLPROPERTIES ('compressionType'='gzip')
option.
There's no apparent way to force compression / decompression algorithm. Athena attempts to identify compression based on file extension. A GZIP
file with a .gz
suffix will be readable; a GZIP
file without that suffix will not.
Similarly, an uncompressed file with a .gz
suffix will fail. The reported error is
HIVE_CURSOR_ERROR: incorrect header check
Some investigation revealed the following:
GZIP
is to name it with a .gz
suffix..gzip
, .zip
, [^.]gz
GZIP
and uncompressed files can live happily side by side in an Athena table or partition - the compression detection is done at the file level, not at the table level.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With