Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Query parquet data from Amazon Athena?

Athena creates a temporary table using fields in S3 table. I have done this using JSON data. Could you help me on how to create table using parquet data?

I have tried following:

  1. Converted sample JSON data to parquet data.
  2. Uploaded parquet data to S3.
  3. Created temporary table using columns of JSON data.

By doing this I am able to a execute query but the result is empty.

Is this approach right or is there any other approach to be followed on parquet data?

Sample json data:

{"_id":"0899f824e118d390f57bc2f279bd38fe","_rev":"1-81cc25723e02f50cb6fef7ce0b0f4f38","deviceId":"BELT001","timestamp":"2016-12-21T13:04:10:066Z","orgid":"fedex","locationId":"LID001","UserId":"UID001","SuperviceId":"SID001"},
{"_id":"0899f824e118d390f57bc2f279bd38fe","_rev":"1-81cc25723e02f50cb6fef7ce0b0f4f38","deviceId":"BELT001","timestamp":"2016-12-21T13:04:10:066Z","orgid":"fedex","locationId":"LID001","UserId":"UID001","SuperviceId":"SID001"}
like image 395
rajeswari Avatar asked Mar 14 '17 12:03

rajeswari


People also ask

Can Athena read Parquet?

Athena allows you to use open source columnar formats such as Apache Parquet and Apache ORC. Converting your data to columnar formats not only helps you improve query performance, but also save on costs.

Can Athena query text files?

But unlike Apache Drill, Athena is limited to data only from Amazon's own S3 storage service. However, Athena is able to query a variety of file formats, including, but not limited to CSV, Parquet, JSON, etc.

Can Athena read snappy Parquet?

For example, Athena can successfully read the data in a table that uses Parquet file format when some Parquet files are compressed with Snappy and other Parquet files are compressed with GZIP. The same principle applies for ORC, text file, and JSON storage formats.


3 Answers

steps:
1. create your my_table_json
2. insert data into my_table_json (verify existence of the created json files in the table 'LOCATION')
3. create my_table_parquet: same create statement as my_table_json except you need to add 'STORED AS PARQUET' clause.
4. run: INSERT INTO my_table_parquet SELECT * FROM my_table_json

like image 38
belostoky Avatar answered Oct 20 '22 04:10

belostoky


If your data has been successfully stored in Parquet format, you would then create a table definition that references those files.

Here is an example statement that uses Parquet files:

CREATE EXTERNAL TABLE IF NOT EXISTS elb_logs_pq (
  request_timestamp string,
  elb_name string,
  request_ip string,
  request_port int,
  ...
  ssl_protocol string )
PARTITIONED BY(year int, month int, day int) 
STORED AS PARQUET
LOCATION 's3://athena-examples/elb/parquet/'
tblproperties ("parquet.compress"="SNAPPY");

This example was taken from the AWS blog post Analyzing Data in S3 using Amazon Athena that does an excellent job of explaining the benefits of using compressed and partitioned data in Amazon Athena.

like image 52
John Rotenstein Avatar answered Oct 20 '22 02:10

John Rotenstein


If your table definition is valid but not getting any rows, try this

-- The MSCK REPAIR TABLE command will load all partitions into the table. -- This command can take a while to run depending on the number of partitions to be loaded.

MSCK REPAIR TABLE {tablename}

like image 39
Eric Linden Avatar answered Oct 20 '22 03:10

Eric Linden