Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partitions not returning any results in Amazon Athena

I have created a table in amazon athena. I have my data as CSV file(one large file having data from 2010-2015).It is not partitioned in s3. I am trying to create partitions in athena but partitions are not pulling any results.

step 1: creating table in athena

CREATE EXTERNAL TABLE IF NOT EXISTS DATABASE.table1 (
NULL string,
OrderID string,
CustomerID string,
EmailAddress string
)
PARTITIONED BY (OrderDate STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
 'serialization.format' = ',',
  'quoteChar' = '"',
  'field.delim' = ','
) LOCATION 's3://XYZ/Orders/'
TBLPROPERTIES ('has_encrypted_data'='false');

After this STEP 2: Loading partitions manually (my orderdate format is mm/dd/yy hrs min)

ALTER TABLE table1 ADD PARTITION (orderdate='01/01/2010 00:00') location 's3://xyz/Orders/'

Its says partition has been created successfully but when i query against this i dont get any results.So am I creating partitions in a wrong way? why it is not returning results?Any help is appreciated.

After trying out the comments below i am still not able to retrieve any results when i query data using order date.I uploaded my files again to athena(two separate files for two years) where the path of files looks like this

s3://xyz/Orders/year/orders+2010.csv
 s3://xyz/Orders/year/orders+2014.csv

After doing this I am still not able to the partitions and query data using date range.

like image 596
jsingh Avatar asked Nov 08 '17 23:11

jsingh


1 Answers

Partitions are sub-directories. When correctly defined Athena can skip over sub-directories that do not contain relevant data.

For example, if a table is partitioned by date (YYYY-MM-DD), there would be a directory called date=2017-11-08 that contains only data for that day.

Given that all your data is contained in a single CSV, you cannot use partitions. If you were to divide the file into individual daily files within an appropriate directory structure, Athena would run faster for you and would be cheaper to operate.

Athena cannot generate the partitions for you -- you must provide Athena with data that is already partitioned. The ADD PARTITION and MSCK REPAIR TABLE commands merely scan in and use the existing partitions.

See AWS Big Data blog: Analyzing Data in S3 using Amazon Athena

like image 144
John Rotenstein Avatar answered Oct 04 '22 18:10

John Rotenstein