Duplicate results in an AWS Athena (Presto) DISTINCT SQL Query?

Tags:

I have a bunch of files on S3 that contain just MD5s, one per line. I created an AWS Athena table to run a de-duplication query against the MD5s. In total there are hundreds of millions of MD5s in those files and in the table.

Athena Table Creation Query:

CREATE EXTERNAL TABLE IF NOT EXISTS database.md5s (
  `md5` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://bucket/folder/';

Here are all the "dedup" queries I've tried (These should all be the same):

SELECT DISTINCT md5
FROM md5s;`

SELECT md5
FROM md5s
GROUP BY md5;

SELECT md5
FROM md5s
GROUP BY DISTINCT md5;

SELECT DISTINCT md5
FROM md5s
GROUP BY DISTINCT md5;

All results output .csvs from Athena still have repeated MD5s. What gives?

Is Athena Doing Partial Deduplication? - Even more peculiar, if I perform a COUNT(DISTINCT md5) in Athena, the count I get is different than the number of rows returned on export.

COUNT(DISTINCT md5) in Athena: 97,533,226
records in export of distinct MD5s: 97,581,616
there 14,790 duplicates in the results export, so both the COUNT(DISTINCT) counts are bad, and the results export are bad.

Is Athena CREATING Duplicates on Export? - The plot thickens. If I query my Athena Table for one of the MD5s that is duplicated in the Athena result export, I only get one result/row from the table. I tested this with a LIKE query to make sure whitespace wasn't causing the issue. This means Athena is ADDING duplicates to the export. There are never more than two of the same MD5 in the results.

select
  md5,
  to_utf8(md5)
from md5s
where md5 like '%0061c3d72c2957f454eef9d4b05775d7%';

Are Athena's Counts & Results File Both Wrong? - I deduped these same records using MySQL, and ended up with 97,531,010 unique MD5s. Athenas counts and results details are below.

COUNT(DISTINCT md5) in Athena: 97,533,226
records in export of distinct MD5s: 97,581,616
there 14,790 duplicates in the results export, so it seems that both the COUNT(DISTINCT) counts are bad, and the results export are bad.

I think this is an Athena bug - I've filed a ticket with AWS's dev team to get this fixed, and will update this post when it is.

Here is the related AWS Forum Post where other users are seeing the same issues. https://forums.aws.amazon.com/thread.jspa?messageID=764702

860

asked Feb 01 '17 01:02

T. Brian Jones

2 Answers

I have confirmed with the AWS team, that this was a known bug with AWS Athena at the time the question was asked. I'm not sure if this has been resolved.

answered Oct 23 '22 23:10

T. Brian Jones

When in doubt please use CTAS to remove any duplicates :

CREATE TABLE new_table
WITH (
      format = 'Parquet',
      parquet_compression = 'SNAPPY')
AS SELECT DISTINCT *
FROM old_table;

Reference: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html

answered Oct 23 '22 22:10

Avish Saha

Related questions
                            
                                Using hive table over parquet in Pig
                            
                                Hive: where + in does not use partition?
                            
                                TIMESTAMP format issue in HIVE
                            
                                Delete data from .Trash in hdfs
                            
                                Hive RegexSerDe Multiline Log matching
                            
                                Why hive_staging file is missing in AWS EMR
                            
                                UDF not working in Spark SQL
                            
                                ARRAY_CONTAINS muliple values in hive
                            
                                Is there a performance difference for these two Hive queries joining two tables and filtering on a partition key?
                            
                                how to map column names in a hive table and replace it with new values in hive table
                            
                                How to write subquery in select statement in hive
                            
                                Where does hive stores its table?
                            
                                How to use in-memory Derby database for testing with Hive (Scala)
                            
                                Simple oozie example of hive query?
                            
                                In Hive, how can I add a column only if that column does not exist?
                            
                                Multiple Spark applications with HiveContext
                            
                                Hive: Fast concatenate two tables into one?
                            
                                Spark SQL HiveContext - saveAsTable creates wrong schema
                            
                                Hive inserting values to an array complex type column
                            
                                How does the percentile function work in Hive?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Duplicate results in an AWS Athena (Presto) DISTINCT SQL Query?

Tags:

duplicates

hive

amazon-athena

presto

T. Brian Jones

People also ask

2 Answers

T. Brian Jones

Avish Saha

Recent Activity

Donate For Us