redshift select distinct returns repeated values

Tags:

I have a database where each object property is stored in a separate row. The attached query does not return distinct values in a redshift database but works as expected when testing in any mysql compatible database.

SELECT DISTINCT distinct_value 
FROM
( 
  SELECT
    uri,
    ( SELECT DISTINCT value_string 
      FROM `test_organization__app__testsegment` AS X 
      WHERE X.uri = parent.uri AND name = 'hasTestString' AND parent.value_string IS NOT NULL ) AS distinct_value 
  FROM `test_organization__app__testsegment` AS parent 
  WHERE     
    uri IN ( SELECT uri 
             FROM `test_organization__app__testsegment` 
             WHERE name = 'types' AND value_uri_multivalue = 'Document'
           )
) AS T 
WHERE distinct_value IS NOT NULL
ORDER BY distinct_value ASC
LIMIT 10000 OFFSET 0

476

asked Sep 30 '15 21:09

DKobylarz

1 Answers

This is not a bug and behavior is intentional, though not straightforward.

In Redshift, you can declare constraints on the tables but Redshift doesn't enforce them, i.e. it allows duplicate values if you insert them. The only difference here is that when you run SELECT DISTINCT query against a column that doesn't have a primary key declared it will scan the whole column and get unique values, and if you run the same on a column that has primary key constraint it will just return the output without performing unique list filtering. This is how you can get duplicate entries if you insert them.

Why is this done? Redshift is optimized for large datasets and it's much faster to copy data if you don't need to check constraint validity for every row that you copy or insert. If you want you can declare a primary key constraint as a part of your data model but you will need to explicitly support it by removing duplicates or designing ETL in a way there are no such.

More information with specific examples in this Heap blog post Redshift Pitfalls And How To Avoid Them

143

answered Sep 28 '22 00:09

AlexYes

Related questions
                            
                                Using SQLite's new WITH RECURSIVE CTE clause
                            
                                Month and day comparison in Oracle
                            
                                Counting the number of digits in column
                            
                                How can I safely let users query my database using (Postgre)SQL?
                            
                                How to use Dapper with MS SQL Server (2012) Geospatial / SQLGeography Column
                            
                                Creating a column from another column in SQLAlchemy
                            
                                Delete and Insert Inside one Transaction SQL
                            
                                What is the R equivalent of SQL "SELECT * FROM table GROUP BY c1, c2"?
                            
                                Is it a bad idea to use a database's primary key as business object identifier?
                            
                                How to keep track of changes to data in a table?
                            
                                Check if column of a table has unique constraint
                            
                                Does jOOQ support PostgreSQL array functions and operators?
                            
                                How do you trouble shoot a "Data type mismatch in criteria expression" error in MS Access 2010? [closed]
                            
                                SQL Server GROUP BY COUNT Consecutive Rows Only
                            
                                SQL Server : there are no primary or candidate keys in the referenced table that match the referencing column list in the foreign key 'FK'
                            
                                Querying a table in SQL Server based on permutation of column2 and 3
                            
                                SQL Server: IF EXISTS massively slowing down a query
                            
                                SPARK SQL Equivalent of Qualify + Row_number statements
                            
                                Understanding Indexes and Missing Index Recommendations in SSMS
                            
                                MySql query histogram for time intervals data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

redshift select distinct returns repeated values

Tags:

sql

amazon-redshift

DKobylarz

People also ask

1 Answers

AlexYes

Recent Activity

Donate For Us