I have a query against a large number of big tables (rows and columns) with a number of joins, however one of tables has some duplicate rows of data causing issues for my query. Since this is a read only realtime feed from another department I can't fix that data, however I am trying to prevent issues in my query from it.
Given that, I need to add this crap data as a left join to my good query. The data set looks like:
IDNo FirstName LastName ... ------------------------------------------- uqx bob smith abc john willis ABC john willis aBc john willis WTF jeff bridges sss bill doe ere sally abby wtf jeff bridges ...
(about 2 dozen columns, and 100K rows)
My first instinct was to perform a distinct gave me about 80K rows:
SELECT DISTINCT P.IDNo FROM people P
But when I try the following, I get all the rows back:
SELECT DISTINCT P.* FROM people P
OR
SELECT DISTINCT(P.IDNo) AS IDNoUnq ,P.FirstName ,P.LastName ...etc. FROM people P
I then thought I would do a FIRST() aggregate function on all the columns, however that feels wrong too. Syntactically am I doing something wrong here?
Update: Just wanted to note: These records are duplicates based on a non-key / non-indexed field of ID listed above. The ID is a text field which although has the same value, it is a different case than the other data causing the issue.
Bookmark this question. Show activity on this post. LEFT JOIN is not equal to JOIN . The default JOIN is INNER .
SQL LEFT JOIN examples Each location belongs to one and only one country while each country can have zero or more locations. The relationship between the countries and locations tables is one-to-many.
distinct
is not a function. It always operates on all columns of the select list.
Your problem is a typical "greatest N per group" problem which can easily be solved using a window function:
select ... from ( select IDNo, FirstName, LastName, ...., row_number() over (partition by lower(idno) order by firstname) as rn from people ) t where rn = 1;
Using the order by
clause you can select which of the duplicates you want to pick.
The above can be used in a left join, see below:
select ... from x left join ( select IDNo, FirstName, LastName, ...., row_number() over (partition by lower(idno) order by firstname) as rn from people ) p on p.idno = x.idno and p.rn = 1 where ...
Add an identity column (PeopleID) and then use a correlated subquery to return the first value for each value.
SELECT * FROM People p WHERE PeopleID = ( SELECT MIN(PeopleID) FROM People WHERE IDNo = p.IDNo )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With