I have a query against a large number of big tables (rows and columns) with a number of joins, however one of tables has some duplicate rows of data causing issues for my query. Since this is a read only realtime feed from another department I can't fix that data, however I am trying to prevent issues in my query from it. Given that, I need to add this crap data as a left join to my good query. The data set looks like: <pre class="prettyprint"><code>IDNo FirstName LastName ... ------------------------------------------- uqx bob smith abc john willis ABC john willis aBc john willis WTF jeff bridges sss bill doe ere sally abby wtf jeff bridges ... </code></pre> (about 2 dozen columns, and 100K rows) My first instinct was to perform a distinct gave me about 80K rows: <pre class="prettyprint"><code>SELECT DISTINCT P.IDNo FROM people P </code></pre> But when I try the following, I get all the rows back: <pre class="prettyprint"><code>SELECT DISTINCT P.* FROM people P </code></pre> OR <pre class="prettyprint"><code>SELECT DISTINCT(P.IDNo) AS IDNoUnq ,P.FirstName ,P.LastName ...etc. FROM people P </code></pre> I then thought I would do a FIRST() aggregate function on all the columns, however that feels wrong too. Syntactically am I doing something wrong here? Update: Just wanted to note: These records are duplicates based on a non-key / non-indexed field of ID listed above. The ID is a text field which although has the same value, it is a different case than the other data causing the issue.

<code>distinct</code> is not a function. It always operates on all columns of the select list. Your problem is a typical "greatest N per group" problem which can easily be solved using a window function: <pre class="prettyprint"><code>select ... from ( select IDNo, FirstName, LastName, ...., row_number() over (partition by lower(idno) order by firstname) as rn from people ) t where rn = 1; </code></pre> Using the <code>order by</code> clause you can select which of the duplicates you want to pick. The above can be used in a left join, see below: <pre class="prettyprint"><code>select ... from x left join ( select IDNo, FirstName, LastName, ...., row_number() over (partition by lower(idno) order by firstname) as rn from people ) p on p.idno = x.idno and p.rn = 1 where ... </code></pre>

Add an identity column (PeopleID) and then use a correlated subquery to return the first value for each value. <pre class="prettyprint"><code>SELECT * FROM People p WHERE PeopleID = ( SELECT MIN(PeopleID) FROM People WHERE IDNo = p.IDNo ) </code></pre>

SQL Left Join first match only

Tags:

sql

join

sql-server

tsql

greatest-n-per-group

I have a query against a large number of big tables (rows and columns) with a number of joins, however one of tables has some duplicate rows of data causing issues for my query. Since this is a read only realtime feed from another department I can't fix that data, however I am trying to prevent issues in my query from it.

Given that, I need to add this crap data as a left join to my good query. The data set looks like:

IDNo    FirstName   LastName    ... ------------------------------------------- uqx     bob     smith abc     john        willis ABC     john        willis aBc     john        willis WTF     jeff        bridges sss     bill        doe ere     sally       abby wtf     jeff        bridges ...

(about 2 dozen columns, and 100K rows)

My first instinct was to perform a distinct gave me about 80K rows:

SELECT DISTINCT P.IDNo FROM people P

But when I try the following, I get all the rows back:

SELECT DISTINCT P.* FROM people P

SELECT      DISTINCT(P.IDNo) AS IDNoUnq      ,P.FirstName     ,P.LastName     ...etc.     FROM people P

I then thought I would do a FIRST() aggregate function on all the columns, however that feels wrong too. Syntactically am I doing something wrong here?

Update: Just wanted to note: These records are duplicates based on a non-key / non-indexed field of ID listed above. The ID is a text field which although has the same value, it is a different case than the other data causing the issue.

513

asked Nov 06 '13 23:11

Dave

2 Answers

distinct is not a function. It always operates on all columns of the select list.

Your problem is a typical "greatest N per group" problem which can easily be solved using a window function:

select ... from (   select IDNo,          FirstName,          LastName,          ....,          row_number() over (partition by lower(idno) order by firstname) as rn    from people  ) t where rn = 1;

Using the order by clause you can select which of the duplicates you want to pick.

The above can be used in a left join, see below:

select ... from x   left join (     select IDNo,            FirstName,            LastName,            ....,            row_number() over (partition by lower(idno) order by firstname) as rn      from people    ) p on p.idno = x.idno and p.rn = 1 where ...

182

answered Sep 16 '22 11:09

a_horse_with_no_name

Add an identity column (PeopleID) and then use a correlated subquery to return the first value for each value.

SELECT * FROM People p WHERE PeopleID = (     SELECT MIN(PeopleID)      FROM People      WHERE IDNo = p.IDNo )

answered Sep 19 '22 11:09

T8RB

Related questions
                            
                                MySQL very slow for alter table query
                            
                                SQL Query for Logins
                            
                                How to select several hardcoded SQL rows?
                            
                                Select rows with same id but different value in another column
                            
                                Using IF ELSE statement based on Count to execute different Insert statements
                            
                                Calculating difference between two timestamps in Oracle in milliseconds
                            
                                Entity Framework throws exception - Invalid object name 'dbo.BaseCs'
                            
                                using sql count in a case statement
                            
                                Can select * usage ever be justified?
                            
                                nvarchar(max) still being truncated
                            
                                Allow null in unique column
                            
                                Using setDate in PreparedStatement
                            
                                ActiveRecord Arel OR condition
                            
                                Counting number of grouped rows in mysql
                            
                                Entity Framework The ALTER TABLE statement conflicted with the FOREIGN KEY constraint
                            
                                How to Troubleshoot Intermittent SQL Timeout Errors
                            
                                Postgres - Transpose Rows to Columns
                            
                                Online SQL Query Syntax Checker [closed]
                            
                                sql joins as venn diagram
                            
                                Composite Primary key vs additional "ID" column?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With