The following is a hypothetical situation this which is close to my real problem. <code>Table1</code> <pre class="prettyprint"><code>recid firstname lastname company 1 A B AAA 2 D E DEF 3 G H IJK 4 A B ABC </code></pre> I have a table2 that looks like this <pre class="prettyprint"><code>recid firstname lastname company 10 A B ABC 20 D E DEF 30 M D DIM 40 A B CCC </code></pre> Now if I join the table on <code>recid</code>, it will give <code>0</code> results, there will be no duplicates because <code>recid</code> is unique. But if I join on <code>firstname</code> and <code>lastname</code> columns, which are not unique and there are duplicates, I get duplicates on inner join. The more columns I add on join, the worse it becomes (more duplicates are created). In the above simple situation, how can I remove duplicates in the following query? I want to compare <code>firstname</code> and <code>lastname</code>, if they match, I return <code>firstname</code>, <code>lastname</code> and <code>recid</code> from <code>table2</code> <pre class="prettyprint"><code>SELECT DISTINCT * FROM (SELECT recid, first, last FROM table1) a INNER JOIN (SELECT recid, first,last FROM table2) b ON a.first = b.first </code></pre> <hr> The script is here if anyone wants to play with it in future <pre class="prettyprint"><code>CREATE TABLE table1 (recid INT NOT NULL PRIMARY KEY, first varchar(20), last varchar(20), company varchar(20)) CREATE TABLE table2 (recid INT NOT NULL PRIMARY KEY, first varchar(20), last varchar(20), company varchar(20)) INSERT INTO TABLE1 VALUES(1,'A','B','ABC') INSERT INTO TABLE1 VALUES(2,'D','E','DEF') INSERT INTO TABLE1 VALUES(3,'M','N','MNO') INSERT INTO TABLE1 VALUES(4,'A','B','ABC') INSERT INTO TABLE2 VALUES(10,'A','B','ABC') INSERT INTO TABLE2 VALUES(20,'D','E','DEF') INSERT INTO TABLE2 VALUES(30,'Q','R','QRS') INSERT INTO TABLE2 VALUES(40,'A','B','ABC') </code></pre>

You don't want to do a join per se, you're merely testing for existence/set inclusion. I don't know what current flavor of SQL you're coding in, but this should work. <pre class="prettyprint"><code>SELECT MAX(recid), firstname, lastname FROM table2 T2 WHERE EXISTS (SELECT * FROM table1 WHERE firstname = T2.firstame AND lastname = T2.lastname) GROUP BY lastname, firstname </code></pre> If you want to implement as a join, leaving the code largely the same: i.e. <pre class="prettyprint"><code>SELECT max(t2.recid), t2.firstame, t2.lastname FROM Table2 T2 INNER JOIN Table1 T1 ON T2.firstname = t1.firstname and t2.lastname = t1.lastname GROUP BY t2.firstname, t2.lastname </code></pre> Depending on the DBMS, an inner join may be implemented differently to an Exists (semi-join vs join) but the optimizer can sometimes figure it out anyway and chose the correct operator regardless of which way you write it.

Removing duplicates from SQL Join

Tags:

The following is a hypothetical situation this which is close to my real problem. Table1

recid   firstname    lastname   company 1       A             B          AAA 2       D             E          DEF 3       G             H          IJK 4       A             B          ABC

I have a table2 that looks like this

recid   firstname    lastname   company 10      A             B          ABC 20      D             E          DEF 30      M             D          DIM 40      A             B          CCC

Now if I join the table on recid, it will give 0 results, there will be no duplicates because recid is unique. But if I join on firstname and lastname columns, which are not unique and there are duplicates, I get duplicates on inner join. The more columns I add on join, the worse it becomes (more duplicates are created).

In the above simple situation, how can I remove duplicates in the following query? I want to compare firstname and lastname, if they match, I return firstname, lastname and recid from table2

SELECT DISTINCT * FROM (SELECT recid, first, last FROM table1) a INNER JOIN (SELECT recid, first,last FROM table2) b ON a.first = b.first

The script is here if anyone wants to play with it in future

CREATE TABLE table1 (recid INT NOT NULL PRIMARY KEY, first varchar(20), last varchar(20), company varchar(20)) CREATE TABLE table2 (recid INT NOT NULL PRIMARY KEY, first varchar(20), last varchar(20), company varchar(20))  INSERT INTO TABLE1 VALUES(1,'A','B','ABC') INSERT INTO TABLE1 VALUES(2,'D','E','DEF') INSERT INTO TABLE1 VALUES(3,'M','N','MNO') INSERT INTO TABLE1 VALUES(4,'A','B','ABC')  INSERT INTO TABLE2 VALUES(10,'A','B','ABC') INSERT INTO TABLE2 VALUES(20,'D','E','DEF') INSERT INTO TABLE2 VALUES(30,'Q','R','QRS') INSERT INTO TABLE2 VALUES(40,'A','B','ABC')

761

asked Aug 16 '11 20:08

Hammad Khan

1 Answers

You don't want to do a join per se, you're merely testing for existence/set inclusion.

I don't know what current flavor of SQL you're coding in, but this should work.

SELECT MAX(recid), firstname, lastname  FROM table2 T2 WHERE EXISTS (SELECT * FROM table1 WHERE firstname = T2.firstame AND lastname = T2.lastname) GROUP BY lastname, firstname

If you want to implement as a join, leaving the code largely the same:

i.e.

SELECT max(t2.recid), t2.firstame, t2.lastname  FROM Table2 T2  INNER JOIN Table1 T1      ON T2.firstname = t1.firstname and t2.lastname = t1.lastname GROUP BY t2.firstname, t2.lastname

Depending on the DBMS, an inner join may be implemented differently to an Exists (semi-join vs join) but the optimizer can sometimes figure it out anyway and chose the correct operator regardless of which way you write it.

answered Oct 11 '22 06:10

Code Magician

Related questions
                            
                                Is there a reason to specify DEFAULT (NULL) on a nullable column?
                            
                                Rails CSRF Tokens - Do they expire?
                            
                                AVPlayer continues to play after ViewController is removed from NavigationController
                            
                                Setting username in Mercurial .hgrc file
                            
                                How to run py.test against different versions of python?
                            
                                How to check if a job is running in Quartz Framework
                            
                                In sbt, how do you add a plugin that's in the local filesystem?
                            
                                Gevent monkeypatching breaking multiprocessing
                            
                                Returning struct containing array
                            
                                Why would you declare getters and setters method private? [duplicate]
                            
                                How can an observer find out the before and after values of the observed property in Ember.js?
                            
                                Xcode 4.3 and C++11 include paths

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With