Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicates using only a MySQL query?

Tags:

sql

mysql

I have a table with the following columns:

URL_ID    
URL_ADDR    
URL_Time

I want to remove duplicates on the URL_ADDR column using a MySQL query.

Is it possible to do such a thing without using any programming?

like image 976
Jim Avatar asked Aug 01 '10 21:08

Jim


People also ask

How do I remove duplicate records from a select query?

The go to solution for removing duplicate rows from your result sets is to include the distinct keyword in your select statement. It tells the query engine to remove duplicates to produce a result set in which every row is unique.

Which query is used to remove duplicates?

We can use the SQL RANK function to remove the duplicate rows as well. SQL RANK function gives unique row ID for each row irrespective of the duplicate row. In the following query, we use a RANK function with the PARTITION BY clause.


3 Answers

Consider the following test case:

CREATE TABLE mytb (url_id int, url_addr varchar(100));

INSERT INTO mytb VALUES (1, 'www.google.com');
INSERT INTO mytb VALUES (2, 'www.microsoft.com');
INSERT INTO mytb VALUES (3, 'www.apple.com');
INSERT INTO mytb VALUES (4, 'www.google.com');
INSERT INTO mytb VALUES (5, 'www.cnn.com');
INSERT INTO mytb VALUES (6, 'www.apple.com');

Where our test table now contains:

SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr          |
+--------+-------------------+
|      1 | www.google.com    |
|      2 | www.microsoft.com |
|      3 | www.apple.com     |
|      4 | www.google.com    |
|      5 | www.cnn.com       |
|      6 | www.apple.com     |
+--------+-------------------+
5 rows in set (0.00 sec)

Then we can use the multiple-table DELETE syntax as follows:

DELETE t2
FROM   mytb t1
JOIN   mytb t2 ON (t2.url_addr = t1.url_addr AND t2.url_id > t1.url_id);

... which will delete duplicate entries, leaving only the first url based on url_id:

SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr          |
+--------+-------------------+
|      1 | www.google.com    |
|      2 | www.microsoft.com |
|      3 | www.apple.com     |
|      5 | www.cnn.com       |
+--------+-------------------+
3 rows in set (0.00 sec)

UPDATE - Further to new comments above:

If the duplicate URLs will not have the same format, you may want to apply the REPLACE() function to remove www. or http:// parts. For example:

DELETE t2
FROM   mytb t1
JOIN   mytb t2 ON (REPLACE(t2.url_addr, 'www.', '') = 
                   REPLACE(t1.url_addr, 'www.', '') AND 
                   t2.url_id > t1.url_id);
like image 200
Daniel Vassallo Avatar answered Oct 02 '22 20:10

Daniel Vassallo


You may want to try the method mentioned at http://labs.creativecommons.org/2010/01/12/removing-duplicate-rows-in-mysql/.

ALTER IGNORE TABLE your_table ADD UNIQUE INDEX `tmp_index` (URL_ADDR);
like image 31
Box Avatar answered Oct 02 '22 18:10

Box


This will leave the ones with the highest URL_ID for a particular URL_ADDR

DELETE FROM table
WHERE URL_ID NOT IN 
    (SELECT ID FROM 
       (SELECT MAX(URL_ID) AS ID 
        FROM table 
        WHERE URL_ID IS NOT NULL
        GROUP BY URL_ADDR ) X)   /*Sounds like you would need to GROUP BY a 
                                   calculated form - e.g. using REPLACE to 
                                  strip out www see Daniel's answer*/

(The derived table 'X' is to avoid the error "You can't specify target table 'tablename' for update in FROM clause")

like image 24
Martin Smith Avatar answered Oct 02 '22 18:10

Martin Smith