Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicates in combination of three columns in python with postgresql

I would like to remove the duplicate data only if three columns (name, price and new price) matching with the same data. But in an other python script.

So the data can insert in to the database, but with an other python script, I want to delete this duplicate data by a cron job.

So in this case:

cur.execute("INSERT INTO cars VALUES(8,'Hummer',41400, 49747)")

cur.execute("INSERT INTO cars VALUES(9,'Volkswagen',21600, 36456)")

are duplicates. Example script with inserted data:

import psycopg2
import sys

con = None

try:
    con = psycopg2.connect(database='testdb', user='janbodnar')    
    cur = con.cursor()

    cur.execute("CREATE TABLE cars(id INT PRIMARY KEY, name VARCHAR(20), price INT, new price INT)")
    cur.execute("INSERT INTO cars VALUES(1,'Audi',52642, 98484)")
    cur.execute("INSERT INTO cars VALUES(2,'Mercedes',57127, 874897)")
    cur.execute("INSERT INTO cars VALUES(3,'Skoda',9000, 439788)")
    cur.execute("INSERT INTO cars VALUES(4,'Volvo',29000, 743878)")
    cur.execute("INSERT INTO cars VALUES(5,'Bentley',350000, 434684)")
    cur.execute("INSERT INTO cars VALUES(6,'Citroen',21000, 43874)")
    cur.execute("INSERT INTO cars VALUES(7,'Hummer',41400, 49747)")
    cur.execute("INSERT INTO cars VALUES(8,'Hummer',41400, 49747)")
    cur.execute("INSERT INTO cars VALUES(9,'Volkswagen',21600, 36456)")
    cur.execute("INSERT INTO cars VALUES(10,'Volkswagen',21600, 36456)")

    con.commit()

except psycopg2.DatabaseError, e:
    if con:
        con.rollback()

    print 'Error %s' % e    
    sys.exit(1

finally:    
    if con:
        con.close()
like image 261
patrick Avatar asked Jan 28 '26 11:01

patrick


1 Answers

You can do this in one statement without additional round-trips to the server.

DELETE FROM cars
USING (
    SELECT id, row_number() OVER (PARTITION BY name, price, new_price
                                  ORDER BY id) AS rn
    FROM   cars
    ) x
WHERE cars.id = x.id
AND   x.rn > 1;

Requires PostgreSQL 8.4 or later for the window function row_number().
Out of a set of dupes the smallest id survives.
Note that I changed "new price" to new_price.

Or use the EXISTS semi-join, that @wildplasser posted as comment to the same effect.


Or, to by special request of CTE-devotee @wildplasser, with a CTE instead of the subquery ... :)

WITH x AS (
    SELECT id, row_number() OVER (PARTITION BY name, price, new_price
                                  ORDER BY id) AS rn
    FROM   cars
    )
DELETE FROM cars
USING  x
WHERE  cars.id = x.id
AND    x.rn > 1;

Data modifying CTE requires Postgres 9.1 or later.
This form will perform about the same as the one with the subquery.

like image 171
Erwin Brandstetter Avatar answered Jan 31 '26 00:01

Erwin Brandstetter



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!