Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate rows based on field in a select query with PostgreSQL?

Considering the table mdl_files that contains the following fields: id, contenthash, timecreated, filesize.

This tables stores attachment files.

We consider that all the rows with the same content hash are duplicate rows and I just want to keep the oldest row (or first if dates are equals). How can I do that?

The following query:

SELECT
  id,
  contenthash,
  filesize,
  to_timestamp(timecreated) :: DATE
FROM mdl_files
ORDER BY contenthash;

returns:

2480229 00002e87605311feb82b70473b61e81f0223c774    18178   2016-10-05
2997411 0000bfd20ef84948eee6811ce5bbac03de42ccb0    1293    2017-03-31
1304839 000280169fc78d704a2d4569bfb6f42ea4a1d5ae    8203    2015-11-10
1364656 000280169fc78d704a2d4569bfb6f42ea4a1d5ae    8203    2015-11-17
71568   0003c6aec5835964870902d697c06d21abf76bf7    139439  2013-04-19
2959945 000419c19d77df7285e669614075b47414e3ab2c    398 2017-03-20
3483049 00061dc0bc2452304107ddc75e7ee2908c729905    28618   2017-08-17
3483047 00061dc0bc2452304107ddc75e7ee2908c729905    28618   2017-08-17

I want to get this resultset:

2480229 00002e87605311feb82b70473b61e81f0223c774    18178   2016-10-05
2997411 0000bfd20ef84948eee6811ce5bbac03de42ccb0    1293    2017-03-31
1304839 000280169fc78d704a2d4569bfb6f42ea4a1d5ae    8203    2015-11-10

71568   0003c6aec5835964870902d697c06d21abf76bf7    139439  2013-04-19
2959945 000419c19d77df7285e669614075b47414e3ab2c    398 2017-03-20
3483049 00061dc0bc2452304107ddc75e7ee2908c729905    28618   2017-08-17

I want the following duplicated lines to be removed from the resultset:

1364656 000280169fc78d704a2d4569bfb6f42ea4a1d5ae    8203    2015-11-17
3483047 00061dc0bc2452304107ddc75e7ee2908c729905    28618   2017-08-17
like image 287
ben.IT Avatar asked Jul 30 '18 11:07

ben.IT


2 Answers

Use DISTINCT ON:

SELECT DISTINCT ON (contenthash)
  id,
  contenthash,
  filesize,
  to_timestamp(timecreated) :: DATE
FROM mdl_files
ORDER BY contenthash, timecreated, id;

DISTINCT ON is a Postgres extension that makes sure that returns one row for each unique combination of the keys in parentheses. The specific row is the first one found based on the order by clause.

like image 98
Gordon Linoff Avatar answered Nov 05 '22 07:11

Gordon Linoff


You can try to use ROW_NUMBER() with windows function to make row number then delete it.

SELECT t.* 
FROM (
SELECT
      id,
      contenthash,
      filesize,
      ROW_NUMBER() OVER (PARTITION BY contenthash,filesize order by timecreated) rn
FROM mdl_files
) t
where t.rn = 1

sqlfiddle

If you want to DELETE duplicate data you can use EXISTS in where clause.

DELETE 
FROM mdl_files f WHERE EXISTS(
  SELECT 1
  FROM (
  SELECT
        id,
        contenthash,
        filesize,
        ROW_NUMBER() OVER (PARTITION BY contenthash,filesize order by timecreated) rn
  FROM mdl_files
  ) t
  where t.rn > 1 and t.id = f.id
)

sqlfiddle

like image 39
D-Shih Avatar answered Nov 05 '22 09:11

D-Shih