I need to find duplicates in a table. In MySQL I simply write:
SELECT *,count(id) count FROM `MY_TABLE`
GROUP BY SOME_COLUMN ORDER BY count DESC
This query nicely:
Similar query in Postgres greets me with an error:
column "MY_TABLE.SOME_COLUMN" must appear in the GROUP BY clause or be used in an aggregate function
What is the Postgres equivalent of this query?
PS: I know that MySQL behaviour deviates from SQL standards.
The PostgreSQL GROUP BY clause is used in collaboration with the SELECT statement to group together those rows in a table that have identical data. This is done to eliminate redundancy in the output and/or compute aggregates that apply to these groups.
The HAVING clause is used instead of WHERE with aggregate functions. While the GROUP BY Clause groups rows that have the same values into summary rows. The having clause is used with the where clause in order to find rows with certain conditions.
The MySQL GROUP BY StatementThe GROUP BY statement is often used with aggregate functions ( COUNT() , MAX() , MIN() , SUM() , AVG() ) to group the result-set by one or more columns.
Back-ticks are a non-standard MySQL thing. Use the canonical double quotes to quote identifiers (possible in MySQL, too). That is, if your table in fact is named "MY_TABLE"
(all upper case). If you (more wisely) named it my_table
(all lower case), then you can remove the double quotes or use lower case.
Also, I use ct
instead of count
as alias, because it is bad practice to use function names as identifiers.
This would work with PostgreSQL 9.1:
SELECT *, count(id) ct
FROM my_table
GROUP BY primary_key_column(s)
ORDER BY ct DESC;
It requires primary key column(s) in the GROUP BY
clause. The results are identical to a MySQL query, but ct
would always be 1 (or 0 if id IS NULL
) - useless to find duplicates.
If you want to group by other column(s), things get more complicated. This query mimics the behavior of your MySQL query - and you can use *
.
SELECT DISTINCT ON (1, some_column)
count(*) OVER (PARTITION BY some_column) AS ct
,*
FROM my_table
ORDER BY 1 DESC, some_column, id, col1;
This works because DISTINCT ON
(PostgreSQL specific), like DISTINCT
(SQL-Standard), are applied after the window function count(*) OVER (...)
. Window functions (with the OVER
clause) require PostgreSQL 8.4 or later and are not available in MySQL.
Works with any table, regardless of primary or unique constraints.
The 1
in DISTINCT ON
and ORDER BY
is just shorthand to refer to the ordinal number of the item in the SELECT
list.
SQL Fiddle to demonstrate both side by side.
More details in this closely related answer:
count(*)
vs. count(id)
If you are looking for duplicates, you are better off with count(*)
than with count(id)
. There is a subtle difference if id
can be NULL
, because NULL
values are not counted - while count(*)
counts all rows. If id
is defined NOT NULL
, results are the same, but count(*)
is generally more appropriate (and slightly faster, too).
Here's another approach, uses DISTINCT ON:
select
distinct on(ct, some_column)
*,
count(id) over(PARTITION BY some_column) as ct
from my_table x
order by ct desc, some_column, id
Data source:
CREATE TABLE my_table (some_column int, id int, col1 int);
INSERT INTO my_table VALUES
(1, 3, 4)
,(2, 4, 1)
,(2, 5, 1)
,(3, 6, 4)
,(3, 7, 3)
,(4, 8, 3)
,(4, 9, 4)
,(5, 10, 1)
,(5, 11, 2)
,(5, 11, 3);
Output:
SOME_COLUMN ID COL1 CT
5 10 1 3
2 4 1 2
3 6 4 2
4 8 3 2
1 3 4 1
Live test: http://www.sqlfiddle.com/#!1/e2509/1
DISTINCT ON documentation: http://www.postgresonline.com/journal/archives/4-Using-Distinct-ON-to-return-newest-order-for-each-customer.html
mysql allows group by
to omit non-aggregated selected columns from the group by
list, which it executes by returning the first row found for each unique combination of grouped by columns. This is non-standard SQL behaviour.
postgres on the other hand is SQL standard compliant.
There is no equivalent query in postgres.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With