Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find possible duplicates in two columns ignoring case and special characters

Query

SELECT COUNT(*), name, number
FROM   tbl
GROUP  BY name, number
HAVING COUNT(*) > 1

It sometimes fails to find duplicates between lower case and upper case.
E.g.: sunny and Sunny don't show up as a duplicates.
So how to find all possible duplicates in PostgreSQL for two columns.

like image 512
Ghostman Avatar asked Oct 19 '12 17:10

Ghostman


2 Answers

lower()/ upper()

Use one of these to fold characters to either lower or upper case. Special characters are not affected:

SELECT count(*), lower(name), number
FROM   tbl
GROUP  BY lower(name), number
HAVING count(*) > 1;

unaccent()

If you actually want to ignore diacritic signs, like your comments imply, install the additional module unaccent, which provides a text search dictionary that removes accents and also the general purpose function unaccent():

CREATE EXTENSION unaccent;

Makes it very simple:

SELECT lower(unaccent('Büßercafé'));

Result:

busercafe

This doesn't strip non-letters. Add regexp_replace() like @Craig mentioned for that:

SELECT lower(unaccent(regexp_replace('$s^o&f!t Büßercafé', '\W', '', 'g') ));

Result:

softbusercafe

You can even build a functional index on top of that:

  • Does PostgreSQL support "accent insensitive" collations?
like image 85
Erwin Brandstetter Avatar answered Dec 05 '22 00:12

Erwin Brandstetter


PostgreSQL by default is case sensitive. You can force it to be case-insensitive during searches by converting all values to a single case:

SELECT COUNT(*), lower(name), number FROM TABLE 
GROUP BY lower(name), number HAVING COUNT(*) > 1
  • NOTE: This has not been tested in Postgres
like image 27
Palpatim Avatar answered Dec 05 '22 00:12

Palpatim