Finding and dealing with duplicate users

Tags:

mysql

In a large user database with the following format and sample data, we are trying to identify duplicated people:

id   first_name    last_name   email
---------------------------------------------------
 1   chris         baker       
 2   chris         baker       [email protected]
 3   chris         baker       [email protected]
 4   chris         baker       [email protected]  
 5   carl          castle      [email protected]
 6   mike          rotch       [email protected]

I am using the following query:

SELECT 
    GROUP_CONCAT(id) AS "ids",
    CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
    COUNT(*) AS "duplicate_count" 
FROM 
    users 
GROUP BY 
    name 
HAVING 
    duplicate_count > 1

This works great; I get a list of duplicates with the id numbers of the involved rows.

We would re-assign any associated data tied to a duplicate to the actual person (set user_id = 2 where user_id = 3), then we delete the duplicating user row.

The trouble comes after we make this report the first time, as we clean up the list after manually verifying that they are indeed duplicates -- some ARE NOT duplicates. There are 2 Chris Bakers that are legitimate users.

We don't want to keep seeing Chris Baker in subsequent duplicate reports until the end of time, so I am looking for a way to flag that user id 1 and user id 4 are NOT duplicates of each other for future reports, but they could be duplicated by new users added later.

What I tried

I added a is_not_duplicate field to the user table, but then if a new duplicate "Chris Baker" gets added to the database, it will cause this situation to not show on the duplicate report; the is_not_duplicate improperly excludes one of the accounts. My HAVING statement would not meet the > 1 threshold until there are -two- duplicates of Chris Baker, plus the "real" one marked is_not_duplicate.

Question Summed Up

How can I build exceptions into the above query without looping results or multiple queries?

Sub-queries are fine, but the size of the dataset makes every query count and I'd like the solution to be as performant as possible.

770

asked Mar 02 '12 21:03

Chris Baker

1 Answers

Try to add the is_not_duplicate boolean field and modify your code as follows:

SELECT 
    GROUP_CONCAT(id) AS "ids",
    CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
    COUNT(*) AS "duplicate_count",
    SUM(is_not_duplicate) AS "real_count"
FROM 
    users 
GROUP BY 
    name 
HAVING 
    duplicate_count > 1
AND
    duplicate_count - real_count > 0

Newly added duplicates will have is_not_duplicate=0 so the real_count for that name will be less than duplicate_count and the row will be shown

answered Sep 22 '22 05:09

Hrant Khachatrian

Related questions
                            
                                Left join or select from multiple table using comma (,) [duplicate]
                            
                                How to successfully rewrite old mysql-php code with deprecated mysql_* functions?
                            
                                How to store very old dates in database?
                            
                                Check constraint for mutually exclusive columns in SQLAlchemy
                            
                                Can't Find any online documentation for MySQL c++ connector API
                            
                                What is a MySQL Server Instance?
                            
                                Select distinct combinations from two columns
                            
                                MYSQL Triggers: JSON_SEARCH an integer value in a json array of integers
                            
                                Attempted to read past end of the stream error in MySQL
                            
                                Mysql hamming distance of hexadecimal values
                            
                                How to connect to docker mysql container on remote machine
                            
                                When to use float vs decimal
                            
                                Table comment length in MySQL
                            
                                MySQL Index for Group By / Order By
                            
                                MySQL true row merge... not just a union
                            
                                Efficiently storing time series data: mySQL or flat files? Many tables (or files) or queries with WHERE condition?
                            
                                Mocking PDO with phpunit
                            
                                Pylons error - 'MySQL server has gone away'
                            
                                C# datatypes vs. MySql datatypes
                            
                                MySQL ORDER BY DESC is fast but ASC is very slow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With