Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding and dealing with duplicate users

Tags:

sql

mysql

In a large user database with the following format and sample data, we are trying to identify duplicated people:

id   first_name    last_name   email
---------------------------------------------------
 1   chris         baker       
 2   chris         baker       [email protected]
 3   chris         baker       [email protected]
 4   chris         baker       [email protected]  
 5   carl          castle      [email protected]
 6   mike          rotch       [email protected]  

I am using the following query:

SELECT 
    GROUP_CONCAT(id) AS "ids",
    CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
    COUNT(*) AS "duplicate_count" 
FROM 
    users 
GROUP BY 
    name 
HAVING 
    duplicate_count > 1

This works great; I get a list of duplicates with the id numbers of the involved rows.

We would re-assign any associated data tied to a duplicate to the actual person (set user_id = 2 where user_id = 3), then we delete the duplicating user row.

The trouble comes after we make this report the first time, as we clean up the list after manually verifying that they are indeed duplicates -- some ARE NOT duplicates. There are 2 Chris Bakers that are legitimate users.

We don't want to keep seeing Chris Baker in subsequent duplicate reports until the end of time, so I am looking for a way to flag that user id 1 and user id 4 are NOT duplicates of each other for future reports, but they could be duplicated by new users added later.

What I tried

I added a is_not_duplicate field to the user table, but then if a new duplicate "Chris Baker" gets added to the database, it will cause this situation to not show on the duplicate report; the is_not_duplicate improperly excludes one of the accounts. My HAVING statement would not meet the > 1 threshold until there are -two- duplicates of Chris Baker, plus the "real" one marked is_not_duplicate.

Question Summed Up

How can I build exceptions into the above query without looping results or multiple queries?

Sub-queries are fine, but the size of the dataset makes every query count and I'd like the solution to be as performant as possible.

like image 770
Chris Baker Avatar asked Mar 02 '12 21:03

Chris Baker


People also ask

How do I find duplicate users in Jira?

Each object in the array contains the user directory ID, the directory name, and a boolean flag indicating whether the account is active. The same can be achieved through the browser: Login to Jira as system administrator. Once logged in change the url to "<BASE_URL>/rest/api/2/user/duplicated/list"


1 Answers

Try to add the is_not_duplicate boolean field and modify your code as follows:

SELECT 
    GROUP_CONCAT(id) AS "ids",
    CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
    COUNT(*) AS "duplicate_count",
    SUM(is_not_duplicate) AS "real_count"
FROM 
    users 
GROUP BY 
    name 
HAVING 
    duplicate_count > 1
AND
    duplicate_count - real_count > 0

Newly added duplicates will have is_not_duplicate=0 so the real_count for that name will be less than duplicate_count and the row will be shown

like image 55
Hrant Khachatrian Avatar answered Sep 22 '22 05:09

Hrant Khachatrian