In a large user database with the following format and sample data, we are trying to identify duplicated people:
id first_name last_name email
---------------------------------------------------
1 chris baker
2 chris baker [email protected]
3 chris baker [email protected]
4 chris baker [email protected]
5 carl castle [email protected]
6 mike rotch [email protected]
I am using the following query:
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
This works great; I get a list of duplicates with the id numbers of the involved rows.
We would re-assign any associated data tied to a duplicate to the actual person (set user_id = 2 where user_id = 3
), then we delete the duplicating user row.
The trouble comes after we make this report the first time, as we clean up the list after manually verifying that they are indeed duplicates -- some ARE NOT duplicates. There are 2 Chris Bakers that are legitimate users.
We don't want to keep seeing Chris Baker in subsequent duplicate reports until the end of time, so I am looking for a way to flag that user id 1 and user id 4 are NOT duplicates of each other for future reports, but they could be duplicated by new users added later.
What I tried
I added a is_not_duplicate
field to the user table, but then if a new duplicate "Chris Baker" gets added to the database, it will cause this situation to not show on the duplicate report; the is_not_duplicate
improperly excludes one of the accounts. My HAVING
statement would not meet the > 1
threshold until there are -two- duplicates of Chris Baker, plus the "real" one marked is_not_duplicate
.
Question Summed Up
How can I build exceptions into the above query without looping results or multiple queries?
Sub-queries are fine, but the size of the dataset makes every query count and I'd like the solution to be as performant as possible.
Each object in the array contains the user directory ID, the directory name, and a boolean flag indicating whether the account is active. The same can be achieved through the browser: Login to Jira as system administrator. Once logged in change the url to "<BASE_URL>/rest/api/2/user/duplicated/list"
Try to add the is_not_duplicate
boolean field and modify your code as follows:
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count",
SUM(is_not_duplicate) AS "real_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
AND
duplicate_count - real_count > 0
Newly added duplicates will have is_not_duplicate=0
so the real_count
for that name will be less than duplicate_count
and the row will be shown
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With