Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicates with less null values

Tags:

sql

sql-server

I have a table of employees which contains about 25 columns. Right now there are a lot of duplicates and I would like to try and get rid of some of these duplicates.

First, I want to find the duplicates by looking for multiple records that have the same values in first name, last name, employee number, company number and status.

SELECT
    firstname,lastname,employeenumber, companynumber, statusflag
FROM
    employeemaster
GROUP BY
    firstname,lastname,employeenumber,companynumber, statusflag
HAVING 
    (COUNT(*) > 1)

This gives me duplicates but my goal is to find and keep the best single record and delete the other records. The "best single record" is defined by the record with the least amount of NULL values in all of the other columns. How can I do this?

I am using Microsoft SQL Server 2012 MGMT Studio.

EXAMPLE:

enter image description here

Red: DELETE Green: KEEP

NOTE: There are a lot more columns in the table than what this table shows.

like image 251
user3788671 Avatar asked Jan 13 '15 16:01

user3788671


People also ask

How do I remove duplicates excluding blanks?

To remove duplicates keep blank rows, you need to add a helper column to identify the blank rows firstly, then apply Remove Duplicates function to remove the duplicates.

How do I remove duplicates but leaving the lowest value in another column in Excel?

To remove duplicate but keep lowest value, you can apply Remove Duplicates function and a formula. 3. Then click OK, and a dialog pops out to remind you how many duplicates have been removed, click OK to close it.

How do I remove duplicates from selectively in Excel?

In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.


2 Answers

You can use the sys.columns table to get a list of columns and build a dynamic query. This query will return a 'KeepThese' value for every record you want to keep based on your given criteria.

-- insert test data
create table EmployeeMaster
  (
    Record int identity(1,1),
    FirstName varchar(50),
    LastName varchar(50),
    EmployeeNumber int,
    CompanyNumber int,
    StatusFlag int,
    UserName varchar(50),
    Branch varchar(50)
  );
insert into EmployeeMaster
  (
    FirstName,
    LastName,
    EmployeeNumber,
    CompanyNumber,
    StatusFlag,
    UserName,
    Branch
  )
  values
    ('Jake','Jones',1234,1,1,'JJONES','PHX'),
    ('Jake','Jones',1234,1,1,NULL,'PHX'),
    ('Jake','Jones',1234,1,1,NULL,NULL),
    ('Jane','Jones',5678,1,1,'JJONES2',NULL);

-- get records with most non-null values with dynamic sys.column query
declare @sql varchar(max)
select @sql = '
    select e.*,
        row_number() over(partition by
                            e.FirstName,
                            e.LastName,
                            e.EmployeeNumber,
                            e.CompanyNumber,
                            e.StatusFlag
                          order by n.NonNullCnt desc) as KeepThese
    from EmployeeMaster e
        cross apply (select count(n.value) as NonNullCnt from (select ' +
            replace((
                select 'cast(' + c.name + ' as varchar(50)) as value union all select '
                from sys.columns c
                where c.object_id = t.object_id
                for xml path('')
                ) + '#',' union all select #','') + ')n)n'
from sys.tables t
where t.name = 'EmployeeMaster'

exec(@sql)
like image 126
Ron Smith Avatar answered Sep 24 '22 04:09

Ron Smith


Try this.

;WITH cte
     AS (SELECT Row_number()
                  OVER(
                    partition BY firstname, lastname, employeenumber, companynumber, statusflag
                    ORDER BY (SELECT NULL)) rn,
                firstname,
                lastname,
                employeenumber,
                companynumber,
                statusflag,
                username,
                branch
         FROM   employeemaster),
     cte1
     AS (SELECT a.firstname,
                a.lastname,
                a.employeenumber,
                a.companynumber,
                a.statusflag,
                Row_number()
                  OVER(
                    partition BY a.firstname, a.lastname, a.employeenumber, a.companynumber, a.statusflag
                    ORDER BY (CASE WHEN a.username IS NULL THEN 1 ELSE 0 END +CASE WHEN a.branch IS NULL THEN 1 ELSE 0 END) )rn
                        -- add the remaining columns in case statement
         FROM   cte a
                JOIN employeemaster b
                  ON a.firstname = b.firstname
                     AND a.lastname = b.lastname
                     AND a.employeenumber = b.employeenumber
                     AND a.companynumbe = b.companynumber
                     AND a.statusflag = b.statusflag)
SELECT *
FROM   cte1
WHERE  rn = 1 
like image 21
Pரதீப் Avatar answered Sep 25 '22 04:09

Pரதீப்