Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use RANK to Group Matched Records

Tags:

Long Story short. I have data that I'm trying to identify duplicate records by address. The address can be matched on the [Address] or [Remit_Address] fields. I use a JOIN and UNION to get the records, but I need the matched records to appear with each other in the results.

I can't sort by any of the existing fields, so a typical 'ORDER BY' won't work. I looked into RANK as suggested by someone and it looks like it might work, but I don't know how to do the Partition, and I think the Order gives me the same issue with ORDER BY.

If RANK is not the best option I'm open to other ideas. The goal ultimately is to group the matched records someway.

  • SSMS 18
  • SQL Server 2019

Here is the setup:

-- Output Table
CREATE TABLE [dupecheck] (
  [id] int identity(1, 1), 
  [Data Area] varchar(255), 
  [Supplier_No] varchar(255), 
  [Name] varchar(255), 
  [Address] varchar(255), 
  [City] varchar(255), 
  [State] varchar(255), 
  [Zip] varchar(255), 
  [Country] varchar(255), 
  [Remit_Address] varchar(255), 
  [Remit_City] varchar(255), 
  [Remit_State] varchar(255), 
  [Remit_Zip] varchar(255), 
  [Remit_Country] varchar(255), 
)


CREATE TABLE [sample_data] (
    [Supplier_No]           varchar(255),
    [Name]                  varchar(255),
    [Address]               varchar(255),
    [City]                  varchar(255),
    [State]                 varchar(255),
    [Zip]                   varchar(255),
    [Country]               varchar(255),
    [Remit_Address]         varchar(255),
    [Remit_City]            varchar(255),
    [Remit_State]           varchar(255),
    [Remit_Zip]             varchar(255),
    [Remit_Country]         varchar(255),
    [cleanAddress]          varchar(255),
    [cleanRemit_Address]    varchar(255),
    CONSTRAINT [suppliers_pk] PRIMARY KEY ([Supplier_No])
)

INSERT INTO [sample_data] VALUES
    ('1039104','Geez Companies','100 Aero Hudson Rd','Streetsboro','OH','44241','','100 Aero Hudson Road','Streetsboro','OH','44241','USA','100 Aero Hudson Rd','100 Aero Hudson Rd'),
    ('1218409','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 W Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'),
    ('1243789','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 West Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'),
    ('1243636','SIRI SYSTEMS','15 BRAD ROAD','WEXFORD','PA','15090','','','','','','','15 BRAD RD',''),
    ('1152482','FLEETWOOD MACK','22 WINDSOCK CT','ADDISON','IL','60101','','PO BOX 951','CHICAGO','IL','60694-5124','','22 WINDSOCK CT','PO BOX 951'),
    ('1224483','Aerospace Junction','211500 Communicate Ave','Mingo Junction','OH','43939','USA','P O Box 99','Mingo Junction','OH','43939','USA','211500 Communicate Ave','PO Box 99'),
    ('1243397','Squeezy Felt','SCHREIBER DIST','NEW KENSINGTON','PA','15068','','','','','','','SCHREIBER DIST',''),
    ('1230895','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'),
    ('1243782','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'),
    ('1135880','RICHARD PRYOR SEMINARS','PO BOX 2194','KANSAS CITY','MO','64121-9468','USA','RICHARD PRYOR SEMINARS P O BOX 2194','KANSAS CITY','MO','64121-9468','USA','PO BOX 2194','RICHARD PRYOR SEMINARS PO BOX 2194'),
    ('1241328','INFINITY AND BEYOND','P.O. BOX 169','GASTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''),
    ('1259522','ZEEBO INC','PO BOX 169','GASSTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''),
    ('1255253','AT&T','PO Box 50221','Carol Stream','IL','60197','USA','','','','','','PO Box 50221',''),
    ('1135513','AT&T','PO Box 50221','Carol Stream','IL','60197-5080','USA','','','','','','PO Box 50221',''),
    ('1119161','Machine Co, Inc','3306 N Thorne Blvd','Chattanooga','TN','','','PO BOX 5301','CHATTANOOGA','TN','37406','USA','3306 N Thorne Blvd','PO BOX 5301'),
    ('1176587','Topsy Turvy','365 Welmington Road','Chicago','IL','60606','USA','','','','','','365 Welmington Rd',''),
    ('2156671','Topsy Turvvy, Inc.','P.O. Box 55217','Columbus','OH','43081','','365 Welmington Road','Chicago','IL','60606','USA','','365 Welmington Rd')


CREATE TABLE [dupe_addresses](
    [NewAdd] [varchar](255) NULL
)

INSERT INTO [dupe_addresses] VALUES
    ('100 W Balor Ave'),
    ('28 N US State Hwy 99'),
    ('365 Welmington Rd'),
    ('PO BOX 169'),
    ('PO Box 204'),
    ('PO Box 50221'),
    ('SouthWestern Medical100 W Balor Ave')

Existing Query:

INSERT INTO [dupecheck]
    SELECT * FROM (
    SELECT 
        'Address Match' AS [Reason], 
        pv.[Supplier_No], 
        pv.[Name], 
        pv.[Address], 
        pv.[City], 
        pv.[State], 
        pv.[Zip], 
        pv.[Country], 
        pv.[Remit_Address], 
        pv.[Remit_City], 
        pv.[Remit_State], 
        pv.[Remit_Zip], 
        pv.[Remit_Country]
         FROM [dupe_addresses] n 
      LEFT JOIN [sample_data] pv 
        ON 
        (n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ) )
       WHERE ([Supplier_No] IS NOT NULL AND [Supplier_No] <> '') 

    UNION

    SELECT 
        'Address Match' AS [Reason], 
        pv.[Supplier_No], 
        pv.[Name], 
        pv.[Address], 
        pv.[City], 
        pv.[State], 
        pv.[Zip], 
        pv.[Country], 
        pv.[Remit_Address], 
        pv.[Remit_City], 
        pv.[Remit_State], 
        pv.[Remit_Zip], 
        pv.[Remit_Country]
         FROM [dupe_addresses] n 
      LEFT JOIN [sample_data] pv 
        ON 
        (n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL) )
       WHERE ([Supplier_No] IS NOT NULL AND [Supplier_No] <> '') 
       ) q1

Current Results:

Reason  Supplier_No Name    Address City    State   Zip Country Remit_Address   Remit_City  Remit_State Remit_Zip   Remit_Country
Address Match   1135513 AT&T    PO Box 50221    Carol Stream    IL  60197-5080  USA                 
Address Match   1176587 Topsy Turvy 365 Welmington Road Chicago IL  60606   USA                 
Address Match   1218409 SouthWestern Medical    100 West Balor Ave  Osceola AR  72370   USA SouthWestern Medical100 W Balor Ave Osceola AR  72370   USA
Address Match   1230895 NERO CO 28 North US State Highway 99    Osceola AR  72370   USA PO Box 204  Cape Girardeau  MO  63702-2045  USA
Address Match   1241328 INFINITY AND BEYOND P.O. BOX 169    GASTONIA    NC  28053-0269  USA                 
Address Match   1243782 NERO CO 28 North US State Highway 99    Osceola AR  72370   USA PO Box 204  Cape Girardeau  MO  63702-2045  USA
Address Match   1243789 SouthWestern Medical    100 West Balor Ave  Osceola AR  72370   USA SouthWestern Medical100 West Balor Ave  Osceola AR  72370   USA
Address Match   1255253 AT&T    PO Box 50221    Carol Stream    IL  60197   USA                 
Address Match   1259522 ZEEBO INC   PO BOX 169  GASSTONIA   NC  28053-0269  USA                 
Address Match   2156671 Topsy Turvvy, Inc.  P.O. Box 55217  Columbus    OH  43081       365 Welmington Road Chicago IL  60606   USA

Desired Results:

Reason  Supplier_No Name    Address City    State   Zip Country Remit_Address   Remit_City  Remit_State Remit_Zip   Remit_Country   rank
Address Match   1135513 AT&T    PO Box 50221    Carol Stream    IL  60197-5080  USA                     1
Address Match   1255253 AT&T    PO Box 50221    Carol Stream    IL  60197   USA                     1
Address Match   1241328 INFINITY AND BEYOND P.O. BOX 169    GASTONIA    NC  28053-0269  USA                     2
Address Match   1259522 ZEEBO INC   PO BOX 169  GASSTONIA   NC  28053-0269  USA                     2
Address Match   1243782 NERO CO 28 North US State Highway 99    Osceola AR  72370   USA PO Box 204  Cape Girardeau  MO  63702-2045  USA 3
Address Match   1230895 NERO CO 28 North US State Highway 99    Osceola AR  72370   USA PO Box 204  Cape Girardeau  MO  63702-2045  USA 3
Address Match   1218409 SouthWestern Medical    100 West Balor Ave  Osceola AR  72370   USA SouthWestern Medical100 W Balor Ave Osceola AR  72370   USA 4
Address Match   1243789 SouthWestern Medical    100 West Balor Ave  Osceola AR  72370   USA SouthWestern Medical100 West Balor Ave  Osceola AR  72370   USA 4
Address Match   2156671 Topsy Turvvy, Inc.  P.O. Box 55217  Columbus    OH  43081       365 Welmington Road Chicago IL  60606   USA 5
Address Match   1176587 Topsy Turvy 365 Welmington Road Chicago IL  60606   USA                     5
like image 914
Dizzy49 Avatar asked Apr 09 '21 07:04

Dizzy49


People also ask

Can rank be used with GROUP BY?

we can use rank function and group by in the same query set but all the columns should be contained in either aggregate function or the Group by clause.

How does rank over partition work?

To partition rows and rank them by their position within the partition, use the RANK() function with the PARTITION BY clause. SQL's RANK() function allows us to add a record's position within the result set or within each partition. In our example, we rank rows within a partition.

How can I get multiple records from a single record in SQL?

The common technique to do this kind of task is this - use the max or min function on a column that contains unique values to obtain a single record for each entity that we are interested in, such as customers (to get a single order for each customer).

How does rank and dense rank work?

DENSE_RANK computes the rank of a row in an ordered group of rows and returns the rank as a NUMBER . The ranks are consecutive integers beginning with 1. The largest rank value is the number of unique values returned by the query. Rank values are not skipped in the event of ties.


1 Answers

This query creates the desired result.

with cte as (
    select s2.NewAdd grp, s1.*
        , rank() over(partition by Supplier_No order by s2.NewAdd) rnk
    from sample_data s1
    inner join dupe_addresses s2 on  
        (s1.cleanAddress=s2.newAdd) or (s1.cleanRemit_Address=s2.newAdd)
)
select c1.*
from cte c1
where rnk = 1
order by c1.grp

removed the Union , and combine two join Conditions by OR instead. so a record with both conditions may be found.

The rank() is used to calculate a rank for each row within a partition of a result set.

partition by Supplier_No used to To identify duplicate records.

finally, use the where rnk = 1 to see the group of records without repetition.

like image 163
Reza Basereh Avatar answered Oct 12 '22 22:10

Reza Basereh