Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find matching records based on dynamic columns

I have a list of pets:
Pets

And I need to find a correct owner for each of the pet from Owner table

Owners

In order to correctly match each pet to an owner I need to use a special matching table that looks like this:

Matching

So, for pet with PetID=2 I need to find an owner that has a matched based on three fields:

    Pet.Zip = Owner.Zip 
    and Pet.OwnerName = Owner.Name 
    and Pet.Document = Owner.Document

In our example, it will work like this:

 select top 1 OwnerID from owners
         where Zip = 23456 
         and Name = 'Alex' 
         and Document = 'a.csv'

if OwnerID is not found I then need to match based on 2 fields (Not using field with the highest priority)

In our example:

 select top 1 OwnerID from owners where
             Name = 'Alex' 
             and Document = 'a.csv'

Since no record is found I we then need to match on less fields. In our example:

select top 1 OwnerID from owners where Document = 'a.csv'

Now, we found an owner with OwnerID = 6.

Now we need to update pet with ownerID = 6 and then we can process next pet.

The only way that I can do this right now involves a loop or a cursor + dynamic SQL.

Is it possible to achieve this without loops+dynamic sql? Maybe STUFF + Pivot somehow?

sql fiddle: http://sqlfiddle.com/#!18/10982/1/0

Sample data:

create table  temp_builder
(
    PetID int not null,
    Field varchar(30) not null,
    MatchTo varchar(30) not null,
    Priority int not null
)

insert into temp_builder values
(1,'Address', 'Addr',4),
(1,'Zip', 'Zip', 3),
(1,'Country', 'Country', 2),
(1,'OwnerName', 'Name',1),
(2,'Zip', 'Zip',3),
(2,'OwnerName','Name', 2),
(2,'Document', 'Document', 1),
(3,'Country', 'Country', 1)


create table temp_pets
(
    PetID int null,
    Address varchar(100) null,
    Zip int null,
    Country varchar(100) null,
    Document varchar(100) null,
    OwnerName varchar(100) null,
    OwnerID int null,
    Field1 bit null,
    Field2 bit null
)

insert into temp_pets values
(1, '123 5th st', 12345, 'US', 'test.csv', 'John', NULL, NULL, NULL),
(2, '234 6th st', 23456, 'US', 'a.csv', 'Alex', NULL, NULL, NULL),
(3, '345 7th st', 34567, 'US', 'b.csv', 'Mike', NULL, NULL, NULL)

create table temp_owners
(
    OwnerID int null,
    Addr varchar(100) null,
    Zip int null,
    Country varchar(100) null,
    Document varchar(100) null,
    Name varchar(100) null,
    OtherField bit null,
    OtherField2 bit null,
)

insert into temp_owners values
(1, '456 8th st',  45678, 'US', 'c.csv', 'Mike',  NULL, NULL),
(2, '678 9th st',  45678, 'US', 'b.csv', 'John',  NULL, NULL),
(3, '890 10th st', 45678, 'US', 'b.csv', 'Alex',  NULL, NULL),
(4, '901 11th st', 23456, 'US', 'b.csv', 'Alex',  NULL, NULL),
(5, '234 5th st',  12345, 'US', 'b.csv', 'John',  NULL, NULL),
(6, '123 5th st',  45678, 'US', 'a.csv', 'John',  NULL, NULL)

Edit: I'm overwhelmed by a number of great suggestions and responses. I've tested them and many worked well for me. Unfortunately, I can only award bounty to one solution.

like image 582
user194076 Avatar asked Mar 07 '18 06:03

user194076


2 Answers

The use of cursor, loops and dynamic SQL can be avoided by treating the fields used for comparison as an entry in a bit set for each pet. A bit set (FieldSetRank column) is calculated for each priority based on a bit entry (FieldRank rank column).

The Pets and Owner tables has to be unpivoted so that the fields and their associated values can be compared. Each of the fields and value that has been matched is assigned to a corresponding FieldRank. A new bit set is then calculated based on the matched values (MatchSetRank). Only records where the matched sets (MatchSetRank) are equal to the desired sets (FieldSetRank) are returned.

The query performs one final ranking to return records with the highest MatchSetRank (records that matched on the highest number of columns while maintaining priority criteria. The following T-SQL demonstrates the concept.

;WITH CTE_Builder
 AS
 (
     SELECT  [PetID]
            ,[Field]
            ,[Priority]
            ,[MatchTo]
            ,POWER(2, [Priority] - 1) AS [FieldRank] -- Define the field ranking as bit set numbered item.
            ,SUM(POWER(2, [Priority] - 1)) OVER (PARTITION BY [PetID] ORDER BY [Priority] ROWS UNBOUNDED PRECEDING) FieldSetRank -- Sum all the bit set IDs to define what constitutes a completed field set ordered by priority.
     FROM   temp_builder
 ),
CTE_PetsUnpivoted
AS
(   -- Unpivot pets table and assign Field Rank and Field Set Rank.
    SELECT   [PetsUnPivot].[PetID]
            ,[PetsUnPivot].[Field]
            ,[Builder].[MatchTo]
            ,[PetsUnPivot].[FieldValue]
            ,[Builder].[Priority]
            ,[Builder].[FieldRank]
            ,[Builder].[FieldSetRank]

    FROM 
       (
            SELECT [PetID], [Address], CAST([Zip] AS VARCHAR(100)) AS [Zip], [Country], [Document], [OwnerName]
            FROM temp_pets
        ) [Pets]
    UNPIVOT
       (FieldValue FOR Field IN 
          ([Address], [Zip], [Country], [Document], [OwnerName])
    ) AS [PetsUnPivot]
    INNER JOIN [CTE_Builder] [Builder] ON [PetsUnPivot].PetID = [Builder].PetID AND [PetsUnPivot].Field = [Builder].Field
),
CTE_Owners
AS
(
    -- Unpivot Owners table and join with unpivoted Pets table on field name and field value.  
    -- Next assign Pets field rank then calculated the field set rank (MatchSetRank) based on actual matches made.
    SELECT   [OwnersUnPivot].[OwnerID]
            ,[Pets].[PetID]
            ,[OwnersUnPivot].[Field]
            ,[Pets].Field AS [PetField]
            ,[Pets].FieldValue as PetFieldValue
            ,[OwnersUnPivot].[FieldValue]
            ,[Pets].[Priority]
            ,[Pets].[FieldRank]
            ,[Pets].[FieldSetRank]
            ,SUM([FieldRank]) OVER (PARTITION BY [Pets].[PetID], [OwnersUnPivot].[OwnerID] ORDER BY [Pets].[Priority] ROWS UNBOUNDED PRECEDING) MatchSetRank
    FROM 
       (
            SELECT [OwnerID], [Addr], CAST([Zip] AS VARCHAR(100)) AS [Zip], [Country], [Document], [Name]
            FROM temp_owners
        ) [Owners]
    UNPIVOT
       (FieldValue FOR Field IN 
          ([Addr], [Zip], [Country], [Document], [Name])
    ) AS [OwnersUnPivot]
    INNER JOIN [CTE_PetsUnpivoted] [Pets] ON [OwnersUnPivot].[Field] = [Pets].[MatchTo] AND [OwnersUnPivot].[FieldValue] = [Pets].[FieldValue]
),
CTE_FinalRanking
AS
(
    SELECT   [PetID]
            ,[OwnerID]
            -- -- Calculate final rank, if multiple matches have the same rank then multiple rows will be returned per pet. 
            -- Change the “RANK()” function to "ROW_NUMBER()" to only return on result per pet.
            ,RANK() OVER (PARTITION BY [PetID] ORDER BY [MatchSetRank] DESC) AS [FinalRank] 
    FROM    CTE_Owners
    WHERE   [FieldSetRank] = [MatchSetRank] -- Only return records where the field sets calculated based on 
                                            -- actual matches is equal to desired field set ranks. This will 
                                            -- eliminate matches where the number of fields that meets the 
                                            -- criteria is the same but does not meet priority requirements. 
)
SELECT   [PetID]
        ,[OwnerID]
FROM    CTE_FinalRanking
WHERE   [FinalRank] = 1
like image 122
Edmond Quinton Avatar answered Oct 22 '22 08:10

Edmond Quinton


I will say right away to save your time:

  • My solution uses dynamic SQL. Michał Turczyn correctly noted that you can't avoid it while names of compared column are stored in the DB.
  • My solution uses loops. And I strongly believe that you will not solve this problem with the pure SQL query which will work fast enough on the data size you declared (tables have > 1M records). The logic you described implies iterations by its nature - from bigger set of matching fields to the lower set. SQL as a query language was not designed to cover such tricky scenarios. You could try to solve your problem with pure SQL query, but even if you manage to build such query, it will be very tricky, complex and unclear. I'm not a fan of such solutions. That's why I have not even dig into this direction.
  • On the other hand my solution does not require creation of temporary tables which is an advantage.

Given this, my approach is fair straightforward:

  1. There is an outer loop that iterates from the biggest set of matchers (all matching fields) to the smallest set of matchers (one field). On the first iteration, when we don't know yet how many matchers are stored in the database for the pet, we read and use them all. On the next iterations, we decrease number of used matchers by 1 (removing the one with highest priority).

  2. The inner loop iterates over current set of matchers and builds the WHERE clause that compares fields between Pets and Owners table.

  3. Current query is executed and if some owner matches given criteria, we break from the outer loop.

Here is the code that implements this logic:

DECLARE @PetId INT = 2;

DECLARE @MatchersLimit INT;
DECLARE @OwnerID INT;

WHILE (@MatchersLimit IS NULL OR @MatchersLimit > 0) AND @OwnerID IS NULL
BEGIN

    DECLARE @CurrMatchFilter VARCHAR(max) = ''
    DECLARE @Field VARCHAR(30)
    DECLARE @MatchTo VARCHAR(30)
    DECLARE @CurrMatchersNumber INT = 0;

    DECLARE @GetMatchers CURSOR;
    IF @MatchersLimit IS NULL
        SET @GetMatchers = CURSOR FOR SELECT Field, MatchTo FROM temp_builder WHERE PetID = @PetId ORDER BY Priority ASC;
    ELSE
        SET @GetMatchers = CURSOR FOR SELECT TOP (@MatchersLimit) Field, MatchTo FROM temp_builder WHERE PetID = @PetId ORDER BY Priority ASC;

    OPEN @GetMatchers;
    FETCH NEXT FROM @GetMatchers INTO @Field, @MatchTo;
    WHILE @@FETCH_STATUS = 0
    BEGIN
        IF @CurrMatchFilter <> '' SET @CurrMatchFilter = @CurrMatchFilter + ' AND ';
        SET @CurrMatchFilter = @CurrMatchFilter + ('temp_pets.' + @Field + ' = ' + 'temp_owners.' + @MatchTo);
        FETCH NEXT FROM @GetMatchers INTO @field, @matchTo;
        SET @CurrMatchersNumber = @CurrMatchersNumber + 1;
    END
    CLOSE @GetMatchers;
    DEALLOCATE @GetMatchers;

    IF @CurrMatchersNumber = 0 BREAK;

    DECLARE @CurrQuery nvarchar(max) = N'SELECT @id = temp_owners.OwnerID FROM temp_owners INNER JOIN temp_pets ON (' + CAST(@CurrMatchFilter AS NVARCHAR(MAX)) + N') WHERE temp_pets.PetID = ' + CAST(@PetId AS NVARCHAR(MAX));
    EXECUTE sp_executesql @CurrQuery, N'@id int OUTPUT', @id=@OwnerID OUTPUT;

    IF @MatchersLimit IS NULL
        SET @MatchersLimit = @CurrMatchersNumber - 1;
    ELSE
        SET @MatchersLimit = @MatchersLimit - 1;

END

SELECT @OwnerID AS OwnerID, @MatchersLimit + 1 AS Matched;

Performance considerations

There are basically 2 queries that are executed in this approach:

  1. SELECT Field, MatchTo FROM temp_builder WHERE PetID = @PetId;

    You should add an index on PetID field in temp_builder table and this query will be executed very fast.

  2. SELECT @id = temp_owners.OwnerID FROM temp_owners INNER JOIN temp_pets ON (temp_pets.Document = temp_owners.Document AND temp_pets.OwnerName = temp_owners.Name AND temp_pets.Zip = temp_owners.Zip AND ...) WHERE temp_pets.PetID = @PetId;

    This query looks scary because it joins two big tables - temp_owners and temp_pets. However temp_pets table is filtered by PetID column that should result in just one record. So if you have an index on temp_pets.PetID column (and you should as this column seems like a primary key), the query will result into scan of temp_owners table. Such scan will not take the ages even for table with over 1M rows. If the query is still too slow, you could consider adding indexes for columns of temp_owners table that are used in the matchers (Addr, Zip, etc.). Adding indexes has downsides, like bigger database and slower insert/update operations. So before adding the indexes on temp_owners columns, check the query speed on table without indexes.

like image 38
CodeFuller Avatar answered Oct 22 '22 08:10

CodeFuller