Are there any tools or methods that can be used for matching by a person's name between two different data sources?
The systems have no other common information and the names have been entered differently in many cases.
Examples of non-exact matches:
King Jr., Martin Luther = King, Martin (exclude suffix)
Erving, Dr. J. = Erving, J. (exclude prefix)
Obama, Barak Hussein = Obama, Barak (exclude middle name)
Pufnstuf, H.R. = Pufnstuf, Haibane Renmei (match abbreviations)
Tankengine, Thomas = Tankengine, Tom (match common nicknames)
Flair, Rick "the Natureboy" = Flair, Natureboy (match on nickname)
Name Matching Algorithms at a GlanceAssigns names a key or code based on their English pronunciation such that similar sounding names share the same key.
Fuzzy matching (FM), also known as fuzzy logic, approximate string matching, fuzzy name matching, or fuzzy string matching is an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets.
(INNER) JOIN : Returns records that have matching values in both tables. LEFT (OUTER) JOIN : Returns all records from the left table, and the matched records from the right table. RIGHT (OUTER) JOIN : Returns all records from the right table, and the matched records from the left table.
I had to use a variety of techniques suggested. Thanks pointing me in the right direction(s). Hopefully, the following will help someone else out with this type of problem to solve.
Removing excess characters
CREATE FUNCTION [dbo].[fn_StripCharacters]
(
@String NVARCHAR(MAX),
@MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
SET @MatchExpression = '%['+@MatchExpression+']%'
WHILE PatIndex(@MatchExpression, @String) > 0
SET @String = Stuff(@String, PatIndex(@MatchExpression, @String), 1, '')
RETURN @String
END
Usage:
--remove all non-alphanumeric and non-white space
dbo.fn_StripCharacters(@Value, , '^a-z^0-9 ')
Split name into parts
CREATE FUNCTION [dbo].[SplitTable] (@sep char(1), @sList StringList READONLY)
RETURNS @ResultList TABLE
(
[ID] VARCHAR(MAX),
[Val] VARCHAR(MAX)
)
AS
BEGIN
declare @OuterCursor cursor
declare @ID varchar(max)
declare @Val varchar(max)
set @OuterCursor = cursor fast_forward for (SELECT * FROM @sList) FOR READ ONLY
open @OuterCursor
fetch next from @OuterCursor into @ID, @Val
while (@@FETCH_STATUS=0)
begin
INSERT INTO @ResultList (ID, Val)
select @ID, split.s from dbo.Split(@sep, @Val) as split
where len(split.s) > 0
fetch next from @OuterCursor into @ID, @Val
end
close @OuterCursor
deallocate @OuterCursor
CREATE FUNCTION [dbo].[Split] (@sep char(1), @s varchar(8000))
RETURNS table
AS
RETURN (
WITH Pieces(pn, start, stop) AS (
SELECT 1, 1, CHARINDEX(@sep, @s)
UNION ALL
SELECT pn + 1, stop + 1, CHARINDEX(@sep, @s, stop + 1)
FROM Pieces
WHERE stop > 0
)
SELECT pn,
LTRIM(RTRIM(SUBSTRING(@s, start,
CASE WHEN stop > 0
THEN stop-start
ELSE 8000
END))) AS s
FROM Pieces
)
RETURN
Usage:
--create split name list
DECLARE @NameList StringList
INSERT INTO @NameList (ID, Val)
SELECT id, firstname FROM dbo.[User] u
WHERE PATINDEX('%[^a-z]%', u.FirstName) > 0
----remove split dups
select u.ID, COUNT(*)
from dbo.import_SplitTable(' ', @NameList) splitList
INNER JOIN dbo.[User] u
ON splitList.id = u.id
Common nicknames:
I created a table based on this list and used it to join on common name equivalents.
Usage:
SELECT u.id
, u.FirstName
, u_nickname_maybe.Name AS MaybeNickname
, u.LastName
, c.ID AS ContactID from
FROM dbo.[User] u
INNER JOIN nickname u_nickname_match
ON u.FirstName = u_nickname_match.Name
INNER JOIN nickname u_nickname_maybe
ON u_nickname_match.relatedid = u_nickname_maybe.id
LEFT OUTER JOIN
(
SELECT c.id, c.LastName, c.FirstName,
c_nickname_maybe.Name AS MaybeFirstName
FROM dbo.Contact c
INNER JOIN nickname c_nickname_match
ON c.FirstName = c_nickname_match.Name
INNER JOIN nickname c_nickname_maybe
ON c_nickname_match.relatedid = c_nickname_maybe.id
WHERE c_nickname_match.Name <> c_nickname_maybe.Name
) as c
ON c.AccountHolderID = ah.ID
AND u_nickname_maybe.Name = c.MaybeFirstName AND u.LastName = c.LastName
WHERE u_nickname_match.Name <> u_nickname_maybe.Name
Phonetic algorithms (Jaro Winkler):
The amazing article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms. I ended up mostly using Jaro Winkler to match the names.
Usage:
SELECT
u.id AS UserID
,c.id AS ContactID
,u.FirstName
,c.FirstName
,u.LastName
,c.LastName
,maxResult.CombinedScores
from
(
SELECT
u.ID
,
max(
dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName))
* dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))
) AS CombinedScores
FROM dbo.[User] u, dbo.[Contact] c
WHERE u.ContactID IS NULL
GROUP BY u.id
) AS maxResult
INNER JOIN dbo.[User] u
ON maxResult.id = u.id
INNER JOIN dbo.[Contact] c
ON maxResult.CombinedScores =
dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName))
* dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))
It's a very complex problem - and there are a lot of expensive tools to do it correctly.
If you ever wondered why you can't check in on a flight as Tom, Dick or Harry (or Bill)
Or why no-fly lists and terrorists watch lists don't work -consider:
(1) Muammar Qaddafi
(2) Mo'ammar Gadhafi
(3) Muammar Kaddafi
(4) Muammar Qadhafi
(5) Moammar El Kadhafi
(6) Muammar Gadafi
(7) Mu'ammar al-Qadafi
(8) Moamer El Kazzafi
(9) Moamar al-Gaddafi
(10) Mu'ammar Al Qathafi
(11) Muammar Al Qathafi
(12) Mo'ammar el-Gadhafi
(13) Moamar El Kadhafi
(14) Muammar al-Qadhafi
(15) Mu'ammar al-Qadhdhafi
(16) Mu'ammar Qadafi
(17) Moamar Gaddafi
(18) Mu'ammar Qadhdhafi
(19) Muammar Khaddafi
(20) Muammar al-Khaddafi
(21) Mu'amar al-Kadafi
(22) Muammar Ghaddafy
(23) Muammar Ghadafi
(24) Muammar Ghaddafi
(25) Muamar Kaddafi
(26) Muammar Quathafi
(27) Muammar Gheddafi
(28) Muamar Al-Kaddafi
(29) Moammar Khadafy
(30) Moammar Qudhafi
(31) Mu'ammar al-Qaddafi
(32) Mulazim Awwal Mu'ammar Muhammad Abu Minyar al-Qadhafi
And that's just official spellings - it doesn't include typos!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With