Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy grouping in SQL

Tags:

sql

sql-server

I need to modify a SQL table to group slightly mismatched names, and assign all elements in the group a standardized name.

For instance, if the initial table looks like this:

Name
--------
Jon Q
John Q
Jonn Q
Mary W
Marie W
Matt H

I would like to create a new table or add a field to the existing one like this:

Name     | StdName
--------------------
Jon Q    | Jon Q
John Q   | Jon Q
Jonn Q   | Jon Q
Mary W   | Mary W
Marie W  | Mary W
Matt H   | Matt H

In this case, I've chosen the first name to assign as the "standardized name," but I don't actually care which one is chosen -- ultimately the final "standardized name" will be hashed into a unique person ID. (I'm also open to alternative solutions that go directly to a numerical ID.) I will have birthdates to match on as well, so the accuracy of the name matching doesn't actually need to be all that precise in practice. I've looked into this a bit and will probably use the Jaro-Winkler algorithm (see e.g. here).

If I knew that the names were all in pairs, this would be a relatively easy query, but there can be an arbitrary number of the same name.

I can easily conceptualize how to do this query in a procedural language, but I'm not very familiar with SQL. Unfortunately I don't have direct access to the data -- it's sensitive data and so somebody else (a bureaucrat) has to run the actual query for me. The specific implementation will be SQL Server, but I'd prefer an implementation-agnostic solution.

EDIT:

In response to a comment, I had the following procedural approach in mind. It's in Python, and I replaced the Jaro-Winkler with simply matching on the first letter of the name, for the sake of having a working code example.

nameList = ['Jon Q', 'John Q', 'Jonn Q', 'Mary W', 'Marie W', 'Larry H']
stdList = nameList[:]

# loop over all names
for i1, name1 in enumerate(stdList):

  # loop over later names in list to find matches
  for i2, name2 in enumerate(stdList[i1+1:]):

    # If there's a match, replace latter with former.
    if (name1[0] == name2[0]):
      stdList[i1+1+i2] = name1

print stdList

The result is ['Jon Q', 'Jon Q', 'Jon Q', 'Mary W', 'Mary W', 'Larry H'].

like image 801
abeboparebop Avatar asked Dec 16 '22 11:12

abeboparebop


1 Answers

Just a thought, but you might be able to use the SOUNDEX() function. This will create a value for the names that are similar.

If you started with something like this:

select name, soundex(name) snd,
  row_number() over(partition by soundex(name)
                    order by soundex(name)) rn
from yt;

See SQL Fiddle with Demo. Which would give a result for each row that is similar along with a row_number() so you could return only the first value for each group. For example, the above query will return:

|    NAME |  SND | RN |
-----------------------
|   Jon Q | J500 |  1 |
|  John Q | J500 |  2 |
|  Jonn Q | J500 |  3 |
|  Matt H | M300 |  1 |
|  Mary W | M600 |  1 |
| Marie W | M600 |  2 |

Then you could select all of the rows from this result where the row_number() is equal to 1 and then join back to your main table on the soundex(name) value:

select t1.name,
  t2.Stdname
from yt t1
inner join
(
  select name as stdName, snd, rn
  from
  (
    select name, soundex(name) snd,
      row_number() over(partition by soundex(name)
                        order by soundex(name)) rn
    from yt
  ) d
  where rn = 1
) t2
  on soundex(t1.name) = t2.snd;

See SQL Fiddle with Demo. This gives a result:

|    NAME | STDNAME |
---------------------
|   Jon Q |   Jon Q |
|  John Q |   Jon Q |
|  Jonn Q |   Jon Q |
|  Mary W |  Mary W |
| Marie W |  Mary W |
|  Matt H |  Matt H |
like image 146
Taryn Avatar answered Dec 29 '22 13:12

Taryn