Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Pattern for matching via a translation table

I have dirty data from 2 different sources. I'm looking for some best practices around matching them. Here are some examples of the data:

Source1.Name                   Source2.Name
Adda Clevenger Jr Prep School  Adda Clevenger Junior Preparatory School
Alice Fong Yu Alt School       Alice Fong Yu Alternative School
Convent Of Sacred Heart Es     Convent of Sacred Heart Elementary School
Rosa Parks Elementary School   Rosa Parks Elementary School

A human can see that these 4 examples should match up with an ideal fuzzy match. I have at my disposal excellent software for traditional fuzzy matching which will catch typos and other small variations. But in this data set I have about a dozen rules governing abbreviations like 'Preparatory' -> 'Prep'. I would like to capture all of those rules in the query. (Then I'll deal with the more traditional fuzziness separately.)

Is there a well-known SQL pattern for handling this requirement? It could be as simple as learning the magic keyword which will unlock examples in my searches. It's a sort of 'translation table' or 'abbreviation table', but I just made those terms up. I haven't found the widely accepted term yet.

Conceptually my goal is to start from this naive query:

/* This succeeds for 1 record and fails for 3 in the sample data set above. */
SELECT * FROM ...
WHERE Source1.Name = Source2.Name

Then modify it into something that gets all of the desired matches shown above. I expect that I could brute-force it with some nested REPLACE functions:

/* This works for the 4 samples given */
SELECT * FROM ...
WHERE 
  REPLACE( REPLACE( REPLACE( Source1.Name, 'Preparatory', 'Prep' ), 'Alternative', 'Alt' ), 'Elementary School', 'Es' )
  = REPLACE( REPLACE( REPLACE( Source2.Name, 'Preparatory', 'Prep' ), 'Alternative', 'Alt' ), 'Elementary School', 'Es' )

This does not feel elegant. It has increasing ugliness as I account for inconsistent abbreviations (e.g. 'International' is sometimes 'Intl' and sometimes 'Int''l'). And it's not particularly smooth for overlapping abbreviations (e.g. 'Elementary School' -> 'Es' but in other cases 'School' -> 'Sch').

How have others solved this?

Note: I'm using Oracle. I would likely use REGEXP_REPLACE rather than REPLACE. I would certainly use UPPER (or LOWER) to avoid case issues. But those details aren't core to the issue.

like image 798
mdahlman Avatar asked Nov 11 '22 20:11

mdahlman


1 Answers

If you have a set of known translations you can create a function capturing these. You can then create a virtual column on your tables which returns the result of this. You can then compare the virtual columns, simplifying your query:

create or replace function abbr_replace ( str varchar2 )  
  return varchar2 deterministic as
begin
  return replace( 
                   replace( 
                         replace( 
                             replace(
                               replace( lower( str ), 'preparatory', 'prep' ), 
                                 'junior', 'jr'),
                             'elementary school', 'es'), 
                         'alternative', 'alt' ), 
                     'elementary school', 'es' 
                 );
end abbr_replace;
/

create table source1 (
  name         varchar2(100),
  replace_name varchar2(100) as ( 
      cast ( abbr_replace ( name ) as varchar2(100) ) 
    )
);

create table source2 (
  name         varchar2(100),
  replace_name varchar2(100) as ( 
      cast ( abbr_replace ( name ) as varchar2(100) ) 
    )
);

insert into source1 (name) values ('Adda Clevenger Jr Prep School');
insert into source1 (name) values ('Alice Fong Yu Alt School');
insert into source1 (name) values ('Convent Of Sacred Heart Es');
insert into source1 (name) values ('Rosa Parks Elementary School');

insert into source2 (name) values ('Adda Clevenger Junior Preparatory School');
insert into source2 (name) values ('Alice Fong Yu Alternative School');
insert into source2 (name) values ('Convent of Sacred Heart Elementary School');
insert into source2 (name) values ('Rosa Parks Elementary School');

commit;

select s1.name, s2.name
from   source1 s1
join   source2 s2
on     s2.replace_name = s1.replace_name;

NAME                                               NAME                                             
-------------------------------------------------- --------------------------------------------------
Adda Clevenger Jr Prep School                      Adda Clevenger Junior Preparatory School          
Alice Fong Yu Alt School                           Alice Fong Yu Alternative School                  
Convent Of Sacred Heart Es                         Convent of Sacred Heart Elementary School         
Rosa Parks Elementary School                       Rosa Parks Elementary School

A couple of points to note:

  • You must declare the function as deterministic
  • You need to be on 11g or higher to use this

If you're looking for a more general fuzzy match, Oracle has implemented the Levenshtein Distance and Jaro-Winkler matching algorithms. These are in utl_match:

select s1.name, s2.name, utl_match.jaro_winkler(s1.name, s2.name) jw
from   source1 s1
join   source2 s2
on     utl_match.jaro_winkler(s1.name, s2.name) > .9;

NAME                                               NAME                                               JW
-------------------------------------------------- -------------------------------------------------- --
Adda Clevenger Jr Prep School                      Adda Clevenger Junior Preparatory School           0.904
Alice Fong Yu Alt School                           Alice Fong Yu Alternative School                   0.925
Convent Of Sacred Heart Es                         Convent of Sacred Heart Elementary School          0.902
Rosa Parks Elementary School                       Rosa Parks Elementary School                       1.000

Scripts also available on LiveSQL

like image 58
Chris Saxon Avatar answered Nov 14 '22 22:11

Chris Saxon