I have dirty data from 2 different sources. I'm looking for some best practices around matching them. Here are some examples of the data:
Source1.Name Source2.Name
Adda Clevenger Jr Prep School Adda Clevenger Junior Preparatory School
Alice Fong Yu Alt School Alice Fong Yu Alternative School
Convent Of Sacred Heart Es Convent of Sacred Heart Elementary School
Rosa Parks Elementary School Rosa Parks Elementary School
A human can see that these 4 examples should match up with an ideal fuzzy match. I have at my disposal excellent software for traditional fuzzy matching which will catch typos and other small variations. But in this data set I have about a dozen rules governing abbreviations like 'Preparatory' -> 'Prep'. I would like to capture all of those rules in the query. (Then I'll deal with the more traditional fuzziness separately.)
Is there a well-known SQL pattern for handling this requirement? It could be as simple as learning the magic keyword which will unlock examples in my searches. It's a sort of 'translation table' or 'abbreviation table', but I just made those terms up. I haven't found the widely accepted term yet.
Conceptually my goal is to start from this naive query:
/* This succeeds for 1 record and fails for 3 in the sample data set above. */
SELECT * FROM ...
WHERE Source1.Name = Source2.Name
Then modify it into something that gets all of the desired matches shown above. I expect that I could brute-force it with some nested REPLACE functions:
/* This works for the 4 samples given */
SELECT * FROM ...
WHERE
REPLACE( REPLACE( REPLACE( Source1.Name, 'Preparatory', 'Prep' ), 'Alternative', 'Alt' ), 'Elementary School', 'Es' )
= REPLACE( REPLACE( REPLACE( Source2.Name, 'Preparatory', 'Prep' ), 'Alternative', 'Alt' ), 'Elementary School', 'Es' )
This does not feel elegant. It has increasing ugliness as I account for inconsistent abbreviations (e.g. 'International' is sometimes 'Intl' and sometimes 'Int''l'). And it's not particularly smooth for overlapping abbreviations (e.g. 'Elementary School' -> 'Es' but in other cases 'School' -> 'Sch').
How have others solved this?
Note: I'm using Oracle. I would likely use REGEXP_REPLACE rather than REPLACE. I would certainly use UPPER (or LOWER) to avoid case issues. But those details aren't core to the issue.
If you have a set of known translations you can create a function capturing these. You can then create a virtual column on your tables which returns the result of this. You can then compare the virtual columns, simplifying your query:
create or replace function abbr_replace ( str varchar2 )
return varchar2 deterministic as
begin
return replace(
replace(
replace(
replace(
replace( lower( str ), 'preparatory', 'prep' ),
'junior', 'jr'),
'elementary school', 'es'),
'alternative', 'alt' ),
'elementary school', 'es'
);
end abbr_replace;
/
create table source1 (
name varchar2(100),
replace_name varchar2(100) as (
cast ( abbr_replace ( name ) as varchar2(100) )
)
);
create table source2 (
name varchar2(100),
replace_name varchar2(100) as (
cast ( abbr_replace ( name ) as varchar2(100) )
)
);
insert into source1 (name) values ('Adda Clevenger Jr Prep School');
insert into source1 (name) values ('Alice Fong Yu Alt School');
insert into source1 (name) values ('Convent Of Sacred Heart Es');
insert into source1 (name) values ('Rosa Parks Elementary School');
insert into source2 (name) values ('Adda Clevenger Junior Preparatory School');
insert into source2 (name) values ('Alice Fong Yu Alternative School');
insert into source2 (name) values ('Convent of Sacred Heart Elementary School');
insert into source2 (name) values ('Rosa Parks Elementary School');
commit;
select s1.name, s2.name
from source1 s1
join source2 s2
on s2.replace_name = s1.replace_name;
NAME NAME
-------------------------------------------------- --------------------------------------------------
Adda Clevenger Jr Prep School Adda Clevenger Junior Preparatory School
Alice Fong Yu Alt School Alice Fong Yu Alternative School
Convent Of Sacred Heart Es Convent of Sacred Heart Elementary School
Rosa Parks Elementary School Rosa Parks Elementary School
A couple of points to note:
deterministic
If you're looking for a more general fuzzy match, Oracle has implemented the Levenshtein Distance and Jaro-Winkler matching algorithms. These are in utl_match:
select s1.name, s2.name, utl_match.jaro_winkler(s1.name, s2.name) jw
from source1 s1
join source2 s2
on utl_match.jaro_winkler(s1.name, s2.name) > .9;
NAME NAME JW
-------------------------------------------------- -------------------------------------------------- --
Adda Clevenger Jr Prep School Adda Clevenger Junior Preparatory School 0.904
Alice Fong Yu Alt School Alice Fong Yu Alternative School 0.925
Convent Of Sacred Heart Es Convent of Sacred Heart Elementary School 0.902
Rosa Parks Elementary School Rosa Parks Elementary School 1.000
Scripts also available on LiveSQL
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With