Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating context-sensitive text correlation

Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.

Example: "West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A" but different from "East Lawnmower Drive 54 A".

How would you approach this problem? Would it be necessary to have some kind of context-based dictionary that knows, in the address case, that "W", "W." and "West" are the same? What about misspellings ("mover" instead of "mower" etc)?

I think this is a tricky one - perhaps there are some well-known algorithms out there?

like image 525
Anders Fjeldstad Avatar asked Dec 03 '09 14:12

Anders Fjeldstad


3 Answers

A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as

  • Edit distance (aka Levenshtein distance)
  • Ratcliff/Obershelp

Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:

  • tokenize the input, i.e. see the input as an array of words rather than a string
  • tokenization should also keep the line number info
  • normalize the input with the use of a short dictionary of common substituions (such as "dr" at the end of a line = "drive", "Jack" = "John", "Bill" = "William"..., "W." at the begining of a line is "West" etc.
  • Identify (a bit like tagging, as in POS tagging) the nature of some entities (for example ZIP Code, and Extended ZIP code, and also city
  • Identify (lookup) some of these entities (for example a relative short database table can include all the Cities / town in the targeted area
  • Identify (lookup) some domain-related entities (if all/many of the address deal with say folks in the legal profession, a lookup of law firm names or of federal buildings may be of help.
  • Generally, put more weight on tokens that come from the last line of the address
  • Put more (or less) weight on tokens with a particular entity type (ex: "Drive", "Street", "Court" should with much less than the tokens which precede them.
  • Consider a modified SOUNDEX algorithm to help with normalization of

With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).

An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as 1/2 n^2 item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).

like image 182
mjv Avatar answered Nov 11 '22 05:11

mjv


I would look at producing a similarity comparison metric that, given two objects (strings perhaps), returns "distance" between them.

If you fulfil the following criteria then it helps:

  1. distance between an object and itself is zero. (reflexive)
  2. distance from a to b is the same in both directions (transitive)
  3. distance from a to c is not more than distance from a to b plus distance from a to c. (triangle rule)

If your metric obeys these they you can arrange your objects in metric space which means you can run queries like:

  • Which other object is most like this one
  • Give me the 5 objects most like this one.

There's a good book about it here. Once you've set up the infrastructure for hosting objects and running the queries you can simply plug in different comparison algorithms, compare their performance and then tune them.

I did this for geographic data at university and it was quite fun trying to tune the comparison algorithms.

I'm sure you could come up with something more advanced but you could start with something simple like reducing the address line to the digits and the first letter of each word and then compare the result of that using a longest common subsequence algorithm.

Hope that helps in some way.

like image 24
Tom Duckering Avatar answered Nov 11 '22 05:11

Tom Duckering


You can use Levenshtein edit distance to find strings that differ by only a few characters. BK Trees can help speed up the matching process.

like image 1
Ken Bloom Avatar answered Nov 11 '22 04:11

Ken Bloom