Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Make Levenstein's Distance algorithm fit my needs

I am having some trouble with the Levenstein Distance-algorithm.

I am using Levensteins distance algorithm to compare a product name with a list of product names, to find the closest match. However, I need to tweak it a bit. I am using the example from dotnetperls.com.

Say I have a list A of 2000 product names from my own database. I sell all these products myself.

Then all of a sudden I get a list B from one of my suppliers with product names and a new price for each product. This might happen more than once a year, so I want to develop software to do the work manually.

The problem is that this supplier isn't very good at consistency. So he makes small changes in the names every now and then, which means that I cannot do a simple string comparison.

I have implemented distance algorithm, but it does not really fit my needs. - yet!

While running through my suppliers list, I came across a product called

American Crew Anti Dandruff Shampoo 250 ml

This product was successfully matched with my own product called

American Crew Anti-Dandruff 250 ml.

With a distance of 10.

Problem

I also came across a product called

American Crew 3-In-1 Shampoo 450 ml.

Which was mistakenly matched with

American Crew Daily Shampoo 450 ml.

instead of my

American Crew 3 in 1 450 ml.

And I can see why! But I am not sure how I should change the algorithm from here.

Any ideas?

By the way, I am not really good with algorithms, but I believe some kind of weighing would help me out here.

EDIT:

Computation time isn't really a concern. Even if it takes ten hours to complete, it's still a lot better than doing it manually :P

like image 608
Søren Lorentzen Avatar asked Nov 01 '12 14:11

Søren Lorentzen


1 Answers

One approach would be use multiple methods to search and apply a weight to each of them. I've assumed you have some class Item with at least a string Name property:

double levWeight = 1.0;   // adjust these weights as you see fit
double matchWeight = 1.0;

// add whatever to here you'd like
var splitOn = new[] { ' ', '!', '"', '#', '$', '%', '&', '\'', '(', ')', '-' };
Func<string, string[]> split = 
    s => s.Split(splitOn, StringSplitOptions.RemoveEmptyEntries);

var matches =
    from xx in mine
    let xp = split(xx.Name)
    select new {
        Item = xx,
        Matches =
           (from yy in theirs
            let yp = split(yy.Name)

            /* found how many name components match */
            let mm = xp.Intersect(yp, StringComparer.OrdinalIgnoreCase).Count()

            /* find the Levenshtein distance of the two names */
            let ld = LevDist(xx, yy)

            /* weight our two criteria */
            let ww = (matchWeight * mm) + (levWeight / ld)

            /* should match on at least ONE name component */
            where mm > 0
            orderby ww descending
            select yy)
    };

When run against the corpus of data you have I receive the following output:

  • American Crew Anti Dandruff Shampoo 250 ml
    1. American Crew Anti-Dandruff 250 ml
    2. American Crew Daily Shampoo 450 ml
    3. American Crew 3 in 1 450 ml
  • American Crew 3-In-1 Shampoo 450 ml
    1. American Crew 3 in 1 450 ml
    2. American Crew Daily Shampoo 450 ml
    3. American Crew Anti-Dandruff 250 ml

If you had more criteria you would simply add them to the query (inline with the other let clauses) and apply some weight to their result.

Other possible applications are "name components in the same order":

let or = xp.Zip(yp, (a,b) => String.Compare(a, b, true))
           .TakeWhile(c => c == 0)
           .Count()
like image 119
user7116 Avatar answered Oct 01 '22 06:10

user7116