I am having some trouble with the Levenstein Distance-algorithm.
I am using Levensteins distance algorithm to compare a product name with a list of product names, to find the closest match. However, I need to tweak it a bit. I am using the example from dotnetperls.com.
Say I have a list A of 2000 product names from my own database. I sell all these products myself.
Then all of a sudden I get a list B from one of my suppliers with product names and a new price for each product. This might happen more than once a year, so I want to develop software to do the work manually.
The problem is that this supplier isn't very good at consistency. So he makes small changes in the names every now and then, which means that I cannot do a simple string comparison.
I have implemented distance algorithm, but it does not really fit my needs. - yet!
While running through my suppliers list, I came across a product called
American Crew Anti Dandruff Shampoo 250 ml
This product was successfully matched with my own product called
American Crew Anti-Dandruff 250 ml.
With a distance of 10.
Problem
I also came across a product called
American Crew 3-In-1 Shampoo 450 ml.
Which was mistakenly matched with
American Crew Daily Shampoo 450 ml.
instead of my
American Crew 3 in 1 450 ml.
And I can see why! But I am not sure how I should change the algorithm from here.
Any ideas?
By the way, I am not really good with algorithms, but I believe some kind of weighing would help me out here.
EDIT:
Computation time isn't really a concern. Even if it takes ten hours to complete, it's still a lot better than doing it manually :P
One approach would be use multiple methods to search and apply a weight to each of them. I've assumed you have some class Item
with at least a string Name
property:
double levWeight = 1.0; // adjust these weights as you see fit
double matchWeight = 1.0;
// add whatever to here you'd like
var splitOn = new[] { ' ', '!', '"', '#', '$', '%', '&', '\'', '(', ')', '-' };
Func<string, string[]> split =
s => s.Split(splitOn, StringSplitOptions.RemoveEmptyEntries);
var matches =
from xx in mine
let xp = split(xx.Name)
select new {
Item = xx,
Matches =
(from yy in theirs
let yp = split(yy.Name)
/* found how many name components match */
let mm = xp.Intersect(yp, StringComparer.OrdinalIgnoreCase).Count()
/* find the Levenshtein distance of the two names */
let ld = LevDist(xx, yy)
/* weight our two criteria */
let ww = (matchWeight * mm) + (levWeight / ld)
/* should match on at least ONE name component */
where mm > 0
orderby ww descending
select yy)
};
When run against the corpus of data you have I receive the following output:
If you had more criteria you would simply add them to the query (inline with the other let
clauses) and apply some weight to their result.
Other possible applications are "name components in the same order":
let or = xp.Zip(yp, (a,b) => String.Compare(a, b, true))
.TakeWhile(c => c == 0)
.Count()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With