I am tasked with matching free form text to data in a database.
What I mean by freeform, is that it is a textbox, and someone can type something/anything. For the most part, these entries are valid. I would like to find a list of values from a table that resemble what was typed in.
Before you ask, I have no control of said textbox, nor the people that type into it.
I am looking for techniques, not specific technologies.
Things I have tried:
- Clearing out the common words from both the criteria as well as the list. ie (the, of, in, etc.)
-
SOUNDEX function in sql, it is very weak, and not quite helpfull.
- The Levenshtein Distance algorithm and am pretty happy with the results, but it still needs lots of polish.
For example I have this list:
- The Hobbit: An Unexpected Journey
- The Hobbit: The Desolation of Smaug
- The Hobbit: There and Back Again
- Iron Man 3
- Despicable Me 2
- Fast & Furious 6
- Monsters University
- The Hunger Games: Catching Fire
- Man of Steel
- Gravity
- Thor: The Dark World
- The Croods
- World War Z
The users input could be:
- hobit unexpected journ
- The word 'hobit' is not spelled right
- Expected result:
- The Hobbit: An Unexpected Journey
- The Hobbit: There and Back Again
- The Hobbit: The Desolation of Smaug
- hunger game
- Expected result:
- The Hunger Games: Catching Fire
What I guess I'm asking is what other methods can I use to calculate these results. My Stack is .Net 4.0 and MSSQL 2008 R2
I would try an algorithm like the following:
- common words from both the criteria as well as the list. (the, of, in, etc.)
- for each criteria word check if it's included in an entry of the list
- if it's included, assign some score/value for this criteria word
- if it's not included, check the Levenshtein Distance between the criteria word, and any of the word in the enrty of the list you are checking against
- then assign a score/value for the lowest Levenshtein Distance you have found (maybe it's better to ignore any Levenshtein Distance higher than 3/4)
- when you have checked all the criteria word respect the current entry of the list, check how many word of the current entry are not included in the criteria, and assign a negative score/value for each of these word
- sum up all the score/value: now you have a single score/value for these criteria against a single entry of your list
Repeat this for any entry in your list.
If the data you are effectively analysing are films title:
- you should add some modifier, like using a multiplying factor on the value/score for the most recent films.
- you can speed up things by having 2 lists to check against: one with the most searched/recent films, and a second list with all the other titles (and if you get enough hit by checking the firs list, you can skip the check against the second list)