Free form text again find in database

Question

I am tasked with matching free form text to data in a database. What I mean by freeform, is that it is a textbox, and someone can type something/anything. For the most part, these entries are valid. I would like to find a list of values from a table that resemble what was typed in. Before you ask, I have no control of said textbox, nor the people that type into it. I am looking for techniques, not specific technologies.

Things I have tried:

Clearing out the common words from both the criteria as well as the list. ie (the, of, in, etc.)
SOUNDEX function in sql, it is very weak, and not quite helpfull.
The Levenshtein Distance algorithm and am pretty happy with the results, but it still needs lots of polish.

For example I have this list:

The Hobbit: An Unexpected Journey
The Hobbit: The Desolation of Smaug
The Hobbit: There and Back Again
Iron Man 3
Despicable Me 2
Fast & Furious 6
Monsters University
The Hunger Games: Catching Fire
Man of Steel
Gravity
Thor: The Dark World
The Croods
World War Z

The users input could be:

hobit unexpected journ
- The word 'hobit' is not spelled right
- Expected result:
  - The Hobbit: An Unexpected Journey
  - The Hobbit: There and Back Again
  - The Hobbit: The Desolation of Smaug
hunger game
- Expected result:
  - The Hunger Games: Catching Fire

What I guess I'm asking is what other methods can I use to calculate these results. My Stack is .Net 4.0 and MSSQL 2008 R2

Max · Accepted Answer

I would try an algorithm like the following:

common words from both the criteria as well as the list. (the, of, in, etc.)
for each criteria word check if it's included in an entry of the list
- if it's included, assign some score/value for this criteria word
- if it's not included, check the Levenshtein Distance between the criteria word, and any of the word in the enrty of the list you are checking against
  - then assign a score/value for the lowest Levenshtein Distance you have found (maybe it's better to ignore any Levenshtein Distance higher than 3/4)
when you have checked all the criteria word respect the current entry of the list, check how many word of the current entry are not included in the criteria, and assign a negative score/value for each of these word
sum up all the score/value: now you have a single score/value for these criteria against a single entry of your list

Repeat this for any entry in your list.

If the data you are effectively analysing are films title:

you should add some modifier, like using a multiplying factor on the value/score for the most recent films.
you can speed up things by having 2 lists to check against: one with the most searched/recent films, and a second list with all the other titles (and if you get enough hit by checking the firs list, you can skip the check against the second list)

Free form text again find in database

Tags:

c#

algorithm

sql-server

tsql

jonypony3

1 Answers

Max

Recent Activity

Donate For Us

Free form text again find in database

Tags:

c#

algorithm

sql-server

tsql

jonypony3

1 Answers

Max

Related questions

Recent Activity

Donate For Us