I'm writting a predictive search that, for server performance requisites (all is cached), must be run on the client browser. The items are TV shows and movies and are matched by title, actors and directors names. Once the search has been executed, it returns a list of matched items with two values for each result:
The number of matching words (n): The user can type 4 words but only 2 of them matched an item. The more the better.
The Levenshtein Edit Distance addition (ld). The user can type 3 words but 2 of them have a typo or other small diference with the indexed ones. I use the edit distance to find nearest indexed word. The adition of all the Levenshtein Distances is returned as a proximity indicator. The less the better.
Requirements
Client-side. No Sphinx, Lucene or any other server-side solutions.
Speed over accuracy. The algorithm runs on each keystroke, and we don't want to bore the user. Keep the big O not so big.
Non-recursive. The calculation of each item relevance shouldn't be dependant to the other itmes calculation. I don't want to beat Google, only provide first the best result of a small set.
Bounded form 0 to 1, 0 to 100 or something like this. Is not a requisite, but be able to show a "relevance percent" is a plus.
Ideas over Implementations. I'm looking for an algorithm/formula better that for a specific implementation.
My aproach
Based on the exponential decay (like the radioactive half-life decomposition) I made up this formula.
Where:
T
is the number of words the user provided.n
is the number of matching words.ld
is the Levenshtein distance addition for this matching words.In pseudocode.
function lambda(n, ld) {
lambda = (n/T) * e^(-ld * 1/n);
return lambda;
}
A little explanation:
-ld * 1/n
is the relevancy measure core. If ld
is low and n
is big, it comes close to zero (-0 side) and indicates thet this result is more relevant.
n/T
is an accuracy ratio. Matched words vs. All words. Refines the previous relevance by taking the total user input into account.
The exponential function, for negatives powers, bound the result between 0 and 1.
And, finally, the question
What I want is not to refine the search algorithm, based on this response with an extra edit distance calculation, but to improve the relevancy sort of returned elements by asign a relevance value to each one. If any parameter other than n
and ld
is needed and easily computed, can be used. In my solution I added T
, the number of words the user provided.
In general, I believe you must simplify your formula. In fact, most basic relevance formulas like tf-idf are quite simple and use just production or fraction of parameters with, possibly, "strengthening" or "weakening" functions. For example, tf-idf
is just multiplication of term frequency and "weaked" by logarithm inverse document frequency. Firstly I'll make quick analysis of your formula and then make some proposals.
Let's rewrite your formula:
First of all, note, that n/T
is not normalized: there may be more results (n
) then search terms (T
). Consider such example: user enters query "John Malkovich", and there's film Being John Malkovich in your database. User entered 2 words, i.e. T = 2
, but film has these terms in both - film title and the cast, so n = 2 * 2 = 4
. Given that, final relevance will be grater then 1. Lack of normalization is not the problem by itself, but in practice it may lead to a number of errors in future.
Now let's look at the second part of your formula - 1 / e^(ld/n)
. Let's denote ld/n
as x
. In this case second part of your formula will look like this:
So for high values of x
it will weaken final relevance a lot. Though I don't understand why it must be exponential, it still makes sense. But x
is not independent variable, it is by itself function of 2 variables: x = x(ld, n)
. Moreover, ld
is also a function: ld = ld(MAX_LD, T)
, so x
depends on 3 different independent variables/parameters: x = x(MAX_LD, T, n)
. In this case it is very very hard to predict behavior of x
(and so final relevance) for all possible cases.
1. Simplify x(). If you want the second part of your formula to keep track only of Levenshtein distance, let it depend on only this distance, but not all 3 independent variables. For example, your formula may look like:
or even:
where distance
is actual Levenshtein edit distance, and not the function of T
and MAX_LD
.
2. Use stemming. I know, I know, you said, you cannot use server-side programming. But are you sure it cannot be performed on the client side? Stemming is much easier then it seems to be. Most of stemming is just about truncating suffixes and endings like "-s", "-ing", "-ment", etc. It is not ideal, but I believe that it will give much better results, then Levenshtein distance. The only strong restriction here is: stemming must be used for both - indexing and searching.
For more precise stemming algo see PorterStemmer class from Lucene sources.
3. Use inverse record frequency. Recall example with query "John Malkovich". There may be plenty of films with term "John", but only several with "Malkovich". It is natural to assume, that the second term must have greater weight in search results, then the first one. tf-idf
involves this fact in its idf
(inverse document frequency) part. You can do the same thing by computing inverse record frequency as:
irf = number-of-all-found-records / number-of-records-with-current-term
and adding to your relevance formula third part:
And, of course, remember: no formula is good until it is tested on real data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With