Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving search result using Levenshtein distance in Java

I have following working Java code for searching for a word against a list of words and it works perfectly and as expected:

public class Levenshtein {     private int[][] wordMartix;      public Set similarExists(String searchWord) {          int maxDistance = searchWord.length();         int curDistance;         int sumCurMax;         String checkWord;          // preventing double words on returning list         Set<String> fuzzyWordList = new HashSet<>();          for (Object wordList : Searcher.wordList) {             checkWord = String.valueOf(wordList);             curDistance = calculateDistance(searchWord, checkWord);             sumCurMax = maxDistance + curDistance;             if (sumCurMax == checkWord.length()) {                 fuzzyWordList.add(checkWord);             }         }         return fuzzyWordList;     }      public int calculateDistance(String inputWord, String checkWord) {         wordMartix = new int[inputWord.length() + 1][checkWord.length() + 1];          for (int i = 0; i <= inputWord.length(); i++) {             wordMartix[i][0] = i;         }          for (int j = 0; j <= checkWord.length(); j++) {             wordMartix[0][j] = j;         }          for (int i = 1; i < wordMartix.length; i++) {             for (int j = 1; j < wordMartix[i].length; j++) {                 if (inputWord.charAt(i - 1) == checkWord.charAt(j - 1)) {                     wordMartix[i][j] = wordMartix[i - 1][j - 1];                 } else {                     int minimum = Integer.MAX_VALUE;                     if ((wordMartix[i - 1][j]) + 1 < minimum) {                         minimum = (wordMartix[i - 1][j]) + 1;                     }                      if ((wordMartix[i][j - 1]) + 1 < minimum) {                         minimum = (wordMartix[i][j - 1]) + 1;                     }                      if ((wordMartix[i - 1][j - 1]) + 1 < minimum) {                         minimum = (wordMartix[i - 1][j - 1]) + 1;                     }                      wordMartix[i][j] = minimum;                 }             }         }          return wordMartix[inputWord.length()][checkWord.length()];     }  } 

Right now when I search for a word like job it returns a list:

Output

joborienterede jobannoncer jobfunktioner perjacobsen jakobsen jobprofiler jacob jobtitler jobbet jobdatabaserne jobfunktion jakob jobs studenterjobber johannesburg jobmuligheder jobannoncerne jobbaser job joberfaringer 

As you can see the output has a lot of related words but has also non-related ones like jakob, jacob etc., which is correct regarding the Levenshtein formula, but I would like to build further and write a method that can fine tune my search so I can get more relevant and related words.

I have worked few hours on it and lost my sight of creativity.

My Question: Is it possible to fine tune the existing method to return relevant/related words Or should I take another approach Or??? in all cases YES or NO, I appreciated if can get input and inspiration regarding improving searching results?


UPDATE

After asking this question long time back I have not really found a solution and I back to it because it is time where I need a useful answer, it is fine to supply the answer with JAVA code samples, but what is most important is a detailed answer with description of available methods and approaches used to index best and most relevant search results and ignoring none relevant words. I know this is an open and endless area, but I need to have some inspiration to start some where.

Note: The oldest answer right now is based on one of the comment inputs and is not helpful (useless), it just sorting the distance, that does not mean getting better search results/quality.

So I did distance sorting and the results was like this:

job jobs jacob jakob jobbet jakobsen jobbaser jobtitler jobannoncer jobfunktion jobprofiler perjacobsen johannesburg jobannoncerne joberfaringer jobfunktioner jobmuligheder jobdatabaserne joborienterede studenterjobber 

so word jobbaser is relevant and jacob/jakob is not relevant, but the distance for jobbaser is bigger than jacob/jakob. So that did not really helped.


General feedback regarding answers

  • @SergioMontoro, it solves almost the problem.
  • @uSeemSurprised, it solves the problem but need continually manipulation.
  • @Gene concept is excellent, but it is relaying on external url.

Thanks I would like to personally thanks all of you who contributed to this question, I have got nice answers and useful comments.

Special thanks to answers from @SergioMontoro, @uSeemSurprised and @Gene, those are different but valid and useful answers.

@D.Kovács is pointing some interesting solution.

I wish I could give bounty to all of those answers. Chose one answer and give it bounty, that does not mean the other answers is not valid, but that only mean that the particular answer I chose was useful for me.

like image 926
Maytham Avatar asked Nov 15 '15 16:11

Maytham


People also ask

How do you use Levenshtein distance?

A General Example. Given two words, hello and hello, the Levenshtein distance is zero because the words are identical. For the two words helo and hello, it is obvious that there is a missing character "l". Thus to transform the word helo to hello all we need to do is insert that character.

What is Levenshtein distance in information retrieval?

Levenshtein distance[5] or edit distance is the number of single character operations required to transform one string to another.

What is the main use of the damerau Levenshtein distance?

While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau–Levenshtein distance has also seen uses in biology to measure the variation between protein sequences.

What is the Levenshtein distance algorithm in Java?

We'll provide an iterative and a recursive Java implementation of this algorithm. 2. What Is the Levenshtein Distance? The Levenshtein distance is a measure of dissimilarity between two Strings. Mathematically, given two Strings x and y, the distance measures the minimum number of character edits required to transform x into y.

What is Levenshtein distance in Python?

The Levenshtein distance also called the Edit distance, is the minimum number of operations required to transform one string to another. Typically, three types of operations are performed (one at a time) : Replace a character. Delete a character. Insert a character. Examples: Input: str1 = “glomax”, str2 = “folmax” Output: 3

What is Levenshtein distance in spring?

Levenshtein distance is only one of the measures of string similarity, some of the other metrics are Cosine Similarity (which uses a token-based approach and considers the strings as vectors), Dice Coefficient, etc. As always the full implementation of examples can be found over on GitHub. with Spring?

How do you fill the Levenshtein matrix?

The matrix will be filled from the upper left corner to the bottom right. Each move horizontally or vertically represents an insertion or a deletion. The Levenshtein distance result between the source and target words will be shown in the bottom right corner.


2 Answers

Without understanding the meaning of the words like @DrYap suggests, the next logical unit to compare two words (if you are not looking for misspellings) is syllables. It is very easy to modify Levenshtein to compare syllables instead of characters. The hard part is breaking the words into syllables. There is a Java implementation TeXHyphenator-J which can be used to split the words. Based on this hyphenation library, here is a modified version of Levenshtein function written by Michael Gilleland & Chas Emerick. More about syllable detection here and here. Of course, you'll want to avoid syllable comparison of two single syllable words probably handling this case with standard Levenshtein.

import net.davidashen.text.Hyphenator;  public class WordDistance {      public static void main(String args[]) throws Exception {         Hyphenator h = new Hyphenator();         h.loadTable(WordDistance.class.getResourceAsStream("hyphen.tex"));         getSyllableLevenshteinDistance(h, args[0], args[1]);     }      /**      * <p>      * Calculate Syllable Levenshtein distance between two words </p>      * The Syllable Levenshtein distance is defined as the minimal number of      * case-insensitive syllables you have to replace, insert or delete to transform word1 into word2.      * @return int      * @throws IllegalArgumentException if either str1 or str2 is <b>null</b>      */     public static int getSyllableLevenshteinDistance(Hyphenator h, String s, String t) {         if (s == null || t == null)             throw new NullPointerException("Strings must not be null");          final String hyphen = Character.toString((char) 173);         final String[] ss = h.hyphenate(s).split(hyphen);         final String[] st = h.hyphenate(t).split(hyphen);          final int n = ss.length;         final int m = st.length;          if (n == 0)             return m;         else if (m == 0)             return n;          int p[] = new int[n + 1]; // 'previous' cost array, horizontally         int d[] = new int[n + 1]; // cost array, horizontally          for (int i = 0; i <= n; i++)             p[i] = i;          for (int j = 1; j <= m; j++) {             d[0] = j;             for (int i = 1; i <= n; i++) {                 int cost = ss[i - 1].equalsIgnoreCase(st[j - 1]) ? 0 : 1;                 // minimum of cell to the left+1, to the top+1, diagonally left and up +cost                 d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);             }             // copy current distance counts to 'previous row' distance counts             int[] _d = p;             p = d;             d = _d;         }          // our last action in the above loop was to switch d and p, so p now actually has the most recent cost counts         return p[n];     }  } 
like image 96
Serg M Ten Avatar answered Sep 20 '22 12:09

Serg M Ten


You can modify Levenshtein Distance by adjusting the scoring when consecutive characters match.

Whenever there are consecutive characters that match, the score can then be reduced thus making the search more relevent.

eg : Lets say the factor by which we want to reduce score by is 10 then if in a word we find the substring "job" we can reduce the score by 10 when we encounter "j" furthur reduce it by (10 + 20) when we find the string "jo" and finally reduce the score by (10 + 20 + 30) when we find "job".

I have written a c++ code below :

#include <bits/stdc++.h>  #define INF -10000000 #define FACTOR 10  using namespace std;  double memo[100][100][100];  double Levenshtein(string inputWord, string checkWord, int i, int j, int count){     if(i == inputWord.length() && j == checkWord.length()) return 0;         if(i == inputWord.length()) return checkWord.length() - j;     if(j == checkWord.length()) return inputWord.length() - i;     if(memo[i][j][count] != INF) return memo[i][j][count];      double ans1 = 0, ans2 = 0, ans3 = 0, ans = 0;     if(inputWord[i] == checkWord[j]){         ans1 = Levenshtein(inputWord, checkWord, i+1, j+1, count+1) - (FACTOR*(count+1));         ans2 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;         ans3 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;         ans = min(ans1, min(ans2, ans3));     }else{         ans1 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;         ans2 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;         ans = min(ans1, ans2);     }     return memo[i][j][count] = ans; }  int main(void) {     // your code goes here     string word = "job";     string wordList[40];     vector< pair <double, string> > ans;     for(int i = 0;i < 40;i++){         cin >> wordList[i];         for(int j = 0;j < 100;j++) for(int k = 0;k < 100;k++){             for(int m = 0;m < 100;m++) memo[j][k][m] = INF;         }         ans.push_back( make_pair(Levenshtein(word, wordList[i],              0, 0, 0), wordList[i]) );     }     sort(ans.begin(), ans.end());     for(int i = 0;i < ans.size();i++){         cout << ans[i].second << " " << ans[i].first << endl;     }     return 0; } 

Link to demo : http://ideone.com/4UtCX3

Here the FACTOR is taken as 10, you can experiment with other words and choose the appropriate value.

Also note that the complexity of the above Levenshtein Distance has also increased, it is now O(n^3) instead of O(n^2) as now we are also keeping track of the counter that counts how many consecutive characters we have encountered.

You can further play with the score by increasing it gradually after you find some consecutive substring and then a mismatch, instead of the current way where we have a fixed score of 1 that is added to the overall score.

Also in the above solution you can remove the strings that have score >=0 as they are not at all releavent you can also choose some other threshold for that to have a more accurate search.

like image 37
uSeemSurprised Avatar answered Sep 20 '22 12:09

uSeemSurprised