Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need an algorithm to create google like word search

Tags:

perl

tcl

I will explain the problem here.

Suppose i am having list of 1000 words. Say it is a dictionary. User will input some word and it will match with exact match if the word is correct or give the closest match. Just like Google search as we enter something and it gives the closest match.

Algorithm that i thought is

Read the word list one by one
split our input word string into characters
take the first word from the list and match character wise
similarly do it for other words in the list

I know this is the long way and it will take lot of time. Do anyone know how to implement better algorithm

like image 940
Sumit Avatar asked Dec 21 '22 00:12

Sumit


2 Answers

  1. Sort the words in an array
  2. When a word comes in => binary search (log(n)) (we are doing that because if you use a hash table it will be good for direct match but poor for adjacent)
  3. If perfect match return it
  4. Else compute a levensthein distance of the requested word with the adjacent words and their neighbors (to be defined) and add them to a list of return (if they are satisfying)
  5. Return the list of adjacent words selected

Quick and dirty implementation with /usr/share/dict/words (you still have to do the levensthein distance part and selection)

DISCLAIMER: Binary search code borrowed from http://www.perlmonks.org/?node_id=503154

open(FILE, "<", "/usr/share/dict/words");
my @lines = <FILE>;
my $word = $ARGV[0];

sub BinSearch
{
    my ($target, $cmp) = @_;
    my @array = @{$_[2]};

    my $posmin = 0;
    my $posmax = $#array;

    return -0.5 if &$cmp (0, \@array, $target) > 0;
    return $#array + 0.5 if &$cmp ($#array, \@array, $target) < 0;

    while (1)
    {
        my $mid = int (($posmin + $posmax) / 2);
        my $result = &$cmp ($mid, \@array, $target);


        if ($result < 0)
        {
            $posmin = $posmax, next if $mid == $posmin && $posmax != $posmin;
            if ($mid == $posmin){
                return "Not found, TODO find close match\n";
            }
            $posmin = $mid;
        }
        elsif ($result > 0)
        {
            $posmax = $posmin, next if $mid == $posmax && $posmax != $posmin;
            if ($mid == $posmax){
                return "Not found, TODO find close match\n"; 
            }
            $posmax = $mid;
        }
        else
        {
            return "Found: ".@array[$mid];
        }
    }
}
sub cmpFunc
{
    my ($index, $arrayRef, $target) = @_;
    my $item = $$arrayRef[$index];
    $item =lc($item);
    $target =lc($target);
    $a =  $item cmp $target;
    return $a;
}

print BinSearch($word."\n", \&cmpFunc, \@lines)."\n";

Usage (if the script is called find_words.pl):

perl find_words.pl word

Where word is the word you want to search for.

like image 187
lc2817 Avatar answered Jan 08 '23 15:01

lc2817


A common algorithm for this sort of "fuzzy" word search is Levenshtein distance. It doesn't really find similar words but calculates the similarity of words. This similarity score (or Levenshtein distance) can then be used by a sorting or filter function to select similar words.

How the distance is measured is simple: how many characters need to be changed from the target word to the matched word. For example, a distance of 3 means that the difference between the words are 3 edits (not necessarily characters since it also includes the act of adding and removing characters).

The Rosetta Code site has a listing of Levenshtein distance algorithms implemented in various languages including tcl and perl: http://rosettacode.org/wiki/Levenshtein_distance

There is a page on the tcler's wiki that discusses similarity algorithms which includes several implementations of Levenshtein distance: similarity

For perl, there's also a CPAN module that you can use: Text::Levenshtein

So in perl you can simply do:

use Text::Levenshtein;

my %word_distance;
@word_distance{@dictionary} = distance($word,@dictionary);

Then iterate through the word_distance hash to find the most similar words.

like image 36
slebetman Avatar answered Jan 08 '23 16:01

slebetman