I will explain the problem here.
Suppose i am having list of 1000 words. Say it is a dictionary. User will input some word and it will match with exact match if the word is correct or give the closest match. Just like Google search as we enter something and it gives the closest match.
Algorithm that i thought is
Read the word list one by one
split our input word string into characters
take the first word from the list and match character wise
similarly do it for other words in the list
I know this is the long way and it will take lot of time. Do anyone know how to implement better algorithm
Quick and dirty implementation with /usr/share/dict/words
(you still have to do the levensthein distance part and selection)
DISCLAIMER: Binary search code borrowed from http://www.perlmonks.org/?node_id=503154
open(FILE, "<", "/usr/share/dict/words");
my @lines = <FILE>;
my $word = $ARGV[0];
sub BinSearch
{
my ($target, $cmp) = @_;
my @array = @{$_[2]};
my $posmin = 0;
my $posmax = $#array;
return -0.5 if &$cmp (0, \@array, $target) > 0;
return $#array + 0.5 if &$cmp ($#array, \@array, $target) < 0;
while (1)
{
my $mid = int (($posmin + $posmax) / 2);
my $result = &$cmp ($mid, \@array, $target);
if ($result < 0)
{
$posmin = $posmax, next if $mid == $posmin && $posmax != $posmin;
if ($mid == $posmin){
return "Not found, TODO find close match\n";
}
$posmin = $mid;
}
elsif ($result > 0)
{
$posmax = $posmin, next if $mid == $posmax && $posmax != $posmin;
if ($mid == $posmax){
return "Not found, TODO find close match\n";
}
$posmax = $mid;
}
else
{
return "Found: ".@array[$mid];
}
}
}
sub cmpFunc
{
my ($index, $arrayRef, $target) = @_;
my $item = $$arrayRef[$index];
$item =lc($item);
$target =lc($target);
$a = $item cmp $target;
return $a;
}
print BinSearch($word."\n", \&cmpFunc, \@lines)."\n";
Usage (if the script is called find_words.pl
):
perl find_words.pl word
Where word is the word you want to search for.
A common algorithm for this sort of "fuzzy" word search is Levenshtein distance. It doesn't really find similar words but calculates the similarity of words. This similarity score (or Levenshtein distance) can then be used by a sorting or filter function to select similar words.
How the distance is measured is simple: how many characters need to be changed from the target word to the matched word. For example, a distance of 3 means that the difference between the words are 3 edits (not necessarily characters since it also includes the act of adding and removing characters).
The Rosetta Code site has a listing of Levenshtein distance algorithms implemented in various languages including tcl and perl: http://rosettacode.org/wiki/Levenshtein_distance
There is a page on the tcler's wiki that discusses similarity algorithms which includes several implementations of Levenshtein distance: similarity
For perl, there's also a CPAN module that you can use: Text::Levenshtein
So in perl you can simply do:
use Text::Levenshtein;
my %word_distance;
@word_distance{@dictionary} = distance($word,@dictionary);
Then iterate through the word_distance
hash to find the most similar words.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With