Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I generate a list of words from a group of letters using Perl?

I was looking for a module, regex, or anything else that might apply to this problem.

How can I programatically parse the string and create known English &| Spanish words given that I have a dictionary table against which I can check each permutation of the algorithm's randomization for a match?

Given a group of characters: EBLAIDL KDIOIDSI ADHFWB

The program should return: BLADE AID KID KIDS FIDDLE HOLA etc....

I also want to be able to define the minimum & maximum word length as well as the number of syllables

The input length doesn't matter, it must be only letters, and punctuation doesn't matter.

Thanks for any help

EDIT
Letters in the input string can be reused.

For example, if the input is: ABLED then the output may contain: BALL or BLEED

like image 994
CheeseConQueso Avatar asked Jan 18 '23 14:01

CheeseConQueso


2 Answers

You haven't specified, so I'm assuming each letter in the input can only be used once.

[You have since specified letters in the input can be used more than once, but I'm going to leave this post here in case someone finds it useful.]

The key to doing this efficiently is to sort the letters in the words.

abracadabra => AAAAABBCDRR
abroad      => AABDOR
drab        => ABDR

Then it becomes clear that "drab" is in "abracadabra".

abracadabra => AAAAABBCDRR
drab        => A    B  DR

And that "abroad" isn't.

abracadabra => AAAAABBCD RR
abroad      => AA   B  DOR

Let's call the sorted letter the "signature". Word "B" in is in word "A" if you can remove letters from the signature of "A" to get the signature of "B". That's easy to check using a regex pattern.

sig('drab') =~ /^A?A?A?A?A?B?B?C?D?R?R?\z/

Or if if we eliminate needless backtracking for efficiency, we get

sig('drab') =~ /^A?+A?+A?+A?+A?+B?+B?+C?+D?+R?+R?+\z/

Now that we know what pattern we want, it's just a matter of building it.

use strict;
use warnings;
use feature qw( say );

sub sig { join '', sort grep /^\pL\z/, split //, uc $_[0] }

my $key = shift(@ARGV);

my $pat = sig($key);
$pat =~ s/.\K/?+/sg;
my $re = qr/^(?:$pat)\z/s;

my $shortest = 9**9**9;
my $longest  = 0;
my $count    = 0;
while (my $word = <>) {
   chomp($word);
   next if !length($word);  # My dictionary starts with a blank line!! 
   next if sig($word) !~ /$re/;
   say $word;
   ++$count;
   $shortest = length($word) if length($word) < $shortest;
   $longest  = length($word) if length($word) > $longest;
}

say "Words:    $count";
if ($count) {
   say "Shortest: $shortest";
   say "Longest:  $longest";
}

Example:

$ perl script.pl EBLAIDL /usr/share/dict/words
A
Abe
Abel
Al
...
libel
lid
lie
lied

Words:    117
Shortest: 1
Longest:  6
like image 133
ikegami Avatar answered Jan 26 '23 00:01

ikegami


Well, the regexp is fairly easy... Then you just need to iterate through the words in the dictionary. EG, assuming a standard linux:

# perl -n -e 'print if (/^[EBLAIDL]+$/);' /usr/share/dict/words

Will quickly return all the words in that file containing those and only those letters.

A
AA
AAA
AAAA
AAAAAA
AAAL
AAE
AAEE
AAII
AB
...

As you can see, though, you need a dictionary file that is worth having. In particular, /usr/share/dict/words on my Fedora system contains a bunch of words with all As which may or may not be something you want. So pick your dictionary file carefully.

For min a max length, you can quickly get that as well:

$min = 9999;
$max = -1;
while(<>) {
   if (/[EBLAIDL]+$/) {
      print;
  chomp;
      if (length($_) > $max) {
     $max = length($_);
     $maxword = $_;
      }
      if (length($_) < $min) {
     $min = length($_);
     $minword = $_;
      }
   }
}

print "longest: $maxword\n";
print "shortest: $minword\n";

Will produce:

ZI
ZMRI
ZWEI
longest: TANSTAAFL
shortest: A

For breaking words into pieces and counting the syllables is very language specific, as has been mentioned in the comments above.

like image 23
Wes Hardaker Avatar answered Jan 25 '23 23:01

Wes Hardaker