I was looking for a module, regex, or anything else that might apply to this problem.
How can I programatically parse the string and create known English &| Spanish words given that I have a dictionary table against which I can check each permutation of the algorithm's randomization for a match?
Given a group of characters: EBLAIDL KDIOIDSI ADHFWB
The program should return: BLADE
AID
KID
KIDS
FIDDLE
HOLA
etc....
I also want to be able to define the minimum & maximum word length as well as the number of syllables
The input length doesn't matter, it must be only letters, and punctuation doesn't matter.
Thanks for any help
EDIT
Letters in the input string can be reused.
For example, if the input is: ABLED
then the output may contain: BALL
or BLEED
You haven't specified, so I'm assuming each letter in the input can only be used once.
[You have since specified letters in the input can be used more than once, but I'm going to leave this post here in case someone finds it useful.]
The key to doing this efficiently is to sort the letters in the words.
abracadabra => AAAAABBCDRR
abroad => AABDOR
drab => ABDR
Then it becomes clear that "drab" is in "abracadabra".
abracadabra => AAAAABBCDRR
drab => A B DR
And that "abroad" isn't.
abracadabra => AAAAABBCD RR
abroad => AA B DOR
Let's call the sorted letter the "signature". Word "B" in is in word "A" if you can remove letters from the signature of "A" to get the signature of "B". That's easy to check using a regex pattern.
sig('drab') =~ /^A?A?A?A?A?B?B?C?D?R?R?\z/
Or if if we eliminate needless backtracking for efficiency, we get
sig('drab') =~ /^A?+A?+A?+A?+A?+B?+B?+C?+D?+R?+R?+\z/
Now that we know what pattern we want, it's just a matter of building it.
use strict;
use warnings;
use feature qw( say );
sub sig { join '', sort grep /^\pL\z/, split //, uc $_[0] }
my $key = shift(@ARGV);
my $pat = sig($key);
$pat =~ s/.\K/?+/sg;
my $re = qr/^(?:$pat)\z/s;
my $shortest = 9**9**9;
my $longest = 0;
my $count = 0;
while (my $word = <>) {
chomp($word);
next if !length($word); # My dictionary starts with a blank line!!
next if sig($word) !~ /$re/;
say $word;
++$count;
$shortest = length($word) if length($word) < $shortest;
$longest = length($word) if length($word) > $longest;
}
say "Words: $count";
if ($count) {
say "Shortest: $shortest";
say "Longest: $longest";
}
Example:
$ perl script.pl EBLAIDL /usr/share/dict/words
A
Abe
Abel
Al
...
libel
lid
lie
lied
Words: 117
Shortest: 1
Longest: 6
Well, the regexp is fairly easy... Then you just need to iterate through the words in the dictionary. EG, assuming a standard linux:
# perl -n -e 'print if (/^[EBLAIDL]+$/);' /usr/share/dict/words
Will quickly return all the words in that file containing those and only those letters.
A
AA
AAA
AAAA
AAAAAA
AAAL
AAE
AAEE
AAII
AB
...
As you can see, though, you need a dictionary file that is worth having. In particular, /usr/share/dict/words on my Fedora system contains a bunch of words with all As which may or may not be something you want. So pick your dictionary file carefully.
For min a max length, you can quickly get that as well:
$min = 9999;
$max = -1;
while(<>) {
if (/[EBLAIDL]+$/) {
print;
chomp;
if (length($_) > $max) {
$max = length($_);
$maxword = $_;
}
if (length($_) < $min) {
$min = length($_);
$minword = $_;
}
}
}
print "longest: $maxword\n";
print "shortest: $minword\n";
Will produce:
ZI
ZMRI
ZWEI
longest: TANSTAAFL
shortest: A
For breaking words into pieces and counting the syllables is very language specific, as has been mentioned in the comments above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With