I'm trying to create a regex to recognize English numerals, such as one, nineteen, twenty, one hundred and twenty two, et cetera, all the way to the millions. I want to reuse some parts of the regular expression, so the regex is being constructed by parts, like so:
// replace <TAG> with the content of the variable
ONE_DIGIT = (?:one|two|three|four|five|six|seven|eight|nine)
TEEN = (?:ten|eleven|twelve|(?:thir|for|fif|six|seven|eigh|nine)teen)
TWO_DIGITS = (?:(?:twen|thir|for|fif|six|seven|eigh|nine)ty(?:\s+<ONE_DIGIT>)?|<TEEN>)
// HUNDREDS, et cetera
I was wondering if anyone has already done the same (and would like to share), as these regexes are quite long and it's possible that they have something that they shouldn't, or something that I may be missing. Also, I want them to be as efficient as possible so I'm looking forward for any optimization tips. I'm using the Java regex engine, but any regex flavour is acceptable.
See Perl's Lingua::EN::Words2Nums and Lingua::EN::FindNumber.
In particular, the source code for Lingua::EN::FindNumber
contains:
# This is from Lingua::EN::Words2Nums, after being thrown through
# Regex::PreSuf
my $numbers =
qr/((?:b(?:akers?dozen|illi(?:ard|on))|centillion|d(?:ecilli(?:ard|on)|ozen|u(?:o(?:decilli(?:ard|on)|vigintillion)|vigintillion))|e(?:ight(?:een|ieth|[yh])?|leven(?:ty(?:first|one))?|s)|f(?:i(?:ft(?:een|ieth|[yh])|rst|ve)|o(?:rt(?:ieth|y)|ur(?:t(?:ieth|[yh]))?))|g(?:oogol(?:plex)?|ross)|hundred|mi(?:l(?:ion|li(?:ard|on))|nus)|n(?:aught|egative|in(?:et(?:ieth|y)|t(?:een|[yh])|e)|o(?:nilli(?:ard|on)|ught|vem(?:dec|vigint)illion))|o(?:ct(?:illi(?:ard|on)|o(?:dec|vigint)illion)|ne)|qu(?:a(?:drilli(?:ard|on)|ttuor(?:decilli(?:ard|on)|vigintillion))|in(?:decilli(?:ard|on)|tilli(?:ard|on)|vigintillion))|s(?:core|e(?:cond|pt(?:en(?:dec|vigint)illion|illi(?:ard|on))|ven(?:t(?:ieth|y))?|x(?:decillion|tilli(?:ard|on)|vigintillion))|ix(?:t(?:ieth|y))?)|t(?:ee?n|h(?:ir(?:t(?:een|ieth|y)|d)|ousand|ree)|r(?:e(?:decilli(?:ard|on)|vigintillion)|i(?:gintillion|lli(?:ard|on)))|w(?:e(?:l(?:fth|ve)|nt(?:ieth|y))|o)|h)|un(?:decilli(?:ard|on)|vigintillion)|vigintillion|zero|s))/i;
subject to Perl's Artistic License.
You can use Regex::PreSuf to automatically factor out common pre- and suffixes:
#!/usr/bin/perl
use strict;
use warnings;
use Regex::PreSuf;
my %singledigit = (
one => 1,
two => 2,
three => 3,
four => 4,
five => 5,
six => 6,
seven => 7,
eight => 8,
nine => 9,
);
my $singledigit = presuf(keys %singledigit);
print $singledigit, "\n";
my $text = "one two three four five six seven eight nine";
$text =~ s/($singledigit)/$singledigit{$1}/g;
print $text, "\n";
Output:
C:\Temp> cvb (?:eight|f(?:ive|our)|nine|one|s(?:even|ix)|t(?:hree|wo)) 1 2 3 4 5 6 7 8 9
I am afraid it gets harder after this ;-)
Perl has a number of modules that produce optimized regexes (which mostly only use standard features, so should be usable in Java) using different techniques. You can see examples of Regexp::Assemble, Regexp::List, Regexp::Optimizer, and Regex::PreSuf output in http://groups.google.com/group/perl.perl5.porters/msg/132877aee7542015. Starting in perl 5.10, perl itself usually optimizes lists of |
'd exact strings into a trie.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With