Scalable Regex for English Numerals

Question

I'm trying to create a regex to recognize English numerals, such as one, nineteen, twenty, one hundred and twenty two, et cetera, all the way to the millions. I want to reuse some parts of the regular expression, so the regex is being constructed by parts, like so:

// replace <TAG> with the content of the variable
ONE_DIGIT = (?:one|two|three|four|five|six|seven|eight|nine)
TEEN = (?:ten|eleven|twelve|(?:thir|for|fif|six|seven|eigh|nine)teen)
TWO_DIGITS = (?:(?:twen|thir|for|fif|six|seven|eigh|nine)ty(?:\s+<ONE_DIGIT>)?|<TEEN>)
// HUNDREDS, et cetera

I was wondering if anyone has already done the same (and would like to share), as these regexes are quite long and it's possible that they have something that they shouldn't, or something that I may be missing. Also, I want them to be as efficient as possible so I'm looking forward for any optimization tips. I'm using the Java regex engine, but any regex flavour is acceptable.

Sinan Ünür · Accepted Answer

See Perl's Lingua::EN::Words2Nums and Lingua::EN::FindNumber.

In particular, the source code for Lingua::EN::FindNumber contains:

# This is from Lingua::EN::Words2Nums, after being thrown through
# Regex::PreSuf
my $numbers =
    qr/((?:b(?:akers?dozen|illi(?:ard|on))|centillion|d(?:ecilli(?:ard|on)|ozen|u(?:o(?:decilli(?:ard|on)|vigintillion)|vigintillion))|e(?:ight(?:een|ieth|[yh])?|leven(?:ty(?:first|one))?|s)|f(?:i(?:ft(?:een|ieth|[yh])|rst|ve)|o(?:rt(?:ieth|y)|ur(?:t(?:ieth|[yh]))?))|g(?:oogol(?:plex)?|ross)|hundred|mi(?:l(?:ion|li(?:ard|on))|nus)|n(?:aught|egative|in(?:et(?:ieth|y)|t(?:een|[yh])|e)|o(?:nilli(?:ard|on)|ught|vem(?:dec|vigint)illion))|o(?:ct(?:illi(?:ard|on)|o(?:dec|vigint)illion)|ne)|qu(?:a(?:drilli(?:ard|on)|ttuor(?:decilli(?:ard|on)|vigintillion))|in(?:decilli(?:ard|on)|tilli(?:ard|on)|vigintillion))|s(?:core|e(?:cond|pt(?:en(?:dec|vigint)illion|illi(?:ard|on))|ven(?:t(?:ieth|y))?|x(?:decillion|tilli(?:ard|on)|vigintillion))|ix(?:t(?:ieth|y))?)|t(?:ee?n|h(?:ir(?:t(?:een|ieth|y)|d)|ousand|ree)|r(?:e(?:decilli(?:ard|on)|vigintillion)|i(?:gintillion|lli(?:ard|on)))|w(?:e(?:l(?:fth|ve)|nt(?:ieth|y))|o)|h)|un(?:decilli(?:ard|on)|vigintillion)|vigintillion|zero|s))/i;

subject to Perl's Artistic License.

You can use Regex::PreSuf to automatically factor out common pre- and suffixes:

#!/usr/bin/perl

use strict;
use warnings;

use Regex::PreSuf;

my %singledigit = (
    one    => 1,
    two    => 2,
    three  => 3,
    four   => 4,
    five   => 5,
    six    => 6,
    seven  => 7,
    eight  => 8,
    nine   => 9,
);

my $singledigit = presuf(keys %singledigit);

print $singledigit, "
";

my $text = "one two three four five six seven eight nine";

$text =~ s/($singledigit)/$singledigit{$1}/g;

print $text, "
";

Output:

C:\Temp> cvb
(?:eight|f(?:ive|our)|nine|one|s(?:even|ix)|t(?:hree|wo))
1 2 3 4 5 6 7 8 9

I am afraid it gets harder after this ;-)

ysth · Answer

Perl has a number of modules that produce optimized regexes (which mostly only use standard features, so should be usable in Java) using different techniques. You can see examples of Regexp::Assemble, Regexp::List, Regexp::Optimizer, and Regex::PreSuf output in http://groups.google.com/group/perl.perl5.porters/msg/132877aee7542015. Starting in perl 5.10, perl itself usually optimizes lists of |'d exact strings into a trie.

Scalable Regex for English Numerals

Tags:

java

regex

perl

João Silva

2 Answers

Sinan Ünür

ysth

Recent Activity

Donate For Us

Scalable Regex for English Numerals

Tags:

java

regex

perl

João Silva

2 Answers

Sinan Ünür

ysth

Related questions

Recent Activity

Donate For Us