I have a series of text that contains mixed numbers (ie: a whole part and a fractional part). The problem is that the text is full of human-coded sloppiness:
I need a regex that can parse these elements so that I can create a proper number out of this mess.
Here's a regex that will handle all of the data I can throw at it:
(\d++(?! */))? *-? *(?:(\d+) */ *(\d+))?.*$
This will put the digits into the following groups:
Also, here's the RegexBuddy explanation for the elements (which helped me immensely when constructing it):
Match the regular expression below and capture its match into backreference number 1 «(\d++(?! */))?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single digit 0..9 «\d++»
Between one and unlimited times, as many times as possible, without giving back (possessive) «++»
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?! */)»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “/” literally «/»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “-” literally «-?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the regular expression below «(?:(\d+) */ *(\d+))?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the regular expression below and capture its match into backreference number 2 «(\d+)»
Match a single digit 0..9 «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “/” literally «/»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the regular expression below and capture its match into backreference number 3 «(\d+)»
Match a single digit 0..9 «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character that is not a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»
I think it may be easier to tackle the different cases (full mixed, fraction only, number only) separately from each other. For example:
sub parse_mixed {
my($mixed) = @_;
if($mixed =~ /^ *(\d+)[- ]+(\d+) *\/ *(\d)+(\D.*)?$/) {
return $1+$2/$3;
} elsif($mixed =~ /^ *(\d+) *\/ *(\d+)(\D.*)?$/) {
return $1/$2;
} elsif($mixed =~ /^ *(\d+)(\D.*)?$/) {
return $1;
}
}
print parse_mixed("10"), "\n";
print parse_mixed("1/3"), "\n";
print parse_mixed("1 / 3"), "\n";
print parse_mixed("10 1/3"), "\n";
print parse_mixed("10-1/3"), "\n";
print parse_mixed("10 - 1/3"), "\n";
If you are using Perl 5.10
, this is how I would write it.
m{ ^ \s* # skip leading spaces (?'whole' \d++ (?! \s*[\/] ) # there should not be a slash immediately following a whole number ) \s* (?: # the rest should fail or succeed as a group -? # ignore possible neg sign \s* (?'numerator' \d+ ) \s* [\/] \s* (?'denominator' \d+ ) )? }x
Then you can access the values from the %+
variable like this:
$+{whole};
$+{numerator};
$+{denominator};
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With