I'll put it right out there: I'm terrible with regular expressions. I've tried to come up with one to solve my problem but I really don't know much about them. . .
Imagine some sentences along the following lines:
- Hello blah blah. It's around 11 1/2" x 32".
- The dimensions are 8 x 10-3/5!
- Probably somewhere in the region of 22" x 17".
- The roll is quite large: 42 1/2" x 60 yd.
- They are all 5.76 by 8 frames.
- Yeah, maybe it's around 84cm long.
- I think about 13/19".
- No, it's probably 86 cm actually.
I want to, as cleanly as possible, extract item dimensions from within these sentences. In a perfect world the regular expression would output the following:
- 11 1/2" x 32"
- 8 x 10-3/5
- 22" x 17"
- 42 1/2" x 60 yd
- 5.76 by 8
- 84cm
- 13/19"
- 86 cm
I imagine a world where the following rules apply:
{cm, mm, yd, yards, ", ', feet}
, though I'd prefer a solution that considers an arbitrary set of units rather than an explicit solution for the above units.4/5"
./
separating the numerator / denominator, and one can assume there is no space between the parts (though if someone takes that in to account that's great!).{x, by}
. If a dimension is only one-dimensional it must have units from the set above, i.e., 22 cm
is OK, .333
is not, nor is 4.33 oz
.To show you how useless I am with regular expressions (and to show I at least tried!), I got this far. . .
[1-9]+[/ ][x1-9]
Update (2)
You guys are very fast and efficient! I'm going to add an extra few of test cases that haven't been covered by the regular expressions below:
- The last but one test case is 12 yd x.
- The last test case is 99 cm by.
- This sentence doesn't have dimensions in it: 342 / 5553 / 222.
- Three dimensions? 22" x 17" x 12 cm
- This is a product code: c720 with another number 83 x better.
- A number on its own 21.
- A volume shouldn't match 0.332 oz.
These should result in the following (# indicates nothing should match):
- 12 yd
- 99 cm
- #
- 22" x 17" x 12 cm
- #
- #
- #
I've adapted M42's answer below, to:
\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)(?:\s*x\s*|\s*by\s*)?(?:\d+(?:\.\d+)?[\s*-]*(?:\d+(?:\/\d+)?)?(?:cm|mm|yd|"|'|feet)?)?
But while that resolves some new test cases it now fails to match the following others. It reports:
- 11 1/2" x 32" PASS
- (nothing) FAIL
- 22" x 17" PASS
- 42 1/2" x 60 yd PASS
- (nothing) FAIL
- 84cm PASS
- 13/19" PASS
- 86 cm PASS
- 22" PASS
- (nothing) FAIL
(nothing) FAIL
12 yd x FAIL
- 99 cm by FAIL
- 22" x 17" [and also, but separately '12 cm'] FAIL
PASS
PASS
[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .
Definition and Usage. The ?! n quantifier matches any string that is not followed by a specific string n.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.
New version, near the target, 2 failed tests
#!/usr/local/bin/perl
use Modern::Perl;
use Test::More;
my $re1 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)/;
my $re2 = qr/(?:\s*x\s*|\s*by\s*)/;
my $re3 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet|frames)/;
my @out = (
'11 1/2" x 32"',
'8 x 10-3/5',
'22" x 17"',
'42 1/2" x 60 yd',
'5.76 by 8 frames',
'84cm',
'13/19"',
'86 cm',
'12 yd',
'99 cm',
'no match',
'22" x 17" x 12 cm',
'no match',
'no match',
'no match',
);
my $i = 0;
my $xx = '22" x 17"';
while(<DATA>) {
chomp;
if (/($re1(?:$re2$re3)?(?:$re2$re1)?)/) {
ok($1 eq $out[$i], $1 . ' in ' . $_);
} else {
ok($out[$i] eq 'no match', ' got "no match" in '.$_);
}
$i++;
}
done_testing;
__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
The last but one test case is 12 yd x.
The last test case is 99 cm by.
This sentence doesn't have dimensions in it: 342 / 5553 / 222.
Three dimensions? 22" x 17" x 12 cm
This is a product code: c720 with another number 83 x better.
A number on its own 21.
A volume shouldn't match 0.332 oz.
output:
# Failed test ' got "no match" in The dimensions are 8 x 10-3/5!'
# at C:\tests\perl\test6.pl line 42.
# Failed test ' got "no match" in They are all 5.76 by 8 frames.'
# at C:\tests\perl\test6.pl line 42.
# Looks like you failed 2 tests of 15.
ok 1 - 11 1/2" x 32" in Hello blah blah. It's around 11 1/2" x 32".
not ok 2 - got "no match" in The dimensions are 8 x 10-3/5!
ok 3 - 22" x 17" in Probably somewhere in the region of 22" x 17".
ok 4 - 42 1/2" x 60 yd in The roll is quite large: 42 1/2" x 60 yd.
not ok 5 - got "no match" in They are all 5.76 by 8 frames.
ok 6 - 84cm in Yeah, maybe it's around 84cm long.
ok 7 - 13/19" in I think about 13/19".
ok 8 - 86 cm in No, it's probably 86 cm actually.
ok 9 - 12 yd in The last but one test case is 12 yd x.
ok 10 - 99 cm in The last test case is 99 cm by.
ok 11 - got "no match" in This sentence doesn't have dimensions in it: 342 / 5553 / 222.
ok 12 - 22" x 17" x 12 cm in Three dimensions? 22" x 17" x 12 cm
ok 13 - got "no match" in This is a product code: c720 with another number 83 x better.
ok 14 - got "no match" in A number on its own 21.
ok 15 - got "no match" in A volume shouldn't match 0.332 oz.
1..15
It seems difficult to match 5.76 by 8 frames
but not 0.332 oz
, sometimes you have to match numbers with unit and numbers without unit.
I'm sorry, I'm not able to do better.
One of many possible solutions (should be nlp compatible as it uses only basic regex syntax):
foundMatch = Regex.IsMatch(SubjectString, @"\d+(?: |cm|\.|""|/)[\d/""x -]*(?:\b(?:by\s*\d+|cm|yd)\b)?");
Will get your results :)
Explanation:
"
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
\ # Match the character “ ” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
cm # Match the characters “cm” literally
| # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
\. # Match the character “.” literally
| # Or match regular expression number 4 below (attempting the next alternative only if this one fails)
"" # Match the character “""” literally
| # Or match regular expression number 5 below (the entire group fails if this one fails to match)
/ # Match the character “/” literally
)
[\d/""x -] # Match a single character present in the list below
# A single digit 0..9
# One of the characters “/""x”
# The character “ ”
# The character “-”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
\b # Assert position at a word boundary
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
by # Match the characters “by” literally
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
cm # Match the characters “cm” literally
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
yd # Match the characters “yd” literally
)
\b # Assert position at a word boundary
)? # Between zero and one times, as many times as possible, giving back as needed (greedy)
"
This is all what I can get with a regular expression in 'Perl'. Try to adapt it to your regex flavour:
\d.*\d(?:\s+\S+|\S+)
Explanation:
\d # One digit.
.* # Any number of characters.
\d # One digit. All joined means to find all content between first and last digit.
\s+\S+ # A non-space characters after some space. It tries to match any unit like 'cm' or 'yd'.
| # Or. Select one of two expressions between parentheses.
\S+ # Any number of non-space characters. It tries to match double-quotes, or units joined to the
# last number.
My test:
Content of script.pl:
use warnings;
use strict;
while ( <DATA> ) {
print qq[$1\n] if m/(\d.*\d(\s+\S+|\S+))/
}
__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
Running the script:
perl script.pl
Result:
11 1/2" x 32".
8 x 10-3/5!
22" x 17".
42 1/2" x 60 yd.
5.76 by 8 frames.
84cm
13/19".
86 cm
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With