Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to match object dimensions

I'll put it right out there: I'm terrible with regular expressions. I've tried to come up with one to solve my problem but I really don't know much about them. . .

Imagine some sentences along the following lines:

  • Hello blah blah. It's around 11 1/2" x 32".
  • The dimensions are 8 x 10-3/5!
  • Probably somewhere in the region of 22" x 17".
  • The roll is quite large: 42 1/2" x 60 yd.
  • They are all 5.76 by 8 frames.
  • Yeah, maybe it's around 84cm long.
  • I think about 13/19".
  • No, it's probably 86 cm actually.

I want to, as cleanly as possible, extract item dimensions from within these sentences. In a perfect world the regular expression would output the following:

  • 11 1/2" x 32"
  • 8 x 10-3/5
  • 22" x 17"
  • 42 1/2" x 60 yd
  • 5.76 by 8
  • 84cm
  • 13/19"
  • 86 cm

I imagine a world where the following rules apply:

  • The following are valid units: {cm, mm, yd, yards, ", ', feet}, though I'd prefer a solution that considers an arbitrary set of units rather than an explicit solution for the above units.
  • A dimension is always described numerically, may or may not have units following it and may or may not have a fractional or decimal part. Being made up of a fractional part on it's own is allowed, e.g., 4/5".
  • Fractional parts always have a / separating the numerator / denominator, and one can assume there is no space between the parts (though if someone takes that in to account that's great!).
  • Dimensions may be one-dimensional or two-dimensional, in which case one can assume the following are acceptable for separating two dimensions: {x, by}. If a dimension is only one-dimensional it must have units from the set above, i.e., 22 cm is OK, .333 is not, nor is 4.33 oz.

To show you how useless I am with regular expressions (and to show I at least tried!), I got this far. . .

[1-9]+[/ ][x1-9]

Update (2)

You guys are very fast and efficient! I'm going to add an extra few of test cases that haven't been covered by the regular expressions below:

  • The last but one test case is 12 yd x.
  • The last test case is 99 cm by.
  • This sentence doesn't have dimensions in it: 342 / 5553 / 222.
  • Three dimensions? 22" x 17" x 12 cm
  • This is a product code: c720 with another number 83 x better.
  • A number on its own 21.
  • A volume shouldn't match 0.332 oz.

These should result in the following (# indicates nothing should match):

  • 12 yd
  • 99 cm
  • #
  • 22" x 17" x 12 cm
  • #
  • #
  • #

I've adapted M42's answer below, to:

\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)(?:\s*x\s*|\s*by\s*)?(?:\d+(?:\.\d+)?[\s*-]*(?:\d+(?:\/\d+)?)?(?:cm|mm|yd|"|'|feet)?)?

But while that resolves some new test cases it now fails to match the following others. It reports:

  • 11 1/2" x 32" PASS
  • (nothing) FAIL
  • 22" x 17" PASS
  • 42 1/2" x 60 yd PASS
  • (nothing) FAIL
  • 84cm PASS
  • 13/19" PASS
  • 86 cm PASS
  • 22" PASS
  • (nothing) FAIL
  • (nothing) FAIL

  • 12 yd x FAIL

  • 99 cm by FAIL
  • 22" x 17" [and also, but separately '12 cm'] FAIL
  • PASS

  • PASS

like image 380
Edwardr Avatar asked Dec 08 '11 16:12

Edwardr


People also ask

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .

What is ?! In regex?

Definition and Usage. The ?! n quantifier matches any string that is not followed by a specific string n.

How do I match a pattern in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What will the regular expression match?

By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.


3 Answers

New version, near the target, 2 failed tests

#!/usr/local/bin/perl 
use Modern::Perl;
use Test::More;

my $re1 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)/;
my $re2 = qr/(?:\s*x\s*|\s*by\s*)/;
my $re3 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet|frames)/;
my @out = (
'11 1/2" x 32"',
'8 x 10-3/5',
'22" x 17"',
'42 1/2" x 60 yd',
'5.76 by 8 frames',
'84cm',
'13/19"',
'86 cm',
'12 yd',
'99 cm',
'no match',
'22" x 17" x 12 cm',
'no match',
'no match',
'no match',
);
my $i = 0;
my $xx = '22" x 17"';
while(<DATA>) {
    chomp;
    if (/($re1(?:$re2$re3)?(?:$re2$re1)?)/) {
        ok($1 eq $out[$i], $1 . ' in ' . $_);
    } else {
        ok($out[$i] eq 'no match', ' got "no match" in '.$_);
    }
    $i++;
}
done_testing;


__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
The last but one test case is 12 yd x.
The last test case is 99 cm by.
This sentence doesn't have dimensions in it: 342 / 5553 / 222.
Three dimensions? 22" x 17" x 12 cm
This is a product code: c720 with another number 83 x better.  
A number on its own 21.
A volume shouldn't match 0.332 oz.

output:

#   Failed test ' got "no match" in The dimensions are 8 x 10-3/5!'
#   at C:\tests\perl\test6.pl line 42.
#   Failed test ' got "no match" in They are all 5.76 by 8 frames.'
#   at C:\tests\perl\test6.pl line 42.
# Looks like you failed 2 tests of 15.
ok 1 - 11 1/2" x 32" in Hello blah blah. It's around 11 1/2" x 32".
not ok 2 -  got "no match" in The dimensions are 8 x 10-3/5!
ok 3 - 22" x 17" in Probably somewhere in the region of 22" x 17".
ok 4 - 42 1/2" x 60 yd in The roll is quite large: 42 1/2" x 60 yd.
not ok 5 -  got "no match" in They are all 5.76 by 8 frames.
ok 6 - 84cm in Yeah, maybe it's around 84cm long.
ok 7 - 13/19" in I think about 13/19".
ok 8 - 86 cm in No, it's probably 86 cm actually.
ok 9 - 12 yd in The last but one test case is 12 yd x.
ok 10 - 99 cm in The last test case is 99 cm by.
ok 11 -  got "no match" in This sentence doesn't have dimensions in it: 342 / 5553 / 222.
ok 12 - 22" x 17" x 12 cm in Three dimensions? 22" x 17" x 12 cm
ok 13 -  got "no match" in This is a product code: c720 with another number 83 x better.  
ok 14 -  got "no match" in A number on its own 21.
ok 15 -  got "no match" in A volume shouldn't match 0.332 oz.
1..15

It seems difficult to match 5.76 by 8 frames but not 0.332 oz, sometimes you have to match numbers with unit and numbers without unit.

I'm sorry, I'm not able to do better.

like image 87
Toto Avatar answered Sep 20 '22 22:09

Toto


One of many possible solutions (should be nlp compatible as it uses only basic regex syntax):

foundMatch = Regex.IsMatch(SubjectString, @"\d+(?: |cm|\.|""|/)[\d/""x -]*(?:\b(?:by\s*\d+|cm|yd)\b)?");

Will get your results :)

Explanation:

"
\d             # Match a single digit 0..9
   +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?:            # Match the regular expression below
                  # Match either the regular expression below (attempting the next alternative only if this one fails)
      \           # Match the character “ ” literally
   |              # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
      cm          # Match the characters “cm” literally
   |              # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
      \.          # Match the character “.” literally
   |              # Or match regular expression number 4 below (attempting the next alternative only if this one fails)
      ""          # Match the character “""” literally
   |              # Or match regular expression number 5 below (the entire group fails if this one fails to match)
      /           # Match the character “/” literally
)
[\d/""x -]        # Match a single character present in the list below
                  # A single digit 0..9
                  # One of the characters “/""x”
                  # The character “ ”
                  # The character “-”
   *              # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(?:               # Match the regular expression below
   \b             # Assert position at a word boundary
   (?:            # Match the regular expression below
                  # Match either the regular expression below (attempting the next alternative only if this one fails)
         by       # Match the characters “by” literally
         \s       # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
            *     # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
         \d       # Match a single digit 0..9
            +     # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      |           # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
         cm       # Match the characters “cm” literally
      |           # Or match regular expression number 3 below (the entire group fails if this one fails to match)
         yd       # Match the characters “yd” literally
   )
   \b             # Assert position at a word boundary
)?                # Between zero and one times, as many times as possible, giving back as needed (greedy)
"
like image 39
FailedDev Avatar answered Sep 17 '22 22:09

FailedDev


This is all what I can get with a regular expression in 'Perl'. Try to adapt it to your regex flavour:

\d.*\d(?:\s+\S+|\S+)

Explanation:

\d        # One digit.
.*        # Any number of characters.
\d        # One digit. All joined means to find all content between first and last digit.
\s+\S+    # A non-space characters after some space. It tries to match any unit like 'cm' or 'yd'.
|         # Or. Select one of two expressions between parentheses.
\S+       # Any number of non-space characters. It tries to match double-quotes, or units joined to the 
          # last number.

My test:

Content of script.pl:

use warnings;
use strict;

while ( <DATA> ) {
        print qq[$1\n] if m/(\d.*\d(\s+\S+|\S+))/
}

__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.

Running the script:

perl script.pl

Result:

11 1/2" x 32".
8 x 10-3/5!
22" x 17".
42 1/2" x 60 yd.
5.76 by 8 frames.
84cm
13/19".
86 cm
like image 43
Birei Avatar answered Sep 18 '22 22:09

Birei