I have the following regular expression:
REGEX = /^.+(\d+.+(?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2}[, ]+\d{5}(?:-\d{4})?).+/
I have the following string:
str = "fdsfd 8126 E Bowen AVE Bensalem, PA 19020-1642 dfdf"
Notice my capturing group begins with one or more digits that match the pattern. Yet this is what I get:
str =~ REGEX
$1
 => "6 E Bowen AVE Bensalem, PA 19020-1642" 
Or
match = str.match(REGEX)
match[1]
=> "6 E Bowen AVE Bensalem, PA 19020-1642"
Why is it missing the first 3 digits of 812?
\d (digit) matches any single digit (same as [0-9] ). The uppercase counterpart \D (non-digit) matches any single character that is not a digit (same as [^0-9] ). \s (space) matches any single whitespace (same as [ \t\n\r\f] , blank, tab, newline, carriage-return and form-feed).
The Special Character Classes in Perl are as follows: Digit \d[0-9]: The \d is used to match any digit character and its equivalent to [0-9]. In the regex /\d/ will match a single digit. The \d is standardized to “digit”.
The below regex works properly, as you can see at Regex101
REGEX = /^.+?(\d+.+(?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2}[, ]+\d{5}(?:-\d{4})?).+/
Note the addition of the question mark near the beginning of the regex
/^.+?(\d+...
    ^ 
By default, your first .+ is being greedy, consuming all digits it can, and still allowing the regex pass. By adding ? after the plus, you can make it lazy instead of greedy.
An alternative would be to not capture digits, like this:
/^[^\d]+(\d+...
[^\d]+ will capture everything except for digits.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With