I am attempting to remove/extract zip codes from a character string. The logic is that I am grabbing things that:
The zip portion of string could start with a space but might not.
Here's a MWE and what I've tried. The 2 attempted regexes are based on this question and this question:
text.var <- c("Mr. Bean bought 2 tickets 2-613-213-4567",
"43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567",
"Rat Race, XX, 12345",
"Ignore phone numbers(613)2134567",
"Grab zips with dashes 12345-6789 or no space before12345-6789",
"Grab zips with spaces 12345 6789 or no space before12345 6789",
"I like 1234567 dogs"
)
pattern1 <- "\\d{5}([- ]*\\d{4})?"
pattern2 <- "[0-9]{5}(-[0-9]{4})?(?!.*[0-9]{5}(-[0-9]{4})?)"
regmatches(text.var, gregexpr(pattern1, text.var, perl = TRUE))
regmatches(text.var, gregexpr(pattern2, text.var, perl = TRUE))
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## [1] "12345"
##
## [[4]]
## [1] "21345"
##
## [[5]]
## [1] "12345-6789"
##
## [[6]]
## [1] "12345"
##
## [[7]]
## [1] "12345"
Desired Output
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## [1] "12345"
##
## [[4]]
## character(0)
##
## [[5]]
## [1] "12345-6789" "12345-6789"
##
## [[6]]
## [1] "12345 6789" "12345 6789"
##
## [[7]]
## character(0)
Note R's regular expressions are similar to other regex but are specific to R. This question is specific to R's regex not a general regex question.
You can use a combination of Negative Lookbehind and a word boundary \b
here.
regmatches(text.var, gregexpr('(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b', text.var, perl=T))
Explanation:
Word boundary asserts that on one side there is a word character, and on the other side there is not.
(?<! # look behind to see if there is not:
\d # digits (0-9)
) # end of look-behind
\d{5} # digits (0-9) (5 times)
(?: # group, but do not capture (optional):
[ -] # any character of: ' ', '-'
\d{4} # digits (0-9) (4 times)
)? # end of grouping
\b # the boundary between a word character (\w) and not a word character
You may consider using the stringi
library package which performs faster.
> library(stringi)
> stri_extract_all_regex(text.var, '(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With