Remove US zip codes from a string: R regex

Question

I am attempting to remove/extract zip codes from a character string. The logic is that I am grabbing things that:

must contain exactly 5 consecutive digits OR
must contain exactly 5 consecutive digits followed by a dash and then exactly 4 consecutive digits OR
must contain exactly 5 consecutive digits followed by a space and then exactly 4 consecutive digits

The zip portion of string could start with a space but might not.

Here's a MWE and what I've tried. The 2 attempted regexes are based on this question and this question:

text.var <- c("Mr. Bean bought 2 tickets 2-613-213-4567",
  "43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567", 
  "Rat Race, XX, 12345",
  "Ignore phone numbers(613)2134567",
  "Grab zips with dashes 12345-6789 or no space before12345-6789",  
  "Grab zips with spaces 12345 6789 or no space before12345 6789",
  "I like 1234567 dogs"
)

pattern1 <- "\d{5}([- ]*\d{4})?"
pattern2 <- "[0-9]{5}(-[0-9]{4})?(?!.*[0-9]{5}(-[0-9]{4})?)"


regmatches(text.var, gregexpr(pattern1, text.var, perl = TRUE)) 
regmatches(text.var, gregexpr(pattern2, text.var, perl = TRUE)) 

## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "12345"
## 
## [[4]]
## [1] "21345"
## 
## [[5]]
## [1] "12345-6789"
## 
## [[6]]
## [1] "12345"
## 
## [[7]]
## [1] "12345"

Desired Output

## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "12345"
## 
## [[4]]
## character(0)
## 
## [[5]]
## [1] "12345-6789" "12345-6789"
## 
## [[6]]
## [1] "12345 6789" "12345 6789"
## 
## [[7]]
## character(0)

Note R's regular expressions are similar to other regex but are specific to R. This question is specific to R's regex not a general regex question.

hwnd · Accepted Answer

Lookaround assertion

You can use a combination of Negative Lookbehind and a word boundary \b here.

regmatches(text.var, gregexpr('(?<!\d)\d{5}(?:[ -]\d{4})?\b', text.var, perl=T))

Explanation:

The negative lookbehind asserts that what precedes is not a digit.

Word boundary asserts that on one side there is a word character, and on the other side there is not.

(?<!        # look behind to see if there is not:
  \d        #   digits (0-9)
)           # end of look-behind
\d{5}       # digits (0-9) (5 times)
(?:         # group, but do not capture (optional):
  [ -]      #   any character of: ' ', '-'
  \d{4}     #   digits (0-9) (4 times)
)?          # end of grouping
\b          # the boundary between a word character (\w) and not a word character

Additional options

You may consider using the stringi library package which performs faster.

> library(stringi)
> stri_extract_all_regex(text.var, '(?<!\d)\d{5}(?:[ -]\d{4})?\b')

Remove US zip codes from a string: R regex

Tags:

regex

r

Tyler Rinker

1 Answers

Lookaround assertion

Additional options

hwnd

Recent Activity

Donate For Us

Remove US zip codes from a string: R regex

Tags:

regex

r

Tyler Rinker

1 Answers

Lookaround assertion

Additional options

hwnd

Related questions

Recent Activity

Donate For Us