Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to extract US zip codes but not faux codes

Tags:

string

regex

r

Using the XML package and XPath to scrape addresses from websites, I sometimes can get only a string that has embedded in it the zip code I want. It is straightforward to extract the zip code, but sometimes there are other five-digit strings that show up.

Here are some variations on the problem in a df.

zips <- data.frame(id = seq(1, 5), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540, City, ST 12345")) 

The R statement to extract zip codes (both 5 digit and plus 4 digits) is below, but it is tricked by the faux zip codes of the street number and the suite number (and there may be other possibilities in other address strings).

regmatches(zips$address, gregexpr("\\d{5}([-]?\\d{4})?", zips$address, perl = TRUE))

An answer to a previous SO question suggested that a "regex will return the last consecutive five digit string. It uses a negative look-ahead to ensure the absence of 5-digit strings after the one being returned."
Extracting a zip code from an address string

\b\d{5}\b(?!.*\b\d{5}\b)

But that question and answer deals with PHP and offers an if loop with preg_matches()` I am not familiar with those languages and tools, but the idea might be right.

My question: what R code will find real zip codes and ignore false lookalikes?

like image 321
lawyeR Avatar asked Aug 07 '14 10:08

lawyeR


2 Answers

This is my first regex answer (I am still learning) so hopefully I don't say anything wrong to lead you in the wrong direction.

Basically, this regex looks for, as you hinted in your question, the last string that looks like a zip code which is not followed by a string that looks like a zip code

the basic syntax is pattern(?!.*pattern) which says to match pattern only if it is not followed (a negative look-ahead assertion, syntax: (?! )) by anything .* and pattern

so we can replace pattern with what you are interested in finding:

[0-9]{5}(-[0-9]{4})?

that is, a digit string [0-9] of exactly 5 characters {5} (which may optionally be followed ? by another group defined as a hyphen and another digit string of length four (-[0-9]{4})

put it all together with gregexpr to search for the matches and regmatches to interpret the results for me, I get:

zips <- data.frame(id = seq(1, 5), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540, City, ST 12345")) 
regmatches(zips$address,
           gregexpr('[0-9]{5}(-[0-9]{4})?(?!.*[0-9]{5}(-[0-9]{4})?)', zips$address, perl = TRUE))

# [[1]]
# [1] "12345"
# 
# [[2]]
# [1] "12345-0000"
# 
# [[3]]
# [1] "12345"
# 
# [[4]]
# [1] "12345"
# 
# [[5]]
# [1] "12345"
like image 186
rawr Avatar answered Oct 20 '22 12:10

rawr


The qdapRegex package has the rm_zip function for this:

zips <- data.frame(id = seq(1, 5), 
    address = c("Company, 18540 Main Ave., City, ST 12345", 
    "Company 18540 Main Ave. City ST 12345-0000", 
    "Company 18540 Main Ave. City State 12345", 
    "Company, 18540 Main Ave., City, ST 12345 USA", 
    "Company, One Main Ave Suite 18540, City, ST 12345")
)

lapply(rm_zip(zips$address, extract=TRUE), tail, 1)

## [[1]]
## [1] "12345"
## 
## [[2]]
## [1] "12345-0000"
## 
## [[3]]
## [1] "12345"
## 
## [[4]]
## [1] "12345"
## 
## [[5]]
## [1] "12345"

EDIT Per @lawyeR's comments:

I think that you want some regex that is more specific than the dictionary system used by qdapRegex. The current implementation of rm_zip allows for validation purposes and thus I wouldn't alter the regular expression it uses to be more flexible. I also wouldn't alter the function rm_zip to have additional parameters/arguments as qdapRegex attempts to have consistently operating functions.

That being said you could create your own function using the rm_ function and supply your own regular expression. I have done this using both of the parameters specified in your comment:

More complex data set:

zips <- data.frame(id = seq(1, 6), 
    address = c("Company, 18540 Main Ave., City, ST 12345", 
    "Company 18540 Main Ave. City ST 12345-0000", 
    "Company 18540 Main Ave. City State 12345", 
    "Company, 18540 Main Ave., City, ST 12345 USA", 
    "Company, One Main Ave Suite 18540m, City, ST 12345",
    "company 12345678")
)

Function to grab even if a character follows the zip

## paste together a more flexible regular expression    
pat <- pastex(
    "@rm_zip", 
    "(?<!\\d)\\d{5}(?!\\d)",
    "(?<!\\d)\\d{5}-\\d{4}(?!\\d)"
)
## Create your own function that extract is set to TRUE
rm_zip2 <- rm_(pattern=pat, extract=TRUE)
rm_zip2(zips$address)

## [[1]]
## [1] "18540" "12345"
## 
## [[2]]
## [1] "18540"      "12345-0000"
## 
## [[3]]
## [1] "18540" "12345"
## 
## [[4]]
## [1] "18540" "12345"
## 
## [[5]]
## [1] "18540" "12345"
## 
## [[6]]
## [1] NA

Function to extract just 5 digit zips

rm_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract=TRUE)
rm_zip3(zips$address)

## [[1]]
## [1] "18540" "12345"
## 
## [[2]]
## [1] "18540" "12345"
## 
## [[3]]
## [1] "18540" "12345"
## 
## [[4]]
## [1] "18540" "12345"
## 
## [[5]]
## [1] "18540" "12345"
## 
## [[6]]
## [1] NA
like image 23
Tyler Rinker Avatar answered Oct 20 '22 13:10

Tyler Rinker