Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert long state names embedded with other text to two-letter state abbreviations

Tags:

regex

r

My objective is to identify US states written out in a character vector that has other text and convert the states to abbreviated form. For example, "North Carolina" to "NC". It is simple if the vector only has long-form state names. However, my vector has other text in random places, as in the example "states".

states <- c("Plano New Jersey", "NC", "xyz", "Alabama 02138", "Texas", "Town Iowa 99999")

From another post I found this:

state.abb[match(states, state.name)]

but it converts only the standalone Texas

> state.abb[match(states, state.name)]
[1] NA   NA   NA   NA   "TX"

and not the New Jersey, Alabama and Iowa strings.

From Fast grep with a vectored pattern or match, to return list of all matches I tried:

sapply(states, grep(pattern = state.name, x = states, value = TRUE))

but

Error in get(as.character(FUN), mode = "function", envir = envir) : 
  object 'Alabama 02138' of mode 'function' was not found
In addition: Warning message:
In grep(pattern = state.name, x = states, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used

Nor does this work:

sapply(states, function(x) state.abb[grep(state.name, states)])

This question did not help: regular expression to convert state names to abbreviations

How do I convert the embedded long names to the state abbreviation?

EDIT: I want to return the vector with the only change being that the long names of the states have been abbreviated, e.g., "Plano New Jersey" becomes "Plano NJ".

Thank you for correcting and/or educating me.

like image 795
lawyeR Avatar asked Feb 13 '23 03:02

lawyeR


2 Answers

Here's another approach:

library(qdap)
mgsub(state.name, state.abb, states)

## [1] "Plano NJ"      "NC"            "xyz"           "AL 02138"      
## "TX"            "Town IA 99999"

If you are uncertain that the states will be capitalized you may want to use:

mgsub(state.name, state.abb, states, ignore.case=TRUE, fixed=FALSE)
like image 127
Tyler Rinker Avatar answered Feb 14 '23 16:02

Tyler Rinker


Try:

indx <- paste0(".*(", paste(state.name, collapse="|"), ").*")
v1 <- gsub(indx, "\\1", states)
ifelse( v1 %in% state.abb, v1, state.abb[match(v1, state.name)])
#[1] "NJ" "NC" NA   "AL" "TX" "IA"

If you want to just replace the states with the abbreviation and not the other text, you could also do:

indx1 <- paste(state.name, collapse="|")   
indx2 <- state.abb[match(v1, state.name)]

mapply(gsub, indx1, indx2, states, USE.NAMES=F)
#[1] "Plano NJ"      "NC"            "xyz"           "AL 02138"     
#[5] "TX"            "Town IA 99999"
like image 23
akrun Avatar answered Feb 14 '23 16:02

akrun