In R, split a character vector by a specific character; save 3rd piece in new vector




I have a vector of data in the form ‘aaa_9999_1’ where the first part is an alpha-location code, the second is the four digit year, and the final is a unique point identifier. E.g., there are multiple sil_2007_X points, each with a different last digit. I need to split this field, using the “_” character and save only the unique ID number into a new vector. I tried:

oss$point <- unlist(strsplit(oss$id, split='_', fixed=TRUE))[3]

based on a response here: R remove part of string. I get a single response of “1”. If I just run

strsplit(oss$id, split= ‘_’, fixed=TRUE)

I can generate the split list:

> head(oss$point)
[1] "sil"  "2007" "1"   

[1] "sil"  "2007" "2"   

[1] "sil"  "2007" "3"   

[1] "sil"  "2007" "4"   

[1] "sil"  "2007" "5"   

[1] "sil"  "2007" "6"  

Adding the [3] at the end just gives me the [[3]] result: “sil” “2007” “3”. What I want is a vector of the 3rd part (the unique number) of all records. I feel like I’m close to understanding this, but it is taking too much time (like most of a day) on a deadline project. Thanks for any feedback.

strsplit creates a list, so I would try the following:

lapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a list
sapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a vector (even though a list is also a vector)

The [ means to extract the third element. If you prefer a vector, substitute lapply with sapply.

Here's an example:

mystring <- c("A_B_C", "D_E_F")

lapply(strsplit(mystring, "_"), `[`, 3)
# [[1]]
# [1] "C"
# [[2]]
# [1] "F"
sapply(strsplit(mystring, "_"), `[`, 3)
# [1] "C" "F"

If there is an easily definable pattern, gsub might be a good option too, and avoids splitting. See the comments for improved (more robust) versions along the same lines from DWin and Josh O'Brien.

gsub(".*_.*_(.*)", "\\1", mystring)
# [1] "C" "F"

And, finally, just for fun, you can expand on the unlist approach to make it work by recycling a vector of TRUEs and FALSEs to extract every third item (since we know in advance that all the splits will result in an identical structure).

unlist(strsplit(mystring, "_"), use.names = FALSE)[c(FALSE, FALSE, TRUE)]
# [1] "C" "F"

If you're extracting not by numeric position, but just looking to extract the last value after a delimiter, you have a few different alternatives.

Use a greedy regex:

gsub(".*_(.*)", "\\1", mystring)
# [1] "C" "F"

Use a convenience function like stri_extract* from the "stringi" package:

stri_extract_last_regex(mystring, "[A-Z]+")
# [1] "C" "F"
