I have a vector of data in the form ‘aaa_9999_1’ where the first part is an alpha-location code, the second is the four digit year, and the final is a unique point identifier. E.g., there are multiple sil_2007_X points, each with a different last digit. I need to split this field, using the “_” character and save only the unique ID number into a new vector. I tried:
oss$point <- unlist(strsplit(oss$id, split='_', fixed=TRUE))[3]
based on a response here: R remove part of string. I get a single response of “1”. If I just run
strsplit(oss$id, split= ‘_’, fixed=TRUE)
I can generate the split list:
> head(oss$point)
[[1]]
[1] "sil" "2007" "1"
[[2]]
[1] "sil" "2007" "2"
[[3]]
[1] "sil" "2007" "3"
[[4]]
[1] "sil" "2007" "4"
[[5]]
[1] "sil" "2007" "5"
[[6]]
[1] "sil" "2007" "6"
Adding the [3] at the end just gives me the [[3]] result: “sil” “2007” “3”. What I want is a vector of the 3rd part (the unique number) of all records. I feel like I’m close to understanding this, but it is taking too much time (like most of a day) on a deadline project. Thanks for any feedback.
Note that splitting into single characters can be done via split = character(0) or split = "" ; the two are equivalent.
To split a vector by equal and different number of elements, we can use split function along with rep function. The rep function will define the repetition of the divisions for equal as well as different number of elements.
split() function is used to split the vector. ceiling() is the function that takes two parameters one parameter that is vector with sequence along to divide the vector sequentially and second is chunklength, which represents the length of chunk to be divided.
strsplit
creates a list, so I would try the following:
lapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a list
sapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a vector (even though a list is also a vector)
The [
means to extract the third element. If you prefer a vector, substitute lapply
with sapply
.
Here's an example:
mystring <- c("A_B_C", "D_E_F")
lapply(strsplit(mystring, "_"), `[`, 3)
# [[1]]
# [1] "C"
#
# [[2]]
# [1] "F"
sapply(strsplit(mystring, "_"), `[`, 3)
# [1] "C" "F"
If there is an easily definable pattern, gsub
might be a good option too, and avoids splitting. See the comments for improved (more robust) versions along the same lines from DWin and Josh O'Brien.
gsub(".*_.*_(.*)", "\\1", mystring)
# [1] "C" "F"
And, finally, just for fun, you can expand on the unlist
approach to make it work by recycling a vector of TRUE
s and FALSE
s to extract every third item (since we know in advance that all the splits will result in an identical structure).
unlist(strsplit(mystring, "_"), use.names = FALSE)[c(FALSE, FALSE, TRUE)]
# [1] "C" "F"
If you're extracting not by numeric position, but just looking to extract the last value after a delimiter, you have a few different alternatives.
Use a greedy regex:
gsub(".*_(.*)", "\\1", mystring)
# [1] "C" "F"
Use a convenience function like stri_extract*
from the "stringi" package:
library(stringi)
stri_extract_last_regex(mystring, "[A-Z]+")
# [1] "C" "F"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With