Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R, split a character vector by a specific character; save 3rd piece in new vector

Tags:

r

vector

I have a vector of data in the form ‘aaa_9999_1’ where the first part is an alpha-location code, the second is the four digit year, and the final is a unique point identifier. E.g., there are multiple sil_2007_X points, each with a different last digit. I need to split this field, using the “_” character and save only the unique ID number into a new vector. I tried:

oss$point <- unlist(strsplit(oss$id, split='_', fixed=TRUE))[3]

based on a response here: R remove part of string. I get a single response of “1”. If I just run

strsplit(oss$id, split= ‘_’, fixed=TRUE)

I can generate the split list:

> head(oss$point)
[[1]]
[1] "sil"  "2007" "1"   

[[2]]
[1] "sil"  "2007" "2"   

[[3]]
[1] "sil"  "2007" "3"   

[[4]]
[1] "sil"  "2007" "4"   

[[5]]
[1] "sil"  "2007" "5"   

[[6]]
[1] "sil"  "2007" "6"  

Adding the [3] at the end just gives me the [[3]] result: “sil” “2007” “3”. What I want is a vector of the 3rd part (the unique number) of all records. I feel like I’m close to understanding this, but it is taking too much time (like most of a day) on a deadline project. Thanks for any feedback.

like image 429
A.Birdman Avatar asked Oct 16 '13 17:10

A.Birdman


People also ask

How do you split a character vector in R?

Note that splitting into single characters can be done via split = character(0) or split = "" ; the two are equivalent.

How do you separate vector elements?

To split a vector by equal and different number of elements, we can use split function along with rep function. The rep function will define the repetition of the divisions for equal as well as different number of elements.

How do you split a vector in half R?

split() function is used to split the vector. ceiling() is the function that takes two parameters one parameter that is vector with sequence along to divide the vector sequentially and second is chunklength, which represents the length of chunk to be divided.


1 Answers

strsplit creates a list, so I would try the following:

lapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a list
sapply(strsplit(oss$id, split='_', fixed=TRUE), `[`, 3) ## Output a vector (even though a list is also a vector)

The [ means to extract the third element. If you prefer a vector, substitute lapply with sapply.

Here's an example:

mystring <- c("A_B_C", "D_E_F")

lapply(strsplit(mystring, "_"), `[`, 3)
# [[1]]
# [1] "C"
# 
# [[2]]
# [1] "F"
sapply(strsplit(mystring, "_"), `[`, 3)
# [1] "C" "F"

If there is an easily definable pattern, gsub might be a good option too, and avoids splitting. See the comments for improved (more robust) versions along the same lines from DWin and Josh O'Brien.

gsub(".*_.*_(.*)", "\\1", mystring)
# [1] "C" "F"

And, finally, just for fun, you can expand on the unlist approach to make it work by recycling a vector of TRUEs and FALSEs to extract every third item (since we know in advance that all the splits will result in an identical structure).

unlist(strsplit(mystring, "_"), use.names = FALSE)[c(FALSE, FALSE, TRUE)]
# [1] "C" "F"

If you're extracting not by numeric position, but just looking to extract the last value after a delimiter, you have a few different alternatives.

Use a greedy regex:

gsub(".*_(.*)", "\\1", mystring)
# [1] "C" "F"

Use a convenience function like stri_extract* from the "stringi" package:

library(stringi)
stri_extract_last_regex(mystring, "[A-Z]+")
# [1] "C" "F"
like image 177
A5C1D2H2I1M1N2O1R2T1 Avatar answered Nov 15 '22 04:11

A5C1D2H2I1M1N2O1R2T1