Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

strsplit and keep part before first underscore

Tags:

string

split

r

I would like to keep the part after the FIRST undescore. Please see example code.

colnames(df)
"EGAR00001341740_P32_1"    "EGAR00001341741_PN32"

My try, but does not give P32_1 but only P32 which is wrong.

sapply(strsplit(colnames(df), split='_', fixed=TRUE), function(x) (x[2]))

desired output: P32_1, PN32

like image 527
user2300940 Avatar asked Dec 24 '22 08:12

user2300940


2 Answers

It could be done with a regex by matching zero or more characters that are not an underscore ([^_]*) from the start (^) of the string, followed by an underscore (_) and replace it with blanks ("")

colnames(df) <- sub("^[^_]*_", "", colnames(df))
colnames(df)
#[1] "P32_1" "PN32"

With strsplit, it will split whereever the split character occurs. One option is str_split from stringr where there is an option to specify the 'n' i.e. number of split parts. If we choose n = 2, we get 2 substrings as it will only split at the first _

library(stringr)
sapply(str_split(colnames(df), "_",  n = 2), `[`, 2)
#[1] "P32_1" "PN32" 
like image 185
akrun Avatar answered Jan 12 '23 02:01

akrun


Here are a few ways. The first fixes the code in the question and the remaining ones are alternatives. All use only base except (6). (4) and (7) assume that the first field is fixed length, which is the case in the question.

x <- c("EGAR00001341740_P32_1", "EGAR00001341741_PN32")

# 1 - using strsplit
sapply(strsplit(x, "_"), function(x) paste(x[-1], collapse = "-"))
## [1] "P32_1" "PN32"

# 2 - a bit easier using sub.  *? is a non-greedy match
sub(".*?_", "", x)
## [1] "P32_1" "PN32" 

# 3 - locate the first underscore and extract all after that
substring(x, regexpr("_", x) + 1)
## [1] "P32_1" "PN32" 

# 4 - if the first field is fixed length as in the example
substring(x, 17)
## [1] "P32_1" "PN32" 

# 5 - replace first _ with character that does not appear and remove all until it
sub(".*;", "", sub("_", ";", x))
## [1] "P32_1" "PN32" 

# 6 - extract everything after first _
library(gsubfn)
strapplyc(x, "_(.*)", simplify = TRUE)
## [1] "P32_1" "PN32" 

# 7 - like (4) assumes fixed length first field
read.fwf(textConnection(x), widths = c(16, 99), as.is = TRUE)$V2
## [1] "P32_1" "PN32" 
like image 26
G. Grothendieck Avatar answered Jan 12 '23 02:01

G. Grothendieck