Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R split a character string on the second underscore

Tags:

regex

split

r

I have character strings with two underscores. Like these

c54254_g4545_i5454
c434_g4_i455
c5454_g544_i3
.
.
etc

I need to split these strings by the second underscore and I am afraid I have no clue how to do that in R (or any other tool for that sake). I'd be very happy if anyone can sort me out here. Thank you SM

like image 966
SigneMaten Avatar asked Sep 04 '15 12:09

SigneMaten


People also ask

How do I split a string by a character in R?

To split a string in R, use the strsplit() method. The strsplit() is a built-in R function that splits the string vector into sub-strings. The strsplit() method returns the list, where each list item resembles the item of input that has been split.

How do you split a string at a certain character?

To split a string with specific character as delimiter in Java, call split() method on the string object, and pass the specific character as argument to the split() method. The method returns a String Array with the splits as elements in the array.

How do you split a character vector in R?

Note that splitting into single characters can be done via split = character(0) or split = "" ; the two are equivalent.

How do you split a string underscore in Python?

To split a string by underscore in Python, pass the underscore character "_" as a delimiter to the split() function. It returns a list of strings resulting from splitting the original string on the occurrences of "_" .


2 Answers

One way would be to replace the second underscore by another delimiter (i.e. space) using sub and then split using that.

Using sub, we match one or more characters that are not a _ from the beginning (^) of the string (^[^_]+) followed by the first underscore (_) followed by one or characters that are not a _ ([^_]+). We capture that as a group by placing it inside the parentheses ((....)), then we match the _ followed by one or more characters till the end of the string in the second capture group ((.*)$). In the replacement, we separate the first (\\1) and second (\\2) with a space.

strsplit(sub('(^[^_]+_[^_]+)_(.*)$', '\\1 \\2', v1), ' ')
#[[1]]
#[1] "c54254_g4545" "i5454"       

#[[2]]
#[1] "c434_g4" "i455"   

#[[3]]
#[1] "c5454_g544" "i3" 

data

v1 <- c('c54254_g4545_i5454', 'c434_g4_i455', 'c5454_g544_i3')
like image 186
akrun Avatar answered Oct 27 '22 09:10

akrun


strsplit(sub("(_)(?=[^_]+$)", " ", x, perl=T), " ")
#[[1]]
#[1] "c54254_g4545" "i5454"       
#
#[[2]]
#[1] "c434_g4" "i455"   
#
#[[3]]
#[1] "c5454_g544" "i3"

With the pattern "(_)(?=[^_]+$)", we split on an underscore that comes before a series of one or more non-underscore characters. That way we only need one capture group.

like image 21
Pierre L Avatar answered Oct 27 '22 08:10

Pierre L