I have character strings with two underscores. Like these
c54254_g4545_i5454
c434_g4_i455
c5454_g544_i3
.
.
etc
I need to split these strings by the second underscore and I am afraid I have no clue how to do that in R (or any other tool for that sake). I'd be very happy if anyone can sort me out here. Thank you SM
To split a string in R, use the strsplit() method. The strsplit() is a built-in R function that splits the string vector into sub-strings. The strsplit() method returns the list, where each list item resembles the item of input that has been split.
To split a string with specific character as delimiter in Java, call split() method on the string object, and pass the specific character as argument to the split() method. The method returns a String Array with the splits as elements in the array.
Note that splitting into single characters can be done via split = character(0) or split = "" ; the two are equivalent.
To split a string by underscore in Python, pass the underscore character "_" as a delimiter to the split() function. It returns a list of strings resulting from splitting the original string on the occurrences of "_" .
One way would be to replace the second underscore by another delimiter (i.e. space) using sub
and then split using that.
Using sub
, we match one or more characters that are not a _
from the beginning (^
) of the string (^[^_]+
) followed by the first underscore (_
) followed by one or characters that are not a _
([^_]+
). We capture that as a group by placing it inside the parentheses ((....)
), then we match the _
followed by one or more characters till the end of the string in the second capture group ((.*)$
). In the replacement, we separate the first (\\1
) and second (\\2
) with a space.
strsplit(sub('(^[^_]+_[^_]+)_(.*)$', '\\1 \\2', v1), ' ')
#[[1]]
#[1] "c54254_g4545" "i5454"
#[[2]]
#[1] "c434_g4" "i455"
#[[3]]
#[1] "c5454_g544" "i3"
v1 <- c('c54254_g4545_i5454', 'c434_g4_i455', 'c5454_g544_i3')
strsplit(sub("(_)(?=[^_]+$)", " ", x, perl=T), " ")
#[[1]]
#[1] "c54254_g4545" "i5454"
#
#[[2]]
#[1] "c434_g4" "i455"
#
#[[3]]
#[1] "c5454_g544" "i3"
With the pattern "(_)(?=[^_]+$)"
, we split on an underscore that comes before a series of one or more non-underscore characters. That way we only need one capture group.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With