I have character strings with two underscores. Like these <pre class="prettyprint"><code>c54254_g4545_i5454 c434_g4_i455 c5454_g544_i3 . . etc </code></pre> I need to split these strings by the second underscore and I am afraid I have no clue how to do that in R (or any other tool for that sake). I'd be very happy if anyone can sort me out here. Thank you SM

One way would be to replace the second underscore by another delimiter (i.e. space) using <code>sub</code> and then split using that. Using <code>sub</code>, we match one or more characters that are not a <code>_</code> from the beginning (<code>^</code>) of the string (<code>^[^_]+</code>) followed by the first underscore (<code>_</code>) followed by one or characters that are not a <code>_</code> (<code>[^_]+</code>). We capture that as a group by placing it inside the parentheses (<code>(....)</code>), then we match the <code>_</code> followed by one or more characters till the end of the string in the second capture group (<code>(.*)$</code>). In the replacement, we separate the first (<code>\\1</code>) and second (<code>\\2</code>) with a space. <pre class="prettyprint"><code>strsplit(sub('(^[^_]+_[^_]+)_(.*)$', '\\1 \\2', v1), ' ') #[[1]] #[1] "c54254_g4545" "i5454" #[[2]] #[1] "c434_g4" "i455" #[[3]] #[1] "c5454_g544" "i3" </code></pre> <h3>data</h3> <pre class="prettyprint"><code>v1 <- c('c54254_g4545_i5454', 'c434_g4_i455', 'c5454_g544_i3') </code></pre>

<pre class="prettyprint"><code>strsplit(sub("(_)(?=[^_]+$)", " ", x, perl=T), " ") #[[1]] #[1] "c54254_g4545" "i5454" # #[[2]] #[1] "c434_g4" "i455" # #[[3]] #[1] "c5454_g544" "i3" </code></pre> With the pattern <code>"(_)(?=[^_]+$)"</code>, we split on an underscore that comes before a series of one or more non-underscore characters. That way we only need one capture group.

R split a character string on the second underscore

Tags:

regex

split

r

I have character strings with two underscores. Like these

c54254_g4545_i5454
c434_g4_i455
c5454_g544_i3
.
.
etc

I need to split these strings by the second underscore and I am afraid I have no clue how to do that in R (or any other tool for that sake). I'd be very happy if anyone can sort me out here. Thank you SM

966

asked Sep 04 '15 12:09

SigneMaten

2 Answers

One way would be to replace the second underscore by another delimiter (i.e. space) using sub and then split using that.

Using sub, we match one or more characters that are not a _ from the beginning (^) of the string (^[^_]+) followed by the first underscore (_) followed by one or characters that are not a _ ([^_]+). We capture that as a group by placing it inside the parentheses ((....)), then we match the _ followed by one or more characters till the end of the string in the second capture group ((.*)$). In the replacement, we separate the first (\\1) and second (\\2) with a space.

strsplit(sub('(^[^_]+_[^_]+)_(.*)$', '\\1 \\2', v1), ' ')
#[[1]]
#[1] "c54254_g4545" "i5454"       

#[[2]]
#[1] "c434_g4" "i455"   

#[[3]]
#[1] "c5454_g544" "i3"

data

v1 <- c('c54254_g4545_i5454', 'c434_g4_i455', 'c5454_g544_i3')

186

answered Oct 27 '22 09:10

akrun

strsplit(sub("(_)(?=[^_]+$)", " ", x, perl=T), " ")
#[[1]]
#[1] "c54254_g4545" "i5454"       
#
#[[2]]
#[1] "c434_g4" "i455"   
#
#[[3]]
#[1] "c5454_g544" "i3"

With the pattern "(_)(?=[^_]+$)", we split on an underscore that comes before a series of one or more non-underscore characters. That way we only need one capture group.

answered Oct 27 '22 08:10

Pierre L

Related questions
                            
                                extracting English verbs from a given text [closed]
                            
                                Regexp for robots.txt
                            
                                preg_replace() and \n in a string
                            
                                Convert absolute to relative url with preg_replace
                            
                                Grepping for '../' (dot dot slash)
                            
                                Is this line of Perl meaningless? s/^(\d+)\b/$1/sg
                            
                                Ruby grep with string argument
                            
                                How to match a long with Java regex?
                            
                                Regex to check for new line
                            
                                How to use word break, asterisk, word break in Regex with Perl?
                            
                                Remove the last part of string separated with dot in Python
                            
                                Sed to replace lower case string between two strings to upper case
                            
                                Removing multiple commas and trailing commas using gsub
                            
                                Javascript - normalize accented greek characters
                            
                                Want Regex to stop at first occurrence of "." and ";"
                            
                                Regex to find integer or decimal from a string in java in a single group?
                            
                                Regular expression for odd number of a's
                            
                                How to find words ending with ing
                            
                                "Nothing to repeat" from Python regex
                            
                                Apache Spark: how to transform Data Frame column with regex to another Data Frame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With