I have this vector <code>myvec</code>. I want to remove everything after second ':' and get the result. How do I remove the string after nth ':'? <pre class="prettyprint"><code>myvec<- c("chr2:213403244:213403244:G:T:snp","chr7:55240586:55240586:T:G:snp" ,"chr7:55241607:55241607:C:G:snp") result chr2:213403244 chr7:55240586 chr7:55241607 </code></pre>

Here are a few alternatives. We delete the kth colon and everything after it. The example in the question would correspond to k = 2. In the examples below we use k = 3. 1) read.table Read the data into a data.frame, pick out the columns desired and paste it back together again: <pre class="prettyprint"><code>k <- 3 # keep first 3 fields only do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":")) </code></pre> giving: <pre class="prettyprint"><code>[1] "chr2:213403244:213403244" "chr7:55240586:55240586" [3] "chr7:55241607:55241607" </code></pre> 2) sprintf/sub Construct the appropriate regular expression (in the case below of k equal to 3 it would be <code>^((.*?:){2}.*?):.*</code> ) and use it with <code>sub</code>: <pre class="prettyprint"><code>k <- 3 sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec) </code></pre> giving: <pre class="prettyprint"><code>[1] "chr2:213403244:213403244" "chr7:55240586:55240586" [3] "chr7:55241607:55241607" </code></pre> Note 1: For k=1 this can be further simplified to <code>sub(":.*", "", myvec)</code> and for k=n-1 it can be further simplified to <code>sub(":[^:]*$", "", myvec)</code> Note 2: Here is a visualization of the regular regular expression for <code>k</code> equal to 3: <pre class="prettyprint"><code>^((.*?:){2}.*?):.* </code></pre> <img src="https://www.debuggex.com/i/C8MSTUM9sPxE1-hj.png" alt="Regular expression visualization"> Debuggex Demo 3) iteratively delete last field We could remove the last field <code>n-k</code> times using the last regular expression in Note 1 above like this: <pre class="prettyprint"><code>n <- 6 # number of fields k < - 3 # number of fields to retain out <- myvec for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out) </code></pre> If we wanted to set n automatically we could optionally replace the hard coded line setting n above with this: <pre class="prettyprint"><code>n <- count.fields(textConnection(myvec[1]), sep = ":") </code></pre> 4) locate position of kth colon Locate the positions of the colons using <code>gregexpr</code> and then extract the location of the kth subtracting one from it since we don't want the trailing colon. Use <code>substr</code> to extract that many characters from the respective strings. <pre class="prettyprint"><code>k <- 3 substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1) </code></pre> giving: <pre class="prettyprint"><code>[1] "chr2:213403244:213403244" "chr7:55240586:55240586" [3] "chr7:55241607:55241607" </code></pre> Note 3: Suppose there are n fields. The question asked to delete everything after the kth delimiter so the solution should work for k = 1, 2, ..., n-1. It need not work for k = n since there are not n delimiters; however, if instead we define k as the number of fields to return then k = n makes sense and, in fact, (1) and (3) work in that case too. (2) and (4) do not work for this extension but we can easily get them to work by using <code>paste0(myvec, ":")</code> as the input instead of <code>myvec</code>. Note 4: We compare performance: <pre class="prettyprint"><code>library(rbenchmark) benchmark( .read.table = do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":")), .sprintf.sub = sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec), .for = { out <- myvec; for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)}, .gregexpr = substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1), order = "elapsed", replications = 1000)[1:4] </code></pre> giving: <pre class="prettyprint"><code> test replications elapsed relative 2 .sprintf.sub 1000 0.11 1.000 4 .gregexpr 1000 0.14 1.273 3 .for 1000 0.15 1.364 1 .read.table 1000 2.16 19.636 </code></pre> The solution using sprintf and sub is the fastest although it does use a complex regular expression whereas the others use simpler or no regular expressions and might be preferred on grounds of simplicity. ADDED Added additional solutions and additional notes.

How to delete everything after nth delimiter in R?

Tags:

regex

r

I have this vector myvec. I want to remove everything after second ':' and get the result. How do I remove the string after nth ':'?

myvec<- c("chr2:213403244:213403244:G:T:snp","chr7:55240586:55240586:T:G:snp" ,"chr7:55241607:55241607:C:G:snp")

result
chr2:213403244   
chr7:55240586
chr7:55241607

974

asked Oct 11 '15 05:10

MAPK

2 Answers

We can use sub. We match one or more characters that are not : from the start of the string (^([^:]+) followed by a :, followed by one more characters not a : ([^:]+), place it in a capture group i.e. within the parentheses. We replace by the capture group (\\1) in the replacement.

sub('^([^:]+:[^:]+).*', '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586"  "chr7:55241607"

The above works for the example posted. For general cases to remove after the nth delimiter,

n <- 2
pat <- paste0('^([^:]+(?::[^:]+){',n-1,'}).*')
sub(pat, '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586"  "chr7:55241607"

Checking with a different 'n'

n <- 3

and repeating the same steps

sub(pat, '\\1', myvec)
#[1] "chr2:213403244:213403244" "chr7:55240586:55240586"  
#[3] "chr7:55241607:55241607"

Or another option would be to split by : and then paste the n number of components together.

n <- 2
vapply(strsplit(myvec, ':'), function(x)
            paste(x[seq.int(n)], collapse=':'), character(1L))
#[1] "chr2:213403244" "chr7:55240586"  "chr7:55241607"

157

answered Nov 10 '22 03:11

akrun

Here are a few alternatives. We delete the kth colon and everything after it. The example in the question would correspond to k = 2. In the examples below we use k = 3.

1) read.table Read the data into a data.frame, pick out the columns desired and paste it back together again:

k <- 3 # keep first 3 fields only
do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":"))

giving:

[1] "chr2:213403244:213403244" "chr7:55240586:55240586"  
[3] "chr7:55241607:55241607"

2) sprintf/sub Construct the appropriate regular expression (in the case below of k equal to 3 it would be ^((.*?:){2}.*?):.* ) and use it with sub:

k <- 3
sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec)

giving:

[1] "chr2:213403244:213403244" "chr7:55240586:55240586"  
[3] "chr7:55241607:55241607"

Note 1: For k=1 this can be further simplified to sub(":.*", "", myvec) and for k=n-1 it can be further simplified to sub(":[^:]*$", "", myvec)

Note 2: Here is a visualization of the regular regular expression for k equal to 3:

^((.*?:){2}.*?):.*

Regular expression visualization

Debuggex Demo

3) iteratively delete last field We could remove the last field n-k times using the last regular expression in Note 1 above like this:

n <- 6 # number of fields
k < - 3 # number of fields to retain
out <- myvec
for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)

If we wanted to set n automatically we could optionally replace the hard coded line setting n above with this:

n <- count.fields(textConnection(myvec[1]), sep = ":")

4) locate position of kth colon Locate the positions of the colons using gregexpr and then extract the location of the kth subtracting one from it since we don't want the trailing colon. Use substr to extract that many characters from the respective strings.

k <- 3
substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1)

giving:

[1] "chr2:213403244:213403244" "chr7:55240586:55240586"  
[3] "chr7:55241607:55241607"

Note 3: Suppose there are n fields. The question asked to delete everything after the kth delimiter so the solution should work for k = 1, 2, ..., n-1. It need not work for k = n since there are not n delimiters; however, if instead we define k as the number of fields to return then k = n makes sense and, in fact, (1) and (3) work in that case too. (2) and (4) do not work for this extension but we can easily get them to work by using paste0(myvec, ":") as the input instead of myvec.

Note 4: We compare performance:

library(rbenchmark)
benchmark(
 .read.table = do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":")),
 .sprintf.sub = sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec),
 .for = { out <- myvec; for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)},
 .gregexpr = substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1),
  order = "elapsed", replications = 1000)[1:4]

giving:

          test replications elapsed relative
2 .sprintf.sub         1000    0.11    1.000
4    .gregexpr         1000    0.14    1.273
3         .for         1000    0.15    1.364
1  .read.table         1000    2.16   19.636

The solution using sprintf and sub is the fastest although it does use a complex regular expression whereas the others use simpler or no regular expressions and might be preferred on grounds of simplicity.

ADDED Added additional solutions and additional notes.

answered Nov 10 '22 01:11

G. Grothendieck

Related questions
                            
                                How do I extract multiple character strings from one line using R
                            
                                Using Python to parse a 12GB CSV
                            
                                R X-axis Date Labels using plot()
                            
                                R column mean by factor
                            
                                How to convert list of dataframe to dataframe which has a new column show the name of list in R
                            
                                Separate "Name" into "FirstName" and "LastName" columns of data frame
                            
                                How to detect that a vector is subset of specific vector?
                            
                                Draw 3x3 square grid in R
                            
                                Change the year in a datetime object in R?
                            
                                R: parse string to a matrix
                            
                                Converting Factor Levels to Numbers
                            
                                FUN-error after running 'tolower' while making Twitter wordcloud
                            
                                column name with brackets or other punctuations for dplyr group_by
                            
                                R: Creating a vector with a specific amount of random numbers
                            
                                ggplot2 sourcing error: X11 library is missing
                            
                                Count values higher than a certain threshold by group
                            
                                r search along a vector and calculate the mean
                            
                                Proper R Markdown Code Organization
                            
                                Test if column name contains string in R
                            
                                Removing one tableGrob when applied to a box plot with a facet_wrap

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to delete everything after nth delimiter in R?

Tags:

regex

r

MAPK

People also ask

2 Answers

akrun

G. Grothendieck

Recent Activity

Donate For Us