I have a dataframe with a column of strings and want to extract substrings of those into a new column. Here is some sample code and data showing I want to take the string after the final underscore character in the <code>id</code> column in order to create a <code>new_id</code> column. The <code>id</code> column entry always has 2 underscore characters and it's always the final substring I would like. <pre class="prettyprint"><code>df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) ) require(dplyr) df = df %>% dplyr::mutate(new_id = strsplit(id, split="_")[[1]][3]) </code></pre> I was expecting strsplit to act on each row in turn. However, the <code>new_id</code> column only contains <code>ABC</code> in each row, whereas I would like <code>ABC</code> in row 1 and <code>NHYK</code> in row 2. Do you know why this fails and how to achieve what I want?

You could use <code>stringr::str_extract</code>: <pre class="prettyprint"><code>library(stringr) df %>% dplyr::mutate(new_id = str_extract(id, "[^_]+$")) #> id x new_id #> 1 abcd_123_ABC 1 ABC #> 2 abc_5234_NHYK 2 NHYK </code></pre> The regex says, match one or more (<code>+</code>) of the characters that aren't <code>_</code> (the negating <code>[^ ]</code>), followed by end of string (<code>$</code>).

An alternative without regex and keeping in the <code>tidyverse</code> style is to use <code>tidyr::separate()</code>. Note, this does remove the input column by default (<code>remove=FALSE</code> to prevent it). <pre class="prettyprint"><code>## using your example data df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) ) ## separate knowing you will have three components df %>% separate(id, c("first", "second", "new_id"), sep = "_") %>% select(-first, -second) ## returns new_id x 1 ABC 1 2 NHYK 2 </code></pre>

Create new column with dplyr mutate and substring of existing column

Tags:

r

dplyr

strsplit

I have a dataframe with a column of strings and want to extract substrings of those into a new column.

Here is some sample code and data showing I want to take the string after the final underscore character in the id column in order to create a new_id column. The id column entry always has 2 underscore characters and it's always the final substring I would like.

df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) )  require(dplyr)  df = df %>% dplyr::mutate(new_id = strsplit(id, split="_")[[1]][3])

I was expecting strsplit to act on each row in turn.

However, the new_id column only contains ABC in each row, whereas I would like ABC in row 1 and NHYK in row 2. Do you know why this fails and how to achieve what I want?

396

asked Feb 01 '17 18:02

PM.

2 Answers

You could use stringr::str_extract:

library(stringr)   df %>%    dplyr::mutate(new_id = str_extract(id, "[^_]+$"))   #>              id x new_id #> 1  abcd_123_ABC 1    ABC #> 2 abc_5234_NHYK 2   NHYK

The regex says, match one or more (+) of the characters that aren't _ (the negating [^ ]), followed by end of string ($).

183

answered Sep 21 '22 16:09

Sam Firke

An alternative without regex and keeping in the tidyverse style is to use tidyr::separate(). Note, this does remove the input column by default (remove=FALSE to prevent it).

## using your example data df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) )  ## separate knowing you will have three components df %>% separate(id, c("first", "second", "new_id"), sep = "_") %>% select(-first, -second) ## returns   new_id x 1    ABC 1 2   NHYK 2

answered Sep 19 '22 16:09

vincentmajor

Related questions
                            
                                Change the thousands separator in a ggplot
                            
                                Floor a year to the decade in R
                            
                                Plotting a grid behind data, not in front in R
                            
                                R - could not find function 'melt()' [duplicate]
                            
                                Suppress convergence message in nnet multinom function in R
                            
                                Position geom_text in the middle of each bar segment in a geom_col stacked barchart [duplicate]
                            
                                Do you use attach() or call variables by name or slicing?
                            
                                jitter geom_line()
                            
                                Merge three different columns into a date in R
                            
                                Matching multiple patterns
                            
                                Forecasting time series data
                            
                                Merging multiple rasters in R
                            
                                What is the right way to multiply data frame by vector?
                            
                                How to adjust facet size manually
                            
                                R: How to filter/subset a sequence of dates
                            
                                Delete columns/rows with more than x% missing
                            
                                How to transpose a dataframe in tidyverse?
                            
                                How do I strip dollar signs ($) from data/ escape special characters in R?
                            
                                linear regression "NA" estimate just for last coefficient
                            
                                Is there a way to knitr markdown straight out of your workspace using RStudio?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With