Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create new column with dplyr mutate and substring of existing column

Tags:

r

dplyr

strsplit

I have a dataframe with a column of strings and want to extract substrings of those into a new column.

Here is some sample code and data showing I want to take the string after the final underscore character in the id column in order to create a new_id column. The id column entry always has 2 underscore characters and it's always the final substring I would like.

df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) )  require(dplyr)  df = df %>% dplyr::mutate(new_id = strsplit(id, split="_")[[1]][3]) 

I was expecting strsplit to act on each row in turn.

However, the new_id column only contains ABC in each row, whereas I would like ABC in row 1 and NHYK in row 2. Do you know why this fails and how to achieve what I want?

like image 396
PM. Avatar asked Feb 01 '17 18:02

PM.


People also ask

Which function of dplyr package helps in adding modifying a column of a data frame?

Add a column to a dataframe in R using dplyr. In my opinion, the best way to add a column to a dataframe in R is with the mutate() function from dplyr .

What does mutate in dplyr do?

mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name. Variables can be removed by setting their value to NULL .


2 Answers

You could use stringr::str_extract:

library(stringr)   df %>%    dplyr::mutate(new_id = str_extract(id, "[^_]+$"))   #>              id x new_id #> 1  abcd_123_ABC 1    ABC #> 2 abc_5234_NHYK 2   NHYK 

The regex says, match one or more (+) of the characters that aren't _ (the negating [^ ]), followed by end of string ($).

like image 183
Sam Firke Avatar answered Sep 21 '22 16:09

Sam Firke


An alternative without regex and keeping in the tidyverse style is to use tidyr::separate(). Note, this does remove the input column by default (remove=FALSE to prevent it).

## using your example data df = data.frame( id = I(c("abcd_123_ABC","abc_5234_NHYK")), x = c(1.0,2.0) )  ## separate knowing you will have three components df %>% separate(id, c("first", "second", "new_id"), sep = "_") %>% select(-first, -second) ## returns   new_id x 1    ABC 1 2   NHYK 2 
like image 22
vincentmajor Avatar answered Sep 19 '22 16:09

vincentmajor