Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Apply tidyr::separate over multiple columns

I would like to iterate over columns in a dataframe and split them into the based on a separator. I am using tidyr::separate, which works when I do one column at a time.

For example:

df<- data.frame(a = c("5312,2020,1212"), b = c("345,982,284"))

df <- separate(data = df, col = "a", 
                         into = paste("a", c("col1", "col2", "col3"), 
                                      sep = "_"), sep = ",")


  a_col1 a_col2 a_col3           b
1   5312   2020   1212 345,982,284

When I try to execute the same operation over each column of df R returns an error

For example I used this for loop:

for(col in names(df)){
    df <- separate(data = df, col = col, 
into = paste(col, c("col1", "col2", "col3), 
sep = "_"), sep = ",")

I was expecting to get the following output:

  a_col1 a_col2 a_col3 b_col1 b_col2 b_col3
1   5312   2020   1212    345    982    284

However R returns this error:

Error in if (!after) c(values, x) else if (after >= lengx) c(x, values) else c(x[1L:after],  : 
  argument is of length zero

Is there another way to apply tidyr::separate over multiple columns in a data frame?

like image 899
spies006 Avatar asked Feb 26 '17 03:02


People also ask

How do I split multiple columns in R?

To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.

Which function in Tidyr package is used to split a single column into multiple columns?

Use the extract Function to Split Column Into Two Columns in R. Another useful function to split a column into two separate ones is extract , which is also part of the tidyr package. extract function works on columns using regular expressions groups.

How do I use multiple columns in R?

To get multiple columns of matrix, specify the column numbers as a vector preceded by a comma, in square brackets, after the matrix variable name. This expression returns the required columns as a matrix.

3 Answers

You could feed a customized separate_() call into Reduce().

sep <- function(...) {
    dots <- list(...)
    n <- stringr::str_count(dots[[1]][[dots[[2]]]], "\\d+")
    separate_(..., into = sprintf("%s_col%d", dots[[2]], 1:n))

df %>% Reduce(f = sep, x = c("a", "b"))
#   a_col_1 a_col_2 a_col_3 b_col_1 b_col_2 b_col_3
# 1    5312    2020    1212     345     982     284

Otherwise, cSplit will do it too.

splitstackshape::cSplit(df, names(df))
#     a_1  a_2  a_3 b_1 b_2 b_3
# 1: 5312 2020 1212 345 982 284
like image 196
Rich Scriven Avatar answered Oct 25 '22 10:10

Rich Scriven

I had the same inquiry (learning tidyverse), so worked through thus. N.B. that I wanted a solution that doesn't break down, so doesn't rely on knowing colnames.


Create your input:

dft <- as_tibble(data.frame(a = c("5312,2020,1212"), b = c("345,982,284")))
df <- as.data.frame(dft)

Create a blank tibble to collect output:

dft0 <- read_csv("a\na")  
dft0 <- dft0[,-1]
dft00 <- dft0

Specify length of the elements to be separated (could be done in-loop, but we know from looking at dft); N.B. if you have a better way to name, use that:

leng <- 3

For-loop version:

for(x in 1:dim(df)[2]){
        dataCol <- dft[,x]
        newCols <- paste(colnames(dataCol)[1], paste("col", 1:leng, sep="") , sep="_")

        dft0 <- cbind(dft0,
                    separate(data = dataCol,
                             col = colnames(dataCol)[1],
                             into = newCols,
                             sep = ","))}

The messy sapply version:

sapp <- sapply(colnames(df),function(ff){

dft00 <- as_tibble(do.call(cbind, sapp))

colnames(dft00) <- as.vector(sapply(colnames(sapp),
like image 45
bruce.moran Avatar answered Oct 25 '22 10:10


This would work for variable number of separators per column, in a single syntax. Demonstrating on elaborated example.


df<- data.frame(a = c("5312,2020,1212", "21,4534"), 
                b = c("345,982,284", "324,234,3425,654"),
                c = c('34,89,89', '87866675'))

#>                a                b        c
#> 1 5312,2020,1212      345,982,284 34,89,89
#> 2        21,4534 324,234,3425,654 87866675

       .init = df, 
       ~ .x %>% separate(names(df)[.y], 
                         sep = ',', 
                         into = paste0(names(df)[.y], '_col_' , seq(1 + max(str_count(df[[.y]], ',')))),
                         fill = 'right'
#>   a_col_1 a_col_2 a_col_3 b_col_1 b_col_2 b_col_3 b_col_4  c_col_1 c_col_2
#> 1    5312    2020    1212     345     982     284    <NA>       34      89
#> 2      21    4534    <NA>     324     234    3425     654 87866675    <NA>
#>   c_col_3
#> 1      89
#> 2    <NA>

Created on 2021-07-19 by the reprex package (v2.0.0)

like image 44
AnilGoyal Avatar answered Oct 25 '22 10:10
