Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use tidyr::separate when the number of needed variables is unknown [duplicate]

Tags:

I've got a dataset that consists of email communication. An example:

library(dplyr) library(tidyr)  dat <- data_frame('date' = Sys.time(),                    'from' = c("[email protected]", "[email protected]",                               "[email protected]", "[email protected]"),                    'to' = c("[email protected],[email protected]", "[email protected]",                             "[email protected],[email protected],[email protected]", "[email protected]")) 

In the above example it's simple enough to see how many variables I need, so I could just do the following:

dat %>% separate(to, into = paste0("to_", 1:3), sep = ",", extra = "merge", fill = "right")  #Source: local data frame [4 x 5] # #                 date                from                to_1                to_2              to_3 #               (time)               (chr)               (chr)               (chr)             (chr) #1 2015-10-22 14:52:41   [email protected]   [email protected] [email protected]                NA #2 2015-10-22 14:52:41   [email protected] [email protected]                  NA                NA #3 2015-10-22 14:52:41 [email protected]     [email protected]   [email protected] [email protected] #4 2015-10-22 14:52:41     [email protected]   [email protected]                  NA                NA 

However, my dataset is 4,000 records long and I'd rather not go through and find the row with the most number of elements in it so that I can determine how many variables I need to create. My approach to handling this is to first split the column myself and get the length of each split and then find the max:

n_vars <- dat$to %>% str_split(",") %>% lapply(function(z) length(z)) %>% unlist() %>% max() 

But that seems inefficient. Is there a better way of doing this?

like image 438
tblznbits Avatar asked Oct 22 '15 19:10

tblznbits


People also ask

How do I separate variables in R?

The split() function in R can be used to split data into groups based on factor levels. This function uses the following basic syntax: split(x, f, …)

How do I separate values in a column in R?

To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.

Which function in Tidyr package is used to split a single column into multiple columns?

Use the separate Function to Split Column Into Two Columns in R. separate is part of the tidyr package, and it can be used to split a character column into multiple columns with regular expressions or numeric locations.

What is Tidyr used for in R?

tidyr provides three main functions for tidying your messy data: gather() , separate() and spread() . Sometimes two variables are clumped together in one column. separate() allows you to tease them apart ( extract() works similarly but uses regexp groups instead of a splitting pattern or position).


2 Answers

This is a good question - my usual repsonse is to use strsplit, then unnest and spread, which is also not super efficient:

library(dplyr) library(tidyr)  dat %>% mutate(to = strsplit(to, ",")) %>%         unnest(to) %>%         group_by(from) %>%         mutate(row = row_number()) %>%         spread(row, to)  Source: local data frame [4 x 5]                   date                from                   1                   2                 3                (time)               (chr)               (chr)               (chr)             (chr) 1 2015-10-22 15:03:17   [email protected]   [email protected] [email protected]                NA 2 2015-10-22 15:03:17   [email protected] [email protected]                  NA                NA 3 2015-10-22 15:03:17 [email protected]     [email protected]   [email protected] [email protected] 4 2015-10-22 15:03:17     [email protected]   [email protected]                  NA                NA 
like image 144
jeremycg Avatar answered Oct 04 '22 11:10

jeremycg


We could use cSplit

library(splitstackshape)  cSplit(dat, 'to', ',') 
like image 31
akrun Avatar answered Oct 04 '22 10:10

akrun