I've got a dataset that consists of email communication. An example: <pre class="prettyprint"><code>library(dplyr) library(tidyr) dat <- data_frame('date' = Sys.time(), 'from' = c("person1@gmail.com", "person2@yahoo.com", "person3@hotmail.com", "person4@msn.com"), 'to' = c("person2@yahoo.com,person3@hotmail.com", "person3@hotmail.com", "person4@msn.com,person1@gmail.com,person2@yahoo.com", "person1@gmail.com")) </code></pre> In the above example it's simple enough to see how many variables I need, so I could just do the following: <pre class="prettyprint"><code>dat %>% separate(to, into = paste0("to_", 1:3), sep = ",", extra = "merge", fill = "right") #Source: local data frame [4 x 5] # # date from to_1 to_2 to_3 # (time) (chr) (chr) (chr) (chr) #1 2015-10-22 14:52:41 person1@gmail.com person2@yahoo.com person3@hotmail.com NA #2 2015-10-22 14:52:41 person2@yahoo.com person3@hotmail.com NA NA #3 2015-10-22 14:52:41 person3@hotmail.com person4@msn.com person1@gmail.com person2@yahoo.com #4 2015-10-22 14:52:41 person4@msn.com person1@gmail.com NA NA </code></pre> However, my dataset is 4,000 records long and I'd rather not go through and find the row with the most number of elements in it so that I can determine how many variables I need to create. My approach to handling this is to first split the column myself and get the length of each split and then find the max: <pre class="prettyprint"><code>n_vars <- dat$to %>% str_split(",") %>% lapply(function(z) length(z)) %>% unlist() %>% max() </code></pre> But that seems inefficient. Is there a better way of doing this?

We could use <code>cSplit</code> <pre class="prettyprint"><code>library(splitstackshape) cSplit(dat, 'to', ',') </code></pre>

How to use tidyr::separate when the number of needed variables is unknown [duplicate]

Tags:

I've got a dataset that consists of email communication. An example:

library(dplyr) library(tidyr)  dat <- data_frame('date' = Sys.time(),                    'from' = c("[email protected]", "[email protected]",                               "[email protected]", "[email protected]"),                    'to' = c("[email protected],[email protected]", "[email protected]",                             "[email protected],[email protected],[email protected]", "[email protected]"))

In the above example it's simple enough to see how many variables I need, so I could just do the following:

dat %>% separate(to, into = paste0("to_", 1:3), sep = ",", extra = "merge", fill = "right")  #Source: local data frame [4 x 5] # #                 date                from                to_1                to_2              to_3 #               (time)               (chr)               (chr)               (chr)             (chr) #1 2015-10-22 14:52:41   [email protected]   [email protected] [email protected]                NA #2 2015-10-22 14:52:41   [email protected] [email protected]                  NA                NA #3 2015-10-22 14:52:41 [email protected]     [email protected]   [email protected] [email protected] #4 2015-10-22 14:52:41     [email protected]   [email protected]                  NA                NA

However, my dataset is 4,000 records long and I'd rather not go through and find the row with the most number of elements in it so that I can determine how many variables I need to create. My approach to handling this is to first split the column myself and get the length of each split and then find the max:

n_vars <- dat$to %>% str_split(",") %>% lapply(function(z) length(z)) %>% unlist() %>% max()

But that seems inefficient. Is there a better way of doing this?

438

asked Oct 22 '15 19:10

tblznbits

2 Answers

This is a good question - my usual repsonse is to use strsplit, then unnest and spread, which is also not super efficient:

library(dplyr) library(tidyr)  dat %>% mutate(to = strsplit(to, ",")) %>%         unnest(to) %>%         group_by(from) %>%         mutate(row = row_number()) %>%         spread(row, to)  Source: local data frame [4 x 5]                   date                from                   1                   2                 3                (time)               (chr)               (chr)               (chr)             (chr) 1 2015-10-22 15:03:17   [email protected]   [email protected] [email protected]                NA 2 2015-10-22 15:03:17   [email protected] [email protected]                  NA                NA 3 2015-10-22 15:03:17 [email protected]     [email protected]   [email protected] [email protected] 4 2015-10-22 15:03:17     [email protected]   [email protected]                  NA                NA

144

answered Oct 04 '22 11:10

jeremycg

We could use cSplit

library(splitstackshape)  cSplit(dat, 'to', ',')

answered Oct 04 '22 10:10

akrun

Related questions
                            
                                How to mock a autowired list of Spring beans?
                            
                                Django Rest Framework Serializer Relations: How to get list of all child objects in parent's serializer?
                            
                                Nested include in sequelize?
                            
                                VBA Saving single sheet as CSV (not whole workbook)
                            
                                Can I cause Xcode's debugger to break programmatically?
                            
                                ReactiveUI (RxUI) vs Reactive Extensions
                            
                                How to set up VS-Code to open new files in current instance?
                            
                                Sum of diagonal elements in a matrix
                            
                                Delete element in an array for julia
                            
                                limit() and sort() order pymongo and mongodb
                            
                                Is there code generation API for TypeScript?
                            
                                Is there an alternative to "revalue" function from plyr when using dplyr?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With