I've got a dataset that consists of email communication. An example:
library(dplyr) library(tidyr) dat <- data_frame('date' = Sys.time(), 'from' = c("[email protected]", "[email protected]", "[email protected]", "[email protected]"), 'to' = c("[email protected],[email protected]", "[email protected]", "[email protected],[email protected],[email protected]", "[email protected]"))
In the above example it's simple enough to see how many variables I need, so I could just do the following:
dat %>% separate(to, into = paste0("to_", 1:3), sep = ",", extra = "merge", fill = "right") #Source: local data frame [4 x 5] # # date from to_1 to_2 to_3 # (time) (chr) (chr) (chr) (chr) #1 2015-10-22 14:52:41 [email protected] [email protected] [email protected] NA #2 2015-10-22 14:52:41 [email protected] [email protected] NA NA #3 2015-10-22 14:52:41 [email protected] [email protected] [email protected] [email protected] #4 2015-10-22 14:52:41 [email protected] [email protected] NA NA
However, my dataset is 4,000 records long and I'd rather not go through and find the row with the most number of elements in it so that I can determine how many variables I need to create. My approach to handling this is to first split the column myself and get the length of each split and then find the max:
n_vars <- dat$to %>% str_split(",") %>% lapply(function(z) length(z)) %>% unlist() %>% max()
But that seems inefficient. Is there a better way of doing this?
The split() function in R can be used to split data into groups based on factor levels. This function uses the following basic syntax: split(x, f, …)
To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.
Use the separate Function to Split Column Into Two Columns in R. separate is part of the tidyr package, and it can be used to split a character column into multiple columns with regular expressions or numeric locations.
tidyr provides three main functions for tidying your messy data: gather() , separate() and spread() . Sometimes two variables are clumped together in one column. separate() allows you to tease them apart ( extract() works similarly but uses regexp groups instead of a splitting pattern or position).
This is a good question - my usual repsonse is to use strsplit
, then unnest
and spread
, which is also not super efficient:
library(dplyr) library(tidyr) dat %>% mutate(to = strsplit(to, ",")) %>% unnest(to) %>% group_by(from) %>% mutate(row = row_number()) %>% spread(row, to) Source: local data frame [4 x 5] date from 1 2 3 (time) (chr) (chr) (chr) (chr) 1 2015-10-22 15:03:17 [email protected] [email protected] [email protected] NA 2 2015-10-22 15:03:17 [email protected] [email protected] NA NA 3 2015-10-22 15:03:17 [email protected] [email protected] [email protected] [email protected] 4 2015-10-22 15:03:17 [email protected] [email protected] NA NA
We could use cSplit
library(splitstackshape) cSplit(dat, 'to', ',')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With