Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Split Variable Column into multiple (unbalanced) columns by comma

Tags:

split

r

I have a dataset of 25 variables and over 2 million observations. One of my variables is a combination of a few different "categories" that I want to split to where it shows 1 category per column (similar to what split would do in stata). For example:

# Name      Age     Number               Events                      First 
# Karen      24        8         Triathlon/IM,Marathon,10k,5k         0
# Kurt       39        2         Half-Marathon,10k                    0 
# Leah       18        0                                              1

And I want it to look like:

# Name   Age  Number Events_1        Event_2      Events_3     Events_4      First
# Karen   24    8     Triathlon/IM    Marathon       10k         5k             0
# Kurt    39    2     Half-Marathon   10k            NA          NA             0 
# Leah    18    0     NA              NA             NA          NA             1

I have looked through stackoverflow but have not found anything that works (everything gives me an error of some sort). Any suggestions would be greatly appreciated.

Note: May not be important but the largest number of categories 1 person has is 19 therefore I would need to create Event_1:Event_19

Comment: Previous stack overflows have suggested the separate function, however this function does not seem to work with my dataset. When I input the function the program runs but when it is finished nothing is changed, there is no output, and no error code. When I tried to use other suggestions made in other threads I received error messages. However, I finally got it is work by using the cSplit function. Thank for the help!!!

like image 925
Kfruge Avatar asked Jul 23 '15 02:07

Kfruge


People also ask

How do I separate a column with a comma in R?

The splitting of comma separated values in an R vector can be done by unlisting the elements of the vector then using strsplit function for splitting. For example, if we have a vector say x that contains comma separated values then the splitting of those values will be done by using the command unlist(strsplit(x,",")).

How do I split a column into multiple columns in R?

To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.

How do you split variables in R?

The split() function in R can be used to split data into groups based on factor levels. This function uses the following basic syntax: split(x, f, …)

How do you split a delimiter in R?

Use str_split to Split String by Delimiter in R Alternatively, the str_split function can also be utilized to split string by delimiter. str_split is part of the stringr package. It almost works in the same way as strsplit does, except that str_split also takes regular expressions as the pattern.


1 Answers

From Ananda's splitstackshape package:

cSplit(df, "Events", sep=",")
#    Name Age Number First      Events_1 Events_2 Events_3 Events_4
#1: Karen  24      8     0  Triathlon/IM Marathon      10k       5k
#2:  Kurt  39      2     0 Half-Marathon      10k       NA       NA
#3: Leah   18      0     1            NA       NA       NA       NA

Or with tidyr:

separate(df, 'Events', paste("Events", 1:4, sep="_"), sep=",", extra="drop")
#   Name Age Number               Events_1 Events_2 Events_3 Events_4 First
#1 Karen  24      8           Triathlon/IM Marathon      10k       5k     0
#2  Kurt  39      2          Half-Marathon      10k     <NA>     <NA>     0
#3 Leah   18      0                     NA     <NA>     <NA>     <NA>     1

With the data.table package:

setDT(df)[,paste0("Events_", 1:4) := tstrsplit(Events, ",")][,-"Events", with=F]
#    Name Age Number First               Events_1 Events_2 Events_3 Events_4
#1: Karen  24      8     0           Triathlon/IM Marathon      10k       5k
#2:  Kurt  39      2     0          Half-Marathon      10k       NA       NA
#3: Leah   18      0     1                     NA       NA       NA       NA

Data

df <- structure(list(Name = structure(1:3, .Label = c("Karen", "Kurt", 
"Leah "), class = "factor"), Age = c(24L, 39L, 18L), Number = c(8L, 
2L, 0L), Events = structure(c(3L, 2L, 1L), .Label = c("               NA", 
"         Half-Marathon,10k", "     Triathlon/IM,Marathon,10k,5k"
), class = "factor"), First = c(0L, 0L, 1L)), .Names = c("Name", 
"Age", "Number", "Events", "First"), class = "data.frame", row.names = c(NA, 
-3L))
like image 156
Pierre L Avatar answered Oct 09 '22 17:10

Pierre L