I've got a rather large (around 100k observations) data set, similar to this:
data <- data.frame(
ID = seq(1, 5, 1),
Values = c("1,2,3", "4", " ", "4,1,6,5,1,1,6", "0,0"),
stringsAsFactors=F)
data
ID Values
1 1 1,2,3
2 2 4
3 3
4 4 4,1,6,5,1,1,6
5 5 0,0
I want to split the Values column by ","
with NA
for missed cells:
ID v1 v2 v3 v4 v5 v6 v7
1 1 2 3 NA NA NA NA
2 4 NA NA NA NA NA NA
3 NA NA NA NA NA NA NA
4 4 1 6 5 1 1 6
5 0 0 NA NA NA NA NA
...
Best attempt was strsplit
+ rbind
:
df <- data.frame(do.call(
"rbind",
strsplit(as.character(data$Values), split = "," , fixed = FALSE)
))
But rbind
function just recycles all 'short' rows instead to set an "NA".
Have found similar problem
Many thanks, Leo
To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.
split() Pandas provide a method to split string around a passed separator/delimiter. After that, the string can be stored as a list in a series or it can also be used to create multiple column data frames from a single separated string.
Strsplit(): An R Language function which is used to split the strings into substrings with split arguments. Where: X = input data file, vector or a stings. Split = Splits the strings into required formats.
I would suggest looking at my cSplit
function or approaching the problem manually.
The cSplit
approach would simply be:
cSplit(data, "Values", ",")
# ID Values_1 Values_2 Values_3 Values_4 Values_5 Values_6 Values_7
# 1: 1 1 2 3 NA NA NA NA
# 2: 2 4 NA NA NA NA NA NA
# 3: 3 NA NA NA NA NA NA
# 4: 4 4 1 6 5 1 1 6
# 5: 5 0 0 NA NA NA NA NA
Approaching the problem manually would look like:
## Split up the values
Split <- strsplit(data$Values, ",", fixed = TRUE)
## How long is each list element?
Ncol <- vapply(Split, length, 1L)
## Create an empty character matrix to store the results
M <- matrix(NA_character_, nrow = nrow(data),
ncol = max(Ncol),
dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
## Use matrix indexing to figure out where to put the results
M[cbind(rep(1:nrow(data), Ncol),
sequence(Ncol))] <- unlist(Split, use.names = FALSE)
## Bind the values back together, here as a "data.table" (faster)
data.table(ID = data$ID, M)
^^ That's pretty much what goes on in cSplit
, but the function has a few other options and some basic error checking and so on that might make it a little bit slower than a purely manual approach (or a function written to address your specific problem).
Both of these approaches would be faster than a "data.table" + "reshape2" approach. Also, since each row is treated individually, you shouldn't have any problems even if you have duplicated ID values--your output should have the same number of rows as your input.
I've done benchmarks on more rows and on data that would give "wider" results (since that's implied in your comments to David's answer).
Here is the sample data:
set.seed(1)
a <- sample(0:100, 100000, TRUE)
Values <- vapply(a, function(x)
paste(sample(0:100, x, TRUE), collapse = ","), character(1L))
Values[sample(length(Values), length(Values) * .15)] <- ""
ID <- c(1:80000, 1:20000)
data <- data.frame(ID, Values, stringsAsFactors = FALSE)
DT <- as.data.table(data)
Here are the functions to test:
fun1a <- function(inDT) {
data2 <- DT[, list(Values = unlist(
strsplit(Values, ","))), by = ID]
data2[, Var := paste0("v", seq_len(.N)), by = ID]
dcast.data.table(data2, ID ~ Var,
fill = NA_character_,
value.var = "Values")
}
fun1b <- function(inDT) {
data2 <- DT[, list(Values = unlist(
strsplit(Values, ",", fixed = TRUE),
use.names = FALSE)), by = ID]
data2[, Var := paste0("v", seq_len(.N)), by = ID]
dcast.data.table(data2, ID ~ Var,
fill = NA_character_,
value.var = "Values")
}
fun2 <- function(inDT) {
cSplit(DT, "Values", ",")
}
fun3 <- function(inDF) {
Split <- strsplit(inDF$Values, ",", fixed = TRUE)
Ncol <- vapply(Split, length, 1L)
M <- matrix(NA_character_, nrow = nrow(inDF),
ncol = max(Ncol),
dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
M[cbind(rep(1:nrow(inDF), Ncol),
sequence(Ncol))] <- unlist(Split, use.names = FALSE)
data.table(ID = inDF$ID, M)
}
Here are the results:
library(microbenchmark)
microbenchmark(fun2(DT), fun3(data), times = 20)
# Unit: seconds
# expr min lq median uq max neval
# fun2(DT) 4.810942 5.173103 5.498279 5.622279 6.003339 20
# fun3(data) 3.847228 3.929311 4.058728 4.160082 4.664568 20
## Didn't want to microbenchmark here...
system.time(fun1a(DT))
# user system elapsed
# 16.92 0.50 17.59
system.time(fun1b(DT)) # fixed = TRUE & use.names = FALSE
# user system elapsed
# 11.54 0.42 12.01
NOTE: The results of fun1a
and fun1b
would not be the same as those of fun2
and fun3
because of the duplicated IDs.
Here's a data.table
combined with reshape2
approach (should be very efficient)
library(data.table) # Loading `data.table` package
data2 <- setDT(data)[, list(Values = unlist(strsplit(Values, ","))), by = ID] # splitting the values by `,` for each `ID`
data2[, Var := paste0("v", seq_len(.N)), by = ID] # Adding the `Var` variable
library(reshape2) # Loading `reshape2` package
dcast.data.table(data2, ID ~ Var, fill = NA_character_, value.var = "Values") # decasting
# ID v1 v2 v3 v4 v5 v6 v7
# 1: 1 1 2 3 NA NA NA NA
# 2: 2 4 NA NA NA NA NA NA
# 3: 3 NA NA NA NA NA NA
# 4: 4 4 1 6 5 1 1 6
# 5: 5 0 0 NA NA NA NA NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With