Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up `strsplit` when possible output are known

I have a large data frame with a factor column that I need to divide into three factor columns by splitting up the factor names by a delimiter. Here is my current approach, which is very slow with a large data frame (sometimes several million rows):

data <- readRDS("data.rds")
data.df <- reshape2:::melt.array(data)
head(data.df)
##  Time Location    Class Replicate Population
##1    1        1 LIDE.1.S         1 0.03859605
##2    2        1 LIDE.1.S         1 0.03852957
##3    3        1 LIDE.1.S         1 0.03846853
##4    4        1 LIDE.1.S         1 0.03841260
##5    5        1 LIDE.1.S         1 0.03836147
##6    6        1 LIDE.1.S         1 0.03831485

Rprof("str.out")
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
Rprof(NULL)

head(data.df)
##  Time Location Species SizeClass Infected Replicate Population
##1    1        1    LIDE         1        S         1 0.03859605
##2    2        1    LIDE         1        S         1 0.03852957
##3    3        1    LIDE         1        S         1 0.03846853
##4    4        1    LIDE         1        S         1 0.03841260
##5    5        1    LIDE         1        S         1 0.03836147
##6    6        1    LIDE         1        S         1 0.03831485

summaryRprof("str.out")

$by.self
                 self.time self.pct total.time total.pct
"strsplit"            1.34    50.00       1.34     50.00
"<Anonymous>"         1.16    43.28       1.16     43.28
"do.call"             0.04     1.49       2.54     94.78
"unique.default"      0.04     1.49       0.04      1.49
"data.frame"          0.02     0.75       0.12      4.48
"is.factor"           0.02     0.75       0.02      0.75
"match"               0.02     0.75       0.02      0.75
"structure"           0.02     0.75       0.02      0.75
"unlist"              0.02     0.75       0.02      0.75

$by.total
                       total.time total.pct self.time self.pct
"do.call"                    2.54     94.78      0.04     1.49
"strsplit"                   1.34     50.00      1.34    50.00
"<Anonymous>"                1.16     43.28      1.16    43.28
"cbind"                      0.14      5.22      0.00     0.00
"data.frame"                 0.12      4.48      0.02     0.75
"as.data.frame.matrix"       0.08      2.99      0.00     0.00
"as.data.frame"              0.08      2.99      0.00     0.00
"as.factor"                  0.08      2.99      0.00     0.00
"factor"                     0.06      2.24      0.00     0.00
"unique.default"             0.04      1.49      0.04     1.49
"unique"                     0.04      1.49      0.00     0.00
"is.factor"                  0.02      0.75      0.02     0.75
"match"                      0.02      0.75      0.02     0.75
"structure"                  0.02      0.75      0.02     0.75
"unlist"                     0.02      0.75      0.02     0.75
"[.data.frame"               0.02      0.75      0.00     0.00
"["                          0.02      0.75      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 2.68

Is there any way to speed up this operation? I note that there are a small (<5) number of each of the categories "Species", "SizeClass", and "Infected", and I know what these are in advance.

Notes:

  • stringr::str_split_fixed performs this task, but not any faster
  • The data frame is actually initially generated by calling reshape::melt on an array in which Class and its associated levels are a dimension. If there's a faster way to get from there to here, great.
  • data.rds at http://dl.getdropbox.com/u/3356641/data.rds
like image 681
Noam Ross Avatar asked May 20 '13 00:05

Noam Ross


1 Answers

This should probably offer quite an increase:

library(data.table)
DT <- data.table(data.df)


DT[, c("Species", "SizeClass", "Infected") 
      := as.list(strsplit(Class, "\\.")[[1]]), by=Class ]

The reasons for the increase:

  1. data.table pre allocates memory for columns
  2. every column assignment in data.frame reassigns the entirety of the data (data.table in contrast does not)
  3. the by statement allows you to implement the strsplit task once per each unique value.

Here is a nice quick method for the whole process.

# Save the new col names as a character vector 
newCols <- c("Species", "SizeClass", "Infected") 

# split the string, then convert the new cols to columns
DT[, c(newCols) := as.list(strsplit(as.character(Class), "\\.")[[1]]), by=Class ]
DT[, c(newCols) := lapply(.SD, factor), .SDcols=newCols]

# remove the old column. This is instantaneous. 
DT[, Class := NULL]

## Have a look: 
DT[, lapply(.SD, class)]
#       Time Location Replicate Population Species SizeClass Infected
# 1: integer  integer   integer    numeric  factor    factor   factor

DT
like image 164
Ricardo Saporta Avatar answered Sep 24 '22 00:09

Ricardo Saporta