Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to split string and convert to long format in data.table

I do the following

library(data.table)
library(stringr)        
dt <- data.table(string_column = paste(sample(c(letters, " "), 500000, replace = TRUE)
                                     , sample(c(letters, " "), 500000, replace = TRUE)
                                     , sample(1:500000)
                                 , sep = " "), key = "string_column") 

split_res <- dt[, list(name = unlist(str_split(string_column, '\\s+'))), by = string_column]

For real data, it takes approx. 1 hour to process dt (10M rows) and create split_res (18M rows) Out of curiosity - is there way to speed up process? Maybe unlist + str_split is not the right way of doing this?

like image 879
RInatM Avatar asked Mar 27 '14 04:03

RInatM


1 Answers

You'll get a big speedup if you just ditch using str_split() from "stringr" and just use strsplit().

fun1 <- function() dt[, list(name = unlist(str_split(string_column, '\\s+'))), by = string_column]
fun2 <- function() dt[, list(name = unlist(strsplit(string_column, '\\s+'))), by = string_column]

system.time(fun1())
#    user  system elapsed 
#  172.41    0.05  172.82 

system.time(fun2())
#    user  system elapsed 
#   11.22    0.01   11.23 

Whether this will make your processing time down from one hour to 4 minutes or not, I'm not sure. But at least you won't have to remember to put in those pesky underscores in your function names :-)


If you can split on a fixed search pattern, you can use the fixed = TRUE argument, which will give you another substantial speed boost.


Another thing to consider is to do the process manually:

x <- strsplit(dt$string_column, "\\s+")
DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))]
DT[, name := unlist(x, use.names = FALSE)]
DT

With your sample data:

fun4 <- function() {
  x <- strsplit(dt$string_column, "\\s+")
  DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))]
  DT[, name := unlist(x, use.names = FALSE)]
  DT
}
#    user  system elapsed 
#    1.79    0.01    1.82

However, the answer is not the same as what I get with fun2(), but that's because you have duplicated values in "string_column". If you add an "id" column and do the same, you will get the same results.

like image 135
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 20 '22 10:09

A5C1D2H2I1M1N2O1R2T1