I do the following
library(data.table)
library(stringr)
dt <- data.table(string_column = paste(sample(c(letters, " "), 500000, replace = TRUE)
, sample(c(letters, " "), 500000, replace = TRUE)
, sample(1:500000)
, sep = " "), key = "string_column")
split_res <- dt[, list(name = unlist(str_split(string_column, '\\s+'))), by = string_column]
For real data, it takes approx. 1 hour to process dt
(10M rows) and create split_res
(18M rows)
Out of curiosity - is there way to speed up process? Maybe unlist + str_split
is not the right way of doing this?
You'll get a big speedup if you just ditch using str_split()
from "stringr" and just use strsplit()
.
fun1 <- function() dt[, list(name = unlist(str_split(string_column, '\\s+'))), by = string_column]
fun2 <- function() dt[, list(name = unlist(strsplit(string_column, '\\s+'))), by = string_column]
system.time(fun1())
# user system elapsed
# 172.41 0.05 172.82
system.time(fun2())
# user system elapsed
# 11.22 0.01 11.23
Whether this will make your processing time down from one hour to 4 minutes or not, I'm not sure. But at least you won't have to remember to put in those pesky underscores in your function names :-)
If you can split on a fixed search pattern, you can use the fixed = TRUE
argument, which will give you another substantial speed boost.
Another thing to consider is to do the process manually:
x <- strsplit(dt$string_column, "\\s+")
DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))]
DT[, name := unlist(x, use.names = FALSE)]
DT
With your sample data:
fun4 <- function() {
x <- strsplit(dt$string_column, "\\s+")
DT <- dt[rep(sequence(nrow(dt)), vapply(x, length, 1L))]
DT[, name := unlist(x, use.names = FALSE)]
DT
}
# user system elapsed
# 1.79 0.01 1.82
However, the answer is not the same as what I get with fun2()
, but that's because you have duplicated values in "string_column". If you add an "id" column and do the same, you will get the same results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With