I have a very large Data Table
with two columns. And I wish to apply a custom function on a particular column. The code to generate the problem is as follows:
require(data.table)
X <- rep("This is just random text", 1e5)
data <- data.frame(1:1e5, replicate(1, X, simplify=FALSE), stringsAsFactors=FALSE)
colnames(data) <- paste("X", seq_len(ncol(data)), sep="")
DT <- as.data.table(data)
Now, we have a large data table which looks like
| X1 | X2 |
|----|-------------------------|
| 1 | This is just random text|
| 2 | This is just random text|
| 3 | This is just random text|
| 4 | This is just random text|
| .. | ... |
What if I want to do some vector operation on any of this column considering in mind that this data.table will be of very large size (approx ~100M
rows).
Let's take an example of X1 column. Suppose, I want to apply the following function on it:
Fun4X1 <- function(x){return(x+x*2)}
And a very complex NLP
function on X2 column which looks something like
Fun4X2 <- function(x){
require(stringr)
return(str_split(x, " ")[[1]][1])
}
How shall I go about doing this for a large dataset? Please suggest the min. time consuming approach as my Function
is itself very complex.
P.S. I have tried foreach
, sapply
, and of course for-loop
and all are very slow on a pretty good hardware system.
The approach should be no different than applying any other in-built (or package-loaded) function to a specific column in a data.table
: Use a list(fun(variable), otherfun(othervariable))
type of construct. You can also name the resulting columns if so desired, otherwise they will be named "V1", "V2" and so on.
In other words, for your problem you can do:
DT[, list(X1 = Fun4X1(X1), X2 = Fun4X2(X2))]
I suspect, however, that a lot of your slowdown might be due to the functions you are actually using. Compare the following slight refinements:
Fun4X2.old <- function(x){
require(stringr)
return(str_split(x, " ")[[1]][1])
}
Fun4X2.new1 <- function(x) {
vapply(strsplit(x, " "),
function(y) y[1], character(1))
}
Fun4X2.new2 <- function(x) {
vapply(strsplit(x, " ", fixed=TRUE),
function(y) y[1], character(1))
}
Fun4X2.sub <- function(x) sub("(.+?) .*", "\\1", x)
X <- rep("This is just random text", 1e5)
system.time(out1 <- Fun4X2.old(X))
# user system elapsed
# 18.838 0.000 18.659
system.time(out2 <- Fun4X2.new1(X))
# user system elapsed
# 0.000 0.000 0.944
system.time(out3 <- Fun4X2.new2(X))
# user system elapsed
# 1.584 0.000 0.270
system.time(out4 <- Fun4X2.sub(X))
# user system elapsed
# 0.000 0.000 0.222
One last note, regarding your comment here:
@AnandaMahto I am looking for something similar to this but if I use your solution then the output on text column in not vectorized and I get same output even if I have different text in each row
Incidentally, your original Fun4X2()
(renamed Fun4X2.old()
above) exhibits the same behavior.
DT2 <- data.table(X1 = 1:4, X2 = c("a b c", "d e f", "g h i", "j k l"))
DT2[, list(Fun4X1(X1), Fun4X2.old(X2))]
# V1 V2
# 1: 3 a
# 2: 6 a
# 3: 9 a
# 4: 12 a
DT2[, list(Fun4X1(X1), Fun4X2.new1(X2))]
# V1 V2
# 1: 3 a
# 2: 6 d
# 3: 9 g
# 4: 12 j
Check out the snowfall package, http://cran.r-project.org/web/packages/snowfall/snowfall.pdf, for parallel computing. You can set up a local cluster and utilize all of your cores. I've found that by using sfApply
from this package it has reduced most of my computing times by 5X
(I have an 8-core, so it would be 8 times faster, but there is obviously the costs of loading the data into the cluster and collecting it at the end).
e.g.
install.packages('snowfall')
require(snowfall)
sfInit( parallel=TRUE, cpus=4 )
sfExport(list=c('DT','Fun4X1','Fun4X2'))
sfApply(DT,1,function(X) return(c(fun4X1(X[1]),fun4X2(X[2]))))
sfStop()
With apply
takes 25.07 sec , with sfApply
takes 9.11 sec on my machine
You can use the fast and vectorized function sub
for the second problem:
Fun4X2 <- function(x) sub("(.+?) .*", "\\1", x)
head(Fun4X2(DT[,X2]))
# [1] "This" "This" "This" "This" "This" "This"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With