I have a very large Data Table with two columns. And I wish to apply a custom function on a particular column. The code to generate the problem is as follows:
require(data.table)
X <- rep("This is just random text", 1e5)
data <- data.frame(1:1e5, replicate(1, X, simplify=FALSE), stringsAsFactors=FALSE)
colnames(data) <- paste("X", seq_len(ncol(data)), sep="")
DT <- as.data.table(data)
Now, we have a large data table which looks like
| X1 |            X2           |
|----|-------------------------|
| 1  | This is just random text|
| 2  | This is just random text|
| 3  | This is just random text|
| 4  | This is just random text|
| .. |            ...          |
What if I want to do some vector operation on any of this column considering in mind that this data.table will be of very large size (approx ~100M rows). 
Let's take an example of X1 column. Suppose, I want to apply the following function on it:
Fun4X1 <- function(x){return(x+x*2)}
And a very complex NLP function on X2 column which looks something like
Fun4X2 <- function(x){
             require(stringr)
             return(str_split(x, " ")[[1]][1])
          }
How shall I go about doing this for a large dataset? Please suggest the min. time consuming approach as my Function is itself very complex.
P.S. I have tried foreach, sapply, and of course for-loop and all are very slow on a pretty good hardware system.
The approach should be no different than applying any other in-built (or package-loaded) function to a specific column in a data.table: Use a list(fun(variable), otherfun(othervariable)) type of construct. You can also name the resulting columns if so desired, otherwise they will be named "V1", "V2" and so on.
In other words, for your problem you can do:
DT[, list(X1 = Fun4X1(X1), X2 = Fun4X2(X2))]
I suspect, however, that a lot of your slowdown might be due to the functions you are actually using. Compare the following slight refinements:
Fun4X2.old <- function(x){
  require(stringr)
  return(str_split(x, " ")[[1]][1])
}
Fun4X2.new1 <- function(x) {
  vapply(strsplit(x, " "), 
         function(y) y[1], character(1))
} 
Fun4X2.new2 <- function(x) {
  vapply(strsplit(x, " ", fixed=TRUE), 
         function(y) y[1], character(1))
} 
Fun4X2.sub <- function(x) sub("(.+?) .*", "\\1", x)
X <- rep("This is just random text", 1e5)    
system.time(out1 <- Fun4X2.old(X))
#    user  system elapsed 
#  18.838   0.000  18.659 
system.time(out2 <- Fun4X2.new1(X))
#    user  system elapsed 
#   0.000   0.000   0.944 
system.time(out3 <- Fun4X2.new2(X))
#    user  system elapsed 
#   1.584   0.000   0.270 
system.time(out4 <- Fun4X2.sub(X))
#    user  system elapsed 
#   0.000   0.000   0.222 
One last note, regarding your comment here:
@AnandaMahto I am looking for something similar to this but if I use your solution then the output on text column in not vectorized and I get same output even if I have different text in each row
Incidentally, your original Fun4X2() (renamed Fun4X2.old() above) exhibits the same behavior. 
DT2 <- data.table(X1 = 1:4, X2 = c("a b c", "d e f", "g h i", "j k l"))
DT2[, list(Fun4X1(X1), Fun4X2.old(X2))]
#    V1 V2
# 1:  3  a
# 2:  6  a
# 3:  9  a
# 4: 12  a
DT2[, list(Fun4X1(X1), Fun4X2.new1(X2))]
#    V1 V2
# 1:  3  a
# 2:  6  d
# 3:  9  g
# 4: 12  j
Check out the snowfall package, http://cran.r-project.org/web/packages/snowfall/snowfall.pdf,  for parallel computing. You can set up a local cluster and utilize all of your cores. I've found that by using sfApply from this package it has reduced most of my computing times by 5X 
(I have an 8-core, so it would be 8 times faster, but there is obviously the costs of loading the data into the cluster and collecting it at the end).
e.g.
install.packages('snowfall')
require(snowfall)
sfInit( parallel=TRUE, cpus=4 )
sfExport(list=c('DT','Fun4X1','Fun4X2'))
sfApply(DT,1,function(X) return(c(fun4X1(X[1]),fun4X2(X[2]))))
sfStop()
With apply takes 25.07 sec , with sfApply takes 9.11 sec on my machine
You can use the fast and vectorized function sub for the second problem:
Fun4X2 <- function(x) sub("(.+?) .*", "\\1", x)
head(Fun4X2(DT[,X2]))
# [1] "This" "This" "This" "This" "This" "This"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With