Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply a custom function on an entire column of data.table?

I have a very large Data Table with two columns. And I wish to apply a custom function on a particular column. The code to generate the problem is as follows:

require(data.table)
X <- rep("This is just random text", 1e5)
data <- data.frame(1:1e5, replicate(1, X, simplify=FALSE), stringsAsFactors=FALSE)
colnames(data) <- paste("X", seq_len(ncol(data)), sep="")
DT <- as.data.table(data)

Now, we have a large data table which looks like

| X1 |            X2           |
|----|-------------------------|
| 1  | This is just random text|
| 2  | This is just random text|
| 3  | This is just random text|
| 4  | This is just random text|
| .. |            ...          |

What if I want to do some vector operation on any of this column considering in mind that this data.table will be of very large size (approx ~100M rows).

Let's take an example of X1 column. Suppose, I want to apply the following function on it:

Fun4X1 <- function(x){return(x+x*2)}

And a very complex NLP function on X2 column which looks something like

Fun4X2 <- function(x){
             require(stringr)
             return(str_split(x, " ")[[1]][1])
          }

How shall I go about doing this for a large dataset? Please suggest the min. time consuming approach as my Function is itself very complex.

P.S. I have tried foreach, sapply, and of course for-loop and all are very slow on a pretty good hardware system.

like image 516
Ankit Avatar asked Jan 22 '14 15:01

Ankit


3 Answers

The approach should be no different than applying any other in-built (or package-loaded) function to a specific column in a data.table: Use a list(fun(variable), otherfun(othervariable)) type of construct. You can also name the resulting columns if so desired, otherwise they will be named "V1", "V2" and so on.

In other words, for your problem you can do:

DT[, list(X1 = Fun4X1(X1), X2 = Fun4X2(X2))]

I suspect, however, that a lot of your slowdown might be due to the functions you are actually using. Compare the following slight refinements:

Fun4X2.old <- function(x){
  require(stringr)
  return(str_split(x, " ")[[1]][1])
}

Fun4X2.new1 <- function(x) {
  vapply(strsplit(x, " "), 
         function(y) y[1], character(1))
} 

Fun4X2.new2 <- function(x) {
  vapply(strsplit(x, " ", fixed=TRUE), 
         function(y) y[1], character(1))
} 

Fun4X2.sub <- function(x) sub("(.+?) .*", "\\1", x)

X <- rep("This is just random text", 1e5)    

system.time(out1 <- Fun4X2.old(X))
#    user  system elapsed 
#  18.838   0.000  18.659 
system.time(out2 <- Fun4X2.new1(X))
#    user  system elapsed 
#   0.000   0.000   0.944 
system.time(out3 <- Fun4X2.new2(X))
#    user  system elapsed 
#   1.584   0.000   0.270 
system.time(out4 <- Fun4X2.sub(X))
#    user  system elapsed 
#   0.000   0.000   0.222 

One last note, regarding your comment here:

@AnandaMahto I am looking for something similar to this but if I use your solution then the output on text column in not vectorized and I get same output even if I have different text in each row

Incidentally, your original Fun4X2() (renamed Fun4X2.old() above) exhibits the same behavior.

DT2 <- data.table(X1 = 1:4, X2 = c("a b c", "d e f", "g h i", "j k l"))
DT2[, list(Fun4X1(X1), Fun4X2.old(X2))]
#    V1 V2
# 1:  3  a
# 2:  6  a
# 3:  9  a
# 4: 12  a

DT2[, list(Fun4X1(X1), Fun4X2.new1(X2))]
#    V1 V2
# 1:  3  a
# 2:  6  d
# 3:  9  g
# 4: 12  j
like image 128
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 13 '22 12:10

A5C1D2H2I1M1N2O1R2T1


Check out the snowfall package, http://cran.r-project.org/web/packages/snowfall/snowfall.pdf, for parallel computing. You can set up a local cluster and utilize all of your cores. I've found that by using sfApply from this package it has reduced most of my computing times by 5X

(I have an 8-core, so it would be 8 times faster, but there is obviously the costs of loading the data into the cluster and collecting it at the end).

e.g.

install.packages('snowfall')

require(snowfall)
sfInit( parallel=TRUE, cpus=4 )
sfExport(list=c('DT','Fun4X1','Fun4X2'))
sfApply(DT,1,function(X) return(c(fun4X1(X[1]),fun4X2(X[2]))))
sfStop()

With apply takes 25.07 sec , with sfApply takes 9.11 sec on my machine

like image 20
James Tobin Avatar answered Oct 13 '22 11:10

James Tobin


You can use the fast and vectorized function sub for the second problem:

Fun4X2 <- function(x) sub("(.+?) .*", "\\1", x)

head(Fun4X2(DT[,X2]))
# [1] "This" "This" "This" "This" "This" "This"
like image 1
Sven Hohenstein Avatar answered Oct 13 '22 13:10

Sven Hohenstein