Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I programmatically update the type of a set of columns (to factors) in data.table?

I would like to modify a set of columns inside a data.table to be factors. If I knew the names of the columns in advance, I think this would be straightforward.

library(data.table)
dt1  <- data.table(a = (1:4), b = rep(c('a','b')), c = rep(c(0,1)))
dt1[,class(b)]
dt1[,b:=factor(b)]
dt1[,class(b)]

But I don't, and instead have a list of the variable names

vars.factors  <- c('b','c')

I can apply the factor function to them without a problem ...

lapply(vars.factors, function(x) dt1[,class(get(x))])
lapply(vars.factors, function(x) dt1[,factor(get(x))])
lapply(vars.factors, function(x) dt1[,factor(get(x))])

But I don't know how to re-assign or update the original column in the data table.

This fails ...

  lapply(vars.factors, function(x) dt1[,x:=factor(get(x))])
  # Error in get(x) : invalid first argument 

As does this ...

  lapply(vars.factors, function(x) dt1[,get(x):=factor(get(x))])
  # Error in get(x) : object 'b' not found 

NB. I tried the answer proposed here without any luck.

like image 344
drstevok Avatar asked Oct 10 '14 12:10

drstevok


1 Answers

Yes, this is fairly straightforward:

dt1[, (vars.factors) := lapply(.SD, as.factor), .SDcols=vars.factors]

In the LHS (of := in j), we specify the names of the columns. If a column already exists, it'll be updated, else, a new column will be created. In the RHS, we loop over all the columns in .SD (which stands for Subset of Data), and we specify the columns that should be in .SD with the .SDcols argument.

Following up on comment:

Note that we need to wrap LHS with () for it to be evaluated and fetch the column names within vars.factors variable. This is because we allow the syntax

DT[, col := value]

when there's only one column to assign, by specifying the column name as a symbol (without quotes), purely for convenience. This creates a column named col and assigns value to it.

To distinguish these two cases apart, we need the (). Wrapping it with () is sufficient to identify that we really need to get the values within the variable.

like image 183
Arun Avatar answered Oct 23 '22 23:10

Arun