Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting multiple data.table columns to factors in R

Tags:

r

data.table

I ran into an unexpected problem when trying to convert multiple columns of a data table into factor columns. I've reproduced it as follows:

library(data.table)
tst <- data.table('a' = c('b','b','c','c'))
class(tst[,a])
tst[,as.factor(a)]  #Returns expected result
tst[,as.factor('a'),with=FALSE] #Returns error

The latter command returns 'Error in Math.factor(j) : abs not meaningful for factors'. I found this when attempting to get tst[,lapply(cols, as.factor),with=FALSE] where cols was a collection of rows I was attempting to convert to factors. Is there any solution or workaround for this?

like image 889
tresbot Avatar asked Aug 30 '13 05:08

tresbot


People also ask

How do I convert multiple columns to factors in R?

In R, you can convert multiple numeric variables to factor using lapply function. The lapply function is a part of apply family of functions. They perform multiple iterations (loops) in R.

How do I convert all columns to integers in R?

To convert columns of an R data frame from integer to numeric we can use lapply function. For example, if we have a data frame df that contains all integer columns then we can use the code lapply(df,as. numeric) to convert all of the columns data type into numeric data type.


2 Answers

I found one solution:

library(data.table)
tst <- data.table('a' = c('b','b','c','c'))
class(tst[,a])
cols <- 'a'
tst[,(cols):=lapply(.SD, as.factor),.SDcols=cols]

Still, the earlier-mentioned behavior seems buggy.

like image 167
tresbot Avatar answered Sep 23 '22 08:09

tresbot


This is now fixed in v1.8.11, but probably not in the way you'd hoped for. From NEWS:

FR #4867 is now implemented. DT[, as.factor('x'), with=FALSE] where x is a column in DT, is now equivalent to DT[, "x", with=FALSE] instead of ending up with an error. Thanks to tresbot for reporting on SO: Converting multiple data.table columns to factors in R


Some explanation: The difference, when with=FALSE is used, is that the columns of the data.table aren't seen as variables anymore. That is:

tst[, as.factor(a), with=FALSE] # would give "a" not found!

would result in an error "a" not found. But what you do instead is:

tst[, as.factor('a'), with=FALSE]

You're in fact creating a factor "a" with level="a" and asking to subset that column. This doesn't really make much sense. Take the case of data.frames:

DF <- data.frame(x=1:5, y=6:10)
DF[, c("x", "y")] # gives back DF

DF[, factor(c("x", "y"))] # gives back DF again, not factor columns
DF[, factor(c("x", "x"))] # gives back two columns of "x", still integer, not factor!

So, basically, what you're applying a factor on, when you use with=FALSE is not on the elements of that column, but just that column name... I hope I've managed to convey the difference well. Feel free to edit/comment if there are any confusions.

like image 26
Arun Avatar answered Sep 24 '22 08:09

Arun