Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to factorize specific columns in a data.frame in R using apply

Tags:

dataframe

r

apply

I have a data.frame called mydata and a vector ids containing indices of the columns in the data.frame that I would like to convert to factors. Now the following code solves the problem

for(i in ids) mydata[, i]<-as.factor(mydata[, i])

Now I wanted to clean this code up by using apply instead of an explicit for-loop.

mydata[, ids]<-apply(mydata[, ids], 2, as.factor)

However, the last statement gives me a data.frame where the types are character instead of factors. I fail to see the distinction between these two lines of code. Why do they not produce the same result?

Kind regards, Michael

like image 879
Dr. Mike Avatar asked Nov 02 '11 11:11

Dr. Mike


2 Answers

The result of apply is a vector or array or list of values (see ?apply).

For your problem, you should use lapply instead:

data(iris)
iris[, 2:3] <- lapply(iris[, 2:3], as.factor)
str(iris)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : Factor w/ 23 levels "2","2.2","2.3",..: 15 10 12 11 16 19 14 14 9 11 ...
 $ Petal.Length: Factor w/ 43 levels "1","1.1","1.2",..: 5 5 4 6 5 8 5 6 5 6 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Notice that this is one place where lapply will be much faster than a for loop. In general a loop and lapply will have similar performance, but the <-.data.frame operation is very slow. By using lapply one avoids the <- operation in each iteration, and replaces it with a single assign. This is much faster.

like image 114
Andrie Avatar answered Nov 11 '22 11:11

Andrie


That is because apply() works completely different. It will first carry out the function as.factor in a local environment, collect the results from that, and then try to merge them in to an array and not a dataframe. This array is in your case a matrix. R meets different factors and has no other way to cbind them than to convert them to character first. That character matrix is used to fill up your dataframe.

You can use lapply for that (see Andrie's answer) or colwise from the plyr function.

require(plyr)
Df[,ids] <- colwise(as.factor)(Df[,ids])
like image 24
Joris Meys Avatar answered Nov 11 '22 12:11

Joris Meys