Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange recycling of factors in dplyr::mutate - bug or feature?

Tags:

r

dplyr

The function mutate from the R package 'dplyr' has a peculiar recycling feature for factors, in that it seems to return the factor as.numeric. In the following example y becomes what you would expect, whereas z is c(1,1)

library(dplyr)
df <- data_frame(x=1:2)
glimpse(df %>% mutate(y="A", z=factor("B")))
# Variables:
# $ x (int) 1, 2
# $ y (chr) "A", "A"
# $ z (int) 1, 1

Is there any rationale behind this, or is it a bug?

(I am using R 3.1.1 and dplyr 0.3.0.1.)


EDIT:

After posting this as an issue on github, Romain Francois fixed it within hours! So if the above is a problem use devtools::install_github to get the latest version:

library(devtools)
install_github("hadley/dplyr")

and then

library(dplyr)
df <- data_frame(x=1:2)
glimpse(df %>% mutate(y="A", z=factor("B")))
# Variables:
# $ x (int) 1, 2
# $ y (chr) "A", "A"
# $ z (fctr) B, B

Nice work Romain!

like image 925
Henrik Renlund Avatar asked Oct 22 '14 06:10

Henrik Renlund


1 Answers

dplyr uses C++ to perform the actual mutate operation. Following the rabbit hole and noting this is an ungrouped mutation, we can use our trusty debugger to notice the following.

debugonce(dplyr:::mutate_impl)
# Inside of mutate_impl we do:
class(dots[[2]]$expr) # which is a "call"!

So now we know the type of our lazy expression. We eval the call and notice it is a supported type (unfortunately, R's TYPEOF macro claims factors are integers - we would need Rf_isFactor to discriminate).

So what happens next? We returned the result and we're done. If you have tried (df %>% mutate(y="A", z=factor(c("A","B"))))[[3]] already, you'll know that the issue is indeed the recycling.

Specifically, the C++ Gatherer object (which should really be checking for Rf_isFactor in addition to its current date check on INTSXPs) is using C++ templating to force a Vector<INTSXP> to be created (implicitly through constructor initialization - notice the arity 2 call in ConstantGathererImpl) without remembering to carry over the factor "label."

TLDR: In R's C++, integers and factors have the same internal type when using the TYPEOF macro, and factors are a weird edge case.

Feel free to submit a pull request to dplyr, it's in active development and hadley and Romain are nice guys. You'll have to add an if statement here.

like image 60
Robert Krzyzanowski Avatar answered Oct 04 '22 16:10

Robert Krzyzanowski