Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Values of the wrong group are used when using plot() within a data.table() in RStudio

I want to generate a divided diagram. On the upper section of the diagram the values of group a, on the lower one the values of group b should be used. I am using data.table() to do this. Here is the code I used to generate an example and set up the graphical output:

library(data.table)
set.seed(23)
Example <- data.table('group' = rep(c('a', 'b'), each = 5), 'value' = runif(10))
layout(1:2)
par('mai' = rep(.5, 4))

When running the following lines in the usual r console the correct values are used for the plotting. When running the same code in Rstudio the values of the second group are used for both diagrams:

Example[, plot(value, ylim = c(0, 1)), by = group] # Example 1
Example[, .SD[plot(value, ylim = c(0, 1))], by = group] # Example 2

When adding a comma in the subset data.table .SD[] of example 2 the correct output is generated in Rstudio as well:

Example[, .SD[, plot(value, ylim = c(0, 1))], by = group] # Example 3

When using barplot() rather than plot() Rstudio uses the correct values as well:

Example[, barplot(value, ylim = c(0, 1)), by = group] # Example 4

Did I overlook something or is this a bug?

System: Windows 7, Rstudio Desktop v0.98.1091, R 3.1.2, data.table 1.9.4

like image 991
Jonas Avatar asked Dec 16 '14 13:12

Jonas


1 Answers

Nice catch (+1'd already)! In my case, Example 3 doesn't produce the right plot as well (OS X 10.10.1, R 3.1.2, Rstudio 0.98.1091).

The only difference between R console/GUI and Rstudio here is the plotting device. RStudio seems to be using a native graphics device RstudioGD, where as R console / GUI uses Quartz.

By debugging graphics:::plot.default I was able to narrow down the issue to the function plot.xy(). This function calls different graphics devices (as shown above).

By initiating, for example, Quartz by calling the function quartz() and then running your code works fine!

FWIW this issue can be reproduced using dplyr() as well:

require(dplyr)
df = as.data.frame(Example)
my_fun = function(x) {plot(x, ylim=c(0,1)); 1L }
df %>% group_by(group) %>% summarise(my_fun(value))

will result in the same wrong plot.

This is most likely due to the way the subgroups are handled in data.table (and I think dplyr should be doing it the same way as data.table), which you can see by:

Example[, print(sapply(.SD, address)), by=group]
#         value 
# "0x105bbf5b8" 
#         value 
# "0x105bbf5b8" 
# Empty data.table (0 rows) of 1 col: group

data.table assigns the largest group for .SD and internally reuses this memory for each subgroup so as to avoid repetitive memory alloc/dealloc - for efficiency. Not sure (shooting in the dark here), but it seems like RstudioGD doesn't let go of the pointer linked with the subgroup, and as the data in the subgroup gets updated, the plot gets updated too. You can verify this by doing:

# on RstudioGD
debug(graphics:::plot.default)
set.seed(23)
Example <- data.table('group' = rep(c('a', 'b'), each = 5), 'value' = runif(10))
layout(1:2)
par('mai' = rep(.5, 4))
Example[, plot(value, ylim = c(0, 1)), by = group] # Example 1
undebug(graphics:::plot.default)

Keep hitting enter, and you'll see that the first plot is plotted right.. and when the second plot is added, the first plot changes as well. This may be a consequence of recent changes in Rv3.1+ which shallow copies function arguments rather than deep copying (again, shooting in the dark here).

You can temporarily fix this by explicitly copying value:

Example[, plot(copy(value), ylim = c(0, 1)), by = group] # Example 1

will produce the right plot.

like image 81
Arun Avatar answered Nov 14 '22 22:11

Arun