Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

for loop with ggplots produces graphs with identical values but different headings

I have read lots of posts about using loops for ggplot to generate lots of graphs, but cannot find any that explain my problem...

I have a dataframe and am trying to loop over 92 columns, creating a new graph for each column. I want to save each plot as a separate object. When I run my loop (code below) and print the graphs, all the graphs are correct. However, when I change the print() command with assign(), the graphs are not correct. The titles are changing as they should, however the graph-values are all identical (they are all the values for the final graph). I found this out because when I used plot_grid() to generate a figure of 10 plots, the graph titles and axis labels were all correct, but the values were identical!

My data set is large, so I have provided a small data set for illustration below.

Sample datafame:

library(ggplot)
library(cowplot)
df <- as.data.frame(cbind(group=c(rep("A", 4), rep("B", 4)), a=sample(1:100, 8), b=sample(100:200, 8), c=sample(300:400, 8))) #make data frame
cols <- 2:4 #define columns for plots
for(i in 1:length(cols)){
  df[,cols[i]] <- as.numeric(as.character(df[,cols[i]]))
} #convert columns to numeric

Plots:

for (i in 1:length(cols)){
  g <- ggplot(df, aes(x=group, y=df[,cols[i]])) +
    geom_boxplot() +
    ggtitle(colnames(df)[cols[i]])
  print(g)
  assign(colnames(df)[cols[i]], g) #generate an object for each plot
}

plot_grid(a, b, c)

I am thinking that when ggplots make the plot, it only renders the data from the final value of i? Or somthing like that? Is there a way around this?

I wish to do it like this, as there are a lot of graphs I wish to make and then I want to mix and match plots for figures.

Thanks!

like image 334
Harry Avatar asked Apr 14 '16 15:04

Harry


2 Answers

I have cleaned up how you generated your sample data frame.

library(ggplot2)
library(cowplot)

df <- data.frame(group=c(rep("A", 4), rep("B", 4)),
                          a=sample(1:100, 8),
                          b=sample(100:200, 8),
                          c=sample(300:400, 8)) #make data frame

Just using data.frame() will suffice. This makes your code clearer and avoids the need for all that post-processing in your 'for loop' to convert your dataframe to numeric and to remove the factors generated - Note that as.data.frame() and cbind() tend to default to factors if you don't have 'stringsAsFactors = FALSE' and that the numeric to character conversion can be avoided by using cbind.data.frame() rather than cbind().

I have also refactored your 'for loop' that generates your plots. You generate a list of integers called 'cols' (cols <- 2:4 ) which you then reiterate across to generate your plots from each column of data. This is unnecessary, we can just create a range in the for statement conditions - 'for (i in 2:ncol(df))' - this simply reiterates from 2 to 4 (the number of columns in your dataframe) - starting from 2 is required to avoid column 1 which contains metadata. This is preferable because:

i) When reviewing your code the condition used is immediately apparent without searching through the rest of your code

ii) R has a number of functions/parameters similarly named to your variable 'cols' and it is best to avoid confusion.

With the code cleaned up we can now try to locate the cause of the bug:

library(ggplot2)
library(cowplot)

df <- data.frame(group=c(rep("A", 4), rep("B", 4)),
                          a=sample(1:100, 8),
                          b=sample(100:200, 8),
                          c=sample(300:400, 8)) #make data frame


for (i in 2:ncol(df)){

  g <- ggplot(df, aes(x=group, y=df[,i])) +
    geom_boxplot() +
    ggtitle(colnames(df)[i])

  print(g)
  assign(colnames(df)[i], g) #generate an object for each plot
}   

It's not immediately obvious why your code doesn't work. The suggestion by Imo has merit. Saving your plots to a list would prevent your environment from getting cluttered with objects, however it would not solve this bug. The cause is unintuitive and requires a deep understanding about how the assign() function is evaluated. See the answer provided here by Konrad Rudolph. The following should work and retains the style of your original code. As Konrad suggests in his answer it might be more "R" like to use lapply. Note that we have given the for loop local scope and that we now re-define i locally. Previously the last value of i generated in the loop was being used to generate each object created via the assign() function. Note the use of <<- to assign g to the global environment.

for (i in 2:ncol(df))  
     local({
  i <- i
  g <<- ggplot(df, aes(x=group, y=df[,i])) +
    geom_boxplot() +
    ggtitle(colnames(df)[i])
  print(i)
  print(g)
  assign(colnames(df)[i], g, pos =1) #generate an object for each plot
     })

plot_grid(a, b, c)

You owe me a drink.

like image 125
Graeme Avatar answered Nov 20 '22 12:11

Graeme


There are two standard ways to deal with this problem:

1- Work with a long-format data.frame

2- Use aes_string to refer to variable names in the wide format data.frame

Here's an illustration of possible strategies.

library(ggplot2)
library(gridExtra)

# data from other answer
df <- data.frame(group=c(rep("A", 4), rep("B", 4)),
                 a=sample(1:100, 8),
                 b=sample(100:200, 8),
                 c=sample(300:400, 8))

## first method: long format
m <- reshape2::melt(df, id = "group")
p <- ggplot(m, aes(x=group, y=value)) +
    geom_boxplot() 

pl <- plyr::dlply(m, "variable", function(.d) p %+% .d + ggtitle(unique(.d$variable)))
grid.arrange(grobs=pl)

## second method: keep wide format
one_plot <- function(col = "a")  ggplot(df, aes_string(x="group", y=col)) +  geom_boxplot() + ggtitle(col)
pl <- plyr::llply(colnames(df)[-1], one_plot)
grid.arrange(grobs=pl)

## third method: more explicit looping

pl <- vector("list", length = ncol(df)-1)
for(ii in seq_along(pl)){
  .col <- colnames(df)[-1][ii]
  .p <- ggplot(df, aes_string(x="group", y=.col)) +  geom_boxplot() + ggtitle(.col)
  pl[[ii]] <- .p
}

grid.arrange(grobs=pl)

Sometimes, when wrapping a ggplot call inside a function/for loop one faces issues with local variables (not the case here, if aes_string is used). In such cases one can define a local environment.

Note that using a construct like aes(y=df[,i]) may appear to work, but can produce very wrong results. Consider a facetted plot, the data.frame will be split into different groups for each panel, and this subsetting can fail miserably to group the right data if numeric values are passed directly to aes() instead of variable names.

like image 2
baptiste Avatar answered Nov 20 '22 12:11

baptiste