Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How subset a data frame by a factor and repeat a plot for each subset?

Tags:

split

r

ggplot2

I am new to R. Forgive me if this if this question has an obvious answer but I've not been able to find a solution. I have experience with SAS and may just be thinking of this problem in the wrong way.

I have a dataset with repeated measures from hundreds of subjects with each subject having multiple measurements across different ages. Each subject is identified by an ID variable. I'd like to plot each measurement (let's say body WEIGHT) by AGE for each individual subject (ID).

I've used ggplot2 to do something like this:

ggplot(data = dataset, aes(x = AGE, y = WEIGHT )) + geom_line() + facet_wrap(~ID)

This works well for a small number of subjects but won't work for the entire dataset.

I've also tried something like this:

ggplot(data=data, aes(x = AGE,y = BW, group = ID, colour = ID)) + geom_line()

This also works for a small number of subjects but is unreadable with hundreds of subjects.

I've tried to subset using code like this:

temp <- split(dataset,dataset$ID)

but I'm not sure how to work with the resulting dataset. Or perhaps there is a way to simply adjust the facet_wrap so that individual plots are created?

Thanks!

like image 506
Matt Avatar asked Oct 02 '13 20:10

Matt


People also ask

How do you subset data frames?

The most general way to subset a data frame by rows and/or columns is the base R Extract[] function, indicated by matched square brackets instead of the usual matched parentheses.

How do you create subsets in the data frames in R?

Create Subsets of a Data frame in R Programming – subset() Function. subset() function in R Programming Language is used to create subsets of a Data frame. This can also be used to drop columns from a data frame.

What function is used to subset a data frame according to the values of a variable in the data frame?

The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions.


2 Answers

Because you want to split up the dataset and make a plot for each level of a factor, I would approach this with one of the split-apply-return tools from the plyr package.

Here is a toy example using the mtcars dataset. I first create the plot and name it p, then use dlply to split the dataset by a factor and return a plot for each level. I'm taking advantage of %+% from ggplot2 to replace the data.frame in a plot.

p = ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
    geom_line()

require(plyr)
dlply(mtcars, .(cyl), function(x) p %+% x)

This returns all the plots, one after another. If you name the resulting list object you can also call one plot at a time.

plots = dlply(mtcars, .(cyl), function(x) p %+% x)
plots[1]

Edit

I started thinking about putting a title on each plot based on the factor, which seems like it would be useful.

dlply(mtcars, .(cyl), function(x) p %+% x + facet_wrap(~cyl))

Edit 2

Here is one way to save these in a single document, one plot per page. This is working with the list of plots named plots. It saves them all to one document, one plot per page. I didn't change any of the defaults in pdf, but you can certainly explore the changes you can make.

pdf()
plots
dev.off()

Updated to use package dplyr instead of plyr. This is done in do, and the output will have a named column that contains all the plots as a list.

library(dplyr)
plots = mtcars %>%
    group_by(cyl) %>%
    do(plots = p %+% . + facet_wrap(~cyl))


Source: local data frame [3 x 2]
Groups: <by row>

  cyl           plots
1   4 <S3:gg, ggplot>
2   6 <S3:gg, ggplot>
3   8 <S3:gg, ggplot>

To see the plots in R, just ask for the column that contains the plots.

plots$plots

And to save as a pdf

pdf()
plots$plots
dev.off()
like image 137
aosmith Avatar answered Sep 21 '22 04:09

aosmith


A few years ago, I wanted to do something similar - plot individual trajectories for ~2500 participants with 1-7 measurements each. I did it like this, using plyr and ggplot2:

library(plyr)
library(ggplot2)

d_ply(dat, .var = "participant_id", .fun = function(x) {

    # Generate the desired plot
    ggplot(x, aes(x = phase, y = result)) +
        geom_point() +
        geom_line()

    # Save it to a file named after the participant
    # Putting it in a subdirectory is prudent
    ggsave(file.path("plots", paste0(x$participant_id, ".png")))

})

A little slow, but it worked. If you want to get a sense of all participants' trajectories in one plot (like your second example - aka the spaghetti plot), you can tweak the transparency of the lines (forget coloring them, though):

ggplot(data = dat, aes(x = phase, y = result, group = participant_id)) + 
    geom_line(alpha = 0.3)
like image 31
Matt Parker Avatar answered Sep 22 '22 04:09

Matt Parker