Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

subset data.frame for ggplot2 bar chart

Tags:

r

ggplot2

I have the following data:

    Splice.Pair  proportion
1         AA-AG 0.010909091
2         AA-GC 0.003636364
3         AA-TG 0.003636364
4         AA-TT 0.007272727
5         AC-AC 0.003636364
6         AC-AG 0.003636364
7         AC-GA 0.003636364
8         AC-GG 0.003636364
9         AC-TC 0.003636364
10        AC-TG 0.003636364
11        AC-TT 0.003636364
12        AG-AA 0.010909091
13        AG-AC 0.007272727
14        AG-AG 0.003636364
15        AG-AT 0.003636364
16        AG-CC 0.003636364
17        AG-CT 0.007272727
...       ...   ...

I want to get a barchart visualising the proportion of each splice pair but only for splice pairs that have a proportion over, say, 0.004. I tried the following:

nc.subset <- subset(nc.dat, proportion > 0.004)
qplot(Splice.Pair, proportion, data=nc.dat.subset,geom="bar", xlab="Splice Pair", ylab="Proportion of total non-canonical splice sites") + coord_flip();

But this just gives me a bar chart with all splice pairs on the Y-axis, except that the splice pairs that were filtered out are missing bars. enter image description here

I have no idea what is happening to allow all categories to still be present :s

like image 676
MattLBeck Avatar asked Aug 12 '11 15:08

MattLBeck


People also ask

Can you use subset in Ggplot?

Method 1: Using subset() function Here, we use subset() function for plotting only subset of DataFrame inside ggplot() function inplace of data DataFrame. All other things are same. Parameters: It takes data object to be subsetted as it's first parameter.

How do I select specific data in R?

To select a specific column, you can also type in the name of the dataframe, followed by a $ , and then the name of the column you are looking to select. In this example, we will be selecting the payment column of the dataframe. When running this script, R will simplify the result as a vector.

Do you need a Dataframe for Ggplot?

ggplot only works with data frames, so we need to convert this matrix into data frame form, with one measurement in each row. We can convert to this “long” form with the melt function in the library reshape2 .


1 Answers

What's happening is that Splice.Pair is a factor. When you subset your data frame, the factor retains it's levels attribute, which still has all of the original levels. You can avoid this kind of problem by simply wrapping your subsetting in droplevels:

nc.subset <- droplevels(subset(nc.dat, proportion > 0.004))

More generally, if you dislike this kind of automatic retention of levels with factors, you can set R to store strings as character vectors rather than factors by default by setting:

options(stringsAsFactors = FALSE)

at the beginning of your R session (this can also be passed as an option to data.frame as well).

EDIT

Regarding the issue of running older versions of R that may lack droplevels, @rcs points out in a comment that the method for a single factor is very simple to implement on your own. The method for data frames is only slightly more complicated:

function (x, except = NULL, ...) 
{
    ix <- vapply(x, is.factor, NA)
    if (!is.null(except)) 
        ix[except] <- FALSE
    x[ix] <- lapply(x[ix], factor)
    x
}

But of course, the best solution is still to upgrade to the latest version of R.

like image 168
joran Avatar answered Oct 02 '22 23:10

joran