Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I filter a data.frame in R by categorical variable?

Just learning R.

Given a data.frame in R with two columns, one numeric and one categorical, how do I extract a portion of the data.frame for usage?

str(ex0331)
'data.frame':   36 obs. of  2 variables:
$ Iron      : num  0.71 1.66 2.01 2.16 2.42 ...
$ Supplement: Factor w/ 2 levels "Fe3","Fe4": 1 1 1 1 1 1 1 1 1 1 ...

Basically, I need to be able to operate on the two factors separately; i.e. I need the ability to individually determine length/mean/sd/etc of the Iron retention rate by Supplement type (Fe3 or Fe4).

What's the easiest way to accomplish this?

I'm aware of the by() command. For example, the following gets some of what I need:

by(ex0331, ex0331$Supplement, summary)
ex0331$Supplement: Fe3
     Iron       Supplement
Min.   :0.710   Fe3:18    
1st Qu.:2.420   Fe4: 0    
Median :3.475             
Mean   :3.699             
3rd Qu.:4.472             
Max.   :8.240             
------------------------------------------------------------ 
ex0331$Supplement: Fe4
     Iron        Supplement
Min.   : 2.200   Fe3: 0    
1st Qu.: 3.892   Fe4:18    
Median : 5.750             
Mean   : 5.937             
3rd Qu.: 6.970             
Max.   :12.450      

But I need more flexibility. I need to apply axis commands, for example, or log() functions by group. I'm sure there's an easy way to do this; I just don't see it. All of the data.frame manipulation documentation I've seen is for numerical rather than categorical variables.

like image 583
Stephen O'Grady Avatar asked Feb 19 '11 18:02

Stephen O'Grady


People also ask

How do you filter categorical data?

For categorical data you can use Pandas string functions to filter the data. The startswith() function returns rows where a given column contains values that start with a certain value, and endswith() which returns rows with values that end with a certain value.

How do I subset a Dataframe based on column value in R?

How to subset the data frame (DataFrame) by column value and name in R? By using R base df[] notation, or subset() you can easily subset the R Data Frame (data. frame) by column value or by column name.


2 Answers

You can get a subset of your data by indexing or using subset:

ex0331 <- data.frame( iron=rnorm(36), supplement=c("Fe3","Fe4"))

subset(ex0331, supplement=="Fe3")
subset(ex0331, supplement=="Fe4")

ex0331[ex0331$supplement=="Fe3",]

Or at once with split, resulting in a list:

split(ex0331,ex0331$supplement)

Another thing you can do is use tapply to split by a factor and then perform a function:

tapply(ex0331$iron,ex0331$supplement,mean)
        Fe3         Fe4 
-0.15443861 -0.01308835 

The plyr package can also be used, which has loads of useful functions. For example:

library(plyr)
daply(ex0331,.(supplement),function(x)mean(x[1]))
        Fe3         Fe4 
-0.15443861 -0.01308835 

Edit

In response to edited question, you could get the log of iron per supplement with:

ex0331 <- data.frame( iron=abs(rnorm(36)), supplement=c("Fe3","Fe4"))

tapply(ex0331$iron,ex0331$supplement,log)

Or with plyr:

library(plyr)
dlply(ex0331,.(supplement),function(x)log(x$iron))

Both returned in a list. I'm sure there is an easier way then the wrapper function in the plyr example though.

like image 54
Sacha Epskamp Avatar answered Sep 19 '22 00:09

Sacha Epskamp


I'd recommend using ddply function from the plyr package, detailed doc is online:

> require(plyr)
> ddply( ex0331, .(Supplement), summarise, 
         mean = mean(Iron), 
         sd = sd(Iron), 
         len = length(Iron))

  Supplement       mean        sd len
1        Fe3 -0.3749169 0.2827360   4
2        Fe4  0.1953116 0.7128129   6

Update. To add a LogIron column where each entry is the log() of the Iron value, you would simply use transform:

> transform(ex0331, LogIron = log(Iron))

         Iron Supplement     LogIron
1  0.07185141        Fe3 -2.63315498
2  1.10367297        Fe3  0.09864368
3  0.48592428        Fe3 -0.72170246
4  0.20286918        Fe3 -1.59519393
5  0.80830682        Fe4 -0.21281357

Or, to create a summary that is the "mean of the log Iron values, per Supplement", you would do:

> ddply( ex0331, .(Supplement), summarise, meanLog = mean(log(Iron)))
  Supplement    meanLog
1        Fe3 -1.0062304
2        Fe4  0.2791507
like image 39
Prasad Chalasani Avatar answered Sep 17 '22 00:09

Prasad Chalasani