Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count factors occurring in group in R

Tags:

dataframe

r

This is my data:

> head(Kandula_for_n)
                date      dist  date_only
1 2005-05-08 12:00:00  138.5861 2005-05-08
2 2005-05-08 16:00:00 1166.9265 2005-05-08
3 2005-05-08 20:00:00 1270.7149 2005-05-08
6 2005-05-09 08:00:00  233.1971 2005-05-09
7 2005-05-09 12:00:00 1899.9530 2005-05-09
8 2005-05-09 16:00:00  726.8363 2005-05-09

I would now like to have an additional column with the count (n) of the data entries (dist) per day. For 2005-05-08, this would be n=3 as there are 3 data entries at 12, 16 and 20 o'clock. I have applied the following code which actually gave me want I wanted:

ndist <-tapply(1:NROW(Kandula_for_n), Kandula_for_n$date_only, function(x) length(unique(x)))

After ndist<-as.data.frame(ndist), I got this:

> head(ndist)
           ndist
2005-05-08     3
2005-05-09     4
2005-05-10     6
2005-05-11     4
2005-05-12     6
2005-05-13     6

The problem is that the count is together with date_only in one column that is called ndist. But I would need them in two separate columns, one with the count and one with date_only. How can this be done? I guess its rather simple, but I just don't get it. I would appreciate if you could give me any thoughts on that.

Thanks for your efforts.

like image 762
Jan Blanke Avatar asked Oct 25 '11 19:10

Jan Blanke


1 Answers

Simply because I find tapply() hard to wrap my brain around, I like using plyr for these types of things:

## make up some data
## you get better/faster/more answers if you do this bit for us :)
dates <- seq(Sys.Date(), Sys.Date() + 5, by = 1)
Kandula_for_n <- data.frame(date_only = sample( dates + 5, 10, replace=TRUE ) , dist=rnorm(10) )

require(plyr)
ddply(Kandula_for_n, "date_only", function(x) data.frame(x, ndist=nrow(x)) )

This will give you something like:

    date_only       dist ndist
1  2011-10-30  0.2434168     5
2  2011-10-30 -0.9361780     5
3  2011-10-30  1.4593197     5
4  2011-10-30 -0.1851402     5
5  2011-10-30  0.6652419     5
6  2011-10-31  0.8876420     1
7  2011-11-03  0.5087175     2
8  2011-11-03 -1.0065152     2
9  2011-11-04  0.4236352     2
10 2011-11-04  0.4535686     2

the ddply line:

ddply(Kandula_for_n, "date_only", function(x) data.frame(x, ndist=nrow(x)) )

takes the input data, groups it by the date.only field, and for every unique value it applies the anonymous function to the data frame made up of only the records with the same value for date_only. My anonymous function simply takes the data.frame x and appends a column named ndist which is the number of rows in x.

like image 117
JD Long Avatar answered Oct 21 '22 12:10

JD Long