Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Help with R and grouping/aggregate/*apply/data.table

I'm very new to R and having trouble running functions to get the answers I need. I have example data PCSTest

http://pastebin.com/z9Ti3nHB

that looks something like this:

Date        Site            Word
--------------------------------------
9/1/2012    slashdot        javascript
9/1/2012    stackexchange   R
9/1/2012    reddit          R
9/1/2012    slashdot        javascript
9/1/2012    stackexchange   javascript
9/5/2012    reddit          R
9/8/2012    slashdot        javascript
9/8/2012    stackexchange   R
9/8/2012    reddit          R
9/8/2012    slashdot        javascript
9/18/2012   stackexchange   R
9/18/2012   reddit          R
9/18/2012   slashdot        javascript
9/18/2012   stackexchange   R
9/27/2012   reddit          R
9/27/2012   slashdot        R

My goal is to looks for trends in the occurrences of different words as they relate to sites over time. I can count them:

library(plyr)   
PCSTest <- read.csv(file="c:/PCS/PCS Data - Test.csv", header=TRUE)
PCSTest$Date <- as.Date(PCSTest$Date, "%m/%d/%Y")
PCSTest$Date <- as.POSIXct(PCSTest$Date)
countTest <- count(PCSTest, c("Date", "Site", "Word"))

which gives this:

                  Date          Site       Word freq
1  2012-08-31 20:00:00        reddit          R    4
2  2012-08-31 20:00:00      slashdot javascript    7
3  2012-08-31 20:00:00 stackexchange javascript    1
4  2012-08-31 20:00:00 stackexchange          R    2
5  2012-09-01 20:00:00        reddit javascript    2
6  2012-09-01 20:00:00      slashdot          R    3
7  2012-09-04 20:00:00        reddit          R    1
8  2012-09-07 20:00:00        reddit          R    1
9  2012-09-07 20:00:00      slashdot javascript    2
10 2012-09-07 20:00:00 stackexchange          R    1
11 2012-09-09 20:00:00 stackexchange javascript    4
12 2012-09-10 20:00:00      slashdot          R    4
13 2012-09-14 20:00:00        reddit javascript    4
14 2012-09-17 20:00:00        reddit          R    4
15 2012-09-17 20:00:00      slashdot javascript    1
16 2012-09-17 20:00:00 stackexchange          R    2
17 2012-09-19 20:00:00        reddit javascript    2
18 2012-09-23 20:00:00 stackexchange javascript    2
19 2012-09-24 20:00:00        reddit javascript    3
20 2012-09-24 20:00:00 stackexchange javascript    1
21 2012-09-24 20:00:00 stackexchange          R    4
22 2012-09-25 20:00:00        reddit javascript    5
23 2012-09-25 20:00:00      slashdot javascript    3
24 2012-09-25 20:00:00 stackexchange          R    7
25 2012-09-26 20:00:00        reddit          R    1
26 2012-09-26 20:00:00      slashdot          R    5

or plot them all:

library(ggplot2)
ggplot(data=countTest, aes(x=Date, y=freq, group=interaction(Site, Word), colour=interaction(Site, Word), shape=Site)) + geom_line() + geom_point()

My plot of Frequency per day for Words per Site

I need to do some calculations on the data now, so I've tried aggregate

aggregate(freq ~ Site + Word, data = countTest,  function(freq) cbind(mean(freq), max(freq)))[order(-agg$freq[,3]),]

which gives:

           Site       Word freq.1 freq.2
2      slashdot javascript   3.25   7.00
5      slashdot          R   4.00   5.00
1        reddit javascript   3.20   5.00
4        reddit          R   2.20   4.00
6 stackexchange          R   3.20   7.00
3 stackexchange javascript   2.00   4.00

What I would like in that last result is a column that has the average frequency per day, something like... sum(freq)/20 days, calculated from the data, maybe a moving average even. Also, I would like another column with the slope/linear regression. How would I calculate that in the aggregate function?

Or, how would I make any of this better/faster? I know that there are apply and data.table functions but I don't see how I would use them. Any help would be greatly appreciated!

like image 972
Oleg Avatar asked Nov 11 '22 02:11

Oleg


1 Answers

I'm not sure what you want to do exactly, but dplyr (or plyr) will help you. Here is examples. If you tell explicitly what you want, you will get more helps.

d <- read.csv("~/Downloads/r_data.txt")
d$Date <- as.POSIXct(as.Date(d$Date, "%m/%d/%Y"))

library(dplyr)
d.cnt <- d %>% group_by(Date, Site, Word) %>% summarise(cnt = n())

# average per day
date.range <- d$Date %>% range %>% diff %>% as.numeric # gives 26 days or
date.range <- d$Date %>% unique %>% length # gives 13 days
d.ave <- d.cnt %>% group_by(Site, Word) %>% summarize(ave_per_day = sum(cnt)/date.range)

# slope
d.reg <- d.cnt %>% group_by(Site, Word) %>% 
  do({fit = lm(cnt ~ Date, data = .); data.frame(int = coef(fit)[1], slope = coef(fit)[2])})

# plot the slope value
library(ggplot2)
ggplot(d.reg, aes(Site, slope, fill = Word)) + geom_bar(stat = "identity", position = "dodge")
like image 89
kohske Avatar answered Nov 15 '22 06:11

kohske