Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using dplyr for frequency counts of interactions, must include zero counts

My question involves writing code using the dplyr package in R

I have a relatively large dataframe (approx 5 million rows) with 2 columns: the first with an individual identifier (id), and a second with a date (date). At present, each row indicates the occurrence of an action (taken by the individual in the id column) on the date in the date column. There are about 300,000 unique individuals, and about 2600 unique dates. For example, the beginning of the data look like this:

    id         date
    John12     2006-08-03
    Tom2993    2008-10-11
    Lisa825    2009-07-03
    Tom2993    2008-06-12
    Andrew13   2007-09-11

I'd like to reshape the data so that I have a row for every possible id x date pair, with an additional column which counts the total number of events that occurred (perhaps taking the value 0) for the listed individual on the given date.

I've had some success with the dplyr package, which I've used to tabulate the id x date counts which are observed in the data.

Here's the code I've used to tabulate id x date counts so far: (my dataframe is called df)

reduced = df %.% 
  group_by(id, date) %.%
  summarize(length(date))

My problem is that (as I said above) I'd like to have a dataset that also includes 0s for id x date pairs that don't have any associated actions. For example, if there's no observed action for John12 on 2007-10-10, I'd like the output to return a row for that id x date pair, with a count of 0.

I considered creating the frame above, then mergine with an empty frame, but I'm convinced there must be a simpler solution. Any suggestions much appreciated!

like image 429
Mark T Patterson Avatar asked May 20 '14 22:05

Mark T Patterson


2 Answers

Here's a simple option, using data.table instead:

library(data.table)

dt = as.data.table(your_df)

setkey(dt, id, date)

# in versions 1.9.3+
dt[CJ(unique(id), unique(date)), .N, by = .EACHI]
#          id       date N
# 1: Andrew13 2006-08-03 0
# 2: Andrew13 2007-09-11 1
# 3: Andrew13 2008-06-12 0
# 4: Andrew13 2008-10-11 0
# 5: Andrew13 2009-07-03 0
# 6:   John12 2006-08-03 1
# 7:   John12 2007-09-11 0
# 8:   John12 2008-06-12 0
# 9:   John12 2008-10-11 0
#10:   John12 2009-07-03 0
#11:  Lisa825 2006-08-03 0
#12:  Lisa825 2007-09-11 0
#13:  Lisa825 2008-06-12 0
#14:  Lisa825 2008-10-11 0
#15:  Lisa825 2009-07-03 1
#16:  Tom2993 2006-08-03 0
#17:  Tom2993 2007-09-11 0
#18:  Tom2993 2008-06-12 1
#19:  Tom2993 2008-10-11 1
#20:  Tom2993 2009-07-03 0

In versions 1.9.2 or before the equivalent expression omits the explicit by:

dt[CJ(unique(id), unique(date)), .N]

The idea is to create all possible pairs of id and date (which is what the CJ part does), and then merge it back, counting occurrences.

like image 145
eddi Avatar answered Oct 20 '22 00:10

eddi


This is how you could do it, although I use dplyr only in part to calculate the frequencies in your original df and for the left_join. As you already suggested in your question, I created a new data.frame and merged it with the existing. I guess if you want to do it exclusively in dplyr that would require you to somehow rbind many rows in the process and I assume this way might be faster than the other.

require(dplyr)

original <- read.table(header=T,text="    id         date
John12     2006-08-03
Tom2993    2008-10-11
Lisa825    2009-07-03
Tom2993    2008-06-12
Andrew13   2007-09-11", stringsAsFactors=F)

original$date <- as.Date(original$date) #convert to date

#get the frequency in original data in new column and summarize in a single row per group
original <- original %>%
  group_by(id, date) %>%
  summarize(count = n())            

#create a sequence of date as you need it
dates <- seq(as.Date("2006-01-01"), as.Date("2009-12-31"), 1)    

#create a new df with expand.grid to get all combinations of date/id
newdf <- expand.grid(id = original$id, date = dates)     

#remove dates
rm(dates)

#join original and newdf to have the frequency counts from original df
newdf <- left_join(newdf, original, by=c("id","date"))   

#replace all NA with 0 for rows which were not in original df
newdf$count[is.na(newdf$count)] <- 0          
like image 39
talat Avatar answered Oct 19 '22 22:10

talat