I have been unable to find a solution to my query on Stack Overflow. This post is similar, but my dataset is slightly - and importantly - different (in that I have multiple measures of 'time' within my grouping variable).
I have observations of organisms at various sites, over time. The sites are further aggregated into larger areas, so I want to eventually have a function I can call in ddply to summarize the dataset for each of the time periods within the geographical areas. However, I'm having trouble getting the function I need.
Question
How do I cycle through time periods and compare with the previous time period, calculating the intersection (i.e. number of 'sites' occurring in both time periods) and the sum of the number occurring in each period?
Toy dataset:
time = c(1,1,1,1,2,2,2,3,3,3,3,3)
site = c("A","B","C","D","A","B","C","A","B","C","D","E")
df <- as.data.frame(cbind(time,site))
df$time = as.numeric(df$time)
My function
dist2 <- function(df){
for(i in unique(df$time))
{
intersection <- length(which(df[df$time==i,"site"] %in% df[df$time==i- 1,"site"]))
both <- length(unique(df[df$time==i,"site"])) + length(unique(df[df$time==i-1,"site"]))
}
return(as.data.frame(cbind(time,intersection,both)))
}
dist2(df)
What I get:
dist2(df) time intersection both 1 1 3 8 2 1 3 8 3 1 3 8 4 1 3 8 5 2 3 8 6 2 3 8 7 2 3 8 8 3 3 8 9 3 3 8 10 3 3 8 11 3 3 8 12 3 3 8
What I expect (hoped!) to achieve:
time intersection both
1 1 NA 4
2 2 3 7
3 3 3 8
Once I have a working function, I want to use it with ddply on the whole data set to calculate these value for each area.
Many thanks for any pointers, tips, advice!
I am running:
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
The DATEDIFF function is one of the mainly used built-in functions of Tableau, which allows you to calculate the difference between the two given dates. The basic syntax of the DATEDIFF function is given below.
Window SUM in tableau performs a total of the previous value, current value, and future value. It will perform aggregation of the first level row in the tableau sheet. We always need to create a calculated field for the calculations to be performed in the tableau.
INDEX( ) Returns the index of the current row in the partition, without any sorting with regard to value. The first row index starts at 1. For example, the table below shows quarterly sales.
Right-click a measure in the view and select Add Table Calculation. Calculates the difference between the current value and the previous value in the partition. This is the default value.
You can determine the number of times each site appeared at each time with the table
function:
(tab <- table(df$time, df$site))
# A B C D E
# 1 1 1 1 1 0
# 2 1 1 1 0 0
# 3 1 1 1 1 1
With some simple manipulation, you can build same-sized table that contains the number of times a site appeared in the previous time period:
(prev.tab <- head(rbind(NA, tab), -1))
# A B C D E
# NA NA NA NA NA
# 1 1 1 1 1 0
# 2 1 1 1 0 0
Determining the number of sites in common with the previous iteration or the number of unique sites in the previous iteration plus the number of unique sites in the current iteration are now simple vectorized operations:
data.frame(time=unique(df$time),
intersection=rowSums(tab * (prev.tab >= 1)),
both=rowSums(tab >= 1) + rowSums(prev.tab >= 1, na.rm=TRUE))
# time intersection both
# 1 1 NA 4
# 2 2 3 7
# 3 3 3 8
Because this doesn't involve making a bunch of intersection
or unique
calls involving pairs of time values it should be more efficient than looping solutions:
# Slightly larger dataset with 100000 observations
set.seed(144)
df <- data.frame(time=sample(1:50, 100000, replace=TRUE),
site=sample(letters, 100000, replace=TRUE))
df <- df[order(df$time),]
josilber <- function(df) {
tab <- table(df$time, df$site)
prev.tab <- head(rbind(NA, tab), -1)
data.frame(time=unique(df$time),
intersection=rowSums(tab * (prev.tab >= 1)),
both=rowSums(tab >= 1) + rowSums(prev.tab >= 1, na.rm=TRUE))
}
# dist2 from @akrun's solution
microbenchmark(josilber(df), dist2(df))
# Unit: milliseconds
# expr min lq mean median uq max neval
# josilber(df) 28.74353 32.78146 52.73928 40.89203 62.04933 237.7774 100
# dist2(df) 540.78422 574.28319 829.04174 825.99418 1018.76561 1607.9460 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With