Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Function to calculate values comparing sequential time periods

I have been unable to find a solution to my query on Stack Overflow. This post is similar, but my dataset is slightly - and importantly - different (in that I have multiple measures of 'time' within my grouping variable).

I have observations of organisms at various sites, over time. The sites are further aggregated into larger areas, so I want to eventually have a function I can call in ddply to summarize the dataset for each of the time periods within the geographical areas. However, I'm having trouble getting the function I need.

Question

How do I cycle through time periods and compare with the previous time period, calculating the intersection (i.e. number of 'sites' occurring in both time periods) and the sum of the number occurring in each period?

Toy dataset:

time = c(1,1,1,1,2,2,2,3,3,3,3,3)
site = c("A","B","C","D","A","B","C","A","B","C","D","E")
df <- as.data.frame(cbind(time,site))
df$time = as.numeric(df$time) 

My function

dist2 <- function(df){
  for(i in unique(df$time))
  {
    intersection <- length(which(df[df$time==i,"site"] %in% df[df$time==i-   1,"site"]))
    both <- length(unique(df[df$time==i,"site"])) + length(unique(df[df$time==i-1,"site"]))
  }
  return(as.data.frame(cbind(time,intersection,both)))
  }

dist2(df)

What I get:

dist2(df)
   time intersection both
1     1            3    8
2     1            3    8
3     1            3    8
4     1            3    8
5     2            3    8
6     2            3    8
7     2            3    8
8     3            3    8
9     3            3    8
10    3            3    8
11    3            3    8
12    3            3    8

What I expect (hoped!) to achieve:

time intersection both
1    1           NA    4
2    2            3    7
3    3            3    8

Once I have a working function, I want to use it with ddply on the whole data set to calculate these value for each area.

Many thanks for any pointers, tips, advice!

I am running:

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
like image 710
gettingthere Avatar asked May 25 '15 16:05

gettingthere


People also ask

How do I calculate the difference in two values between two different dates in Tableau?

The DATEDIFF function is one of the mainly used built-in functions of Tableau, which allows you to calculate the difference between the two given dates. The basic syntax of the DATEDIFF function is given below.

What is window sum function in Tableau?

Window SUM in tableau performs a total of the previous value, current value, and future value. It will perform aggregation of the first level row in the tableau sheet. We always need to create a calculated field for the calculations to be performed in the tableau.

How does Index function work in Tableau?

INDEX( ) Returns the index of the current row in the partition, without any sorting with regard to value. The first row index starts at 1. For example, the table below shows quarterly sales.

How do you find the running difference in Tableau?

Right-click a measure in the view and select Add Table Calculation. Calculates the difference between the current value and the previous value in the partition. This is the default value.


1 Answers

You can determine the number of times each site appeared at each time with the table function:

(tab <- table(df$time, df$site))
#     A B C D E
#   1 1 1 1 1 0
#   2 1 1 1 0 0
#   3 1 1 1 1 1

With some simple manipulation, you can build same-sized table that contains the number of times a site appeared in the previous time period:

(prev.tab <- head(rbind(NA, tab), -1))
#    A  B  C  D  E
#   NA NA NA NA NA
# 1  1  1  1  1  0
# 2  1  1  1  0  0

Determining the number of sites in common with the previous iteration or the number of unique sites in the previous iteration plus the number of unique sites in the current iteration are now simple vectorized operations:

data.frame(time=unique(df$time),
           intersection=rowSums(tab * (prev.tab >= 1)),
           both=rowSums(tab >= 1) + rowSums(prev.tab >= 1, na.rm=TRUE))
#   time intersection both
# 1    1           NA    4
# 2    2            3    7
# 3    3            3    8

Because this doesn't involve making a bunch of intersection or unique calls involving pairs of time values it should be more efficient than looping solutions:

# Slightly larger dataset with 100000 observations
set.seed(144)
df <- data.frame(time=sample(1:50, 100000, replace=TRUE),
                 site=sample(letters, 100000, replace=TRUE))
df <- df[order(df$time),]
josilber <- function(df) {
  tab <- table(df$time, df$site)
  prev.tab <- head(rbind(NA, tab), -1)
  data.frame(time=unique(df$time),
             intersection=rowSums(tab * (prev.tab >= 1)),
             both=rowSums(tab >= 1) + rowSums(prev.tab >= 1, na.rm=TRUE))
}
# dist2 from @akrun's solution
microbenchmark(josilber(df), dist2(df))
# Unit: milliseconds
#          expr       min        lq      mean    median         uq       max neval
#  josilber(df)  28.74353  32.78146  52.73928  40.89203   62.04933  237.7774   100
#     dist2(df) 540.78422 574.28319 829.04174 825.99418 1018.76561 1607.9460   100
like image 88
josliber Avatar answered Oct 08 '22 03:10

josliber