I have been unable to find a solution to my query on Stack Overflow. This post is similar, but my dataset is slightly - and importantly - different (in that I have multiple measures of 'time' within my grouping variable). I have observations of organisms at various sites, over time. The sites are further aggregated into larger areas, so I want to eventually have a function I can call in ddply to summarize the dataset for each of the time periods within the geographical areas. However, I'm having trouble getting the function I need. Question How do I cycle through time periods and compare with the previous time period, calculating the intersection (i.e. number of 'sites' occurring in both time periods) and the sum of the number occurring in each period? Toy dataset: <pre class="prettyprint"><code>time = c(1,1,1,1,2,2,2,3,3,3,3,3) site = c("A","B","C","D","A","B","C","A","B","C","D","E") df <- as.data.frame(cbind(time,site)) df$time = as.numeric(df$time) </code></pre> My function <pre class="prettyprint"><code>dist2 <- function(df){ for(i in unique(df$time)) { intersection <- length(which(df[df$time==i,"site"] %in% df[df$time==i- 1,"site"])) both <- length(unique(df[df$time==i,"site"])) + length(unique(df[df$time==i-1,"site"])) } return(as.data.frame(cbind(time,intersection,both))) } dist2(df) </code></pre> What I get: <blockquote> <pre class="prettyprint"><code>dist2(df) time intersection both 1 1 3 8 2 1 3 8 3 1 3 8 4 1 3 8 5 2 3 8 6 2 3 8 7 2 3 8 8 3 3 8 9 3 3 8 10 3 3 8 11 3 3 8 12 3 3 8 </code></pre> </blockquote> What I expect (hoped!) to achieve: <pre class="prettyprint"><code>time intersection both 1 1 NA 4 2 2 3 7 3 3 3 8 </code></pre> Once I have a working function, I want to use it with ddply on the whole data set to calculate these value for each area. Many thanks for any pointers, tips, advice! I am running: <pre class="prettyprint"><code>R version 3.1.2 (2014-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) </code></pre>

You can determine the number of times each site appeared at each time with the <code>table</code> function: <pre class="prettyprint"><code>(tab <- table(df$time, df$site)) # A B C D E # 1 1 1 1 1 0 # 2 1 1 1 0 0 # 3 1 1 1 1 1 </code></pre> With some simple manipulation, you can build same-sized table that contains the number of times a site appeared in the previous time period: <pre class="prettyprint"><code>(prev.tab <- head(rbind(NA, tab), -1)) # A B C D E # NA NA NA NA NA # 1 1 1 1 1 0 # 2 1 1 1 0 0 </code></pre> Determining the number of sites in common with the previous iteration or the number of unique sites in the previous iteration plus the number of unique sites in the current iteration are now simple vectorized operations: <pre class="prettyprint"><code>data.frame(time=unique(df$time), intersection=rowSums(tab * (prev.tab >= 1)), both=rowSums(tab >= 1) + rowSums(prev.tab >= 1, na.rm=TRUE)) # time intersection both # 1 1 NA 4 # 2 2 3 7 # 3 3 3 8 </code></pre> Because this doesn't involve making a bunch of <code>intersection</code> or <code>unique</code> calls involving pairs of time values it should be more efficient than looping solutions: <pre class="prettyprint"><code># Slightly larger dataset with 100000 observations set.seed(144) df <- data.frame(time=sample(1:50, 100000, replace=TRUE), site=sample(letters, 100000, replace=TRUE)) df <- df[order(df$time),] josilber <- function(df) { tab <- table(df$time, df$site) prev.tab <- head(rbind(NA, tab), -1) data.frame(time=unique(df$time), intersection=rowSums(tab * (prev.tab >= 1)), both=rowSums(tab >= 1) + rowSums(prev.tab >= 1, na.rm=TRUE)) } # dist2 from @akrun's solution microbenchmark(josilber(df), dist2(df)) # Unit: milliseconds # expr min lq mean median uq max neval # josilber(df) 28.74353 32.78146 52.73928 40.89203 62.04933 237.7774 100 # dist2(df) 540.78422 574.28319 829.04174 825.99418 1018.76561 1607.9460 100 </code></pre>

Function to calculate values comparing sequential time periods

Tags:

function

time

r

plyr

I have been unable to find a solution to my query on Stack Overflow. This post is similar, but my dataset is slightly - and importantly - different (in that I have multiple measures of 'time' within my grouping variable).

I have observations of organisms at various sites, over time. The sites are further aggregated into larger areas, so I want to eventually have a function I can call in ddply to summarize the dataset for each of the time periods within the geographical areas. However, I'm having trouble getting the function I need.

Question

How do I cycle through time periods and compare with the previous time period, calculating the intersection (i.e. number of 'sites' occurring in both time periods) and the sum of the number occurring in each period?

Toy dataset:

time = c(1,1,1,1,2,2,2,3,3,3,3,3)
site = c("A","B","C","D","A","B","C","A","B","C","D","E")
df <- as.data.frame(cbind(time,site))
df$time = as.numeric(df$time)

My function

dist2 <- function(df){
  for(i in unique(df$time))
  {
    intersection <- length(which(df[df$time==i,"site"] %in% df[df$time==i-   1,"site"]))
    both <- length(unique(df[df$time==i,"site"])) + length(unique(df[df$time==i-1,"site"]))
  }
  return(as.data.frame(cbind(time,intersection,both)))
  }

dist2(df)

What I get:

dist2(df)
   time intersection both
1     1            3    8
2     1            3    8
3     1            3    8
4     1            3    8
5     2            3    8
6     2            3    8
7     2            3    8
8     3            3    8
9     3            3    8
10    3            3    8
11    3            3    8
12    3            3    8

What I expect (hoped!) to achieve:

time intersection both
1    1           NA    4
2    2            3    7
3    3            3    8

Once I have a working function, I want to use it with ddply on the whole data set to calculate these value for each area.

Many thanks for any pointers, tips, advice!

I am running:

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

710

asked May 25 '15 16:05

gettingthere

1 Answers

You can determine the number of times each site appeared at each time with the table function:

(tab <- table(df$time, df$site))
#     A B C D E
#   1 1 1 1 1 0
#   2 1 1 1 0 0
#   3 1 1 1 1 1

With some simple manipulation, you can build same-sized table that contains the number of times a site appeared in the previous time period:

(prev.tab <- head(rbind(NA, tab), -1))
#    A  B  C  D  E
#   NA NA NA NA NA
# 1  1  1  1  1  0
# 2  1  1  1  0  0

Determining the number of sites in common with the previous iteration or the number of unique sites in the previous iteration plus the number of unique sites in the current iteration are now simple vectorized operations:

data.frame(time=unique(df$time),
           intersection=rowSums(tab * (prev.tab >= 1)),
           both=rowSums(tab >= 1) + rowSums(prev.tab >= 1, na.rm=TRUE))
#   time intersection both
# 1    1           NA    4
# 2    2            3    7
# 3    3            3    8

Because this doesn't involve making a bunch of intersection or unique calls involving pairs of time values it should be more efficient than looping solutions:

# Slightly larger dataset with 100000 observations
set.seed(144)
df <- data.frame(time=sample(1:50, 100000, replace=TRUE),
                 site=sample(letters, 100000, replace=TRUE))
df <- df[order(df$time),]
josilber <- function(df) {
  tab <- table(df$time, df$site)
  prev.tab <- head(rbind(NA, tab), -1)
  data.frame(time=unique(df$time),
             intersection=rowSums(tab * (prev.tab >= 1)),
             both=rowSums(tab >= 1) + rowSums(prev.tab >= 1, na.rm=TRUE))
}
# dist2 from @akrun's solution
microbenchmark(josilber(df), dist2(df))
# Unit: milliseconds
#          expr       min        lq      mean    median         uq       max neval
#  josilber(df)  28.74353  32.78146  52.73928  40.89203   62.04933  237.7774   100
#     dist2(df) 540.78422 574.28319 829.04174 825.99418 1018.76561 1607.9460   100

answered Oct 08 '22 03:10

josliber

Related questions
                            
                                Remove variable wrapped in function from model formula in R
                            
                                Creating Package Documentation with RStudio?
                            
                                mlogit: missing value where TRUE/FALSE needed
                            
                                How to convert numpy array to R matrix? [duplicate]
                            
                                Create an R package with dependencies
                            
                                R data.table conditional aggregation
                            
                                How plot bars on top of grid lines when using barplot?
                            
                                "more" like command in R console [duplicate]
                            
                                Markdown: Change default font size of code chunks in ioslides
                            
                                Apply function to all rows in the loop and put the results in new column
                            
                                How to add a label for a vertical line with legend in ggplot2
                            
                                How to use require(googlesheets) properly?
                            
                                Using aggregate() on data.frame objects
                            
                                How can I download multiple objects from S3 simultaneously?
                            
                                Embedding new fonts to plot in ggplot2, R (mac)
                            
                                How to add a horizontal line above a bar chart using ggplot?
                            
                                Creating origin-destination matrices with R
                            
                                Aggregate data to weekly level with every week starting from Monday
                            
                                R sum of rows for different group of columns that start with similar string
                            
                                Using table() to create 3 variable frequency table in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With