Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summarize data within multiple groups of a time series

I have a series of observations of birds at different locations and times. The data frame looks like this:

birdID   site          ts
1       A          2013-04-15 09:29
1       A          2013-04-19 01:22
1       A          2013-04-20 23:13
1       A          2013-04-22 00:03
1       B          2013-04-22 14:02
1       B          2013-04-22 17:02
1       C          2013-04-22 14:04
1       C          2013-04-22 15:18
1       C          2013-04-23 00:54
1       A          2013-04-23 01:20
1       A          2013-04-24 23:07
1       A          2013-04-30 23:47
1       B          2013-04-30 03:51
1       B          2013-04-30 04:26
2       C          2013-04-30 04:29
2       C          2013-04-30 18:49
2       A          2013-05-01 01:03
2       A          2013-05-01 23:15
2       A          2013-05-02 00:09
2       C          2013-05-03 07:57
2       C          2013-05-04 07:21
2       C          2013-05-05 02:54
2       A          2013-05-05 03:27
2       A          2013-05-14 00:16
2       D          2013-05-14 10:00
2       D          2013-05-14 15:00

I would like to summarize the data in a way that shows the first and last detection of each bird at each site, and the duration at each site, while preserving information about multiple visits to sites (i.e. if a bird went from site A > B > C > A > B, I would like show each visit to site A and B independently, not lump both visits together).

I am hoping to produce output like this, where the start (min_ts), end (max_ts), and duration (days) of each visit are preserved:

birdID  site      min_ts                max_ts          days
1      A      2013-04-15 09:29    2013-04-22 00:03  6.6
1      B      2013-04-22 14:02    2013-04-22 17:02  0.1
1      C      2013-04-22 14:04    2013-04-23 00:54  0.5
1      A      2013-04-23 01:20    2013-04-30 23:47  7.9
1      B      2013-04-30 03:51    2013-04-30 04:26  0.02
2      C      2013-04-30 4:29     2013-04-30 18:49  0.6
2      A      2013-05-01 01:03    2013-05-02 00:09  0.96
2      C      2013-05-03 07:57    2013-05-05 02:54  1.8
2      A      2013-05-05 03:27    2013-05-14 00:16  8.8
2      D      2013-05-14 10:00    2013-05-14 15:00  0.2

I have tried this code, which yields the correct variables but lumps all the information about a single site together, not preserving multiple visits:

df <- df %>%
  group_by(birdID, site) %>%
  summarise(min_ts = min(ts),
            max_ts = max(ts),
            days = difftime(max_ts, min_ts, units = "days")) %>%
  arrange(birdID, min_ts)
birdID  site    min_ts               max_ts            days
1   A   2013-04-15 09:29   2013-04-30 23:47    15.6
1   B   2013-04-22 14:02   2013-04-30 4:26     7.6
1   C   2013-04-22 14:04   2013-04-23 0:54     0.5
2   C   2013-04-30 04:29   2013-05-05 2:54     4.9
2   A   2013-05-01 01:03   2013-05-14 0:16     12.9
2   D   2013-05-14 10:00   2013-05-14 15:00    0.2

I realize grouping by site is a problem, but if I remove that as a grouping variable the data are summarised without site info. I have tried this. It doesn't run, but I feel it's close to the solution:

df <- df %>% 
   group_by(birdID) %>% 
   summarize(min_ts = if_else((birdID == lag(birdID) & site != lag(site)), min(ts), NA_real_), 
             max_ts = if_else((birdID == lag(birdID) & site != lag(site)), max(ts), NA_real_), 
            min_d = min(yday(ts)),
            max_d = max(yday(ts)),
            days = max_d - min_d)) 
like image 214
sbliss Avatar asked Jun 23 '19 22:06

sbliss


People also ask

How do you summarize a time series data in R?

To summarize data by a particular variable or time period, you first create a new column in your dataset called day. Next, take all of the values (in this case precipitation measured each hour) for each day and add them using the sum() function. You can do all of this efficiently using dplyr mutate() function.

How do you summarize a group?

SUM() function with group by SUM is used with a GROUP BY clause. The aggregate functions summarize the table data. Once the rows are divided into groups, the aggregate functions are applied in order to return just one value per group.

What is aggregated time series?

Time aggregation is the aggregation of all data points for a single resource over a specified period (the granularity). Data aggregations in Resource Time Series reports are of the time aggregation type.

How do you describe time series data?

Time series data is data that is recorded over consistent intervals of time. Cross-sectional data consists of several variables recorded at the same time. Pooled data is a combination of both time series data and cross-sectional data.


2 Answers

One possibility could be:

df %>%
 group_by(birdID, site, rleid = with(rle(site), rep(seq_along(lengths), lengths))) %>%
 summarise(min_ts = min(ts),
           max_ts = max(ts),
           days = difftime(max_ts, min_ts, units = "days")) %>%
 ungroup() %>%
 select(-rleid) %>%
 arrange(birdID, min_ts)

   birdID site  min_ts              max_ts              days           
    <int> <chr> <dttm>              <dttm>              <drtn>         
 1      1 A     2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
 2      1 B     2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
 3      1 C     2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
 4      1 A     2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
 5      1 B     2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
 6      2 C     2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
 7      2 A     2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
 8      2 C     2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
 9      2 A     2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
10      2 D     2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days

Here it creates a rleid()-like grouping variable and then calculates the difference.

Or the same using rleid() from data.table explicitly:

df %>%
 group_by(birdID, site, rleid = rleid(site)) %>%
 summarise(min_ts = min(ts),
           max_ts = max(ts),
           days = difftime(max_ts, min_ts, units = "days")) %>%
 ungroup() %>%
 select(-rleid) %>%
 arrange(birdID, min_ts)
like image 147
tmfmnk Avatar answered Sep 30 '22 14:09

tmfmnk


Another alternative is to use lag and cumsum to create a grouping variable.

library(dplyr)

df %>%
  group_by(birdID, group = cumsum(site != lag(site, default = first(site)))) %>%
  summarise(min_ts = min(ts),
            max_ts = max(ts),
            days = difftime(max_ts, min_ts, units = "days")) %>%
  ungroup() %>%
  select(-group)

# A tibble: 10 x 4
#   birdID min_ts              max_ts              days           
#    <int> <dttm>              <dttm>              <drtn>         
# 1      1 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
# 2      1 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
# 3      1 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
# 4      1 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
# 5      1 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
# 6      2 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
# 7      2 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
# 8      2 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
# 9      2 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
#10      2 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days
like image 25
Ronak Shah Avatar answered Sep 30 '22 15:09

Ronak Shah