I have a series of observations of birds at different locations and times. The data frame looks like this:
birdID site ts
1 A 2013-04-15 09:29
1 A 2013-04-19 01:22
1 A 2013-04-20 23:13
1 A 2013-04-22 00:03
1 B 2013-04-22 14:02
1 B 2013-04-22 17:02
1 C 2013-04-22 14:04
1 C 2013-04-22 15:18
1 C 2013-04-23 00:54
1 A 2013-04-23 01:20
1 A 2013-04-24 23:07
1 A 2013-04-30 23:47
1 B 2013-04-30 03:51
1 B 2013-04-30 04:26
2 C 2013-04-30 04:29
2 C 2013-04-30 18:49
2 A 2013-05-01 01:03
2 A 2013-05-01 23:15
2 A 2013-05-02 00:09
2 C 2013-05-03 07:57
2 C 2013-05-04 07:21
2 C 2013-05-05 02:54
2 A 2013-05-05 03:27
2 A 2013-05-14 00:16
2 D 2013-05-14 10:00
2 D 2013-05-14 15:00
I would like to summarize the data in a way that shows the first and last detection of each bird at each site, and the duration at each site, while preserving information about multiple visits to sites (i.e. if a bird went from site A > B > C > A > B, I would like show each visit to site A and B independently, not lump both visits together).
I am hoping to produce output like this, where the start (min_ts), end (max_ts), and duration (days) of each visit are preserved:
birdID site min_ts max_ts days
1 A 2013-04-15 09:29 2013-04-22 00:03 6.6
1 B 2013-04-22 14:02 2013-04-22 17:02 0.1
1 C 2013-04-22 14:04 2013-04-23 00:54 0.5
1 A 2013-04-23 01:20 2013-04-30 23:47 7.9
1 B 2013-04-30 03:51 2013-04-30 04:26 0.02
2 C 2013-04-30 4:29 2013-04-30 18:49 0.6
2 A 2013-05-01 01:03 2013-05-02 00:09 0.96
2 C 2013-05-03 07:57 2013-05-05 02:54 1.8
2 A 2013-05-05 03:27 2013-05-14 00:16 8.8
2 D 2013-05-14 10:00 2013-05-14 15:00 0.2
I have tried this code, which yields the correct variables but lumps all the information about a single site together, not preserving multiple visits:
df <- df %>%
group_by(birdID, site) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
arrange(birdID, min_ts)
birdID site min_ts max_ts days
1 A 2013-04-15 09:29 2013-04-30 23:47 15.6
1 B 2013-04-22 14:02 2013-04-30 4:26 7.6
1 C 2013-04-22 14:04 2013-04-23 0:54 0.5
2 C 2013-04-30 04:29 2013-05-05 2:54 4.9
2 A 2013-05-01 01:03 2013-05-14 0:16 12.9
2 D 2013-05-14 10:00 2013-05-14 15:00 0.2
I realize grouping by site is a problem, but if I remove that as a grouping variable the data are summarised without site info. I have tried this. It doesn't run, but I feel it's close to the solution:
df <- df %>%
group_by(birdID) %>%
summarize(min_ts = if_else((birdID == lag(birdID) & site != lag(site)), min(ts), NA_real_),
max_ts = if_else((birdID == lag(birdID) & site != lag(site)), max(ts), NA_real_),
min_d = min(yday(ts)),
max_d = max(yday(ts)),
days = max_d - min_d))
To summarize data by a particular variable or time period, you first create a new column in your dataset called day. Next, take all of the values (in this case precipitation measured each hour) for each day and add them using the sum() function. You can do all of this efficiently using dplyr mutate() function.
SUM() function with group by SUM is used with a GROUP BY clause. The aggregate functions summarize the table data. Once the rows are divided into groups, the aggregate functions are applied in order to return just one value per group.
Time aggregation is the aggregation of all data points for a single resource over a specified period (the granularity). Data aggregations in Resource Time Series reports are of the time aggregation type.
Time series data is data that is recorded over consistent intervals of time. Cross-sectional data consists of several variables recorded at the same time. Pooled data is a combination of both time series data and cross-sectional data.
One possibility could be:
df %>%
group_by(birdID, site, rleid = with(rle(site), rep(seq_along(lengths), lengths))) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-rleid) %>%
arrange(birdID, min_ts)
birdID site min_ts max_ts days
<int> <chr> <dttm> <dttm> <drtn>
1 1 A 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
2 1 B 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
3 1 C 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
4 1 A 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
5 1 B 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
6 2 C 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
7 2 A 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
8 2 C 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
9 2 A 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
10 2 D 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days
Here it creates a rleid()
-like grouping variable and then calculates the difference.
Or the same using rleid()
from data.table
explicitly:
df %>%
group_by(birdID, site, rleid = rleid(site)) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-rleid) %>%
arrange(birdID, min_ts)
Another alternative is to use lag
and cumsum
to create a grouping variable.
library(dplyr)
df %>%
group_by(birdID, group = cumsum(site != lag(site, default = first(site)))) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-group)
# A tibble: 10 x 4
# birdID min_ts max_ts days
# <int> <dttm> <dttm> <drtn>
# 1 1 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
# 2 1 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
# 3 1 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
# 4 1 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
# 5 1 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
# 6 2 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
# 7 2 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
# 8 2 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
# 9 2 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
#10 2 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With