I've got a bit of code in R:
library(dplyr)
df_temp <- df %>%
group_by(policy_number, policy_year) %>%
summarise(term_start_date = last(term_start_date),
term_end_date = last(term_end_date),
on_cover_after = last(on_cover_after),
termination_code = last(termination_code),
termination_date = last(termination_date))
The main table df
is about 700,000 rows by 130 columns. Grouped by policy_number
and policy_year
there are about 300,000 (policy_number
/policy_year
) groupings.
4 of the 5 columns that I've referred to in last()
are dates.
This query takes about 3 minutes to run, which is a nuisance because the rest of my code runs quite briskly. I'm hoping to speed it up. Is there anything I could try that might help please?
(ideally would supply a reprex but how could I do that here? not sure)
Thank you.
Edit: since I'm always using the last record for a given (policy_number
/policy_year
) pair, is there some code I could write along the lines of:
df_temp <- df %>%
group_by(policy_number, policy_year) %>%
mutate(counter = 1:n()) %>%
filter(counter == max(counter)) %>%
select(term_start_date,
term_end_date,
on_cover_after,
termination_code,
termination_date)
?
There is a great source here about this. The author makes several great suggestions (see his comments section). I would consider aggregating your data with data.table, or if you stick with dplyr then consider defining a key. Some metrics of relative benchmarks:
From source
Instead of summarise
, use summarise_at
library(dplyr)
df %>%
group_by(policy_number, policy_year) %>%
summarise_at(vars(term_start_date, term_end_date,
term_end_date,termination_code, termination_date), last)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With