I have the following dataframe gathering the evolution of policies:
Df <- data.frame(Id_policy = c("A_001", "A_002", "A_003","B_001","B_002"),
date_new = c("20200101","20200115","20200304","20200110","20200215"),
date_end = c("20200503","20200608","20210101","20200403","20200503"),
expend = c("","A_001","A_002","",""))
which looks like that:
Id_policy date_new date_end expend
A_001 20200101 20200503
A_002 20200115 20200608 A_001
A_003 20200304 20210101 A_002
B_001 20200110 20200403
B_002 20200215 20200503
"Id_policy" refers to a specific policy, "date_new" the date of policy issuance, "date_end" the date of policy end. However, sometimes a policy is extended. When it is the case, a new policy is set and the variable "expend" provides the name of the previous policy it changes.
The idea here is to flatten the dataset so we only keep rows corresponding to different policies. So, the output would be something like this:
Id_policy date_new date_end expend
A_001 20200101 20210101
B_001 20200110 20200403
B_002 20200215 20200503
Has-someone faced a similar problem ?
One way is to treat this as a network problem and use igraph functions (related posts e.g. Make a group_indices based on several columns
; Fast way to group variables based on direct and indirect similarities in multiple columns).
Set the missing 'expend' to 'Id_policy'
Use graph_from_data_frame to create a graph, where 'expend' and 'Id_policy' columns are treated as an edge list.
Use components to get connected components of the graph, i.e. which 'Id_policy' are connected, directly or indirectly.
Select the membership element to get "the cluster id to which each vertex belongs".
Join membership to original data.
Grab relevant data grouped by membership.
I use data.table for the data wrangling steps, but this can of course also be done in base or dplyr.
library(data.table)
library(igraph)
setDT(Df)
Df[expend == "", expend := Id_policy]
g = graph_from_data_frame(Df[ , .(expend, Id_policy)])
mem = components(g)$membership
Df[.(names(mem)), on = .(Id_policy), mem := mem]
Df[ , .(Id_policy = Id_policy[1],
date_new = first(date_new),
date_end = last(date_end), by = mem]
# mem Id_policy date_new date_end
# 1: 1 A_001 20200101 20210101
# 2: 2 B_001 20200110 20200403
# 3: 3 B_002 20200215 20200503
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With