Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data management: flatten data with R

I have the following dataframe gathering the evolution of policies:

Df <- data.frame(Id_policy = c("A_001", "A_002", "A_003","B_001","B_002"),
                 date_new = c("20200101","20200115","20200304","20200110","20200215"),
                 date_end = c("20200503","20200608","20210101","20200403","20200503"),
                 expend = c("","A_001","A_002","",""))

which looks like that:

  Id_policy date_new date_end expend
     A_001 20200101 20200503       
     A_002 20200115 20200608  A_001
     A_003 20200304 20210101  A_002
     B_001 20200110 20200403       
     B_002 20200215 20200503       

"Id_policy" refers to a specific policy, "date_new" the date of policy issuance, "date_end" the date of policy end. However, sometimes a policy is extended. When it is the case, a new policy is set and the variable "expend" provides the name of the previous policy it changes.

The idea here is to flatten the dataset so we only keep rows corresponding to different policies. So, the output would be something like this:

  Id_policy date_new date_end expend
     A_001 20200101 20210101       
     B_001 20200110 20200403       
     B_002 20200215 20200503     

Has-someone faced a similar problem ?

like image 902
Jb_Eyd Avatar asked Jun 13 '26 04:06

Jb_Eyd


1 Answers

One way is to treat this as a network problem and use igraph functions (related posts e.g. Make a group_indices based on several columns ; Fast way to group variables based on direct and indirect similarities in multiple columns).

  1. Set the missing 'expend' to 'Id_policy'

  2. Use graph_from_data_frame to create a graph, where 'expend' and 'Id_policy' columns are treated as an edge list.

  3. Use components to get connected components of the graph, i.e. which 'Id_policy' are connected, directly or indirectly.

  4. Select the membership element to get "the cluster id to which each vertex belongs".

  5. Join membership to original data.

  6. Grab relevant data grouped by membership.

I use data.table for the data wrangling steps, but this can of course also be done in base or dplyr.

library(data.table)
library(igraph)

setDT(Df)
Df[expend ==  "", expend := Id_policy]

g = graph_from_data_frame(Df[ , .(expend, Id_policy)])
mem = components(g)$membership

Df[.(names(mem)), on = .(Id_policy), mem := mem]    

Df[ , .(Id_policy = Id_policy[1],
        date_new = first(date_new),
        date_end = last(date_end), by = mem]
#    mem Id_policy date_new date_end
# 1:   1     A_001 20200101 20210101
# 2:   2     B_001 20200110 20200403
# 3:   3     B_002 20200215 20200503
like image 96
Henrik Avatar answered Jun 15 '26 07:06

Henrik



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!