Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R, Assign a group ID to combinations of values in another observation ID variable

Tags:

r

grouping

ids

Assume that I have several years of personal data, in which a Personal ID (indiv) identifies individuals across years, but the household ID (househ) only assigns household relation within a year.

Example:

libary(dplyr)
test_01 <- data.frame(indiv=c(1,2,3,4,5,6),househ=c(1,1,2,3,4,4),time=rep(1,6)) #1&2 form a household, 3 and 4 are single, 5&6 form a household
test_02 <- data.frame(indiv=c(2,3,4,5,6,7),househ=c(1,2,2,3,3,4),time=rep(2,6)) #1 exits, so 2 is now a new household, 3&4 now form a new household, 5&6 still do, 7 enters
test_03 <- data.frame(indiv=c(2,3,4,5,7,8,9,10),househ=c(1,2,2,3,4,5,5,6),time=rep(3,8)) #according to logic above
data_test_panel <- bind_rows(test_01,test_02,test_03)

The desired time-consistent household variable would be:

data_test_panel$true_household <- c(1,1,2,3,4,4,5,6,6,4,4,7,5,6,6,8,7,9,9,10)

So far I tried:

library(data.table)
setDT(data_test_panel)[,cons_household := .GRP,.(time,househ)] # where cons_household is the new household ID. However, this doesn't give the same household ID across time but assigns new values for every appearance of a household.

Thank you kindly for all your help! /Severin

like image 932
a123 Avatar asked Nov 01 '25 19:11

a123


2 Answers

Create a list column of unique individuals in a given household at each time period. We can then take the group number by that column.

dplyr approach

This is much nicer than the data.table approach:

library(dplyr)
data_test_panel |>
    mutate(
        indiv_in_household = list(indiv),
        .by = c(househ, time)
    ) |>
    mutate(
        cons_household = cur_group_id(),
        .by = indiv_in_household
    )

Note this uses per-operation grouping with the .by parameter which requires at least dplyr v1.1.0.

data.table approach

The syntax for this is much more complicated because data.table will automatically unroll your list column into a vector if you try to make it list(indiv) by group.

data_test_panel[, indiv_in_household := .SD[
    , .(split(
        rep(indiv, length(indiv)),
        rep(seq(length(indiv)), each = length(indiv))
    )),
    .(househ, time)
]$V1]

head(data_test_panel)
#    indiv househ  time true_household indiv_in_household
#    <num>  <num> <num>          <num>             <list>
# 1:     1      1     1              1                1,2
# 2:     2      1     1              1                1,2
# 3:     3      2     1              2                  3
# 4:     4      3     1              3                  4
# 5:     5      4     1              4                5,6
# 6:     6      4     1              4                5,6

Also data.table doesn't support grouping by list columns so we need to use toString(), which is OK in this case but in general is not ideal as you have to think through the format of the column to ensure it will not somehow cause ambiguity leading two distinct groups to be merged.

data_test_panel[, cons_household := .GRP, sapply(indiv_in_household, toString)]

data_test_panel[cons_household != true_household]
# Empty data.table (0 rows and 6 cols): indiv,househ,time,true_household,cons_household,indiv_in_household
like image 81
SamR Avatar answered Nov 04 '25 17:11

SamR


A data.table approach that is reasonably nice

As the previous answer shows, data.table's handling of list columns doesn't lend itself to the dplyr-style solution in which the list column of members per annual household is broadcast to all rows of the data, and then grouped-by to create the IDs.

However, we can use the following: (1) Collapse (instead of broadcasting) to a list column with a single row per household/year. (2) Create the ID as the index of each annual household in a list of unique households from all years. (3) Uncollapse the list column.

(data_test_panel
  [, .(hh_indivs = list(indiv)), by=.(househ, time)                # collapse
   ][, hh_id := match(hh_indivs, unique(hh_indivs))                # index
    ][, .(indiv = unlist(hh_indivs), hh_id), by=.(househ, time)])  # uncollapse

This avoids the complicated code to mimic the broadcasted list column, and the problematic use of toString() to enable a groupby.

like image 22
Tobo Avatar answered Nov 04 '25 17:11

Tobo