Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate the elder generation’ maximum education year in one family

Suppose I have dataframe like this:

   family relationship meanings              edu
 1      1 A            respondent             12
 2      1 B            respondent's spouse    18
 3      1 C            A's father             10
 4      1 D            A's mother              9
 5      1 E1           A's first son          15
 6      1 F1           E1's spouse            14
 7      1 G11          E1's first son          3
 8      1 G12          E1's second son         1
 9      1 E2           A's second son         13
10      2 A            respondent             21
11      2 B            respondent's spouse    16
12      2 C            A's father             12
13      2 D            A's mother             16
14      2 E1           A's first son          18
15      2 F1           E1's spouse            15
16      2 E2           A's second son         17
17      2 E3           A's third son          16

family indicates family number. relationship indicates relationships in one family. meanings indicates the meanings of the second column, relationship.

relationship in the first family

I want to calculate the father generation’s maximum education year in one family. We do not need spouse's information.

The expected results are as follows:

   family id      edu fedu 
 1      1 A        12 10   
 2      1 C        10 NA   
 3      1 E1       15 18   
 4      1 E2       13 18   
 5      1 G11       3 15   
 6      1 G12       1 15   
 7      2 A        21 16   
 8      2 C        12 NA   
 9      2 E1       18 21   
10      2 E2       17 21   
11      2 E3       16 21

Here is the data:

 d = structure(list(family = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2), relationship = c("A", "B", "C", "D", "E1", "F1", "G11", "G12", "E2", "A", "B", "C", "D", "E1", "F1", "E2", "E3"), meanings = c("respondent", "respondent's spouse", "A's father","A's mother", "A's first son", "E1's spouse", "E1's first son","E1's second son", "A's second son", "respondent", "respondent's spouse","A's father", "A's mother", "A's first son", "E1's spouse", "A's second son","A's third son"), edu = c(12, 18, 10, 9, 15, 14, 3, 1, 13, 21,16, 12, 16, 18, 15, 17, 16)), row.names = c(NA, -17L), class = c("tbl_df", "tbl", "data.frame"))
like image 317
X.Jun Avatar asked Dec 24 '19 09:12

X.Jun


1 Answers

Here is what I tried. I thought it was necessary to create a generation variable. Seeing the sample image in your question, C and D are the 1st generation. A and B are the 2nd generation. E and F are the 3rd generation, and G is the 4th generation. The first mutate() with case_when() created the generation variable. Then, I defined groups by family and generation. For each group, I identified max education duration (i.e., max_ed_duration). Since you said that you do not need spouse's information, I dropped rows that contain mother or spouse in meanings. Then, I defined groups again using family. For each family, if generation is 1, assign NA to fedu. Otherwise, assign max_ed_duration value from the previous generation to fedu. Finally, I arranged the data by family and relationship.

library(dplyr)

mutate(mydf, generation = case_when(relationship %in% c("C", "D") ~ 1,
                                    relationship %in% c("A", "B") ~ 2,
                                    grepl(x = relationship, pattern = "^E|F") ~ 3,
                                    grepl(x = relationship, pattern = "^G") ~ 4)) %>% 
  group_by(family, generation) %>% 
  mutate(max_ed_duration = max(edu)) %>% 
  filter(!grepl(x = meanings, pattern = "mother|spouse")) %>% 
  group_by(family) %>%
  mutate(fedu = if_else(generation == 1,
                        NA_real_,
                        max_ed_duration[match(x = generation - 1, table = generation)])) %>% 
  arrange(family, relationship)

#   family relationship meanings          edu generation max_ed_duration  fedu
#    <dbl> <chr>        <chr>           <dbl>      <dbl>           <dbl> <dbl>
# 1      1 A            respondent         12          2              18    10
# 2      1 C            A's father         10          1              10    NA
# 3      1 E1           A's first son      15          3              15    18
# 4      1 E2           A's second son     13          3              15    18
# 5      1 G11          E1's first son      3          4               3    15
# 6      1 G12          E1's second son     1          4               3    15
# 7      2 A            respondent         21          2              21    16
# 8      2 C            A's father         12          1              16    NA
# 9      2 E1           A's first son      18          3              18    21
#10      2 E2           A's second son     17          3              18    21
#11      2 E3           A's third son      16          3              18    21

DATA

mydf <- structure(list(family = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 
2, 2, 2, 2, 2), relationship = c("A", "B", "C", "D", "E1", "F1", 
"G11", "G12", "E2", "A", "B", "C", "D", "E1", "F1", "E2", "E3"
), meanings = c("respondent", "respondent's spouse", "A's father", 
"A's mother", "A's first son", "E1's spouse", "E1's first son", 
"E1's second son", "A's second son", "respondent", "respondent's spouse", 
"A's father", "A's mother", "A's first son", "E1's spouse", "A's second son", 
"A's third son"), edu = c(12, 18, 10, 9, 15, 14, 3, 1, 13, 21, 
16, 12, 16, 18, 15, 17, 16)), class = "data.frame", row.names = c(NA, 
-17L))
like image 137
jazzurro Avatar answered Oct 25 '22 07:10

jazzurro