I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
Approach: Take a Set. Insert all array element in the Set. Set does not allow duplicates and sets like LinkedHashSet maintains the order of insertion so it will remove duplicates and elements will be printed in the same order in which it is inserted.
You can do this by first ordering the data.frame
so the desired entry for each id
is first, and then remove the rows with duplicate id
s.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With