Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate observations based on set of rules

I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.

id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
like image 279
DBK Avatar asked Mar 22 '13 17:03

DBK


People also ask

Can be used to remove duplicates from a set of elements?

Approach: Take a Set. Insert all array element in the Set. Set does not allow duplicates and sets like LinkedHashSet maintains the order of insertion so it will remove duplicates and elements will be printed in the same order in which it is inserted.


1 Answers

You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.

d <- with(data, data[order(id, sex, -age),])
#    id sex age
# 1   1   1  32
# 2   2   1  34
# 3   2   2  54
# 4   3   1  23
# 5   4   2  32
# 7   5   2  67
# 6   5   2  56
# 8   6   1  45
# 9   7   1  51
# 10  8   1  43
# 11  8   1  35
# 12  9   2  80
# 13 10   1  45
d[!duplicated(d$id), ]
#    id sex age
# 1   1   1  32
# 2   2   1  34
# 4   3   1  23
# 5   4   2  32
# 7   5   2  67
# 8   6   1  45
# 9   7   1  51
# 10  8   1  43
# 12  9   2  80
# 13 10   1  45
like image 182
Matthew Plourde Avatar answered Sep 20 '22 11:09

Matthew Plourde