Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to subset your dataframe to only keep the first duplicate? [duplicate]

I have a dataframe with multiple variables, and I am interested in how to subset it so that it only includes the first duplicate.

    >head(occurrence)
    userId        occurrence  profile.birthday profile.gender postDate count
    1 100469891698         6               47         Female 583 days     0
    2 100469891698         6               47         Female  55 days     0
    3 100469891698         6               47         Female 481 days     0
    4 100469891698         6               47         Female 583 days     0
    5 100469891698         6               47         Female 583 days     0
    6 100469891698         6               47         Female 583 days     0

Here you can see the dataframe. The 'occurrence' column counts how many times the same userId has occurred. I have tried the following code to remove duplicates:

    occurrence <- occurrence[!duplicated(occurrence$userId),]

However, this way it remove "random" duplicates. I want to keep the data which is the oldest one by postDate. So for example the first row should look something like this:

   userId        occurrence  profile.birthday profile.gender postDate count
  1 100469891698         6               47         Female 583 days     0

Thank you for your help!

like image 601
eagerstudent Avatar asked Aug 27 '18 11:08

eagerstudent


1 Answers

Did you try order first like this:

occurrence <- occurrence[order(occurrence$userId, occurrence$postDate, decreasing=TRUE),]
occurrenceClean <- occurrence[!duplicated(occurrence$userId),]
occurrenceClean
like image 176
Sandra Barão Avatar answered Oct 11 '22 09:10

Sandra Barão