I have a dataframe with multiple variables, and I am interested in how to subset it so that it only includes the first duplicate.
>head(occurrence)
userId occurrence profile.birthday profile.gender postDate count
1 100469891698 6 47 Female 583 days 0
2 100469891698 6 47 Female 55 days 0
3 100469891698 6 47 Female 481 days 0
4 100469891698 6 47 Female 583 days 0
5 100469891698 6 47 Female 583 days 0
6 100469891698 6 47 Female 583 days 0
Here you can see the dataframe. The 'occurrence' column counts how many times the same userId has occurred. I have tried the following code to remove duplicates:
occurrence <- occurrence[!duplicated(occurrence$userId),]
However, this way it remove "random" duplicates. I want to keep the data which is the oldest one by postDate. So for example the first row should look something like this:
userId occurrence profile.birthday profile.gender postDate count
1 100469891698 6 47 Female 583 days 0
Thank you for your help!
Did you try order first like this:
occurrence <- occurrence[order(occurrence$userId, occurrence$postDate, decreasing=TRUE),]
occurrenceClean <- occurrence[!duplicated(occurrence$userId),]
occurrenceClean
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With