New to R, but learning to handle db data and hit a wall.
I want to remove duplicate rows/observations from a table, based on two criteria: A user ID field and a date field that indicates the last time there was a change to the user, so the most recent dated row.
My truncated data set would look like the following:
UID | DateLastChange
1 | 01/01/2016
1 | 01/03/2016
2 | 01/14/2015
3 | 02/15/2014
3 | 03/15/2016
I would like to end up with:
UID | DateLastChange
1 | 01/03/2016
2 | 01/14/2015
3 | 03/15/2016
I have attempted to use duplicate or unique, but they don't seem to fully embrace the ability to be selective. I can conceive of the possibility to build a new table with unique UIDs, then left join in some way to only match with the most recent date.
Any advice would be much appreciated. Scott
We can use data.table
library(data.table)
setDT(df1)[order(UID, -as.IDate(DateLastChange, "%m/%d/%Y")), head(.SD, 1), by = UID]
# UID DateLastChange
#1: 1 01/03/2016
#2: 2 01/14/2015
#3: 3 03/15/2016
Or using duplicated
setDT(df1)[order(UID, -as.IDate(DateLastChange, "%m/%d/%Y"))][!duplicated(UID)]
Using dplyr - data can be in any order
require(dplyr)
dat$DateLastChange <- strptime(dat$DateLastChange, "%m/%d%Y"))
dat %>% group_by(UID) %>% summarize(DateLastChange = max(DateLastChange))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With