Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate rows in R data frame, based on a date field and another field

New to R, but learning to handle db data and hit a wall.

I want to remove duplicate rows/observations from a table, based on two criteria: A user ID field and a date field that indicates the last time there was a change to the user, so the most recent dated row.

My truncated data set would look like the following:

UID    | DateLastChange
1      |  01/01/2016
1      |  01/03/2016
2      |  01/14/2015
3      |  02/15/2014
3      |  03/15/2016

I would like to end up with:

UID    | DateLastChange
1      |  01/03/2016
2      |  01/14/2015
3      |  03/15/2016

I have attempted to use duplicate or unique, but they don't seem to fully embrace the ability to be selective. I can conceive of the possibility to build a new table with unique UIDs, then left join in some way to only match with the most recent date.

Any advice would be much appreciated. Scott

like image 237
Scottieie Avatar asked Jan 04 '17 03:01

Scottieie


Video Answer


2 Answers

We can use data.table

library(data.table)
setDT(df1)[order(UID, -as.IDate(DateLastChange, "%m/%d/%Y")), head(.SD, 1), by = UID]
#     UID DateLastChange
#1:   1     01/03/2016
#2:   2     01/14/2015
#3:   3     03/15/2016

Or using duplicated

setDT(df1)[order(UID, -as.IDate(DateLastChange, "%m/%d/%Y"))][!duplicated(UID)]
like image 137
akrun Avatar answered Sep 27 '22 01:09

akrun


Using dplyr - data can be in any order

require(dplyr)
dat$DateLastChange <- strptime(dat$DateLastChange, "%m/%d%Y")) 
dat %>% group_by(UID) %>% summarize(DateLastChange = max(DateLastChange))
like image 28
Andrew Lavers Avatar answered Sep 26 '22 01:09

Andrew Lavers