Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing the first duplicate row and keep the rest?

Tags:

r

I want to remove duplicates based on column 'User' but only the first instance where it appears.

DF:

User  No
A     1
B     1
A     2
A     3
A     4
C     1
B     2
D     1

Result: (A1 and B1 removed)

User  No  
A     2
A     3
A     4
C     1
B     2
D     1

I've been unsuccessful with using the duplicated function.

Any help would be appreciated! Thanks!

like image 879
ant Avatar asked Apr 20 '15 05:04

ant


2 Answers

If i understand correctly, this should work

library(dplyr)
dd %>% group_by(User) %>% filter(duplicated(User) | n()==1)
like image 110
MrFlick Avatar answered Nov 05 '22 02:11

MrFlick


Here is an option using data.table. We convert the 'data.frame' to 'data.table' (setDT(DF)). Grouped by the 'User' column, we select all the rows except the first (tail(.SD, -1)) where .SD is Subset of Data.table. But, this will also remove the row if there is only a single row for a 'User' group. We can avoid that by using an if/else condition stating that if the number of rows are greater than 1 (.N>1), we remove the first row or else return the row (.SD).

library(data.table)
setDT(DF)[, if(.N>1) tail(.SD,-1) else .SD , by = User]
#   User No
#1:    A  2
#2:    A  3
#3:    A  4
#4:    B  2
#5:    C  1
#6:    D  1

Or a similar option as in @MrFlick's dplyr code would be using a logical condition with duplicated and .N (number of rows). We create a column 'N' by checking 'User' groups that have a single observation (.N==1), in the next step, we subset the rows that have are either TRUE for N or is duplicated for the 'User'. The duplicated returns TRUE values for duplicate rows leaving the first value as FALSE.

setDT(DF)[DF[, N:=.N==1, by = User][, N|duplicated(User)]][,N:=NULL][]

Or a base R option would be using ave to get the logical index ('indx2') by checking if the length for each 'User' group is 1 or not. We can use this along with the duplicated as described above to subset the dataset.

indx2 <- with(DF, ave(seq_along(User), User, FUN=length)==1)
DF[duplicated(DF$User)|indx2,]
#   User No
#3    A  2
#4    A  3
#5    A  4
#6    C  1
#7    B  2
#8    D  1
like image 41
akrun Avatar answered Nov 05 '22 03:11

akrun