I want to remove duplicates based on column 'User' but only the first instance where it appears.
DF:
User No
A 1
B 1
A 2
A 3
A 4
C 1
B 2
D 1
Result: (A1 and B1 removed)
User No
A 2
A 3
A 4
C 1
B 2
D 1
I've been unsuccessful with using the duplicated function.
Any help would be appreciated! Thanks!
If i understand correctly, this should work
library(dplyr)
dd %>% group_by(User) %>% filter(duplicated(User) | n()==1)
Here is an option using data.table
. We convert the 'data.frame' to 'data.table' (setDT(DF)
). Grouped by the 'User' column, we select all the rows except the first (tail(.SD, -1)
) where .SD
is Subset of Data.table
. But, this will also remove the row if there is only a single row for a 'User' group. We can avoid that by using an if/else
condition stating that if
the number of rows are greater than 1 (.N>1
), we remove the first row or else
return the row (.SD
).
library(data.table)
setDT(DF)[, if(.N>1) tail(.SD,-1) else .SD , by = User]
# User No
#1: A 2
#2: A 3
#3: A 4
#4: B 2
#5: C 1
#6: D 1
Or a similar option as in @MrFlick's dplyr code would be using a logical condition with duplicated
and .N
(number of rows). We create a column 'N' by checking 'User' groups that have a single observation (.N==1
), in the next step, we subset the rows that have are either TRUE for N or is duplicated
for the 'User'. The duplicated
returns TRUE
values for duplicate
rows leaving the first value as FALSE
.
setDT(DF)[DF[, N:=.N==1, by = User][, N|duplicated(User)]][,N:=NULL][]
Or a base R
option would be using ave
to get the logical index ('indx2') by checking if the length
for each 'User' group is 1 or not. We can use this along with the duplicated
as described above to subset the dataset.
indx2 <- with(DF, ave(seq_along(User), User, FUN=length)==1)
DF[duplicated(DF$User)|indx2,]
# User No
#3 A 2
#4 A 3
#5 A 4
#6 C 1
#7 B 2
#8 D 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With