Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove duplicate row based only of previous row

Tags:

dataframe

r

I'm trying to remove duplicate rows from a data frame, based only on the previous row. The duplicate and unique functions will remove all duplicates, leaving you only with unique rows, which is not what I want.

I've illustrated the problem here with a loop. I need to vectorize this because my actual data set is much to large to use a loop on.

x <- c(1,1,1,1,3,3,3,4)
y <- c(1,1,1,1,3,3,3,4)
z <- c(1,2,1,1,3,2,2,4)
xy <- data.frame(x,y,z)

xy
  x y z
1 1 1 1
2 1 1 2
3 1 1 1
4 1 1 1 #this should be removed
5 3 3 3
6 3 3 2
7 3 3 2 #this should be removed
8 4 4 4

# loop that produces desired output
toRemove <- NULL
for (i in 2:nrow(xy)){
   test <- as.vector(xy[i,] == xy[i-1,])
   if (!(FALSE %in% test)){ 
      toRemove <- c(toRemove, i) #build a vector of rows to remove
   }
}
xy[-toRemove,] #exclude rows
  x y z
1 1 1 1
2 1 1 2
3 1 1 1
5 3 3 3
6 3 3 2
8 4 4 4

I've tried using dplyr's lag function, but it only works on single columns, when I try to run it over all 3 columns it doesn't work.

ifelse(xy[,1:3] == lag(xy[,1:3],1), NA, xy[,1:3])

Any advice on how to accomplish this?

like image 480
Lloyd Christmas Avatar asked Oct 18 '22 01:10

Lloyd Christmas


1 Answers

Looks like we want to remove if the row is same as above:

# make an index, if cols not same as above
ix <- c(TRUE, rowSums(tail(xy, -1) == head(xy, -1)) != ncol(xy))

# filter
xy[ix, ]
like image 198
zx8754 Avatar answered Nov 03 '22 19:11

zx8754