Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - delete consecutive (ONLY) duplicates

I need to eliminate rows from a data frame based on the repetition of values in a given column, but only those that are consecutive. For example, for the following data frame:

df = data.frame(x=c(1,1,1,2,2,4,2,2,1))
df$y <- c(10,11,30,12,49,13,12,49,30)
df$z <- c(1,2,3,4,5,6,7,8,9)

x  y z
1 10 1
1 11 2
1 30 3
2 12 4
2 49 5
4 13 6
2 12 7
2 49 8
1 30 9

I would need to eliminate rows with consecutive repeated values in the x column, keep the last repeated row, and maintain the structure of the data frame:

x  y z
1 30 3
2 49 5
4 13 6
2 49 8
1 30 9

Following directions from help and some other posts, I have tried using the duplicated function:

df[ !duplicated(x,fromLast=TRUE), ] # which gives me this:
      x  y  z
1     1 10  1
6     4 13  6
7     2 12  7
9     1 30  9
NA   NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA

Not sure why I get the NA rows at the end (wasn't happening with a similar table I was testing), but works only partially on the values.

I have also tried using the data.table package as follows:

library(data.table)
dt <- as.data.table(df)           
setkey(dt, x)                    
dt[J(unique(x)), mult ='last'] 

Works great, but it eliminates ALL duplicates from the data frame, not just those that are consecutive, giving something like this:

x  y z
1 30 9
2 49 8
4 13 6

Please, forgive if cross-posting. I tried some of the suggestions but none worked for eliminating only those that are consecutive. I would appreciate any help.

Thanks

like image 917
ebb Avatar asked Mar 15 '18 18:03

ebb


People also ask

How do I remove specific duplicates in R?

There are other methods to drop duplicate rows in R one method is duplicated() which identifies and removes duplicate in R. The other method is unique() which identifies the unique values. Get distinct Rows of the dataframe in R using distinct() function.

Which function in R removes duplicate elements?

unique() function in R Language is used to remove duplicated elements/rows from a vector, data frame or array.

How do I find duplicate rows in R?

Data Visualization using R Programming We can find the rows with duplicated values in a particular column of an R data frame by using duplicated function inside the subset function. This will return only the duplicate rows based on the column we choose that means the first unique value will not be in the output.

How does duplicate work in R?

duplicated() in R The duplicated() is a built-in R function that determines which elements of a vector or data frame are duplicates of elements with smaller subscripts and returns a logical vector indicating which elements (rows) are duplicates.


2 Answers

How about:

df[cumsum(rle(df$x)$lengths),]

Explanation:

rle(df$x)

gives you the run lengths and values of consecutive duplicates in the x variable. Then:

rle(df$x)$lengths

extracts the lengths. Finally:

cumsum(rle(df$x)$lengths)

gives the row indices which you can select using [.

EDIT for fun here's a microbenchmark of the answers given so far with rle being mine, consec being what I think is the most fundamentally direct answer, given by @James, and would be the answer I would "accept", and dp being the dplyr answer given by @Nik.

#> Unit: microseconds
#>    expr       min         lq       mean     median         uq        max
#>     rle   134.389   145.4220   162.6967   154.4180   172.8370    375.109
#>  consec   111.411   118.9235   136.1893   123.6285   145.5765    314.249
#>      dp 20478.898 20968.8010 23536.1306 21167.1200 22360.8605 179301.213

rle performs better than I thought it would.

like image 124
ngm Avatar answered Sep 22 '22 12:09

ngm


You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.

df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
  x  y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9
like image 38
James Avatar answered Sep 26 '22 12:09

James