R - delete consecutive (ONLY) duplicates

Tags:

I need to eliminate rows from a data frame based on the repetition of values in a given column, but only those that are consecutive. For example, for the following data frame:

df = data.frame(x=c(1,1,1,2,2,4,2,2,1))
df$y <- c(10,11,30,12,49,13,12,49,30)
df$z <- c(1,2,3,4,5,6,7,8,9)

x  y z
1 10 1
1 11 2
1 30 3
2 12 4
2 49 5
4 13 6
2 12 7
2 49 8
1 30 9

I would need to eliminate rows with consecutive repeated values in the x column, keep the last repeated row, and maintain the structure of the data frame:

Following directions from help and some other posts, I have tried using the duplicated function:

df[ !duplicated(x,fromLast=TRUE), ] # which gives me this:
      x  y  z
1     1 10  1
6     4 13  6
7     2 12  7
9     1 30  9
NA   NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA

Not sure why I get the NA rows at the end (wasn't happening with a similar table I was testing), but works only partially on the values.

I have also tried using the data.table package as follows:

library(data.table)
dt <- as.data.table(df)           
setkey(dt, x)                    
dt[J(unique(x)), mult ='last']

Works great, but it eliminates ALL duplicates from the data frame, not just those that are consecutive, giving something like this:

Please, forgive if cross-posting. I tried some of the suggestions but none worked for eliminating only those that are consecutive. I would appreciate any help.

Thanks

917

asked Mar 15 '18 18:03

ebb

2 Answers

How about:

df[cumsum(rle(df$x)$lengths),]

Explanation:

rle(df$x)

gives you the run lengths and values of consecutive duplicates in the x variable. Then:

rle(df$x)$lengths

extracts the lengths. Finally:

cumsum(rle(df$x)$lengths)

gives the row indices which you can select using [.

EDIT for fun here's a microbenchmark of the answers given so far with rle being mine, consec being what I think is the most fundamentally direct answer, given by @James, and would be the answer I would "accept", and dp being the dplyr answer given by @Nik.

#> Unit: microseconds
#>    expr       min         lq       mean     median         uq        max
#>     rle   134.389   145.4220   162.6967   154.4180   172.8370    375.109
#>  consec   111.411   118.9235   136.1893   123.6285   145.5765    314.249
#>      dp 20478.898 20968.8010 23536.1306 21167.1200 22360.8605 179301.213

rle performs better than I thought it would.

124

answered Sep 22 '22 12:09

ngm

You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.

df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
  x  y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9

answered Sep 26 '22 12:09

James

Related questions
                            
                                Deleting rows in R conditionally
                            
                                conditional calculations in data frame
                            
                                Use put two value columns in spread() function in R [duplicate]
                            
                                correlation matrix of a bunch of categorical variables in R
                            
                                Whether to write in "ui.R + server.R" or "app.R"
                            
                                R - Call a function from function name that is stored in a variable?
                            
                                Get ObjectID in mongolite R library
                            
                                xyplot time series with positive values in green, negative in red, in R
                            
                                Count number of unique rows based on two columns, by group
                            
                                Divide all columns by the value from the 2nd column - apply for all rows
                            
                                How can I plot igraph community with defined colors?
                            
                                Incomplete list into dataframe
                            
                                Moving x or y axis together with tick labels to the middle of a single ggplot (no facets)
                            
                                How does createDataPartition function from caret package split data?
                            
                                Split columns by number in a dataframe
                            
                                Hide comments in R markdown
                            
                                Keep which(..., arr.ind = TRUE) results that connect
                            
                                How to sort source and/or target nodes in a sankey diagram within a shiny app?
                            
                                How do I create a "macro" for regressors in R?
                            
                                How can I get derivative value in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R - delete consecutive (ONLY) duplicates

Tags:

r

duplicates

repeat

delete-row

ebb

People also ask

2 Answers

ngm

James

Recent Activity

Donate For Us