Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using R, How can I flag sequential duplicate values in a single column of a dataframe

Tags:

r

duplicates

This is my first post and I'm new to programming and R.

I'm trying to create a new column to mark or flag sequentially duplicated values in a separate column.

df <- c(2,2,2,2,3,4,3,4,3,4,2,3,7,7,7))

Using the duplicated function returns the following:

data.frame(value = df, flag = duplicated(df))

   value  flag  
1      2  FALSE  
2      2  TRUE  
3      2  TRUE  
4      2  TRUE  
5      3  FALSE  
6      4  FALSE  
7      3  TRUE  
8      4  TRUE  
9      3  TRUE  
10     4  TRUE  
11     2  TRUE  
12     3  TRUE  
13     7  FALSE  
14     7  TRUE  
15     7  TRUE   

What I'd like is:

   value  flag  
1      2  TRUE  
2      2  TRUE  
3      2  TRUE  
4      2  TRUE  
5      3  FALSE  
6      4  FALSE  
7      3  FALSE  
8      4  FALSE  
9      3  FALSE  
10     4  FALSE  
11     2  FALSE  
12     3  FALSE  
13     7  TRUE    
14     7  TRUE    
15     7  TRUE     

My data set has over 2 million observations, so ideally the solution would be efficient.

Thank you , John

like image 396
John Bellettiere Avatar asked Jun 27 '13 20:06

John Bellettiere


2 Answers

rle will get you what you are after in combination with rep

rl <- rle( df )
rep( rl$lengths != 1 , times = rl$lengths )
#  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
# [15]  TRUE

And I believe rle is fairly efficient.

Timing (MBP late 2008) on a 2e6 length vector:

system.time({ rl <- rle( df )
res <- rep( rl$lengths != 1 , times = rl$lengths )
 })
#   user  system elapsed 
#  0.449   0.106   0.559
like image 193
Simon O'Hanlon Avatar answered Nov 15 '22 07:11

Simon O'Hanlon


Since you have more than 2 millions I recommand you really to switch to data.table. Here My solution using rle similar to @Simon one, I just write its data.table version. I believe that is not always obvious especially for beginners(like me under data.table).

library(data.table)
set.seed(1234)
dd <- sample(1:20, 2e+06, rep = TRUE)
DT <- data.table(dd)
system.time(DT[, `:=`(grp2, {
                            dd.rle = rle(dd)  ## store rle to not call it twice
                            rep(dd.rle$lengths > 1, times = dd.rle$lengths)
             })])
##    user  system elapsed 
##    1.17    0.06    1.28
##    user  system elapsed  <- rle twice
##    1.69    0.11    1.86

##        dd  grp2
## 1e+00:  3 FALSE
## 2e+00: 13  TRUE
## 3e+00: 13  TRUE
## 4e+00: 13  TRUE
## 5e+00: 18 FALSE
##    ---         
## 2e+06:  6 FALSE
## 2e+06:  5 FALSE
## 2e+06:  4 FALSE
## 2e+06: 10 FALSE
## 2e+06: 13 FALSE
like image 38
agstudy Avatar answered Nov 15 '22 08:11

agstudy