Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identifying sequences of repeated numbers in R

Tags:

r

I have a long time series where I need to identify and flag sequences of repeated values. Here's some data:

   DATETIME WDIR
1  40360.04   22
2  40360.08   23
3  40360.12  126
4  40360.17  126
5  40360.21  126
6  40360.25  126
7  40360.29   25
8  40360.33   26
9  40360.38  132
10 40360.42  132
11 40360.46  132
12 40360.50   30
13 40360.54  132
14 40360.58   35

So if I need to note when a value is repeated three or more times, I have a sequence of four '126' and a sequence of three '132' that need to be flagged.

I'm very new to R. I expect I use cbind to create a new column in this array with a "T" in the corresponding rows, but how to populate the column correctly is a mystery. Any pointers please? Thanks a bunch.

like image 323
Elizabeth Avatar asked Sep 22 '11 03:09

Elizabeth


2 Answers

As Ramnath says, you can use rle.

rle(dat$WDIR)
Run Length Encoding
  lengths: int [1:9] 1 1 4 1 1 3 1 1 1
  values : int [1:9] 22 23 126 25 26 132 30 132 35

rle returns an object with two components, lengths and values. We can use the lengths piece to build a new column that identifies which values are repeated more than three times.

tmp <- rle(dat$WDIR)
rep(tmp$lengths >= 3,times = tmp$lengths)
[1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

This will be our new column.

newCol <- rep(tmp$lengths > 1,times = tmp$lengths)
cbind(dat,newCol)
   DATETIME WDIR newCol
1  40360.04   22  FALSE
2  40360.08   23  FALSE
3  40360.12  126   TRUE
4  40360.17  126   TRUE
5  40360.21  126   TRUE
6  40360.25  126   TRUE
7  40360.29   25  FALSE
8  40360.33   26  FALSE
9  40360.38  132   TRUE
10 40360.42  132   TRUE
11 40360.46  132   TRUE
12 40360.50   30  FALSE
13 40360.54  132  FALSE
14 40360.58   35  FALSE
like image 73
joran Avatar answered Oct 19 '22 09:10

joran


Use rle to do the job!! It is an amazing function that calculates the number of successive repetitions of numbers in a sequence. Here is some example code on how you can use rle to flag the miscreants in your data. This will return all rows from the data frame which have WDIR that are repeated 3 or more times successively.

runs = rle(mydf$WDIR)
subset(mydf, WDIR %in% runs$values[runs$lengths >= 3])
like image 42
Ramnath Avatar answered Oct 19 '22 08:10

Ramnath