Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sequence length encoding using R

Tags:

r

encoding

Is there a way to encode increasing integer sequences in R, analogous to encoding run lengths using run length encoding (rle)?

I'll illustrate with an example:

Analogy: Run length encoding

r <- c(rep(1, 4), 2, 3, 4, rep(5, 5))
rle(r)
Run Length Encoding
  lengths: int [1:5] 4 1 1 1 5
  values : num [1:5] 1 2 3 4 5

Desired: sequence length encoding

s <- c(1:4, rep(5, 4), 6:9)
s
[1] 1 2 3 4 5 5 5 5 6 7 8 9

somefunction(s)
Sequence lengths
  lengths: int [1:4] 5 1 1 5
  value1 : num [1:4] 1 5 5 5

Edit 1

Thus, somefunction(1:10) will give the result:

Sequence lengths
  lengths: int [1:1] 10
  value1 : num [1:1] 1 

This results means that there is an integer sequence of length 10 with starting value of 1, i.e. seq(1, 10)

Note that there isn't a mistake in my example result. The vector in fact ends in the sequence 5:9, not 6:9 which was used to construct it.

My use case is that I am working with survey data in an SPSS export file. Each subquestion in a grid of questions will have a name of the pattern paste("q", 1:5), but sometimes there is an "other" category which will be marked q_99, q_other or something else. I wish to find a way of identifying the sequences.

Edit 2

In a way, my desired function is the inverse of the base function sequence, with the start value, value1 in my example, added.

lengths <- c(5, 1, 1, 5)
value1 <- c(1, 5, 5, 5)

s
[1] 1 2 3 4 5 5 5 5 6 7 8 9
sequence(lengths) + rep(value1-1, lengths) 
[1] 1 2 3 4 5 5 5 5 6 7 8 9

Edit 3

I should have stated that for my purposes a sequence is defined as increasing integer sequences as opposed to monotonically increasing sequences, e.g. c(4,5,6,7) but not c(2,4,6,8) nor c(5,4,3,2,1). However, any other integer can appear between sequences.

This means a solution should be able to cope with this test case:

somefunction(c(2, 4, 1:4, 5, 5))
    Sequence lengths
      lengths: int [1:4] 1 1 5 1
      value1 : num [1:4] 2 4 1 5 

In the ideal case, the solution can also cope with the use case suggested originally, which would include characters in the vector, e.g.

somefunction(c(2, 4, 1:4, 5, "other"))
    Sequence lengths
      lengths: int [1:5] 1 1 5 1 1
      value1 : num [1:5] 2 4 1 5 "other"
like image 506
Andrie Avatar asked Aug 16 '11 11:08

Andrie


People also ask

How do you find the length of a sequence in R?

A sequence vector is created by using the sequence of numbers such as 1 to 15, 21 to 51, 101 to 150, -5 to 10. The length of this type of vectors can be found only by using the length function.

What is rle in R programming?

The rle function is named for the acronym of “run length encoding”. What does the term “run length” mean? Imagine you flip a coin 10 times and record the outcome as “H” if the coin lands showing heads, and “T” if the coin lands showing tails. You want to know what the longest streak of heads is.


1 Answers

EDIT : added control to do the character vectors as well.

Based on rle, I come to following solution :

somefunction <- function(x){

    if(!is.numeric(x)) x <- as.numeric(x)
    n <- length(x)
    y <- x[-1L] != x[-n] + 1L
    i <- c(which(y|is.na(y)),n)

    list(
      lengths = diff(c(0L,i)),
      values = x[head(c(0L,i)+1L,-1L)]
    )

}

> s <- c(2,4,1:4, rep(5, 4), 6:9,4,4,4)

    > somefunction(s)
    $lengths
    [1] 1 1 5 1 1 5 1 1 1

    $values
    [1] 2 4 1 5 5 5 4 4 4

This one works on every test case I tried and uses vectorized values without ifelse clauses. Should run faster. It converts strings to NA, so you keep a numeric output.

> S <- c(4,2,1:5,5, "other" , "other",4:6,2)

> somefunction(S)
$lengths
[1] 1 1 5 1 1 1 3 1

$values
[1]  4  2  1  5 NA NA  4  2

Warning message:
In somefunction(S) : NAs introduced by coercion
like image 149
Joris Meys Avatar answered Oct 12 '22 22:10

Joris Meys