Is there a way to encode increasing integer sequences in R, analogous to encoding run lengths using run length encoding (rle
)?
I'll illustrate with an example:
Analogy: Run length encoding
r <- c(rep(1, 4), 2, 3, 4, rep(5, 5))
rle(r)
Run Length Encoding
lengths: int [1:5] 4 1 1 1 5
values : num [1:5] 1 2 3 4 5
Desired: sequence length encoding
s <- c(1:4, rep(5, 4), 6:9)
s
[1] 1 2 3 4 5 5 5 5 6 7 8 9
somefunction(s)
Sequence lengths
lengths: int [1:4] 5 1 1 5
value1 : num [1:4] 1 5 5 5
Edit 1
Thus, somefunction(1:10)
will give the result:
Sequence lengths
lengths: int [1:1] 10
value1 : num [1:1] 1
This results means that there is an integer sequence of length 10 with starting value of 1, i.e. seq(1, 10)
Note that there isn't a mistake in my example result. The vector in fact ends in the sequence 5:9, not 6:9 which was used to construct it.
My use case is that I am working with survey data in an SPSS export file. Each subquestion in a grid of questions will have a name of the pattern paste("q", 1:5)
, but sometimes there is an "other" category which will be marked q_99
, q_other
or something else. I wish to find a way of identifying the sequences.
Edit 2
In a way, my desired function is the inverse of the base function sequence
, with the start value, value1
in my example, added.
lengths <- c(5, 1, 1, 5)
value1 <- c(1, 5, 5, 5)
s
[1] 1 2 3 4 5 5 5 5 6 7 8 9
sequence(lengths) + rep(value1-1, lengths)
[1] 1 2 3 4 5 5 5 5 6 7 8 9
Edit 3
I should have stated that for my purposes a sequence is defined as increasing integer sequences as opposed to monotonically increasing sequences, e.g. c(4,5,6,7)
but not c(2,4,6,8)
nor c(5,4,3,2,1)
. However, any other integer can appear between sequences.
This means a solution should be able to cope with this test case:
somefunction(c(2, 4, 1:4, 5, 5))
Sequence lengths
lengths: int [1:4] 1 1 5 1
value1 : num [1:4] 2 4 1 5
In the ideal case, the solution can also cope with the use case suggested originally, which would include characters in the vector, e.g.
somefunction(c(2, 4, 1:4, 5, "other"))
Sequence lengths
lengths: int [1:5] 1 1 5 1 1
value1 : num [1:5] 2 4 1 5 "other"
A sequence vector is created by using the sequence of numbers such as 1 to 15, 21 to 51, 101 to 150, -5 to 10. The length of this type of vectors can be found only by using the length function.
The rle function is named for the acronym of “run length encoding”. What does the term “run length” mean? Imagine you flip a coin 10 times and record the outcome as “H” if the coin lands showing heads, and “T” if the coin lands showing tails. You want to know what the longest streak of heads is.
EDIT : added control to do the character vectors as well.
Based on rle, I come to following solution :
somefunction <- function(x){
if(!is.numeric(x)) x <- as.numeric(x)
n <- length(x)
y <- x[-1L] != x[-n] + 1L
i <- c(which(y|is.na(y)),n)
list(
lengths = diff(c(0L,i)),
values = x[head(c(0L,i)+1L,-1L)]
)
}
> s <- c(2,4,1:4, rep(5, 4), 6:9,4,4,4)
> somefunction(s)
$lengths
[1] 1 1 5 1 1 5 1 1 1
$values
[1] 2 4 1 5 5 5 4 4 4
This one works on every test case I tried and uses vectorized values without ifelse clauses. Should run faster. It converts strings to NA, so you keep a numeric output.
> S <- c(4,2,1:5,5, "other" , "other",4:6,2)
> somefunction(S)
$lengths
[1] 1 1 5 1 1 1 3 1
$values
[1] 4 2 1 5 NA NA 4 2
Warning message:
In somefunction(S) : NAs introduced by coercion
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With