Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count length of sequential consequtive values per group in R

I have a dataset with consequtive values and I would like to know the count of how many times each length occurs. More specifically, I want to find out how many id's have a sequence running from 1:2, from 1:3, from 1:4 etc. Only sequences starting from 1 are of interest.

In this example, id1 would have a "full" sequence running from 1:3 (as the number 4 is missing), id2 has a sequence running from 1:5, id3 has a sequence running from 1:6, id4 is not counted since it does not start with a value of 1 and id 5 has a sequence running from 1:3.

So we end up with two sequences until 3, one until 5 and one until 6.

Is there a clever way to calculate this, without resorting to inefficient loops?

Example data:

data <- data.table( id    = c(1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,5,5,5,5),
                    value = c(1,2,3,5,1,2,3,4,5,10,11,1,2,3,4,5,6,2,3,4,5,6,7,8,1,2,3,7))

 > data
    id value
 1:  1     1
 2:  1     2
 3:  1     3
 4:  1     5
 5:  2     1
 6:  2     2
 7:  2     3
 8:  2     4
 9:  2     5
10:  2    10
11:  2    11
12:  3     1
13:  3     2
14:  3     3
15:  3     4
16:  3     5
17:  3     6
18:  4     2
19:  4     3
20:  4     4
21:  4     5
22:  4     6
23:  4     7
24:  4     8
25:  5     1
26:  5     2
27:  5     3
28:  5     7
    id value
like image 531
Inkling Avatar asked Dec 23 '22 16:12

Inkling


2 Answers

out <- data[, len0 := rleid(c(TRUE, diff(value) == 1L)), by = .(id) ][
  , .(value1 = first(value), len = .N), by = .(id, len0) ]
out
#       id  len0 value1   len
#    <num> <int>  <num> <int>
# 1:     1     1      1     3
# 2:     1     2      5     1
# 3:     2     1      1     5
# 4:     2     2     10     1
# 5:     2     3     11     1
# 6:     3     1      1     6
# 7:     4     1      2     7
# 8:     5     1      1     3
# 9:     5     2      7     1

Walk-through:

  • within each id, the len0 is created to identify the increase-by-1 steps
  • within id,len0, summarize with the first value (in case you only want those starting at 1, see below) and the length of the run

If you just want to know those whose sequences begin at one, filter on value1:

out[ value1 == 1L, ]
#       id  len0 value1   len
#    <num> <int>  <num> <int>
# 1:     1     1      1     3
# 2:     2     1      1     5
# 3:     3     1      1     6
# 4:     5     1      1     3

(I think you only need id and len at this point.)

like image 105
r2evans Avatar answered Jan 13 '23 12:01

r2evans


Here is another option:

data[rowid(id)==value, max(value), id]

output:

   id V1
1:  1  3
2:  2  5
3:  3  6
4:  5  3
like image 36
chinsoon12 Avatar answered Jan 13 '23 13:01

chinsoon12