Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Run-Length Encoding and Generating Sums

I have the following run-length encoding data.

df1 <- structure(list(lengths = c(2L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 3L, 1L), values = c(10, 9, NA, 5, 4, 3, NA, 2, NA, 1, 0, NA, 0)), row.names = c(NA, -13L), class = "data.frame")
df1
# > df1
#    lengths values
# 1        2     10
# 2        3      9
# 3        2     NA
# 4        1      5
# 5        1      4
# 6        1      3
# 7        1     NA
# 8        1      2
# 9        2     NA
# 10       1      1
# 11       1      0
# 12       3     NA
# 13       1      0

Using a particular threshold (0.01), I create a new variable in this data frame.

df1$Below_Threshold <- ifelse(df1$values <= 0.01, TRUE, FALSE)
df1
# > df1
#    lengths values Below_Threshold
# 1        2     10           FALSE
# 2        3      9           FALSE
# 3        2     NA              NA
# 4        1      5           FALSE
# 5        1      4           FALSE
# 6        1      3           FALSE
# 7        1     NA              NA
# 8        1      2           FALSE
# 9        2     NA              NA
# 10       1      1           FALSE
# 11       1      0            TRUE
# 12       3     NA              NA
# 13       1      0            TRUE

I now want to perform run-length encoding on this new variable, but instead of simply returning the number of occurrences, I want to return the sum of the lengths column from the first data frame. The result should look like the sum column in the df2 data frame in the following chunk of code.

df2 <- structure(list(values = c(FALSE, NA, FALSE, NA, FALSE, NA, FALSE, TRUE, NA, TRUE), sum = c(5, 2, 3, 1, 1, 2, 1, 1, 3, 1)), class = "data.frame", row.names = c(NA, -10L))
df2
# > df2
#    values sum
# 1   FALSE   5
# 2      NA   2
# 3   FALSE   3
# 4      NA   1
# 5   FALSE   1
# 6      NA   2
# 7   FALSE   1
# 8    TRUE   1
# 9      NA   3
# 10   TRUE   1

Is there a nice, efficient way of achieving this result? base R solutions are preferred but all are welcome.

like image 761
David Moore Avatar asked Dec 22 '25 06:12

David Moore


1 Answers

df1 %>%
   group_by(grp = consecutive_id(values <= 0.01))%>%
   summarise(values = first(values) <= 0.01, sum = sum(lengths))

# A tibble: 10 × 3
     grp values   sum
   <int> <lgl>  <int>
 1     1 FALSE      5
 2     2 NA         2
 3     3 FALSE      3
 4     4 NA         1
 5     5 FALSE      1
 6     6 NA         2
 7     7 FALSE      1
 8     8 TRUE       1
 9     9 NA         3
10    10 TRUE       1

If that feels repetative, use:

df1 %>%
  mutate(values = values <= 0.01) %>%
  group_by(grp = consecutive_id(values))%>%
  summarise(values = first(values), sum = sum(lengths))

# A tibble: 10 × 3
     grp values   sum
   <int> <lgl>  <int>
 1     1 FALSE      5
 2     2 NA         2
 3     3 FALSE      3
 4     4 NA         1
 5     5 FALSE      1
 6     6 NA         2
 7     7 FALSE      1
 8     8 TRUE       1
 9     9 NA         3
10    10 TRUE       1
like image 109
KU99 Avatar answered Dec 23 '25 22:12

KU99