Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace a sequence of values by group depending on preceeding values

I have a data table of this form (2000000+ rows, 1000+groups):

set.seed(1)    
dt <- data.table(id = rep(1:3, each = 5), values = sample(c("a", "b","c"), 15, TRUE))

> dt
    id values
 1:  1      a
 2:  1      c
 3:  1      a
 4:  1      b
 5:  1      a
 6:  2      c
 7:  2      c
 8:  2      b
 9:  2      b
10:  2      c
11:  3      c
12:  3      a
13:  3      a
14:  3      a
15:  3      b

I want to, within each ID group, replace the entire sequence of character "a", that precedes the character "b", and I want to replace them with "b". So the condition is that if "a" or a sequence of "a"s appear before "b", replace all the "a"s. (actually, in my real table, it's when "b" is preceded by "a","x", or"y", preceding character should be replaced, but I should be able to generalize)

In the example above,the value of "a" in row 3 should be replaced (easy to do with (shift) in data.table), as well as all the "a"s in rows 12-14 (not sure how to do). So, the desired output is this:

> dt
    id values
 1:  1      a
 2:  1      c
 3:  1      b
 4:  1      b
 5:  1      a
 6:  2      c
 7:  2      c
 8:  2      b
 9:  2      b
10:  2      c
11:  3      c
12:  3      b
13:  3      b
14:  3      b
15:  3      b

What comes to my mind is looping from the last index, but I am not exactly sure how to do that with if I have multiple groupings (say, ID and DATE), and anyway, this doesn't seem to be the fastest dt solution.

like image 902
Djpengo Avatar asked Jun 09 '20 16:06

Djpengo


4 Answers

Here's another data.table approach:

dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(id, x=x-1, values="a")], 
   on = .(id, x, values), 
   values := "b"
   ][, x := NULL]
  • create a new column "x" with the run length ids per value grouped by id
  • join on itself while modifying the run length ids (x) to be the preceeding value and values to be "a" (the specific value you want to change), then update values with "b"
  • delete column x afterwards

The result is:

dt
#     id values
#  1:  1      a
#  2:  1      c
#  3:  1      b
#  4:  1      b
#  5:  1      a
#  6:  2      c
#  7:  2      c
#  8:  2      b
#  9:  2      b
# 10:  2      c
# 11:  3      c
# 12:  3      b
# 13:  3      b
# 14:  3      b
# 15:  3      b

And here's a generalization to the case where you want to replace values "a", "x", or "y" followed by "b" with "b":

dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(values=c("a", "x", "y")), by = .(id, x=x-1)], 
   on = .(id, x, values), 
   values := "b"
   ][, x := NULL]
like image 67
talat Avatar answered Nov 18 '22 08:11

talat


Late to the party and several nice run length alternatives were already provided ;) So here I try nafill instead.

(1) Create a variable 'v2' which is NA when 'values' are "a". (2) Fill missing values by next observation carried backward. (3) When the original 'values' are "a" and the corresponding filled values in 'v2' are "b", update 'v' with 'v2'.

# 1
dt[values != "a" , v2 := values]

# 2
d1[, v2 := v2[nafill(replace(seq_len(.N), is.na(v2), NA), type = "nocb")], by = id]

# 3
dt[values == "a" & v2 == "b", values := v2]

# clean-up
dt[ , v2 := NULL]

Currently, nafill only works with numeric variables, hence replace step in chunk # 2 (modified from @chinsoon12 in the issue nafill, setnafill for character, factor and other types).

The NA replacement code may be slightly shortened by using zoo::nalocf:

dt[, v2 := zoo::na.locf(v2, fromLast = TRUE, na.rm = FALSE), by = id]

However, note that na.locf is slower.


When comparing the answers on larger data (data.table(id = rep(1:1e4, each = 1e4, replace = TRUE), values = sample(c("a", "b", "c"), 1e8, replace = TRUE)), it turns out that this alternative actually is faster than the others.

like image 4
Henrik Avatar answered Nov 18 '22 08:11

Henrik


This is not pretty but I think this is what you are after:

dt[, .N, by = .(id, values = paste0(values, rleid(values)))
   ][, values := sub("[0-9]+", "", values)
     ][, values := fifelse(values == "a" & shift(values, -1L) == "b" & !is.na(shift(values, -1L)), "b", values), by = id
       ][, .SD[rep(seq_len(.N), N)]
         ][, !"N"]

    id values
 1:  1      a
 2:  1      c
 3:  1      b
 4:  1      b
 5:  1      a
 6:  2      c
 7:  2      c
 8:  2      b
 9:  2      b
10:  2      c
11:  3      c
12:  3      b
13:  3      b
14:  3      b
15:  3      b
like image 2
sindri_baldur Avatar answered Nov 18 '22 07:11

sindri_baldur


You can use rle().

Note: To avoid ambiguity, I rename the "values" column to "var" because the rle() function also produces a list containing a vector named "values".

dt[, new := with(rle(var), rep(ifelse(values == "a" & c(values[-1], "") == "b", "b", values), lengths)), by = id]
dt

#     id var new
#  1:  1   a   a
#  2:  1   c   c
#  3:  1   a   b
#  4:  1   b   b
#  5:  1   a   a
#  6:  2   c   c
#  7:  2   c   c
#  8:  2   b   b
#  9:  2   b   b
# 10:  2   c   c
# 11:  3   c   c
# 12:  3   a   b
# 13:  3   a   b
# 14:  3   a   b
# 15:  3   b   b
like image 1
Darren Tsai Avatar answered Nov 18 '22 06:11

Darren Tsai