Replace a sequence of values by group depending on preceeding values

Question

I have a data table of this form (2000000+ rows, 1000+groups):

set.seed(1)    
dt <- data.table(id = rep(1:3, each = 5), values = sample(c("a", "b","c"), 15, TRUE))

> dt
    id values
 1:  1      a
 2:  1      c
 3:  1      a
 4:  1      b
 5:  1      a
 6:  2      c
 7:  2      c
 8:  2      b
 9:  2      b
10:  2      c
11:  3      c
12:  3      a
13:  3      a
14:  3      a
15:  3      b

I want to, within each ID group, replace the entire sequence of character "a", that precedes the character "b", and I want to replace them with "b". So the condition is that if "a" or a sequence of "a"s appear before "b", replace all the "a"s. (actually, in my real table, it's when "b" is preceded by "a","x", or"y", preceding character should be replaced, but I should be able to generalize)

In the example above,the value of "a" in row 3 should be replaced (easy to do with (shift) in data.table), as well as all the "a"s in rows 12-14 (not sure how to do). So, the desired output is this:

> dt
    id values
 1:  1      a
 2:  1      c
 3:  1      b
 4:  1      b
 5:  1      a
 6:  2      c
 7:  2      c
 8:  2      b
 9:  2      b
10:  2      c
11:  3      c
12:  3      b
13:  3      b
14:  3      b
15:  3      b

What comes to my mind is looping from the last index, but I am not exactly sure how to do that with if I have multiple groupings (say, ID and DATE), and anyway, this doesn't seem to be the fastest dt solution.

talat · Accepted Answer

Here's another data.table approach:

dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(id, x=x-1, values="a")], 
   on = .(id, x, values), 
   values := "b"
   ][, x := NULL]

create a new column "x" with the run length ids per value grouped by id
join on itself while modifying the run length ids (x) to be the preceeding value and values to be "a" (the specific value you want to change), then update values with "b"
delete column x afterwards

The result is:

dt
#     id values
#  1:  1      a
#  2:  1      c
#  3:  1      b
#  4:  1      b
#  5:  1      a
#  6:  2      c
#  7:  2      c
#  8:  2      b
#  9:  2      b
# 10:  2      c
# 11:  3      c
# 12:  3      b
# 13:  3      b
# 14:  3      b
# 15:  3      b

And here's a generalization to the case where you want to replace values "a", "x", or "y" followed by "b" with "b":

dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(values=c("a", "x", "y")), by = .(id, x=x-1)], 
   on = .(id, x, values), 
   values := "b"
   ][, x := NULL]

Henrik · Answer

Late to the party and several nice run length alternatives were already provided ;) So here I try nafill instead.

(1) Create a variable 'v2' which is NA when 'values' are "a". (2) Fill missing values by next observation carried backward. (3) When the original 'values' are "a" and the corresponding filled values in 'v2' are "b", update 'v' with 'v2'.

# 1
dt[values != "a" , v2 := values]

# 2
d1[, v2 := v2[nafill(replace(seq_len(.N), is.na(v2), NA), type = "nocb")], by = id]

# 3
dt[values == "a" & v2 == "b", values := v2]

# clean-up
dt[ , v2 := NULL]

Currently, nafill only works with numeric variables, hence replace step in chunk # 2 (modified from @chinsoon12 in the issue nafill, setnafill for character, factor and other types).

The NA replacement code may be slightly shortened by using zoo::nalocf:

dt[, v2 := zoo::na.locf(v2, fromLast = TRUE, na.rm = FALSE), by = id]

However, note that na.locf is slower.

When comparing the answers on larger data (data.table(id = rep(1:1e4, each = 1e4, replace = TRUE), values = sample(c("a", "b", "c"), 1e8, replace = TRUE)), it turns out that this alternative actually is faster than the others.

sindri_baldur · Answer

This is not pretty but I think this is what you are after:

dt[, .N, by = .(id, values = paste0(values, rleid(values)))
   ][, values := sub("[0-9]+", "", values)
     ][, values := fifelse(values == "a" & shift(values, -1L) == "b" & !is.na(shift(values, -1L)), "b", values), by = id
       ][, .SD[rep(seq_len(.N), N)]
         ][, !"N"]

    id values
 1:  1      a
 2:  1      c
 3:  1      b
 4:  1      b
 5:  1      a
 6:  2      c
 7:  2      c
 8:  2      b
 9:  2      b
10:  2      c
11:  3      c
12:  3      b
13:  3      b
14:  3      b
15:  3      b

Darren Tsai · Answer

You can use rle().

Note: To avoid ambiguity, I rename the "values" column to "var" because the rle() function also produces a list containing a vector named "values".

dt[, new := with(rle(var), rep(ifelse(values == "a" & c(values[-1], "") == "b", "b", values), lengths)), by = id]
dt

#     id var new
#  1:  1   a   a
#  2:  1   c   c
#  3:  1   a   b
#  4:  1   b   b
#  5:  1   a   a
#  6:  2   c   c
#  7:  2   c   c
#  8:  2   b   b
#  9:  2   b   b
# 10:  2   c   c
# 11:  3   c   c
# 12:  3   a   b
# 13:  3   a   b
# 14:  3   a   b
# 15:  3   b   b

Replace a sequence of values by group depending on preceeding values

Tags:

string

replace

r

data.table

sequence

Djpengo

4 Answers

talat

Henrik

sindri_baldur

Darren Tsai

Recent Activity

Donate For Us

Replace a sequence of values by group depending on preceeding values

Tags:

string

replace

r

data.table

sequence

Djpengo

4 Answers

talat

Henrik

sindri_baldur

Darren Tsai

Related questions

Recent Activity

Donate For Us