Feel free to edit this title to make it more understandable/generalizable...
I have a data.table object with 3 columns that form groups (id
, id2
pol_loc
). Within these groups are row observations and there will be an asterisk at some row for each group or an NA
. I'd like to efficiently make an indicator column for each group of the row relative to the asterisks (before - 1, after - 0). Here is what the data table looks like:
id id2 pol_loc non_pol cluster_tag
1: 1 1 3 do NA
2: 1 1 3 you NA
3: 1 1 3 * NA
4: 1 1 3 it NA
-------------------------------------
5: 1 2 3 but 4
6: 1 2 3 i NA
7: 1 2 3 * NA
8: 1 2 3 really 2
9: 1 2 3 bad NA
-------------------------------------
10: 1 2 5 but 4
11: 1 2 5 i NA
12: 1 2 5 hate NA
13: 1 2 5 really 2
14: 1 2 5 * NA
15: 1 2 5 dogs NA
-------------------------------------
16: 2 1 4 i NA
17: 2 1 4 am NA
18: 2 1 4 the NA
19: 2 1 4 * NA
20: 2 1 4 friend NA
-------------------------------------
21: 3 1 4 do NA
22: 3 1 4 you NA
23: 3 1 4 really 2
24: 3 1 4 * NA
-------------------------------------
25: 3 2 NA NA NA
id id2 pol_loc non_pol cluster_tag
Desired output:
Here is the desired output:
id id2 pol_loc non_pol cluster_tag before
1: 1 1 3 do NA 1
2: 1 1 3 you NA 1
3: 1 1 3 * NA NA
4: 1 1 3 it NA 0
----------------------------------------------
5: 1 2 3 but 4 1
6: 1 2 3 i NA 1
7: 1 2 3 * NA NA
8: 1 2 3 really 2 0
9: 1 2 3 bad NA 0
----------------------------------------------
10: 1 2 5 but 4 1
11: 1 2 5 i NA 1
12: 1 2 5 hate NA 1
13: 1 2 5 really 2 1
14: 1 2 5 * NA NA
15: 1 2 5 dogs NA 0
----------------------------------------------
16: 2 1 4 i NA 1
17: 2 1 4 am NA 1
18: 2 1 4 the NA 1
19: 2 1 4 * NA NA
20: 2 1 4 friend NA 0
----------------------------------------------
21: 3 1 4 do NA 1
22: 3 1 4 you NA 1
23: 3 1 4 really 2 1
24: 3 1 4 * NA NA
----------------------------------------------
25: 3 2 NA NA NA NA
id id2 pol_loc non_pol cluster_tag before
MWE
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L),
id2 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), pol_loc = c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, NA), non_pol = c("do", "you",
"*", "it", "but", "i", "*", "really", "bad", "but", "i",
"hate", "really", "*", "dogs", "i", "am", "the", "*", "friend",
"do", "you", "really", "*", NA), cluster_tag = c(NA, NA,
NA, NA, "4", NA, NA, "2", NA, "4", NA, NA, "2", NA, NA, NA,
NA, NA, NA, NA, NA, NA, "2", NA, NA)), row.names = c(NA,
-25L), class = "data.frame", .Names = c("id", "id2", "pol_loc",
"non_pol", "cluster_tag"))
library(data.table)
setDT(dat)
EDIT If it makes it easier or more efficient the NA
s can become 0
or 1
It makes no difference and I'm guessing that's more efficient.
The first observation in each block, defined by a value of id, then carries information on first occurrence. We copy the observation number of first occurrence to each other occurrence of the same id . Now we tag identifiers from 1 to whatever, according to first occurrence: Those familiar with egen, group () may recognize the basic idea here.
Pandas GroupBy – Count occurrences in column 1 Import module 2 Create or import data frame 3 Apply groupby 4 Use any of the two methods 5 Display result More ...
We have variable id in this initial order. We want to go through all the values of id in the order 5, 1, 4, 2. Order of occurrence in the data is encapsulated in the set of observation numbers, so we put those in a variable: Now we sort by id, breaking ties by obs.
Now we sort by id, breaking ties by obs. The first observation in each block, defined by a value of id, then carries information on first occurrence. We copy the observation number of first occurrence to each other occurrence of the same id . Now we tag identifiers from 1 to whatever, according to first occurrence:
Try
dat[, before:=1-cumsum(non_pol=="*"), by=.(id, id2, pol_loc)][non_pol=="*", before:=NA,]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With