Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table: Mark before/after occurrence of symbol within groups

Tags:

r

data.table

Feel free to edit this title to make it more understandable/generalizable...

I have a data.table object with 3 columns that form groups (id, id2 pol_loc). Within these groups are row observations and there will be an asterisk at some row for each group or an NA. I'd like to efficiently make an indicator column for each group of the row relative to the asterisks (before - 1, after - 0). Here is what the data table looks like:

    id id2 pol_loc non_pol cluster_tag
 1:  1   1       3      do          NA
 2:  1   1       3     you          NA
 3:  1   1       3       *          NA
 4:  1   1       3      it          NA
 -------------------------------------
 5:  1   2       3     but           4
 6:  1   2       3       i          NA
 7:  1   2       3       *          NA
 8:  1   2       3  really           2
 9:  1   2       3     bad          NA
 -------------------------------------
10:  1   2       5     but           4
11:  1   2       5       i          NA
12:  1   2       5    hate          NA
13:  1   2       5  really           2
14:  1   2       5       *          NA
15:  1   2       5    dogs          NA
 -------------------------------------
16:  2   1       4       i          NA
17:  2   1       4      am          NA
18:  2   1       4     the          NA
19:  2   1       4       *          NA
20:  2   1       4  friend          NA
 -------------------------------------
21:  3   1       4      do          NA
22:  3   1       4     you          NA
23:  3   1       4  really           2
24:  3   1       4       *          NA
 -------------------------------------
25:  3   2      NA      NA          NA
    id id2 pol_loc non_pol cluster_tag

Desired output:

Here is the desired output:

    id id2 pol_loc non_pol cluster_tag   before
 1:  1   1       3      do          NA        1
 2:  1   1       3     you          NA        1
 3:  1   1       3       *          NA       NA
 4:  1   1       3      it          NA        0
 ----------------------------------------------
 5:  1   2       3     but           4        1
 6:  1   2       3       i          NA        1
 7:  1   2       3       *          NA       NA
 8:  1   2       3  really           2        0
 9:  1   2       3     bad          NA        0
 ----------------------------------------------
10:  1   2       5     but           4        1
11:  1   2       5       i          NA        1
12:  1   2       5    hate          NA        1
13:  1   2       5  really           2        1
14:  1   2       5       *          NA       NA
15:  1   2       5    dogs          NA        0
 ----------------------------------------------
16:  2   1       4       i          NA        1
17:  2   1       4      am          NA        1
18:  2   1       4     the          NA        1
19:  2   1       4       *          NA       NA
20:  2   1       4  friend          NA        0
 ----------------------------------------------
21:  3   1       4      do          NA        1
22:  3   1       4     you          NA        1
23:  3   1       4  really           2        1
24:  3   1       4       *          NA       NA
 ----------------------------------------------
25:  3   2      NA      NA          NA       NA
    id id2 pol_loc non_pol cluster_tag   before

MWE

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), 
    id2 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), pol_loc = c(3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 4L, 
    4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, NA), non_pol = c("do", "you", 
    "*", "it", "but", "i", "*", "really", "bad", "but", "i", 
    "hate", "really", "*", "dogs", "i", "am", "the", "*", "friend", 
    "do", "you", "really", "*", NA), cluster_tag = c(NA, NA, 
    NA, NA, "4", NA, NA, "2", NA, "4", NA, NA, "2", NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, "2", NA, NA)), row.names = c(NA, 
-25L), class = "data.frame", .Names = c("id", "id2", "pol_loc", 
"non_pol", "cluster_tag"))

library(data.table)

setDT(dat)

EDIT If it makes it easier or more efficient the NAs can become 0 or 1 It makes no difference and I'm guessing that's more efficient.

like image 843
Tyler Rinker Avatar asked Aug 22 '15 15:08

Tyler Rinker


People also ask

How to tag identifiers According to first occurrence in a block?

The first observation in each block, defined by a value of id, then carries information on first occurrence. We copy the observation number of first occurrence to each other occurrence of the same id . Now we tag identifiers from 1 to whatever, according to first occurrence: Those familiar with egen, group () may recognize the basic idea here.

How to count occurrences in column in pandas groupby?

Pandas GroupBy – Count occurrences in column 1 Import module 2 Create or import data frame 3 Apply groupby 4 Use any of the two methods 5 Display result More ...

What is the Order of occurrence of the variable ID?

We have variable id in this initial order. We want to go through all the values of id in the order 5, 1, 4, 2. Order of occurrence in the data is encapsulated in the set of observation numbers, so we put those in a variable: Now we sort by id, breaking ties by obs.

How do you sort identifiers According to first occurrence?

Now we sort by id, breaking ties by obs. The first observation in each block, defined by a value of id, then carries information on first occurrence. We copy the observation number of first occurrence to each other occurrence of the same id . Now we tag identifiers from 1 to whatever, according to first occurrence:


1 Answers

Try

dat[, before:=1-cumsum(non_pol=="*"), by=.(id, id2, pol_loc)][non_pol=="*", before:=NA,]
like image 185
Khashaa Avatar answered Sep 30 '22 15:09

Khashaa