data.table: Mark before/after occurrence of symbol within groups

Tags:

data.table

Feel free to edit this title to make it more understandable/generalizable...

I have a data.table object with 3 columns that form groups (id, id2 pol_loc). Within these groups are row observations and there will be an asterisk at some row for each group or an NA. I'd like to efficiently make an indicator column for each group of the row relative to the asterisks (before - 1, after - 0). Here is what the data table looks like:

    id id2 pol_loc non_pol cluster_tag
 1:  1   1       3      do          NA
 2:  1   1       3     you          NA
 3:  1   1       3       *          NA
 4:  1   1       3      it          NA
 -------------------------------------
 5:  1   2       3     but           4
 6:  1   2       3       i          NA
 7:  1   2       3       *          NA
 8:  1   2       3  really           2
 9:  1   2       3     bad          NA
 -------------------------------------
10:  1   2       5     but           4
11:  1   2       5       i          NA
12:  1   2       5    hate          NA
13:  1   2       5  really           2
14:  1   2       5       *          NA
15:  1   2       5    dogs          NA
 -------------------------------------
16:  2   1       4       i          NA
17:  2   1       4      am          NA
18:  2   1       4     the          NA
19:  2   1       4       *          NA
20:  2   1       4  friend          NA
 -------------------------------------
21:  3   1       4      do          NA
22:  3   1       4     you          NA
23:  3   1       4  really           2
24:  3   1       4       *          NA
 -------------------------------------
25:  3   2      NA      NA          NA
    id id2 pol_loc non_pol cluster_tag

Desired output:

Here is the desired output:

    id id2 pol_loc non_pol cluster_tag   before
 1:  1   1       3      do          NA        1
 2:  1   1       3     you          NA        1
 3:  1   1       3       *          NA       NA
 4:  1   1       3      it          NA        0
 ----------------------------------------------
 5:  1   2       3     but           4        1
 6:  1   2       3       i          NA        1
 7:  1   2       3       *          NA       NA
 8:  1   2       3  really           2        0
 9:  1   2       3     bad          NA        0
 ----------------------------------------------
10:  1   2       5     but           4        1
11:  1   2       5       i          NA        1
12:  1   2       5    hate          NA        1
13:  1   2       5  really           2        1
14:  1   2       5       *          NA       NA
15:  1   2       5    dogs          NA        0
 ----------------------------------------------
16:  2   1       4       i          NA        1
17:  2   1       4      am          NA        1
18:  2   1       4     the          NA        1
19:  2   1       4       *          NA       NA
20:  2   1       4  friend          NA        0
 ----------------------------------------------
21:  3   1       4      do          NA        1
22:  3   1       4     you          NA        1
23:  3   1       4  really           2        1
24:  3   1       4       *          NA       NA
 ----------------------------------------------
25:  3   2      NA      NA          NA       NA
    id id2 pol_loc non_pol cluster_tag   before

MWE

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), 
    id2 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), pol_loc = c(3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 4L, 
    4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, NA), non_pol = c("do", "you", 
    "*", "it", "but", "i", "*", "really", "bad", "but", "i", 
    "hate", "really", "*", "dogs", "i", "am", "the", "*", "friend", 
    "do", "you", "really", "*", NA), cluster_tag = c(NA, NA, 
    NA, NA, "4", NA, NA, "2", NA, "4", NA, NA, "2", NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, "2", NA, NA)), row.names = c(NA, 
-25L), class = "data.frame", .Names = c("id", "id2", "pol_loc", 
"non_pol", "cluster_tag"))

library(data.table)

setDT(dat)

EDIT If it makes it easier or more efficient the NAs can become 0 or 1 It makes no difference and I'm guessing that's more efficient.

843

asked Aug 22 '15 15:08

Tyler Rinker

1 Answers

Try

dat[, before:=1-cumsum(non_pol=="*"), by=.(id, id2, pol_loc)][non_pol=="*", before:=NA,]

185

answered Sep 30 '22 15:09

Khashaa

Related questions
                            
                                Dygraph's %>% replacing Dplyr's
                            
                                R RODBC Show all tables
                            
                                Why does geocode keep returning the wrong address but Google Maps works correctly
                            
                                K Means Clustering in R - ignoring row id
                            
                                What is the cost of enabling memory profiling?
                            
                                Grep a variable and store the result in a vector in R
                            
                                Centering text labels on stacked bar plots
                            
                                Are there SQL datatypes that don't work with R?
                            
                                R raster plotting an image, draw a circle and mask pixels outside circle
                            
                                Creating a continuous 1d heatmap in R
                            
                                Visualize ..count.. in a line graph with ggplot2
                            
                                lpSolve in R with Character and Column Sum Contraints
                            
                                Delete lines of a text file in R
                            
                                Shinydashboard Tabbox Height
                            
                                R Shiny: Reuse lengthy computation for different output controls
                            
                                ggplot2: aspect.ratio overpowers coord_equal or coord.fixed
                            
                                R ggvis multiple plots from single data frame
                            
                                save image of d3heatmap in a file
                            
                                Fixing maps library data for Pacific centred (0°-360° longitude) display
                            
                                Error creating R data.table with date-time POSIXlt

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With