Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How to generate a column with row values based on the nearest N row's values

I'm looking for a way to code a column based information in the previous N rows to a given row. The dataset is sorted.

In short, I want to create a column called oneweeksince that returns TRUE if the victims column is greater than 0 (or !NA) for seven rows after.

Put another way, if, for row[i], row[i]$victims > 0 within any row from row[i - 7] to row[i], then row[i]$oneweeksince should be TRUE. The oneweeksince value should also be TRUE on rows in which victims > 0 or !is.na(victims)

How can I automate the creation of this column/feature? It is also possible to use the date column to compute the date distance. I'm trying to avoid creating a loop because of the slow perf in R.

The dataset should look like this:

      date           oneweeksince victims
1    2009-01-01         FALSE      NA
2    2009-01-02         FALSE      NA
3    2009-01-03         FALSE      NA
4    2009-01-04         FALSE      NA
5    2009-01-05         FALSE      NA
6    2009-01-06         FALSE      NA
7    2009-01-07         FALSE      NA
8    2009-01-08          TRUE       1
9    2009-01-09          TRUE      NA
10   2009-01-10          TRUE      NA
11   2009-01-11          TRUE      NA
12   2009-01-12          TRUE      NA
13   2009-01-13          TRUE      NA
14   2009-01-14          TRUE      NA
15   2009-01-15          TRUE      NA
16   2009-01-16         FALSE      NA
17   2009-01-17         FALSE      NA
18   2009-01-18         FALSE      NA
19   2009-01-19         FALSE      NA
20   2009-01-20         FALSE      NA

The dataset is many years long so I need an efficient way to do it.

like image 535
Union find Avatar asked Mar 04 '23 07:03

Union find


1 Answers

Solution from @G.Grothendieck

After some discussions, this is the most effective and efficient answer.

library(dplyr)
library(zoo)

dat2 <- dat %>%
  mutate(roll = rollapplyr(victims > 0, 8, any, na.rm = TRUE, fill = NA, partial = TRUE)) %>%
  mutate(oneweeksince = roll > 0) %>%
  select(-roll)

Solution from my previous attempt

A solution using rollapplyr from the zoo package. rollapplyr can apply a function with a rolling window. In this case, we can specify the rolling window to be 8 and apply the mean function. Notice that rollmean function is not suitable in this case becuas we cannot specify na.rm = TRUE in the rollmean function. The last step is to simply evaluate if the roll column is larger than 1.

library(dplyr)
library(zoo)

dat2 <- dat %>%
  mutate(roll = rollapplyr(victims, width = 8, FUN = function(x) mean(x, na.rm = TRUE), fill = NA)) %>%
  mutate(oneweeksince = roll > 0) %>%
  select(-roll)
# dat2
#          date victims oneweeksince
# 1  2009-01-01      NA           NA
# 2  2009-01-02      NA           NA
# 3  2009-01-03      NA           NA
# 4  2009-01-04      NA           NA
# 5  2009-01-05      NA           NA
# 6  2009-01-06      NA           NA
# 7  2009-01-07      NA           NA
# 8  2009-01-08       1         TRUE
# 9  2009-01-09      NA         TRUE
# 10 2009-01-10      NA         TRUE
# 11 2009-01-11      NA         TRUE
# 12 2009-01-12      NA         TRUE
# 13 2009-01-13      NA         TRUE
# 14 2009-01-14      NA         TRUE
# 15 2009-01-15      NA         TRUE
# 16 2009-01-16      NA           NA
# 17 2009-01-17      NA           NA
# 18 2009-01-18      NA           NA
# 19 2009-01-19      NA           NA

DATA

dat <- read.table(text = "      date           oneweeksince victims
1    '2009-01-01'         FALSE      NA
                  2    '2009-01-02'         FALSE      NA
                  3    '2009-01-03'         FALSE      NA
                  4    '2009-01-04'         FALSE      NA
                  5    '2009-01-05'         FALSE      NA
                  6    '2009-01-06'         FALSE      NA
                  7    '2009-01-07'         FALSE      NA
                  8    '2009-01-08'          TRUE       1
                  9    '2009-01-09'          TRUE      NA
                  10   '2009-01-10'          TRUE      NA
                  11   '2009-01-11'          TRUE      NA
                  12   '2009-01-12'          TRUE      NA
                  13   '2009-01-13'          TRUE      NA
                  14   '2009-01-14'          TRUE      NA
                  15   '2009-01-15'          TRUE      NA
                  16   '2009-01-16'         FALSE      NA
                  17   '2009-01-17'         FALSE      NA
                  18   '2009-01-18'         FALSE      NA
                  19   '2009-01-19'         FALSE      NA
                  20   '2009-01-20'         FALSE      NA",
                  header = TRUE, stringsAsFactors = FALSE)

dat$oneweeksince <- NULL

My second attempt

The OP pointed out that my solution will not work if there are entries in the first N rows where N is the window width. Here I provided a solution to address that. I am going to use the same example data frame except that I change the second row of victims to be 1. The new solution needs functions from purrr and tidyr, so I load the tidyverse package for this.

library(tidyverse)
library(zoo)

dat2 <- dat %>%
  mutate(roll = rollapplyr(victims, width = 8, FUN = function(x) mean(x, na.rm = TRUE), fill = NA)) %>%
  # Split the data frame for the first width - 1 rows and others
  mutate(GroupID = ifelse(row_number() <= 7, 1L, 2L)) %>%
  split(.$GroupID) %>%
  # Check if the GroupID is 1. If yes, change the roll column to be the same as victims
  # After that, use fill to fill in NA
  map_if(function(x) unique(x$GroupID) == 1L, 
         ~.x %>% mutate(roll = victims) %>% fill(roll)) %>%
  # Combine data frames
  bind_rows() %>%
  mutate(oneweeksince = roll > 0) %>%
  select(-roll)
# dat2
# date victims GroupID oneweeksince
# 1  2009-01-01      NA       1           NA
# 2  2009-01-02       1       1         TRUE
# 3  2009-01-03      NA       1         TRUE
# 4  2009-01-04      NA       1         TRUE
# 5  2009-01-05      NA       1         TRUE
# 6  2009-01-06      NA       1         TRUE
# 7  2009-01-07      NA       1         TRUE
# 8  2009-01-08       1       2         TRUE
# 9  2009-01-09      NA       2         TRUE
# 10 2009-01-10      NA       2         TRUE
# 11 2009-01-11      NA       2         TRUE
# 12 2009-01-12      NA       2         TRUE
# 13 2009-01-13      NA       2         TRUE
# 14 2009-01-14      NA       2         TRUE
# 15 2009-01-15      NA       2         TRUE
# 16 2009-01-16      NA       2           NA
# 17 2009-01-17      NA       2           NA
# 18 2009-01-18      NA       2           NA
# 19 2009-01-19      NA       2           NA
# 20 2009-01-20      NA       2           NA

DATA

dat <- read.table(text = "      date           oneweeksince victims
1    '2009-01-01'         FALSE      NA
                  2    '2009-01-02'         FALSE       1
                  3    '2009-01-03'         FALSE      NA
                  4    '2009-01-04'         FALSE      NA
                  5    '2009-01-05'         FALSE      NA
                  6    '2009-01-06'         FALSE      NA
                  7    '2009-01-07'         FALSE      NA
                  8    '2009-01-08'          TRUE       1
                  9    '2009-01-09'          TRUE      NA
                  10   '2009-01-10'          TRUE      NA
                  11   '2009-01-11'          TRUE      NA
                  12   '2009-01-12'          TRUE      NA
                  13   '2009-01-13'          TRUE      NA
                  14   '2009-01-14'          TRUE      NA
                  15   '2009-01-15'          TRUE      NA
                  16   '2009-01-16'         FALSE      NA
                  17   '2009-01-17'         FALSE      NA
                  18   '2009-01-18'         FALSE      NA
                  19   '2009-01-19'         FALSE      NA
                  20   '2009-01-20'         FALSE      NA",
                  header = TRUE, stringsAsFactors = FALSE)

dat$oneweeksince <- NULL
like image 156
5 revs Avatar answered Mar 07 '23 00:03

5 revs