Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

update numeric variables for each row of a group as the minimum of a subgroup

Tags:

r

data.table

I have a list of data grouped by 3 different variables, and one measurement variable as shown below. ID1 = factor, ID2 = int, ID3 = number, varX = number

ID1 ID2 ID3  varX
A   1   0.1  40.0
A   1   0.8  70.5
A   2   0.7  55.0
A   2   0.8  65.0
A   2   1.0  60.0
B   4   0.2  70.0
B   5   0.6  55.7
C   1   0.1  55.0
C   1   0.3  90.0
C   1   0.9  60.0
C   5   0.8  45.5
C   5   0.9  30.0

I want to update each value of varX to be the minimum value grouped by ID1 and ID2, but also subgrouped by ID3 where only rows with an ID3 value greater than or equal to the current row to be updated are considered when determining the minimum.

For example: for ID1 = A, ID2 = 2, ID3 = 0.7, varX would be the minimum of 55.0, 65.0 and 60.0. Whereas for ID1 = A, ID2 = 2, ID3 = 0.8, varX would be the minimum of 65.0 and 60.0.

The resulting table would look like:

ID1 ID2 ID3  varX
A   1   0.1  40.0
A   1   0.8  70.5
A   2   0.7  55.0
A   2   0.8  60.0
A   2   1.0  60.0
B   4   0.2  70.0
B   5   0.6  55.7
C   1   0.1  55.0
C   1   0.3  60.0
C   1   0.9  60.0
C   5   0.8  30.0
C   5   0.9  30.0

I have 36,000 rows of data in this format, so performance is relatively important

like image 769
Michael 96 Avatar asked Nov 07 '22 21:11

Michael 96


1 Answers

Here's a more verbose dplyr approach that might be fast enough (1 sec to process 1 million rows in your format).

library(dplyr)
df2 <- df %>%
  tibble::rowid_to_column() %>%   # to use later to put back in original order
  group_by(ID1, ID2) %>%
  arrange(-ID3) %>%   # starting with the largest ID3 within each group and working down...
  mutate(varX2 = cummin(varX)) %>%   # what's the min varX encountered so far?
  ungroup() %>%
  arrange(rowid)   # put back in original order

Here's the fake data I tested with:

n = 1000000
df <- data_frame(
  ID1 = sample(LETTERS[1:26], size = n, replace = T),
  ID2 = sample(1:100, size = n, replace = T),
  ID3 = sample(0.1*1:10, size = n, replace = T),
  varX = rnorm(n, 50, 30))
like image 134
Jon Spring Avatar answered Nov 14 '22 23:11

Jon Spring