Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get values from a column where a threshold is crossed for the first time for each group in R

Tags:

r

I have a list of transactions for a lot of people. I wish to find out when each particular person has crossed a particular threshold value of total transactions.

Here is an example of what I have already done: Example dataset:

df <- data.frame(name = rep(c("a","b"),4), 
    dates = seq(as.Date("2017-01-01"), by = "month", length.out = 8), amt = 11:18)
setorderv(df, "name")

This gives me the following data frame

  name      dates amt
1    a 2017-01-01  11
3    a 2017-03-01  13
5    a 2017-05-01  15
7    a 2017-07-01  17
2    b 2017-02-01  12
4    b 2017-04-01  14
6    b 2017-06-01  16
8    b 2017-08-01  18

Then I wrote the following code to find the cumulative sums

df$cumsum <- ave(df$amt, df$name, FUN = cumsum)

This gives me the following data frame:

  name      dates amt cumsum
1    a 2017-01-01  11     11
3    a 2017-03-01  13     24
5    a 2017-05-01  15     39
7    a 2017-07-01  17     56
2    b 2017-02-01  12     12
4    b 2017-04-01  14     26
6    b 2017-06-01  16     42
8    b 2017-08-01  18     60

Now I want to know when each person crossed 20 and 40. I wrote the following code to find this out:

names <- unique(df$name)    
for (i in seq_along(names)){
    x1 <- Position(function(x) x >= 20, df$cumsum[df$name == names[i]])
    x2 <- Position(function(x) x >= 40, df$cumsum[df$name == names[i]])

    result_df[i,] <- c(df$name[i], 
                         df[df$name == names[i],2][x1],
                         df[df$name == names[i],2][x2])
}

This code checks where the thresholds were crossed and stores the row number in a variable. Then extracts the value from that row of the second column and stores it in a another data frame.

The problem is, this code is really slow. I have over 200,000 people in my data set and over 10 million rows. This code takes about 25 seconds to execute for the first 50 users, which means it is likely to take about 30 hours for the entire dataset.

Is there a faster way to do this?

like image 747
gouravkr Avatar asked May 25 '18 10:05

gouravkr


1 Answers

With dplyr you could group by person, filter when cumsum is above >20 or above >40, and then use slice(1) to select the first relevant row per person. Should be way faster than for looping.

df <- read.table(text = '
name      dates amt cumsum
a 2017-01-01  11     11
a 2017-03-01  13     24
a 2017-05-01  15     39
a 2017-07-01  17     56
b 2017-02-01  12     12
b 2017-04-01  14     26
b 2017-06-01  16     42
b 2017-08-01  18     60', header = T)

df %>% 
  group_by(name) %>% 
  filter(cumsum > 20) %>% 
  slice(1)

       name      dates   amt cumsum
      <fctr> <fctr> <int>  <int>
1      a 2017-03-01    13     24
2      b 2017-04-01    14     26

df %>% 
  group_by(name) %>% 
  filter(cumsum > 40) %>% 
  slice(1)

   name      dates   amt cumsum
  <fctr>     <fctr> <int>  <int>
      a 2017-07-01    17     56
      b 2017-06-01    16     42

Of course you could subsequently rbind these dataframes and arrange on person. Does this help?

like image 146
Lennyy Avatar answered Nov 08 '22 00:11

Lennyy