I have a list of transactions for a lot of people. I wish to find out when each particular person has crossed a particular threshold value of total transactions.
Here is an example of what I have already done: Example dataset:
df <- data.frame(name = rep(c("a","b"),4),
dates = seq(as.Date("2017-01-01"), by = "month", length.out = 8), amt = 11:18)
setorderv(df, "name")
This gives me the following data frame
name dates amt
1 a 2017-01-01 11
3 a 2017-03-01 13
5 a 2017-05-01 15
7 a 2017-07-01 17
2 b 2017-02-01 12
4 b 2017-04-01 14
6 b 2017-06-01 16
8 b 2017-08-01 18
Then I wrote the following code to find the cumulative sums
df$cumsum <- ave(df$amt, df$name, FUN = cumsum)
This gives me the following data frame:
name dates amt cumsum
1 a 2017-01-01 11 11
3 a 2017-03-01 13 24
5 a 2017-05-01 15 39
7 a 2017-07-01 17 56
2 b 2017-02-01 12 12
4 b 2017-04-01 14 26
6 b 2017-06-01 16 42
8 b 2017-08-01 18 60
Now I want to know when each person crossed 20 and 40. I wrote the following code to find this out:
names <- unique(df$name)
for (i in seq_along(names)){
x1 <- Position(function(x) x >= 20, df$cumsum[df$name == names[i]])
x2 <- Position(function(x) x >= 40, df$cumsum[df$name == names[i]])
result_df[i,] <- c(df$name[i],
df[df$name == names[i],2][x1],
df[df$name == names[i],2][x2])
}
This code checks where the thresholds were crossed and stores the row number in a variable. Then extracts the value from that row of the second column and stores it in a another data frame.
The problem is, this code is really slow. I have over 200,000 people in my data set and over 10 million rows. This code takes about 25 seconds to execute for the first 50 users, which means it is likely to take about 30 hours for the entire dataset.
Is there a faster way to do this?
With dplyr you could group by person, filter when cumsum is above >20 or above >40, and then use slice(1) to select the first relevant row per person. Should be way faster than for looping.
df <- read.table(text = '
name dates amt cumsum
a 2017-01-01 11 11
a 2017-03-01 13 24
a 2017-05-01 15 39
a 2017-07-01 17 56
b 2017-02-01 12 12
b 2017-04-01 14 26
b 2017-06-01 16 42
b 2017-08-01 18 60', header = T)
df %>%
group_by(name) %>%
filter(cumsum > 20) %>%
slice(1)
name dates amt cumsum
<fctr> <fctr> <int> <int>
1 a 2017-03-01 13 24
2 b 2017-04-01 14 26
df %>%
group_by(name) %>%
filter(cumsum > 40) %>%
slice(1)
name dates amt cumsum
<fctr> <fctr> <int> <int>
a 2017-07-01 17 56
b 2017-06-01 16 42
Of course you could subsequently rbind these dataframes and arrange on person. Does this help?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With