Trying to convert the following R data.frame:
structure(list( Time=c("09:30:01" ,"09:30:29" ,"09:35:56", "09:37:17" ,"09:37:21" ,"09:37:28" ,"09:37:35" ,"09:37:51" ,"09:42:11" ,"10:00:31"),
Price=c(1,2,3,4,5,6,7,8,9,10),
Volume=c(100,200,300,100,200,300,100,200,600,100)),
.Names = c("Time", "Price", "Volume"),
row.names = c(NA,10L),
class = "data.frame")
Time Price Volume
1 09:30:01 1 100
2 09:30:29 2 200
3 09:35:56 3 300
4 09:37:17 4 100
5 09:37:21 5 200
6 09:37:28 6 300
7 09:37:35 7 100
8 09:37:51 8 200
9 09:42:11 9 600
10 10:00:31 10 100
into this
Time Price Volume Bin
1 09:30:01 1 100 1
2 09:30:29 2 200 1
3 09:35:56 3 200 1
4 09:35:56 3 100 2
5 09:37:17 4 100 2
6 09:37:21 5 200 2
7 09:37:28 6 100 2
8 09:37:28 6 200 3
9 09:37:35 7 100 3
10 09:37:51 8 200 3
11 09:42:11 9 500 4
12 09:42:11 9 100 5
13 10:00:31 10 100 5
Essentially, it is calculating cumulative sums on volume and binning eachtime 500 is breached. So, bin 1 is 100+200+200 with the volume at 09:35:56 split into 200/100 and a new row inserted and the bin counter incremented.
This is relatively straightforward with base R but I was wondering is there a more elegant and hopefully faster way with dplyr.
Cheers
Update:
Thank you @Frank and @AntoniosK.
To address your question the range of volume values is all positive integers values from 1 to 10k.
I microbenchmarked both approaches and dplyr was slightly faster but not much in it, on a dataset similar to above with ~200k rows.
Really appreciate the swift responses and assistance
Not sure if this is the best or fastest way, but seems fast for those Volume
values. The philosophy is simple. Based on the Volume
value you have create that many rows of Time
and Price
with Volume = 1
. Then let cumsum
add the numbers and flag every time you have a new 500 batch. Use those flags to create your Bin
values.
structure(list( Time=c("09:30:01" ,"09:30:29" ,"09:35:56", "09:37:17" ,"09:37:21" ,"09:37:28" ,"09:37:35" ,"09:37:51" ,"09:42:11" ,"10:00:31"),
Price=c(1,2,3,4,5,6,7,8,9,10),
Volume=c(100,200,300,100,200,300,100,200,600,100)),
.Names = c("Time", "Price", "Volume"),
row.names = c(NA,10L),
class = "data.frame") -> dt
library(dplyr)
dt %>%
group_by(Time, Price) %>% ## for each Time and Price
do(data.frame(Volume = rep(1,.$Volume))) %>% ## create as many rows, with Volume = 1, as the value of Volume
ungroup() %>% ## forget about the grouping
mutate(CumSum = cumsum(Volume), ## cumulative sums
flag_500 = ifelse(CumSum %in% seq(501,sum(dt$Volume),by=500),1,0), ## flag 500 batches (at 501, 1001, etc.)
Bin = cumsum(flag_500)+1) %>% ## create Bin values
group_by(Bin, Time, Price) %>% ## for each Bin, Time and Price
summarise(Volume = sum(Volume)) %>% ## get new Volume values
select(Time, Price, Volume, Bin) %>% ## use only if you want to re-arrange column order
ungroup() ## use if you want to forget the grouping
# Time Price Volume Bin
# (chr) (dbl) (dbl) (dbl)
# 1 09:30:01 1 100 1
# 2 09:30:29 2 200 1
# 3 09:35:56 3 200 1
# 4 09:35:56 3 100 2
# 5 09:37:17 4 100 2
# 6 09:37:21 5 200 2
# 7 09:37:28 6 100 2
# 8 09:37:28 6 200 3
# 9 09:37:35 7 100 3
# 10 09:37:51 8 200 3
# 11 09:42:11 9 500 4
# 12 09:42:11 9 100 5
# 13 10:00:31 10 100 5
This is hardly "straightforward."
With data.table, it's still many lines of code:
library(data.table)
setDT(DF)
DF[, c("cV","cVL") := shift(cumsum(Volume), 0:1, type="lag", fill=0) ]
DF[, end := ( cV %/% 500 ) - ( cV %% 500 == 0 ) ]
DF[, start := shift(end, type = "lag", fill = -1) + ( cVL %% 500 == 0 ) ]
badcols = c("Volume","cV","cVL","start","end")
DF[,{
V =
if (start==end) Volume
else c((start+1)*500-cVL, rep(500, max(end-start-2,0)), cV - end*500)
c(.SD[, !badcols, with=FALSE], list(Volume = V, Bin = 1+start:end))
}, by=.(r=seq(nrow(DF)))][,!"r",with=FALSE]
which gives
Time Price Volume Bin
1: 09:30:01 1 100 1
2: 09:30:29 2 200 1
3: 09:35:56 3 200 1
4: 09:35:56 3 100 2
5: 09:37:17 4 100 2
6: 09:37:21 5 200 2
7: 09:37:28 6 100 2
8: 09:37:28 6 200 3
9: 09:37:35 7 100 3
10: 09:37:51 8 200 3
11: 09:42:11 9 500 4
12: 09:42:11 9 100 5
13: 10:00:31 10 100 5
Here's one way using data.table
and it's rolling joins feature:
require(data.table) # v1.9.6+
setDT(df)[, csum := cumsum(Volume)]
ans = rbind(df, df[.(csum=500 * seq_len(max(csum)%/% 500L)), roll=-Inf, on="csum"])
setorder(ans, Price, csum)
ans = ans[, `:=`(Volume = c(csum[1L], diff(csum)),
id = (csum-1L) %/% 500L + 1L,
csum = NULL)][Volume > 0L]
The first step adds a new column with the cumulative sum of Volume
.
The second step is perhaps the most important. Let's look at the second part. For each multiple of 500
until max(csum)
, it joins to the first value >= a multiple of 500 on df$csum
. It's a rolling NOCB join (next observation carried backward). With that we'd get:
# Time Price Volume csum
# 1: 09:35:56 3 300 500
# 2: 09:37:28 6 300 1000
# 3: 09:37:51 8 200 1500
# 4: 09:42:11 9 600 2000
These are the break points that needs to be added to your original data.table. That's what we do with rbind()
.
Then, all we do is to order by Price, csum
, generate back Volume
column. From there, generating id
column can be done using csum
column as shown.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With