Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dplyr conditional windowing

Tags:

r

dplyr

Trying to convert the following R data.frame:

    structure(list( Time=c("09:30:01"  ,"09:30:29"  ,"09:35:56",  "09:37:17"  ,"09:37:21"  ,"09:37:28"  ,"09:37:35"  ,"09:37:51"  ,"09:42:11"  ,"10:00:31"),
            Price=c(1,2,3,4,5,6,7,8,9,10),
            Volume=c(100,200,300,100,200,300,100,200,600,100)),
      .Names = c("Time", "Price", "Volume"),
      row.names = c(NA,10L),
      class = "data.frame")

           Time Price Volume
    1  09:30:01     1    100
    2  09:30:29     2    200
    3  09:35:56     3    300
    4  09:37:17     4    100
    5  09:37:21     5    200
    6  09:37:28     6    300
    7  09:37:35     7    100
    8  09:37:51     8    200
    9  09:42:11     9    600
    10 10:00:31    10    100

into this

       Time Price  Volume Bin
1  09:30:01     1     100   1
2  09:30:29     2     200   1
3  09:35:56     3     200   1
4  09:35:56     3     100   2
5  09:37:17     4     100   2
6  09:37:21     5     200   2
7  09:37:28     6     100   2
8  09:37:28     6     200   3
9  09:37:35     7     100   3
10 09:37:51     8     200   3
11 09:42:11     9     500   4
12 09:42:11     9     100   5
13 10:00:31    10     100   5

Essentially, it is calculating cumulative sums on volume and binning eachtime 500 is breached. So, bin 1 is 100+200+200 with the volume at 09:35:56 split into 200/100 and a new row inserted and the bin counter incremented.

This is relatively straightforward with base R but I was wondering is there a more elegant and hopefully faster way with dplyr.

Cheers

Update:

Thank you @Frank and @AntoniosK.

To address your question the range of volume values is all positive integers values from 1 to 10k.

I microbenchmarked both approaches and dplyr was slightly faster but not much in it, on a dataset similar to above with ~200k rows.

Really appreciate the swift responses and assistance

like image 936
user5199932 Avatar asked Nov 04 '15 15:11

user5199932


3 Answers

Not sure if this is the best or fastest way, but seems fast for those Volume values. The philosophy is simple. Based on the Volume value you have create that many rows of Time and Price with Volume = 1. Then let cumsum add the numbers and flag every time you have a new 500 batch. Use those flags to create your Bin values.

structure(list( Time=c("09:30:01"  ,"09:30:29"  ,"09:35:56",  "09:37:17"  ,"09:37:21"  ,"09:37:28"  ,"09:37:35"  ,"09:37:51"  ,"09:42:11"  ,"10:00:31"),
                Price=c(1,2,3,4,5,6,7,8,9,10),
                Volume=c(100,200,300,100,200,300,100,200,600,100)),
          .Names = c("Time", "Price", "Volume"),
          row.names = c(NA,10L),
          class = "data.frame") -> dt

library(dplyr)

dt %>%
  group_by(Time, Price) %>%                     ## for each Time and Price
  do(data.frame(Volume = rep(1,.$Volume))) %>%  ## create as many rows, with Volume = 1, as the value of Volume
  ungroup() %>%                                 ## forget about the grouping
  mutate(CumSum = cumsum(Volume),               ## cumulative sums 
         flag_500 = ifelse(CumSum %in% seq(501,sum(dt$Volume),by=500),1,0),  ## flag 500 batches (at 501, 1001, etc.)
         Bin = cumsum(flag_500)+1) %>%          ## create Bin values 
  group_by(Bin, Time, Price) %>%                ## for each Bin, Time and Price
  summarise(Volume = sum(Volume)) %>%           ## get new Volume values
  select(Time, Price, Volume, Bin) %>%          ## use only if you want to re-arrange column order
  ungroup()                                     ## use if you want to forget the grouping

#        Time Price Volume   Bin
#       (chr) (dbl)  (dbl) (dbl)
# 1  09:30:01     1    100     1
# 2  09:30:29     2    200     1
# 3  09:35:56     3    200     1
# 4  09:35:56     3    100     2
# 5  09:37:17     4    100     2
# 6  09:37:21     5    200     2
# 7  09:37:28     6    100     2
# 8  09:37:28     6    200     3
# 9  09:37:35     7    100     3
# 10 09:37:51     8    200     3
# 11 09:42:11     9    500     4
# 12 09:42:11     9    100     5
# 13 10:00:31    10    100     5
like image 155
AntoniosK Avatar answered Nov 11 '22 04:11

AntoniosK


This is hardly "straightforward."

With data.table, it's still many lines of code:

library(data.table)
setDT(DF)
DF[, c("cV","cVL") := shift(cumsum(Volume), 0:1, type="lag", fill=0) ]
DF[, end   := ( cV %/% 500 ) - ( cV %% 500 == 0 ) ]
DF[, start := shift(end, type = "lag", fill = -1) + ( cVL %% 500 == 0 ) ]

badcols = c("Volume","cV","cVL","start","end")
DF[,{
  V = 
    if (start==end) Volume
    else c((start+1)*500-cVL, rep(500, max(end-start-2,0)), cV - end*500) 

  c(.SD[, !badcols, with=FALSE], list(Volume = V, Bin = 1+start:end))
}, by=.(r=seq(nrow(DF)))][,!"r",with=FALSE]

which gives

        Time Price Volume Bin
 1: 09:30:01     1    100   1
 2: 09:30:29     2    200   1
 3: 09:35:56     3    200   1
 4: 09:35:56     3    100   2
 5: 09:37:17     4    100   2
 6: 09:37:21     5    200   2
 7: 09:37:28     6    100   2
 8: 09:37:28     6    200   3
 9: 09:37:35     7    100   3
10: 09:37:51     8    200   3
11: 09:42:11     9    500   4
12: 09:42:11     9    100   5
13: 10:00:31    10    100   5
like image 22
Frank Avatar answered Nov 11 '22 06:11

Frank


Here's one way using data.table and it's rolling joins feature:

require(data.table) # v1.9.6+
setDT(df)[, csum := cumsum(Volume)]
ans = rbind(df, df[.(csum=500 * seq_len(max(csum)%/% 500L)), roll=-Inf, on="csum"])
setorder(ans, Price, csum)
ans = ans[, `:=`(Volume = c(csum[1L], diff(csum)), 
                 id     = (csum-1L) %/% 500L + 1L, 
                 csum   = NULL)][Volume > 0L]

The first step adds a new column with the cumulative sum of Volume.

The second step is perhaps the most important. Let's look at the second part. For each multiple of 500 until max(csum), it joins to the first value >= a multiple of 500 on df$csum. It's a rolling NOCB join (next observation carried backward). With that we'd get:

#        Time Price Volume csum
# 1: 09:35:56     3    300  500
# 2: 09:37:28     6    300 1000
# 3: 09:37:51     8    200 1500
# 4: 09:42:11     9    600 2000

These are the break points that needs to be added to your original data.table. That's what we do with rbind().

Then, all we do is to order by Price, csum, generate back Volume column. From there, generating id column can be done using csum column as shown.

like image 3
Arun Avatar answered Nov 11 '22 06:11

Arun