Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transactions reshaping from long to wide, joining Buy and Sell dataframes

I have buy and sell transactions in long format and I want to transform it to wide format. Look at example:

enter image description here

For every BUY transaction of some ticker must exist SELL transaction of the same ticker which closes position. If SELL transaction don't exists or shares count becomes zero then put NA at sell price.

Explanation:

We bought 100 shares of AIG ticker at price 34.56. Next we must find exit (SELL) transaction for BUY transaction of the same ticker AIG. This transaction exists below with 600 shares. So we close our AIG BUY transaction with 100 shares, decreasing shares of SELL transaction from 600 to 500 and write this transaction in wide format with buy price and sell price.

Next transaction is GOOG. For this ticker we found two SELL transactiosn and write them all in wide format, but 100 shares was unsold, so we put this transaction as "unfinished" with NA in sell price.

If necessary, I can put algorithm in pseudocode later. But I hope, my explanation is clear.

My question is following: It's easy to do it in R with clean and vectorized code? This algorithm is pretty easy to program in imperative-paradigm languages, like C++. But with R I have troubles.

EDIT 1: Added input and output data frames for R:

inputDF1 <- data.frame(Ticker = c("AIG", "GOOG", rep("AIG", 3), rep("GOOG", 2), rep("NEM", 3)), Side = c(rep("BUY", 4), rep("SELL", 3), "BUY", rep("SELL", 2)), Shares = c(100, 400, 200, 400, 600, 200, 100, 100, 50, 50), Price = c(34.56, 457, 28.56, 24.65, 30.02, 460, 461, 45, 56, 78))
inputDF2 <- data.frame(Ticker = c(rep("AIG", 3), rep("GOOG", 3)), Side = c(rep("BUY", 2), "SELL", "BUY", rep("SELL", 2)), Shares = c(100, 100, 200, 300, 200, 100), Price = c(34, 35, 36, 457, 458, 459))
inputDF3 <- data.frame(Ticker = c(rep("AIG", 3), rep("GOOG", 3)), Side = c(rep("BUY", 2), "SELL", "BUY", rep("SELL", 2)), Shares = c(100, 100, 100, 300, 100, 100), Price = c(34, 35, 36, 457, 458, 459))

outputDF1 <- data.frame(Ticker = c("AIG", rep("GOOG", 3), rep("AIG", 3), rep("NEM", 2)), Side = rep("BUY", 9), Shares = c(100, 200, 100, 100, 200, 300, 100, 50, 50), BuyPrice = c(34.56, 457, 457, 457, 28.56, 24.65, 24.65, 45, 45), SellPrice = c(30.02, 460, 461, NA, 30.02, 30.02, NA, 56, 78))
outputDF2 <- data.frame(Ticker = c(rep("AIG", 2), rep("GOOG", 2)), Side = rep("BUY", 4), Shares = c(100, 100, 200, 100), BuyPrice = c(34, 35, 457, 457), SellPrice = c(36, 36, 458, 459))
outputDF3 <- data.frame(Ticker = c(rep("AIG", 2), rep("GOOG", 3)), Side = rep("BUY", 5), Shares = rep(100, 5), BuyPrice = c(34, 35, rep(457, 3)), SellPrice = c(36, NA, 458, 459, NA))

EDIT 2: Updated example and input/output data for R

like image 283
Eldar Agalarov Avatar asked Oct 18 '25 19:10

Eldar Agalarov


2 Answers

Original Answer (while the question was still in development and I wasn't paying close enough attention)

Use dcast from reshape2:

> t <- c("AIG", "GOOG", "AIG", "AIG", "AIG", "GOOG", "GOOG")
> sd <- c(rep("BUY", 4), rep("SELL", 3))
> sh <- c(100, 400, 200, 400, 600, 200, 100)
> pr <- c(34.56, 457, 28.56, 24.65, 30.02, 460, 461)
> df <- data.frame(Ticker = t, Side = sd, Shares = sh, Price = pr)
> 
> library(reshape2)
> df
  Ticker Side Shares  Price
1    AIG  BUY    100  34.56
2   GOOG  BUY    400 457.00
3    AIG  BUY    200  28.56
4    AIG  BUY    400  24.65
5    AIG SELL    600  30.02
6   GOOG SELL    200 460.00
7   GOOG SELL    100 461.00
> dcast(df, Ticker*Shares ~ Side, value.var="Price")
  Ticker Shares    BUY   SELL
1    AIG    100  34.56     NA
2    AIG    200  28.56     NA
3    AIG    400  24.65     NA
4    AIG    600     NA  30.02
5   GOOG    100     NA 461.00
6   GOOG    200     NA 460.00
7   GOOG    400 457.00     NA

New Answer

The key sticking point here is that "vector-based" in R is often tied to "functional" (e.g. the apply() family), but a purely functional approach doesn't quite work here, because you have to update the sell list for every (part of each) buy transaction. I really feel like you could do something magical with aggregate or by and a carefully designed function, but the best readable solution that came to me involves a simple for-loop.

Version with for

inputDF <- data.frame(Ticker = c("AIG", "GOOG", "AIG", "AIG", "AIG", "GOOG", "GOOG"), 
                      Side = c(rep("BUY", 4), rep("SELL", 3)), 
                      Shares = c(100, 400, 200, 400, 600, 200, 100), 
                      Price = c(34.56, 457, 28.56, 24.65, 30.02, 460, 461))
buys <- subset(inputDF,Side=="BUY")
sells <- subset(inputDF,Side=="SELL")
transactions <- NULL

# go through every buy operation
for(i in 1:nrow(buys)){
    ticker <- buys[i,"Ticker"]
    bp <- buys[i,"Price"]
    shares <- buys[i,"Shares"]

    # keep going as long as we can find sellers
    while(shares > 0 & sum(sells[sells$Ticker == ticker,"Shares"]) > 0){
        sp <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Price"]
        if(sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"] > shares){
            shares.sold <- shares
        }else{
            shares.sold <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"]
        }
        shares <- shares - shares.sold
        sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] <- sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] - shares.sold
        transactions <- rbind(transactions,data.frame("Ticker"=ticker
                                                      ,"Side"="BUY"
                                                      ,"Shares"=shares.sold
                                                      ,"BuyPrice"=bp
                                                   ,"SellPrice"=sp))
    }
    # not enough sellers
    if(shares > 0){ 
        transactions <- rbind(transactions,data.frame("Ticker"=ticker
                                                  ,"Side"="BUY"
                                                  ,"Shares"=shares
                                                  ,"BuyPrice"=bp
                                                  ,"SellPrice"="NA"))

    }

}

print(transactions)

Output:

  Ticker Side Shares BuyPrice SellPrice
1    AIG  BUY    100    34.56     30.02
2   GOOG  BUY    200   457.00       460
3   GOOG  BUY    100   457.00       461
4   GOOG  BUY    100   457.00        NA
5    AIG  BUY    200    28.56     30.02
6    AIG  BUY    300    24.65     30.02
7    AIG  BUY    100    24.65        NA

The updating becomes obvious if we try to use the foreach package to automagically parallelize the loop. It quickly becomes apparent that we have a race condition on the sell data frame.

Version with apply

There are a couple of inefficiencies in the code above that could be improved upon. The append operation via rbind() isn't terribly efficient and could probably be optimized a bit, either reducing the number of calls to rbind() or eliminating it all together. You could also pack everything into a function and convert it to a call to apply(), which does tend to be faster even for serial apply() because the looping is done at a more optimized level. (The same is true for CPython -- list comprehensions and str.join() are much faster than for loops because they're "more aware" of the total size of the operation and because they're written in optimized C.) Here's a first attempt -- note that we use do.call(rbind, list(...)) to simplify the list of small data frames we get back from the original call to apply. This isn't terribly efficient (rbindlist from data.table is significantly faster, see here), but it doesn't have any external dependencies. The list you get back from apply() is actually interesting in its own way -- every element is the list of transactions you needed to do complete one entire buy operation. If you added row names to the buys data frame, then you could call up each set of transactions by name.

inputDF <- data.frame(Ticker = c("AIG", "GOOG", "AIG", "AIG", "AIG", "GOOG", "GOOG"), 
                      Side = c(rep("BUY", 4), rep("SELL", 3)), 
                      Shares = c(100, 400, 200, 400, 600, 200, 100), 
                      Price = c(34.56, 457, 28.56, 24.65, 30.02, 460, 461))
buys <- subset(inputDF,Side=="BUY")
sells <- subset(inputDF,Side=="SELL")
transactions <- NULL

# go through every buy operation
buy.operation <- function(x){
    ticker <- x["Ticker"]
    # apply() converts to matix implicity, and all the elements of a matrix have
    # have the same data type, so everything gets converted to characters
    # thus, we need to convert back
    bp <- as.numeric(x["Price"])
    shares <- as.numeric(x["Shares"])

    # keep going as long as we can find sellers
    while(shares > 0 & sum(sells[sells$Ticker == ticker,"Shares"]) > 0){
        sp <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Price"]
        if(sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"] > shares){
            shares.sold <- shares
        }else{
            shares.sold <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"]
        }
        shares <- shares - shares.sold
        sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] <- sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] - shares.sold
        transactions <- rbind(transactions,data.frame("Ticker"=ticker
                                                      ,"Side"="BUY"
                                                      ,"Shares"=shares.sold
                                                      ,"BuyPrice"=bp
                                                      ,"SellPrice"=sp))
    }
    # not enough sellers
    if(shares > 0){ 
        transactions <- rbind(transactions,data.frame("Ticker"=ticker
                                                      ,"Side"="BUY"
                                                      ,"Shares"=shares
                                                      ,"BuyPrice"=bp
                                                      ,"SellPrice"="NA"))

    }

    transactions
}

transactions <- do.call(rbind, apply(buys,1,buy.operation) )
# get rid of weird row names
row.names(transactions) <- NULL
print(transactions)

Output:

  Ticker Side Shares BuyPrice SellPrice
1    AIG  BUY    100    34.56     30.02
2   GOOG  BUY    200   457.00       460
3   GOOG  BUY    100   457.00       461
4   GOOG  BUY    100   457.00        NA
5    AIG  BUY    200    28.56     30.02
6    AIG  BUY    400    24.65     30.02

Unfortunately, the final incomplete AIG transaction is missing. I haven't figured out quite yet how to fix that.

like image 59
Livius Avatar answered Oct 20 '25 08:10

Livius


Hm, I spent way too much time on this question! Here's my attempt (with data.table).

Since you don't mention anything about your real data dimensions, I've not been able to optimise it any further. It'd be nice if you could run this on your real dataset and write back your findings (reg. speed/scaling).

First we've to split the dataset by Side and perform a join. This is the most straightforward approach. I also see that @Mike.Gahan has attempted along this route as well.

require(data.table)
dt1 <- as.data.table(inputDF1)
d1 <- dt1[Side == "BUY"][, N := .N > 1L, by=Ticker]
d2 <- dt1[Side == "SELL"]
setkey(d2, Ticker)
ans = d2[d1, allow.cartesian=TRUE][, Side := NULL]

Note that allow.cartesian does not perform a cartesian join. It's used very loosely here. Read ?data.table for more info or check this post on what it's for. The join, basically, will be really fast and will scale really well. This is not a limiting step.

We now set the column order and names accordingly:

setcolorder(ans, c("Ticker", "Side.1", "Shares.1", "Shares", "Price.1", "Price", "N"))
setnames(ans, c("Ticker", "Side", "Shares", "tmp", "BuyPrice", "SellPrice", "N"))

We interchange Shares and tmp so that Shares reflects the actual output we expect, based on the value of N as follows:

ans[, c("Shares", "tmp") := if (!N[1L]) 
             { val = Shares[1L]; list(tmp, val) }, by = Ticker]

We'll need a couple of parameters to aggregate and get the final result:

ans[, `:=`(N2= rep(c(FALSE, TRUE), c(.N-1L, 1L)), 
           csum = sum(Shares)), by = Ticker][, N2 := !(N2 * (csum != tmp))]

Finally,

ans1 = ans[(N2)][, c("N", "N2", "tmp", "csum") := NULL]
ans2 = ans[!(N2)][, N := N * 1L]
if (nrow(ans2) > 0) {
    ans2 = ans2[,  list("BUY", if (N[1L]) c(Shares+tmp-csum, csum-tmp) 
             else c(Shares, tmp-csum), BuyPrice, c(SellPrice, NA)), by=Ticker]
}
ans  = rbindlist(list(ans1, ans2))

#    Ticker Side Shares BuyPrice SellPrice
# 1:    AIG  BUY    100    34.56     30.02
# 2:   GOOG  BUY    200   457.00    460.00
# 3:    AIG  BUY    200    28.56     30.02
# 4:    NEM  BUY     50    45.00     56.00
# 5:    NEM  BUY     50    45.00     78.00
# 6:   GOOG  BUY    100   457.00    461.00
# 7:   GOOG  BUY    100   457.00        NA
# 8:    AIG  BUY    300    24.65     30.02
# 9:    AIG  BUY    100    24.65        NA

My guess is that this should be plentiful fast. But, it might be possible to optimise this further. I'll leave that to you, should you choose to build up on this answer.

like image 39
Arun Avatar answered Oct 20 '25 09:10

Arun



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!