I have buy and sell transactions in long format and I want to transform it to wide format. Look at example:

For every BUY transaction of some ticker must exist SELL transaction of the same ticker which closes position. If SELL transaction don't exists or shares count becomes zero then put NA at sell price.
Explanation:
We bought 100 shares of AIG ticker at price 34.56. Next we must find exit (SELL) transaction for BUY transaction of the same ticker AIG. This transaction exists below with 600 shares. So we close our AIG BUY transaction with 100 shares, decreasing shares of SELL transaction from 600 to 500 and write this transaction in wide format with buy price and sell price.
Next transaction is GOOG. For this ticker we found two SELL transactiosn and write them all in wide format, but 100 shares was unsold, so we put this transaction as "unfinished" with NA in sell price.
If necessary, I can put algorithm in pseudocode later. But I hope, my explanation is clear.
My question is following: It's easy to do it in R with clean and vectorized code? This algorithm is pretty easy to program in imperative-paradigm languages, like C++. But with R I have troubles.
EDIT 1: Added input and output data frames for R:
inputDF1 <- data.frame(Ticker = c("AIG", "GOOG", rep("AIG", 3), rep("GOOG", 2), rep("NEM", 3)), Side = c(rep("BUY", 4), rep("SELL", 3), "BUY", rep("SELL", 2)), Shares = c(100, 400, 200, 400, 600, 200, 100, 100, 50, 50), Price = c(34.56, 457, 28.56, 24.65, 30.02, 460, 461, 45, 56, 78))
inputDF2 <- data.frame(Ticker = c(rep("AIG", 3), rep("GOOG", 3)), Side = c(rep("BUY", 2), "SELL", "BUY", rep("SELL", 2)), Shares = c(100, 100, 200, 300, 200, 100), Price = c(34, 35, 36, 457, 458, 459))
inputDF3 <- data.frame(Ticker = c(rep("AIG", 3), rep("GOOG", 3)), Side = c(rep("BUY", 2), "SELL", "BUY", rep("SELL", 2)), Shares = c(100, 100, 100, 300, 100, 100), Price = c(34, 35, 36, 457, 458, 459))
outputDF1 <- data.frame(Ticker = c("AIG", rep("GOOG", 3), rep("AIG", 3), rep("NEM", 2)), Side = rep("BUY", 9), Shares = c(100, 200, 100, 100, 200, 300, 100, 50, 50), BuyPrice = c(34.56, 457, 457, 457, 28.56, 24.65, 24.65, 45, 45), SellPrice = c(30.02, 460, 461, NA, 30.02, 30.02, NA, 56, 78))
outputDF2 <- data.frame(Ticker = c(rep("AIG", 2), rep("GOOG", 2)), Side = rep("BUY", 4), Shares = c(100, 100, 200, 100), BuyPrice = c(34, 35, 457, 457), SellPrice = c(36, 36, 458, 459))
outputDF3 <- data.frame(Ticker = c(rep("AIG", 2), rep("GOOG", 3)), Side = rep("BUY", 5), Shares = rep(100, 5), BuyPrice = c(34, 35, rep(457, 3)), SellPrice = c(36, NA, 458, 459, NA))
EDIT 2: Updated example and input/output data for R
Use dcast from reshape2:
> t <- c("AIG", "GOOG", "AIG", "AIG", "AIG", "GOOG", "GOOG")
> sd <- c(rep("BUY", 4), rep("SELL", 3))
> sh <- c(100, 400, 200, 400, 600, 200, 100)
> pr <- c(34.56, 457, 28.56, 24.65, 30.02, 460, 461)
> df <- data.frame(Ticker = t, Side = sd, Shares = sh, Price = pr)
>
> library(reshape2)
> df
Ticker Side Shares Price
1 AIG BUY 100 34.56
2 GOOG BUY 400 457.00
3 AIG BUY 200 28.56
4 AIG BUY 400 24.65
5 AIG SELL 600 30.02
6 GOOG SELL 200 460.00
7 GOOG SELL 100 461.00
> dcast(df, Ticker*Shares ~ Side, value.var="Price")
Ticker Shares BUY SELL
1 AIG 100 34.56 NA
2 AIG 200 28.56 NA
3 AIG 400 24.65 NA
4 AIG 600 NA 30.02
5 GOOG 100 NA 461.00
6 GOOG 200 NA 460.00
7 GOOG 400 457.00 NA
The key sticking point here is that "vector-based" in R is often tied to "functional" (e.g. the apply() family), but a purely functional approach doesn't quite work here, because you have to update the sell list for every (part of each) buy transaction. I really feel like you could do something magical with aggregate or by and a carefully designed function, but the best readable solution that came to me involves a simple for-loop.
forinputDF <- data.frame(Ticker = c("AIG", "GOOG", "AIG", "AIG", "AIG", "GOOG", "GOOG"),
Side = c(rep("BUY", 4), rep("SELL", 3)),
Shares = c(100, 400, 200, 400, 600, 200, 100),
Price = c(34.56, 457, 28.56, 24.65, 30.02, 460, 461))
buys <- subset(inputDF,Side=="BUY")
sells <- subset(inputDF,Side=="SELL")
transactions <- NULL
# go through every buy operation
for(i in 1:nrow(buys)){
ticker <- buys[i,"Ticker"]
bp <- buys[i,"Price"]
shares <- buys[i,"Shares"]
# keep going as long as we can find sellers
while(shares > 0 & sum(sells[sells$Ticker == ticker,"Shares"]) > 0){
sp <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Price"]
if(sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"] > shares){
shares.sold <- shares
}else{
shares.sold <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"]
}
shares <- shares - shares.sold
sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] <- sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] - shares.sold
transactions <- rbind(transactions,data.frame("Ticker"=ticker
,"Side"="BUY"
,"Shares"=shares.sold
,"BuyPrice"=bp
,"SellPrice"=sp))
}
# not enough sellers
if(shares > 0){
transactions <- rbind(transactions,data.frame("Ticker"=ticker
,"Side"="BUY"
,"Shares"=shares
,"BuyPrice"=bp
,"SellPrice"="NA"))
}
}
print(transactions)
Output:
Ticker Side Shares BuyPrice SellPrice
1 AIG BUY 100 34.56 30.02
2 GOOG BUY 200 457.00 460
3 GOOG BUY 100 457.00 461
4 GOOG BUY 100 457.00 NA
5 AIG BUY 200 28.56 30.02
6 AIG BUY 300 24.65 30.02
7 AIG BUY 100 24.65 NA
The updating becomes obvious if we try to use the foreach package to automagically parallelize the loop. It quickly becomes apparent that we have a race condition on the sell data frame.
applyThere are a couple of inefficiencies in the code above that could be improved upon. The append operation via rbind() isn't terribly efficient and could probably be optimized a bit, either reducing the number of calls to rbind() or eliminating it all together. You could also pack everything into a function and convert it to a call to apply(), which does tend to be faster even for serial apply() because the looping is done at a more optimized level. (The same is true for CPython -- list comprehensions and str.join() are much faster than for loops because they're "more aware" of the total size of the operation and because they're written in optimized C.) Here's a first attempt -- note that we use do.call(rbind, list(...)) to simplify the list of small data frames we get back from the original call to apply. This isn't terribly efficient (rbindlist from data.table is significantly faster, see here), but it doesn't have any external dependencies. The list you get back from apply() is actually interesting in its own way -- every element is the list of transactions you needed to do complete one entire buy operation. If you added row names to the buys data frame, then you could call up each set of transactions by name.
inputDF <- data.frame(Ticker = c("AIG", "GOOG", "AIG", "AIG", "AIG", "GOOG", "GOOG"),
Side = c(rep("BUY", 4), rep("SELL", 3)),
Shares = c(100, 400, 200, 400, 600, 200, 100),
Price = c(34.56, 457, 28.56, 24.65, 30.02, 460, 461))
buys <- subset(inputDF,Side=="BUY")
sells <- subset(inputDF,Side=="SELL")
transactions <- NULL
# go through every buy operation
buy.operation <- function(x){
ticker <- x["Ticker"]
# apply() converts to matix implicity, and all the elements of a matrix have
# have the same data type, so everything gets converted to characters
# thus, we need to convert back
bp <- as.numeric(x["Price"])
shares <- as.numeric(x["Shares"])
# keep going as long as we can find sellers
while(shares > 0 & sum(sells[sells$Ticker == ticker,"Shares"]) > 0){
sp <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Price"]
if(sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"] > shares){
shares.sold <- shares
}else{
shares.sold <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"]
}
shares <- shares - shares.sold
sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] <- sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] - shares.sold
transactions <- rbind(transactions,data.frame("Ticker"=ticker
,"Side"="BUY"
,"Shares"=shares.sold
,"BuyPrice"=bp
,"SellPrice"=sp))
}
# not enough sellers
if(shares > 0){
transactions <- rbind(transactions,data.frame("Ticker"=ticker
,"Side"="BUY"
,"Shares"=shares
,"BuyPrice"=bp
,"SellPrice"="NA"))
}
transactions
}
transactions <- do.call(rbind, apply(buys,1,buy.operation) )
# get rid of weird row names
row.names(transactions) <- NULL
print(transactions)
Output:
Ticker Side Shares BuyPrice SellPrice
1 AIG BUY 100 34.56 30.02
2 GOOG BUY 200 457.00 460
3 GOOG BUY 100 457.00 461
4 GOOG BUY 100 457.00 NA
5 AIG BUY 200 28.56 30.02
6 AIG BUY 400 24.65 30.02
Unfortunately, the final incomplete AIG transaction is missing. I haven't figured out quite yet how to fix that.
Hm, I spent way too much time on this question! Here's my attempt (with data.table).
Since you don't mention anything about your real data dimensions, I've not been able to optimise it any further. It'd be nice if you could run this on your real dataset and write back your findings (reg. speed/scaling).
First we've to split the dataset by Side and perform a join. This is the most straightforward approach. I also see that @Mike.Gahan has attempted along this route as well.
require(data.table)
dt1 <- as.data.table(inputDF1)
d1 <- dt1[Side == "BUY"][, N := .N > 1L, by=Ticker]
d2 <- dt1[Side == "SELL"]
setkey(d2, Ticker)
ans = d2[d1, allow.cartesian=TRUE][, Side := NULL]
Note that
allow.cartesiandoes not perform a cartesian join. It's used very loosely here. Read?data.tablefor more info or check this post on what it's for. The join, basically, will be really fast and will scale really well. This is not a limiting step.
We now set the column order and names accordingly:
setcolorder(ans, c("Ticker", "Side.1", "Shares.1", "Shares", "Price.1", "Price", "N"))
setnames(ans, c("Ticker", "Side", "Shares", "tmp", "BuyPrice", "SellPrice", "N"))
We interchange Shares and tmp so that Shares reflects the actual output we expect, based on the value of N as follows:
ans[, c("Shares", "tmp") := if (!N[1L])
{ val = Shares[1L]; list(tmp, val) }, by = Ticker]
We'll need a couple of parameters to aggregate and get the final result:
ans[, `:=`(N2= rep(c(FALSE, TRUE), c(.N-1L, 1L)),
csum = sum(Shares)), by = Ticker][, N2 := !(N2 * (csum != tmp))]
Finally,
ans1 = ans[(N2)][, c("N", "N2", "tmp", "csum") := NULL]
ans2 = ans[!(N2)][, N := N * 1L]
if (nrow(ans2) > 0) {
ans2 = ans2[, list("BUY", if (N[1L]) c(Shares+tmp-csum, csum-tmp)
else c(Shares, tmp-csum), BuyPrice, c(SellPrice, NA)), by=Ticker]
}
ans = rbindlist(list(ans1, ans2))
# Ticker Side Shares BuyPrice SellPrice
# 1: AIG BUY 100 34.56 30.02
# 2: GOOG BUY 200 457.00 460.00
# 3: AIG BUY 200 28.56 30.02
# 4: NEM BUY 50 45.00 56.00
# 5: NEM BUY 50 45.00 78.00
# 6: GOOG BUY 100 457.00 461.00
# 7: GOOG BUY 100 457.00 NA
# 8: AIG BUY 300 24.65 30.02
# 9: AIG BUY 100 24.65 NA
My guess is that this should be plentiful fast. But, it might be possible to optimise this further. I'll leave that to you, should you choose to build up on this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With