Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a better solution than a for loop when you have to keep track of a running balance?

Tags:

r

I have a large data frame with millions of rows. It is time series data. For example:

dates <- c(1,2,3)
purchase_price <- c(5,2,1)
income <- c(2,2,2)
df <- data.frame(dates=dates,price=purchase_price,income=income)

I want to create a new column that tells me how much I spent on each day, with some rule like "if I have enough money, then buy it. Otherwise, save the money."

I am currently doing looping through each row of the dataframe, and keeping track of a running total of money. However, this takes forever with the large dataset. As far as I can tell, I can't do a vector operation because I have to keep track of this running variable.

Inside the for loop I am doing:

balance = balance + row$income
buy_amt = min(balance,row$price)
balance = balance - buy_amt

Is there any faster solution?

Thanks!

like image 401
user2374133 Avatar asked Oct 27 '13 18:10

user2374133


2 Answers

For problems that are easily expressed in terms of loops, I'm becoming increasingly convinced that Rcpp is the right solution. It's relatively easy to pick up and you can express loop-y algorithms very naturally.

Here's a solution to your problem using Rcpp:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List purchaseWhenPossible(NumericVector date, NumericVector income, 
                          NumericVector price, double init_balance = 0) {  
  int n = date.length();
  NumericVector balance(n);
  LogicalVector buy(n);

  for (int i = 0; i < n; ++i) {
    balance[i] = ((i == 0) ? init_balance : balance[i - 1]) + income;

    // Buy it if you can afford it
    if (balance[i] >= price[i]) {
      buy[i] = true;
      balance[i] -= price[i];
    } else {
      buy[i] = false;
    }

  }

  return List::create(_["buy"] = buy, _["balance"] = balance);
}

/*** R

# Copying input data from Ricardo
df <- data.frame(
  dates = 1:6,
  income = rep(2, 6),
  price = c(5, 2, 3, 5, 2, 1)
)

out <- purchaseWhenPossible(df$dates, df$income, df$price, 3)
df$balance <- out$balance
df$buy <- out$buy

*/

To run it, save it into a file called purchase.cpp, then run Rcpp::sourceCpp("purchase.cpp")

It will be very fast, because C++ is so fast, but I didn't do any formal benchmarking.

like image 29
hadley Avatar answered Sep 19 '22 01:09

hadley


As Paul points out, some iteration is necessary. You have a dependency between one instance and a previous point.

However, the dependency only occurs whenever a purchase is made (read: you only need to recalculate the balance when..). Therefore, you can iterate in "batches"

Try the following does exactly that by identifying which is the next row where there is sufficient balance to make a purchase. It then handles all the previous rows in a single call, and then proceeds from that point.

library(data.table)
DT <- as.data.table(df)

## Initial Balance
b.init <- 2

setattr(DT, "Starting Balance", b.init)

## Raw balance for the day, regardless of purchase
DT[, balance := b.init + cumsum(income)]
DT[, buying  := FALSE]

## Set N, to not have to call nrow(DT) several times
N   <- nrow(DT)

## Initialize
ind <- seq(1:N)

# Identify where the next purchase is
while(length(buys <- DT[ind, ind[which(price <= balance)]]) && min(ind) < N) {
  next.buy <- buys[[1L]] # only grab the first one
  if (next.buy > ind[[1L]]) {
    not.buys <- ind[1L]:(next.buy-1L)
    DT[not.buys, buying := FALSE]
  }
  DT[next.buy, `:=`(buying  = TRUE
                  , balance = (balance - price)
                  ) ]

  # If there are still subsequent rows after 'next.buy', recalculate the balance
  ind <- (next.buy+1) : N
#  if (N > ind[[1]]) {  ## So that
    DT[ind, balance := cumsum(income) + DT[["balance"]][[ ind[[1]]-1L]] ]
#  }
}
# Final row needs to be outside of while-loop, or else will buy that same item multiple times
if (DT[N, !buying && (balance > price)])
  DT[N, `:=`(buying  = TRUE, balance = (balance - price)) ]

RESULTS:

## Show output
{
  print(DT)
  cat("Starting Balance was", attr(DT, "Starting Balance"), "\n")
}


## Starting with 3: 
   dates price income balance buying
1:     1     5      2       0   TRUE
2:     2     2      2       0   TRUE
3:     3     3      2       2  FALSE
4:     4     5      2       4  FALSE
5:     5     2      2       4   TRUE
6:     6     1      2       5   TRUE
Starting Balance was 3

## Starting with 2: 
   dates price income balance buying
1:     1     5      2       4  FALSE
2:     2     2      2       4   TRUE
3:     3     3      2       3   TRUE
4:     4     5      2       0   TRUE
5:     5     2      2       0   TRUE
6:     6     1      2       1   TRUE
Starting Balance was 2


# I modified your original data slightly, for testing
df <- rbind(df, df)
df$dates <- seq_along(df$dates)
df[["price"]][[3]] <- 3
like image 63
Ricardo Saporta Avatar answered Sep 20 '22 01:09

Ricardo Saporta