Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use ff package in a simple 'for' loop for a large dataset

Tags:

I'm trying to do some basic calculations with a large table (~94 million rows, 3 columns) that require the use of a package like ff in R. However, I'm having trouble using this package and running out of memory, though I know my computer is more than capable of handling this. I'm including my hardware/software specs below, as well as my code that doesn't seem to be using the ff package properly. I've spent over 100 hours reading every pdf, ppt, and website that mentions anything on the ff package, and I haven't found anything that explains how to use ff clearly (at least to an amateur like me). Any help on what I'm doing wrong would be greatly appreciated. This logic seems to work when I count up to about 1.1 million rows, but then it seems to go out of bounds after that.

I have also tried breaking up the 'for' loop into chunks 1/200 of the total size; creating new ff objects for existing ShortPrice & LongPrice ff files on each pass of the loop, then rm(), gc() at the end of each pass. When I create the ff files for each column through read.table.ffdf at the beginning, for some reason I lose the TradePosition values when trying to create a new ff object to the existing TradePosition ff file using vmode = "quad", "integer" or "raw".

Hardware/Software Specs:

  • June 2012 Macbook Pro with 16 GB RAM, i7 Quad-Core Processor, 512 GB SSD
  • OS X 10.8.2
  • Using 32-bit R program

Data/Tables:

  • Text file named "Trades.txt" has 94,741,221 rows, three columns
  • Column 1 named TradePosition ("factor" type, levels/values = "0", "Short" or "Long")
  • Column 2 named ShortPrice ("double" type, values represent EUR/USD currency prices to 5 decimal places)
  • Column 3 named LongPrice ("double" type, values represent EUR/USD currency prices to 5 decimal places)
  • Internal R variable "DatasetLength" = 94,741,221

Code:

library(ff)
options("fftempdir"="/Users/neil/Code/","ffbatchbytes"=20*getOption("ffbatchbytes"),"ffmaxbytes"=8*getOption("ffmaxbytes"),"ffpagesize"=1000*65536,"ffcaching"="mmnoflush")
ffdfTrades <- read.table.ffdf(file="/Users/neil/Code/Trades.txt",nrows=DatasetLength,FUN="read.table",header=TRUE,sep=";",quote="",colClasses=c("factor","numeric","numeric"),comment.char="")

Transactions <- c(rep(0,DatasetLength))
dataindex <- 1
for (dataindex in seq(1,DatasetLength-1,1)) {

    if (ffdfTrades$TradePosition[dataindex]!=ffdfTrades$TradePosition[dataindex+1]) {

        if (ffdfTrades$TradePosition[dataindex+1]=="Short") {

            if (ffdfTrades$TradePosition[dataindex]=="Long") {
                Transactions[dataindex+1] <- -2*ffdfTrades$ShortPrice[dataindex+1]
            }

            else {
                Transactions[dataindex+1] <- -1*ffdfTrades$ShortPrice[dataindex+1]
            }
        }

        else {

            if (ffdfTrades$TradePosition[dataindex+1]=="Long") {

                if (ffdfTrades$TradePosition[dataindex]=="Short") {
                    Transactions[dataindex+1] <- 2*ffdfTrades$LongPrice[dataindex+1]
                }

                else {
                    Transactions[dataindex+1] <- 1*ffdfTrades$LongPrice[dataindex+1]
                }
            }
        }
    }

    message(paste("Row ",dataindex," done.",sep=""))
    dataindex <- dataindex + 1
}