Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time series as `ts` column in data.table?

I have multiple sets of time series data and would like help figuring out the best way to get them into R and analyze them with R. I'm pretty familiar with data.table but not so familiar with R's ts class supporting time series analysis.

In particular, I want to known how to use ts in this situation or if there are limitations in ts (such as problems aggregating a set of ts objects) that make it in appropriate to use here.

The Data

There are a large number of stores. For each store, I have multiple data points for each day, such as sales volume in dollars, sales volume in number of transactions, and store traffic (number of people entering the store). (Actually what I have is a table with columns store ID, date, and the data for that store and date.)

What I've been doing is using a data.table with one row per store, aggregating the data by store into months and storing the values for each month in a separate named column (e.g. jan14_dollars, feb14_dollars...) but this is unwieldy for a lot of reasons, in particular when I want to look at weeks or quarters.

I was thinking the right way to handle this was to have columns of type ts so each row would be just be store, dollars_ts, transactions_ts, traffic_ts but (a) how do I get the data into that format and (b) can ts be combined the way integers can to give me the results I want? If you can only answer (a) or (b) but not both, please do answer what you can.

I cannot provide a realistic data set, but you can generate a random one to play with like this:

require("data.table")

storeData <- CJ(store = toupper(letters), date = seq(as.Date('2012-01-01'), as.Date('2014-01-01'), by="day"))
storeData$dollars = sample(100:100000, nrow(storeData), replace = TRUE)/100
storeData$transactions <- sample(0:1000, nrow(storeData), replace = TRUE)
storeData$traffic  <- storeData$transactions + sample(0:1000, nrow(storeData), replace = TRUE)

head(storeData)
   store       date  dollars transactions traffic
1:     A 2012-01-01   48.60          409     990
2:     A 2012-01-02  996.89           36     428
3:     A 2012-01-03   69.35          647    1103
4:     A 2012-01-04  334.56          953     973
5:     A 2012-01-05  692.99          958    1753
6:     A 2012-01-06  973.32          724    1086

The Analysis

I want to answer questions like "how many stores had positive dollar sales growth?" and "is there a relationship between change in dollars/transaction and change in traffic?" and to bin the data into time periods and compare the answers across time periods (e.g. Q1 this year versus Q1 last year).

Can these kinds of questions be answered using ts? If so, how do I get this data into an appropriate set of columns or is there some structure other than data.table I should be using?

Please show both how to organize the data and then how to use the data to answer the example questions "how many stores had positive dollar sales growth in January 2014 compared to January 2013?" and "what is the overall trend in dollars per transaction for the past 3 months?"

like image 410
Old Pro Avatar asked Jul 25 '14 14:07

Old Pro


3 Answers

You're asking a lot of questions here. I recommend you spend time reading about all the things data.table can do involving joins and aggregating data. Here is an example of how you would get the year over year growth of each store in the first quarter.

#get the first day of the first month for your binning
minDate<-min(storeData$date); month(minDate)<-1; day(minDate)<-1

#get the first day of the last month for your binning
maxDate<-max(storeData$date); month(maxDate)<-12; day(maxDate)<-1

#Build some bins
yearly<-data.table(leftBound=seq.Date(minDate,maxDate,by="year"))
quarterly<-data.table(leftBound=seq.Date(minDate,maxDate,by="3 months"))
monthly<-data.table(leftBound=seq.Date(minDate,maxDate,by="month"))

#Example for quarterly data
quarterly[, rollDate:=leftBound]
storeData[, rollDate:=date]

setkey(quarterly,"rollDate")
setkey(storeData,"rollDate")

temp<-quarterly[storeData, roll=TRUE] #associate each (store, date) pair with a quarter

#create a "join table" containing each quarter for each store
jt<-CJ(leftBound=quarterly$leftBound, store=unique(storeData$store))
setkey(temp,"leftBound","store")

dt<-temp[jt, allow.cartesian=TRUE]
dt[, `:=`(year=year(leftBound), quarter=quarter(leftBound))]

qSummary<-dt[,list(dollars=sum(dollars, na.rm=TRUE), 
         transactions=sum(transactions, na.rm=TRUE), 
         traffic=sum(traffic, na.rm=TRUE)),
   by=list(year,quarter,store)] #Summarize the data by quarter

#Get year/year growth for Q1
qSummary[,list(dollarGrowth = dollars[which(year==2014 & quarter==1)] / dollars[which(year==2013 & quarter==1)]), by=store]

 #First five rows...
    store dollarGrowth
 1:     A    0.0134860
 2:     B    0.0137215
 3:     C    0.0188249
 4:     D    0.0163887
 5:     E    0.0037576
like image 74
Ben Avatar answered Oct 23 '22 23:10

Ben


You may want to look into the zoo package. This package uses zoo class which is able to handle irregular time series very well. It also has a zooreg class, which is analogous to ts class, but for this answer we'll stick with zoo.

Here is a quick solution:

Step0: Loading the package and data:

# install.packages("zoo")
library(zoo)

storeData <- data.frame()

st.dates <- seq(as.Date('2012-01-01'),as.Date('2014-01-01'),by="day")

n <- length(st.dates)

storeData <- 
  data.frame(
    store = rep(1:26, n),
    dollars = sample(100:100000, n*26, replace = TRUE)/100,
    transactions = sample(0:1000, n*26, replace = TRUE),
    traffic = sample(0:2000, n*26, replace = TRUE)
  )

Note that I am using a data.frame instead of data.table, and the construction of data is slightly different. With your level of expertise, I am sure you can manipulate it easily.

Step1: Converting the data to time series

Now, in principle you can convert this to zoo or zooreg class right now:

zoo(storeData, order.by=rep(st.dates,26))

But note that each day will have multiple entries corresponding to different stores, so this creates problem with multiple entries on the time series.

Since in our case, we won't have more than one entry per day per store, we want to keep the info for each store separate. So we split the data by store:

storeDataList <- split(storeData,factor(storeData$store))

Now we are ready to convert the data into time series:

storeDataZooList <- lapply(storeDataList,
                           function(storeData) zoo(storeData, order.by=st.dates))

This gives a list of time series data for all the stores.

Step3: Analysis

Now that you have time series data for all the stores, aggregating is straight forward:

You can aggregate a specific field for any store by quarter or month:

aggregate(storeDataZooList[[1]]$dollars,as.yearqtr)
aggregate(storeDataZooList[[1]]$dollars,as.yearmon)

You can aggregate all the fields for any store by quarter or month:

aggregate(storeDataZooList[[1]],as.yearqtr)
aggregate(storeDataZooList[[1]],as.yearmon)

You can aggregate all the fields for all the stores by quarter or month:

lapply(storeDataZooList,aggregate, as.yearmon)
lapply(storeDataZooList,aggregate, as.yearqtr)

The documentation for the package is fairly detailed, and I am sure it will help you tremendously.

Hope this helps!!

EDIT: Note that for simplicity I have kept the storeID as numeric. If you have an alpha-numeric storeID, you will have to drop the storeID at the time series conversion step for aggregation to work:

storeDataZooList <- lapply(storeDataList,
                           function(storeData) zoo(storeData[,-1], order.by=st.dates))
like image 24
Shambho Avatar answered Oct 24 '22 01:10

Shambho


Try the package TimeProjection to extract useful features from the date and aggregate on these derived features. In other words, stick to data.frame or data.table unless you want to run algorithms that need ts structure.

like image 1
Karsten W. Avatar answered Oct 24 '22 00:10

Karsten W.