Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way for filling-in missing dates for data.table

I am loading a data.table from CSV file that has date, orders, amount etc. fields.

The input file occasionally does not have data for all dates. For example, as shown below:

> NADayWiseOrders
           date orders  amount guests
  1: 2013-01-01     50 2272.55    149
  2: 2013-01-02      3   64.04      4
  3: 2013-01-04      1   18.81      0
  4: 2013-01-05      2   77.62      0
  5: 2013-01-07      2   35.82      2

In the above 03-Jan and 06-Jan do not have any entries.

Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)

What is the best/optimal way to fill-in such gaps of missing dates data with such default values?

The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.

like image 221
Gopalakrishna Palem Avatar asked Apr 09 '14 08:04

Gopalakrishna Palem


People also ask

How do you handle missing date values?

- Missing Data: The missing data can be handled in multiple ways such as: Ignoring the data, filling the data with some constant value, filling the data with a corresponding measure of central tendency like mean/ median.


2 Answers

The idiomatic data.table way (using rolling joins) is this:

setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"), 
                   to = as.Date("2013-01-07"), 
                   by = "days")

NADayWiseOrders[J(all_dates), roll=Inf]
         date orders  amount guests
1: 2013-01-01     50 2272.55    149
2: 2013-01-02      3   64.04      4
3: 2013-01-03      3   64.04      4
4: 2013-01-04      1   18.81      0
5: 2013-01-05      2   77.62      0
6: 2013-01-06      2   77.62      0
7: 2013-01-07      2   35.82      2
like image 126
Arun Avatar answered Oct 15 '22 04:10

Arun


Not sure if it's the fastest, but it'll work if there are no NAs in the data:

# just in case these aren't Dates. 
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)
like image 22
shadow Avatar answered Oct 15 '22 05:10

shadow