Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Find and add missing (/non existing) rows in time related data frame

Tags:

r

I'm struggling with the following.

If have a (big) data frame with the following:

  • several columns for which the combination of columns is a 'unique' combination, say ID
  • a time related column
  • a measure related column

I want to make sure that for each unique ID for each time interval a measure is available in the data frame. And if it is not, I want to add a 0 (or NA) measure for that time/ID.

To illustrate the problem, create the following test data frame:

test <- data.frame(
    YearWeek   =rep(c("2012-01","2012-02"),each=4),
    ProductID  =rep(c(1,2), times=4),
    CustomerID =rep(c("a","b"), each=2, times=2),
    Quantity   =5:12
)[1:7,]

  YearWeek ProductID CustomerID Quantity
1  2012-01         1          a        5
2  2012-01         2          a        6
3  2012-01         1          b        7
4  2012-01         2          b        8
5  2012-02         1          a        9
6  2012-02         2          a       10
7  2012-02         1          b       11

The 8th row is left out, on purpose. This way I simulate a 'missing value' (missing Quantity) for ID '2-b' (ProductID-CustomerID) for the time value "2012-02".

What I want to do is adjust the data.frame in such a way that for all time values (these are known, in this example just "2012-01" and "2012-02"), for all ID-combinations (these are not known upfront, but this is 'all unique ID combinations in the data frame', thus the unique set on the ID columns), a Quantity is available in the data frame.

This should result for this example (if we choose NA for the missing value, typically I want to have control on that):

  YearWeek ProductID CustomerID Quantity
1  2012-01         1          a        5
2  2012-01         2          a        6
3  2012-01         1          b        7
4  2012-01         2          b        8
5  2012-02         1          a        9
6  2012-02         2          a       10
7  2012-02         1          b       11
8  2012-02         2          b       NA

The ultimate goal is to create time series for these ID combinations and I therefore want to have Quantities for all time values. I need to do different aggregations (on time) and using different levels of ID's from a big dataset

I tried several things, for instance with melt and cast from the reshape package. But so far I didn't manage to do it. The next step is creating a function, with for-loops etc. but that is not really useful from a performance perspective.

Maybe there is an easier way to create time series instantly, giving a data.frame like test. Does anybody have an idea on this one??

Thanks in advance!

Note that in the actual problem there are more than two 'ID columns'.


EDIT:

I should describe the problem further. There is a difference between the 'time' column and the 'ID' columns. The first (and great!) answer on the question by joran, maybe didn't get a clear understanding from what I want (and the example I gave didn't made the difference clear). I said above:

for all ID-combinations (these are not known upfront, but this is 'all unique ID combinations in the data frame', thus the unique set on the ID columns)

So I do not want 'all possible ID combinations' but 'all ID combinations within the data'. For each of those combinations I want a value for every unique time-value.

Let me make it clear by expanding test to test2, as follows

> test2 <- rbind(test, c("2012-02", 3, "a", 13))
> test2
  YearWeek ProductID CustomerID Quantity
1  2012-01         1          a        5
2  2012-01         2          a        6
3  2012-01         1          b        7
4  2012-01         2          b        8
5  2012-02         1          a        9
6  2012-02         2          a       10
7  2012-02         1          b       11
8  2012-02         3          a       13

Which means I want in the resulting data frame no '3-b' ID combination, because this combination is not within test2. If I use the method of the first answer I will get the following:

> vals2 <- expand.grid(YearWeek = unique(test2$YearWeek),
                       ProductID = unique(test2$ProductID),
                       CustomerID = unique(test2$CustomerID))

> merge(vals2,test2,all = TRUE)
   YearWeek ProductID CustomerID Quantity
1   2012-01         1          a        5
2   2012-01         1          b        7
3   2012-01         2          a        6
4   2012-01         2          b        8
5   2012-01         3          a     <NA>
6   2012-01         3          b     <NA>
7   2012-02         1          a        9
8   2012-02         1          b       11
9   2012-02         2          a       10
10  2012-02         2          b     <NA>
11  2012-02         3          a       13
12  2012-02         3          b     <NA>

So I don't want the rows 6 and 12 to be here.

To overcome this problem I found a solution in the one below. In here I split the 'unique time column' and the 'unique ID combination'. The difference with above is thus the word 'combination' and not unique for every ID column.

> temp_merge <- merge(unique(test2["YearWeek"]),
                      unique(test2[c("ProductID", "CustomerID")]))

> merge(temp_merge,test2,all = TRUE)
   YearWeek ProductID CustomerID Quantity
1   2012-01         1          a        5
2   2012-01         1          b        7
3   2012-01         2          a        6
4   2012-01         2          b        8
5   2012-01         3          a     <NA>
6   2012-02         1          a        9
7   2012-02         1          b       11
8   2012-02         2          a       10
9   2012-02         2          b     <NA>
10  2012-02         3          a       13

What are the comments on this one?

Is this an elegant way, or are there better ways?

like image 568
FBE Avatar asked Apr 03 '12 15:04

FBE


1 Answers

Use expand.grid and merge:

vals <- expand.grid(YearWeek = unique(test$YearWeek),
                    ProductID = unique(test$ProductID),
                    CustomerID = unique(test$CustomerID))
> merge(vals,test,all = TRUE)
  YearWeek ProductID CustomerID Quantity
1  2012-01         1          a        5
2  2012-01         1          b        7
3  2012-01         2          a        6
4  2012-01         2          b        8
5  2012-02         1          a        9
6  2012-02         1          b       11
7  2012-02         2          a       10
8  2012-02         2          b       NA

The NAs can be replaced after the fact with whatever values you choose using subsetting and is.na.

like image 149
joran Avatar answered Sep 29 '22 01:09

joran