Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filling "implied missing values" in a data frame that has varying observations per time unit

I have a large dataset with spatiotemporal data. Each set of coordinates are associated with an id (player id in a computer game). Unfortunately the coordinates for each id aren't logged at every time unit. If a reading is not available for a specific id at x time stamp, then that row was entirely omitted from the dataset rather than logged as NA.

I would like to have the same exact amount of observations per time unit as there are unique ids (i.e. inserting "implied missing NAs"). On time units where ids are missing, they should be inserted as new rows with NAs as their coordinates.

Here's a dummy dataset to illustrate:

time <- c(10,10,10,10,11,11,11,11,11,11,12,12,12,12,13,13,14,14,14,14,14,14,15,15,15)
id <- c(1,3,4,5,1,2,3,4,5,6,2,4,5,6,3,6,1,2,3,4,5,6,2,4,5)
x <- c(128,128,64,64,124,128,120,68,64,64,122,71,65,64,112,74,116,114,113,73,70,70,111,75,70)
y <- c(128,128,64,66,125,128,124,66,67,64,124,67,71,68,113,68,115,119,113,76,69,77,116,80,82)

spatiodf <- as.data.frame(cbind(time, id, x, y))


   time id   x   y
1    10  1 128 128
2    10  3 128 128
3    10  4  64  64
4    10  5  64  66
5    11  1 124 125
6    11  2 128 128
7    11  3 120 124
8    11  4  68  66
9    11  5  64  67
10   11  6  64  64
11   12  1 118 123
12   12  2 122 124
13   12  4  71  67
14   12  5  65  71
15   12  6  64  68
16   13  3 112 113
17   13  6  74  68
18   14  1 116 115
19   14  2 114 119
20   14  3 113 113
21   14  4  73  76
22   14  5  70  69
23   14  6  70  77
24   15  2 111 116
25   15  4  75  80
26   15  5  70  82

From the above output I would like to get to the following below output where the data frame was recreated with each time unit having an equal amount of observations (and NA values were manually inserted into rows that had missing values).

time <- rep(10:15, each = 6)
id <- rep(1:6, times = 6)
x <- c(128,NA,128,64,64,NA,124,128,120,68,64,64,NA,122,NA,71,65,64,NA,NA,112,NA,NA,74,116,114,113,73,70,70,NA,111,NA,75,70,NA)
y <- c(128,NA,128,64,66,NA,125,128,124,66,67,64,NA,124,NA,67,71,68,NA,NA,113,NA,NA,68,115,119,113,76,69,77,NA,116,NA,80,82,NA)

spatiodf_equal_obs <- as.data.frame(cbind(time, id, x, y))

library(dplyr)
spatiodf_equal_obs %>% 
  arrange(id)

   time id   x   y
1    10  1 128 128
2    11  1 124 125
3    12  1  NA  NA
4    13  1  NA  NA
5    14  1 116 115
6    15  1  NA  NA
7    10  2  NA  NA
8    11  2 128 128
9    12  2 122 124
10   13  2  NA  NA
11   14  2 114 119
12   15  2 111 116
13   10  3 128 128
14   11  3 120 124
15   12  3  NA  NA
16   13  3 112 113
17   14  3 113 113
18   15  3  NA  NA
19   10  4  64  64
20   11  4  68  66
21   12  4  71  67
22   13  4  NA  NA
23   14  4  73  76
24   15  4  75  80
25   10  5  64  66
26   11  5  64  67
27   12  5  65  71
28   13  5  NA  NA
29   14  5  70  69
30   15  5  70  82
31   10  6  NA  NA
32   11  6  64  64
33   12  6  64  68
34   13  6  74  68
35   14  6  70  77
36   15  6  NA  NA

The reason the data needs to be in the above format is because I want to be able to fill in the NA values with the nearest available previous or following entry from the same id. Once we have the dataframe in the above output that can be done using fill() from tidyr:

library(tidyr)
res <- spatiodf_equal_obs %>%
  group_by(id) %>%
  fill(x, y, .direction = "down") %>%
  fill(x, y, .direction = "up") 

I've tried a lot of combinations of spreading, gathering (and trickery with creating new dataframes to merge(df1, df2, all=TRUE)). I can't seem to figure out how to go from that first data frame to the second one though.

The final output should look like this:

   time id   x   y
1    10  1 128 128
2    11  1 124 125
3    12  1 124 125
4    13  1 124 125
5    14  1 116 115
6    15  1 116 115
7    10  2 128 128
8    11  2 128 128
9    12  2 122 124
10   13  2 122 124
11   14  2 114 119
12   15  2 111 116
13   10  3 128 128
14   11  3 120 124
15   12  3 120 124
16   13  3 112 113
17   14  3 113 113
18   15  3 113 113
19   10  4  64  64
20   11  4  68  66
21   12  4  71  67
22   13  4  71  67
23   14  4  73  76
24   15  4  75  80
25   10  5  64  66
26   11  5  64  67
27   12  5  65  71
28   13  5  65  71
29   14  5  70  69
30   15  5  70  82
31   10  6  64  64
32   11  6  64  64
33   12  6  64  68
34   13  6  74  68
35   14  6  70  77
36   15  6  70  77
like image 904
Lauler Avatar asked Feb 10 '17 17:02

Lauler


1 Answers

To fill in gaps with values taken from the nearest row, you can do:

library(data.table)
setDT(spatiodf)

resDT = spatiodf[
  CJ(id = id, time = min(time):max(time), unique = TRUE), on=.(id, time), roll="nearest"
]

# verify
fsetequal(data.table(res), resDT) # TRUE

How it works

  • setDT converts to a data.table in place, so no <- is needed.

  • DT[i, on=, roll=] uses i to look up rows in DT, rolling each i to a row in DT. The "roll" is done on the final column in on=.

  • CJ(a, b, unique = TRUE) returns all combos of a and b, like expand.grid in base.

like image 62
Frank Avatar answered Oct 12 '22 16:10

Frank