Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill in missing rows with R data.table

I have a data.table in R that was fetched from a database that looks like this:

date,identifier,description,location,value1,value2
2014-03-01,1,foo,1,100,200
2014-03-01,1,foo,2,200,300
2014-04-01,1,foo,1,100,200
2014-04-01,1,foo,2,100,200
2014-05-01,1,foo,1,100,200
2014-05-01,1,foo,2,100,200
2014-03-01,2,bar,1,100,200
2014-04-01,2,bar,1,100,200
2014-05-01,2,bar,1,100,200
2014-03-01,3,baz,1,100,200
2014-03-01,3,baz,2,200,300
2014-04-01,3,baz,1,100,200
2014-04-01,3,baz,2,100,200
2014-05-01,3,baz,1,100,200
2014-05-01,3,baz,2,100,200
2014-05-01,4,quux,2,100,200
<SNIP>

In order to do some calculations on the data, I'd like to massage it so that each combination of date, identifier, description and location has a row in the table with NA as value1 and value2. I know the range of date and all potential values for location.

I'm new to both R and data.table and my mind is mush at this point. The result I'd like to come up with for the above sample table is:

date,identifier,description,location,value1,value2
2014-03-01,1,foo,1,100,200
2014-03-01,1,foo,2,200,300
2014-04-01,1,foo,1,100,200
2014-04-01,1,foo,2,100,200
2014-05-01,1,foo,1,100,200
2014-05-01,1,foo,2,100,200
2014-03-01,2,bar,1,100,200
2014-03-01,2,bar,2,NA,NA
2014-04-01,2,bar,1,100,200
2014-04-01,2,bar,2,NA,NA
2014-05-01,2,bar,1,100,200
2014-05-01,2,bar,2,NA,NA
2014-03-01,3,baz,1,100,200
2014-03-01,3,baz,2,200,300
2014-04-01,3,baz,1,100,200
2014-04-01,3,baz,2,100,200
2014-05-01,3,baz,1,100,200
2014-05-01,3,baz,2,100,200
2014-03-01,4,quux,1,NA,NA
2014-03-01,4,quux,2,NA,NA
2014-04-01,4,quux,1,NA,NA
2014-04-01,4,quux,2,NA,NA
2014-05-01,4,quux,1,NA,NA
2014-05-01,4,quux,2,100,200

The data in the database is sparse in that a given identifier/description/location combination could have any number of entries or none at all for each date. I want to get to for a given date range (e.g., 2014-03-01 through 2014-05-01) each identifier/description and location has a row in the table.

This seems like something there'd be an interesting data.table trick to do, but I'm blanking.

Edit: I did this on a smaller scale for one identifier/description by merging in another datatable but I'm not sure how to do this with the added complexity of multiple identifier/descriptions and locations.

Thanks very much for your responses.

Here is dput output of the original data that can be readily copied into R:

structure(list(date = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 2L, 3L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), 
.Label = c("2014-03-01", "2014-04-01", "2014-05-01"), class = "factor"), 
identifier = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L),     
description = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 4L), 
.Label = c("bar", "baz", "foo", "quux"), class = "factor"), 
location = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L), 
value1 = c(100L, 200L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 200L, 100L, 100L, 100L, 100L, 100L), 
value2 = c(200L, 300L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 200L, 300L, 200L, 200L, 200L, 200L, 200L)), 
.Names = c("date", "identifier", "description", "location", "value1", "value2"), 
row.names = c(NA, -16L),
class = c("data.table", "data.frame"))
like image 433
Jerry Smithwell Avatar asked May 13 '15 16:05

Jerry Smithwell


People also ask

How do I replace missing values in R?

You can replace NA values with zero(0) on numeric columns of R data frame by using is.na() , replace() , imputeTS::replace() , dplyr::coalesce() , dplyr::mutate_at() , dplyr::mutate_if() , and tidyr::replace_na() functions.

How do you show missing values in a table in R?

The table() function in base R can display missing values (i.e. NAs) via useNA , which takes several arguments: "no", "ifany", or "always".


2 Answers

With help from @akrun and @eddi, here's the idiomatic (?) way:

mycols  = c("description","date","location")
setkeyv(DT0,mycols)
DT1 <- DT0[J(do.call(CJ,lapply(mycols,function(x)unique(get(x)))))]
# alternately: DT1 <- DT0[DT0[,do.call(CJ,lapply(.SD,unique)),.SDcols=mycols]]

The identifier column is missing for the new rows, but can be filled:

setkey(DT1,description)
DT1[unique(DT0[,c("description","identifier")]),identifier:=i.identifier]
like image 162
Frank Avatar answered Oct 16 '22 16:10

Frank


If I understand the question rightly - and using just base R, not any special data.table:

# The fields for whose every permutation we require a row
unique.fields <- c("date", "identifier", "description", "location")
filler <- expand.grid(sapply(unique.fields, function(f) unique(foo[,f])) )
merge(filler, foo, by=unique.fields,  all.x=TRUE)
like image 20
Reign of Error Avatar answered Oct 16 '22 15:10

Reign of Error