Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill in sequential values in a dataframe

Tags:

dataframe

r

I have an R data processing problem. I have a large set of data in a .csv file that I can load in using readr. The problem is that there are missing values that I need to add to the dataframe.

Example data:

X1    X2    Value
Coal  1     300
Coal  3     100
Coal  5     150
NatG  2     175
NatG  4     150

This file will load in correctly. However, I want to add processing so that the final dataframe contains an entry for every X1 type and the entire sequence 1:5 in X2 with the missing values set equal to zero. The final dataframe would like like this:

X1    X2    Value
Coal  1     300
Coal  2     0
Coal  3     100
Coal  4     0
Coal  5     150
NatG  1     0
NatG  2     175
NatG  3     0
NatG  4     150
NatG  5     0

I tried using readr to load in a file that had all entries equal to zero and then read in the real data to the same data frame, but it was a destructive overwrite and the missing rows were removed. I need to know how to add appropriate rows to the dataframe without knowing which values in the 1:5 sequence will be missing for each value under X1.

like image 794
doriengard Avatar asked Dec 23 '17 22:12

doriengard


2 Answers

We can use the complete function from the tidyr package. The complete function creates combination based on columns provided. The fill argument can specify the fill values.

library(tidyr)

dt2 <- dt %>%
  complete(X1, X2, fill = list(Value = 0))
dt2
# # A tibble: 10 x 3
#       X1    X2 Value
#    <chr> <int> <dbl>
#  1  Coal     1   300
#  2  Coal     2     0
#  3  Coal     3   100
#  4  Coal     4     0
#  5  Coal     5   150
#  6  NatG     1     0
#  7  NatG     2   175
#  8  NatG     3     0
#  9  NatG     4   150
# 10  NatG     5     0

We can also specify the group values for the combination if some group values are missing. For example, we can do the following to make sure 1 to 5 in column X2 all has been considered.

dt2 <- dt %>%
  complete(X1, X2 = 1:5, fill = list(Value = 0))

Or use full_seq, which will automatically create the sequence based on the minimum and maximum values in a column.

dt2 <- dt %>%
  complete(X1, X2 = full_seq(X2, period = 1), fill = list(Value = 0))

DATA

dt <- read.table(text = "X1    X2    Value
Coal  1     300
                 Coal  3     100
                 Coal  5     150
                 NatG  2     175
                 NatG  4     150",
                 header = TRUE, stringsAsFactors = FALSE)
like image 52
www Avatar answered Nov 20 '22 09:11

www


Great solution from @www with tidyr. This is a way to do it in base R.

levels <- 1:5

type <- c("Coal", "NatG")

df2 <- expand.grid(levels,type)

df3 <- merge(dt, df2, by.x=c("X1","X2"), by.y=c("Var2","Var1"),all.y=T)

df3[is.na(df3)] <- 0

     X1 X2 Value
1  Coal  1   300
2  Coal  2     0
3  Coal  3   100
4  Coal  4     0
5  Coal  5   150
6  NatG  1     0
7  NatG  2   175
8  NatG  3     0
9  NatG  4   150
10 NatG  5     0
like image 3
Mako212 Avatar answered Nov 20 '22 10:11

Mako212