I have an R data processing problem. I have a large set of data in a .csv file that I can load in using readr
. The problem is that there are missing values that I need to add to the dataframe.
Example data:
X1 X2 Value
Coal 1 300
Coal 3 100
Coal 5 150
NatG 2 175
NatG 4 150
This file will load in correctly. However, I want to add processing so that the final dataframe contains an entry for every X1 type and the entire sequence 1:5 in X2 with the missing values set equal to zero. The final dataframe would like like this:
X1 X2 Value
Coal 1 300
Coal 2 0
Coal 3 100
Coal 4 0
Coal 5 150
NatG 1 0
NatG 2 175
NatG 3 0
NatG 4 150
NatG 5 0
I tried using readr to load in a file that had all entries equal to zero and then read in the real data to the same data frame, but it was a destructive overwrite and the missing rows were removed. I need to know how to add appropriate rows to the dataframe without knowing which values in the 1:5 sequence will be missing for each value under X1.
We can use the complete
function from the tidyr package. The complete
function creates combination based on columns provided. The fill
argument can specify the fill values.
library(tidyr)
dt2 <- dt %>%
complete(X1, X2, fill = list(Value = 0))
dt2
# # A tibble: 10 x 3
# X1 X2 Value
# <chr> <int> <dbl>
# 1 Coal 1 300
# 2 Coal 2 0
# 3 Coal 3 100
# 4 Coal 4 0
# 5 Coal 5 150
# 6 NatG 1 0
# 7 NatG 2 175
# 8 NatG 3 0
# 9 NatG 4 150
# 10 NatG 5 0
We can also specify the group values for the combination if some group values are missing. For example, we can do the following to make sure 1 to 5 in column X2
all has been considered.
dt2 <- dt %>%
complete(X1, X2 = 1:5, fill = list(Value = 0))
Or use full_seq
, which will automatically create the sequence based on the minimum and maximum values in a column.
dt2 <- dt %>%
complete(X1, X2 = full_seq(X2, period = 1), fill = list(Value = 0))
DATA
dt <- read.table(text = "X1 X2 Value
Coal 1 300
Coal 3 100
Coal 5 150
NatG 2 175
NatG 4 150",
header = TRUE, stringsAsFactors = FALSE)
Great solution from @www with tidyr
. This is a way to do it in base R.
levels <- 1:5
type <- c("Coal", "NatG")
df2 <- expand.grid(levels,type)
df3 <- merge(dt, df2, by.x=c("X1","X2"), by.y=c("Var2","Var1"),all.y=T)
df3[is.na(df3)] <- 0
X1 X2 Value
1 Coal 1 300
2 Coal 2 0
3 Coal 3 100
4 Coal 4 0
5 Coal 5 150
6 NatG 1 0
7 NatG 2 175
8 NatG 3 0
9 NatG 4 150
10 NatG 5 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With