Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fill NA's when you have enough information to calculate what the NA should be in R

I've got a dataset where there are some NA's but I can manually work out what the values should be as the df is a column for name and the rest of the columns are just numbers followed by a final column with total. Only one NA appears per row at most so I can work out what the value should be by using the total column and the sum of all the other columns. Just wondering what would be the best way to fill these NA's without having to hardcode one by one as the df I'm using is pretty big

example df:

df = structure(list(city = c("sydney", "new york", "london", "beijing", "paris", "madrid"), 
                    year = c(2005:2010), 
                    A = c(1, 4, 5 , NA, 2, 1), 
                    B = c(3, NA, 4 , 9, 0, 6),
                    C = c(3, 4 , 6, 1, 8, NA),
                    total = c(NA, 10, 15, 14, NA, 15)), 
               class = "data.frame", row.names = c(NA, -6L))

df
like image 984
P_S_13 Avatar asked Oct 24 '25 01:10

P_S_13


2 Answers

This solution can also help you.

library(purrr)
library(dplyr)

df %>%
  rowwise() %>%
  mutate(total = ifelse(is.na(total), sum(c_across(A:C)), total), 
         pmap_df(select(cur_data(), A:total), ~ {x <- c(...)[1:3]
         replace(x, is.na(x), c(...)[4] - sum(x, na.rm = TRUE))}))

# A tibble: 6 x 6
# Rowwise: 
  city      year     A     B     C total
  <chr>    <int> <dbl> <dbl> <dbl> <dbl>
1 sydney    2005     1     3     3     7
2 new york  2006     4     2     4    10
3 london    2007     5     4     6    15
4 beijing   2008     4     9     1    14
5 paris     2009     2     0     8    10
6 madrid    2010     1     6     8    15

A bit hardcoded but can be modified in that sense.

like image 80
Anoushiravan R Avatar answered Oct 26 '25 14:10

Anoushiravan R


You need to replace the NA in total column first and then you can simply calculate the rest. You can also make a function for the A, B, C columns so you don't repeat the code, but with only 3 columns that shouldn't be the problem.

df = structure(list(city = c("sydney", "new york", "london", "beijing", "paris", "madrid"), 
                    year = c(2005:2010), 
                    A = c(1, 4, 5 , NA, 2, 1), 
                    B = c(3, NA, 4 , 9, 0, 6),
                    C = c(3, 4 , 6, 1, 8, NA),
                    total = c(NA, 10, 15, 14, NA, 15)), 
               class = "data.frame", row.names = c(NA, -6L))

df
#>       city year  A  B  C total
#> 1   sydney 2005  1  3  3    NA
#> 2 new york 2006  4 NA  4    10
#> 3   london 2007  5  4  6    15
#> 4  beijing 2008 NA  9  1    14
#> 5    paris 2009  2  0  8    NA
#> 6   madrid 2010  1  6 NA    15

df$total <- ifelse(is.na(df$total), rowSums(df[, c("A", "B", "C")]), df$total)
df$A <- ifelse(is.na(df$A), df$total - rowSums(df[, c("A", "B", "C")], na.rm = TRUE), df$A)
df$B <- ifelse(is.na(df$B), df$total - rowSums(df[, c("A", "B", "C")], na.rm = TRUE), df$B)
df$C <- ifelse(is.na(df$C), df$total - rowSums(df[, c("A", "B", "C")], na.rm = TRUE), df$C)

df
#>       city year A B C total
#> 1   sydney 2005 1 3 3     7
#> 2 new york 2006 4 2 4    10
#> 3   london 2007 5 4 6    15
#> 4  beijing 2008 4 9 1    14
#> 5    paris 2009 2 0 8    10
#> 6   madrid 2010 1 6 8    15

Created on 2022-02-09 by the reprex package (v2.0.1)

UPDATE: After replacing NA in total column, you can use the na.aprox function from zoo package to interpolate the rest of values.

library(zoo)

df$total <- ifelse(is.na(df$total), rowSums(df[, c("A", "B", "C")]), df$total)   # first totals
df[, c("A", "B", "C")] <- na.approx(df[, c("A", "B", "C", "total")], rule = 2)   # then rest
df
      city year   A   B C total
1   sydney 2005 1.0 3.0 3     7
2 new york 2006 4.0 3.5 4    10
3   london 2007 5.0 4.0 6    15
4  beijing 2008 3.5 9.0 1    14
5    paris 2009 2.0 0.0 8    10
6   madrid 2010 1.0 6.0 8    15
like image 43
Claudiu Papasteri Avatar answered Oct 26 '25 13:10

Claudiu Papasteri