What would you recommend for making the distinction between missing value types for dataset users who might not read the codebook carefully?
In this toy example, q2 is only asked to people who said "Yes" to q1. This means there is one missing value on q2 that is missing because the person did not respond, and two missing values on q2 that are missing because the question was not asked.
library(tidyverse)
df <- tibble(q1 = c("Yes", "Yes", "No", "No"),
q2 = c("Yes", NA, NA, NA))
df
# A tibble: 4 x 2
q1 q2
<chr> <chr>
1 Yes Yes
2 Yes NA
3 No NA
4 No NA
df %>% group_by(q1, q2) %>% count()
# A tibble: 3 x 3
# Groups: q1, q2 [3]
q1 q2 n
<chr> <chr> <int>
1 No NA 2
2 Yes Yes 1
3 Yes NA 1
When I summarize by q2 there is no way in the dataset to make a distinction between missingness from non-response vs structural missingness.
df %>% group_by(q2) %>% count()
# A tibble: 2 x 2
# Groups: q2 [2]
q2 n
<chr> <int>
1 Yes 1
2 NA 3
Using NA, Inf, -Inf and NaN we can represent 4 categories of numeric missing values. Below we show the use of NA with Inf and then NA with NaN. In the third approach we discuss the use of naniar package.
1) Recode q2 values of Yes, No, structural missing and missing to 1, 0, Inf and NA respectively. Note that is.na(x) will only report TRUE for an actual NA, is.infinite(x) will only report TRUE for an Inf and !is.finite(x) will report TRUE for NA or Inf in case you need to perform tests. Optionally recode the output back.
df %>%
count(q2 = recode(q2, Yes = 1, No = 0, .missing = ifelse(q1 == "No", Inf, NA)))
giving:
# A tibble: 3 x 2
# Groups: q2 [3]
q2 n
<dbl> <int>
1 1 1
2 Inf 2
3 NA 1
2) A variation on this is to use NaN in place of Inf. In that case tests can use is.na(x), is.nan(x) and !is.finite(x)
df %>%
count(q2 = recode(q2, Yes = 1, No = 0, .missing = ifelse(q1 == "No", NaN, NA)))
giving:
# A tibble: 3 x 2
q2 n
<dbl> <int>
1 1 1
2 NA 1
3 NaN 2
3) The naniar package can create auxilliary columns that define the type of each NA using bind_shadow. We can then recode the auxilliary columns using recode_shadow and then use those in our counting.
library(naniar)
library(naniar)
df %>%
bind_shadow %>%
recode_shadow(q2 = .where(is.na(q2) & q1 == "No" ~ "struct")) %>%
count(q2, q2_NA)
giving:
# A tibble: 3 x 3
q2 q2_NA n
<chr> <fct> <int>
1 Yes !NA 1
2 <NA> NA 1
3 <NA> NA_struct 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With