Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

making the distinction between missing value types (non-response vs skip patterns)

Tags:

dataframe

r

What would you recommend for making the distinction between missing value types for dataset users who might not read the codebook carefully?

In this toy example, q2 is only asked to people who said "Yes" to q1. This means there is one missing value on q2 that is missing because the person did not respond, and two missing values on q2 that are missing because the question was not asked.

library(tidyverse)
df <- tibble(q1 = c("Yes", "Yes", "No", "No"), 
             q2 = c("Yes", NA, NA, NA))

df

# A tibble: 4 x 2
  q1    q2   
  <chr> <chr>
1 Yes   Yes  
2 Yes   NA   
3 No    NA   
4 No    NA 

df %>% group_by(q1, q2) %>% count()

# A tibble: 3 x 3
# Groups:   q1, q2 [3]
  q1    q2        n
  <chr> <chr> <int>
1 No    NA        2
2 Yes   Yes       1
3 Yes   NA        1

When I summarize by q2 there is no way in the dataset to make a distinction between missingness from non-response vs structural missingness.

df %>% group_by(q2) %>% count()

# A tibble: 2 x 2
# Groups:   q2 [2]
  q2        n
  <chr> <int>
1 Yes       1
2 NA        3
like image 825
Eric Green Avatar asked Feb 10 '26 18:02

Eric Green


1 Answers

Using NA, Inf, -Inf and NaN we can represent 4 categories of numeric missing values. Below we show the use of NA with Inf and then NA with NaN. In the third approach we discuss the use of naniar package.

1) Recode q2 values of Yes, No, structural missing and missing to 1, 0, Inf and NA respectively. Note that is.na(x) will only report TRUE for an actual NA, is.infinite(x) will only report TRUE for an Inf and !is.finite(x) will report TRUE for NA or Inf in case you need to perform tests. Optionally recode the output back.

df %>% 
  count(q2 = recode(q2, Yes = 1, No = 0, .missing = ifelse(q1 == "No", Inf, NA)))

giving:

# A tibble: 3 x 2
# Groups:   q2 [3]
     q2     n
  <dbl> <int>
1     1     1
2   Inf     2
3    NA     1

2) A variation on this is to use NaN in place of Inf. In that case tests can use is.na(x), is.nan(x) and !is.finite(x)

df %>% 
  count(q2 = recode(q2, Yes = 1, No = 0, .missing = ifelse(q1 == "No", NaN, NA)))

giving:

# A tibble: 3 x 2
     q2     n
  <dbl> <int>
1     1     1
2    NA     1
3   NaN     2

3) The naniar package can create auxilliary columns that define the type of each NA using bind_shadow. We can then recode the auxilliary columns using recode_shadow and then use those in our counting.

library(naniar)

library(naniar)
df %>%
  bind_shadow %>%
  recode_shadow(q2 = .where(is.na(q2) & q1 == "No" ~ "struct")) %>%
  count(q2, q2_NA)

giving:

# A tibble: 3 x 3
  q2    q2_NA         n
  <chr> <fct>     <int>
1 Yes   !NA           1
2 <NA>  NA            1
3 <NA>  NA_struct     2
like image 168
G. Grothendieck Avatar answered Feb 12 '26 14:02

G. Grothendieck



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!