I'm interested to specify types of missing values. I have data that have different types of missing and I am trying to code these values as missing in R, but I am looking for a solution were I can still distinguish between them.
Say I have some data that looks like this,
set.seed(667)
df <- data.frame(a = sample(c("Don't know/Not sure","Unknown","Refused","Blue", "Red", "Green"), 20, rep=TRUE), b = sample(c(1, 2, 3, 77, 88, 99), 10, rep=TRUE), f = round(rnorm(n=10, mean=.90, sd=.08), digits = 2), g = sample(c("C","M","Y","K"), 10, rep=TRUE) ); df
# a b f g
# 1 Unknown 2 0.78 M
# 2 Refused 2 0.87 M
# 3 Red 77 0.82 Y
# 4 Red 99 0.78 Y
# 5 Green 77 0.97 M
# 6 Green 3 0.99 K
# 7 Red 3 0.99 Y
# 8 Green 88 0.84 C
# 9 Unknown 99 1.08 M
# 10 Refused 99 0.81 C
# 11 Blue 2 0.78 M
# 12 Green 2 0.87 M
# 13 Blue 77 0.82 Y
# 14 Don't know/Not sure 99 0.78 Y
# 15 Unknown 77 0.97 M
# 16 Refused 3 0.99 K
# 17 Blue 3 0.99 Y
# 18 Green 88 0.84 C
# 19 Refused 99 1.08 M
# 20 Red 99 0.81 C
If I now make two tables my missing values ("Don't know/Not sure","Unknown","Refused"
and 77, 88, 99
) are included as regular data,
table(df$a,df$g)
# C K M Y
# Blue 0 0 1 2
# Don't know/Not sure 0 0 0 1
# Green 2 1 2 0
# Red 1 0 0 3
# Refused 1 1 2 0
# Unknown 0 0 3 0
and
table(df$b,df$g)
# C K M Y
# 2 0 0 4 0
# 3 0 2 0 2
# 77 0 0 2 2
# 88 2 0 0 0
# 99 2 0 2 2
I now recode the three factor levels "Don't know/Not sure","Unknown","Refused"
into <NA>
is.na(df[,c("a")]) <- df[,c("a")]=="Don't know/Not sure"|df[,c("a")]=="Unknown"|df[,c("a")]=="Refused"
and remove the empty levels
df$a <- factor(df$a)
and the same is done with the numeric values 77, 88,
and 99
is.na(df) <- df=="77"|df=="88"|df=="99"
table(df$a, df$g, useNA = "always")
# C K M Y <NA>
# Blue 0 0 1 2 0
# Green 2 1 2 0 0
# Red 1 0 0 3 0
# <NA> 1 1 5 1 0
table(df$b,df$g, useNA = "always")
# C K M Y <NA>
# 2 0 0 4 0 0
# 3 0 2 0 2 0
# <NA> 4 0 4 4 0
Now the missing categories are recode into NA
but they are all lumped together. Is there a way in a to recode something as missing, but retain the original values? I want R to thread "Don't know/Not sure","Unknown","Refused"
and 77, 88, 99
as missing, but I want to be able to still have the information in the variable.
Types Of Missing Values. Missing Completely At Random (MCAR) Missing At Random (MAR) Missing Not At Random (MNAR)
There are four types of missing data that are generally categorized. Missing completely at random (MCAR), missing at random, missing not at random, and structurally missing. Each type may be occurring in your data or even a combination of multiple missing data types.
KNN Imputer or Iterative Imputer classes to impute missing values considering the multivariate approach. In a multivariate approach, more than one feature is taken into consideration. Arbitrary Value Imputation is an important technique used in Imputation as it can handle both the Numerical and Categorical variables.
A missing value can signify a number of different things. Perhaps the field was not applicable, the event did not happen, or the data was not available. It could be that the person who entered the data did not know the right value, or did not care if a field was not filled in.
To my knowledge, base R doesn't have an in-built way to handle different NA
types. (editor: It does: NA_integer_
, NA_real_
, NA_complex_
, and NA_character
. See ?base::NA
.)
One option is to use a package which does so, for instance "memisc". It's a little bit of extra work, but it seems to do what you're looking for.
Here's an example:
First, your data. I've made a copy since we will be making some pretty significant changes to the dataset, and it's always nice to have a backup.
set.seed(667)
df <- data.frame(a = sample(c("Don't know/Not sure", "Unknown",
"Refused", "Blue", "Red", "Green"),
20, replace = TRUE),
b = sample(c(1, 2, 3, 77, 88, 99), 10,
replace = TRUE),
f = round(rnorm(n = 10, mean = .90, sd = .08),
digits = 2),
g = sample(c("C", "M", "Y", "K"), 10,
replace = TRUE))
df2 <- df
Let's factor variable "a":
df2$a <- factor(df2$a,
levels = c("Blue", "Red", "Green",
"Don't know/Not sure",
"Refused", "Unknown"),
labels = c(1, 2, 3, 77, 88, 99))
Load the "memisc" library:
library(memisc)
Now, convert variables "a" and "b" to item
s in "memisc":
df2$a <- as.item(as.character(df2$a),
labels = structure(c(1, 2, 3, 77, 88, 99),
names = c("Blue", "Red", "Green",
"Don't know/Not sure",
"Refused", "Unknown")),
missing.values = c(77, 88, 99))
df2$b <- as.item(df2$b,
labels = c(1, 2, 3, 77, 88, 99),
missing.values = c(77, 88, 99))
By doing this, we have a new data type. Compare the following:
as.factor(df2$a)
# [1] <NA> <NA> Red Red Green Green Red Green <NA> <NA> Blue
# [12] Green Blue <NA> <NA> <NA> Blue Green <NA> Red
# Levels: Blue Red Green
as.factor(include.missings(df2$a))
# [1] *Unknown *Refused Red
# [4] Red Green Green
# [7] Red Green *Unknown
# [10] *Refused Blue Green
# [13] Blue *Don't know/Not sure *Unknown
# [16] *Refused Blue Green
# [19] *Refused Red
# Levels: Blue Red Green *Don't know/Not sure *Refused *Unknown
We can use this information to create tables behaving the way you describe, while retaining all the original information.
table(as.factor(include.missings(df2$a)), df2$g)
#
# C K M Y
# Blue 0 0 1 2
# Red 1 0 0 3
# Green 2 1 2 0
# *Don't know/Not sure 0 0 0 1
# *Refused 1 1 2 0
# *Unknown 0 0 3 0
table(as.factor(df2$a), df2$g)
#
# C K M Y
# Blue 0 0 1 2
# Red 1 0 0 3
# Green 2 1 2 0
table(as.factor(df2$a), df2$g, useNA="always")
#
# C K M Y <NA>
# Blue 0 0 1 2 0
# Red 1 0 0 3 0
# Green 2 1 2 0 0
# <NA> 1 1 5 1 0
The tables for the numeric column with missing data behaves the same way.
table(as.factor(include.missings(df2$b)), df2$g)
#
# C K M Y
# 1 0 0 0 0
# 2 0 0 4 0
# 3 0 2 0 2
# *77 0 0 2 2
# *88 2 0 0 0
# *99 2 0 2 2
table(as.factor(df2$b), df2$g, useNA="always")
#
# C K M Y <NA>
# 1 0 0 0 0 0
# 2 0 0 4 0 0
# 3 0 2 0 2 0
# <NA> 4 0 4 4 0
As a bonus, you get the facility to generate nice codebook
s:
> codebook(df2$a)
========================================================================
df2$a
------------------------------------------------------------------------
Storage mode: character
Measurement: nominal
Missing values: 77, 88, 99
Values and labels N Percent
1 'Blue' 3 25.0 15.0
2 'Red' 4 33.3 20.0
3 'Green' 5 41.7 25.0
77 M 'Don't know/Not sure' 1 5.0
88 M 'Refused' 4 20.0
99 M 'Unknown' 3 15.0
However, I do also suggest you read the comment from @Maxim.K about what really constitutes missing values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With