Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Specify different types of missing values (NAs)

I'm interested to specify types of missing values. I have data that have different types of missing and I am trying to code these values as missing in R, but I am looking for a solution were I can still distinguish between them.

Say I have some data that looks like this,

set.seed(667) 
df <- data.frame(a = sample(c("Don't know/Not sure","Unknown","Refused","Blue", "Red", "Green"),  20, rep=TRUE), b = sample(c(1, 2, 3, 77, 88, 99),  10, rep=TRUE), f = round(rnorm(n=10, mean=.90, sd=.08), digits = 2), g = sample(c("C","M","Y","K"),  10, rep=TRUE) ); df
#                      a  b    f g
# 1              Unknown  2 0.78 M
# 2              Refused  2 0.87 M
# 3                  Red 77 0.82 Y
# 4                  Red 99 0.78 Y
# 5                Green 77 0.97 M
# 6                Green  3 0.99 K
# 7                  Red  3 0.99 Y
# 8                Green 88 0.84 C
# 9              Unknown 99 1.08 M
# 10             Refused 99 0.81 C
# 11                Blue  2 0.78 M
# 12               Green  2 0.87 M
# 13                Blue 77 0.82 Y
# 14 Don't know/Not sure 99 0.78 Y
# 15             Unknown 77 0.97 M
# 16             Refused  3 0.99 K
# 17                Blue  3 0.99 Y
# 18               Green 88 0.84 C
# 19             Refused 99 1.08 M
# 20                 Red 99 0.81 C

If I now make two tables my missing values ("Don't know/Not sure","Unknown","Refused" and 77, 88, 99) are included as regular data,

table(df$a,df$g)
#                     C K M Y
# Blue                0 0 1 2
# Don't know/Not sure 0 0 0 1
# Green               2 1 2 0
# Red                 1 0 0 3
# Refused             1 1 2 0
# Unknown             0 0 3 0

and

table(df$b,df$g)
#    C K M Y
# 2  0 0 4 0
# 3  0 2 0 2
# 77 0 0 2 2
# 88 2 0 0 0
# 99 2 0 2 2

I now recode the three factor levels "Don't know/Not sure","Unknown","Refused" into <NA>

is.na(df[,c("a")]) <- df[,c("a")]=="Don't know/Not sure"|df[,c("a")]=="Unknown"|df[,c("a")]=="Refused"

and remove the empty levels

df$a <- factor(df$a) 

and the same is done with the numeric values 77, 88, and 99

is.na(df) <- df=="77"|df=="88"|df=="99"

table(df$a, df$g, useNA = "always")       
#       C K M Y <NA>
# Blue  0 0 1 2    0
# Green 2 1 2 0    0
# Red   1 0 0 3    0
# <NA>  1 1 5 1    0

table(df$b,df$g, useNA = "always")
#      C K M Y <NA>
# 2    0 0 4 0    0
# 3    0 2 0 2    0
# <NA> 4 0 4 4    0

Now the missing categories are recode into NA but they are all lumped together. Is there a way in a to recode something as missing, but retain the original values? I want R to thread "Don't know/Not sure","Unknown","Refused" and 77, 88, 99 as missing, but I want to be able to still have the information in the variable.

like image 351
Eric Fail Avatar asked Apr 18 '13 04:04

Eric Fail


People also ask

What are categories of missing values?

Types Of Missing Values. Missing Completely At Random (MCAR) Missing At Random (MAR) Missing Not At Random (MNAR)

How many types of missing data are there?

There are four types of missing data that are generally categorized. Missing completely at random (MCAR), missing at random, missing not at random, and structurally missing. Each type may be occurring in your data or even a combination of multiple missing data types.

What are the different ways to handle missing values in machine learning?

KNN Imputer or Iterative Imputer classes to impute missing values considering the multivariate approach. In a multivariate approach, more than one feature is taken into consideration. Arbitrary Value Imputation is an important technique used in Imputation as it can handle both the Numerical and Categorical variables.

What are missing values in data warehousing?

A missing value can signify a number of different things. Perhaps the field was not applicable, the event did not happen, or the data was not available. It could be that the person who entered the data did not know the right value, or did not care if a field was not filled in.


1 Answers

To my knowledge, base R doesn't have an in-built way to handle different NA types. (editor: It does: NA_integer_, NA_real_, NA_complex_, and NA_character. See ?base::NA.)

One option is to use a package which does so, for instance "memisc". It's a little bit of extra work, but it seems to do what you're looking for.

Here's an example:

First, your data. I've made a copy since we will be making some pretty significant changes to the dataset, and it's always nice to have a backup.

set.seed(667) 
df <- data.frame(a = sample(c("Don't know/Not sure", "Unknown", 
                              "Refused", "Blue", "Red", "Green"),
                            20, replace = TRUE), 
                 b = sample(c(1, 2, 3, 77, 88, 99), 10, 
                            replace = TRUE), 
                 f = round(rnorm(n = 10, mean = .90, sd = .08), 
                           digits = 2), 
                 g = sample(c("C", "M", "Y", "K"), 10, 
                            replace = TRUE))
df2 <- df

Let's factor variable "a":

df2$a <- factor(df2$a, 
                levels = c("Blue", "Red", "Green", 
                           "Don't know/Not sure",
                           "Refused", "Unknown"),
                labels = c(1, 2, 3, 77, 88, 99))

Load the "memisc" library:

library(memisc)

Now, convert variables "a" and "b" to items in "memisc":

df2$a <- as.item(as.character(df2$a), 
                  labels = structure(c(1, 2, 3, 77, 88, 99),
                                     names = c("Blue", "Red", "Green", 
                                               "Don't know/Not sure",
                                               "Refused", "Unknown")),
                  missing.values = c(77, 88, 99))
df2$b <- as.item(df2$b, 
                 labels = c(1, 2, 3, 77, 88, 99), 
                 missing.values = c(77, 88, 99))

By doing this, we have a new data type. Compare the following:

as.factor(df2$a)
#  [1] <NA>  <NA>  Red   Red   Green Green Red   Green <NA>  <NA>  Blue 
# [12] Green Blue  <NA>  <NA>  <NA>  Blue  Green <NA>  Red  
# Levels: Blue Red Green
as.factor(include.missings(df2$a))
#  [1] *Unknown             *Refused             Red                 
#  [4] Red                  Green                Green               
#  [7] Red                  Green                *Unknown            
# [10] *Refused             Blue                 Green               
# [13] Blue                 *Don't know/Not sure *Unknown            
# [16] *Refused             Blue                 Green               
# [19] *Refused             Red                 
# Levels: Blue Red Green *Don't know/Not sure *Refused *Unknown

We can use this information to create tables behaving the way you describe, while retaining all the original information.

table(as.factor(include.missings(df2$a)), df2$g)
#                       
#                        C K M Y
#   Blue                 0 0 1 2
#   Red                  1 0 0 3
#   Green                2 1 2 0
#   *Don't know/Not sure 0 0 0 1
#   *Refused             1 1 2 0
#   *Unknown             0 0 3 0
table(as.factor(df2$a), df2$g)
#        
#         C K M Y
#   Blue  0 0 1 2
#   Red   1 0 0 3
#   Green 2 1 2 0
table(as.factor(df2$a), df2$g, useNA="always")
#        
#         C K M Y <NA>
#   Blue  0 0 1 2    0
#   Red   1 0 0 3    0
#   Green 2 1 2 0    0
#   <NA>  1 1 5 1    0

The tables for the numeric column with missing data behaves the same way.

table(as.factor(include.missings(df2$b)), df2$g)
#      
#       C K M Y
#   1   0 0 0 0
#   2   0 0 4 0
#   3   0 2 0 2
#   *77 0 0 2 2
#   *88 2 0 0 0
#   *99 2 0 2 2
table(as.factor(df2$b), df2$g, useNA="always")
#       
#        C K M Y <NA>
#   1    0 0 0 0    0
#   2    0 0 4 0    0
#   3    0 2 0 2    0
#   <NA> 4 0 4 4    0

As a bonus, you get the facility to generate nice codebooks:

> codebook(df2$a)
========================================================================

   df2$a

------------------------------------------------------------------------

   Storage mode: character
   Measurement: nominal
   Missing values: 77, 88, 99

            Values and labels    N    Percent 

    1   'Blue'                   3   25.0 15.0
    2   'Red'                    4   33.3 20.0
    3   'Green'                  5   41.7 25.0
   77 M 'Don't know/Not sure'    1         5.0
   88 M 'Refused'                4        20.0
   99 M 'Unknown'                3        15.0

However, I do also suggest you read the comment from @Maxim.K about what really constitutes missing values.

like image 127
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 12 '22 08:10

A5C1D2H2I1M1N2O1R2T1