I have a data frame with a bunch of categorical variables. Some of them contain NA's and I use the <code>addNA</code> function to convert them to an explicit factor level. My problem comes when I try to treat them as NA's they don't seem to register. Here's my example data set and attempts to 'find' NA's: <pre class="prettyprint"><code>df1 <- data.frame(id = 1:200, y =rbinom(200, 1, .5), var1 = factor(rep(c('abc','def','ghi','jkl'),50))) df1$var2 <- factor(rep(c('ab c','ghi','jkl','def'),50)) df1$var3 <- factor(rep(c('abc','ghi','nop','xyz'),50)) df1[df1$var1 == 'abc','var1'] <- NA df1$var1 <- addNA(df1$var1) df1$isNaCol <- ifelse(df1$var1 == NA, 1, 0);summary(df1$isNaCol) df1$isNaCol <- ifelse(is.na(df1$var1), 1, 0);summary(df1$isNaCol) df1$isNaCol <- ifelse(df1$var1 == 'NA', 1, 0);summary(df1$isNaCol) df1$isNaCol <- ifelse(df1$var1 == '<NA>', 1, 0);summary(df1$isNaCol) </code></pre> Also when I type <code>??addNA</code> I don't get any matches. Is this a gray-market function or something? Any suggestions would be appreciated.

Note that this is done with the OP's data before the call to <code>addNA()</code>. It is instructive to see what <code>addNA()</code> does with this data. <pre class="prettyprint"><code>> head(df1$var1) [1] <NA> def ghi jkl <NA> def Levels: abc def ghi jkl > levels(df1$var1) [1] "abc" "def" "ghi" "jkl" > head(addNA(df1$var1)) [1] <NA> def ghi jkl <NA> def Levels: abc def ghi jkl <NA> > levels(addNA(df1$var1)) [1] "abc" "def" "ghi" "jkl" NA </code></pre> <code>addNA</code> is altering the levels of the factor such that missing-ness (<code>NA</code>) is a level where by default R ignores it as what level the <code>NA</code> values take is, of course, missing. It is also stripping out the <code>NA</code> information - in a sense it is no longer unknown but part of a category "missing". To look at the help for <code>addNA</code> us <code>?addNA</code>. If we look at the definition of <code>addNA</code> we see that all it is doing is altering the levels <pre class="prettyprint"><code>of the factor, not changing the data any: > addNA function (x, ifany = FALSE) { if (!is.factor(x)) x <- factor(x) if (ifany & !any(is.na(x))) return(x) ll <- levels(x) if (!any(is.na(ll))) ll <- c(ll, NA) factor(x, levels = ll, exclude = NULL) } </code></pre> Note that it doesn't otherwise change the data - the <code>NA</code> are still there in the factor. We can replicate most of the behaviour of <code>addNA</code> via: <pre class="prettyprint"><code>with(df1, factor(var1, levels = c(levels(var1), NA), exclude = NULL)) > head(with(df1, factor(var1, levels = c(levels(var1), NA), exclude = NULL))) [1] <NA> def ghi jkl <NA> def Levels: abc def ghi jkl <NA> </code></pre> However because <code>NA</code> is now a level, those entries are not indicated as being missing via <code>is.na()</code> That explains the second comparison you do not working (where you use <code>is.na()</code>). The only nicety you get from <code>addNA</code> is that it doesn't add <code>NA</code> as a level if it already exists as one. Also, via the <code>ifany</code> you can stop it adding <code>NA</code> as a level if there are no <code>NA</code>s in the data. Where you are going wrong is attempting to compare an <code>NA</code> with something using the usual comparison methods (except your second example). If we don't know what value and <code>NA</code> observation takes, how can we compare it with something? Well, we can't, other than with the internal representation of <code>NA</code>. This is what is done by the <code>is.na()</code> function: <pre class="prettyprint"><code>> with(df1, head(is.na(var1), 10)) [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE </code></pre> Hence I would do (without using <code>addNA</code> at all) <pre class="prettyprint"><code>df1 <- transform(df1, isNaCol = is.na(var1)) > head(df1) id y var1 var2 var3 isNaCol 1 1 1 <NA> ab c abc TRUE 2 2 0 def ghi ghi FALSE 3 3 0 ghi jkl nop FALSE 4 4 0 jkl def xyz FALSE 5 5 0 <NA> ab c abc TRUE 6 6 1 def ghi ghi FALSE </code></pre> If you want that as a <code>1</code>, <code>0</code>, variable, just add <code>as.numeric()</code> as in <pre class="prettyprint"><code>df1 <- transform(df1, isNaCol = as.numeric(is.na(var1))) </code></pre> Where I think you are really going wrong is in wanting to attach an <code>NA</code> level to the factor. I see <code>addNA()</code> as a convenience function for use in things like <code>table()</code>, and even that has arguments to not need the prior use of <code>addNA()</code>, e.g.: <pre class="prettyprint"><code>> with(df1, table(var1, useNA = "ifany")) var1 abc def ghi jkl <NA> 0 50 50 50 50 </code></pre>

Anything compared to NA is NA; this is why your first summary is all NA. The <code>addNA</code> function changes any NA observations in your factor to a new level. This level is then given the label NA (of character mode). The underlying variable itself no longer has any NAs. This is why your second summary is all 0. To see how many observations have the NA level, use what Matthew Plourde posted.

Find NA values after using addNA()

Tags:

r

na

screechOwl

3 Answers

Testing equality to NA with the usual comparison operators always yields NA---you want is.na. Additionally, calling is.na on a factor test each level index (not the value associated with that index), so you want to convert the factor to a character vector first.

df1$isNaCol <- ifelse(is.na(as.character(df1$var1)), 1, 0);summary(df1$isNaCol)

133

answered Oct 06 '22 01:10

Matthew Plourde

Note that this is done with the OP's data before the call to addNA().

It is instructive to see what addNA() does with this data.

> head(df1$var1)
[1] <NA> def  ghi  jkl  <NA> def 
Levels: abc def ghi jkl
> levels(df1$var1)
[1] "abc" "def" "ghi" "jkl"
> head(addNA(df1$var1))
[1] <NA> def  ghi  jkl  <NA> def 
Levels: abc def ghi jkl <NA>
> levels(addNA(df1$var1))
[1] "abc" "def" "ghi" "jkl" NA

addNA is altering the levels of the factor such that missing-ness (NA) is a level where by default R ignores it as what level the NA values take is, of course, missing. It is also stripping out the NA information - in a sense it is no longer unknown but part of a category "missing".

To look at the help for addNA us ?addNA.

If we look at the definition of addNA we see that all it is doing is altering the levels

of the factor, not changing the data any:

> addNA
function (x, ifany = FALSE) 
{
    if (!is.factor(x)) 
        x <- factor(x)
    if (ifany & !any(is.na(x))) 
        return(x)
    ll <- levels(x)
    if (!any(is.na(ll))) 
        ll <- c(ll, NA)
    factor(x, levels = ll, exclude = NULL)
}

Note that it doesn't otherwise change the data - the NA are still there in the factor. We can replicate most of the behaviour of addNA via:

with(df1, factor(var1, levels = c(levels(var1), NA), exclude = NULL))

> head(with(df1, factor(var1, levels = c(levels(var1), NA), exclude = NULL)))
[1] <NA> def  ghi  jkl  <NA> def 
Levels: abc def ghi jkl <NA>

However because NA is now a level, those entries are not indicated as being missing via is.na() That explains the second comparison you do not working (where you use is.na()).

The only nicety you get from addNA is that it doesn't add NA as a level if it already exists as one. Also, via the ifany you can stop it adding NA as a level if there are no NAs in the data.

Where you are going wrong is attempting to compare an NA with something using the usual comparison methods (except your second example). If we don't know what value and NA observation takes, how can we compare it with something? Well, we can't, other than with the internal representation of NA. This is what is done by the is.na() function:

> with(df1, head(is.na(var1), 10))
 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

Hence I would do (without using addNA at all)

df1 <- transform(df1, isNaCol = is.na(var1))

> head(df1)
  id y var1 var2 var3 isNaCol
1  1 1 <NA> ab c  abc    TRUE
2  2 0  def  ghi  ghi   FALSE
3  3 0  ghi  jkl  nop   FALSE
4  4 0  jkl  def  xyz   FALSE
5  5 0 <NA> ab c  abc    TRUE
6  6 1  def  ghi  ghi   FALSE

If you want that as a 1, 0, variable, just add as.numeric() as in

df1 <- transform(df1, isNaCol = as.numeric(is.na(var1)))

Where I think you are really going wrong is in wanting to attach an NA level to the factor. I see addNA() as a convenience function for use in things like table(), and even that has arguments to not need the prior use of addNA(), e.g.:

> with(df1, table(var1, useNA = "ifany"))
var1
 abc  def  ghi  jkl <NA> 
   0   50   50   50   50

answered Oct 06 '22 00:10

Gavin Simpson

Anything compared to NA is NA; this is why your first summary is all NA.

The addNA function changes any NA observations in your factor to a new level. This level is then given the label NA (of character mode). The underlying variable itself no longer has any NAs. This is why your second summary is all 0.

To see how many observations have the NA level, use what Matthew Plourde posted.

answered Oct 06 '22 01:10

Hong Ooi

Related questions
                            
                                geom_path() refuses to cross over the 0/360 line in coord_polar()
                            
                                information on .o files for x64 is not available: NOTE on R package checks using Rcpp
                            
                                Manual annotate a ggplot with different labels, in different facets
                            
                                Sending a string from R to C++
                            
                                Reading user input without echoing
                            
                                How to handle binary strings in R?
                            
                                Summing rows based on specific factor combinations
                            
                                Create a co-occurrence matrix from dummy-coded observations
                            
                                Difference between passing options in aes() and outside of it in ggplot2
                            
                                Adjusting the node size in igraph using a matrix
                            
                                Stopping a large number of zeros being printed (not scientific notation)
                            
                                Rcpp - Use multiple C++ functions in file referenced by sourceCpp?
                            
                                Modelling data with a Weibull link function in R
                            
                                How can I plot multiple variables side-by-side in a dotplot in R?
                            
                                Rcpp: Save compiled function as Robj
                            
                                Merge overlapping ranges into unique groups, in dataframe
                            
                                How to have a new line in a `bquote` expression used with `text`?
                            
                                factor analysis using R
                            
                                RStudio projects capabilities
                            
                                Adding R^2 on graph with facets

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With