I am using R to summarize a large amount of data for a report. I want to be able to use lapply()
to generate a list of tables from the table()
function, from which I can extract my desired statistics. There are a lot of these, so I've written a function to do it. My issue is that I am having difficulty returning the number of missing (NA
) values even though I have that in each table, because I can't figure out how to tell R that I want the element from table()
that holds the number of NA
values. As far as I can tell, R is "naming" that element NA
...and I can't call that.
I'm trying to avoid writing some complex statement where I say something like which(is.na(names(element[1]))) | names(element[1])=="var_I_want"
because I feel like that's just really wordy. I was hoping there was some way to either tell R to label the NA
variable in each table with a character name, or to tell it to pick the one labeled NA
, but I haven't had much luck yet.
Minimal example:
example <- data.frame(ID=c(10,20,30,40,50),
V1=c("A","B","A",NA,"C"),
V2=c("Dog","Cat",NA,"Cat","Bunny"),
V3=c("Yes","No","No","Yes","No"),
V4=c("No",NA,"No","No","Yes"),
V5=c("No","Yes","Yes",NA,"No"))
varlist <- c("V1","V2","V3","V4","V5")
list_o_tables <- lapply(X=example[varlist],FUN=table,useNA="always")
list(V1=list_o_tables[["V1"]]["A"],
V2=list_o_tables[["V2"]]["Cat"],
V3=list_o_tables[["V3"]]["Yes"],
V4=list_o_tables[["V4"]]["Yes"],
V5=list_o_tables[["V5"]]["Yes"])
What I get:
$V1
A
2
$V2
Cat
2
$V3
Yes
2
$V4
Yes
1
$V5
Yes
2
What I'd like:
$V1
A <NA>
2 1
$V2
Cat <NA>
2 1
$V3
Yes <NA>
2 0
$V4
Yes <NA>
1 1
$V5
Yes <NA>
2 1
Counting NA s across either rows or columns can be achieved by using the apply() function. This function takes three arguments: X is the input matrix, MARGIN is an integer, and FUN is the function to apply to each row or column. MARGIN = 1 means to apply the function across rows and MARGIN = 2 across columns.
To see which values in each of these vectors R recognizes as missing, we can use the is.na function. It will return a TRUE/FALSE vector with as any elements as the vector we provide. We can see that R distinguishes between the NA and “NA” in x2–NA is seen as a missing value, “NA” is not.
In order to find the missing values in all columns use apply function with the which and the sum function in is.na() method.
In R, the easiest way to find columns that contain missing values is by combining the power of the functions is.na() and colSums(). First, you check and count the number of NA's per column. Then, you use a function such as names() or colnames() to return the names of the columns with at least one missing value.
This is ugly (IMHO) but it works:
my_table <- function(x){
setNames(table(x,useNA = "always"),c(sort(unique(x[!is.na(x)])),'NA'))
}
So you'd lapply
this instead, and then you'd have access to the NA
column.
Looking more closely, this is rooted in the behavior of factor
:
levels(factor(c(1,NA,2),exclude = NULL))
[1] "1" "2" NA
My recollection is that the distinction between a factor level of NA
versus "NA"
has been at the very least a source of confusion in R in the past. I feel like I've seen some debates about the merits of this on r-devel, but I can't recall for sure at the moment.
So the issue is, if you have a factor with NA
values, what do you call the levels? Technically, this is correct, one of the levels is "missing" not literally "NA". It would be nice (IMHO) if table
didn't adhere to this quite so strictly, though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With