Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identifying rows in data.frame with only NA values in R

Tags:

r

missing-data

I have a data.frame with 15,000 observations of 34 ordinal and NA variables. I am performing clustering for a market segmentation study and need the rows with only NAs removed. After taking out the userID I got an error message saying to omit 2099 rows with only NAs before clustering.

I found a link for removing rows with all NA values, but I need to identify which of the 2099 rows have all NA values. Here the link for the discussion removing rows with all NA values: Remove Rows with NAs in data.frame

Here's a sample of the first five observations from six variables:

> head(Store2df, n=5)
  RowNo      Age Gender HouseholdIncome MaritalStatus PresenceofChildren
1     1     <NA>   Male            <NA>          <NA>               <NA>
2     2    45-54 Female            <NA>          <NA>               <NA>
3     3     <NA>   <NA>            <NA>          <NA>               <NA>
4     4     <NA>   <NA>            <NA>          <NA>               <NA>
5     5    45-54 Female        75k-100k       Married                Yes
#Making a vector
> Vector1 <- Store2df$RowNo 
#Taking out RowNo column
> Store2df$RowNo <- NULL

EDIT: I put the results in a object, but found that the code made an extra column. Clicking in RStudio's environment, an extra column called row.names was created labeling each row with the original row name. A couple thousand rows were deleted and the new column labeled the new rows with the old row number. However when looking at the head of the new object, I did not see the row label. Why does the row.names label show in the environment, but not when I view the head?

#Remove all rows with only NA values
> Store2df <- Store2[!!rowSums(!is.na(Store2)),]
#View head of store2df
> head(Store2df)
    Age Gender HouseholdIncome MaritalStatus PresenceofChildren
1  <NA>   Male            <NA>          <NA>               <NA>
2 45-54 Female            <NA>          <NA>               <NA>
5 45-54 Female        75k-100k       Married                Yes
6 25-34   Male        75k-100k       Married                 No
7 35-44 Female       125k-150k       Married                Yes
8 55-64   Male        75k-100k       Married                 No

EDIT 2: I put in the row number/userID column to keep track of the number of users. To perform the operation for removing all NAs, I took out the first column. Now I need to keep track of the users I removed. I have a list of over 2000 rows that had all NA values, I don't want to create an index manually putting in each row.

Question: How do I remove the emails that the missing data corresponded to?

> #First six rows of the column RowNo
> head(Store2df$RowNo)
[1] 1 2 3 4 5 6

I want 2099 rows deleted in the Store2df data.frame with the RowNo included. Here's the script identifying which rows are all empty in the Store2df data.frame without RowNo.

> which(rowSums(is.na(Store2df))==ncol(Store2df))

Showing the first 6 rows, row number 3 and 4 are deleted.

> head(Store2df$RowNo)
[1] 1 2 5 6 7 8

There are 4 steps I want to complete:

1) Take out RowNo column in Store2df data.frame and save as separate vector

2) Delete rows with all NA values in Store2df data.frame

3) Delete same rows in Store2new1 vector as Store2df data.frame

4) Combine vector and data.frame with vector matching the data.frame

like image 822
Scott Davis Avatar asked Sep 01 '14 05:09

Scott Davis


1 Answers

 which(rowSums(is.na(Store2))==ncol(Store2))
 #3 4 
 #3 4 

Or

 which(Reduce(`&`,as.data.frame(is.na(Store2))))
 #[1] 3 4

Or

 which(!rowSums(!is.na(Store2)))  
 #3 4 
 #3 4 

data

 Store2 <- structure(list(Age = c(NA, "45-54", NA, NA, "45-54"), Gender = c("Male", 
 "Female", NA, NA, "Female"), HouseholdIncome = c(NA, NA, NA, 
  NA, "75k-100k"), MaritalStatus = c(NA, NA, NA, NA, "Married"), 
PresenceofChildren = c(NA, NA, NA, NA, "Yes"), HomeOwnerStatus = c(NA, 
NA, NA, NA, "Own"), HomeMarketValue = c(NA, NA, NA, NA, "150k-200k"
)), .Names = c("Age", "Gender", "HouseholdIncome", "MaritalStatus", 
"PresenceofChildren", "HomeOwnerStatus", "HomeMarketValue"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5"))

Update

To drop the rows with all NAs

  Store2[!!rowSums(!is.na(Store2)),]
  #   Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus
  #1  <NA>   Male            <NA>          <NA>               <NA>            <NA>
  #2 45-54 Female            <NA>          <NA>               <NA>            <NA>
  #5 45-54 Female        75k-100k       Married                Yes             Own
   #HomeMarketValue
  #1            <NA>
  #2            <NA>
  #5       150k-200k
  • is.na(Store2) gives a logical index of elements that are missing or NA
  • ! will negate the logical index i.e. TRUE becomes FALSE and viceversa
  • rowSums of the above code gives the sum of elements that are not NA in each row

        rowSums(!is.na(Store2))
        #   1 2 3 4 5 
        #   1 2 0 0 7  # 3rd and 4th row have `0 non NA` values
    
  • ! Negate the above gives

        !rowSums(!is.na(Store2))
        # 1     2     3     4     5 
        #FALSE FALSE  TRUE  TRUE FALSE 
    
  • We wanted to drop those rows that are all NA's or 0 non NAs. So ! again

        !!rowSums(!is.na(Store2))
        #1     2     3     4     5 
        #TRUE  TRUE FALSE FALSE  TRUE 
    
  • Subset using the above logical index

Update2

If you have two rowNo, i.e. the one you stored separately before deleting the NA rows and the second after you deleted the NAs.

   RowNo1 <- 1:6
   RowNo2 <- c(1,2,5,6)
   RowNo1 %in% RowNo2
   #[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE
   RowNo1[RowNo1 %in% RowNo2]
   #[1] 1 2 5 6

Update3

With your new requests, let me try it again:

    Store2 <- structure(list(RowNo = 1:5, Age = c(NA, "45-54", NA, NA, "45-54"
    ), Gender = c("Male", "Female", NA, NA, "Female"), HouseholdIncome = c(NA, 
    NA, NA, NA, "75k-100k"), MaritalStatus = c(NA, NA, NA, NA, "Married"
   ), PresenceofChildren = c(NA, NA, NA, NA, "Yes")), .Names = c("RowNo", 
   "Age", "Gender", "HouseholdIncome", "MaritalStatus", "PresenceofChildren"
   ), class = "data.frame", row.names = c("1", "2", "3", "4", "5"
   ))

First step

Saving RowNo as separate vector (I am not sure why you need this)

  Store2new1 <- Store2$RowNo

Second step

Delete rows with all NA values in Store2 data.frame and store it as Store2df

   Store2df <- Store2[!!rowSums(!is.na(Store2[,-1])),] #Here you already get the new dataset with `RowNo` column

   Store2df
   #RowNo   Age Gender HouseholdIncome MaritalStatus PresenceofChildren
   #1     1  <NA>   Male            <NA>          <NA>               <NA>
   #2     2 45-54 Female            <NA>          <NA>               <NA>
   #5     5 45-54 Female        75k-100k       Married                Yes

Third step

Delete same rows in Store2new1 vector as Store2df data.frame

   Store2new2 <- Store2new1[Store2new1 %in% Store2df$RowNo]
   Store2new1[Store2new1 %in% Store2df$RowNo]
   #[1] 1 2 5

Fourth step

I don't really think the fourth step or third is required unless you want to delete more rows, which is not clear from the post.

like image 72
akrun Avatar answered Oct 19 '22 05:10

akrun