Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace NA with previous occurrence

Tags:

r

na

This is a sample of my data from CSV. It contains ~10 columns.

    Product_id    Product_Weight    Product_Name    Shop_Name ...
[1]    A             10                xxxx            Walmart
[2]    B             12                yyyy            Target
[3]    C             11                zzzz            Target
[4]    A             NA                xxxx            Walmart
[5]    C             NA                zzzz            Target

I would like to fill NA's in row 4 and 5 with 10 and 11 respectively (since the product weight of A and C are already known from row 1 and 3). I want the final data frame to be like this

    Product_id    Product_Weight    Product_Name    Shop_Name ...
[1]    A             10                xxxx            Walmart
[2]    B             12                yyyy            Target
[3]    C             11                zzzz            Target
[4]    A             10                xxxx            Walmart
[5]    C             11                zzzz            Target 

What is the best way to do this in R?

like image 863
Avis Avatar asked Jun 09 '16 17:06

Avis


1 Answers

Although the question asked for the "previous occurrence" this would have the disadvantage that if the first Product_Weight in any Product_id were NA then it could not be filled in even if we knew the Product_Weight from a subsequent Product_id so instead of using the previous occurrence we take the mean of all non-NAs with the same Product_id. Since these should all be the same their mean is their common value.

If you really do want the previous occurrence use the Prev function where:

Prev <- function(x) na.locf(x, na.rm = FALSE)

in place of na.aggregate in (1) and (3) and do not use (2).

The following solutions have the advantages that they all:

  • preserve the order of the input

  • work even if the first Product_Weight in any Product_id is NA

  • do not modify the input

The first solution has the additional advantage of being only one line of code (plus a library statement) and the second solution has the additional advantage of not using any packages.

1) zoo::na.aggregate We use na.aggregate in the zoo package (which replaces all NAs with the average of the non-NAs) and we apply it to Product_Weight separately for each Product_id.

library(zoo)
transform(DF, Product_Weight = ave(Product_Weight, Product_id, FUN = na.aggregate))

giving:

  Product_id Product_Weight Product_Name Shop_Name
1          A             10         xxxx   Walmart
2          B             12         yyyy    Target
3          C             11         zzzz    Target
4          A             10         xxxx   Walmart
5          C             11         zzzz    Target

2) No packages Alternately use Mean in place of na.aggregate where Mean is defined as:

Mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))

3) dplyr/zoo Add row numbers, group by the Product_id, fill in the NAs as in the prior solutions using either na.aggregate as shown below or Mean, arrange back to the original order and remove the row numbers:

library(dplyr)
library(zoo)

DF %>% 
   mutate(row = row_number()) %>% 
   group_by(Product_id) %>% 
   mutate(Product_Weight = na.aggregate(Product_Weight)) %>% 
   ungroup() %>% 
   arrange(row) %>% 
   select(-row)

Note: This was used for the input DF :

Lines <- "    Product_id    Product_Weight    Product_Name    Shop_Name
    A             10                xxxx            Walmart
    B             12                yyyy            Target
    C             11                zzzz            Target
    A             NA                xxxx            Walmart
    C             NA                zzzz            Target"
DF <- read.table(text = Lines, header = TRUE)
like image 110
G. Grothendieck Avatar answered Oct 06 '22 23:10

G. Grothendieck