Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filling missing levels

Tags:

r

missing-data

I have the following type of dataframe:

Country <- rep(c("USA", "AUS", "GRC"),2)
Year    <- 2001:2006
Level   <- c("rich","middle","poor",rep(NA,3))
df <- data.frame(Country, Year,Level)

df 
Country Year  Level
1     USA 2001   rich
2     AUS 2002 middle
3     GRC 2003   poor
4     USA 2004   <NA>
5     AUS 2005   <NA>
6     GRC 2006   <NA>

I want to fill the missing values with the correct level label in the last from the right column.

So the expected outcome should be like this:

Country Year  Level
1     USA 2001   rich
2     AUS 2002 middle
3     GRC 2003   poor
4     USA 2004   rich
5     AUS 2005 middle
6     GRC 2006   poor
like image 247
msh855 Avatar asked Dec 21 '17 18:12

msh855


People also ask

What is fill in R?

Source: R/fill.R. fill.Rd. Fills missing values in selected columns using the next or previous entry. This is useful in the common output format where values are not repeated, and are only recorded when they change.


2 Answers

In base R, you could use ave():

transform(df, Level = ave(Level, Country, FUN = na.omit))

#   Country Year  Level
# 1     USA 2001   rich
# 2     AUS 2002 middle
# 3     GRC 2003   poor
# 4     USA 2004   rich
# 5     AUS 2005 middle
# 6     GRC 2006   poor

Another, more accurate possibility is to use a join. Here we merge the Country column with the NA-omitted data. The outcome is the same, just in a different row order.

merge(df["Country"], na.omit(df))

#   Country Year  Level
# 1     AUS 2002 middle
# 2     AUS 2002 middle
# 3     GRC 2003   poor
# 4     GRC 2003   poor
# 5     USA 2001   rich
# 6     USA 2001   rich
like image 177
Rich Scriven Avatar answered Oct 11 '22 12:10

Rich Scriven


We can group by 'Country' and get the non-NA unique value

library(dplyr)
df %>%
    group_by(Country) %>% 
    dplyr::mutate(Level = Level[!is.na(Level)][1])
# A tibble: 6 x 3
# Groups:   Country [3]
#  Country  Year  Level
#   <fctr> <int> <fctr>
#1     USA  2001   rich
#2     AUS  2002 middle
#3     GRC  2003   poor
#4     USA  2004   rich
#5     AUS  2005 middle
#6     GRC  2006   poor

If we have loaded dplyr along with plyr, it is better to specify explicitly dplyr::mutate or dplyr::summarise so that it uses the function from dplyr. There are same functions in plyr and it could potentially mask the functions from dplyr when both are loaded creating different behavior.

like image 35
akrun Avatar answered Oct 11 '22 10:10

akrun