I have the following type of dataframe: <pre class="prettyprint"><code>Country <- rep(c("USA", "AUS", "GRC"),2) Year <- 2001:2006 Level <- c("rich","middle","poor",rep(NA,3)) df <- data.frame(Country, Year,Level) df Country Year Level 1 USA 2001 rich 2 AUS 2002 middle 3 GRC 2003 poor 4 USA 2004 <NA> 5 AUS 2005 <NA> 6 GRC 2006 <NA> </code></pre> I want to fill the missing values with the correct level label in the last from the right column. So the expected outcome should be like this: <pre class="prettyprint"><code>Country Year Level 1 USA 2001 rich 2 AUS 2002 middle 3 GRC 2003 poor 4 USA 2004 rich 5 AUS 2005 middle 6 GRC 2006 poor </code></pre>

In base R, you could use <code>ave()</code>: <pre class="prettyprint"><code>transform(df, Level = ave(Level, Country, FUN = na.omit)) # Country Year Level # 1 USA 2001 rich # 2 AUS 2002 middle # 3 GRC 2003 poor # 4 USA 2004 rich # 5 AUS 2005 middle # 6 GRC 2006 poor </code></pre> Another, more accurate possibility is to use a join. Here we merge the <code>Country</code> column with the NA-omitted data. The outcome is the same, just in a different row order. <pre class="prettyprint"><code>merge(df["Country"], na.omit(df)) # Country Year Level # 1 AUS 2002 middle # 2 AUS 2002 middle # 3 GRC 2003 poor # 4 GRC 2003 poor # 5 USA 2001 rich # 6 USA 2001 rich </code></pre>

We can group by 'Country' and get the non-NA unique value <pre class="prettyprint"><code>library(dplyr) df %>% group_by(Country) %>% dplyr::mutate(Level = Level[!is.na(Level)][1]) # A tibble: 6 x 3 # Groups: Country [3] # Country Year Level # <fctr> <int> <fctr> #1 USA 2001 rich #2 AUS 2002 middle #3 GRC 2003 poor #4 USA 2004 rich #5 AUS 2005 middle #6 GRC 2006 poor </code></pre> If we have loaded <code>dplyr</code> along with <code>plyr</code>, it is better to specify explicitly <code>dplyr::mutate</code> or <code>dplyr::summarise</code> so that it uses the function from <code>dplyr</code>. There are same functions in <code>plyr</code> and it could potentially mask the functions from <code>dplyr</code> when both are loaded creating different behavior.

Filling missing levels

Tags:

r

missing-data

I have the following type of dataframe:

Country <- rep(c("USA", "AUS", "GRC"),2)
Year    <- 2001:2006
Level   <- c("rich","middle","poor",rep(NA,3))
df <- data.frame(Country, Year,Level)

df 
Country Year  Level
1     USA 2001   rich
2     AUS 2002 middle
3     GRC 2003   poor
4     USA 2004   <NA>
5     AUS 2005   <NA>
6     GRC 2006   <NA>

I want to fill the missing values with the correct level label in the last from the right column.

So the expected outcome should be like this:

Country Year  Level
1     USA 2001   rich
2     AUS 2002 middle
3     GRC 2003   poor
4     USA 2004   rich
5     AUS 2005 middle
6     GRC 2006   poor

247

asked Dec 21 '17 18:12

msh855

2 Answers

In base R, you could use ave():

transform(df, Level = ave(Level, Country, FUN = na.omit))

#   Country Year  Level
# 1     USA 2001   rich
# 2     AUS 2002 middle
# 3     GRC 2003   poor
# 4     USA 2004   rich
# 5     AUS 2005 middle
# 6     GRC 2006   poor

Another, more accurate possibility is to use a join. Here we merge the Country column with the NA-omitted data. The outcome is the same, just in a different row order.

merge(df["Country"], na.omit(df))

#   Country Year  Level
# 1     AUS 2002 middle
# 2     AUS 2002 middle
# 3     GRC 2003   poor
# 4     GRC 2003   poor
# 5     USA 2001   rich
# 6     USA 2001   rich

177

answered Oct 11 '22 12:10

Rich Scriven

We can group by 'Country' and get the non-NA unique value

library(dplyr)
df %>%
    group_by(Country) %>% 
    dplyr::mutate(Level = Level[!is.na(Level)][1])
# A tibble: 6 x 3
# Groups:   Country [3]
#  Country  Year  Level
#   <fctr> <int> <fctr>
#1     USA  2001   rich
#2     AUS  2002 middle
#3     GRC  2003   poor
#4     USA  2004   rich
#5     AUS  2005 middle
#6     GRC  2006   poor

If we have loaded dplyr along with plyr, it is better to specify explicitly dplyr::mutate or dplyr::summarise so that it uses the function from dplyr. There are same functions in plyr and it could potentially mask the functions from dplyr when both are loaded creating different behavior.

answered Oct 11 '22 10:10

akrun

Related questions
                            
                                Formatting of persp3d plot
                            
                                Calculating Time Difference between two columns
                            
                                stringr str_extract capture group capturing everything
                            
                                R: Sample a vector with replacement multiple times
                            
                                Too few periods for decompose() [closed]
                            
                                Removing leading zeros from alphanumeric characters in R
                            
                                How to make gradient color filled timeseries plot in R
                            
                                using leaflet library to output multiple popup values
                            
                                "RTextTools" create_matrix got an error
                            
                                Improving model training speed in caret (R)
                            
                                Interpretation of ordered and non-ordered factors, vs. numerical predictors in model summary
                            
                                R Extract day from datetime
                            
                                dim(X) must have a positive length when applying function in data frame
                            
                                How to remove duplicated (by name) column in data.tables in R?
                            
                                Conditionally selecting columns in dplyr where certain proportion of values is NA
                            
                                How to select last N observation from each group in dplyr dataframe?
                            
                                How to upload a file to a server via FTP using R?
                            
                                How to iterate over file names in a R script?
                            
                                R 3.0.0 update has left loads of 2.x packages incompatible
                            
                                How to determine if a string "ends with" another string in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With