Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using R to insert a value for missing data with a value from another data frame

All,

I have a question that I fear might be too pedestrian to ask here, but searching for it elsewhere is leading me astray. I may not be using the right search terms.

I have a panel data frame (country-year) in R with some missing values on a given variable. I'm trying to impute them with the value from another vector in another data frame. Here's an illustration of what I am trying to do.

Assume Data is the data frame of interest, which has missing values on a given vector that I'm trying to impute from another donor data frame. It looks like this.

country    year      x
  70       1920    9.234
  70       1921    9.234
  70       1922    9.234
  70       1923    9.234
  70       1924    9.234
  80       1920      NA
  80       1921      NA
  80       1922      NA
  80       1923      NA
  80       1924      NA
  90       1920    7.562
  90       1921    7.562
  90       1922    7.562
  90       1923    7.562
  90       1924    7.562

This would be the Donor frame, which has a value for country == 80

country      x
  70       9.234
  80       1.523
  90       7.562

I'm trying to find a seamless way to automate this, beyond a command of Data$x[Data$country == 80] <- 1.523. There are a lot of countries with missingness on x.

It may be worth clarifying that a simple merge would be the easiest, but not necessarily appropriate for what I'm trying to do. Some countries will see variation on x over different years. Basically, what I'm trying to accomplish is a command that says that if the value of x is missing from Data for all years for a given country, take the corresponding value for the country from the Donor data and paste it over all country years as a "best guess" of sorts.

Thanks for any input. I suspect this is a rookie question, but I didn't know the right terms to search for it.

Reproducible code for the above data follows.

country <- c(70,70,70,70,70,80,80,80,80,80,90,90,90,90,90)
year <- c(1920,1921,1922,1923,1924,1920,1921,1922,1923,1924,1920,1921,1922,1923,1924)
x <- c(9.234,9.234,9.234,9.234,9.234,NA,NA,NA,NA,NA,7.562,7.562,7.562,7.562,7.562)

Data=data.frame(country=country,year=year,x=x)
summary(Data)

country <- c(70,80,90)
x <- c(9.234,1.523,7.562)
Donor=data.frame(country=country,x=x)
summary(Donor)
like image 424
steve Avatar asked Dec 27 '22 01:12

steve


1 Answers

Using merge:

r = merge(Data, Donor, by="country", suffixes=c(".Data", ".Donor"))
Data$x = ifelse(is.na(r$x.Data), r$x.Donor, r$x.Data)

If for some reason idea of overwriting all values of x seems bad then use which to overwrite only NAs (with the same merge):

r = merge(Data, Donor, by="country", suffixes=c(".Data", ".Donor"))
na.idx = which(is.na(Data$x))
Data[na.idx,"x"] = r[na.idx,"x.Donor"]
like image 166
topchef Avatar answered Jan 16 '23 20:01

topchef