Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filling missing values based on a relationship with another table

Tags:

r

data.table

I have two data tables, city_pop, and city_sub. city_pop is a list of cities with average population with some values missing. The city_sub table gives two possible city_id (sub_1 and sub_2) whose avg_pop can be used to fill NA in city_pop. sub_1 and sub_2 are to be used in that order of preference. Only the NA values in avg_pop need to be replaced.

How can I do this without using for loops?

city_id = c(1, 2, 3, 4, 5, 6)
avg_pop = c(100, NA, NA, 300, 400, NA)

city_pop = data.table(city_id, avg_pop)

   city_id avg_pop
1:       1     100
2:       2      NA
3:       3      NA
4:       4     300
5:       5     400
6:       6      NA

sub_1=c(2,1,4,3,1,3)
sub_2=c(5,5,6,6,2,4)

city_sub =data.table(city_id,sub_1,sub_2)

   city_id sub_1 sub_2
1:       1     2     5
2:       2     1     5
3:       3     4     6
4:       4     3     6
5:       5     1     2
6:       6     3     4

Expected Output -

  city_id avg_pop
1       1     100
2       2     100
3       3     300
4       4     300
5       5     400
6       6     300
like image 644
Bhagya Avatar asked Aug 11 '19 00:08

Bhagya


3 Answers

Here's a way with dplyr using coalesce which uses the first non-NA value. I created a separate column avg_pop2 as it seems safer in this case and also makes it easy to validate the result.

city_pop %>% 
  left_join(city_sub, by = "city_id") %>% 
  mutate(
    avg_pop2 = coalesce(
      avg_pop, avg_pop[match(sub_1, city_id)], avg_pop[match(sub_2, city_id)]
    )
  )

  city_id avg_pop sub_1 sub_2 avg_pop2
1       1     100     2     5      100
2       2      NA     1     5      100
3       3      NA     4     6      300
4       4     300     3     6      300
5       5     400     1     2      400
6       6      NA     3     4      300
like image 50
Shree Avatar answered Oct 17 '22 11:10

Shree


One way would be to look up sub_1, then look up its avg_pop; then do the same for sub_2:

city_pop[is.na(avg_pop), avg_pop :=  
  city_pop[.(city_sub[.SD, on=.(city_id), x.sub_1]), on=.(city_id), x.avg_pop]
]
city_pop[is.na(avg_pop), avg_pop := 
  city_pop[.(city_sub[.SD, on=.(city_id), x.sub_2]), on=.(city_id), x.avg_pop]
]

This approach is kind of convoluted and would not work for more general examples. A graph theory approach might make more sense, eg, if city_sub looks like this:

   city_id sub_1 
1:       1     5 
5:       5     3 

Suppose 1 & 5 both have missing data. We would expect to see 5 filled with 3, then 1 filled with 5, but this requires knowing in which order to fill. With a directed graph, you could figure out the right order of traversal, I guess, though I haven't thought through the details.

like image 40
Frank Avatar answered Oct 17 '22 12:10

Frank


Another possible approach is to convert city_sub into a long format and tweak the city_id in the decimal place before using a rolling join:

          #convert into long format
newpop <- melt(city_sub, measure.vars=patterns("^sub_"), variable.factor=FALSE)[,
    #tweak the city_id slightly to show order of preference
    city_id := as.numeric(paste0(city_id, ".", substring(variable, nchar(variable))))][
        #look up average population
        city_pop, on=.(value=city_id), new_pop := i.avg_pop][
            #remove cities without population
            !is.na(new_pop)]
newpop
#   city_id variable value new_pop
#1:     2.1    sub_1     1     100
#2:     3.1    sub_1     4     300
#3:     5.1    sub_1     1     100
#4:     1.2    sub_2     5     400
#5:     2.2    sub_2     5     400
#6:     6.2    sub_2     4     300

#rolling join
city_pop[is.na(avg_pop), avg_pop :=
        newpop[copy(.SD), on=.(city_id), roll=-Inf, x.new_pop]]

output:

   city_id avg_pop
1:       1     100
2:       2     100
3:       3     300
4:       4     300
5:       5     400
6:       6     300

data:

library(data.table)
city_pop = data.table(city_id=1:6, avg_pop=c(100, NA, NA, 300, 400, NA))
city_sub = data.table(city_id=1:6, sub_1=c(2,1,4,3,1,3), sub_2=c(5,5,6,6,2,4))
like image 1
chinsoon12 Avatar answered Oct 17 '22 11:10

chinsoon12