Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Define the value of a column in a dataframe based on 2 keys from a different dataframe

I have the following dataframe:

a <- seq(0, 5, by = 0.25)
b <-seq(0, 20, by = 1)
df <- data.frame(a, b)

and I'd like to create a new column "value", based on columns a and b, and the conversion table below:

a_min <- c(0,2, 0,2)
a_max <- c(2,5,2,5)
b_min <- c(0,0,10,10)
b_max <- c(10,10,30,30)
output <-c(1,2,3,4)

conv <- data.frame(a_min, a_max, b_min, b_max, output)

I've tried to do it using dplyr::mutate without much success...

require(dplyr)
mutate(df, value = calcula(conv, a, b))

longer object length is not a multiple of shorter object length

My expectation would be to obtain a dataframe like the 'df' above with the additional column value as per below:

df$value <- c(rep(1,8), rep(2,2), rep(4,11))
like image 702
ines vidal Avatar asked Jan 17 '17 11:01

ines vidal


4 Answers

A possible relatively simple and very efficient data.table solution using binary non-equi joins

library(data.table) # v1.10.0
setDT(conv)[setDT(df), output, on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b)]
## [1] 1 1 1 1 1 1 1 1 1 2 2 2 4 4 4 4 4 4 4 4 4 4 4

As a side note, if output column is just the row index within conv, you could make this join even more efficient by just asking for the row indices by specifying which = TRUE

setDT(conv)[setDT(df), on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b), which = TRUE]
## [1] 1 1 1 1 1 1 1 1 1 2 2 2 4 4 4 4 4 4 4 4 4 4 4
like image 85
David Arenburg Avatar answered Oct 16 '22 22:10

David Arenburg


One more option, this time with matrices.

with(df, with(conv, output[max.col(
    outer(a, a_min, `>=`) + outer(a, a_max, `<=`) +
    outer(b, b_min, `>=`) + outer(b, b_max, `<=`))]))

## [1] 1 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4

outer compares each element of the vector from df from the one from conv, producing a matrix of Booleans for each call. Since TRUE is 1, if you add all four matrices, the index you want will be the column with the most TRUEs, which you can get with max.col. Subset output, and you've got your result.

The benefit of working with matrices is that they're fast. Using @Phann's benchmarks on 1,000 rows:

Unit: microseconds
      expr       min         lq       mean     median         uq       max neval   cld
 alistaire   276.099   320.4565   349.1045   339.8375   357.2705   941.551   100 a    
      akr1   830.934   966.6705  1064.8433  1057.6610  1152.3565  1507.180   100 ab   
      akr2 11431.246 11731.3125 12835.5229 11947.5775 12408.4715 36767.488   100    d 
       Pha 11985.129 12403.1095 13330.1465 12660.4050 13044.9330 29653.842   100    d 
       Ron 71132.626 74300.3540 81136.9408 78034.2275 88952.8765 98950.061   100     e
      Dav1  2506.205  2765.4095  2971.6738  2948.6025  3082.4025  4065.368   100   c  
      Dav2  2104.481  2272.9180  2480.9570  2478.8775  2575.8740  3683.896   100  bc  

and on 100,000 rows:

Unit: milliseconds
      expr      min       lq     mean   median       uq       max neval cld
 alistaire 30.00677 36.49348 44.28828 39.43293 54.28207  64.36581   100 a  
      akr1 36.24467 40.04644 48.46986 41.59644 60.15175  77.34415   100 a  
      Dav1 51.74218 57.23488 67.70289 64.11002 68.86208 382.25182   100   c
      Dav2 48.48227 54.82818 60.25256 59.81041 64.92611  91.20212   100  b 
like image 28
alistaire Avatar answered Oct 17 '22 00:10

alistaire


We can try with Map with na.locf

library(zoo)
f1 <- function(u, v, x, y, z) z * NA^!((with(df, a >= u & a <v) & (b >=x & b <y)))
na.locf(do.call(pmax, c(do.call(Map, c(f=f1, unname(conv))), na.rm = TRUE)))
#[1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4 4

Or another way to write the Map solution is to pass the 'a' and 'b' columns as arguments, and then do the logical evaluation with columns of 'conv' to extract the 'output' value and unlist the list output

unlist(Map(function(x, y) 
     with(conv, output[x >= a_min & a_max > x & y >= b_min & b_max > y]), 
                          df$a, df$b))
#[1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4

NOTE: The second solution should be slower as we are looping through the rows of the dataset while the first solution loops through the 'conv' rows (which we assume should not be many rows)

like image 6
akrun Avatar answered Oct 16 '22 23:10

akrun


Another approach using apply:

df$value <- unlist(apply(df, 1, function(x){
    ifelse(length(OUT <- output[which(x[1] >= a_min & x[1] <= a_max & x[2] >= b_min & x[2] <= b_max)]) > 0, OUT, 0)
}))

EDIT:

Because there are several answers so far, I checked the time needed to process the data. I created a little bit bigger example (similar to the given one with random numbers):

set.seed(23563)
a <- runif(1000, 0, 5)
b <- runif(1000, 0, 20)
df <- data.frame(a, b)


require(microbenchmark)
library(zoo)
require(data.table)

microbenchmark(
  akr1 = { #akrun 1
    f1 <- function(u, v, x, y, z) z * NA^!((with(df, a >= u & a <v) & (b >=x & b <y)))
    na.locf(do.call(pmax, c(do.call(Map, c(f=f1, unname(conv))), na.rm = TRUE)))
  },
  akr2 = { #akrun 2
    unlist(Map(function(x, y) 
      with(conv, output[x >= a_min & a_max > x & y >= b_min & b_max > y]), 
      df$a, df$b))
  },
  Pha = { #Phann
    df$value <- unlist(apply(df, 1, function(x){
      ifelse(length(OUT <- output[which(x[1] >= a_min & x[1] <= a_max & x[2] >= b_min & x[2] <= b_max)]) > 0, OUT, 0)
    }))
  }, 
  Ron = { #Ronak Shah
    unlist(mapply(function(x, y) 
      conv$output[x >= conv$a_min & conv$a_max > x & y >= conv$b_min & conv$b_max > y], 
      df$a, df$b))
  },
  Dav1 ={ #David Arenburg 1
    setDT(conv)[setDT(df), on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b)]$output
  },
  Dav2 = { #David Arenburg 2
    setDT(conv)[setDT(df), on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b), which = TRUE]
  },
  times = 100L
)

With 1000 random numbers:

# Unit: milliseconds
# expr        min         lq       mean     median         uq       max neval
# akr1   4.267206   4.749576   6.259695   5.351494   6.843077  54.39187   100
# akr2  33.437853  39.912785  49.932875  47.416888  57.070369  91.55602   100
# Pha   30.433779  36.939692  48.205592  46.393800  55.800204  83.91640   100
# Ron  174.765021 199.648315 227.493117 223.314661 240.579057 370.26929   100
# Dav1   6.944759   7.814469  10.685460   8.536694  11.974102  44.47915   100
# Dav2   6.106978   6.706424   8.961821   8.161707  10.376085  28.91255   100

With 10000 random numbers (same seed), I get:

# Unit: milliseconds
# expr        min         lq       mean     median         uq        max neval
# akr1   23.48180   24.03962   26.16747   24.46897   26.19565   41.83238   100
# akr2  357.38290  398.69965  434.92052  409.15385  440.98210  829.85113   100
# Pha   320.39285  347.66632  376.98118  361.76852  383.08231  681.28500   100
# Ron  1661.50669 1788.06228 1873.70929 1837.28187 1912.04123 2499.23235   100
# Dav1   20.91486   21.60953   23.12278   21.94707   22.42773   44.71900   100
# Dav2   19.69506   20.22077   21.63715   20.55793   21.27578   38.96819   100
like image 5
Phann Avatar answered Oct 17 '22 00:10

Phann