Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R create new column with name of column closest in value to target

Tags:

r

dplyr

I'm analyzing donor data from an appeal. Let's say we have this data frame, which has an ID, the amount given, and the ask amounts from an appeal card:

dfgive <- data.frame(id = c(1, 2, 3, 4, 5),
   gift = c(20, 1000, 1500, 300, 500),
   ask1 = c(50, 5000, 5000, 100, 2500),
   ask2 = c(100, 3750, 3750, 250, 5000),
   ask3 = c(250, 2500, 2500, 500, 10000),
   gift_diff_1 = c(-30, -4000, -3500, 200, -2000),
   gift_diff_2 = c(-80, -2750, -2250, 50, -4500),
   gift_diff_3 = c(-230, -1500, -1000, -200, -9500),
   mindiff = c(-230, -4000, -3500, -200, -9500))

The gift difference from each ask amount and the minimum difference were created with the code below.

mutate(gift_diff_1 = gift - ask1) %>%
   mutate(gift_diff_2 = gift - ask2) %>%
   mutate(gift_diff_3 = gift - ask3) %>%
   mutate(mindiff=min(gift_diff_1, gift_diff_2, gift_diff_3))

What I'm hoping to do next is create a new column called gift_closest that would identify which of the ask amounts (ask1, ask2, or ask3) is closest in absolute value to the gift. We could use either the derived gift_diff columns or do it wholly within the equation. I can't get my head around how to best do it.

The value of the new column would be one of "ask1", "ask2", or "ask3". Ideally in a dplyr chain as that's where I'm creating the entire data set which will include a bunch of demographic fields.

thanks...

like image 850
greg dubrow Avatar asked Dec 18 '22 11:12

greg dubrow


2 Answers

You might consider reshaping the data if you have many different ask columns, to reduce duplication of code. The result below can be joined back onto the original dfgive if you want to keep other columns in that dataframe.

library(tidyverse)
dfgive <- data.frame(
  id = c(1, 2, 3, 4, 5),
  gift = c(20, 1000, 1500, 300, 500),
  ask1 = c(50, 5000, 5000, 100, 2500),
  ask2 = c(100, 3750, 3750, 250, 5000),
  ask3 = c(250, 2500, 2500, 500, 10000)
)

dfgive %>%
  pivot_longer(
    cols = starts_with("ask"),
    names_to = "ask_num",
    values_to = "ask"
  ) %>%
  mutate(gift_diff = gift - ask) %>%
  group_by(id) %>%
  summarise(
    mindiff = min(gift_diff),
    gift_closest = ask_num[which.min(abs(gift_diff))]
  )
#> # A tibble: 5 x 3
#>      id mindiff gift_closest
#>   <dbl>   <dbl> <chr>       
#> 1     1    -230 ask1        
#> 2     2   -4000 ask3        
#> 3     3   -3500 ask3        
#> 4     4    -200 ask2        
#> 5     5   -9500 ask1
like image 77
Calum You Avatar answered Dec 20 '22 00:12

Calum You


First of all: I guess your mindiff-part won't do what you expect it to do. Replace min with pmin.

library(dplyr)

dfgive %>%
  mutate(abs = pmin(abs(gift - ask1), abs(gift - ask2), abs(gift - ask3)),
         gift_close = case_when(abs(gift - ask1) == abs ~ "ask1",
                                abs(gift - ask2) == abs ~ "ask2",
                                abs(gift - ask3) == abs ~ "ask3",
                                TRUE ~ NA_character_))

returns

  id gift ask1 ask2  ask3 gift_diff_1 gift_diff_2 gift_diff_3 mindiff  abs gift_close
1  1   20   50  100   250         -30         -80        -230    -230   30       ask1
2  2 1000 5000 3750  2500       -4000       -2750       -1500   -4000 1500       ask3
3  3 1500 5000 3750  2500       -3500       -2250       -1000   -3500 1000       ask3
4  4  300  100  250   500         200          50        -200    -200   50       ask2
5  5  500 2500 5000 10000       -2000       -4500       -9500   -9500 2000       ask1
like image 26
Martin Gal Avatar answered Dec 20 '22 01:12

Martin Gal