Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data table: use function on group except current row

Tags:

r

data.table

Suppose I have:

x = data.table( id=c(1,1,1,2,2,2), price=c(100,110,120,200,200,220) )
> x
   id price
1:  1   100
2:  1   110
3:  1   120
4:  2   200
5:  2   200
6:  2   220

and want to find for each row the cheapest price in the group (by=id) once the current row is omitted. so the results should look like:

> x
   id price   cheapest_in_this_id_omitting_current_row
1:  1   100   110       # if I take this row out the cheapest is the next row
2:  1   110   100       # row 1
3:  1   120   100       # row 1
4:  2   200   200       # row 5
5:  2   200   200       # row 4 (or 5)
6:  2   220   200       # row 4 (or 5)

So it is like using:

x[, cheapest_by_id := min(price), id]

but removing the current row for each calculation.

If I could have a variable that refers to the current row inside the group like a .row_nb, I would use:

x[, min(price[-.row_nb]), id]

but this .row_nb does not seem to exist...?

like image 299
Timothée HENRY Avatar asked Oct 18 '25 23:10

Timothée HENRY


2 Answers

We group by 'id', use combn on the sequence of row, specify the number of elements to choose i.e. 'm' as 1 less than the number of rows (.N-1), use the output from combn as numeric index to subset the 'price', get the min and assign (:=) the output as the new column.

 x[,  cheapest_in_this_id_omitting_current_row:= 
             combn(.N:1, .N-1, FUN=function(i) min(price[i])), by = id]
x
#   id price cheapest_in_this_id_omitting_current_row
#1:  1   100                                      110
#2:  1   110                                      100
#3:  1   120                                      100
#4:  2   200                                      200
#5:  2   200                                      200
#6:  2   220                                      200

Or instead of using combn, we can loop over the sequence, use that to index the 'price', get the mean. I guess this would be fast.

 x[,cheapest_in_this_id_omitting_current_row:=
          unlist(lapply(1:.N, function(i) min(price[-i]))) , id]
like image 99
akrun Avatar answered Oct 20 '25 14:10

akrun


Here's another way:

x[order(price), min_other_p := c(price[2], rep(price[1], .N-1)), by = id]
# or
x[order(price), min_other_p := replace( rep(price[1], .N), 1, price[2] ), by = id]


   id price min_other_p
1:  1   100         110
2:  1   110         100
3:  1   120         100
4:  2   200         200
5:  2   200         200
6:  2   220         200

The order in i is not necessary in the OP's example but needed in general.


How it works. We sort the price vector in increasing order with order, so that price[1] and price[2] are the lowest two prices observed in each group. In the result, we want price[1] -- the lowest price overall -- everywhere except in position 1, where we want the next lowest price.

To be even more explicit: Suppose we have sorted so that we have sorted so that i==1 is the row with the lowest price within a group; i==2, the second lowest and so on. Then price[1] is the 1st order statistic of the vector of prices in a group and price[2] is the second order statistic of the vector of prices. It is clear that

# pseudocode
min(price[-i]) == price[2] if i==1, since price[2] == min(price[2:.N])
min(price[-i]) == price[1] otherwise, since price[1] belongs to price[-i] and is smallest
like image 42
Frank Avatar answered Oct 20 '25 13:10

Frank



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!