Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find nearest matches for each row and sum based on a condition

Consider the following data.table of events:

library(data.table)
breaks <- data.table(id = 1:8,
                     Channel = c("NP1", "NP1", "NP2", "NP2", "NP3", "NP3", "AT4", "AT4"),
                     Time = c(1000, 1100, 975, 1075, 1010, 1080, 1000, 1050),
                     Day = c(1, 1, 1, 1, 1, 1, 1, 1),
                     ZA = c(15, 12, 4, 2, 1, 2, 23, 18),
                     stringsAsFactors = F)

breaks
   id Channel Time Day ZA
1:  1     NP1 1000   1 15
2:  2     NP1 1100   1 12
3:  3     NP2  975   1  4
4:  4     NP2 1075   1  2
5:  5     NP3 1010   1  1
6:  6     NP3 1080   1  2
7:  7     AT4 1000   1 23
8:  8     AT4 1050   1 18

For each unique event in breaks I want to find the nearest events in all other channels using the Time variable where Day == Day and then sum the values of ZA for these events.

This is the result I want to achieve:

   id Channel Time Day ZA Sum
1:  1     NP1 1000   1 15  28
2:  2     NP1 1100   1 12  22
3:  3     NP2  975   1  4  39
4:  4     NP2 1075   1  2  32
5:  5     NP3 1010   1  1  42
6:  6     NP3 1080   1  2  32
7:  7     AT4 1000   1 23  20
8:  8     AT4 1050   1 18  19

So for the first row the channel is NP1. The closes events in all other channels to Time = 1000 are rows 3, 5 and 7. 4+1+23 = 28

I got this to work using data.table with the following code:

breaks[breaks[, c("Day", "Time", "Channel", "ZA")], on = "Day", allow.cartesian = TRUE][
  Channel != i.Channel][
    order(id)][
      , delta := abs(Time - i.Time)][
        , .SD[delta == min(delta)], by = .(Channel, Time, Day, i.Channel)][
          , unique(.SD, by = c("id", "i.Channel"))][
            , .(Sum = sum(i.ZA)), by = .(id, Channel, Time, Day, ZA)]

However, this creates a dataset with 64 rows in the first step and I'd like to do this with a dataset of more than a million rows.

Can anyone help me find a more efficient way of doing this?

Edit:

I tried out the solutions of G. Grothendieck (sqldf), eddi (data.table) and MarkusN (dplyr) once on the full dataset of 1.4 million rows with 39 different channels. The dataset was in-memory.

sqldf:      54 minutes
data.table: 11 hours
dplyr:      29 hours
like image 641
Dahbid Avatar asked Sep 20 '17 07:09

Dahbid


1 Answers

In the inner select self-join each row in breaks to those rows on the same day and different channel and then among all the joined rows to a particular original row keep only the joined row having the minmum absolute time difference. In the outer select sum the ZA from the other Channel within id giving the result.

Note that we are assuming the default SQLite backend to sqldf here and are using a feature specific to that database, namely, that if min is used in a select then the other values specified in that select will also be populated from the minimizing row.

By default it will use an in-memory database which would be best if it fits but if you specify dbname = tempfile() as an argument to sqldf it will use a file as an out-of-memory database instead. It would also be possible to add one or more indexes which may or may not speed it up. See the sqldf github home page for more examples.

library(sqldf)

sqldf("select id, Channel, Time, Day, ZA, sum(bZA) Sum
 from (
   select a.*, b.ZA bZA, min(abs(a.Time - b.Time))
   from breaks a join breaks b on a.Day = b.Day and a.Channel != b.Channel
   group by a.id, b.Channel)
 group by id")

giving:

  id Channel Time Day ZA Sum
1  1     NP1 1000   1 15  28
2  2     NP1 1100   1 12  22
3  3     NP2  975   1  4  39
4  4     NP2 1075   1  2  32
5  5     NP3 1010   1  1  42
6  6     NP3 1080   1  2  32
7  7     AT4 1000   1 23  20
8  8     AT4 1050   1 18  19

This is slightly faster than the data.table code in the question on a problem of this size but for larger problems the comparison would have to be redone.

Also, it may be able to handle larger size due to not having to materialize the intermedidate results (depending on the query optimizer) and the possibility of handling it out of memory (if need be).

library(data.table)
library(dplyr)
library(sqldf)
library(rbenchmark)

benchmark(sqldf = 
sqldf("select id, Channel, Time, Day, ZA, sum(bZA) Sum
 from (
   select a.*, b.ZA bZA, min(abs(a.Time - b.Time))
   from breaks a join breaks b on a.Day = b.Day and a.Channel != b.Channel
   group by a.id, b.Channel)
 group by id"),

data.table = breaks[breaks[, c("Day", "Time", "Channel", "ZA")], on = "Day",
     allow.cartesian = TRUE][
  Channel != i.Channel][
    order(id)][
      , delta := abs(Time - i.Time)][
        , .SD[delta == min(delta)], by = .(Channel, Time, Day, i.Channel)][
          , unique(.SD, by = c("id", "i.Channel"))][
            , .(Sum = sum(i.ZA)), by = .(id, Channel, Time, Day, ZA)],

dplyr = { breaks %>% 
  inner_join(breaks, by=c("Day"), suffix=c("",".y")) %>%
  filter(Channel != Channel.y) %>%
  group_by(id, Channel, Time, Day, ZA, Channel.y) %>%
  arrange(abs(Time - Time.y)) %>%
  filter(row_number()==1) %>%
  group_by(id, Channel, Time, Day, ZA) %>%
  summarise(Sum=sum(ZA.y)) %>%                           
  ungroup() %>% 
  select(id:Sum) },

order = "elapsed")[1:4]

giving:

        test replications elapsed relative
1      sqldf          100    3.38    1.000
2 data.table          100    4.05    1.198
3      dplyr          100    9.23    2.731
like image 116
G. Grothendieck Avatar answered Nov 09 '22 09:11

G. Grothendieck