Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data.Table non-equi join with arithmetic operations

Tags:

r

data.table

I'm trying to do a complex self-join on a (for R) large data structure (tens to hundreds of millions of rows), so creating new columns just for this one operation is something I'd like to avoid as it would literally add gigs of memory pressure to my object, especially since I want to play with different join time parameters.

Structure of dt_sample:

str(dt_sample)
Classes ‘data.table’ and 'data.frame':  50 obs. of  6 variables:
 $ gateway_airport: chr  "BOS" "BOS" "BOS" "BOS" ...
 $ final_airport  : chr  "ORD" "BNA" "ORD" "RSW" ...
 $ dept_utc       : POSIXct, format: "2016-11-17 15:09:00" "2016-11-17 21:00:00" "2016-11-17 12:40:00" ...
 $ arriv_utc      : POSIXct, format: "2016-11-17 17:03:00" "2016-11-17 23:00:00" "2016-11-17 14:35:00" ...
 $ airlines_id    : chr  "UA" "B6" "UA" "B6" ...
 $ flight_number  : num  1472 1907 449 965 3839 ...

The idea is to do a self-join on where X's final_airport is Y's gateway_airport, and Y's dept_utc is within a range of X's arriv_UTC (in my example below between +30 minutes and +8 hours inclusive).

This works, but creates a large structure that gets filtered after the merge:

result <- merge(dt_sample, dt_sample, by.x=c("final_airport"),
                by.y=("gateway_airport"))[arriv_utc.x + 1800 <= dept_utc.y 
                                          &arriv_utc.x  +28800 >= dept_utc.y,]

I'd prefer to use the on = .() notation, but it does not seem to allow arithmetic operations on the join elements and the self-join seems to confuse it as well. Again, I'd prefer not to add new columns. Any ideas on how to use data.table effectively here?

The result of head(result) is a data.table with 3 airports per row, with the origin, intermediary stop and final airport (+ some extra info). The full result for the sample in the dput below is 19 rows long.

   final_airport gateway_airport          dept_utc.x         arriv_utc.x airlines_id.x flight_number.x final_airport          dept_utc.y         arriv_utc.y
1:           IAD             BOS 2016-11-17 14:35:00 2016-11-17 16:18:00            UA             525           JAX 2016-11-17 17:30:00 2016-11-17 19:37:00
2:           IAD             BOS 2016-11-17 14:35:00 2016-11-17 16:18:00            UA             525           SAV 2016-11-17 17:30:00 2016-11-17 19:16:00
3:           IAD             BOS 2016-11-17 14:35:00 2016-11-17 16:18:00            UA             525           TYS 2016-11-17 17:31:00 2016-11-17 19:10:00
4:           IAD             BOS 2016-11-17 14:35:00 2016-11-17 16:18:00            UA             525           DEN 2016-11-17 17:35:00 2016-11-17 19:36:00
5:           IAD             BOS 2016-11-17 14:35:00 2016-11-17 16:18:00            UA             525           GSO 2016-11-17 17:40:00 2016-11-17 19:09:00
6:           IAD             BOS 2016-11-17 14:35:00 2016-11-17 16:18:00            UA             525           LAX 2016-11-17 17:40:00 2016-11-17 20:25:00
   airlines_id.y flight_number.y
1:            AC            3891
2:            AC            2736
3:            AC            2567
4:            UA             735
5:            AC            2727
6:            UA             632

Click below to expand 50-row sample dput structure for reproducibility:

structure(list(gateway_airport = c("BOS", "BOS", "BOS", "BOS", 
"IAD", "IAD", "IAD", "BOS", "IAD", "BOS", "BOS", "BOS", "BOS", 
"IAD", "BOS", "BOS", "IAD", "BOS", "BOS", "BOS", "IAD", "BOS", 
"BOS", "BOS", "BOS", "IAD", "BOS", "IAD", "BOS", "IAD", "IAD", 
"IAD", "BOS", "IAD", "BOS", "BOS", "BOS", "IAD", "IAD", "BOS", 
"IAD", "BOS", "BOS", "BOS", "IAD", "BOS", "IAD", "BOS", "BOS", 
"IAD"), final_airport = c("ORD", "BNA", "ORD", "RSW", "ORF", 
"MCO", "DEN", "CLT", "DFW", "PHL", "ORD", "MIA", "ORD", "GSO", 
"JFK", "FLL", "ORD", "ORD", "LGA", "LGA", "LAX", "ORD", "IAD", 
"RIC", "DEN", "TYS", "SEA", "TPA", "SAV", "ROA", "SEA", "DFW", 
"PHL", "MIA", "IAH", "PHL", "LGA", "JFK", "JAX", "TPA", "TPA", 
"IAH", "DFW", "LAS", "SAV", "IAD", "LAX", "LGA", "SFO", "LAX"
), dept_utc = structure(c(1479395340, 1479416400, 1479386400, 
1479397800, 1479420600, 1479422700, 1479404100, 1479386100, 1479384840, 
1479387600, 1479378840, 1479386700, 1479402000, 1479404400, 1479403800, 
1479418500, 1479421500, 1479384000, 1479420900, 1479387600, 1479404400, 
1479412500, 1479411000, 1479381000, 1479412920, 1479403860, 1479395700, 
1479389100, 1479398400, 1479421500, 1479415200, 1479400140, 1479415440, 
1479380400, 1479406080, 1479382200, 1479413700, 1479394800, 1479403800, 
1479414300, 1479423000, 1479392520, 1479411600, 1479384000, 1479403800, 
1479393300, 1479391200, 1479400200, 1479397500, 1479420600), class = c("POSIXct", 
"POSIXt"), tzone = ""), arriv_utc = structure(c(1479402180, 1479423600, 
1479393300, 1479410880, 1479424620, 1479431160, 1479411360, 1479395520, 
1479393900, 1479393360, 1479386700, 1479400020, 1479408780, 1479409740, 
1479408240, 1479431340, 1479425280, 1479391860, 1479425640, 1479392100, 
1479414300, 1479419280, 1479417120, 1479387600, 1479422940, 1479409800, 
1479407460, 1479397800, 1479408180, 1479426180, 1479425580, 1479409500, 
1479421740, 1479390420, 1479418260, 1479387900, 1479418320, 1479399360, 
1479411420, 1479426420, 1479431940, 1479404880, 1479423900, 1479395340, 
1479410160, 1479399480, 1479401580, 1479404640, 1479411300, 1479430860
), class = c("POSIXct", "POSIXt"), tzone = ""), airlines_id = c("UA", 
"B6", "UA", "B6", "AC", "UA", "UA", "AA", "AA", "B6", "AA", "AA", 
"AA", "AC", "EI", "B6", "UA", "AA", "B6", "AA", "UA", "UA", "UA", 
"B6", "UA", "AC", "B6", "UA", "B6", "AC", "UA", "AA", "B6", "AA", 
"UA", "AA", "B6", "B6", "AC", "EI", "UA", "UA", "B6", "B6", "AC", 
"UA", "UA", "B6", "UA", "UA"), flight_number = c(1472, 1907, 
449, 965, 3839, 419, 735, 1735, 2569, 459, 1155, 1274, 1240, 
2727, 5021, 1969, 511, 1404, 1331, 2126, 632, 981, 511, 1481, 
448, 2567, 597, 2002, 49, 2726, 357, 1556, 1059, 1083, 1233, 
543, 1231, 1308, 3891, 5290, 360, 167, 1115, 1077, 2736, 525, 
470, 831, 477, 325)), .Names = c("gateway_airport", "final_airport", 
"dept_utc", "arriv_utc", "airlines_id", "flight_number"), class = c("data.table", 
"data.frame"), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x2301358>)
like image 475
Serban Tanasa Avatar asked Nov 17 '16 15:11

Serban Tanasa


People also ask

What are non equi JOINs?

Non-Equi Join is also a type of INNER Join in which we need to retrieve data from multiple tables. Non-Equi Join matches the column values from different tables based on an inequality based on the operators like <,>,<=,>=,!= , BETWEEN, etc.

Can I use not equal in join?

Such joins are called non-equi JOINs, and they are also possible in SQL. When you join two tables using other conditional operators, beyond the equal sign, non-equi JOINs come into play. Comparison operators, like <, >, <=, >=, != , and <> and the BETWEEN operator work perfectly for joining tables in SQL.

Which join does not use any comparison operator?

The SQL NON EQUI JOIN uses comparison operator instead of the equal sign like >, <, >=, <= along with conditions.

What is non equi join in oracle?

What is Non-Equijoin in Oracle? The nonequijoins is such a join which match column values from different tables based on an inequality (instead of the equal sign like >, <, >=, <= ) expression. The value of the join column in each row in the source table is compared to the corresponding values in the target table.


1 Answers

As already explained in the comments, you can use non-equi join functionality to perform this kind of join. Because on the fly calculation in on argument is not yet supported you have to create those two columns before performing join. The good thing is that it doesn't cost any extra memory and is extremely fast.

setDT(dt_sample)

## OP: join, then subset
result <- merge(dt_sample, dt_sample, by.x=c("final_airport"),
                by.y=("gateway_airport"))[arriv_utc.x + 1800 <= dept_utc.y 
                                          &arriv_utc.x  +28800 >= dept_utc.y,]

## efficient: non-equi join
d = copy(dt_sample)
d[, `:=`(arriv_utc2=arriv_utc+1800, arriv_utc3=arriv_utc+28800)]
d[d, nomatch=NULL, on=.(
  final_airport == gateway_airport,
  arriv_utc2 <= dept_utc,
  arriv_utc3 >= dept_utc
  ), j=.(
    x.final_airport, x.gateway_airport, x.dept_utc, x.arriv_utc, x.airlines_id, x.flight_number, i.final_airport, i.dept_utc, i.arriv_utc, i.airlines_id, i.flight_number
  )] -> ans
setnames(result, names(ans))
all.equal(ans, result, check.attributes=FALSE, ignore.row.order=TRUE)
#[1] TRUE
like image 174
jangorecki Avatar answered Sep 17 '22 21:09

jangorecki