I'm trying to do a complex self-join on a (for R) large data structure (tens to hundreds of millions of rows), so creating new columns just for this one operation is something I'd like to avoid as it would literally add gigs of memory pressure to my object, especially since I want to play with different join time parameters.
Structure of dt_sample
:
str(dt_sample)
Classes ‘data.table’ and 'data.frame': 50 obs. of 6 variables:
$ gateway_airport: chr "BOS" "BOS" "BOS" "BOS" ...
$ final_airport : chr "ORD" "BNA" "ORD" "RSW" ...
$ dept_utc : POSIXct, format: "2016-11-17 15:09:00" "2016-11-17 21:00:00" "2016-11-17 12:40:00" ...
$ arriv_utc : POSIXct, format: "2016-11-17 17:03:00" "2016-11-17 23:00:00" "2016-11-17 14:35:00" ...
$ airlines_id : chr "UA" "B6" "UA" "B6" ...
$ flight_number : num 1472 1907 449 965 3839 ...
The idea is to do a self-join on where X's final_airport is Y's gateway_airport, and Y's dept_utc is within a range of X's arriv_UTC (in my example below between +30 minutes and +8 hours inclusive).
This works, but creates a large structure that gets filtered after the merge:
result <- merge(dt_sample, dt_sample, by.x=c("final_airport"),
by.y=("gateway_airport"))[arriv_utc.x + 1800 <= dept_utc.y
&arriv_utc.x +28800 >= dept_utc.y,]
I'd prefer to use the on = .()
notation, but it does not seem to allow arithmetic operations on the join elements and the self-join seems to confuse it as well. Again, I'd prefer not to add new columns. Any ideas on how to use data.table effectively here?
The result of head(result) is a data.table with 3 airports per row, with the origin, intermediary stop and final airport (+ some extra info). The full result for the sample in the dput below is 19 rows long.
final_airport gateway_airport dept_utc.x arriv_utc.x airlines_id.x flight_number.x final_airport dept_utc.y arriv_utc.y
1: IAD BOS 2016-11-17 14:35:00 2016-11-17 16:18:00 UA 525 JAX 2016-11-17 17:30:00 2016-11-17 19:37:00
2: IAD BOS 2016-11-17 14:35:00 2016-11-17 16:18:00 UA 525 SAV 2016-11-17 17:30:00 2016-11-17 19:16:00
3: IAD BOS 2016-11-17 14:35:00 2016-11-17 16:18:00 UA 525 TYS 2016-11-17 17:31:00 2016-11-17 19:10:00
4: IAD BOS 2016-11-17 14:35:00 2016-11-17 16:18:00 UA 525 DEN 2016-11-17 17:35:00 2016-11-17 19:36:00
5: IAD BOS 2016-11-17 14:35:00 2016-11-17 16:18:00 UA 525 GSO 2016-11-17 17:40:00 2016-11-17 19:09:00
6: IAD BOS 2016-11-17 14:35:00 2016-11-17 16:18:00 UA 525 LAX 2016-11-17 17:40:00 2016-11-17 20:25:00
airlines_id.y flight_number.y
1: AC 3891
2: AC 2736
3: AC 2567
4: UA 735
5: AC 2727
6: UA 632
Click below to expand 50-row sample dput structure for reproducibility:
structure(list(gateway_airport = c("BOS", "BOS", "BOS", "BOS",
"IAD", "IAD", "IAD", "BOS", "IAD", "BOS", "BOS", "BOS", "BOS",
"IAD", "BOS", "BOS", "IAD", "BOS", "BOS", "BOS", "IAD", "BOS",
"BOS", "BOS", "BOS", "IAD", "BOS", "IAD", "BOS", "IAD", "IAD",
"IAD", "BOS", "IAD", "BOS", "BOS", "BOS", "IAD", "IAD", "BOS",
"IAD", "BOS", "BOS", "BOS", "IAD", "BOS", "IAD", "BOS", "BOS",
"IAD"), final_airport = c("ORD", "BNA", "ORD", "RSW", "ORF",
"MCO", "DEN", "CLT", "DFW", "PHL", "ORD", "MIA", "ORD", "GSO",
"JFK", "FLL", "ORD", "ORD", "LGA", "LGA", "LAX", "ORD", "IAD",
"RIC", "DEN", "TYS", "SEA", "TPA", "SAV", "ROA", "SEA", "DFW",
"PHL", "MIA", "IAH", "PHL", "LGA", "JFK", "JAX", "TPA", "TPA",
"IAH", "DFW", "LAS", "SAV", "IAD", "LAX", "LGA", "SFO", "LAX"
), dept_utc = structure(c(1479395340, 1479416400, 1479386400,
1479397800, 1479420600, 1479422700, 1479404100, 1479386100, 1479384840,
1479387600, 1479378840, 1479386700, 1479402000, 1479404400, 1479403800,
1479418500, 1479421500, 1479384000, 1479420900, 1479387600, 1479404400,
1479412500, 1479411000, 1479381000, 1479412920, 1479403860, 1479395700,
1479389100, 1479398400, 1479421500, 1479415200, 1479400140, 1479415440,
1479380400, 1479406080, 1479382200, 1479413700, 1479394800, 1479403800,
1479414300, 1479423000, 1479392520, 1479411600, 1479384000, 1479403800,
1479393300, 1479391200, 1479400200, 1479397500, 1479420600), class = c("POSIXct",
"POSIXt"), tzone = ""), arriv_utc = structure(c(1479402180, 1479423600,
1479393300, 1479410880, 1479424620, 1479431160, 1479411360, 1479395520,
1479393900, 1479393360, 1479386700, 1479400020, 1479408780, 1479409740,
1479408240, 1479431340, 1479425280, 1479391860, 1479425640, 1479392100,
1479414300, 1479419280, 1479417120, 1479387600, 1479422940, 1479409800,
1479407460, 1479397800, 1479408180, 1479426180, 1479425580, 1479409500,
1479421740, 1479390420, 1479418260, 1479387900, 1479418320, 1479399360,
1479411420, 1479426420, 1479431940, 1479404880, 1479423900, 1479395340,
1479410160, 1479399480, 1479401580, 1479404640, 1479411300, 1479430860
), class = c("POSIXct", "POSIXt"), tzone = ""), airlines_id = c("UA",
"B6", "UA", "B6", "AC", "UA", "UA", "AA", "AA", "B6", "AA", "AA",
"AA", "AC", "EI", "B6", "UA", "AA", "B6", "AA", "UA", "UA", "UA",
"B6", "UA", "AC", "B6", "UA", "B6", "AC", "UA", "AA", "B6", "AA",
"UA", "AA", "B6", "B6", "AC", "EI", "UA", "UA", "B6", "B6", "AC",
"UA", "UA", "B6", "UA", "UA"), flight_number = c(1472, 1907,
449, 965, 3839, 419, 735, 1735, 2569, 459, 1155, 1274, 1240,
2727, 5021, 1969, 511, 1404, 1331, 2126, 632, 981, 511, 1481,
448, 2567, 597, 2002, 49, 2726, 357, 1556, 1059, 1083, 1233,
543, 1231, 1308, 3891, 5290, 360, 167, 1115, 1077, 2736, 525,
470, 831, 477, 325)), .Names = c("gateway_airport", "final_airport",
"dept_utc", "arriv_utc", "airlines_id", "flight_number"), class = c("data.table",
"data.frame"), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x2301358>)
Non-Equi Join is also a type of INNER Join in which we need to retrieve data from multiple tables. Non-Equi Join matches the column values from different tables based on an inequality based on the operators like <,>,<=,>=,!= , BETWEEN, etc.
Such joins are called non-equi JOINs, and they are also possible in SQL. When you join two tables using other conditional operators, beyond the equal sign, non-equi JOINs come into play. Comparison operators, like <, >, <=, >=, != , and <> and the BETWEEN operator work perfectly for joining tables in SQL.
The SQL NON EQUI JOIN uses comparison operator instead of the equal sign like >, <, >=, <= along with conditions.
What is Non-Equijoin in Oracle? The nonequijoins is such a join which match column values from different tables based on an inequality (instead of the equal sign like >, <, >=, <= ) expression. The value of the join column in each row in the source table is compared to the corresponding values in the target table.
As already explained in the comments, you can use non-equi join functionality to perform this kind of join.
Because on the fly calculation in on
argument is not yet supported you have to create those two columns before performing join. The good thing is that it doesn't cost any extra memory and is extremely fast.
setDT(dt_sample)
## OP: join, then subset
result <- merge(dt_sample, dt_sample, by.x=c("final_airport"),
by.y=("gateway_airport"))[arriv_utc.x + 1800 <= dept_utc.y
&arriv_utc.x +28800 >= dept_utc.y,]
## efficient: non-equi join
d = copy(dt_sample)
d[, `:=`(arriv_utc2=arriv_utc+1800, arriv_utc3=arriv_utc+28800)]
d[d, nomatch=NULL, on=.(
final_airport == gateway_airport,
arriv_utc2 <= dept_utc,
arriv_utc3 >= dept_utc
), j=.(
x.final_airport, x.gateway_airport, x.dept_utc, x.arriv_utc, x.airlines_id, x.flight_number, i.final_airport, i.dept_utc, i.arriv_utc, i.airlines_id, i.flight_number
)] -> ans
setnames(result, names(ans))
all.equal(ans, result, check.attributes=FALSE, ignore.row.order=TRUE)
#[1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With