I'm trying to do a complex self-join on a (for R) large data structure (tens to hundreds of millions of rows), so creating new columns just for this one operation is something I'd like to avoid as it would literally add gigs of memory pressure to my object, especially since I want to play with different join time parameters. Structure of <code>dt_sample</code>: <pre class="prettyprint"><code>str(dt_sample) Classes ‘data.table’ and 'data.frame': 50 obs. of 6 variables: $ gateway_airport: chr "BOS" "BOS" "BOS" "BOS" ... $ final_airport : chr "ORD" "BNA" "ORD" "RSW" ... $ dept_utc : POSIXct, format: "2016-11-17 15:09:00" "2016-11-17 21:00:00" "2016-11-17 12:40:00" ... $ arriv_utc : POSIXct, format: "2016-11-17 17:03:00" "2016-11-17 23:00:00" "2016-11-17 14:35:00" ... $ airlines_id : chr "UA" "B6" "UA" "B6" ... $ flight_number : num 1472 1907 449 965 3839 ... </code></pre> The idea is to do a self-join on where X's final_airport is Y's gateway_airport, and Y's dept_utc is within a range of X's arriv_UTC (in my example below between +30 minutes and +8 hours inclusive). This works, but creates a large structure that gets filtered after the merge: <pre class="prettyprint"><code>result <- merge(dt_sample, dt_sample, by.x=c("final_airport"), by.y=("gateway_airport"))[arriv_utc.x + 1800 <= dept_utc.y &arriv_utc.x +28800 >= dept_utc.y,] </code></pre> I'd prefer to use the <code>on = .()</code> notation, but it does not seem to allow arithmetic operations on the join elements and the self-join seems to confuse it as well. Again, I'd prefer not to add new columns. Any ideas on how to use data.table effectively here? The result of head(result) is a data.table with 3 airports per row, with the origin, intermediary stop and final airport (+ some extra info). The full result for the sample in the dput below is 19 rows long. <pre class="prettyprint"><code> final_airport gateway_airport dept_utc.x arriv_utc.x airlines_id.x flight_number.x final_airport dept_utc.y arriv_utc.y 1: IAD BOS 2016-11-17 14:35:00 2016-11-17 16:18:00 UA 525 JAX 2016-11-17 17:30:00 2016-11-17 19:37:00 2: IAD BOS 2016-11-17 14:35:00 2016-11-17 16:18:00 UA 525 SAV 2016-11-17 17:30:00 2016-11-17 19:16:00 3: IAD BOS 2016-11-17 14:35:00 2016-11-17 16:18:00 UA 525 TYS 2016-11-17 17:31:00 2016-11-17 19:10:00 4: IAD BOS 2016-11-17 14:35:00 2016-11-17 16:18:00 UA 525 DEN 2016-11-17 17:35:00 2016-11-17 19:36:00 5: IAD BOS 2016-11-17 14:35:00 2016-11-17 16:18:00 UA 525 GSO 2016-11-17 17:40:00 2016-11-17 19:09:00 6: IAD BOS 2016-11-17 14:35:00 2016-11-17 16:18:00 UA 525 LAX 2016-11-17 17:40:00 2016-11-17 20:25:00 airlines_id.y flight_number.y 1: AC 3891 2: AC 2736 3: AC 2567 4: UA 735 5: AC 2727 6: UA 632 </code></pre> Click below to expand 50-row sample dput structure for reproducibility: <div class="snippet" data-lang="js" data-hide="true" data-console="false" data-babel="false"> <div class="snippet-code snippet-currently-hidden"> <pre class="prettyprint snippet-code-html lang-html prettyprint-override"><code>structure(list(gateway_airport = c("BOS", "BOS", "BOS", "BOS", "IAD", "IAD", "IAD", "BOS", "IAD", "BOS", "BOS", "BOS", "BOS", "IAD", "BOS", "BOS", "IAD", "BOS", "BOS", "BOS", "IAD", "BOS", "BOS", "BOS", "BOS", "IAD", "BOS", "IAD", "BOS", "IAD", "IAD", "IAD", "BOS", "IAD", "BOS", "BOS", "BOS", "IAD", "IAD", "BOS", "IAD", "BOS", "BOS", "BOS", "IAD", "BOS", "IAD", "BOS", "BOS", "IAD"), final_airport = c("ORD", "BNA", "ORD", "RSW", "ORF", "MCO", "DEN", "CLT", "DFW", "PHL", "ORD", "MIA", "ORD", "GSO", "JFK", "FLL", "ORD", "ORD", "LGA", "LGA", "LAX", "ORD", "IAD", "RIC", "DEN", "TYS", "SEA", "TPA", "SAV", "ROA", "SEA", "DFW", "PHL", "MIA", "IAH", "PHL", "LGA", "JFK", "JAX", "TPA", "TPA", "IAH", "DFW", "LAS", "SAV", "IAD", "LAX", "LGA", "SFO", "LAX" ), dept_utc = structure(c(1479395340, 1479416400, 1479386400, 1479397800, 1479420600, 1479422700, 1479404100, 1479386100, 1479384840, 1479387600, 1479378840, 1479386700, 1479402000, 1479404400, 1479403800, 1479418500, 1479421500, 1479384000, 1479420900, 1479387600, 1479404400, 1479412500, 1479411000, 1479381000, 1479412920, 1479403860, 1479395700, 1479389100, 1479398400, 1479421500, 1479415200, 1479400140, 1479415440, 1479380400, 1479406080, 1479382200, 1479413700, 1479394800, 1479403800, 1479414300, 1479423000, 1479392520, 1479411600, 1479384000, 1479403800, 1479393300, 1479391200, 1479400200, 1479397500, 1479420600), class = c("POSIXct", "POSIXt"), tzone = ""), arriv_utc = structure(c(1479402180, 1479423600, 1479393300, 1479410880, 1479424620, 1479431160, 1479411360, 1479395520, 1479393900, 1479393360, 1479386700, 1479400020, 1479408780, 1479409740, 1479408240, 1479431340, 1479425280, 1479391860, 1479425640, 1479392100, 1479414300, 1479419280, 1479417120, 1479387600, 1479422940, 1479409800, 1479407460, 1479397800, 1479408180, 1479426180, 1479425580, 1479409500, 1479421740, 1479390420, 1479418260, 1479387900, 1479418320, 1479399360, 1479411420, 1479426420, 1479431940, 1479404880, 1479423900, 1479395340, 1479410160, 1479399480, 1479401580, 1479404640, 1479411300, 1479430860 ), class = c("POSIXct", "POSIXt"), tzone = ""), airlines_id = c("UA", "B6", "UA", "B6", "AC", "UA", "UA", "AA", "AA", "B6", "AA", "AA", "AA", "AC", "EI", "B6", "UA", "AA", "B6", "AA", "UA", "UA", "UA", "B6", "UA", "AC", "B6", "UA", "B6", "AC", "UA", "AA", "B6", "AA", "UA", "AA", "B6", "B6", "AC", "EI", "UA", "UA", "B6", "B6", "AC", "UA", "UA", "B6", "UA", "UA"), flight_number = c(1472, 1907, 449, 965, 3839, 419, 735, 1735, 2569, 459, 1155, 1274, 1240, 2727, 5021, 1969, 511, 1404, 1331, 2126, 632, 981, 511, 1481, 448, 2567, 597, 2002, 49, 2726, 357, 1556, 1059, 1083, 1233, 543, 1231, 1308, 3891, 5290, 360, 167, 1115, 1077, 2736, 525, 470, 831, 477, 325)), .Names = c("gateway_airport", "final_airport", "dept_utc", "arriv_utc", "airlines_id", "flight_number"), class = c("data.table", "data.frame"), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x2301358>)</code></pre> </div> </div>

As already explained in the comments, you can use non-equi join functionality to perform this kind of join. Because on the fly calculation in <code>on</code> argument is not yet supported you have to create those two columns before performing join. The good thing is that it doesn't cost any extra memory and is extremely fast. <pre class="prettyprint lang-r prettyprint-override"><code>setDT(dt_sample) ## OP: join, then subset result <- merge(dt_sample, dt_sample, by.x=c("final_airport"), by.y=("gateway_airport"))[arriv_utc.x + 1800 <= dept_utc.y &arriv_utc.x +28800 >= dept_utc.y,] ## efficient: non-equi join d = copy(dt_sample) d[, `:=`(arriv_utc2=arriv_utc+1800, arriv_utc3=arriv_utc+28800)] d[d, nomatch=NULL, on=.( final_airport == gateway_airport, arriv_utc2 <= dept_utc, arriv_utc3 >= dept_utc ), j=.( x.final_airport, x.gateway_airport, x.dept_utc, x.arriv_utc, x.airlines_id, x.flight_number, i.final_airport, i.dept_utc, i.arriv_utc, i.airlines_id, i.flight_number )] -> ans setnames(result, names(ans)) all.equal(ans, result, check.attributes=FALSE, ignore.row.order=TRUE) #[1] TRUE </code></pre>

Data.Table non-equi join with arithmetic operations

Q: What is non equi join in oracle?

What is Non-Equijoin in Oracle? The nonequijoins is such a join which match column values from different tables based on an inequality (instead of the equal sign like >, <, >=, <= ) expression. The value of the join column in each row in the source table is compared to the corresponding values in the target table.

Tags:

r

data.table

I'm trying to do a complex self-join on a (for R) large data structure (tens to hundreds of millions of rows), so creating new columns just for this one operation is something I'd like to avoid as it would literally add gigs of memory pressure to my object, especially since I want to play with different join time parameters.

Structure of dt_sample:

str(dt_sample)
Classes ‘data.table’ and 'data.frame':  50 obs. of  6 variables:
 $ gateway_airport: chr  "BOS" "BOS" "BOS" "BOS" ...
 $ final_airport  : chr  "ORD" "BNA" "ORD" "RSW" ...
 $ dept_utc       : POSIXct, format: "2016-11-17 15:09:00" "2016-11-17 21:00:00" "2016-11-17 12:40:00" ...
 $ arriv_utc      : POSIXct, format: "2016-11-17 17:03:00" "2016-11-17 23:00:00" "2016-11-17 14:35:00" ...
 $ airlines_id    : chr  "UA" "B6" "UA" "B6" ...
 $ flight_number  : num  1472 1907 449 965 3839 ...

The idea is to do a self-join on where X's final_airport is Y's gateway_airport, and Y's dept_utc is within a range of X's arriv_UTC (in my example below between +30 minutes and +8 hours inclusive).

This works, but creates a large structure that gets filtered after the merge:

result <- merge(dt_sample, dt_sample, by.x=c("final_airport"),
                by.y=("gateway_airport"))[arriv_utc.x + 1800 <= dept_utc.y 
                                          &arriv_utc.x  +28800 >= dept_utc.y,]

I'd prefer to use the on = .() notation, but it does not seem to allow arithmetic operations on the join elements and the self-join seems to confuse it as well. Again, I'd prefer not to add new columns. Any ideas on how to use data.table effectively here?

The result of head(result) is a data.table with 3 airports per row, with the origin, intermediary stop and final airport (+ some extra info). The full result for the sample in the dput below is 19 rows long.

   final_airport gateway_airport          dept_utc.x         arriv_utc.x airlines_id.x flight_number.x final_airport          dept_utc.y         arriv_utc.y
1:           IAD             BOS 2016-11-17 14:35:00 2016-11-17 16:18:00            UA             525           JAX 2016-11-17 17:30:00 2016-11-17 19:37:00
2:           IAD             BOS 2016-11-17 14:35:00 2016-11-17 16:18:00            UA             525           SAV 2016-11-17 17:30:00 2016-11-17 19:16:00
3:           IAD             BOS 2016-11-17 14:35:00 2016-11-17 16:18:00            UA             525           TYS 2016-11-17 17:31:00 2016-11-17 19:10:00
4:           IAD             BOS 2016-11-17 14:35:00 2016-11-17 16:18:00            UA             525           DEN 2016-11-17 17:35:00 2016-11-17 19:36:00
5:           IAD             BOS 2016-11-17 14:35:00 2016-11-17 16:18:00            UA             525           GSO 2016-11-17 17:40:00 2016-11-17 19:09:00
6:           IAD             BOS 2016-11-17 14:35:00 2016-11-17 16:18:00            UA             525           LAX 2016-11-17 17:40:00 2016-11-17 20:25:00
   airlines_id.y flight_number.y
1:            AC            3891
2:            AC            2736
3:            AC            2567
4:            UA             735
5:            AC            2727
6:            UA             632

Click below to expand 50-row sample dput structure for reproducibility:

structure(list(gateway_airport = c("BOS", "BOS", "BOS", "BOS", 
"IAD", "IAD", "IAD", "BOS", "IAD", "BOS", "BOS", "BOS", "BOS", 
"IAD", "BOS", "BOS", "IAD", "BOS", "BOS", "BOS", "IAD", "BOS", 
"BOS", "BOS", "BOS", "IAD", "BOS", "IAD", "BOS", "IAD", "IAD", 
"IAD", "BOS", "IAD", "BOS", "BOS", "BOS", "IAD", "IAD", "BOS", 
"IAD", "BOS", "BOS", "BOS", "IAD", "BOS", "IAD", "BOS", "BOS", 
"IAD"), final_airport = c("ORD", "BNA", "ORD", "RSW", "ORF", 
"MCO", "DEN", "CLT", "DFW", "PHL", "ORD", "MIA", "ORD", "GSO", 
"JFK", "FLL", "ORD", "ORD", "LGA", "LGA", "LAX", "ORD", "IAD", 
"RIC", "DEN", "TYS", "SEA", "TPA", "SAV", "ROA", "SEA", "DFW", 
"PHL", "MIA", "IAH", "PHL", "LGA", "JFK", "JAX", "TPA", "TPA", 
"IAH", "DFW", "LAS", "SAV", "IAD", "LAX", "LGA", "SFO", "LAX"
), dept_utc = structure(c(1479395340, 1479416400, 1479386400, 
1479397800, 1479420600, 1479422700, 1479404100, 1479386100, 1479384840, 
1479387600, 1479378840, 1479386700, 1479402000, 1479404400, 1479403800, 
1479418500, 1479421500, 1479384000, 1479420900, 1479387600, 1479404400, 
1479412500, 1479411000, 1479381000, 1479412920, 1479403860, 1479395700, 
1479389100, 1479398400, 1479421500, 1479415200, 1479400140, 1479415440, 
1479380400, 1479406080, 1479382200, 1479413700, 1479394800, 1479403800, 
1479414300, 1479423000, 1479392520, 1479411600, 1479384000, 1479403800, 
1479393300, 1479391200, 1479400200, 1479397500, 1479420600), class = c("POSIXct", 
"POSIXt"), tzone = ""), arriv_utc = structure(c(1479402180, 1479423600, 
1479393300, 1479410880, 1479424620, 1479431160, 1479411360, 1479395520, 
1479393900, 1479393360, 1479386700, 1479400020, 1479408780, 1479409740, 
1479408240, 1479431340, 1479425280, 1479391860, 1479425640, 1479392100, 
1479414300, 1479419280, 1479417120, 1479387600, 1479422940, 1479409800, 
1479407460, 1479397800, 1479408180, 1479426180, 1479425580, 1479409500, 
1479421740, 1479390420, 1479418260, 1479387900, 1479418320, 1479399360, 
1479411420, 1479426420, 1479431940, 1479404880, 1479423900, 1479395340, 
1479410160, 1479399480, 1479401580, 1479404640, 1479411300, 1479430860
), class = c("POSIXct", "POSIXt"), tzone = ""), airlines_id = c("UA", 
"B6", "UA", "B6", "AC", "UA", "UA", "AA", "AA", "B6", "AA", "AA", 
"AA", "AC", "EI", "B6", "UA", "AA", "B6", "AA", "UA", "UA", "UA", 
"B6", "UA", "AC", "B6", "UA", "B6", "AC", "UA", "AA", "B6", "AA", 
"UA", "AA", "B6", "B6", "AC", "EI", "UA", "UA", "B6", "B6", "AC", 
"UA", "UA", "B6", "UA", "UA"), flight_number = c(1472, 1907, 
449, 965, 3839, 419, 735, 1735, 2569, 459, 1155, 1274, 1240, 
2727, 5021, 1969, 511, 1404, 1331, 2126, 632, 981, 511, 1481, 
448, 2567, 597, 2002, 49, 2726, 357, 1556, 1059, 1083, 1233, 
543, 1231, 1308, 3891, 5290, 360, 167, 1115, 1077, 2736, 525, 
470, 831, 477, 325)), .Names = c("gateway_airport", "final_airport", 
"dept_utc", "arriv_utc", "airlines_id", "flight_number"), class = c("data.table", 
"data.frame"), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x2301358>)

475

asked Nov 17 '16 15:11

Serban Tanasa

1 Answers

As already explained in the comments, you can use non-equi join functionality to perform this kind of join. Because on the fly calculation in on argument is not yet supported you have to create those two columns before performing join. The good thing is that it doesn't cost any extra memory and is extremely fast.

setDT(dt_sample)

## OP: join, then subset
result <- merge(dt_sample, dt_sample, by.x=c("final_airport"),
                by.y=("gateway_airport"))[arriv_utc.x + 1800 <= dept_utc.y 
                                          &arriv_utc.x  +28800 >= dept_utc.y,]

## efficient: non-equi join
d = copy(dt_sample)
d[, `:=`(arriv_utc2=arriv_utc+1800, arriv_utc3=arriv_utc+28800)]
d[d, nomatch=NULL, on=.(
  final_airport == gateway_airport,
  arriv_utc2 <= dept_utc,
  arriv_utc3 >= dept_utc
  ), j=.(
    x.final_airport, x.gateway_airport, x.dept_utc, x.arriv_utc, x.airlines_id, x.flight_number, i.final_airport, i.dept_utc, i.arriv_utc, i.airlines_id, i.flight_number
  )] -> ans
setnames(result, names(ans))
all.equal(ans, result, check.attributes=FALSE, ignore.row.order=TRUE)
#[1] TRUE

174

answered Sep 17 '22 21:09

jangorecki

Related questions
                            
                                Duplicated legends when faceting in ggplotly
                            
                                Vectorized equality testing
                            
                                Shiny DataTable: Disable row selection for certain rows
                            
                                How to use R's testthat to unit test individual files?
                            
                                How does ggplot2 density differ from the density function?
                            
                                imageOutput click within conditionalPanel
                            
                                Rmarkdown Chunk Name from Variable
                            
                                when is R's `ByteCompile` counter-productive?
                            
                                How can I set R.HOME() and/or R_HOME correctly?
                            
                                Rmarkdown of Stargazer: LaTeX Error if align is set to TRUE
                            
                                Efficiently construct GRanges/IRanges from Rle vector
                            
                                R function for position of sun giving unexpected results
                            
                                Combine data frame rows in R based on multiple columns
                            
                                Is there a reliable way to detect POSIXlt objects representing a time which does not exist due to DST?
                            
                                lme4::glmer.nb function produces "Error in family$family : $ operator not defined for this S4 class" depending on the order I run models
                            
                                R plotly version 4.5.2 scatterplot legend bubble size settings
                            
                                modelr: Fitting multiple models with resampled data
                            
                                How to change column names in stargazer when printing data frames?
                            
                                buffer areas around lines ggplot2
                            
                                How to show error location in tryCatch?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With