Occasionally when connecting to my Oracle
database through ROracle
and dbplyr
I will run a dplyr::collect
operation that fetches more data than expected and than R can handle.
This can make R crash and is often a sign I should have filtered or aggregated data further before fetching.
It would be great to be able to check the size of the result before choosing to fetch it or not (without running the query twice).
Let's name collect2
the variation of collect
that would allow this:
expected behavior:
small_t <- con %>% tbl("small_table") %>%
filter_group_etc %>%
collect2(n_max = 5e6) # works fine
big_t <- con %>% tbl("big_table") %>%
filter_group_etc %>%
collect2(n_max = 5e6) # Error: query returned 15.486.245 rows, n_max set to 5.000.000
Would this be possible ?
I'm also open to a solution using ROracle
/ DBI
without dplyr
, e.g.:
dbGetQuery2(con, my_big_sql_query,n_max = 5e6) # Error: query returned 15.486.245 rows, n_max set to 5.000.000
EDIT:
See below a partial solution posted as an answer, not optimal because some time is wasted fetching data I have no use for.
This doesn't get around the problem you mention in the comments about spending the resources to get the query twice, but it does seems to work (at least against my MySQL database--I don't have an Oracle database to test it against):
collect2 <- function(query, limit = 20000) {
query_nrows <- query %>%
ungroup() %>%
summarize(n = n()) %>%
collect() %>%
pull('n')
if(query_nrows <= limit) {
collect(query)
} else {
warning("Query has ", query_nrows,"; limit is ", limit,". Data will not be collected.")
}
}
I don't see any way to test the number of rows in the results of a query without actually running the query. With this method, though, you always force the computation of row numbers to happen in the database first and refuse to collect if you're over 20,000 (or whatever your row limit is).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With