Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Union All of SQL temporary tables created using dplyr?

Tags:

r

dplyr

How does one rbind or bind_rows temporary tables created in SQL (tested and failed in Postgres and SQLite) by dplyr?

E.g.

library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")

copy_to(con, nycflights13::flights, "flights",
        temporary = FALSE, 
        indexes = list(
            c("year", "month", "day"), 
            "carrier", 
            "tailnum",
            "dest"
        )
)

copy_to(con, nycflights13::flights, "flights2",
        temporary = FALSE, 
        indexes = list(
            c("year", "month", "day"), 
            "carrier", 
            "tailnum",
            "dest"
        )
)



flights_db <- tbl(con, "flights")
flights_db_2 <- tbl(con, "flights2")

Calling bind_rows gives the following error:

> bind_rows(flights_db, flights_db_2)
Error in bind_rows_(x, .id) : 
  Argument 1 must be a data frame or a named atomic vector, not a tbl_dbi/tbl_sql/tbl_lazy/tbl
like image 654
Alex Avatar asked Nov 26 '25 09:11

Alex


2 Answers

With thanks to Akrun for pointing me to the union family, it is possible to somewhat replicate bind_rows with:

Reduce(union_all, list(flights_db, flights_db, flights_db))

As noted in the comments to, and in Akrun's answer, union produces unique records in the result, and union_all is the equivalent to SQL's UNION ALL.

like image 60
Alex Avatar answered Nov 27 '25 23:11

Alex


As database holds unique records, here both the objects 'flights', 'flights2' are the same. Otherwise, we need

union(flights_db, flights_db_2)

The above will only create the dimensions as in 'flights_db' because both the objects are the same. If we need to create double the number of rows, then create a unique identifier

flights1 <- nycflights13::flights %>%
                  mutate(id= 1)

flights2 <- nycflights13::flights %>%
                  mutate(id = 2)

copy_to(con, flights1, "flights",
        temporary = FALSE,
        overwrite = TRUE, 
        indexes = list(
            c("year", "month", "day"), 
            "carrier", 
            "tailnum",
            "dest"
        )
)

copy_to(con, flights2, "flights2",
        temporary = FALSE, 
        overwrite = TRUE,
        indexes = list(
            c("year", "month", "day"), 
            "carrier", 
            "tailnum",
            "dest"
        )
)

flights_db <- tbl(con, "flights")
flights_db_2 <- tbl(con, "flights2")

Now we do the union

union(flights_db, flights_db_2) %>% 
                  summarise(n = n())
# Source:   lazy query [?? x 1]
# Database: sqlite 3.19.3 []
#       n
#   <int>
#1 673552

dim(nycflights13::flights)
#[1] 336776     19

To demonstrate the uniqueness, we can select a small subset of disjointed rows for both the objects and then do the union

copy_to(con, nycflights13::flights[1:20,], "flights",
        temporary = FALSE,
        overwrite = TRUE, 
        indexes = list(
            c("year", "month", "day"), 
            "carrier", 
            "tailnum",
            "dest"
        )
)

copy_to(con, nycflights13::flights[21:30,], "flights2",
        temporary = FALSE, 
        overwrite = TRUE,
        indexes = list(
            c("year", "month", "day"), 
            "carrier", 
            "tailnum",
            "dest"
        )
)


flights_db <- tbl(con, "flights")
flights_db_2 <- tbl(con, "flights2")
union(flights_db, flights_db_2) %>%
          collect
# A tibble: 30 x 19
#    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin  dest air_time distance
#   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>  <chr> <chr>    <dbl>    <dbl>
# 1  2013     1     1      517            515         2      830            819        11      UA   1545  N14228    EWR   IAH      227     1400
# 2  2013     1     1      533            529         4      850            830        20      UA   1714  N24211    LGA   IAH      227     1416
# 3  2013     1     1      542            540         2      923            850        33      AA   1141  N619AA    JFK   MIA      160     1089
# 4  2013     1     1      544            545        -1     1004           1022       -18      B6    725  N804JB    JFK   BQN      183     1576
# 5  2013     1     1      554            558        -4      740            728        12      UA   1696  N39463    EWR   ORD      150      719
# 6  2013     1     1      554            600        -6      812            837       -25      DL    461  N668DN    LGA   ATL      116      762
# 7  2013     1     1      555            600        -5      913            854        19      B6    507  N516JB    EWR   FLL      158     1065
# 8  2013     1     1      557            600        -3      709            723       -14      EV   5708  N829AS    LGA   IAD       53      229
# 9  2013     1     1      557            600        -3      838            846        -8      B6     79  N593JB    JFK   MCO      140      944
#10  2013     1     1      558            600        -2      753            745         8      AA    301  N3ALAA    LGA   ORD      138      733
# ... with 20 more rows, and 3 more variables: hour <dbl>, minute <dbl>, time_hour <dbl>
like image 45
akrun Avatar answered Nov 27 '25 23:11

akrun



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!