Count number of rows when using dplyr to access sql table/query

Question

What would be the efficient way to count the number of rows which using dplyr to access sql table. MWE is below using SQLite, but I use PostgreSQL and have the same issue. Basically dim() is not very consistent. I used

dim()

This works for a schema in the database (First case), but is not very consistent when I create a tbl from an SQL query for the same schema (Second case). My number of rows is in the millions or I see this even with a small 1000 of rows. I get NA or ??. Is there anything that is missing?

#MWE
test_db <- src_sqlite("test_db.sqlite3", create = T)
library(nycflights13)
flights_sqlite <- copy_to(test_db, flights, temporary = FALSE, indexes = list(
c("year", "month", "day"), "carrier", "tailnum"))

flights_postgres <- tbl(test_db, "flights")

First case (table from direct schema)

flights_postgres

 > flights_postgres
 Source: postgres 9.3.5 []
 From: flights [336,776 x 16]

   year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight    origin dest air_time distance hour minute
  1  2013     1   1      517         2      830        11      UA  N14228   1545    EWR  IAH      227     1400    5     17
  2  2013     1   1      533         4      850        20      UA  N24211   1714    LGA  IAH      227     1416    5     33

#using dim()
> dim(flights_postgres)
[1] 336776     16

The above works and get the count of the number of rows. Second case (table from SQL query)

 ## use the flights schema above but can also be used to create other variables (like lag, lead)   in run time
 flight_postgres_2 <- tbl(test_db, sql("SELECT * FROM flights"))

  >flight_postgres_2
 Source: postgres 9.3.5 []
 From: <derived table> [?? x 16]

  year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight     origin dest air_time distance hour minute
   1  2013     1   1      517         2      830        11      UA  N14228   1545    EWR  IAH      227     1400    5     17
   2  2013     1   1      533         4      850        20      UA  N24211   1714    LGA  IAH      227     1416    5     33

> 
> dim(flight_postgres_2)
[1] NA 16

As you see it either prints as ?? or NA. So not very helpful.

I got around this by either using collect() or then convert the output to a dataframe using as.data.frame() to check the dimension. But these two methods may not be the ideal solution, given the time it may take for larger number of rows.

Ian Gow · Accepted Answer

I think the answer is what @alistaire suggests: Do it in the database.

> flight_postgres_2 %>% summarize(n())
Source: sqlite 3.8.6 [test_db.sqlite3]
From: <derived table> [?? x 1]

      n()
    (int)
1  336776
..    ...

Asking dim to do this would be having your cake (lazy evaluation of SQL with dplyr, keeping data in the database) and eating it too (having full access to the data in R).

Note that this is doing @alistaire's approach underneath:

> flight_postgres_2 %>% summarize(n()) %>% explain()
<SQL>
SELECT "n()"
FROM (SELECT COUNT() AS "n()"
FROM (SELECT * FROM flights) AS "zzz11") AS "zzz13"


<PLAN>
  selectid order from                                                         detail
1        0     0    0 SCAN TABLE flights USING COVERING INDEX flights_year_month_day

Count number of rows when using dplyr to access sql table/query

Tags:

sqlite

mysql

r

postgresql

dplyr

rajvijay

1 Answers

Ian Gow

Recent Activity

Donate For Us

Count number of rows when using dplyr to access sql table/query

Tags:

sqlite

mysql

r

postgresql

dplyr

rajvijay

1 Answers

Ian Gow

Related questions

Recent Activity

Donate For Us