Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join tables from different SQL databases using R and dplyr?

I'm using dplyr (0.7.0), dbplyr (1.0.0), DBI 0.6-1, and odbc (1.0.1.9000). I would like to do something like the following:

db1 <- DBI::dbConnect(
  odbc::odbc(),
  Driver = "SQL Server",
  Server = "MyServer",
  Database = "DB1"
)
db2 <- DBI::dbConnect(
  odbc::odbc(),
  Driver = "SQL Server",
  Server = "MyServer",
  Database = "DB2"
)
x <- tbl(db1, "Table1") %>%
  dplyr::left_join(tbl(db2, "Table2"), by = "JoinColumn") 

but I keep getting an error that doesn't really seem to have any substance to it. When I use show_query it seems like the code is trying to create a SQL query that joins the two tables without taking the separate databases into account. Per the documentation for dplyr::left_join I've also tried:

x <- tbl(db1, "Table1") %>%
      dplyr::left_join(tbl(db2, "Table2"), by = "JoinColumn", copy = TRUE) 

But there is no change in the output or error message. Is there a different way to join tables from separate databases on the same server?

like image 581
stat_student Avatar asked Jun 22 '17 15:06

stat_student


People also ask

How do I join tables in dplyr?

To join by different variables on x and y , use a named vector. For example, by = c("a" = "b") will match x$a to y$b . To join by multiple variables, use a vector with length > 1. For example, by = c("a", "b") will match x$a to y$a and x$b to y$b .

Can you join tables from different databases in SQL?

SQL Server allows you to join tables from different databases as long as those databases are on the same server. The join syntax is the same; the only difference is that you must fully qualify table names.

Is dplyr better than SQL?

dplyr is a R package that provides a set of grammar based functions to transform data. Compared to using SQL, it's much easier to construct and much easier to read what's constructed.

Can you use SQL and R together?

Not only can you easily retrieve data from SQL Sources for analysis and visualisation in R, but you can also use SQL to create, clean, filter, query and otherwise manipulate datasets within R, using a wide choice of relational databases. There is no reason to abandon your hard-earned SQL skills!


2 Answers

I'm assuming from the code you provided that (a) you're interested in joining the two tbl objects via dplyr's syntax before you run collect() and pull the results into local memory and that (b) you want to refer directly to the database objects in the call to tbl().

These choices are important if you want to leverage dplyr to programmatically build your query logic while simultaneously leveraging the database server to INNER JOIN large volumes of data down to the set that you're interested in. (Or at least that's why I ended up here.)

The solution I found uses one connection without specifying the database, and spells out the database and schema information using in_schema() (I couldn't find this documented or vignetted anywhere):

conn <- DBI::dbConnect(
  odbc::odbc(),
  Driver = "SQL Server",
  Server = "MyServer"
)

x <- tbl(src_dbi(conn),
         in_schema("DB1.dbo", "Table1")) %>%
  dplyr::left_join(tbl(src_dbi(conn),
                       in_schema("DB1.dbo", "Table2")),
                   by = "JoinColumn")
like image 160
ishak Avatar answered Sep 29 '22 15:09

ishak


I faced the same problem and I wasn't able to solve it with dplyr::left_join.

At least I was able to do the job using the following workaround. I connected to SQL Server without declaring a default database, then I ran the query with sql().

con <- dbConnect(odbc::odbc(), dsn="DWH" ,  uid="", pwd= "" )

data_db <- tbl( con, sql("SELECT * 
                    FROM DB1..Table1 AS a
                    LEFT JOIN DB2..Table2 AS b ON a.JoinColumn = b.JoinColumn") ) 

data_db %>% ...

Hope it helps.

like image 41
zanocom Avatar answered Sep 29 '22 13:09

zanocom