Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use a subquery for dbtable option in jdbc data source?

I want to use Spark to process some data from a JDBC source. But to begin with, instead of reading original tables from JDBC, I want to run some queries on the JDBC side to filter columns and join tables, and load the query result as a table in Spark SQL.

The following syntax to load raw JDBC table works for me:

df_table1 = sqlContext.read.format('jdbc').options(
    url="jdbc:mysql://foo.com:3306",
    dbtable="mydb.table1",
    user="me",
    password="******",
    driver="com.mysql.jdbc.Driver" # mysql JDBC driver 5.1.41
).load() 
df_table1.show() # succeeded

According to Spark documentation (I'm using PySpark 1.6.3):

dbtable: The JDBC table that should be read. Note that anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.

So just for experiment, I tried something simple like this:

df_table1 = sqlContext.read.format('jdbc').options(
    url="jdbc:mysql://foo.com:3306",
    dbtable="(SELECT * FROM mydb.table1) AS table1",
    user="me",
    password="******",
    driver="com.mysql.jdbc.Driver"
).load() # failed

It threw the following exception:

com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'table1 WHERE 1=0' at line 1

I also tried a few other variations of the syntax (add / remove parentheses, remove 'as' clause, switch case, etc) without any luck. So what would be the correct syntax? Where can I find more detailed documentation for the syntax? Besides, where does this weird "WHERE 1=0" in error message come from? Thanks!

like image 807
Dichen Avatar asked Apr 02 '17 23:04

Dichen


People also ask

Can we use JDBC in Spark SQL?

JDBC To Other Databases Spark SQL also includes a data source that can read data from other databases using JDBC. functionality should be preferred over using JdbcRDD. This is because the results are returned

Can I specify dbtable and query options at the same time?

It is not allowed to specify dbtable and query options at the same time. A query that will be used to read data into Spark. The specified query will be parenthesized and used as a subquery in the FROM clause. Spark will also assign an alias to the subquery clause.

Can I read data from another database using JDBC?

Spark SQL also includes a data source that can read data from other databases using JDBC. functionality should be preferred over using JdbcRDD. This is because the results are returned

How to define subquery for partition column in JDBC?

When specifying partitionColumn option is required, the subquery can be specified using dbtable option instead and partition columns can be qualified using the subquery alias provided as part of dbtable. The class name of the JDBC driver to use to connect to this URL. These options must all be specified if any of them is specified.


2 Answers

For reading data from JDBC source using sql query in Spark SQL, you can try something like this:

val df_table1 = sqlContext.read.format("jdbc").options(Map(
    ("url" -> "jdbc:postgresql://localhost:5432/mydb"),
    ("dbtable" -> "(select * from table1) as table1"),
    ("user" -> "me"),
    ("password" -> "******"),
    ("driver" -> "org.postgresql.Driver"))
).load()

I tried it using PostgreSQL. You can modify it according to MySQL.

like image 77
himanshuIIITian Avatar answered Nov 02 '22 11:11

himanshuIIITian


table = "(SELECT id, person, manager, CAST(tdate AS CHAR) AS tdate, CAST(start AS   CHAR) AS start, CAST(end AS CHAR) as end, CAST(duration AS CHAR) AS duration FROM EmployeeTimes) AS EmployeeTimes",

spark = get_spark_session()
df = spark.read.format("jdbc"). \
    options(url=ip,
            driver='com.mysql.jdbc.Driver',
            dbtable=table,
            user=username,
            password=password).load()
return df

I had heaps of trouble with Spark JDBC incompatability with MYSQL timestamps. The trick is to convert all your timestamp or duration values to a string prior to having the JDBC touch them. Simply cast your values as strings and it will work.

Note: You will also have to use AS to give the query an alias for it to work.

like image 40
Zack Avatar answered Nov 02 '22 10:11

Zack