Create Spark Dataframe from SQL Query

Tags:

I'm sure this is a simple SQLContext question, but I can't find any answer in the Spark docs or Stackoverflow

I want to create a Spark Dataframe from a SQL Query on MySQL

For example, I have a complicated MySQL query like

SELECT a.X,b.Y,c.Z FROM FOO as a JOIN BAR as b ON ... JOIN ZOT as c ON ... WHERE ...

and I want a Dataframe with Columns X,Y and Z

I figured out how to load entire tables into Spark, and I could load them all, and then do the joining and selection there. However, that is very inefficient. I just want to load the table generated by my SQL query.

Here is my current approximation of the code, that doesn't work. Mysql-connector has an option "dbtable" that can be used to load a whole table. I am hoping there is some way to specify a query

  val df = sqlContext.format("jdbc").
    option("url", "jdbc:mysql://localhost:3306/local_content").
    option("driver", "com.mysql.jdbc.Driver").
    option("useUnicode", "true").
    option("continueBatchOnError","true").
    option("useSSL", "false").
    option("user", "root").
    option("password", "").
    sql(
"""
select dl.DialogLineID, dlwim.Sequence, wi.WordRootID from Dialog as d
join DialogLine as dl on dl.DialogID=d.DialogID
join DialogLineWordInstanceMatch as dlwim o n dlwim.DialogLineID=dl.DialogLineID
join WordInstance as wi on wi.WordInstanceID=dlwim.WordInstanceID
join WordRoot as wr on wr.WordRootID=wi.WordRootID
where d.InSite=1 and dl.Active=1
limit 100
"""
    ).load()

330

asked Jul 14 '16 14:07

opus111

1 Answers

I found this here Bulk data migration through Spark SQL

The dbname parameter can be any query wrapped in parenthesis with an alias. So in my case, I need to do this:

val query = """
  (select dl.DialogLineID, dlwim.Sequence, wi.WordRootID from Dialog as d
    join DialogLine as dl on dl.DialogID=d.DialogID
    join DialogLineWordInstanceMatch as dlwim on dlwim.DialogLineID=dl.DialogLineID
    join WordInstance as wi on wi.WordInstanceID=dlwim.WordInstanceID
    join WordRoot as wr on wr.WordRootID=wi.WordRootID
    where d.InSite=1 and dl.Active=1
    limit 100) foo
"""

val df = sqlContext.format("jdbc").
  option("url", "jdbc:mysql://localhost:3306/local_content").
  option("driver", "com.mysql.jdbc.Driver").
  option("useUnicode", "true").
  option("continueBatchOnError","true").
  option("useSSL", "false").
  option("user", "root").
  option("password", "").
  option("dbtable",query).
  load()

As expected, loading each table as its own Dataframe and joining them in Spark was very inefficient.

answered Sep 23 '22 03:09

opus111

Related questions
                            
                                mysql group by and sort each group
                            
                                Boolean Field in mysql db
                            
                                Is it possible to make MySQL use an index for the ORDER by 1 DESC, 2 ASC?
                            
                                Prepared Statement with ON DUPLICATE KEY
                            
                                MySQL concatenating all columns
                            
                                Display MySQL uptime in days
                            
                                Execute INSERT if table is empty?
                            
                                selecting duplicate IDs in mysql
                            
                                LEFT JOIN order and limit
                            
                                Explicitly specify sort order for mysql query?
                            
                                Group records by both month and year in Rails
                            
                                Python => ValueError: unsupported format character 'Y' (0x59)
                            
                                Column not found: 1054 Unknown column laravel
                            
                                I changed MySQL port in XAMPP, now how do I listen to the new port?
                            
                                How to setup a connection timeout depending of the user login in MySQL
                            
                                How to create a relation between two tables using PHPMyAdmin?
                            
                                Is string or int preferred for foreign keys?
                            
                                Conditional NOT NULL case SQL
                            
                                How exactly does using OR in a MySQL statement differ with/without parentheses?
                            
                                Get data from multiple SELECT sub-queries for reporting from MySQL database

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Create Spark Dataframe from SQL Query

Tags:

sql

mysql

scala

apache-spark

mysql-connector

opus111

People also ask

1 Answers

opus111

Recent Activity

Donate For Us