Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?

Using pyspark:

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("spark play")\
    .getOrCreate()    

df = spark.read\
    .format("jdbc")\
    .option("url", "jdbc:mysql://localhost:port")\
    .option("dbtable", "schema.tablename")\
    .option("user", "username")\
    .option("password", "password")\
    .load()

Rather than fetch "schema.tablename", I would prefer to grab the result set of a query.

like image 571
PBL Avatar asked Aug 02 '16 20:08

PBL


People also ask

Can Spark SQL read data from other databases?

Spark SQL also includes a data source that can read data from other databases using JDBC.

Can we use SQL queries directly in Spark?

Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. Usable in Java, Scala, Python and R. Apply functions to results of SQL queries.

Can Spark read from database?

Spark provides api to support or to perform database read and write to spark dataframe from external db sources. And it requires the driver class and jar to be placed correctly and also to have all the connection properties specified in order to load or unload the data from external data sources.

Which option can be used in Spark SQL if you need to use an in memory columnar structure to cache tables?

Spark SQL can cache tables using an in-memory columnar format by calling spark. catalog. cacheTable("tableName") or dataFrame. cache() .


1 Answers

Same as in 1.x you can pass valid subquery as dbtable argument for example:

...
.option("dbtable", "(SELECT foo, bar FROM schema.tablename) AS tmp")
...
like image 80
zero323 Avatar answered Nov 14 '22 23:11

zero323