How to add jdbc drivers to classpath when using PySpark?

Tags:

apache-spark-sql

pyspark

How / where do I install the jdbc drivers for spark sql? I'm running the all-spark-notebook docker image, and am trying to pull some data directly from a sql database into spark.

From what I can tell I can tell I need to include the drivers in my Classpath, I'm just not sure how to do that from pyspark?

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

jdbcDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("dbtable", "jdbc:postgresql:dbserver") \
    .load()

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-2-f3b08ff6d117> in <module>()
      2 spark = SparkSession     .builder     .master("local")     .appName("Python Spark SQL basic example")     .getOrCreate()
      3 
----> 4 jdbcDF = spark.read     .format("jdbc")     .option("url", "jdbc:postgresql:dbserver")     .option("dbtable", "jdbc:postgresql:dbserver")     .load()

/usr/local/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
    163             return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
    164         else:
--> 165             return self._df(self._jreader.load())
    166 
    167     @since(1.4)

/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o36.load.
: java.sql.SQLException: No suitable driver
    at java.sql.DriverManager.getDriver(DriverManager.java:315)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:83)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:34)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

650

asked Oct 25 '17 06:10

Martinffx

1 Answers

In order to include the driver for postgresql you can do the following:

conf = SparkConf()  # create the configuration
conf.set("spark.jars", "/path/to/postgresql-connector-java-someversion-bin.jar")  # set the spark.jars

...
spark = SparkSession.builder \
        .config(conf=conf) \  # feed it to the session here
        .master("local") \
        .appName("Python Spark SQL basic example") \
        .getOrCreate()

Now, since you are using Docker, I guess you have to mount the folder that has the driver jar and refer to the mounted folder. (e.g.: How to mount host directory in docker container?)

Hope this helps, good luck!

Edit: A diffferent way would be to give the --driver-class-path argument when using spark-submit like this:

spark-submit --driver-class-path=path/to/postgresql-connector-java-someversion-bin.jar file_to_run.py

but I'm guessing this is not how you will run this.

answered Dec 27 '22 23:12

mkaran

Related questions
                            
                                How to use Spark Streaming to read a stream and find the IP over a time Window?
                            
                                GCP Dataproc custom image Python environment
                            
                                Getting the leaf probabilities of a tree model in spark
                            
                                PySpark equivalent of function "typedLit" from Scala API
                            
                                Spark streaming reads file twice from NFS
                            
                                Spark example program runs very slow
                            
                                Data shuffle for Hive and Spark window function
                            
                                How to build a sparse matrix in PySpark?
                            
                                CodeGen grows beyond 64 KB error when normalizing large PySpark dataframe
                            
                                pyspark.sql.types.Row to list
                            
                                Read Headers from Data Source in an AWS Glue Job
                            
                                Pyspark: How to convert a spark dataframe to json and save it as json file?
                            
                                How we save a Huge pyspark dataframe?
                            
                                How to view AWS Glue Spark UI
                            
                                Implementing a recursive algorithm in pyspark to find pairings within a dataframe
                            
                                PySpark "illegal reflective access operation" when executed in terminal
                            
                                Use the result from Cross tab (spark dataframe) for chi-square test in SparkMlib
                            
                                Zeppelin - Cannot query with %sql a table I registered with pyspark
                            
                                Pyspark - Get all parameters of models created with ParamGridBuilder
                            
                                Why Mongo Spark connector returns different and incorrect counts for a query?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With