How to specify driver class path when using pyspark within a jupyter notebook?

Tags:

I want to query a PostgreSQL with pyspark within a jupyter notebook. I have browsed a lot of questions on StackOverflow but none of them worked for me, mainly because the answers seemed outdated. Here's my minimal code:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

Running this from a notebook would raise the following error:

Py4JJavaError: An error occurred while calling o69.jdbc.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at scala.Option.getOrElse(Option.scala:121)...

The principal tips I have found were summed up in the link below but unfortunately I can't get them to work in my notebook:

Pyspark connection to Postgres database in ipython notebook

Note: I am using Spark 2.3.1 and Python 3.6.3 and I am able to connect to the database from the pyspark shell if I specify the jar location.

pyspark --driver-class-path /home/.../postgresql.jar --jars /home/.../jars/postgresql.jar

Thanks to anyone who can help me on this one.

EDIT

The answers from How to load jar dependenices in IPython Notebook are already listed in the link I shared myself, and do not work for me. I already tried to configure the environment variable from the notebook:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql.jar --jars /path/to/postgresql.jar'

There's nothing wrong with the file path or the file itself since it works fine when I specify it and run the pyspark-shell.

460

asked Aug 09 '18 16:08

kluu

1 Answers

Using the config method worked for me:

from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

answered Sep 22 '22 00:09

kluu

Related questions
                            
                                drops a column if it exceeds a specific number of NA values
                            
                                Why does {1} == frozenset({1}) in Python?
                            
                                Why does tf.Print() not work?
                            
                                How to write Huffman coding to a file using Python?
                            
                                Is it possible to share a piece of code betwen AWS Lambda functions?
                            
                                Getting the results of a unittest programmatically
                            
                                matplotlib fill_between facecolor alpha vs edgecolor alpha?
                            
                                How to apply Polyglot Detector function to dataframe
                            
                                How to break up lambda function in to its own function? (Lambda is currently 125+ characters)
                            
                                How do I get the area of a GeoJSON polygon with Python
                            
                                How to fix Error "No module named 'pynput'"? even after downloading with pip?
                            
                                Better Approach than FuzzyWuzzy?
                            
                                Classification report for regression (sklearn)
                            
                                How to generate a paper-like background with OpenCV
                            
                                Predicting multiple variables at once with Facebook Prophet
                            
                                Using whitespace in class names in Python
                            
                                Python: Use PIL to load png file gives strange results
                            
                                Seaborn's boxplot whiskers meaning
                            
                                Why is this range variable declared before being used?
                            
                                Standardize dataset containing too large values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to specify driver class path when using pyspark within a jupyter notebook?

Tags:

python

jupyter-notebook

apache-spark

pyspark

kluu

People also ask

1 Answers

kluu

Recent Activity

Donate For Us