Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to specify driver class path when using pyspark within a jupyter notebook?

I want to query a PostgreSQL with pyspark within a jupyter notebook. I have browsed a lot of questions on StackOverflow but none of them worked for me, mainly because the answers seemed outdated. Here's my minimal code:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

Running this from a notebook would raise the following error:

Py4JJavaError: An error occurred while calling o69.jdbc.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at scala.Option.getOrElse(Option.scala:121)...

The principal tips I have found were summed up in the link below but unfortunately I can't get them to work in my notebook:

Pyspark connection to Postgres database in ipython notebook

Note: I am using Spark 2.3.1 and Python 3.6.3 and I am able to connect to the database from the pyspark shell if I specify the jar location.

pyspark --driver-class-path /home/.../postgresql.jar --jars /home/.../jars/postgresql.jar

Thanks to anyone who can help me on this one.

EDIT

The answers from How to load jar dependenices in IPython Notebook are already listed in the link I shared myself, and do not work for me. I already tried to configure the environment variable from the notebook:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql.jar --jars /path/to/postgresql.jar'

There's nothing wrong with the file path or the file itself since it works fine when I specify it and run the pyspark-shell.

like image 460
kluu Avatar asked Aug 09 '18 16:08

kluu


People also ask

Can we run spark in Jupyter notebook?

The power of Spark + JupyterPySpark allows users to interact with Apache Spark without having to learn a different language like Scala. The combination of Jupyter Notebooks with Spark provides developers with a powerful and familiar development environment while harnessing the power of Apache Spark.


1 Answers

Using the config method worked for me:

from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
like image 58
kluu Avatar answered Sep 22 '22 00:09

kluu