Trying to read a table with PySpark from a Postgres DB. I have set up the following code and verified SparkContext exists:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /tmp/jars/postgresql-42.0.0.jar --jars /tmp/jars/postgresql-42.0.0.jar pyspark-shell'
from pyspark import SparkContext, SparkConf
conf = SparkConf()
conf.setMaster("local[*]")
conf.setAppName('pyspark')
sc = SparkContext(conf=conf)
from pyspark.sql import SQLContext
properties = {
"driver": "org.postgresql.Driver"
}
url = 'jdbc:postgresql://tom:@localhost/gqp'
sqlContext = SQLContext(sc)
sqlContext.read \
.format("jdbc") \
.option("url", url) \
.option("driver", properties["driver"]) \
.option("dbtable", "specimen") \
.load()
I get the following error:
Py4JJavaError: An error occurred while calling o812.load. : java.lang.NullPointerException
The name of my database is gqp, table is specimen, and have verified it is running on localhost using the Postgres.app macOS app.
The URL was the problem!
Originally it was: url = 'jdbc:postgresql://tom:@localhost/gqp'
I removed the tom:@ part, and it worked. The URL must follow the pattern: jdbc:postgresql://ip_address:port/db_name, whereas mine was directly copied from a Flask project.
If you're reading this, hope you didn't make this same mistake :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With