Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Not able to connect to postgres using jdbc in pyspark shell

I am using standalone cluster on my local windows and trying to load data from one of our server using following code -

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc", url="jdbc:postgresql://host/dbname", dbtable="schema.tablename")

I have set the SPARK_CLASSPATH as -

os.environ['SPARK_CLASSPATH'] = "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\postgresql-9.2-1002.jdbc3.jar"

While executing sqlContext.load, it throws error mentioning "No suitable driver found for jdbc:postgresql". I have tried searching web, but not able to find solution.

like image 899
Soni Shashank Avatar asked Apr 16 '15 08:04

Soni Shashank


People also ask

Can JDBC connect to PostgreSQL?

The PostgreSQL JDBC Driver allows Java programs to connect to a PostgreSQL database using standard, database independent Java code. pgJDBC is an open source JDBC driver written in Pure Java (Type 4), and communicates in the PostgreSQL native network protocol.


2 Answers

May be it will be helpful.

In my environment SPARK_CLASSPATH contains path to postgresql connector

from pyspark import SparkContext, SparkConf
from pyspark.sql import DataFrameReader, SQLContext
import os

sparkClassPath = os.getenv('SPARK_CLASSPATH', '/path/to/connector/postgresql-42.1.4.jar')

# Populate configuration
conf = SparkConf()
conf.setAppName('application')
conf.set('spark.jars', 'file:%s' % sparkClassPath)
conf.set('spark.executor.extraClassPath', sparkClassPath)
conf.set('spark.driver.extraClassPath', sparkClassPath)
# Uncomment line below and modify ip address if you need to use cluster on different IP address
#conf.set('spark.master', 'spark://127.0.0.1:7077')

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

url = 'postgresql://127.0.0.1:5432/postgresql'
properties = {'user':'username', 'password':'password'}

df = DataFrameReader(sqlContext).jdbc(url='jdbc:%s' % url, table='tablename', properties=properties)

df.printSchema()
df.show()

This piece of code allows to use pyspark where you need. For example, I've used it in Django project.

like image 146
avkghost Avatar answered Sep 29 '22 14:09

avkghost


I had the same problem with mysql, and was never able to get it to work with the SPARK_CLASSPATH approach. However I did get it to work with extra command line arguments, see the answer to this question

To avoid having to click through to get it working, here's what you have to do:

pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>
like image 21
8forty Avatar answered Sep 29 '22 15:09

8forty