I am using standalone cluster on my local windows and trying to load data from one of our server using following code - <pre class="prettyprint"><code>from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.load(source="jdbc", url="jdbc:postgresql://host/dbname", dbtable="schema.tablename") </code></pre> I have set the SPARK_CLASSPATH as - <pre class="prettyprint"><code>os.environ['SPARK_CLASSPATH'] = "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\postgresql-9.2-1002.jdbc3.jar" </code></pre> While executing sqlContext.load, it throws error mentioning "No suitable driver found for jdbc:postgresql". I have tried searching web, but not able to find solution.

I had the same problem with mysql, and was never able to get it to work with the SPARK_CLASSPATH approach. However I did get it to work with extra command line arguments, see the answer to this question To avoid having to click through to get it working, here's what you have to do: <pre class="prettyprint"><code>pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL> </code></pre>

Not able to connect to postgres using jdbc in pyspark shell

Tags:

postgresql

jdbc

apache-spark

apache-spark-sql

pyspark

I am using standalone cluster on my local windows and trying to load data from one of our server using following code -

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc", url="jdbc:postgresql://host/dbname", dbtable="schema.tablename")

I have set the SPARK_CLASSPATH as -

os.environ['SPARK_CLASSPATH'] = "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\postgresql-9.2-1002.jdbc3.jar"

While executing sqlContext.load, it throws error mentioning "No suitable driver found for jdbc:postgresql". I have tried searching web, but not able to find solution.

899

asked Apr 16 '15 08:04

Soni Shashank

2 Answers

May be it will be helpful.

In my environment SPARK_CLASSPATH contains path to postgresql connector

from pyspark import SparkContext, SparkConf
from pyspark.sql import DataFrameReader, SQLContext
import os

sparkClassPath = os.getenv('SPARK_CLASSPATH', '/path/to/connector/postgresql-42.1.4.jar')

# Populate configuration
conf = SparkConf()
conf.setAppName('application')
conf.set('spark.jars', 'file:%s' % sparkClassPath)
conf.set('spark.executor.extraClassPath', sparkClassPath)
conf.set('spark.driver.extraClassPath', sparkClassPath)
# Uncomment line below and modify ip address if you need to use cluster on different IP address
#conf.set('spark.master', 'spark://127.0.0.1:7077')

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

url = 'postgresql://127.0.0.1:5432/postgresql'
properties = {'user':'username', 'password':'password'}

df = DataFrameReader(sqlContext).jdbc(url='jdbc:%s' % url, table='tablename', properties=properties)

df.printSchema()
df.show()

This piece of code allows to use pyspark where you need. For example, I've used it in Django project.

146

answered Sep 29 '22 14:09

avkghost

I had the same problem with mysql, and was never able to get it to work with the SPARK_CLASSPATH approach. However I did get it to work with extra command line arguments, see the answer to this question

To avoid having to click through to get it working, here's what you have to do:

pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>

answered Sep 29 '22 15:09

8forty

Related questions
                            
                                How to change values of foreign keys in postgresql?
                            
                                ActiveRecord - select first record from each group
                            
                                Storing Postgres data on a separate AWS EBS volume. [closed]
                            
                                How to store geometry Point in Postgis database using java
                            
                                Use PostgreSQL JSON column type with Rails 3
                            
                                How to join table with dynamic identifier in postgres?
                            
                                Calling a function inside another function in PL/pgSQL
                            
                                How to make ARRAY field with foreign key constraint in SQLAlchemy?
                            
                                Is there a elegant and common way for converting PostgreSQL hstore to JPA 2.1 DataType?
                            
                                alternative to SQL count subquery
                            
                                PostgreSQL - Query to determine which tables are currently being vacuumed?
                            
                                How to filter a PostgreSQL array column with the JPA Criteria API?
                            
                                Should I use Postgres' bigserial for records in a new application?
                            
                                Does a capistrano rollback undo migrations?
                            
                                pqxx return id of just inserted row
                            
                                Postgres uses wrong index
                            
                                How to represent dates with uncertainty in PostgreSQL
                            
                                RETURN cannot have a parameter in function returning set; use RETURN NEXT at or near "QUERY" Postgres
                            
                                Most efficient way to do a bulk UPDATE with pairs of input
                            
                                Order of locking in postgres SELECT FOR UPDATE

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With