Apache Spark : JDBC connection not working

Tags:

I have asked this question previously also but did not got any answer (Not able to connect to postgres using jdbc in pyspark shell).

I have successfully installed Spark 1.3.0 on my local windows and ran sample programs to test using pyspark shell.

Now, I want to run Correlations from Mllib on the data that is stored in Postgresql, but I am not able to connect to postgresql.

I have successfully added the required jar (tested this jar) in the classpath by running

pyspark --jars "C:\path\to\jar\postgresql-9.2-1002.jdbc3.jar"

I can see that jar is successfully added in environment UI.

When I run the following in pyspark shell-

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")

I get this ERROR -

>>> df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\pyspark\sql\context.py", line 482, in load
    df = self._ssql_ctx.load(source, joptions)
  File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__
  File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.load.
: java.sql.SQLException: No suitable driver found for     jdbc:postgresql://[host]/[dbname]
        at java.sql.DriverManager.getConnection(DriverManager.java:602)
        at java.sql.DriverManager.getConnection(DriverManager.java:207)
        at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:94)
        at org.apache.spark.sql.jdbc.JDBCRelation.<init>    (JDBCRelation.scala:125)
        at  org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:114)
        at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
        at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
        at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:619)

612

asked Apr 23 '15 11:04

Soni Shashank

2 Answers

I had this exact problem with mysql/mariadb, and got BIG clue from this question

So your pyspark command should be:

pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>

Also watch for errors when pyspark start like "Warning: Local jar ... does not exist, skipping." and "ERROR SparkContext: Jar not found at ...", these probably mean you spelled the path wrong.

answered Oct 12 '22 13:10

8forty

As jake256 suggested

"driver", "org.postgresql.Driver"

key-value pair was missing. In my case, I launched pyspark as :

pyspark --jars /path/to/postgresql-9.4.1210.jar

with following instructions :

  from pyspark.sql import DataFrameReader

  url = 'postgresql://192.168.2.4:5432/postgres'
  properties = {'user': 'myUser', 'password': 'myPasswd', 'driver': 'org.postgresql.Driver'}
  df = DataFrameReader(sqlContext).jdbc(
      url='jdbc:%s' % url, table='weather', properties=properties
  )
  df.show()

  +-------------+-------+-------+-----------+----------+
  |         city|temp_lo|temp_hi|       prcp|      date|
  +-------------+-------+-------+-----------+----------+
  |San Francisco|     46|     50|       0.25|1994-11-27|
  |San Francisco|     43|     57|        0.0|1994-11-29|
  |      Hayward|     54|     37|0.239999995|1994-11-29|
  +-------------+-------+-------+-----------+----------+

Tested on :

Ubuntu 16.04
PostgreSQL server version 9.5.
Postgresql driver used is postgresql-9.4.1210.jar
and Spark version is spark-2.0.0-bin-hadoop2.6
but I am also confident that it should also work on spark-2.0.0-bin-hadoop2.7.
Java JDK 1.8 64bits

other JDBC Drivers can be found on : https://www.petefreitag.com/articles/jdbc_urls/

tutorial I followed is on : https://developer.ibm.com/clouddataservices/2015/08/19/speed-your-sql-queries-with-spark-sql/

similar solution was suggested also on : pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver

answered Oct 12 '22 11:10

aks

Related questions
                            
                                How to add a 'total' row in a grouped query (in Postgresql)?
                            
                                Django reset auto-increment pk/id field for production
                            
                                Add constraint to make column unique per group of rows
                            
                                How to connect to a cluster in Amazon Redshift using SQLAlchemy?
                            
                                Postgresql Check if query is still running
                            
                                Getting the following error: "SequelizeDatabaseError column "createdAt" does not exist"
                            
                                Restoring the data from pg_dump doesn't overwrite the data but it appends the data to the original database
                            
                                PostgreSQL ON CONFLICT with a WHERE clause
                            
                                How to implement full text search on complex nested JSONB in Postgresql
                            
                                How good is Rails and PostgreSQL support?
                            
                                Database performance: filtering on column vs. separate table
                            
                                Restore PostgreSQL db from backup without foreign key constraint issue
                            
                                Select only numeric part of string only if it starts with a numeric value
                            
                                Get error codes while using psql
                            
                                what is the equivalent of all_tables and all_tab_columns in postgresql
                            
                                Cast Const Integer to Bigint in Postgres
                            
                                Postgresql lowercase to compare data
                            
                                Unit of return value of ST_Distance
                            
                                How to delete rows using an outer join
                            
                                Heroku + Rails + PG: ActiveRecord::StatementInvalid (PG::ConnectionBad: PQconsumeInput() SSL connection has been closed unexpectedly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Spark : JDBC connection not working

Tags:

postgresql

jdbc

apache-spark

apache-spark-sql

Soni Shashank

People also ask

2 Answers

8forty

aks

Recent Activity

Donate For Us