Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark : JDBC connection not working

I have asked this question previously also but did not got any answer (Not able to connect to postgres using jdbc in pyspark shell).

I have successfully installed Spark 1.3.0 on my local windows and ran sample programs to test using pyspark shell.

Now, I want to run Correlations from Mllib on the data that is stored in Postgresql, but I am not able to connect to postgresql.

I have successfully added the required jar (tested this jar) in the classpath by running

pyspark --jars "C:\path\to\jar\postgresql-9.2-1002.jdbc3.jar"

I can see that jar is successfully added in environment UI.

When I run the following in pyspark shell-

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")  

I get this ERROR -

>>> df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\pyspark\sql\context.py", line 482, in load
    df = self._ssql_ctx.load(source, joptions)
  File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__
  File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.load.
: java.sql.SQLException: No suitable driver found for     jdbc:postgresql://[host]/[dbname]
        at java.sql.DriverManager.getConnection(DriverManager.java:602)
        at java.sql.DriverManager.getConnection(DriverManager.java:207)
        at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:94)
        at org.apache.spark.sql.jdbc.JDBCRelation.<init>    (JDBCRelation.scala:125)
        at  org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:114)
        at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
        at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
        at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:619)
like image 612
Soni Shashank Avatar asked Apr 23 '15 11:04

Soni Shashank


People also ask

How does JDBC work in Spark?

Spark SQL also includes a data source that can read data from other databases using JDBC. This functionality should be preferred over using JdbcRDD. This is because the results are returned as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources.

How does Spark connect to database?

To connect any database connection we require basically the common properties such as database driver , db url , username and password. Hence in order to connect using pyspark code also requires the same set of properties. url — the JDBC url to connect the database.


2 Answers

I had this exact problem with mysql/mariadb, and got BIG clue from this question

So your pyspark command should be:

pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>

Also watch for errors when pyspark start like "Warning: Local jar ... does not exist, skipping." and "ERROR SparkContext: Jar not found at ...", these probably mean you spelled the path wrong.

like image 67
8forty Avatar answered Oct 12 '22 13:10

8forty


As jake256 suggested

"driver", "org.postgresql.Driver"

key-value pair was missing. In my case, I launched pyspark as :

pyspark --jars /path/to/postgresql-9.4.1210.jar

with following instructions :

  from pyspark.sql import DataFrameReader

  url = 'postgresql://192.168.2.4:5432/postgres'
  properties = {'user': 'myUser', 'password': 'myPasswd', 'driver': 'org.postgresql.Driver'}
  df = DataFrameReader(sqlContext).jdbc(
      url='jdbc:%s' % url, table='weather', properties=properties
  )
  df.show()

  +-------------+-------+-------+-----------+----------+
  |         city|temp_lo|temp_hi|       prcp|      date|
  +-------------+-------+-------+-----------+----------+
  |San Francisco|     46|     50|       0.25|1994-11-27|
  |San Francisco|     43|     57|        0.0|1994-11-29|
  |      Hayward|     54|     37|0.239999995|1994-11-29|
  +-------------+-------+-------+-----------+----------+

Tested on :

  • Ubuntu 16.04

  • PostgreSQL server version 9.5.

  • Postgresql driver used is postgresql-9.4.1210.jar

  • and Spark version is spark-2.0.0-bin-hadoop2.6

  • but I am also confident that it should also work on spark-2.0.0-bin-hadoop2.7.

  • Java JDK 1.8 64bits

other JDBC Drivers can be found on : https://www.petefreitag.com/articles/jdbc_urls/

tutorial I followed is on : https://developer.ibm.com/clouddataservices/2015/08/19/speed-your-sql-queries-with-spark-sql/

similar solution was suggested also on : pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver

like image 33
aks Avatar answered Oct 12 '22 11:10

aks