I've installed Spark on a Windows machine and want to use it via Spyder. After some troubleshooting the basics seems to work:
import os
os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
textFile = sc.textFile("D:\\Analytics\\Spark\\spark-1.4.0-bin-hadoop2.6\\README.md")
textFile.count()
textFile.filter(lambda line: "Spark" in line).count()
sc.stop()
This runs as expected. I now want to connect to a Postgres9.3 database running on the same server. I have downloaded the JDBC driver from here here and have put it in the folder D:\Analytics\Spark\spark_jars. I've then created a new file D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6\conf\spark-defaults.conf containing this line:
spark.driver.extraClassPath 'D:\\Analytics\\Spark\\spark_jars\\postgresql-9.3-1103.jdbc41.jar'
I've ran the following code to test the connection
import os
os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
df = (sqlContext
.load(source="jdbc",
url="jdbc:postgresql://[hostname]/[database]?user=[username]&password=[password]",
dbtable="pubs")
)
sc.stop()
But am getting the following error:
Py4JJavaError: An error occurred while calling o22.load.
: java.sql.SQLException: No suitable driver found for jdbc:postgresql://uklonana01/stonegate?user=analytics&password=pMOe8jyd
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:128)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:113)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
How can I check whether I've downloaded the right .jar file or where else the error might come from?
I have tried SPARK_CLASSPATH environment variable but it doesn't work with Spark 1.6.
Other answers from posts like below suggested adding pyspark command arguments and it works.
Not able to connect to postgres using jdbc in pyspark shell
Apache Spark : JDBC connection not working
pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>
Remove spark-defaults.conf and add the SPARK_CLASSPATH to the system environment in python like this:
os.environ["SPARK_CLASSPATH"] = 'PATH\\TO\\postgresql-9.3-1101.jdbc41.jar'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With