I'm working with Apache Spark 1.4.0 on Windows 7 x64 with Java 1.8.0_45 x64 and Python 2.7.10 x86 in IPython 3.2.0
I am attempting to write a DataFrame-based program in an IPython notebook that reads from and writes back to an SQL Server database.
So far I can read data from the database
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc",url="jdbc:sqlserver://serverURL", dbtable="dbName.tableName", driver="com.microsoft.sqlserver.jdbc.SQLServerDriver", user="userName", password="password")
and convert the data to a Panda and do whatever I want to it. (This was more than a little hassle, but it works after adding Microsoft's sqljdbc42.jar to spark.driver.extraClassPath in spark-defaults.conf)
The current problem arises when I go to write the data back to SQL Server with the DataFrameWriter API:
df.write.jdbc("jdbc:sqlserver://serverURL", "dbName.SparkTestTable1", dict(driver="com.microsoft.sqlserver.jdbc.SQLServerDriver", user="userName", password="password"))
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
<ipython-input-19-8502a3e85b1e> in <module>()
----> 1 df.write.jdbc("jdbc:sqlserver://jdbc:sqlserver", "dbName.SparkTestTable1", dict(driver="com.microsoft.sqlserver.jdbc.SQLServerDriver", user="userName", password="password"))
C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\readwriter.pyc in jdbc(self, url, table, mode, properties)
394 for k in properties:
395 jprop.setProperty(k, properties[k])
--> 396 self._jwrite.mode(mode).jdbc(url, table, jprop)
397
398
C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:
C:\Python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
302 raise Py4JError(
303 'An error occurred while calling {0}{1}{2}. Trace:\n{3}\n'.
--> 304 format(target_id, '.', name, value))
305 else:
306 raise Py4JError(
Py4JError: An error occurred while calling o49.mode. Trace:
py4j.Py4JException: Method mode([class java.util.HashMap]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
The problem seems to be that py4j cannot find the Java java.util.HashMap
class when it goes to convert my connectionProperties dictionary into a JVM object. Adding my rt.jar (with path) to spark.driver.extraClassPath does not not resolve the issue. Removing the dictionary from the write command avoids this error, but of course the write fails to due a lack of driver and authentication.
Edit: The o49.mode
part of the error changes from run to run.
Davies Liu on the Spark users mailing list found the problem. There is a subtle difference between the Scala and Python APIs that I missed. You have to pass in a mode string (such as "overwrite") as the 3rd parameter in the Python API but not the Scala API. Changing the statement as follows resolves this issue:
df.write.jdbc("jdbc:sqlserver://serverURL", "dbName.SparkTestTable1", "overwrite", dict(driver="com.microsoft.sqlserver.jdbc.SQLServerDriver", user="userName", password="password"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With