PySpark 1.5 & MSSQL jdbc

Tags:

I am using PySpark on Spark 1.5 on Cloudera YARN, using Python 3.3 on Centos 6 Machines. The SQL Server instance is SQL Server Enterprise 64bit. The SQL Server driver is listed below; sqljdbc4.jar; and I have added to my .bashrc

export SPARK_CLASSPATH="/var/lib/spark/sqljdbc4.jar"
export PYSPARK_SUBMIT_ARGS="--conf spark.executor.extraClassPath="/var/lib/spark/sqljdbc4.jar" --driver-class-path="/var/lib/spark/sqljdbc4.jar" --jars="/var/lib/spark/sqljdbc4.jar" --master yarn --deploy-mode client"

And I can see confirmation when I launch Spark that

SPARK_CLASSPATH was detected (set to '/var/lib/spark/sqljdbc4.jar')

I have a dataframe that looks like this schema

root
 |-- daytetime: timestamp (nullable = true)
 |-- ip: string (nullable = true)
 |-- tech: string (nullable = true)
 |-- th: string (nullable = true)
 |-- car: string (nullable = true)
 |-- min_dayte: timestamp (nullable = true)
 |-- max_dayte: timestamp (nullable = true)

I have created an empty table already in my MS SQL server called 'dbo.shaping', where the 3 timestamp columns will be datetime2(7) and the others nvarchar(50).

I try to export the dataframe from PySpark using this

properties = {"user": "<username>", "password": "<password>"} 

df.write.format('jdbc').options(url='<IP>:1433/<dbname>', dbtable='dbo.shaping',driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",properties=properties)

I get the following traceback error

Py4JError: An error occurred while calling o250.option. Trace:
py4j.Py4JException: Method option([class java.lang.String, class java.util.HashMap]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:744)

Are my methods at least correct, and perhaps this error is related to writing the specific type of data, ie, I have an issue with the data construct and not my code?

444

asked Feb 26 '16 17:02

PR102012

1 Answers

You cannot use a dict as a value for options. options method expects only str arguments (Scala docs and PySpark annotations) and is expanded to separate calls to Java option.

In current Spark versions value is automatically converted to string, so your code would fail silently, but it isn't the case in 1.5.

Since properties are specific to JDBC driver anyway, you should use jdbc method:

properties = {
    "user": "<username>", "password": "<password>", "driver": 
    "com.microsoft.sqlserver.jdbc.SQLServerDriver"}

df.write.jdbc(
    url='<IP>:1433/<dbname>',
    table='dbo.shaping',
    properties=properties)

though unpacking properties should work as well:

.options(
    url='<IP>:1433/<dbname>',
    dbtable='dbo.shaping',
    driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",
    **properties)

In general, when you see:

py4j.Py4JException: Method ... does not exist

it usually signalizes mismatch between local Python types, and the types expected by JVM method in use.

See also: How to use JDBC source to write and read data in (Py)Spark?

102

answered Oct 12 '22 14:10

zero323

Related questions
                            
                                COUNT (DISTINCT column_name) Discrepancy vs. COUNT (column_name) in SQL Server 2008?
                            
                                Password mismatch while logging to sql server
                            
                                Performance using DISTINCT COUNT
                            
                                SQL Server Management Studio: why when inserting into table, foreign key column is not visible?
                            
                                How to traverse a dacpac
                            
                                What is the difference between SPECIFIC_SCHEMA and ROUTINE_SCHEMA in INFORMATION_SCHEMA.ROUTINES?
                            
                                Sorting hierarchical text in SQL
                            
                                LINQ and Entity Framework - Avoiding subqueries
                            
                                If my C# times out with a stored procedure call, does the procedure continue running?
                            
                                Difference between RDBMS and ORDBMS
                            
                                How to use CDATA in SQL XML
                            
                                String_agg in sql server 2016
                            
                                How do I update an assembly and its dependent assemblies in MS-SQL?
                            
                                Can LINQ-to-SQL omit unspecified columns on insert so a database default value is used?
                            
                                to MSMQ or not to MSMQ? (or SQL Table as the Queue)
                            
                                Deadlocks causing 'Server failed to resume the transaction' with NHibernate and distributed transactions
                            
                                Are subqueries cached by MySQL when used in a WHERE clause?
                            
                                Why does Entity Framework throw an exception when changing SqlParameter order?
                            
                                Stored Procedure return values with linq data context
                            
                                Loop Join in SQL Server 2008

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark 1.5 & MSSQL jdbc

Tags:

sql-server

jdbc

apache-spark

pyspark

PR102012

People also ask

1 Answers

zero323

Recent Activity

Donate For Us