Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark 1.5 & MSSQL jdbc

I am using PySpark on Spark 1.5 on Cloudera YARN, using Python 3.3 on Centos 6 Machines. The SQL Server instance is SQL Server Enterprise 64bit. The SQL Server driver is listed below; sqljdbc4.jar; and I have added to my .bashrc

export SPARK_CLASSPATH="/var/lib/spark/sqljdbc4.jar"
export PYSPARK_SUBMIT_ARGS="--conf spark.executor.extraClassPath="/var/lib/spark/sqljdbc4.jar" --driver-class-path="/var/lib/spark/sqljdbc4.jar" --jars="/var/lib/spark/sqljdbc4.jar" --master yarn --deploy-mode client"

And I can see confirmation when I launch Spark that

SPARK_CLASSPATH was detected (set to '/var/lib/spark/sqljdbc4.jar')

I have a dataframe that looks like this schema

root
 |-- daytetime: timestamp (nullable = true)
 |-- ip: string (nullable = true)
 |-- tech: string (nullable = true)
 |-- th: string (nullable = true)
 |-- car: string (nullable = true)
 |-- min_dayte: timestamp (nullable = true)
 |-- max_dayte: timestamp (nullable = true)

I have created an empty table already in my MS SQL server called 'dbo.shaping', where the 3 timestamp columns will be datetime2(7) and the others nvarchar(50).

I try to export the dataframe from PySpark using this

properties = {"user": "<username>", "password": "<password>"} 

df.write.format('jdbc').options(url='<IP>:1433/<dbname>', dbtable='dbo.shaping',driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",properties=properties)

I get the following traceback error

Py4JError: An error occurred while calling o250.option. Trace:
py4j.Py4JException: Method option([class java.lang.String, class java.util.HashMap]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:744)

Are my methods at least correct, and perhaps this error is related to writing the specific type of data, ie, I have an issue with the data construct and not my code?

like image 444
PR102012 Avatar asked Feb 26 '16 17:02

PR102012


People also ask

Is PySpark still used?

PySpark is very well used in Data Science and Machine Learning community as there are many widely used data science libraries written in Python including NumPy, TensorFlow.

What is PySpark used for?

PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. If you're already familiar with Python and libraries such as Pandas, then PySpark is a good language to learn to create more scalable analyses and pipelines.

Is PySpark better than Python?

We have mentioned here both the advantages and disadvantages of Python and PySpark. Both of them are amazing. But, you should go for PySpark due to its fault-tolerant nature.

Is PySpark good for beginners?

PySpark is a great language for data scientists to learn because it enables scalable analysis and ML pipelines. If you're already familiar with Python and SQL and Pandas, then PySpark is a great way to start.


1 Answers

You cannot use a dict as a value for options. options method expects only str arguments (Scala docs and PySpark annotations) and is expanded to separate calls to Java option.

In current Spark versions value is automatically converted to string, so your code would fail silently, but it isn't the case in 1.5.

Since properties are specific to JDBC driver anyway, you should use jdbc method:

properties = {
    "user": "<username>", "password": "<password>", "driver": 
    "com.microsoft.sqlserver.jdbc.SQLServerDriver"}

df.write.jdbc(
    url='<IP>:1433/<dbname>',
    table='dbo.shaping',
    properties=properties)

though unpacking properties should work as well:

.options(
    url='<IP>:1433/<dbname>',
    dbtable='dbo.shaping',
    driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",
    **properties)

In general, when you see:

py4j.Py4JException: Method ... does not exist

it usually signalizes mismatch between local Python types, and the types expected by JVM method in use.

See also: How to use JDBC source to write and read data in (Py)Spark?

like image 102
zero323 Avatar answered Oct 12 '22 14:10

zero323