I've been searching for a while if there is any way to use a Scala
class in Pyspark
, and I haven't found any documentation nor guide about this subject.
Let's say I create a simple class in Scala
that uses some libraries of apache-spark
, something like:
class SimpleClass(sqlContext: SQLContext, df: DataFrame, column: String) { def exe(): DataFrame = { import sqlContext.implicits._ df.select(col(column)) } }
Pyspark
?.py
file? By the way I also looked at the spark
code and I felt a bit lost, and I was incapable of replicating their functionality for my own purpose.
PySpark is more popular because Python is the most popular language in the data community. PySpark is a well supported, first class Spark API, and is a great choice for most organizations. Scala is a powerful programming language that offers developer friendly features that aren't available in Python.
Create Column Class Object One of the simplest ways to create a Column class object is by using PySpark lit() SQL function, this takes a literal value and returns a Column object. You can also access the Column from DataFrame by multiple ways.
Scala is faster than Python due to its static type language. If faster performance is a requirement, Scala is a good bet. Spark is native in Scala, hence making writing Spark jobs in Scala the native way.
Yes it is possible although can be far from trivial. Typically you want a Java (friendly) wrapper so you don't have to deal with Scala features which cannot be easily expressed using plain Java and as a result don't play well with Py4J gateway.
Assuming your class is int the package com.example
and have Python DataFrame
called df
df = ... # Python DataFrame
you'll have to:
Build a jar using your favorite build tool.
Include it in the driver classpath for example using --driver-class-path
argument for PySpark shell / spark-submit
. Depending on the exact code you may have to pass it using --jars
as well
Extract JVM instance from a Python SparkContext
instance:
jvm = sc._jvm
Extract Scala SQLContext
from a SQLContext
instance:
ssqlContext = sqlContext._ssql_ctx
Extract Java DataFrame
from the df
:
jdf = df._jdf
Create new instance of SimpleClass
:
simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v")
Callexe
method and wrap the result using Python DataFrame
:
from pyspark.sql import DataFrame DataFrame(simpleObject.exe(), ssqlContext)
The result should be a valid PySpark DataFrame
. You can of course combine all the steps into a single call.
Important: This approach is possible only if Python code is executed solely on the driver. It cannot be used inside Python action or transformation. See How to use Java/Scala function from an action or a transformation? for details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With