I've been searching for a while if there is any way to use a Scala class in Pyspark, and I haven't found any documentation nor guide about this subject.
Let's say I create a simple class in Scala that uses some libraries of apache-spark, something like:
class SimpleClass(sqlContext: SQLContext, df: DataFrame, column: String) {   def exe(): DataFrame = {     import sqlContext.implicits._      df.select(col(column))   } }   Pyspark?.py file?  By the way I also looked at the spark code and I felt a bit lost, and I was incapable of replicating their functionality for my own purpose.
PySpark is more popular because Python is the most popular language in the data community. PySpark is a well supported, first class Spark API, and is a great choice for most organizations. Scala is a powerful programming language that offers developer friendly features that aren't available in Python.
Create Column Class Object One of the simplest ways to create a Column class object is by using PySpark lit() SQL function, this takes a literal value and returns a Column object. You can also access the Column from DataFrame by multiple ways.
Scala is faster than Python due to its static type language. If faster performance is a requirement, Scala is a good bet. Spark is native in Scala, hence making writing Spark jobs in Scala the native way.
Yes it is possible although can be far from trivial. Typically you want a Java (friendly) wrapper so you don't have to deal with Scala features which cannot be easily expressed using plain Java and as a result don't play well with Py4J gateway.
Assuming your class is int the package com.example and have Python DataFrame called df
df = ... # Python DataFrame   you'll have to:
Build a jar using your favorite build tool.
Include it in the driver classpath for example using --driver-class-path argument for PySpark shell / spark-submit. Depending on the exact code you may have to pass it using --jars as well
Extract JVM instance from a Python SparkContext instance:
jvm = sc._jvm  Extract Scala SQLContext from a SQLContext instance:
ssqlContext = sqlContext._ssql_ctx  Extract Java DataFrame from the df:
jdf = df._jdf  Create new instance of SimpleClass:
simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v")  Callexe method and wrap the result using Python DataFrame:
from pyspark.sql import DataFrame  DataFrame(simpleObject.exe(), ssqlContext)  The result should be a valid PySpark DataFrame. You can of course combine all the steps into a single call.
Important: This approach is possible only if Python code is executed solely on the driver. It cannot be used inside Python action or transformation. See How to use Java/Scala function from an action or a transformation? for details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With