How to use a Scala class inside Pyspark

Tags:

I've been searching for a while if there is any way to use a Scala class in Pyspark, and I haven't found any documentation nor guide about this subject.

Let's say I create a simple class in Scala that uses some libraries of apache-spark, something like:

class SimpleClass(sqlContext: SQLContext, df: DataFrame, column: String) {   def exe(): DataFrame = {     import sqlContext.implicits._      df.select(col(column))   } }

Is there any possible way to use this class in Pyspark?
Is it too tough?
Do I have to create a .py file?
Is there any guide that shows how to do that?

By the way I also looked at the spark code and I felt a bit lost, and I was incapable of replicating their functionality for my own purpose.

297

asked Mar 15 '16 23:03

Alberto Bonsanto

1 Answers

Yes it is possible although can be far from trivial. Typically you want a Java (friendly) wrapper so you don't have to deal with Scala features which cannot be easily expressed using plain Java and as a result don't play well with Py4J gateway.

Assuming your class is int the package com.example and have Python DataFrame called df

df = ... # Python DataFrame

you'll have to:

Build a jar using your favorite build tool.
Include it in the driver classpath for example using --driver-class-path argument for PySpark shell / spark-submit. Depending on the exact code you may have to pass it using --jars as well
Extract JVM instance from a Python SparkContext instance:
```
jvm = sc._jvm 
```
Extract Scala SQLContext from a SQLContext instance:
```
ssqlContext = sqlContext._ssql_ctx 
```
Extract Java DataFrame from the df:
```
jdf = df._jdf 
```

Create new instance of SimpleClass:

simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v")

Callexe method and wrap the result using Python DataFrame:

from pyspark.sql import DataFrame  DataFrame(simpleObject.exe(), ssqlContext)

The result should be a valid PySpark DataFrame. You can of course combine all the steps into a single call.

Important: This approach is possible only if Python code is executed solely on the driver. It cannot be used inside Python action or transformation. See How to use Java/Scala function from an action or a transformation? for details.

196

answered Sep 18 '22 13:09

zero323

Related questions
                            
                                How can I conditionally set a variable in a Go template based on an expression which may cause an error if not wrapped with an if statement
                            
                                Can an optimizing compiler add std::move?
                            
                                Set zlim in matplotlib scatter3d
                            
                                What Port Does AWS S3 Use?
                            
                                What does u'\ufe0f' in an emoji mean? Is it the same if I delete it?
                            
                                JWT & OAuth2 - Does the server store the token? & How are they Secure/Hacker Safe?
                            
                                I want to use stdin in a pytest test
                            
                                Python Pandas find all rows where all values are NaN
                            
                                IntelliJ suppress unused warning for API methods
                            
                                RxSwift table view with multiple custom cell types
                            
                                How to use Jenkins parameters in a shell script
                            
                                How to type check a Date object in flow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With