Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use a Scala class inside Pyspark

Tags:

I've been searching for a while if there is any way to use a Scala class in Pyspark, and I haven't found any documentation nor guide about this subject.

Let's say I create a simple class in Scala that uses some libraries of apache-spark, something like:

class SimpleClass(sqlContext: SQLContext, df: DataFrame, column: String) {   def exe(): DataFrame = {     import sqlContext.implicits._      df.select(col(column))   } } 
  • Is there any possible way to use this class in Pyspark?
  • Is it too tough?
  • Do I have to create a .py file?
  • Is there any guide that shows how to do that?

By the way I also looked at the spark code and I felt a bit lost, and I was incapable of replicating their functionality for my own purpose.

like image 297
Alberto Bonsanto Avatar asked Mar 15 '16 23:03

Alberto Bonsanto


People also ask

Does PySpark have Scala?

PySpark is more popular because Python is the most popular language in the data community. PySpark is a well supported, first class Spark API, and is a great choice for most organizations. Scala is a powerful programming language that offers developer friendly features that aren't available in Python.

How do you create a class in PySpark?

Create Column Class Object One of the simplest ways to create a Column class object is by using PySpark lit() SQL function, this takes a literal value and returns a Column object. You can also access the Column from DataFrame by multiple ways.

Is PySpark better than Scala?

Scala is faster than Python due to its static type language. If faster performance is a requirement, Scala is a good bet. Spark is native in Scala, hence making writing Spark jobs in Scala the native way.


1 Answers

Yes it is possible although can be far from trivial. Typically you want a Java (friendly) wrapper so you don't have to deal with Scala features which cannot be easily expressed using plain Java and as a result don't play well with Py4J gateway.

Assuming your class is int the package com.example and have Python DataFrame called df

df = ... # Python DataFrame 

you'll have to:

  1. Build a jar using your favorite build tool.

  2. Include it in the driver classpath for example using --driver-class-path argument for PySpark shell / spark-submit. Depending on the exact code you may have to pass it using --jars as well

  3. Extract JVM instance from a Python SparkContext instance:

    jvm = sc._jvm 
  4. Extract Scala SQLContext from a SQLContext instance:

    ssqlContext = sqlContext._ssql_ctx 
  5. Extract Java DataFrame from the df:

    jdf = df._jdf 
  6. Create new instance of SimpleClass:

    simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v") 
  7. Callexe method and wrap the result using Python DataFrame:

    from pyspark.sql import DataFrame  DataFrame(simpleObject.exe(), ssqlContext) 

The result should be a valid PySpark DataFrame. You can of course combine all the steps into a single call.

Important: This approach is possible only if Python code is executed solely on the driver. It cannot be used inside Python action or transformation. See How to use Java/Scala function from an action or a transformation? for details.

like image 196
zero323 Avatar answered Sep 18 '22 13:09

zero323