I want to be able to use a Scala function as a UDF in PySpark <pre class="prettyprint lang-scala prettyprint-override"><code>package com.test object ScalaPySparkUDFs extends Serializable { def testFunction1(x: Int): Int = { x * 2 } def testUDFFunction1 = udf { x: Int => testFunction1(x) } } </code></pre> I can access <code>testFunction1</code> in PySpark and have it return values: <pre class="prettyprint lang-py prettyprint-override"><code>functions = sc._jvm.com.test.ScalaPySparkUDFs functions.testFunction1(10) </code></pre> What I want to be able to do is use this function as a UDF, ideally in a <code>withColumn</code> call: <pre class="prettyprint lang-py prettyprint-override"><code>row = Row("Value") numbers = sc.parallelize([1,2,3,4]).map(row).toDF() numbers.withColumn("Result", testUDFFunction1(numbers['Value'])) </code></pre> I think a promising approach is as found here: Spark: How to map Python with Scala or Java User Defined Functions? However, when making the changes to code found there to use <code>testUDFFunction1</code> instead: <pre class="prettyprint lang-py prettyprint-override"><code>def udf_test(col): sc = SparkContext._active_spark_context _f = sc._jvm.com.test.ScalaPySparkUDFs.testUDFFunction1.apply return Column(_f(_to_seq(sc, [col], _to_java_column))) </code></pre> I get: <pre class="prettyprint"><code> AttributeError: 'JavaMember' object has no attribute 'apply' </code></pre> I don't understand this because I believe <code>testUDFFunction1</code> does have an apply method? I do not want to use expressions of the type found here: Register UDF to SqlContext from Scala to use in PySpark Any suggestions as to how to make this work would be appreciated!

Agree with @user6910411, you have to call apply method directly on the function. So, your code will be. UDF in Scala: <pre class="prettyprint"><code>import org.apache.spark.sql.expressions.UserDefinedFunction import org.apache.spark.sql.functions._ object ScalaPySparkUDFs { def testFunction1(x: Int): Int = { x * 2 } def getFun(): UserDefinedFunction = udf(testFunction1 _ ) } </code></pre> PySpark code: <pre class="prettyprint"><code>def test_udf(col): sc = spark.sparkContext _test_udf = sc._jvm.com.test.ScalaPySparkUDFs.getFun() return Column(_test_udf.apply(_to_seq(sc, [col], _to_java_column))) row = Row("Value") numbers = sc.parallelize([1,2,3,4]).map(row).toDF() numbers.withColumn("Result", test_udf(numbers['Value'])) </code></pre>

The question you've linked is using a Scala <code>object</code>. Scala <code>object</code> is a singleton and you can use <code>apply</code> method directly. Here you use a nullary function which returns an object of <code>UserDefinedFunction</code> class co you have to call the function first: <pre class="prettyprint lang-py prettyprint-override"><code>_f = sc._jvm.com.test.ScalaPySparkUDFs.testUDFFunction1() # Note () at the end Column(_f.apply(_to_seq(sc, [col], _to_java_column))) </code></pre>

How to use Scala UDF in PySpark?

Tags:

python

scala

apache-spark

apache-spark-sql

pyspark

I want to be able to use a Scala function as a UDF in PySpark

package com.test

object ScalaPySparkUDFs extends Serializable {
    def testFunction1(x: Int): Int = { x * 2 }
    def testUDFFunction1 = udf { x: Int => testFunction1(x) }
}

I can access testFunction1 in PySpark and have it return values:

functions = sc._jvm.com.test.ScalaPySparkUDFs 
functions.testFunction1(10)

What I want to be able to do is use this function as a UDF, ideally in a withColumn call:

row = Row("Value")
numbers = sc.parallelize([1,2,3,4]).map(row).toDF()
numbers.withColumn("Result", testUDFFunction1(numbers['Value']))

I think a promising approach is as found here: Spark: How to map Python with Scala or Java User Defined Functions?

However, when making the changes to code found there to use testUDFFunction1 instead:

def udf_test(col):
    sc = SparkContext._active_spark_context
    _f = sc._jvm.com.test.ScalaPySparkUDFs.testUDFFunction1.apply
    return Column(_f(_to_seq(sc, [col], _to_java_column)))

I get:

 AttributeError: 'JavaMember' object has no attribute 'apply'

I don't understand this because I believe testUDFFunction1 does have an apply method?

I do not want to use expressions of the type found here: Register UDF to SqlContext from Scala to use in PySpark

Any suggestions as to how to make this work would be appreciated!

638

asked Jan 21 '17 13:01

user2682459

2 Answers

Agree with @user6910411, you have to call apply method directly on the function. So, your code will be.

UDF in Scala:

import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions._


object ScalaPySparkUDFs {

    def testFunction1(x: Int): Int = { x * 2 }

    def getFun(): UserDefinedFunction = udf(testFunction1 _ )
}

PySpark code:

def test_udf(col):
    sc = spark.sparkContext
    _test_udf = sc._jvm.com.test.ScalaPySparkUDFs.getFun()
    return Column(_test_udf.apply(_to_seq(sc, [col], _to_java_column)))


row = Row("Value")
numbers = sc.parallelize([1,2,3,4]).map(row).toDF()
numbers.withColumn("Result", test_udf(numbers['Value']))

168

answered Oct 18 '22 13:10

Ajit K'sagar

The question you've linked is using a Scala object. Scala object is a singleton and you can use apply method directly.

Here you use a nullary function which returns an object of UserDefinedFunction class co you have to call the function first:

_f = sc._jvm.com.test.ScalaPySparkUDFs.testUDFFunction1() # Note () at the end
Column(_f.apply(_to_seq(sc, [col], _to_java_column)))

answered Oct 18 '22 15:10

zero323

Related questions
                            
                                PyInstaller file fails to execute script - DistributionNotFound
                            
                                Updating an ManyToMany field with Django rest
                            
                                How to do Byte Pair Encoding bigram counting and replacements efficiently in Python?
                            
                                How to make spaces and indentation insignificant in Django blocktrans?
                            
                                Appengine remote_api_shell not working with application-default credentials since update
                            
                                Minute and second format for x label of matplotlib
                            
                                MemoryError when loading a JSON file
                            
                                Why is 211 used in plt.subplot(211)
                            
                                Sklearn predict multiple outputs
                            
                                RobotFramework Create Dictionary with an integer value instead of string
                            
                                Writing To CSV file Without Line Space in Python 3
                            
                                Falcon CORS middleware does not work properly
                            
                                How to get the globals from a module namespace?
                            
                                Tensorflow Retrain on Windows
                            
                                How to control the order that after_request handlers are executed?
                            
                                Starting from a specific point in a For loop
                            
                                How to use tf.while_loop() for variable-length inputs in tensorflow?
                            
                                Count characters in a string from a list of characters
                            
                                Returning top n values for group/multiindex in Pandas
                            
                                How to check whether or not a iterating variable NavigableString or Tag type?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With