How to reference a dataframe when in an UDF on another dataframe?

Tags:

How do you reference a pyspark dataframe when in the execution of an UDF on another dataframe?

Here's a dummy example. I am creating two dataframes scores and lastnames, and within each lies a column that is the same across the two dataframes. In the UDF applied on scores, I want to filter on lastnames and return a string found in lastname.

from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *

sc = SparkContext("local")
sqlCtx = SQLContext(sc)


# Generate Random Data
import itertools
import random
student_ids = ['student1', 'student2', 'student3']
subjects = ['Math', 'Biology', 'Chemistry', 'Physics']
random.seed(1)
data = []

for (student_id, subject) in itertools.product(student_ids, subjects):
    data.append((student_id, subject, random.randint(0, 100)))

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
            StructField("student_id", StringType(), nullable=False),
            StructField("subject", StringType(), nullable=False),
            StructField("score", IntegerType(), nullable=False)
    ])

# Create DataFrame 
rdd = sc.parallelize(data)
scores = sqlCtx.createDataFrame(rdd, schema)

# create another dataframe
last_name = ["Granger", "Weasley", "Potter"]
data2 = []
for i in range(len(student_ids)):
    data2.append((student_ids[i], last_name[i]))

schema = StructType([
            StructField("student_id", StringType(), nullable=False),
            StructField("last_name", StringType(), nullable=False)
    ])

rdd = sc.parallelize(data2)
lastnames = sqlCtx.createDataFrame(rdd, schema)


scores.show()
lastnames.show()


from pyspark.sql.functions import udf
def getLastName(sid):
    tmp_df = lastnames.filter(lastnames.student_id == sid)
    return tmp_df.last_name

getLastName_udf = udf(getLastName, StringType())
scores.withColumn("last_name", getLastName_udf("student_id")).show(10)

And the following is the last part of the trace:

Py4JError: An error occurred while calling o114.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

327

asked Dec 30 '16 04:12

tohweizhong

2 Answers

You can't directly reference a dataframe (or an RDD) from inside a UDF. The DataFrame object is a handle on your driver that spark uses to represent the data and actions that will happen out on the cluster. The code inside your UDF's will run out on the cluster at a time of Spark's choosing. Spark does this by serializing that code, and making copies of any variables included in the closure and sending them out to each worker.

What instead you want to do, is use the constructs Spark provides in it's API to join/combine the two DataFrames. If one of the data sets is small, you can manually send out the data in a broadcast variable, and then access it from your UDF. Otherwise, you can just create the two dataframes like you did, then use the join operation to combine them. Something like this should work:

joined = scores.withColumnRenamed("student_id", "join_id")
joined = joined.join(lastnames, joined.join_id == lastnames.student_id)\
               .drop("join_id")
joined.show()

+---------+-----+----------+---------+
|  subject|score|student_id|last_name|
+---------+-----+----------+---------+
|     Math|   13|  student1|  Granger|
|  Biology|   85|  student1|  Granger|
|Chemistry|   77|  student1|  Granger|
|  Physics|   25|  student1|  Granger|
|     Math|   50|  student2|  Weasley|
|  Biology|   45|  student2|  Weasley|
|Chemistry|   65|  student2|  Weasley|
|  Physics|   79|  student2|  Weasley|
|     Math|    9|  student3|   Potter|
|  Biology|    2|  student3|   Potter|
|Chemistry|   84|  student3|   Potter|
|  Physics|   43|  student3|   Potter|
+---------+-----+----------+---------+

It's also worth noting, that under the hood Spark DataFrames has an optimization where a DataFrame that is part of a join can be converted to a broadcast variable to avoid a shuffle if it is small enough. So if you do the join method listed above, you should get the best possible performance, without sacrificing the ability to handle larger data sets.

answered Nov 15 '22 10:11

Ryan Widmaier

Changing pair to dictionary for easy lookup of names

data2 = {}
for i in range(len(student_ids)):
    data2[student_ids[i]] = last_name[i]

Instead of creating rdd and making it to df create broadcast variable

//rdd = sc.parallelize(data2) 
//lastnames = sqlCtx.createDataFrame(rdd, schema)
lastnames = sc.broadcast(data2)

Now access this in udf with values attr on broadcast variable(lastnames).

from pyspark.sql.functions import udf
def getLastName(sid):
    return lastnames.value[sid]

answered Nov 15 '22 09:11

mrsrinivas

Related questions
                            
                                compute string length in Spark SQL DSL
                            
                                How to show the scheme (including type) of a parquet file from command line or spark shell?
                            
                                Starting a single Spark Slave (or Worker)
                            
                                How to sum values in an iterator in a PySpark groupByKey()
                            
                                How to get default property values in Spark
                            
                                How to encode categorical features in Apache Spark
                            
                                Output Dstream of Apache Spark in Python
                            
                                How to submit a Scala job to Spark?
                            
                                Yarn container is running out of memory
                            
                                Apache Spark: How do I convert a Spark DataFrame to a RDD with type RDD[(Type1,Type2, ...)]?
                            
                                Error when creating a StreamingContext
                            
                                Register UDF to SqlContext from Scala to use in PySpark
                            
                                pandas str.contains in pyspark dataframe in Pyspark
                            
                                How to define Kafka (data source) dependencies for Spark Streaming?
                            
                                Spark 2.0 DataSets groupByKey and divide operation and type safety
                            
                                SPARK, DataFrame: difference of Timestamp columns over consecutive rows
                            
                                spark kafka producer serializable
                            
                                SPARK: YARN kills containers for exceeding memory limits
                            
                                Sort by dateTime in scala
                            
                                Spark Dataframes- Reducing By Key

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to reference a dataframe when in an UDF on another dataframe?

Tags:

dataframe

apache-spark

pyspark

user-defined-functions

broadcast

tohweizhong

People also ask

2 Answers

Ryan Widmaier

mrsrinivas

Recent Activity

Donate For Us