How to add multiple columns using UDF?

Tags:

Question

I want to add the return values of a UDF to an existing dataframe in seperate columns. How do I achieve this in a resourceful way?

Here's an example of what I have so far.

Click to copy

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType  

df = spark.createDataFrame([("Alive",4)],["Name","Number"])
df.show(1)

+-----+------+
| Name|Number|
+-----+------+
|Alive|     4|
+-----+------+

def example(n):
        return [[n+2], [n-2]]

#  schema = StructType([
#          StructField("Out1", ArrayType(IntegerType()), False),
#          StructField("Out2", ArrayType(IntegerType()), False)])

example_udf = udf(example)

Now I can add a column to the dataframe as follows

Click to copy

newDF = df.withColumn("Output", example_udf(df["Number"]))
newDF.show(1)
+-----+------+----------+
| Name|Number|Output    |
+-----+------+----------+
|Alive|     4|[[6], [2]]|
+-----+------+----------+

However I don't want the two values to be in the same column but rather in separate ones.

Ideally I'd like to split the output column now to avoid calling the example function two times (once for each return value) as explained here and here, however in my situation I'm getting an array of arrays and I can't see how a split would work there (please note that each array will contain multiple values, separated with a ",".

How the result should look like

What I ultimately want is this

Click to copy

+-----+------+----+----+
| Name|Number|Out1|Out2|
+-----+------+----+----+
|Alive|     4|   6|   2|
+-----+------+----+----+

Note that the use of the StructType return type is optional and doesn't necessarily have to be part of the solution.

EDIT: I commented out the use of StructType (and edited the udf assignment) since it's not necessary for the return type of the example function. However it has to be used if the return value would be something like

Click to copy

return [6,3,2],[4,3,1]

375

asked Dec 06 '17 08:12

Rob

1 Answers

To return a StructType, just using Row

Click to copy

from pyspark.sql.types import StructType,StructField,IntegerType,Row
from pyspark.sql import functions as F

df = spark.createDataFrame([("Alive", 4)], ["Name", "Number"])


def example(n):
    return Row('Out1', 'Out2')(n + 2, n - 2)


schema = StructType([
    StructField("Out1", IntegerType(), False),
    StructField("Out2", IntegerType(), False)])

example_udf = F.UserDefinedFunction(example, schema)

newDF = df.withColumn("Output", example_udf(df["Number"]))
newDF = newDF.select("Name", "Number", "Output.*")

newDF.show(truncate=False)

100

answered Oct 03 '22 06:10

Zhang Tong

Related questions
                            
                                Spark Shell "Failed to Initialize Compiler" Error on a mac
                            
                                Add extra hours to timestamp columns in Pyspark data frame [duplicate]
                            
                                Spark SQL: how to cache sql query result without using rdd.cache()
                            
                                How to randomly sample from a Scala list or array?
                            
                                How to filter based on array value in PySpark?
                            
                                How do you automate pyspark jobs on emr using boto3 (or otherwise)?
                            
                                Spark-Shell Startup Errors
                            
                                Amazon s3a returns 400 Bad Request with Spark
                            
                                How to use groupBy to collect rows into a map?
                            
                                Hadoop “Unable to load native-hadoop library for your platform” error on docker-spark?
                            
                                AWS Glue executor memory limit
                            
                                Does SparkSQL support subquery?
                            
                                Pyspark - Aggregation on multiple columns
                            
                                Spark, add new Column with the same value in Scala [duplicate]
                            
                                Zeppelin: How to restart sparkContext in zeppelin
                            
                                How to filter column on values in list in pyspark?
                            
                                Spark Scala: Cannot up cast from string to int as it may truncate
                            
                                Spark SQL case insensitive filter for column conditions
                            
                                Get JavaSparkContext from a SparkSession
                            
                                spark - scala - How can I check if a table exists in hive

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to add multiple columns using UDF?

Tags:

apache-spark

apache-spark-sql

pyspark

Rob

People also ask

1 Answers

Zhang Tong

Recent Activity

Donate For Us