UDF with multiple rows as response pySpark

Tags:

I want to apply splitUtlisation on each row of utilisationDataFarme and pass startTime and endTime as parameters., as a result splitUtlisation will return multiple rows of data hence I want to create a new DataFrame with (Id, Day, Hour, Minute)

def splitUtlisation(onDateTime, offDateTime):
    yield onDateTime
    rule = rrule.rrule(rrule.HOURLY, byminute = 0, bysecond = 0, dtstart=offDateTime)
    for result in rule.between(onDateTime, offDateTime):
      yield result
    yield offDateTime


utilisationDataFarme = (
sc.parallelize([
    (10001, "2017-02-12 12:01:40" , "2017-02-12 12:56:32"),
    (10001, "2017-02-13 12:06:32" , "2017-02-15 16:06:32"),
    (10001, "2017-02-16 21:45:56" , "2017-02-21 21:45:56"),
    (10001, "2017-02-21 22:32:41" , "2017-02-25 00:52:50"),
    ]).toDF(["id",  "startTime" ,  "endTime"])
    .withColumn("startTime", col("startTime").cast("timestamp"))
    .withColumn("endTime", col("endTime").cast("timestamp"))

In core Python I did like this

dayList = ['SUN' , 'MON' , 'TUE' , 'WED' , 'THR' , 'FRI' , 'SAT']
    for result in hours_aligned(datetime.datetime.now(), datetime.datetime.now() + timedelta(hours=68)):
      print(dayList[datetime.datetime.weekday(result)],  result.hour, 60 if result.minute == 0 else result.minute)

Result

THR 21 60
THR 22 60
THR 23 60
FRI 0 60
FRI 1 60
FRI 2 60
FRI 3 60

How to create it in pySpark?

I tried to create new Schema and apply

schema = StructType([StructField("Id", StringType(), False), StructField("Day", StringType(), False), StructField("Hour", StringType(), False) , StructField("Minute", StringType(), False)])
udf_splitUtlisation = udf(splitUtlisation, schema)
df = sqlContext.createDataFrame([],"id" , "Day" , "Hour" , "Minute")

Still I could not handle multiple rows as response.

697

asked Mar 09 '18 09:03

syv

1 Answers

You can use pyspark's explode to unpack a single row containing multiple values into multiple rows once you have your udf defined correctly.

As far as I know you won't be able to use generators with yield as an udf. Instead, you need to return all values at once as an array (see return_type) which then can be exploded and expanded:

from pyspark.sql.functions import col, udf, explode
from pyspark.sql.types import ArrayType, StringType, MapType
import pandas as pd

# input data as given by OP
df = sc.parallelize([
    (10001, "2017-02-12 12:01:40" , "2017-02-12 12:56:32"),
    (10001, "2017-02-13 12:06:32" , "2017-02-15 16:06:32"),
    (10001, "2017-02-16 21:45:56" , "2017-02-21 21:45:56"),
    (10001, "2017-02-21 22:32:41" , "2017-02-25 00:52:50")])\
    .toDF(["id",  "startTime" ,  "endTime"])\
    .withColumn("startTime", col("startTime").cast("timestamp"))\
    .withColumn("endTime", col("endTime").cast("timestamp"))

return_type = ArrayType(MapType(StringType(), StringType()))

@udf(returnType=return_type)
def your_udf_func(start, end):
    """Insert your function to return whatever you like
    as a list of dictionaries.

    For example, I chose to return hourly values for
    day, hour and minute.

    """

    date_range = pd.date_range(start, end, freq="h")
    df = pd.DataFrame({"day": date_range.strftime("%a"),
                      "hour": date_range.hour,
                      "minute": date_range.minute})

    values = df.to_dict("index").values()

    return list(values)


extracted = your_udf_func("startTime", "endTime")
exploded = explode(extracted).alias("exploded")
expanded = [col("exploded").getItem(k).alias(k) for k in ["hour", "day", "minute"]]

result = df.select("id", exploded).select("id", *expanded)

And the result is:

result.show(5)

+-----+----+---+------+
|   id|hour|day|minute|
+-----+----+---+------+
|10001|  12|Sun|     1|
|10001|  12|Mon|     6|
|10001|  13|Mon|     6|
|10001|  14|Mon|     6|
|10001|  15|Mon|     6|
+-----+----+---+------+
only showing top 5 rows

190

answered Sep 20 '22 01:09

pansen

Related questions
                            
                                Why dataset.count() is faster than rdd.count()?
                            
                                Spark job just hangs with large data
                            
                                Development with Apache Spark
                            
                                scala code throw exception in spark
                            
                                merge multiple small files in to few larger files in Spark
                            
                                How to read a zip containing multiple files in Apache Spark
                            
                                How to open Spark UI when working on a server?
                            
                                Elegant Json flatten in Spark [duplicate]
                            
                                Spark's Column.isin function does not take List
                            
                                Spark job execution time
                            
                                How to use Plotly with Zeppelin
                            
                                Spark Streaming: How to periodically refresh cached RDD?
                            
                                Forward fill missing values in Spark/Python
                            
                                Custom aggregation on PySpark dataframes [duplicate]
                            
                                Why Spark application on YARN fails with FetchFailedException due to Connection refused?
                            
                                PySpark fix/remove console progress bar
                            
                                org.apache.spark.sql.AnalysisException: cannot resolve given input columns
                            
                                How do I increase decimal precision in Spark?
                            
                                Spark Mongodb Connector Scala - Missing database name
                            
                                Vector assembler in Pyspark is creating tuple of multiple vectors instead of a single vector, how to solve the issue? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UDF with multiple rows as response pySpark

Tags:

apache-spark

pyspark

syv

People also ask

1 Answers

pansen

Recent Activity

Donate For Us