PySpark: How to Append Dataframes in For Loop

Tags:

I am performing a rolling median calculation on individual time series dataframes, then I want to concat/append the results.

# UDF for rolling median
median_udf = udf(lambda x: float(np.median(x)), FloatType())

series_list = ['0620', '5914']
SeriesAppend=[]

for item in series_list:
    # Filter for select item
    series = test_df.where(col("ID").isin([item]))
    # Sort time series
    series_sorted = series.sort(series.ID, 
    series.date).persist()
    # Calculate rolling median
    series_sorted = series_sorted.withColumn("list", 
        collect_list("metric").over(w)) \
        .withColumn("rolling_median", median_udf("list"))

    SeriesAppend.append(series_sorted)

SeriesAppend

[DataFrame[ntwrk_genre_cd: string, date: date, mkt_cd: string, syscode: string, ntwrk_cd: string, syscode_ntwrk: string, metric: double, list: array, rolling_median: float], DataFrame[ntwrk_genre_cd: string, date: date, mkt_cd: string, syscode: string, ntwrk_cd: string, syscode_ntwrk: string, metric: double, list: array, rolling_median: float]]

When I attempt to .show():

'list' object has no attribute 'show'
Traceback (most recent call last):
AttributeError: 'list' object has no attribute 'show'

I realize this is saying the object is a list of dataframes. How do I convert to a single dataframe?

I know that the following solution works for an explicit number of dataframes, but I want my for-loop to be agnostic to the number of dataframes:

from functools import reduce
from pyspark.sql import DataFrame

dfs = [df1,df2,df3]
df = reduce(DataFrame.unionAll, dfs)

Is there a way to generalize this to non-explicit dataframe names?

744

asked May 29 '19 15:05

mwhee

1 Answers

Thanks everyone! To sum up - the solution uses Reduce and unionAll:

from functools import reduce
from pyspark.sql import DataFrame

SeriesAppend=[]

for item in series_list:
    # Filter for select item
    series = test_df.where(col("ID").isin([item]))
    # Sort time series
    series_sorted = series.sort(series.ID, 
    series.date).persist()
    # Calculate rolling median
    series_sorted = series_sorted.withColumn("list", 
         collect_list("metric").over(w)) \
         .withColumn("rolling_median", median_udf("list"))

    SeriesAppend.append(series_sorted)

df_series = reduce(DataFrame.unionAll, SeriesAppend)

183

answered Nov 15 '22 10:11

mwhee

Related questions
                            
                                Spark: Return empty column if column does not exist in dataframe
                            
                                Apache Spark startsWith in SQL expression
                            
                                Spark AnalysisException when "flattening" DataFrame in Spark SQL
                            
                                Pyspark - Cumulative sum with reset condition
                            
                                How to find the max value of multiple columns?
                            
                                How to set up Zeppelin to work with remote EMR Yarn cluster
                            
                                Spark Convert Data Frame Column to dense Vector for StandardScaler() "Column must be of type org.apache.spark.ml.linalg.VectorUDT"
                            
                                Java Apache Spark: Long transformation chains result in quadratic time
                            
                                Pyspark Dataframe Join using UDF
                            
                                set spark.streaming.kafka.maxRatePerPartition for createDirectStream
                            
                                pyspark 1.6.0 write to parquet gives "path exists" error
                            
                                How to run a scala program in terminal?
                            
                                spark sql count(*) query store result
                            
                                Spark Parquet Loader: Reduce number of jobs involved in listing a dataframe's files
                            
                                Spark 2.3.0 Read Text File With Header Option Not Working
                            
                                substring multiple characters from the last index of a pyspark string column using negative indexing
                            
                                weekofyear() returning seemingly incorrect results for January 1
                            
                                Kafka - Could not find a 'KafkaClient' entry in the JAAS configuration java
                            
                                PySpark - to_date format from column
                            
                                Pyspark 2.4.0, read avro from kafka with read stream - Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark: How to Append Dataframes in For Loop

Tags:

apache-spark

time-series

pyspark

user-defined-functions

mwhee

People also ask

1 Answers

mwhee

Recent Activity

Donate For Us