Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark - how to backfill a DataFrame?

How can you do the same thing as df.fillna(method='bfill') for a pandas dataframe with a pyspark.sql.DataFrame?

The pyspark dataframe has the pyspark.sql.DataFrame.fillna method, however there is no support for a method parameter.


In pandas you can use the following to backfill a time series:

Create data

import pandas as pd

index = pd.date_range('2017-01-01', '2017-01-05')
data = [1, 2, 3, None, 5]

df = pd.DataFrame({'data': data}, index=index)

Giving

Out[1]:
            data
2017-01-01  1.0
2017-01-02  2.0
2017-01-03  3.0
2017-01-04  NaN
2017-01-05  5.0

Backfill the dataframe

df = df.fillna(method='bfill')

Produces the backfilled frame

Out[2]:
            data
2017-01-01  1.0
2017-01-02  2.0
2017-01-03  3.0
2017-01-04  5.0
2017-01-05  5.0

How can the same thing be done for a pyspark.sql.DataFrame?

like image 976
Adrian Torrie Avatar asked May 04 '17 03:05

Adrian Torrie


1 Answers

The last and first functions, with their ignorenulls=True flags, can be combined with the rowsBetween windowing. If we want to fill backwards, we select the first non-null that is between the current row and the end. If we want to fill forwards, we select the last non-null that is between the beginning and the current row.

from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
import sys

df.withColumn(
  'data',
  F.first(
    F.col('data'),
    ignorenulls=True
  ) \
    .over(
      W.orderBy('date').rowsBetween(0, sys.maxsize)
    )
  )

source on filling in spark: https://towardsdatascience.com/end-to-end-time-series-interpolation-in-pyspark-filling-the-gap-5ccefc6b7fc9

like image 198
John Haberstroh Avatar answered Oct 01 '22 18:10

John Haberstroh