I am attempting to fill in missing values in my Spark dataframe with the previous non-null value (if it exists). I've done this type of thing in Python/Pandas but my data is too big for Pandas (on a small cluster) and I'm Spark noob. Is this something Spark can do? Can it do it for multiple columns? If so, how? If not, any suggestions for alternative approaches within the who Hadoop suite of tools?
Thanks!
I've found a solution that works without additional coding by using a Window here. So Jeff was right, there is a solution. full code boelow, I'll briefly explain what it does, for more details just look at the blog.
from pyspark.sql import Window
from pyspark.sql.functions import last
import sys
# define the window
window = Window.orderBy('time')\
.rowsBetween(-sys.maxsize, 0)
# define the forward-filled column
filled_column_temperature = last(df6['temperature'], ignorenulls=True).over(window)
# do the fill
spark_df_filled = df6.withColumn('temperature_filled', filled_column_temperature)
So the idea is to define a Window sliding (more on sliding windows here) through the data which always contains the actual row and ALL previous ones:
window = Window.orderBy('time')\
.rowsBetween(-sys.maxsize, 0)
Note that we sort by time, so data is in the correct order. Also note that using "-sys.maxsize" ensures that the window is always including all previous data and is contineously growing as it traverses through the data top-down, but there might be more efficient solutions.
Using the "last" function, we are always addressing the last row in that window. By passing "ignorenulls=True" we define that if the current row is null, then the function will return the most recent (last) non-null value in the window. Otherwise the actual row's value is used.
Done.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With