Fill in null with previously known good value with pyspark

Tags:

Is there a way to replace null values in pyspark dataframe with the last valid value? There is addtional timestamp and session columns if you think you need them for windows partitioning and ordering. More specifically, I'd like to achieve the following conversion:

+---------+-----------+-----------+      +---------+-----------+-----------+ | session | timestamp |         id|      | session | timestamp |         id| +---------+-----------+-----------+      +---------+-----------+-----------+ |        1|          1|       null|      |        1|          1|       null| |        1|          2|        109|      |        1|          2|        109| |        1|          3|       null|      |        1|          3|        109| |        1|          4|       null|      |        1|          4|        109| |        1|          5|        109| =>   |        1|          5|        109| |        1|          6|       null|      |        1|          6|        109| |        1|          7|        110|      |        1|          7|        110| |        1|          8|       null|      |        1|          8|        110| |        1|          9|       null|      |        1|          9|        110| |        1|         10|       null|      |        1|         10|        110| +---------+-----------+-----------+      +---------+-----------+-----------+

689

asked Mar 31 '16 20:03

Oleksiy

2 Answers

I believe I have a much simpler solution than the accepted. It is using Functions too, but uses the function called 'LAST' and ignores nulls.

Let's re-create something similar to the original data:

import sys from pyspark.sql.window import Window import pyspark.sql.functions as func  d = [{'session': 1, 'ts': 1}, {'session': 1, 'ts': 2, 'id': 109}, {'session': 1, 'ts': 3}, {'session': 1, 'ts': 4, 'id': 110}, {'session': 1, 'ts': 5},  {'session': 1, 'ts': 6}] df = spark.createDataFrame(d)

This prints:

+-------+---+----+ |session| ts|  id| +-------+---+----+ |      1|  1|null| |      1|  2| 109| |      1|  3|null| |      1|  4| 110| |      1|  5|null| |      1|  6|null| +-------+---+----+

Now, if we use the window function LAST:

df.withColumn("id", func.last('id', True).over(Window.partitionBy('session').orderBy('ts').rowsBetween(-sys.maxsize, 0))).show()

We just get:

+-------+---+----+ |session| ts|  id| +-------+---+----+ |      1|  1|null| |      1|  2| 109| |      1|  3| 109| |      1|  4| 110| |      1|  5| 110| |      1|  6| 110| +-------+---+----+

Hope it helps!

answered Oct 04 '22 07:10

elmosca

This seems to be doing the trick using Window functions:

import sys from pyspark.sql.window import Window import pyspark.sql.functions as func  def fill_nulls(df):     df_na = df.na.fill(-1)     lag = df_na.withColumn('id_lag', func.lag('id', default=-1)\                            .over(Window.partitionBy('session')\                                  .orderBy('timestamp')))      switch = lag.withColumn('id_change',                             ((lag['id'] != lag['id_lag']) &                              (lag['id'] != -1)).cast('integer'))       switch_sess = switch.withColumn(         'sub_session',         func.sum("id_change")         .over(             Window.partitionBy("session")             .orderBy("timestamp")             .rowsBetween(-sys.maxsize, 0))     )      fid = switch_sess.withColumn('nn_id',                            func.first('id')\                            .over(Window.partitionBy('session', 'sub_session')\                                  .orderBy('timestamp')))      fid_na = fid.replace(-1, 'null')      ff = fid_na.drop('id').drop('id_lag')\                           .drop('id_change')\                           .drop('sub_session').\                           withColumnRenamed('nn_id', 'id')      return ff

Here is the full null_test.py.

answered Oct 04 '22 07:10

Oleksiy

Related questions
                            
                                PySpark Drop Rows
                            
                                Retrieve SparkContext from SparkSession
                            
                                java.lang.ClassCastException using lambda expressions in spark job on remote server
                            
                                How to use orderby() with descending order in Spark window functions?
                            
                                Exploding nested Struct in Spark dataframe
                            
                                How to create a sample single-column Spark DataFrame in Python?
                            
                                How does Distinct() function work in Spark?
                            
                                How to replace null values with a specific value in Dataframe using spark in Java?
                            
                                How do I replace a string value with a NULL in PySpark?
                            
                                SparkSQL - Read parquet file directly
                            
                                How to make shark/spark clear the cache?
                            
                                IllegalAccessError to guava's StopWatch from org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus
                            
                                PySpark Logging?
                            
                                Merge Spark output CSV files with a single header
                            
                                Reading multiple files from S3 in Spark by date period
                            
                                Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?
                            
                                Convert a simple one line string to RDD in Spark
                            
                                What are broadcast variables? What problems do they solve?
                            
                                How to avoid generating crc files and SUCCESS files while saving a DataFrame?
                            
                                How to create SparkSession with Hive support (fails with "Hive classes are not found")?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fill in null with previously known good value with pyspark

Tags:

apache-spark

apache-spark-sql

pyspark

Oleksiy

People also ask

2 Answers

elmosca

Oleksiy

Recent Activity

Donate For Us