Using Spark 1.5.1, I've been trying to forward fill null values with the last known observation for one column of my DataFrame. It is possible to start with a null value and for this case I would to backward fill this null value with the first knwn observation. However, If that too complicates the code, this point can be skipped. In this post, a solution in Scala was provided for a very similar problem by zero323. But, I don't know Scala and I don't succeed to ''translate'' it in Pyspark API code. It's possible to do it with Pyspark ? Thanks for your help. Below, a simple example sample input: <pre class="prettyprint"><code>| cookie_ID | Time | User_ID | ------------- | -------- |------------- | 1 | 2015-12-01 | null | 1 | 2015-12-02 | U1 | 1 | 2015-12-03 | U1 | 1 | 2015-12-04 | null | 1 | 2015-12-05 | null | 1 | 2015-12-06 | U2 | 1 | 2015-12-07 | null | 1 | 2015-12-08 | U1 | 1 | 2015-12-09 | null | 2 | 2015-12-03 | null | 2 | 2015-12-04 | U3 | 2 | 2015-12-05 | null | 2 | 2015-12-06 | U4 </code></pre> And the expected output: <pre class="prettyprint"><code>| cookie_ID | Time | User_ID | ------------- | -------- |------------- | 1 | 2015-12-01 | U1 | 1 | 2015-12-02 | U1 | 1 | 2015-12-03 | U1 | 1 | 2015-12-04 | U1 | 1 | 2015-12-05 | U1 | 1 | 2015-12-06 | U2 | 1 | 2015-12-07 | U2 | 1 | 2015-12-08 | U1 | 1 | 2015-12-09 | U1 | 2 | 2015-12-03 | U3 | 2 | 2015-12-04 | U3 | 2 | 2015-12-05 | U3 | 2 | 2015-12-06 | U4 </code></pre>

Another workaround to get this working, is to try something like this: <pre class="prettyprint"><code>from pyspark.sql import functions as F from pyspark.sql.window import Window window = ( Window .partitionBy('cookie_id') .orderBy('Time') .rowsBetween(Window.unboundedPreceding, Window.currentRow) ) final = ( joined .withColumn('UserIDFilled', F.last('User_ID', ignorenulls=True).over(window)) ) </code></pre> So what this is doing is that it constructs your window based on the partition key and the order column. It also tells the window to look back all rows within the window up to the current row. Finally, at each row, you return the last value that is not null (which remember, according to your window, it includes your current row)

Pyspark : forward fill with last observation for a DataFrame

Tags:

apache-spark

apache-spark-sql

pyspark

spark-dataframe

Using Spark 1.5.1,

I've been trying to forward fill null values with the last known observation for one column of my DataFrame.

It is possible to start with a null value and for this case I would to backward fill this null value with the first knwn observation. However, If that too complicates the code, this point can be skipped.

In this post, a solution in Scala was provided for a very similar problem by zero323.

But, I don't know Scala and I don't succeed to ''translate'' it in Pyspark API code. It's possible to do it with Pyspark ?

Thanks for your help.

Below, a simple example sample input:

| cookie_ID     | Time       | User_ID   
| ------------- | --------   |------------- 
| 1             | 2015-12-01 | null 
| 1             | 2015-12-02 | U1
| 1             | 2015-12-03 | U1
| 1             | 2015-12-04 | null   
| 1             | 2015-12-05 | null     
| 1             | 2015-12-06 | U2
| 1             | 2015-12-07 | null
| 1             | 2015-12-08 | U1
| 1             | 2015-12-09 | null      
| 2             | 2015-12-03 | null     
| 2             | 2015-12-04 | U3
| 2             | 2015-12-05 | null   
| 2             | 2015-12-06 | U4

And the expected output:

| cookie_ID     | Time       | User_ID   
| ------------- | --------   |------------- 
| 1             | 2015-12-01 | U1
| 1             | 2015-12-02 | U1
| 1             | 2015-12-03 | U1
| 1             | 2015-12-04 | U1
| 1             | 2015-12-05 | U1
| 1             | 2015-12-06 | U2
| 1             | 2015-12-07 | U2
| 1             | 2015-12-08 | U1
| 1             | 2015-12-09 | U1
| 2             | 2015-12-03 | U3
| 2             | 2015-12-04 | U3
| 2             | 2015-12-05 | U3
| 2             | 2015-12-06 | U4

223

asked Mar 15 '16 18:03

Villo

1 Answers

Another workaround to get this working, is to try something like this:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

window = (
    Window
    .partitionBy('cookie_id')
    .orderBy('Time')
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)
)

final = (
    joined
    .withColumn('UserIDFilled', F.last('User_ID', ignorenulls=True).over(window))
)

So what this is doing is that it constructs your window based on the partition key and the order column. It also tells the window to look back all rows within the window up to the current row. Finally, at each row, you return the last value that is not null (which remember, according to your window, it includes your current row)

answered Sep 20 '22 02:09

BICube

Related questions
                            
                                'map-side' aggregation in Spark
                            
                                Spark MLlib LDA, how to infer the topics distribution of a new unseen document?
                            
                                How to convert spark DataFrame to RDD mllib LabeledPoints?
                            
                                Spark simpler value_counts
                            
                                Spark from_json with dynamic schema
                            
                                How to sort within partitions (and avoid sort across the partitions) using RDD API?
                            
                                How to save latest offset that Spark consumed to ZK or Kafka and can read back after restart
                            
                                Create labeledPoints from Spark DataFrame in Python
                            
                                Convert an RDD to iterable: PySpark?
                            
                                How to fully utilize all Spark nodes in cluster?
                            
                                When to use Kryo serialization in Spark?
                            
                                Spark' Dataset unpersist behaviour
                            
                                Julia on Hadoop? [closed]
                            
                                Spark vs Flink low memory available
                            
                                Spark : multiple spark-submit in parallel
                            
                                How to add source file name to each row in Spark?
                            
                                --files option in pyspark not working
                            
                                Spark: how to use SparkContext.textFile for local file system
                            
                                Applying function to Spark Dataframe Column
                            
                                What is a glom?. How it is different from mapPartitions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With