Pyspark Dataframe: Get previous row that meets a condition

Tags:

For every row in a PySpark DataFrame I am trying to get a value from the first preceding row that satisfied a certain condition:

That is if my dataframe looks like this:

X  | Flag
1  | 1
2  | 0
3  | 0
4  | 0
5  | 1
6  | 0
7  | 0
8  | 0
9  | 1
10 | 0

I want output that looks like this:

X  | Lag_X | Flag
1  | NULL  | 1
2  | 1     | 0
3  | 1     | 0
4  | 1     | 0
5  | 1     | 1
6  | 5     | 0
7  | 5     | 0
8  | 5     | 0
9  | 5     | 1
10 | 9     | 0

I thought I could do this with lag function and a WindowSpec, unfortunately WindowSpec doesnt support .filter or .when, so this does not work:

conditional_window = Window().orderBy(X).filter(df[Flag] == 1)
df = df.withColumn('lag_x', f.lag(df[x],1).over(conditional_window)

It seems like this should be simple, but I have been racking my brain trying to find a solution so any help with this would be greatly appreciated

449

asked Mar 27 '18 19:03

NME IX

1 Answers

Question is old, but I thought the answer might help others

Here is a working solution using window and lag functions

from pyspark.sql import functions as F
from pyspark.sql import Window
from pyspark.sql.functions import when
from pyspark.context import SparkContext

# Call SparkContext
sc = SparkContext.getOrCreate()
sc = sparkContext

# Create DataFrame
a = sc.createDataFrame([(1, 1), 
                        (2, 0),
                        (3, 0),
                        (4, 0),
                        (5, 1),
                        (6, 0),
                        (7, 0),
                        (8, 0),
                        (9, 1),
                       (10, 0)]
                     , ['X', 'Flag'])

# Use a window function
win = Window.orderBy("X")
# Condition : if preceeding row in column "Flag" is not 0
condition = F.lag(F.col("Flag"), 1).over(win) != 0
# Add a new column : if condition is true, value is value of column "X" at the previous row
a = a.withColumn("Flag_X", F.when(condition, F.col("X") - 1))

Now, we obtain a DataFrame as shown below

+---+----+------+
|  X|Flag|Flag_X|
+---+----+------+
|  1|   1|  null|
|  2|   0|     1|
|  3|   0|  null|
|  4|   0|  null|
|  5|   1|  null|
|  6|   0|     5|
|  7|   0|  null|
|  8|   0|  null|
|  9|   1|  null|
| 10|   0|     9|
+---+----+------+

To fill null values :

a = a.withColumn("Flag_X", 
                 F.last(F.col("Flag_X"), ignorenulls=True)\
     .over(win))

So the final DataFrame is as required :

+---+----+------+
|  X|Flag|Flag_X|
+---+----+------+
|  1|   1|  null|
|  2|   0|     1|
|  3|   0|     1|
|  4|   0|     1|
|  5|   1|     1|
|  6|   0|     5|
|  7|   0|     5|
|  8|   0|     5|
|  9|   1|     5|
| 10|   0|     9|
+---+----+------+

173

answered Nov 14 '22 22:11

Mike

Related questions
                            
                                Change content of image interactively using slider widgets
                            
                                argparse metavar for nargs='+' to get numbered arguments in help info?
                            
                                Numpy set absolute value in place
                            
                                ValueError: zero-dimensional arrays cannot be concatenated
                            
                                PySpark - Convert to JSON row by row
                            
                                In FT2Font: Can not load face
                            
                                ValueError: Invalid endpoint: s3-api.xxxx.objectstorage.service.networklayer.com
                            
                                Difference between apply() and apply_async() in Python multiprocessing module
                            
                                Django restframework, extra_kwargs not working
                            
                                Django: How to redirect with arguments
                            
                                How to prevent PyCharm from overriding default backend as set in matplotlib?
                            
                                PIP (Python) : ImportError: cannot import name _remove_dead_weakref
                            
                                Filtering with MultiIndex
                            
                                Numpy array: group by one column, sum another
                            
                                What does it mean for a tensor to have shape [None, x] in TensorFlow? [duplicate]
                            
                                Calculate nunique() for groupby in pandas
                            
                                How to print weights in Tensorflow?
                            
                                `np.concatenate` a numpy array with a sparse matrix
                            
                                Properly terminate flask web app running in a thread
                            
                                How to use Keras with GPU?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark Dataframe: Get previous row that meets a condition

Tags:

python

pyspark

pyspark-sql

spark-dataframe

NME IX

People also ask

1 Answers

Mike

Recent Activity

Donate For Us