Remove rows from dataframe based on condition in pyspark

Tags:

I have one dataframe with two columns:

+--------+-----+
|    col1| col2|
+--------+-----+
|22      | 12.2|
|1       |  2.1|
|5       | 52.1|
|2       | 62.9|
|77      | 33.3|

I would like to create a new dataframe which will take only rows where

"value of col1" > "value of col2"

Just as a note the col1 has long type and col2 has double type

the result should be like this:

+--------+----+
|    col1|col2|
+--------+----+
|22      |12.2|
|77      |33.3|

272

asked Sep 18 '18 23:09

LDropl

1 Answers

The best way to keep rows based on a condition is to use filter, as mentioned by others.

To answer the question as stated in the title, one option to remove rows based on a condition is to use left_anti join in Pyspark. For example to delete all rows with col1>col2 use:

rows_to_delete = df.filter(df.col1>df.col2)

df_with_rows_deleted = df.join(rows_to_delete, on=[key_column], how='left_anti')

105

answered Oct 14 '22 17:10

MMizani

Related questions
                            
                                How to start multiple streaming queries in a single Spark application?
                            
                                PySpark: how to resample frequencies
                            
                                Enable case sensitivity for spark.sql globally
                            
                                How to interpret results of Spark OneHotEncoder
                            
                                Spark converting a Dataset to RDD
                            
                                On which way does RDD of spark finish fault-tolerance?
                            
                                Spark dataframe write method writing many small files
                            
                                Spark structured streaming kafka convert JSON without schema (infer schema)
                            
                                Class com.hadoop.compression.lzo.LzoCodec not found for Spark on CDH 5?
                            
                                Specifying an external configuration file for Apache Spark
                            
                                PySpark 1.5 How to Truncate Timestamp to Nearest Minute from seconds
                            
                                Spark 1.6-Failed to locate the winutils binary in the hadoop binary path
                            
                                Spark - Random Number Generation
                            
                                Could not bind on a random free port error while trying to connect to spark master
                            
                                EntityTooLarge error when uploading a 5G file to Amazon S3
                            
                                How to get ID of a map task in Spark?
                            
                                pyspark matrix with dummy variables
                            
                                Spark column string replace when present in other column (row)
                            
                                Converting a Spark Dataframe to a Scala Map collection
                            
                                How to change the column type from String to Date in DataFrames?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove rows from dataframe based on condition in pyspark

Tags:

dataframe

apache-spark

pyspark

LDropl

People also ask

1 Answers

MMizani

Recent Activity

Donate For Us