Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove rows from dataframe based on condition in pyspark

I have one dataframe with two columns:

+--------+-----+
|    col1| col2|
+--------+-----+
|22      | 12.2|
|1       |  2.1|
|5       | 52.1|
|2       | 62.9|
|77      | 33.3|

I would like to create a new dataframe which will take only rows where

"value of col1" > "value of col2"

Just as a note the col1 has long type and col2 has double type

the result should be like this:

+--------+----+
|    col1|col2|
+--------+----+
|22      |12.2|
|77      |33.3|
like image 272
LDropl Avatar asked Sep 18 '18 23:09

LDropl


People also ask

How do I remove rows based on conditions in PySpark?

In order to remove Rows with NULL values on selected columns of PySpark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.

How do I delete a row in a DataFrame based on a condition?

Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).

How do you delete a row in DataFrame PySpark?

Drop rows with NA or missing values in pyspark is accomplished by using na. drop() function. NA or Missing values in pyspark is dropped using na. drop() function.


1 Answers

The best way to keep rows based on a condition is to use filter, as mentioned by others.

To answer the question as stated in the title, one option to remove rows based on a condition is to use left_anti join in Pyspark. For example to delete all rows with col1>col2 use:

rows_to_delete = df.filter(df.col1>df.col2)

df_with_rows_deleted = df.join(rows_to_delete, on=[key_column], how='left_anti')
like image 105
MMizani Avatar answered Oct 14 '22 17:10

MMizani