I have one dataframe with two columns:
+--------+-----+
| col1| col2|
+--------+-----+
|22 | 12.2|
|1 | 2.1|
|5 | 52.1|
|2 | 62.9|
|77 | 33.3|
I would like to create a new dataframe which will take only rows where
"value of col1" > "value of col2"
Just as a note the col1 has long type and col2 has double type
the result should be like this:
+--------+----+
| col1|col2|
+--------+----+
|22 |12.2|
|77 |33.3|
In order to remove Rows with NULL values on selected columns of PySpark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.
Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).
Drop rows with NA or missing values in pyspark is accomplished by using na. drop() function. NA or Missing values in pyspark is dropped using na. drop() function.
The best way to keep rows based on a condition is to use filter
, as mentioned by others.
To answer the question as stated in the title, one option to remove rows based on a condition is to use left_anti join in Pyspark. For example to delete all rows with col1>col2 use:
rows_to_delete = df.filter(df.col1>df.col2)
df_with_rows_deleted = df.join(rows_to_delete, on=[key_column], how='left_anti')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With