Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark dataframe how to drop rows with nulls in all columns?

For a dataframe, before it is like:

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|null|null|
|null|   B|  X1|
+----+----+----+

After I hope it's like:

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
+----+----+----+

I prefer a general method such that it can apply when df.columns is very long. Thanks!

like image 699
kww Avatar asked Jan 12 '18 15:01

kww


People also ask

How do you drop a row with NULL values?

Drop all rows having at least one null value DataFrame. dropna() method is your friend. When you call dropna() over the whole DataFrame without specifying any arguments (i.e. using the default behaviour) then the method will drop all rows with at least one missing value.

How do you replace NULL values with some other value or discard the rows with NULL values in Spark?

fillna() function was introduced in Spark version 1.3. 1 and is used to replace null values with another specified value. It accepts two parameters namely value and subset . value corresponds to the desired value you want to replace nulls with.

How do I remove all rows from a PySpark DataFrame?

We can use where or filter function to 'remove' or 'delete' rows from a DataFrame.


Video Answer


2 Answers

Providing strategy for na.drop is all you need:

df = spark.createDataFrame([
    (1, "B", "X1"), (None, None, None), (None, "B", "X1"), (None, "C", None)],
    ("ID", "TYPE", "CODE")
)

df.na.drop(how="all").show()
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+  
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+

Alternative formulation can be achieved with threshold (number of NOT NULL values):

df.na.drop(thresh=1).show()
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+
like image 65
zero323 Avatar answered Oct 01 '22 18:10

zero323


One option is to use functools.reduce to construct the conditions:

from functools import reduce
df.filter(~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])).show()
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
+----+----+----+

where reduce produce a query as follows:

~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])
# Column<b'(NOT (((ID IS NULL) AND (TYPE IS NULL)) AND (CODE IS NULL)))'>
like image 23
Psidom Avatar answered Oct 01 '22 18:10

Psidom