For a dataframe, before it is like: <pre class="prettyprint"><code>+----+----+----+ | ID|TYPE|CODE| +----+----+----+ | 1| B| X1| |null|null|null| |null| B| X1| +----+----+----+ </code></pre> After I hope it's like: <pre class="prettyprint"><code>+----+----+----+ | ID|TYPE|CODE| +----+----+----+ | 1| B| X1| |null| B| X1| +----+----+----+ </code></pre> I prefer a general method such that it can apply when <code>df.columns</code> is very long. Thanks!

Providing strategy for <code>na.drop</code> is all you need: <pre class="prettyprint"><code>df = spark.createDataFrame([ (1, "B", "X1"), (None, None, None), (None, "B", "X1"), (None, "C", None)], ("ID", "TYPE", "CODE") ) df.na.drop(how="all").show() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+----+----+----+ | ID|TYPE|CODE| +----+----+----+ | 1| B| X1| |null| B| X1| |null| C|null| +----+----+----+ </code></pre> Alternative formulation can be achieved with <code>threshold</code> (number of <code>NOT NULL</code> values): <pre class="prettyprint"><code>df.na.drop(thresh=1).show() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+----+----+----+ | ID|TYPE|CODE| +----+----+----+ | 1| B| X1| |null| B| X1| |null| C|null| +----+----+----+ </code></pre>

One option is to use <code>functools.reduce</code> to construct the conditions: <pre class="prettyprint lang-py prettyprint-override"><code>from functools import reduce df.filter(~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])).show() +----+----+----+ | ID|TYPE|CODE| +----+----+----+ | 1| B| X1| |null| B| X1| +----+----+----+ </code></pre> where <code>reduce</code> produce a query as follows: <pre class="prettyprint lang-py prettyprint-override"><code>~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns]) # Column<b'(NOT (((ID IS NULL) AND (TYPE IS NULL)) AND (CODE IS NULL)))'> </code></pre>

Pyspark dataframe how to drop rows with nulls in all columns?

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|null|null|
|null|   B|  X1|
+----+----+----+

After I hope it's like:

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
+----+----+----+

I prefer a general method such that it can apply when df.columns is very long. Thanks!

699

asked Jan 12 '18 15:01

kww

Video Answer

2 Answers

Providing strategy for na.drop is all you need:

df = spark.createDataFrame([
    (1, "B", "X1"), (None, None, None), (None, "B", "X1"), (None, "C", None)],
    ("ID", "TYPE", "CODE")
)

df.na.drop(how="all").show()

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+  
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+

Alternative formulation can be achieved with threshold (number of NOT NULL values):

df.na.drop(thresh=1).show()

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+

answered Oct 01 '22 18:10

zero323

One option is to use functools.reduce to construct the conditions:

from functools import reduce
df.filter(~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])).show()
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
+----+----+----+

where reduce produce a query as follows:

~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])
# Column<b'(NOT (((ID IS NULL) AND (TYPE IS NULL)) AND (CODE IS NULL)))'>

answered Oct 01 '22 18:10

Psidom

Related questions
                            
                                Resize a QGraphicsItem with the mouse
                            
                                Nginx 504 Gateway Timeout Error for Django
                            
                                Tkinter: grid or pack inside a grid?
                            
                                How to generate a random number in a Template Django python?
                            
                                "\n" in strings not working
                            
                                'str' object is not callable Django Rest Framework
                            
                                Combine two strings (char by char) and repeat last char of shortest one
                            
                                convert requests.models.Response to Django HttpResponse
                            
                                Add custom button to django admin panel
                            
                                How to use Python `secret` module to generate random integer?
                            
                                Get schema of parquet file in Python
                            
                                Rename duplicated index values pandas DataFrame
                            
                                Process pandas dataframe into violinplot
                            
                                Python cannot find package h2o in anaconda
                            
                                PyQt combo box change value of a label [closed]
                            
                                boto3 file_upload does it check if file exists
                            
                                virtualenv on Windows10 gives error:The path python3 does not exist
                            
                                series.unique vs list of set - performance
                            
                                AttributeError:'LinearSVC' object has no attribute 'predict_proba'
                            
                                How to read Python source code directly from IDE

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark dataframe how to drop rows with nulls in all columns?

Tags:

python

apache-spark

apache-spark-sql

pyspark

pyspark-sql

kww

People also ask

Video Answer

2 Answers

zero323

Psidom

Recent Activity

Donate For Us