Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.
Edited: As per Suresh Request,
for column in media.columns:
if media.select(media[column]).distinct().count() == 1:
media = media.drop(media[column])
Here I assumed that if count is one, then it should be Nan. But I wanted to check whether that is Nan. And if there's any other inbuilt spark function, let me know.
I tried my way. Say, I have a dataframe as below,
from pyspark.sql import functions as F
>>> df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2|null|
|null| 3|null|
| 5|null|null|
+----+----+----+
>>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns])
>>> df1.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 2| 2| 0|
+----+----+----+
>>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0]
>>> df = df.select(*nonNull_cols)
>>> df.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
|null| 3|
| 5|null|
+----+----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With