Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop if all entries in a spark dataframe's specific column is null

Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

Edited: As per Suresh Request,

for column in media.columns:
    if media.select(media[column]).distinct().count() == 1:
        media = media.drop(media[column])

Here I assumed that if count is one, then it should be Nan. But I wanted to check whether that is Nan. And if there's any other inbuilt spark function, let me know.

like image 615
Naveen Honest Raj K Avatar asked Aug 11 '17 07:08

Naveen Honest Raj K


1 Answers

I tried my way. Say, I have a dataframe as below,

from pyspark.sql import functions as F

>>> df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   2|null|
|null|   3|null|
|   5|null|null|
+----+----+----+

>>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns])
>>> df1.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   2|   2|   0|
+----+----+----+

>>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0]
>>> df = df.select(*nonNull_cols)
>>> df.show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|null|   3|
|   5|null|
+----+----+
like image 174
Suresh Avatar answered Nov 07 '22 12:11

Suresh