Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark: drop columns that have same values in all rows

Tags:

pyspark

Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?

So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.

However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?

Thanks

like image 857
mommomonthewind Avatar asked Jan 27 '23 00:01

mommomonthewind


1 Answers

You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.

# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()

# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]

# drop the selected column
df.drop(*cols_to_drop).show()
like image 187
Manoj Singh Avatar answered Jun 12 '23 06:06

Manoj Singh