pyspark: drop columns that have same values in all rows

Question

Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?

So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.

However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?

Thanks

Manoj Singh · Accepted Answer

You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.

# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()

# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]

# drop the selected column
df.drop(*cols_to_drop).show()

pyspark: drop columns that have same values in all rows

Tags:

pyspark

mommomonthewind

1 Answers

Manoj Singh

Recent Activity

Donate For Us

pyspark: drop columns that have same values in all rows

Tags:

pyspark

mommomonthewind

1 Answers

Manoj Singh

Related questions

Recent Activity

Donate For Us