How to drop columns and not rows using pandas axis equivalent in pyspark?

Question

I have a spark dataframe as shown below

df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4],
'readings' : ['READ_1','READ_2','READ_1','READ_3',np.nan,'READ_5',np.nan,'READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'],
 'val' :[5,np.nan,7,np.nan,np.nan,7,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,46],
 })

from pyspark.sql.types import *
from pyspark.sql.functions import isnan, when, count, col

mySchema = StructType([ StructField("subject_id", LongType(), True)\
                       ,StructField("readings", StringType(), True)\
                       ,StructField("val", FloatType(), True)])

spark_df = spark.createDataFrame(df,schema=mySchema)

What I would like to do is drop columns which has more than 80% pc of NaN, NULL or 0 values?

I tried something like below but it doesn't work

spark_df = spark_df.dropna(axis = 'columns',how=any,thresh=12)

The above is possible in pandas but it doesn't work here. I get the below error and it isn't surprising

TypeError: dropna() got an unexpected keyword argument 'axis'

Please note that my real dataframe is 40 million and 3k columns. I referred this post but it doesn't have an answer yet

Is there anything equivalent to this in pyspark?

I expect my output to be like as shown below with just 2 columns

enter image description here

Rithin Chalumuri · Accepted Answer

You can use the subset parameter in dropna method to specify the columns to look null values in.

To remove all columns with more than 80% null values:

columns_to_drop = []
count_before = spark_df.count()

for column_name in spark_df.columns:
    temp_spark_df =  spark_df.dropna(subset=[column_name], how=any, thresh=12)
    count_after = temp_spark_df.count()

    if ((count_before-count_after)/count_before) > 0.8:
        columns_to_drop.append(column_name)


spark_df = spark_df.drop(*columns_to_drop)

How to drop columns and not rows using pandas axis equivalent in pyspark?

Tags:

python

python-3.x

apache-spark-sql

pyspark

The Great

1 Answers

Rithin Chalumuri

Recent Activity

Donate For Us

How to drop columns and not rows using pandas axis equivalent in pyspark?

Tags:

python

python-3.x

apache-spark-sql

pyspark

The Great

1 Answers

Rithin Chalumuri

Related questions

Recent Activity

Donate For Us