I have a spark dataframe as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4],
'readings' : ['READ_1','READ_2','READ_1','READ_3',np.nan,'READ_5',np.nan,'READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'],
'val' :[5,np.nan,7,np.nan,np.nan,7,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,46],
})
from pyspark.sql.types import *
from pyspark.sql.functions import isnan, when, count, col
mySchema = StructType([ StructField("subject_id", LongType(), True)\
,StructField("readings", StringType(), True)\
,StructField("val", FloatType(), True)])
spark_df = spark.createDataFrame(df,schema=mySchema)
What I would like to do is drop columns which has more than 80% pc of NaN, NULL or 0
values?
I tried something like below but it doesn't work
spark_df = spark_df.dropna(axis = 'columns',how=any,thresh=12)
The above is possible in pandas
but it doesn't work here. I get the below error and it isn't surprising
TypeError: dropna() got an unexpected keyword argument 'axis'
Please note that my real dataframe is 40 million and 3k columns
. I referred this post but it doesn't have an answer yet
Is there anything equivalent to this in pyspark?
I expect my output to be like as shown below with just 2 columns
You can use the subset
parameter in dropna
method to specify the columns to look null values in.
To remove all columns with more than 80% null values:
columns_to_drop = []
count_before = spark_df.count()
for column_name in spark_df.columns:
temp_spark_df = spark_df.dropna(subset=[column_name], how=any, thresh=12)
count_after = temp_spark_df.count()
if ((count_before-count_after)/count_before) > 0.8:
columns_to_drop.append(column_name)
spark_df = spark_df.drop(*columns_to_drop)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With