Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop rows containing specific value in PySpark dataframe

I have a pyspark dataframe like:

A    B      C
1    NA     9
4    2       5
6    4       2
5    1    NA

I want to delete rows which contain value "NA". In this case first and the last row. How to implement this using Python and Spark?


Update based on comment: Looking for a solution that removes rows that have the string: NA in any of the many columns.

like image 346
jason_1093 Avatar asked Feb 23 '19 15:02

jason_1093


People also ask

How do you drop rows with certain values in spark?

Drop rows with NA or missing values in pyspark is accomplished by using na. drop() function. NA or Missing values in pyspark is dropped using na. drop() function.


2 Answers

Just use a dataframe filter expression:

l = [('1','NA','9')
    ,('4','2', '5')
    ,('6','4','2')
    ,('5','NA','1')]
df = spark.createDataFrame(l,['A','B','C'])
#The following command requires that the checked columns are strings!
df = df.filter((df.A != 'NA') & (df.B != 'NA') & (df.C != 'NA'))
df.show()

+---+---+---+ 
|  A|  B|  C| 
+---+---+---+ 
|  4|  2|  5| 
|  6|  4|  2| 
+---+---+---+

@bluephantom: In the case you have hundreds of columns, just generate a string expression via list comprehension:

#In my example are columns need to be checked
listOfRelevantStringColumns = df.columns
expr = ' and '.join('(%s != "NA")' % col_name for col_name in listOfRelevantStringColumns)
df.filter(expr).show()
like image 91
cronoik Avatar answered Oct 08 '22 07:10

cronoik


In case if you want to remove the row

df = df.filter((df.A != 'NA') | (df.B != 'NA'))

But sometimes we need to replace with mean(in case of numeric column) or most frequent value(in case of categorical). for that you need to add column with same name which replace the original column i-e "A"

from pyspark.sql.functions import mean,col,when,count
df=df.withColumn("A",when(df.A=="NA",mean(df.A)).otherwise(df.A))
like image 1
Ghias Ali Avatar answered Oct 08 '22 05:10

Ghias Ali