Fill Pyspark dataframe column null values with average value from same column

Tags:

With a dataframe like this,

rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,None,"201602"),(1,20,3003,"201601"), (1,20,None,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, None,"201601")])

df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"])
df_data.show()

+---+----+----+-------+
| id|type|cost|   date|
+---+----+----+-------+
|  0|  10| 223| 201601|
|  0|  10|  83|2016032|
|  1|  20|null| 201602|
|  1|  20|3003| 201601|
|  1|  20|null| 201603|
|  2|  40|2321| 201601|
|  2|  30|  10| 201602|
|  2|  61|null| 201601|
+---+----+----+-------+

I need to fill the null values with the average of the existing values, with the expected result being

+---+----+----+-------+
| id|type|cost|   date|
+---+----+----+-------+
|  0|  10| 223| 201601|
|  0|  10|  83|2016032|
|  1|  20|1128| 201602|
|  1|  20|3003| 201601|
|  1|  20|1128| 201603|
|  2|  40|2321| 201601|
|  2|  30|  10| 201602|
|  2|  61|1128| 201601|
+---+----+----+-------+

where 1128 is the average of the existing values. I need to do that for several columns.

My current approach is to use na.fill:

fill_values = {column: df_data.agg({column:"mean"}).flatMap(list).collect()[0] for column in df_data.columns if column not in ['date','id']}
df_data = df_data.na.fill(fill_values)

+---+----+----+-------+
| id|type|cost|   date|
+---+----+----+-------+
|  0|  10| 223| 201601|
|  0|  10|  83|2016032|
|  1|  20|1128| 201602|
|  1|  20|3003| 201601|
|  1|  20|1128| 201603|
|  2|  40|2321| 201601|
|  2|  30|  10| 201602|
|  2|  61|1128| 201601|
+---+----+----+-------+

But this is very cumbersome. Any ideas?

300

asked Jun 10 '16 13:06

Ivan

1 Answers

Well, one way or another you have to:

compute statistics
fill the blanks

It pretty much limits what you can really improve here, still:

replace flatMap(list).collect()[0] with first()[0] or structure unpacking
compute all stats with a single action
use built-in Row methods to extract dictionary

The final result could like this:

def fill_with_mean(df, exclude=set()): 
    stats = df.agg(*(
        avg(c).alias(c) for c in df.columns if c not in exclude
    ))
    return df.na.fill(stats.first().asDict())

fill_with_mean(df_data, ["id", "date"])

In Spark 2.2 or later you can also use Imputer. See Replace missing values with mean - Spark Dataframe.

138

answered Oct 12 '22 19:10

zero323

Related questions
                            
                                Why do argument-less function calls execute faster?
                            
                                read json graph networkx file
                            
                                Sort at various levels in Python
                            
                                Why is IPython.display.Image not showing in output?
                            
                                Can I access url kwargs in django template context processor?
                            
                                python joblib Parallel on Windows not working even "if __name__ == '__main__':" is added
                            
                                PyQt4 Wait in thread for user input from GUI
                            
                                Python Selenium Find Element by Name
                            
                                PyInstaller and python-docx module do not work together
                            
                                Read a pgm file in python
                            
                                VariableDoesNotExist: Failed lookup for key [val2] in u'None'
                            
                                Scipy: distance correlation is higher than 1
                            
                                How to check when is the last time S3 bucket has been updated?
                            
                                what is my openssl and ssl Default CA Certs Path?
                            
                                Getting last modified for every file in a directory
                            
                                Ignoring non-numerical string values in pandas dataframe
                            
                                Django form: setting initial value on DateField
                            
                                All possible maximum matchings of a bipartite graph
                            
                                Unable to update nested dictionary value in multiprocessing's manager.dict()
                            
                                Cannot import name include [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fill Pyspark dataframe column null values with average value from same column

Tags:

python

apache-spark

apache-spark-sql

pyspark

pyspark-sql

Ivan

People also ask

1 Answers

zero323

Recent Activity

Donate For Us