Let's say there is a RDD that looks like this:
+----+--------------+-----+
| age|best_guess_age| name|
+----+--------------+-----+
| 23| 23|Alice|
|null| 18| Bob|
| 34| 32| Tom|
|null| 40|Linda|
+----+--------------+-----+
Where we want to fill the age
column with best_guess_age
column whenever it is null.
The fillna
command requires an actual value to replace the na's, we can't simply pass in a column.
How to do this?
In PySpark, DataFrame. fillna() or DataFrameNaFunctions. fill() is used to replace NULL/None values on all or selected multiple DataFrame columns with either zero(0), empty string, space, or any constant literal values.
The Fill Na function finds up the null value for a given data frame in PySpark and then fills the value out of it that is passed as an argument. The value can be passed to the data frame that finds the null value and applies the value out of it. The fillNa value replaces the null value and it is an alias for na.
one way is to copy columns [o, o_type] into temporary columns ['o_temp','o_type_temp'] and then copy the values of [s,s_type] into [o,o_type] and finally ['o_temp','o_type_temp'] into [s,s_type] .
You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples.
You can use coalesce
function; By doing coalesce('age', 'best_guess_age')
, it will take values from age
column if it's not null, otherwise from best_guess_age
column:
from pyspark.sql.functions import coalesce
df.withColumn('age', coalesce('age', 'best_guess_age')).show()
+---+--------------+-----+
|age|best_guess_age| name|
+---+--------------+-----+
| 23| 23|Alice|
| 18| 18| Bob|
| 34| 32| Tom|
| 40| 40|Linda|
+---+--------------+-----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With