Let's say there is a RDD that looks like this: <pre class="prettyprint"><code>+----+--------------+-----+ | age|best_guess_age| name| +----+--------------+-----+ | 23| 23|Alice| |null| 18| Bob| | 34| 32| Tom| |null| 40|Linda| +----+--------------+-----+ </code></pre> Where we want to fill the <code>age</code> column with <code>best_guess_age</code> column whenever it is null. The <code>fillna</code> command requires an actual value to replace the na's, we can't simply pass in a column. How to do this?

You can use <code>coalesce</code> function; By doing <code>coalesce('age', 'best_guess_age')</code>, it will take values from <code>age</code> column if it's not null, otherwise from <code>best_guess_age</code> column: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.functions import coalesce df.withColumn('age', coalesce('age', 'best_guess_age')).show() +---+--------------+-----+ |age|best_guess_age| name| +---+--------------+-----+ | 23| 23|Alice| | 18| 18| Bob| | 34| 32| Tom| | 40| 40|Linda| +---+--------------+-----+ </code></pre>

In pyspark, is it possible to fillna with another column?

Tags:

apache-spark

pyspark

Let's say there is a RDD that looks like this:

+----+--------------+-----+
| age|best_guess_age| name|
+----+--------------+-----+
|  23|            23|Alice|
|null|            18|  Bob|
|  34|            32|  Tom|
|null|            40|Linda|
+----+--------------+-----+

Where we want to fill the age column with best_guess_age column whenever it is null.

The fillna command requires an actual value to replace the na's, we can't simply pass in a column.

How to do this?

534

asked Aug 21 '18 15:08

foobar

1 Answers

You can use coalesce function; By doing coalesce('age', 'best_guess_age'), it will take values from age column if it's not null, otherwise from best_guess_age column:

from pyspark.sql.functions import coalesce
df.withColumn('age', coalesce('age', 'best_guess_age')).show()
+---+--------------+-----+
|age|best_guess_age| name|
+---+--------------+-----+
| 23|            23|Alice|
| 18|            18|  Bob|
| 34|            32|  Tom|
| 40|            40|Linda|
+---+--------------+-----+

answered Nov 15 '22 04:11

Psidom

Related questions
                            
                                Spark saveAsTextFile() writes to multiple files instead of one [duplicate]
                            
                                Creating a SparkSQL UDF in Java outside of SQLContext
                            
                                Extract date from a string column containing timestamp in Pyspark
                            
                                Spark DataFrames when udf functions do not accept large enough input variables
                            
                                How to pass a list of paths to spark.read.load?
                            
                                How can I use graphframes with pyspark on AWS EMR?
                            
                                Save Spark Dataframe into Elasticsearch - Can’t handle type exception
                            
                                How to iterate records spark scala?
                            
                                Spark SQL performance - JOIN on value BETWEEN min and max
                            
                                Cannot create dataframe from list: pyspark
                            
                                How to modify a column value in a row of a spark dataframe?
                            
                                UDF to extract only the file name from path in Spark SQL
                            
                                How to find mean of grouped Vector columns in Spark SQL?
                            
                                Converting dataframe columns into list of tuples
                            
                                Add PySpark RDD as new column to pyspark.sql.dataframe
                            
                                SparkConf settings not used when running Spark app in cluster mode on YARN
                            
                                Apache Spark subtract days from timestamp column
                            
                                pyspark throws TypeError: textFile() missing 1 required positional argument: 'name'
                            
                                Saving dataframe records in a tab delimited file
                            
                                How to extract number from string column?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With