I have the following sample DataFrame: <pre class="prettyprint"><code>a | b | c | 1 | 2 | 4 | 0 | null | null| null | 3 | 4 | </code></pre> And I want to replace null values only in the first 2 columns - Column "a" and "b": <pre class="prettyprint"><code>a | b | c | 1 | 2 | 4 | 0 | 0 | null| 0 | 3 | 4 | </code></pre> Here is the code to create sample dataframe: <pre class="prettyprint"><code>rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)]) df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"]) </code></pre> I know how to replace all null values using: <pre class="prettyprint"><code>df2 = df2.fillna(0) </code></pre> And when I try this, I lose the third column: <pre class="prettyprint"><code>df2 = df2.select(df2.columns[0:1]).fillna(0) </code></pre>

<pre class="prettyprint"><code>df.fillna(0, subset=['a', 'b']) </code></pre> There is a parameter named <code>subset</code> to choose the columns unless your spark version is lower than 1.3.1

Use a dictionary to fill values of certain columns: <pre class="prettyprint"><code>df.fillna( { 'a':0, 'b':0 } ) </code></pre>

PySpark: How to fillna values in dataframe for specific columns?

Tags:

apache-spark

pyspark

spark-dataframe

I have the following sample DataFrame:

a    | b    | c   |   1    | 2    | 4   | 0    | null | null|  null | 3    | 4   |

And I want to replace null values only in the first 2 columns - Column "a" and "b":

a    | b    | c   |   1    | 2    | 4   | 0    | 0    | null|  0    | 3    | 4   |

Here is the code to create sample dataframe:

rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)]) df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])

I know how to replace all null values using:

df2 = df2.fillna(0)

And when I try this, I lose the third column:

df2 = df2.select(df2.columns[0:1]).fillna(0)

553

asked Jul 12 '17 19:07

Rakesh Adhikesavan

2 Answers

df.fillna(0, subset=['a', 'b'])

There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1

180

answered Oct 07 '22 07:10

Zhang Tong

Use a dictionary to fill values of certain columns:

df.fillna( { 'a':0, 'b':0 } )

answered Oct 07 '22 08:10

scottlittle

Related questions
                            
                                PySpark groupByKey returning pyspark.resultiterable.ResultIterable
                            
                                Median / quantiles within PySpark groupBy
                            
                                Upacking a list to select multiple columns from a spark data frame
                            
                                Apache Spark -- Assign the result of UDF to multiple dataframe columns
                            
                                PySpark: withColumn() with two conditions and three outcomes
                            
                                How to flatten a struct in a Spark dataframe?
                            
                                Automatically and Elegantly flatten DataFrame in Spark SQL
                            
                                How to split Vector into columns - using PySpark
                            
                                aggregate function Count usage with groupBy in Spark
                            
                                What are the various join types in Spark?
                            
                                How does Spark partition(ing) work on files in HDFS?
                            
                                How to melt Spark DataFrame?
                            
                                How to check Spark Version [closed]
                            
                                Generate a Spark StructType / Schema from a case class
                            
                                Spark functions vs UDF performance?
                            
                                How to access s3a:// files from Apache Spark?
                            
                                PySpark - rename more than one column using withColumnRenamed
                            
                                How do I log from my Python Spark script
                            
                                PySpark: java.lang.OutofMemoryError: Java heap space
                            
                                Retrieve top n in each group of a DataFrame in pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With