I have the following dataset and its contain some null values, need to replace the null value using fillna in spark. DataFrame: <pre class="prettyprint"><code>df = spark.read.format("com.databricks.spark.csv").option("header&zwnj;","true").load("/sam&zwnj;ple.csv") >>> df.printSchema(); root |-- Age: string (nullable = true) |-- Height: string (nullable = true) |-- Name: string (nullable = true) >>> df.show() +---+------+-----+ |Age|Height| Name| +---+------+-----+ | 10| 80|Alice| | 5| null| Bob| | 50| null| Tom| | 50| null| null| +---+------+-----+ >>> df.na.fill(10).show() </code></pre> when i'll give the na values it dosen't changed the same dataframe appeared again. <pre class="prettyprint"><code>+---+------+-----+ |Age|Height| Name| +---+------+-----+ | 10| 80|Alice| | 5| null| Bob| | 50| null| Tom| | 50| null| null| +---+------+-----+ </code></pre> tried create a new dataframe and store the fill values in dataframe but the result showing like unchanged. <pre class="prettyprint"><code>>>> df2 = df.na.fill(10) </code></pre> how to replace the null values? please give me the possible ways by using fill na. Thanks in Advance.

It seems that your <code>Height</code> column is not numeric. When you call <code>df.na.fill(10)</code> spark replaces only nulls with column that match type of <code>10</code>, which are numeric columns. If <code>Height</code> column need to be string, you can try <code>df.na.fill('10').show()</code>, otherwise casting to <code>IntegerType()</code> is neccessary.

You can also provide a specific default value for each column if you prefer. <pre class="prettyprint"><code>df.na.fill({'Height': '10', 'Name': 'Bob'}) </code></pre>

Spark fillNa not replacing the null value

Tags:

apache-spark

pyspark

I have the following dataset and its contain some null values, need to replace the null value using fillna in spark.

DataFrame:

df = spark.read.format("com.databricks.spark.csv").option("header‌","true").load("/sam‌ple.csv")

>>> df.printSchema();
root
 |-- Age: string (nullable = true)
 |-- Height: string (nullable = true)
 |-- Name: string (nullable = true)

>>> df.show()
+---+------+-----+
|Age|Height| Name|
+---+------+-----+
| 10|    80|Alice|
|  5|  null|  Bob|
| 50|  null|  Tom|
| 50|  null| null|
+---+------+-----+

>>> df.na.fill(10).show()

when i'll give the na values it dosen't changed the same dataframe appeared again.

+---+------+-----+
|Age|Height| Name|
+---+------+-----+
| 10|    80|Alice|
|  5|  null|  Bob|
| 50|  null|  Tom|
| 50|  null| null|
+---+------+-----+

tried create a new dataframe and store the fill values in dataframe but the result showing like unchanged.

>>> df2 = df.na.fill(10)

how to replace the null values? please give me the possible ways by using fill na. Thanks in Advance.

274

asked Nov 03 '16 07:11

Churchill vins

2 Answers

It seems that your Height column is not numeric. When you call df.na.fill(10) spark replaces only nulls with column that match type of 10, which are numeric columns.

If Height column need to be string, you can try df.na.fill('10').show(), otherwise casting to IntegerType() is neccessary.

175

answered Oct 18 '22 00:10

Mariusz

You can also provide a specific default value for each column if you prefer.

df.na.fill({'Height': '10', 'Name': 'Bob'})

answered Oct 18 '22 00:10

beljul

Related questions
                            
                                Recommended way to access HBase using Scala
                            
                                Pyspark sql: Create a new column based on whether a value exists in a different DataFrame's column
                            
                                How can I train a random forest with a sparse matrix in Spark?
                            
                                Issue upon Spark Upgrade : key not found: _PYSPARK_DRIVER_CONN_INFO_PATH
                            
                                Issue while parsing mongo collection which has few schemas in spark
                            
                                Spark Java - Collect multiple columns into array column
                            
                                Diffrence between extends from App and object contain main method in scala
                            
                                Named accumulator in pyspark
                            
                                spark.sql vs SqlContext
                            
                                log from spark udf to driver
                            
                                Apache Spark UI displays incorrect input size of file being ingested
                            
                                Apache Spark 2.3.1 with Hive metastore 3.1.0
                            
                                Using Spark 2.3.1 with Scala, Reduce Arbitrary List of Date Ranges into distinct non-overlapping ranges of dates
                            
                                Transferring unroll memory to storage memory failed
                            
                                Why Spark dataframe cache doesn't work here
                            
                                How to give alias name for posexplode columns in Spark SQL?
                            
                                Spark Scala, how to check if nested column is present in dataframe
                            
                                Change spark _temporary directory path
                            
                                rdd.histogram gives "can not generate buckets with non-number in RDD" error
                            
                                How to save dataframe to Elasticsearch in PySpark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With