Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark fillNa not replacing the null value

I have the following dataset and its contain some null values, need to replace the null value using fillna in spark.

DataFrame:

df = spark.read.format("com.databricks.spark.csv").option("header‌​","true").load("/sam‌​ple.csv")

>>> df.printSchema();
root
 |-- Age: string (nullable = true)
 |-- Height: string (nullable = true)
 |-- Name: string (nullable = true)

>>> df.show()
+---+------+-----+
|Age|Height| Name|
+---+------+-----+
| 10|    80|Alice|
|  5|  null|  Bob|
| 50|  null|  Tom|
| 50|  null| null|
+---+------+-----+

>>> df.na.fill(10).show()

when i'll give the na values it dosen't changed the same dataframe appeared again.

+---+------+-----+
|Age|Height| Name|
+---+------+-----+
| 10|    80|Alice|
|  5|  null|  Bob|
| 50|  null|  Tom|
| 50|  null| null|
+---+------+-----+

tried create a new dataframe and store the fill values in dataframe but the result showing like unchanged.

>>> df2 = df.na.fill(10)

how to replace the null values? please give me the possible ways by using fill na. Thanks in Advance.

like image 274
Churchill vins Avatar asked Nov 03 '16 07:11

Churchill vins


People also ask

How do I change the null value in Spark?

The replacement of null values in PySpark DataFrames is one of the most common operations undertaken. This can be achieved by using either DataFrame. fillna() or DataFrameNaFunctions. fill() methods.

How do you replace nulls in PySpark?

In PySpark, DataFrame. fillna() or DataFrameNaFunctions. fill() is used to replace NULL/None values on all or selected multiple DataFrame columns with either zero(0), empty string, space, or any constant literal values.

How does Spark ignore null values?

In order to remove Rows with NULL values on selected columns of Spark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.


2 Answers

It seems that your Height column is not numeric. When you call df.na.fill(10) spark replaces only nulls with column that match type of 10, which are numeric columns.

If Height column need to be string, you can try df.na.fill('10').show(), otherwise casting to IntegerType() is neccessary.

like image 175
Mariusz Avatar answered Oct 18 '22 00:10

Mariusz


You can also provide a specific default value for each column if you prefer.

df.na.fill({'Height': '10', 'Name': 'Bob'})
like image 20
beljul Avatar answered Oct 18 '22 00:10

beljul