Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace null values with a specific value in Dataframe using spark in Java?

I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. For this I'm trying to replace Null or invalid values present in a column with the most frequent value of that column. For Example:-

Name|Place a   |a1 a   |a2 a   |a2     |d1 b   |a2 c   |a2 c   |     | d   |c1 

In this case I'll replace all the NULL values in column "Name" with 'a' and in column "Place" with 'a2'. Till now I am able to extract only the most frequent columns in a particular column. Can you please help me with the second step on how to replace the null or invalid values with the most frequent values of that column.

like image 486
PirateJack Avatar asked Jun 21 '17 09:06

PirateJack


People also ask

How do I change the null value in Spark DataFrame?

The replacement of null values in PySpark DataFrames is one of the most common operations undertaken. This can be achieved by using either DataFrame. fillna() or DataFrameNaFunctions. fill() methods.

How do I change NULL values in Spark DataFrame PySpark?

In PySpark, DataFrame. fillna() or DataFrameNaFunctions. fill() is used to replace NULL/None values on all or selected multiple DataFrame columns with either zero(0), empty string, space, or any constant literal values.

How do I change the DataFrame value in Spark?

You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples.


1 Answers

You can use .na.fill function (it is a function in org.apache.spark.sql.DataFrameNaFunctions).

Basically the function you need is: def fill(value: String, cols: Seq[String]): DataFrame

You can choose the columns, and you choose the value you want to replace the null or NaN.

In your case it will be something like:

val df2 = df.na.fill("a", Seq("Name"))             .na.fill("a2", Seq("Place")) 
like image 68
Rami Avatar answered Sep 25 '22 06:09

Rami