I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. For this I'm trying to replace Null or invalid values present in a column with the most frequent value of that column. For Example:-
Name|Place a |a1 a |a2 a |a2 |d1 b |a2 c |a2 c | | d |c1
In this case I'll replace all the NULL values in column "Name" with 'a' and in column "Place" with 'a2'. Till now I am able to extract only the most frequent columns in a particular column. Can you please help me with the second step on how to replace the null or invalid values with the most frequent values of that column.
The replacement of null values in PySpark DataFrames is one of the most common operations undertaken. This can be achieved by using either DataFrame. fillna() or DataFrameNaFunctions. fill() methods.
In PySpark, DataFrame. fillna() or DataFrameNaFunctions. fill() is used to replace NULL/None values on all or selected multiple DataFrame columns with either zero(0), empty string, space, or any constant literal values.
You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples.
You can use .na.fill
function (it is a function in org.apache.spark.sql.DataFrameNaFunctions).
Basically the function you need is: def fill(value: String, cols: Seq[String]): DataFrame
You can choose the columns, and you choose the value you want to replace the null or NaN.
In your case it will be something like:
val df2 = df.na.fill("a", Seq("Name")) .na.fill("a2", Seq("Place"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With