Spark 1.6.1, Scala api.
For a dataframe, I need to replace all null value of a certain column with 0. I have 2 ways to do this. 1.
myDF.withColumn("pipConfidence", when($"mycol".isNull, 0).otherwise($"mycol"))
2.
myDF.na.fill(0, Seq("mycol"))
Are they essentially the same or one way is preferred?
Thank you!
They are not the same but performance should be similar. na.fill
uses coalesce
but it replaces NaN
and NULLs
, not only NULLS
.
val y = when($"x" === 0, $"x".cast("double")).when($"x" === 1, lit(null)).otherwise(lit("NaN").cast("double"))
val df = spark.range(0, 3).toDF("x").withColumn("y", y)
df.withColumn("y", when($"y".isNull, 0.0).otherwise($"y")).show()
df.na.fill(0.0, Seq("y")).show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With