Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark "replacing null with 0" performance comparison

Spark 1.6.1, Scala api.

For a dataframe, I need to replace all null value of a certain column with 0. I have 2 ways to do this. 1.

myDF.withColumn("pipConfidence", when($"mycol".isNull, 0).otherwise($"mycol"))

2.

myDF.na.fill(0, Seq("mycol"))

Are they essentially the same or one way is preferred?

Thank you!

like image 238
user2628641 Avatar asked Oct 25 '16 18:10

user2628641


1 Answers

They are not the same but performance should be similar. na.fill uses coalesce but it replaces NaN and NULLs, not only NULLS.

val y = when($"x" === 0, $"x".cast("double")).when($"x" === 1, lit(null)).otherwise(lit("NaN").cast("double"))
val df = spark.range(0, 3).toDF("x").withColumn("y", y)

df.withColumn("y", when($"y".isNull, 0.0).otherwise($"y")).show()
df.na.fill(0.0, Seq("y")).show()
like image 81
5 revs, 3 users 57%user6022341 Avatar answered Oct 25 '22 20:10

5 revs, 3 users 57%user6022341