When I work with DataFrames in Spark, I have to sometimes edit only the values of a particular column in that DataFrame. For eg. if I have a count
field in my dataframe, and If I would like to add 1
to every value of count
, then I could either write a custom udf to get the job done using the withColumn
feature of DataFrames, or I could do a map
on the DataFrame and then extract another DataFrame from the resultant RDD.
What I would like to know is how a udf actually works under the hood. Give me a comparison in using a map/udf in this case. What's the performance difference?
Thanks!
Simply, map
is more flexible than udf
. With map
, there is no restriction on the number of columns you can manipulate within a row. Say you want to derive the value for 5 columns of the data and delete 3 columns. You would need to do withColumn
/udf
5 times, then a select
. With 1 map
function, you could do all of this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With