I am having an input table (I) with 100 columns and 10 million records. I want to get an output table (O) that has 50 columns and these columns are derived from columns of I i.e. there will be 50 functions that map column(s) of I to 50 columns of O i.e. o1 = f(i1) , o2 = f(i2, i3) ..., o50 = f(i50, i60, i70).
In spark sql I can do this in two ways:
I want to know which one of the above 2 is more efficient (higher distributed and parallel processing) and why or if they are equally fast/performant, given that I am processing entire input table I and producing entirely new output table O i.e. its a bulk data processing.
It is well known that the use of UDFs (User Defined Functions) in Apache Spark, and especially in using the Python API, can compromise our application performace. For this reason, at Damavis we try to avoid their use as much as possible infavour of using native functions or SQL .
In these circumstances, PySpark UDF is around 10 times more performant than the PySpark Pandas UDF. We have also found that creating a Python wrapper to call Scala UDF from PySpark code is around 15 times more performant than the two types of PySpark UDFs.
Spark SQL supports integration of Hive UDFs, UDAFs and UDTFs. Similar to Spark UDFs and UDAFs, Hive UDFs work on a single row as input and generate a single row as output, while Hive UDAFs operate on multiple rows and return a single aggregated row as a result.
Why do we need a Spark UDF? UDF's are used to extend the functions of the framework and re-use this function on several DataFrame.
I was going to write this whole thing about the Catalyst optimizer, but it is simpler just to note what Jacek Laskowski says in his book Mastering Apache Spark 2:
"Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them."
Jacek also notes a comment from someone on the Spark development team:
"There are simple cases in which we can analyze the UDFs byte code and infer what it is doing, but it is pretty difficult to do in general."
This is why Spark UDFs should never be your first option.
That same sentiment is echoed in this Cloudera post, where the author states "...using Apache Spark’s built-in SQL query functions will often lead to the best performance and should be the first approach considered whenever introducing a UDF can be avoided."
However, the author correctly notes also that this may change in the future as Spark gets smarter, and in the meantime, you can use Expression.genCode
, as described in Chris Fregly’s talk, if you don't mind tightly coupling to the Catalyst optimizer.
User define functions or custom functions can be defined and registered as UDFs in Spark SQL with an associated alias that is made available to SQL queries.
UDF has major performance impact over Apache Spark SQL (Spark SQL’s Catalyst Optimizer)
Since we don't have any defined rules in Spark and developer can use his/her due diligence.
Python UDF never use UDF. it's impossible compensate the cost of repeated serialization, deserialization and data movement between Python interpreter and JVM, Python UDFs result on in data being serialized between the executor JVM and the Python interpreter running the UDF logic – this significantly reduces performance as compared to UDF implementations in Java or Scala.
Java, Scala UDF implementation is accessible directly by the executor JVM. So Java ,Scala UDF performance is better then Python UDF
Spark SQL functions operate directly on JVM and optimize with both Catalyst and Tungsten. It means these can be optimized in the execution plan and most of the time can benefit from codgen and other Tungsten optimizations. Moreover these can operate on data in its "native" representation., since Spark SQL to be working with Catalyst query optimizer. Its capabilities are expanding with every release and can often provide dramatic performance improvements to Spark SQL queries;
Conclusion : UDF implementation code may not be well understood by Catalyst ,So using Apache Spark’s built-in SQL query functions will often lead to the best performance and should be the first approach considered whenever introducing a UDF can be avoided.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With