I am having an input table (I) with 100 columns and 10 million records. I want to get an output table (O) that has 50 columns and these columns are derived from columns of I i.e. there will be 50 functions that map column(s) of I to 50 columns of O i.e. o1 = f(i1) , o2 = f(i2, i3) ..., o50 = f(i50, i60, i70). In spark sql I can do this in two ways: <ol> <li>row transformation where entire row of I is parsed (ex: map function) one by one to produce a row of O.</li> <li>Use UDF which I guess work at column level i.e. take existing column(s) of I as input and produce one of the corresponding column of O i.e. use 50 UDF functions.</li> </ol> I want to know which one of the above 2 is more efficient (higher distributed and parallel processing) and why or if they are equally fast/performant, given that I am processing entire input table I and producing entirely new output table O i.e. its a bulk data processing.

I was going to write this whole thing about the Catalyst optimizer, but it is simpler just to note what Jacek Laskowski says in his book Mastering Apache Spark 2: "Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them." Jacek also notes a comment from someone on the Spark development team: "There are simple cases in which we can analyze the UDFs byte code and infer what it is doing, but it is pretty difficult to do in general." This is why Spark UDFs should never be your first option. That same sentiment is echoed in this Cloudera post, where the author states "...using Apache Spark’s built-in SQL query functions will often lead to the best performance and should be the first approach considered whenever introducing a UDF can be avoided." However, the author correctly notes also that this may change in the future as Spark gets smarter, and in the meantime, you can use <code>Expression.genCode</code>, as described in Chris Fregly’s talk, if you don't mind tightly coupling to the Catalyst optimizer.

User define functions or custom functions can be defined and registered as UDFs in Spark SQL with an associated alias that is made available to SQL queries. UDF has major performance impact over Apache Spark SQL (Spark SQL’s Catalyst Optimizer) Since we don't have any defined rules in Spark and developer can use his/her due diligence. Python UDF never use UDF. it's impossible compensate the cost of repeated serialization, deserialization and data movement between Python interpreter and JVM, Python UDFs result on in data being serialized between the executor JVM and the Python interpreter running the UDF logic – this significantly reduces performance as compared to UDF implementations in Java or Scala. Java, Scala UDF implementation is accessible directly by the executor JVM. So Java ,Scala UDF performance is better then Python UDF Spark SQL functions operate directly on JVM and optimize with both Catalyst and Tungsten. It means these can be optimized in the execution plan and most of the time can benefit from codgen and other Tungsten optimizations. Moreover these can operate on data in its "native" representation., since Spark SQL to be working with Catalyst query optimizer. Its capabilities are expanding with every release and can often provide dramatic performance improvements to Spark SQL queries; Conclusion : UDF implementation code may not be well understood by Catalyst ,So using Apache Spark’s built-in SQL query functions will often lead to the best performance and should be the first approach considered whenever introducing a UDF can be avoided.

spark sql - whether to use row transformation or UDF

Tags:

I am having an input table (I) with 100 columns and 10 million records. I want to get an output table (O) that has 50 columns and these columns are derived from columns of I i.e. there will be 50 functions that map column(s) of I to 50 columns of O i.e. o1 = f(i1) , o2 = f(i2, i3) ..., o50 = f(i50, i60, i70).

In spark sql I can do this in two ways:

row transformation where entire row of I is parsed (ex: map function) one by one to produce a row of O.
Use UDF which I guess work at column level i.e. take existing column(s) of I as input and produce one of the corresponding column of O i.e. use 50 UDF functions.

I want to know which one of the above 2 is more efficient (higher distributed and parallel processing) and why or if they are equally fast/performant, given that I am processing entire input table I and producing entirely new output table O i.e. its a bulk data processing.

616

asked Apr 14 '17 12:04

sunillp

2 Answers

I was going to write this whole thing about the Catalyst optimizer, but it is simpler just to note what Jacek Laskowski says in his book Mastering Apache Spark 2:

"Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them."

Jacek also notes a comment from someone on the Spark development team:

"There are simple cases in which we can analyze the UDFs byte code and infer what it is doing, but it is pretty difficult to do in general."

This is why Spark UDFs should never be your first option.

That same sentiment is echoed in this Cloudera post, where the author states "...using Apache Spark’s built-in SQL query functions will often lead to the best performance and should be the first approach considered whenever introducing a UDF can be avoided."

However, the author correctly notes also that this may change in the future as Spark gets smarter, and in the meantime, you can use Expression.genCode, as described in Chris Fregly’s talk, if you don't mind tightly coupling to the Catalyst optimizer.

answered Oct 22 '22 12:10

Vidya

User define functions or custom functions can be defined and registered as UDFs in Spark SQL with an associated alias that is made available to SQL queries.

UDF has major performance impact over Apache Spark SQL (Spark SQL’s Catalyst Optimizer)

Since we don't have any defined rules in Spark and developer can use his/her due diligence.

Python UDF never use UDF. it's impossible compensate the cost of repeated serialization, deserialization and data movement between Python interpreter and JVM, Python UDFs result on in data being serialized between the executor JVM and the Python interpreter running the UDF logic – this significantly reduces performance as compared to UDF implementations in Java or Scala.

Java, Scala UDF implementation is accessible directly by the executor JVM. So Java ,Scala UDF performance is better then Python UDF

Spark SQL functions operate directly on JVM and optimize with both Catalyst and Tungsten. It means these can be optimized in the execution plan and most of the time can benefit from codgen and other Tungsten optimizations. Moreover these can operate on data in its "native" representation., since Spark SQL to be working with Catalyst query optimizer. Its capabilities are expanding with every release and can often provide dramatic performance improvements to Spark SQL queries;

Conclusion : UDF implementation code may not be well understood by Catalyst ,So using Apache Spark’s built-in SQL query functions will often lead to the best performance and should be the first approach considered whenever introducing a UDF can be avoided.

answered Oct 22 '22 13:10

vaquar khan

Related questions
                            
                                Boto3 error: The AWS Access Key Id you provided does not exist in our records
                            
                                Angular 4 is getting slower through time
                            
                                Guzzle async requests not really async?
                            
                                Python AttributeError: 'dict' object has no attribute 'append'
                            
                                AWS Java SDK 2.0 create a presigned URL for a S3 object
                            
                                Docker: docker: Error response from daemon: linux spec user: unable to find user myuser: no matching entries in passwd file
                            
                                Make R's View() open in a new window automatically
                            
                                How to debug processes running inside docker-compose with pycharm
                            
                                How to add custom MUI palette colors
                            
                                React/Redux rendering a list that's updating every second
                            
                                Why would a step function cancels itself when there are no errors
                            
                                How to Pass POJO class in Work manager in android?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With