Spark how to use a UDF with a Join

Tags:

I'd like to use a specific UDF with using Spark

Here's the plan:

I have a table A(10 million rows) and a table B(15 millions rows)

I'd like to use an UDF comparing one element of the table A and one of the table B Is it possible

Here's a a sample of my code. At some point i also need to say that my UDF compare must be greater than 0,9:

DataFrame dfr = df
                .select("name", "firstname", "adress1", "city1","compare(adress1,adress2)")
                .join(dfa,df.col("adress1").equalTo(dfa.col("adress2"))
                        .and((df.col("city1").equalTo(dfa.col("city2"))
                                ...;

Is it possible ?

689

asked Aug 16 '17 16:08

Jean

1 Answers

Yes, you can. However it will be slower than normal operators, as Spark will be not able to do predicate pushdown

Example:

val udf = udf((x : String, y : String) => { here compute similarity; });
val df3 = df1.join(df2, udf(df1.field1, df2.field1) > 0.9)

For example:

val df1 = Seq (1, 2, 3, 4).toDF("x")
val df2 = Seq(1, 3, 7, 11).toDF("q")
val udf = org.apache.spark.sql.functions.udf((x : Int, q : Int) => { Math.abs(x - q); });
val df3 = df1.join(df2, udf(df1("x"), df2("q")) > 1)

You can also directly return boolean from User Defined Function

answered Sep 22 '22 15:09

T. Gawęda

Related questions
                            
                                why does filter remove null value by default on spark dataframe?
                            
                                Memory issue with spark structured streaming
                            
                                Storing multiple dataframes of different widths with Parquet?
                            
                                Does spark optimize identical but independent DAGs in pyspark?
                            
                                Spark fails on big shuffle jobs with java.io.IOException: Filesystem closed
                            
                                Combine results from batch RDD with streaming RDD in Apache Spark
                            
                                real time log processing using apache spark streaming
                            
                                Spark streaming DStream RDD to get file name
                            
                                Create Spark DataFrame in Spark Streaming from JSON Message on Kafka
                            
                                Spark forcing log4j
                            
                                Accessing HDFS HA from spark job (UnknownHostException error)
                            
                                Spark worker memory
                            
                                Why is a Spark Row object so big compared to equivalent structures?
                            
                                Understanding Spark shuffle spill
                            
                                How to transform RDD, Dataframe or Dataset straight to a Broadcast variable without collect?
                            
                                More efficient way to loop through PySpark DataFrame and create new columns
                            
                                Dag-scheduler-event-loop java.lang.OutOfMemoryError: unable to create new native thread
                            
                                Passing a map with struct-type key into a Spark UDF
                            
                                Handling microseconds in Spark Scala
                            
                                How to change user in hdfs using sparkSubmit in java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark how to use a UDF with a Join

Tags:

join

dataframe

apache-spark

user-defined-functions

Jean

People also ask

1 Answers

T. Gawęda

Recent Activity

Donate For Us