I'd like to use a specific UDF
with using Spark
Here's the plan:
I have a table A
(10 million rows) and a table B
(15 millions rows)
I'd like to use an UDF
comparing one element of the table A
and one of the table B
Is it possible
Here's a a sample of my code. At some point i also need to say that my UDF
compare must be greater than 0,9
:
DataFrame dfr = df
.select("name", "firstname", "adress1", "city1","compare(adress1,adress2)")
.join(dfa,df.col("adress1").equalTo(dfa.col("adress2"))
.and((df.col("city1").equalTo(dfa.col("city2"))
...;
Is it possible ?
1)When we use UDFs we end up losing all the optimization Spark does on our Dataframe/Dataset. When we use a UDF, it is as good as a Black box to Spark's optimizer. Let's consider an example of a general optimization when reading data from Database or columnar format files such as Parquet is PredicatePushdown.
In Spark, you create UDF by creating a function in a language you prefer to use for Spark. For example, if you are using Spark with scala, you create a UDF in scala language and wrap it with udf() function or register it as udf to use it on DataFrame and SQL respectively.
The left anti join in PySpark is similar to the join functionality, but it returns only columns from the left DataFrame for non-matched records.
Join is supposed to be a transformation, not an action.
Yes, you can. However it will be slower than normal operators, as Spark will be not able to do predicate pushdown
Example:
val udf = udf((x : String, y : String) => { here compute similarity; });
val df3 = df1.join(df2, udf(df1.field1, df2.field1) > 0.9)
For example:
val df1 = Seq (1, 2, 3, 4).toDF("x")
val df2 = Seq(1, 3, 7, 11).toDF("q")
val udf = org.apache.spark.sql.functions.udf((x : Int, q : Int) => { Math.abs(x - q); });
val df3 = df1.join(df2, udf(df1("x"), df2("q")) > 1)
You can also directly return boolean from User Defined Function
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With