I'm trying to create a custom join for two dataframes (df1 and df2) in PySpark (similar to this), with code that looks like this:
my_join_udf = udf(lambda x, y: isJoin(x, y), BooleanType())
my_join_df = df1.join(df2, my_join_udf(df1.col_a, df2.col_b))
The error message I'm getting is:
java.lang.RuntimeException: Invalid PythonUDF PythonUDF#<lambda>(col_a#17,col_b#0), requires attributes from more than one child
Is there a way to write a PySpark UDF that can process columns from two separate dataframes?
Spark 2.2+
You have to use crossJoin
or enable cross joins in the configuration:
df1.crossJoin(df2).where(my_join_udf(df1.col_a, df2.col_b))
Spark 2.0, 2.1
Method shown below doesn't work anymore in Spark 2.x. See SPARK-19728.
Spark 1.x
Theoretically you can join and filter:
df1.join(df2).where(my_join_udf(df1.col_a, df2.col_b))
but in general you shouldn't to it all. Any type of join
which is not based on equality requires a full Cartesian product (same as the answer) which is rarely acceptable (see also Why using a UDF in a SQL query leads to cartesian product?).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With