Is there any nice looking code to perform case insensitive join in Pyspark? Something like:
df3 = df1.join(df2,
["col1", "col2", "col3"],
"left_outer",
"case-insensitive")
Or what is your working solutions to this?
sql. caseSensitive is set to false , Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values.
Select transformation function not only applies case insensitive filtering but also renames the column header in the new dataframe after transformation. Case sensitivity is set to false by default. We can enable it by turning it on in the spark-session .
PySpark Join Two DataFrames The first join syntax takes, right dataset, joinExprs and joinType as arguments and we use joinExprs to provide a join condition. The second join syntax takes just the right dataset and joinExprs and it considers default join as inner join .
Spark can be case sensitive, but it is case insensitive by default.
It's not exactly elegant, but you could create new lower-case versions of those columns purely for joining.
import pyspark.sql.functions as F
df1_l = df1 \
.withColumn("col1_l", F.lower(df1.col1)) \
.withColumn("col2_l", F.lower(df1.col2)) \
.withColumn("col3_l"), F.lower(df1.col3)
df2_l = df2 \
.withColumn("col1_l", F.lower(df2.col1)) \
.withColumn("col2_l", F.lower(df2.col2)) \
.withColumn("col3_l"), F.lower(df2.col3)
df3 = df1_l.join(df2_l,
["col1_l", "col2_l", "col3_l"],
"left_outer")
And you could also try doing this same transformation in a join predicate, e.g.:
df3 = df1.join(df2,
(F.lower(df1.col1) == F.lower(df2.col1))
& (F.lower(df1.col2) == F.lower(df2.col2))
& (F.lower(df1.col3) == F.lower(df2.col3))
"left_outer")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With