Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark - how to do case insensitive dataframe joins?

Is there any nice looking code to perform case insensitive join in Pyspark? Something like:

df3 = df1.join(df2, 
               ["col1", "col2", "col3"],
               "left_outer",
               "case-insensitive")

Or what is your working solutions to this?

like image 502
Babu Avatar asked Oct 25 '16 15:10

Babu


People also ask

Are Spark joins case sensitive?

sql. caseSensitive is set to false , Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values.

How do you ignore a case in PySpark?

Select transformation function not only applies case insensitive filtering but also renames the column header in the new dataframe after transformation. Case sensitivity is set to false by default. We can enable it by turning it on in the spark-session .

How do you join with conditions in PySpark?

PySpark Join Two DataFrames The first join syntax takes, right dataset, joinExprs and joinType as arguments and we use joinExprs to provide a join condition. The second join syntax takes just the right dataset and joinExprs and it considers default join as inner join .

Is PySpark DataFrame case sensitive?

Spark can be case sensitive, but it is case insensitive by default.


1 Answers

It's not exactly elegant, but you could create new lower-case versions of those columns purely for joining.

import pyspark.sql.functions as F
df1_l = df1 \
    .withColumn("col1_l", F.lower(df1.col1)) \
    .withColumn("col2_l", F.lower(df1.col2)) \
    .withColumn("col3_l"), F.lower(df1.col3)

df2_l = df2 \
    .withColumn("col1_l", F.lower(df2.col1)) \
    .withColumn("col2_l", F.lower(df2.col2)) \
    .withColumn("col3_l"), F.lower(df2.col3)

df3 = df1_l.join(df2_l, 
           ["col1_l", "col2_l", "col3_l"],
           "left_outer")

And you could also try doing this same transformation in a join predicate, e.g.:

df3 = df1.join(df2, 
           (F.lower(df1.col1) == F.lower(df2.col1))
            & (F.lower(df1.col2) == F.lower(df2.col2))
            & (F.lower(df1.col3) == F.lower(df2.col3))
           "left_outer")
like image 153
Chris Dove Avatar answered Sep 20 '22 16:09

Chris Dove