Pyspark

Is there any nice looking code to perform case insensitive join in Pyspark? Something like:

df3 = df1.join(df2, 
               ["col1", "col2", "col3"],
               "left_outer",
               "case-insensitive")

Or what is your working solutions to this?

Are Spark joins case sensitive?

sql. caseSensitive is set to false , Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values.

How do you ignore a case in PySpark?

Select transformation function not only applies case insensitive filtering but also renames the column header in the new dataframe after transformation. Case sensitivity is set to false by default. We can enable it by turning it on in the spark-session .

How do you join with conditions in PySpark?

PySpark Join Two DataFrames The first join syntax takes, right dataset, joinExprs and joinType as arguments and we use joinExprs to provide a join condition. The second join syntax takes just the right dataset and joinExprs and it considers default join as inner join .

Is PySpark DataFrame case sensitive?

Spark can be case sensitive, but it is case insensitive by default.

It's not exactly elegant, but you could create new lower-case versions of those columns purely for joining.

import pyspark.sql.functions as F
df1_l = df1 \
    .withColumn("col1_l", F.lower(df1.col1)) \
    .withColumn("col2_l", F.lower(df1.col2)) \
    .withColumn("col3_l"), F.lower(df1.col3)

df2_l = df2 \
    .withColumn("col1_l", F.lower(df2.col1)) \
    .withColumn("col2_l", F.lower(df2.col2)) \
    .withColumn("col3_l"), F.lower(df2.col3)

df3 = df1_l.join(df2_l, 
           ["col1_l", "col2_l", "col3_l"],
           "left_outer")

And you could also try doing this same transformation in a join predicate, e.g.:

df3 = df1.join(df2, 
           (F.lower(df1.col1) == F.lower(df2.col1))
            & (F.lower(df1.col2) == F.lower(df2.col2))
            & (F.lower(df1.col3) == F.lower(df2.col3))
           "left_outer")

Pyspark - how to do case insensitive dataframe joins?

Tags:

apache-spark

spark-dataframe

Babu

People also ask

1 Answers

Chris Dove

Recent Activity

Donate For Us