How to use Spark SQL filter as a case insensitive filter.
For example:
dataFrame.filter(dataFrame.col("vendor").equalTo("fortinet"));
just return rows that 'vendor'
column is equal to 'fortinet'
but i want rows that 'vendor'
column equal to 'fortinet'
or 'Fortinet'
or 'foRtinet'
or ...
caseSensitive is set to false , Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values.
The contains() method checks whether a DataFrame column string contains a string specified as an argument (matches on part of the string). Returns true if the string exists and false if not.
Try sqlContext. sql("set spark. sql. caseSensitive=true") in your Python code, which worked for me.
Column names that differ only by case are considered duplicate. Delta Lake is case preserving, but case insensitive, when storing a schema. Parquet is case sensitive when storing and returning column information. Spark can be case sensitive, but it is case insensitive by default.
The CASE WHEN and OTHERWISE function or statement tests whether any of a sequence of expressions is true, and returns a corresponding result for the first true expression. Spark SQL DataFrame CASE Statement Examples. You can write the CASE statement on DataFrame column values or you can write your own expression to test conditions.
To filter () rows on Spark DataFrame based on multiple conditions using AND (&&), OR (||), and NOT (!), you case use either Column with a condition or SQL expression as explained above. Below is just a simple example, you can extend this with AND (&&), OR (||), and NOT (!) conditional expressions as needed.
Case insensitive SQL SELECT: Use upper or lower functions. The SQL standard way to perform case insensitive queries is to use the SQL upper or lower functions, like this: select * from users where upper (first_name) = 'FRED'; or this: select * from users where lower (first_name) = 'fred';
Spark SQL CASE WHEN on DataFrame. The CASE WHEN and OTHERWISE function or statement tests whether any of a sequence of expressions is true, and returns a corresponding result for the first true expression. Spark SQL DataFrame CASE Statement Examples
You can either use case-insensitive regex:
val df = sc.parallelize(Seq(
(1L, "Fortinet"), (2L, "foRtinet"), (3L, "foo")
)).toDF("k", "v")
df.where($"v".rlike("(?i)^fortinet$")).show
// +---+--------+
// | k| v|
// +---+--------+
// | 1|Fortinet|
// | 2|foRtinet|
// +---+--------+
or simple equality with lower
/ upper
:
import org.apache.spark.sql.functions.{lower, upper}
df.where(lower($"v") === "fortinet")
// +---+--------+
// | k| v|
// +---+--------+
// | 1|Fortinet|
// | 2|foRtinet|
// +---+--------+
df.where(upper($"v") === "FORTINET")
// +---+--------+
// | k| v|
// +---+--------+
// | 1|Fortinet|
// | 2|foRtinet|
// +---+--------+
For simple filters I would prefer rlike
although performance should be similar, for join
conditions equality is a much better choice. See How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion? for details.
Try to use lower/upper string functions:
dataFrame.filter(lower(dataFrame.col("vendor")).equalTo("fortinet"))
or
dataFrame.filter(upper(dataFrame.col("vendor")).equalTo("FORTINET"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With