I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows:
input_path = <s3_location_str> my_expr = "Arizona.*hot" # a regex expression dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx # is the following correct? substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%" dk = dx.filter("keyword like %s" %substr) # dk should contain rows with keyword values such as "Arizona is hot."
Note
I'm trying to get all rows in dx that contain the expression my_keyword. Otherwise, for exact matches we wouldn't need surrounding percent signs '%'.
In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.
Filter on an Array column When you want to filter rows from DataFrame based on value present in an array collection column, you can use the first syntax. The below example uses array_contains() from Pyspark SQL functions which checks if a value contains in an array if present it returns true otherwise false.
Similar to SQL regexp_like() function Spark & PySpark also supports Regex (Regular expression matching) by using rlike() function, This function is available in org. apache. spark.
In this article, we are going to see where filter in PySpark Dataframe. Where() is a method used to filter the rows from DataFrame based on the given condition. The where() method is an alias for the filter() method. Both these methods operate exactly the same.
From neeraj's hint, it seems like the correct way to do this in pyspark is:
expr = "Arizona.*hot" dk = dx.filter(dx["keyword"].rlike(expr))
Note that dx.filter($"keyword" ...)
did not work since (my version) of pyspark didn't seem to support the $
nomenclature out of the box.
Try rlike function as mentioned below.
df.filter(<column_name> rlike "<regex_pattern>")
for example.
dk = dx.filter($"keyword" rlike "<pattern>")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With