Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark: filter dataframe by regex with string formatting?

I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows:

input_path = <s3_location_str> my_expr = "Arizona.*hot"  # a regex expression dx = sqlContext.read.parquet(input_path)  # "keyword" is a field in dx  # is the following correct? substr = "'%%%s%%'" %my_keyword  # escape % via %% to get "%" dk = dx.filter("keyword like %s" %substr)  # dk should contain rows with keyword values such as "Arizona is hot." 

Note

I'm trying to get all rows in dx that contain the expression my_keyword. Otherwise, for exact matches we wouldn't need surrounding percent signs '%'.

like image 888
Quetzalcoatl Avatar asked Aug 09 '17 00:08

Quetzalcoatl


People also ask

How do you filter strings in PySpark?

In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.

How do you filter a PySpark DataFrame based on a column value?

Filter on an Array column When you want to filter rows from DataFrame based on value present in an array collection column, you can use the first syntax. The below example uses array_contains() from Pyspark SQL functions which checks if a value contains in an array if present it returns true otherwise false.

What regex does PySpark use?

Similar to SQL regexp_like() function Spark & PySpark also supports Regex (Regular expression matching) by using rlike() function, This function is available in org. apache. spark.

Which method is used for filter the DataFrame value in PySpark?

In this article, we are going to see where filter in PySpark Dataframe. Where() is a method used to filter the rows from DataFrame based on the given condition. The where() method is an alias for the filter() method. Both these methods operate exactly the same.


2 Answers

From neeraj's hint, it seems like the correct way to do this in pyspark is:

expr = "Arizona.*hot" dk = dx.filter(dx["keyword"].rlike(expr)) 

Note that dx.filter($"keyword" ...) did not work since (my version) of pyspark didn't seem to support the $ nomenclature out of the box.

like image 90
Quetzalcoatl Avatar answered Sep 22 '22 06:09

Quetzalcoatl


Try rlike function as mentioned below.

df.filter(<column_name> rlike "<regex_pattern>") 

for example.

dk = dx.filter($"keyword" rlike "<pattern>") 
like image 27
Neeraj Bhadani Avatar answered Sep 22 '22 06:09

Neeraj Bhadani