Pyspark: filter dataframe by regex with string formatting?

Tags:

I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows:

input_path = <s3_location_str> my_expr = "Arizona.*hot"  # a regex expression dx = sqlContext.read.parquet(input_path)  # "keyword" is a field in dx  # is the following correct? substr = "'%%%s%%'" %my_keyword  # escape % via %% to get "%" dk = dx.filter("keyword like %s" %substr)  # dk should contain rows with keyword values such as "Arizona is hot."

Note

I'm trying to get all rows in dx that contain the expression my_keyword. Otherwise, for exact matches we wouldn't need surrounding percent signs '%'.

888

asked Aug 09 '17 00:08

Quetzalcoatl

2 Answers

From neeraj's hint, it seems like the correct way to do this in pyspark is:

expr = "Arizona.*hot" dk = dx.filter(dx["keyword"].rlike(expr))

Note that dx.filter($"keyword" ...) did not work since (my version) of pyspark didn't seem to support the $ nomenclature out of the box.

answered Sep 22 '22 06:09

Quetzalcoatl

Try rlike function as mentioned below.

df.filter(<column_name> rlike "<regex_pattern>")

for example.

dk = dx.filter($"keyword" rlike "<pattern>")

answered Sep 22 '22 06:09

Neeraj Bhadani

Related questions
                            
                                Nginx - Rewrite the request_uri before uwsgi_pass
                            
                                re.findall behaves weird
                            
                                Multiline Regular Expression search and replace!
                            
                                Sublime regex replace merging replace text with capture group
                            
                                Emacs query-replace-regexp multiline
                            
                                How to replace all BUT the first occurrence of a pattern in string
                            
                                trim in javascript ? what this code is doing?
                            
                                Why does strsplit use positive lookahead and lookbehind assertion matches differently?
                            
                                @Pattern for alphanumeric string - Bean validation
                            
                                Is it possible to match nested brackets with a regex without using recursion or balancing groups?
                            
                                How to remove square brackets and anything between them with a regex?
                            
                                How do you replace double quotes with a blank space in Java?
                            
                                grep backslash in negative lookbehind
                            
                                regex match either string in linux "find" command
                            
                                VSCode wildcard Search and Replace Regex
                            
                                Adding Line Break After pattern in VIM
                            
                                "UnicodeEncodeError: 'ascii' codec can't encode character"
                            
                                What's the difference between tr/// and s/// when using regex in Perl?
                            
                                why isn't this regex working : find ./ -regex '.*\(m\|h\)$
                            
                                Regex C++: extract substring

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark: filter dataframe by regex with string formatting?

Tags:

regex

apache-spark-sql

pyspark

pyspark-sql

spark-dataframe

Quetzalcoatl

People also ask

2 Answers

Quetzalcoatl

Neeraj Bhadani

Recent Activity

Donate For Us