The Python Fuzzy Wuzzy library includes the following regex:
regex = re.compile(r"(?ui)\W")
return regex.sub(u" ", a_string)
(https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/string_processing.py#L17)
This replaces any non-alphanumeric in a_string with a space.
What does the (?ui) bit do though? It seems to work fine without it.
Thanks
The u is the unicode flag and i is the ignore case flag.
The unicode flag makes \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database. For example:
>>> re.findall(r'\d+', u'The answer is \u0664\u0662') # No flag
[]
>>> re.findall(r'(?u)\d+', u'The answer is \u0664\u0662') # With unicode flag
[u'\u0664\u0662']
The ignore case flag performs case-insensitive matching. Expressions like [A-Z] will match lowercase letters as well. This is not affected by the current locale. For example:
>>> re.findall(r'[a-z]+', 'HELLO world') # No flag
['world']
>>> re.findall(r'(?i)[a-z]+', 'HELLO world') # With ignore case flag
['HELLO', 'world']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With