Input: s = "test1 this is a sample subscript o₁"
I've tried: re.compile(r'\b[^\W\d_]{2,}\b').findall(s)
It finds the word with more than 2 chars and doesn't contain number
'this', 'is', 'sample', 'subscript', 'o₁',
but it still has the subscript number.
Is there a way to remove word that contains subscript in it?
Desire output: 'this', 'is', 'sample', 'subscript'
The point is that the Unicode aware \d in Python 3 regex does not match No Unicode category.
If you need to work with ASCII only letter words, use
r'\b[a-zA-Z]{2,}\b'
Or, make the pattern non-Unicode aware by using re.A / re.ASCII flag:
re.compile(r'\b[^\W\d_]{2,}\b', re.A)
See this Python 3 demo.
If you need to work with any Unicode letters you may fix it by either adding all the No characters to the regex negated character class (which might make it a tedious solution), or add a programmatic check after a match is found to see if the match contains any char from the No category.
See this Python 3 demo:
import re, sys, unicodedata
s = "test1 this is a sample subscript o₁"
No = [chr(i) for i in range(sys.maxunicode) if unicodedata.category(chr(i)) == 'No']
print([x for x in re.findall(r'\b[^\W\d_]{2,}\b', s) if not any(y in x for y in No)])
# => ['this', 'is', 'sample', 'subscript']
Make sure you are using the latest Python version to support the latest Unicode standard, or rely on the PyPi regex module:
p = regex.compile(r"\b\p{L}{2,}\b")
print(p.findall(s))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With