I am trying to remove words of length less than 4 from a string.
I use this regex:
re.sub(' \w{1,3} ', ' ', c)
Though this removes some strings but it fails when 2-3 words of length less than 4 appear together. Like:
I am in a bank.
It gives me:
I in bank.
How to resolve this?
Don't include the spaces; use \b
word boundary anchors instead:
re.sub(r'\b\w{1,3}\b', '', c)
This removes words of up to 3 characters entirely:
>>> import re
>>> re.sub(r'\b\w{1,3}\b', '', 'The quick brown fox jumps over the lazy dog')
' quick brown jumps over lazy '
>>> re.sub(r'\b\w{1,3}\b', '', 'I am in a bank.')
' bank.'
If you want an alternative to regex:
new_string = ' '.join([w for w in old_string.split() if len(w)>3])
Answered by Martijn, but I just wanted to explain why your regex doesn't work. The regex string ' \w{1,3} '
matches a space, followed by 1-3 word characters, followed by another space. The I
doesn't get matched because it doesn't have a space in front of it. The am
gets replaced, and then the regex engine starts at the next non-matched character: the i
in in
. It doesn't see the space before in
, since it was placed there by the substitution. So, the next match it finds is a
, which produces your output string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With