Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove words of length less than 4 from string [duplicate]

Tags:

python

regex

I am trying to remove words of length less than 4 from a string.

I use this regex:

 re.sub(' \w{1,3} ', ' ', c)

Though this removes some strings but it fails when 2-3 words of length less than 4 appear together. Like:

 I am in a bank.

It gives me:

 I in bank. 

How to resolve this?

like image 567
blackmamba Avatar asked Jun 20 '14 16:06

blackmamba


3 Answers

Don't include the spaces; use \b word boundary anchors instead:

re.sub(r'\b\w{1,3}\b', '', c)

This removes words of up to 3 characters entirely:

>>> import re
>>> re.sub(r'\b\w{1,3}\b', '', 'The quick brown fox jumps over the lazy dog')
' quick brown  jumps over  lazy '
>>> re.sub(r'\b\w{1,3}\b', '', 'I am in a bank.')
'    bank.'
like image 170
Martijn Pieters Avatar answered Oct 22 '22 19:10

Martijn Pieters


If you want an alternative to regex:

new_string = ' '.join([w for w in old_string.split() if len(w)>3])
like image 22
Vidhya G Avatar answered Oct 22 '22 19:10

Vidhya G


Answered by Martijn, but I just wanted to explain why your regex doesn't work. The regex string ' \w{1,3} ' matches a space, followed by 1-3 word characters, followed by another space. The I doesn't get matched because it doesn't have a space in front of it. The am gets replaced, and then the regex engine starts at the next non-matched character: the i in in. It doesn't see the space before in, since it was placed there by the substitution. So, the next match it finds is a, which produces your output string.

like image 33
Sizik Avatar answered Oct 22 '22 19:10

Sizik