Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I remove numbers, and words with length below 2, from a sentence?

Tags:

python

regex

I am trying to remove words that have length below 2 and any word that is numbers. For example

 s = " This is a test 1212 test2"

Output desired is

" This is test test2"

I tried \w{2,} this removes all the word whose length is below 2. When I added \D+ this removes all numbers when I didn't want to get rid of 2 from test2.

like image 733
Sam Avatar asked Oct 14 '20 18:10

Sam


People also ask

How do you remove short words in Python?

The \W* at the start lets you remove both the word and the preceding non-word characters so that the rest of the sentence still matches up. Note that punctuation is included in \W , use \s if you only want to remove preceding whitespace.

How do I remove a character from a string?

Using 'str. replace() , we can replace a specific character. If we want to remove that specific character, replace that character with an empty string. The str. replace() method will replace all occurrences of the specific character mentioned.

How do I remove a character from a string in Python?

You can remove a character from a Python string using replace() or translate(). Both these methods replace a character or string with a given value. If an empty string is specified, the character or string you select is removed from the string without a replacement.


2 Answers

You may use:

s = re.sub(r'\b(?:\d+|\w)\b\s*', '', s)

RegEx Demo

Pattern Details:

  • \b: Match word boundary
  • (?:\d+|\w): Match a single word character or 1+ digits
  • \b: Match word boundary
  • \s*: Match 0 or more whitespaces
like image 191
anubhava Avatar answered Sep 30 '22 07:09

anubhava


You can make use of work boundaries '\b' and remove anything that is 1 character long inside boundaries: number or letter, doesn't matter. Also remove anything between boundaries that is just numbers:

import re

s = " This is a test 1212 test2"

print( re.sub(r"\b([^ ]|\d+)\b","",s))

Output:

 This is  test  test2

Explanation:

\b(           word boundary followed by a group
   [^ ]           anything that is not a space (1 character) 
       |              or
        \d+       any amount of numbers
)             followed by another boundary

is replaced by re.sub(pattern, replaceBy, source) with "".

like image 34
Patrick Artner Avatar answered Sep 30 '22 08:09

Patrick Artner