Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get all words with specific length that doesn't contain number?

Tags:

python

regex

I have an input (including unicode):

s = "Question1: a12 is the number of a, b1 is the number of cầu thủ"

I want to get all words that contain no number and have more than 2 chars, desire output:

['is', 'the', 'number', 'of', 'is', 'the', 'number', 'of', 'cầu', 'thủ'].

I've tried

re.compile('[\w]{2,}').findall(s)

and got

'Question1', 'a12', 'is', 'the', 'number', 'of', 'b1', 'is', 'the', 'number', 'of', 'cầu', 'thủ'

Is there any way to get only words with no number in it?

like image 981
Ha Bom Avatar asked May 13 '19 08:05

Ha Bom


1 Answers

You may use

import re
s = "Question1: a12 is the number of a, b1 is the number of cầu thủ"
print(re.compile(r'\b[^\W\d_]{2,}\b').findall(s))
# => ['is', 'the', 'number', 'of', 'is', 'the', 'number', 'of', 'cầu', 'thủ']

Or, if you only want to limit to ASCII only letter words with minimum 2 letters:

print(re.compile(r'\b[a-zA-Z]{2,}\b').findall(s))

See the Python demo

Details

  • To match only letters, you need to use [^\W\d_] (or r'[a-zA-Z] ASCII-only variation)
  • To match whole words, you need word boundaries, \b
  • To make sure you are defining word boundaries and not backspace chars in the regex pattern, use a raw string literal, r'...'.

So, r'\b[^\W\d_]{2,}\b' defines a regex that matches a word boundary, two or more letters and then asserts that there is no word char right after these two letters.

like image 186
Wiktor Stribiżew Avatar answered Sep 30 '22 12:09

Wiktor Stribiżew