Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match words both starting and ending with underscore with Python 3

I am having the following sample code where i am trying to match all word instances which are starting and ending with an underscore (either single or double one).

import re
test = ['abc text_ abc',
'abc _text abc',
'abc text_textUnderscored abc',
'abc :_text abc', 
'abc _text_ abc', 
'abc __text__ abc',
'abc _text_: abc',
'abc (-_-) abc']
test_str = ' '.join(test)
print(re.compile('(_\\w+\\b)').split(test_str))

I have already tried the following regex and it seems too strong (should match only _text_and __text__).

Output: ['abc text_ abc abc ', '_text', ' abc abc text', '_textUnderscored', ' abc abc :', '_text', ' abc abc ', '_text_', ' abc abc ', '__text__', ' abc abc ', '_text_', ': abc abc (-_-) abc']

Can you suggest a better approach (preferably with single regex pattern and usage of re.split method)?

like image 443
azawalich Avatar asked Mar 05 '19 20:03

azawalich


People also ask

How do you add an underscore in regex?

The _ (underscore) character in the regular expression means that the zone name must have an underscore immediately following the alphanumeric string matched by the preceding brackets. The . (period) matches any character (a wildcard).

What is \b in python regex?

Inside a character range, \b represents the backspace character, for compatibility with Python's string literals. Matches the empty string, but only when it is not at the beginning or end of a word.

Is underscore a special character in regex?

Regex doesn't recognize underscore as special character.


1 Answers

If you mean to match any chunks of word chars (letters, digits and underscores) that are not preceded nor followed with non-word chars (chars other than letters, digits and underscores) and of any length (even 1, _) you may use

r'\b_(?:\w*_)?\b'

with re.findall. See the regex demo.

If you do not want to match single-char words (i.e. _) you need to remove the optional non-capturing group, and use r'\b_\w*_\b'.

If you need to match at least 3 char words, also replace * (zero or more repetitions) with + (one or more occurrences) .

If you consider words as whole words only when they are at the start/end of string or are followed/preceded with whitespaces, replace \b...\b with (?<!\S)...(?!\S):

r'(?<!\S)_\w*_(?!\S)'

See another regex demo

Details

  • \b - a word boundary, there must be start of string or a non-word char right before
  • _ - an underscore
  • (?:\w*_)? - an optional non-capturing group matching 1 or 0 occurrences of
    • \w* - 0+ word chars (letters, digits, _s) (thanks to this optional group, even _ word will be found)
    • _ - an underscore
  • \b - a word boundary, there must be end of string or a non-word char right after
  • (?<!\S) - left whitespace boundary
  • (?!\S) - right whitespace boundary

See the Python demo:

rx = re.compile(r'\b_(?:\w*_)?\b')
print(rx.findall(test_str))
# => ['_text_', '__text__']
like image 54
Wiktor Stribiżew Avatar answered Oct 03 '22 13:10

Wiktor Stribiżew