I am having the following sample code where i am trying to match all word instances which are starting and ending with an underscore (either single or double one).
import re
test = ['abc text_ abc',
'abc _text abc',
'abc text_textUnderscored abc',
'abc :_text abc',
'abc _text_ abc',
'abc __text__ abc',
'abc _text_: abc',
'abc (-_-) abc']
test_str = ' '.join(test)
print(re.compile('(_\\w+\\b)').split(test_str))
I have already tried the following regex and it seems too strong (should match only _text_
and __text__
).
Output: ['abc text_ abc abc ', '_text', ' abc abc text', '_textUnderscored', ' abc abc :', '_text', ' abc abc ', '_text_', ' abc abc ', '__text__', ' abc abc ', '_text_', ': abc abc (-_-) abc']
Can you suggest a better approach (preferably with single regex pattern and usage of re.split
method)?
The _ (underscore) character in the regular expression means that the zone name must have an underscore immediately following the alphanumeric string matched by the preceding brackets. The . (period) matches any character (a wildcard).
Inside a character range, \b represents the backspace character, for compatibility with Python's string literals. Matches the empty string, but only when it is not at the beginning or end of a word.
Regex doesn't recognize underscore as special character.
If you mean to match any chunks of word chars (letters, digits and underscores) that are not preceded nor followed with non-word chars (chars other than letters, digits and underscores) and of any length (even 1, _
) you may use
r'\b_(?:\w*_)?\b'
with re.findall
. See the regex demo.
If you do not want to match single-char words (i.e. _
) you need to remove the optional non-capturing group, and use r'\b_\w*_\b'
.
If you need to match at least 3 char words, also replace *
(zero or more repetitions) with +
(one or more occurrences) .
If you consider words as whole words only when they are at the start/end of string or are followed/preceded with whitespaces, replace \b...\b
with (?<!\S)...(?!\S)
:
r'(?<!\S)_\w*_(?!\S)'
See another regex demo
Details
\b
- a word boundary, there must be start of string or a non-word char right before_
- an underscore(?:\w*_)?
- an optional non-capturing group matching 1 or 0 occurrences of
\w*
- 0+ word chars (letters, digits, _
s) (thanks to this optional group, even _
word will be found)_
- an underscore \b
- a word boundary, there must be end of string or a non-word char right after(?<!\S)
- left whitespace boundary(?!\S)
- right whitespace boundarySee the Python demo:
rx = re.compile(r'\b_(?:\w*_)?\b')
print(rx.findall(test_str))
# => ['_text_', '__text__']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With