I have a regex ([-@.\/,':\w]*[\w])*
and it matches all words within a text (including punctuated words like I.B.M), but I want to make it exclude underscores and I can't seem to figure out how to do it... I tried adding ^[_]
(e.g. (^[_][-@.\/,':\w]*[\w])*
) but it just breaks up all the words into letters. I want to preserve the word matching, but I don't want to have words with underscores in them, nor words that are entirely made up of underscores.
Whats the proper way to do this?
P.S.
Update
Here is an example:
"I.B.M should be parsed as one word w_o_r_d! Russian should work too: мплекс исторических событий."
The matches should be:
I.B.M.
should
be
parsed
as
one
word
Russian
should
work
too
мплекс
исторических
событий
Note that w_o_r_d
should not get matched.
A domain name may include lowercase and uppercase letters, numbers, period signs and dashes, but no underscores. \w includes all of the above, plus an underscore.
Regex doesn't recognize underscore as special character.
\W matches any character that's not a letter, digit, or underscore. It prevents the regex from matching characters before or after the phrase.
The _ (underscore) character in the regular expression means that the zone name must have an underscore immediately following the alphanumeric string matched by the preceding brackets. The . (period) matches any character (a wildcard).
Try this instead:
([-@.\/,':\p{L}\p{Nd}]*[\p{L}\p{Nd}])*
The \w
class is composed of [\p{L}\p{Nd}\p{Pc}]
when you're performing Unicode matching. (Or simply [a-zA-Z0-9]
if you're doing non-Unicode matching.)
It's the \p{Pc}
Unicode category -- punctuation/connector -- that causes the problem by matching underscores, so we explicitly match against the other categories without including that one.
(Further information here, "Character Classes: Word Character", and here, "Character Classes: Supported Unicode General Categories".)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With