Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex ignore underscores

I have a regex ([-@.\/,':\w]*[\w])* and it matches all words within a text (including punctuated words like I.B.M), but I want to make it exclude underscores and I can't seem to figure out how to do it... I tried adding ^[_] (e.g. (^[_][-@.\/,':\w]*[\w])*) but it just breaks up all the words into letters. I want to preserve the word matching, but I don't want to have words with underscores in them, nor words that are entirely made up of underscores.

Whats the proper way to do this?

P.S.

  • My app is written in C# (if that makes any difference).
  • I can't use A-Za-z0-9 because I have to match words regardless of the language (could be Chinese, Russian, Japanese, German, English).

Update
Here is an example:

"I.B.M should be parsed as one word w_o_r_d! Russian should work too: мплекс исторических событий."

The matches should be:

I.B.M.  
should  
be  
parsed  
as  
one  
word  
Russian  
should  
work  
too  
мплекс  
исторических  
событий  

Note that w_o_r_d should not get matched.

like image 780
Kiril Avatar asked Mar 30 '11 23:03

Kiril


People also ask

Does \w include _?

A domain name may include lowercase and uppercase letters, numbers, period signs and dashes, but no underscores. \w includes all of the above, plus an underscore.

Is underscore a special character in regex?

Regex doesn't recognize underscore as special character.

Does regex \W include underscore?

\W matches any character that's not a letter, digit, or underscore. It prevents the regex from matching characters before or after the phrase.

What's an underscore in regex?

The _ (underscore) character in the regular expression means that the zone name must have an underscore immediately following the alphanumeric string matched by the preceding brackets. The . (period) matches any character (a wildcard).


1 Answers

Try this instead:

([-@.\/,':\p{L}\p{Nd}]*[\p{L}\p{Nd}])*

The \w class is composed of [\p{L}\p{Nd}\p{Pc}] when you're performing Unicode matching. (Or simply [a-zA-Z0-9] if you're doing non-Unicode matching.)

It's the \p{Pc} Unicode category -- punctuation/connector -- that causes the problem by matching underscores, so we explicitly match against the other categories without including that one.

(Further information here, "Character Classes: Word Character", and here, "Character Classes: Supported Unicode General Categories".)

like image 185
LukeH Avatar answered Oct 03 '22 06:10

LukeH