Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex: \w EXCEPT underscore (add to class and then exclude from class)

This question applies to Python 3 regular expressions. I think it might apply to other languages as well.

The question could easily be misunderstood so I'll be careful in describing it.

As background, \w means "a word character." In certain circumstances Python 3 will treat this as just [a-zA-Z0-9_] but if the regular expression is a string, it will be Unicode-aware so that \w means "any Unicode word character." This is generally a good thing as people use different languages, and it would be hard to construct a range like [a-zA-Z0-9_] for all languages at once. I think \w is therefore most useful in a multilingual setting.

But there is a problem: What if you don't want to match underscores because you don't think they're really a word character (for your particular application)?

If you're only focused on English applications, the best solution is probably to skip \w entirely and just use [a-zA-Z0-9]. But if you're focused on global applications and you don't want underscores, it seems like you might be in a really unfortunate situation. I haven't done it, but I assume it would be really tough to write a range that represents 100 languages at once just so you can avoid that underscore.

So my question is: Is there any way to use \w to match any Unicode word character, but somehow also exclude underscores (or some other undesirable character) from the class? I don't think I've seen anything like this described, but it would be highly useful. Something like [\w^_]. Of course that won't actually work, but what I mean is "use a character class that starts with everything represented by \w, but then go ahead and remove underscores from that class."

Thoughts?

like image 414
Stephen Avatar asked Mar 04 '23 13:03

Stephen


1 Answers

I have two options.

  1. [^\W_]

    This is very effective and does exactly what you want. It's also straightforward.

  2. With regex: [[\w]--[_]], note you need "V1" flag set, so you need

    r = regex.compile(r"(?V1)[\w--_]")
    

    or

    r = regex.compile(r"[\w--_]", flags=regex.V1)
    

    This looks better (readability) IMO if you're familiar with Matthew Barnett's regex module, which is more powerful than Python's stock re.

like image 66
iBug Avatar answered Apr 09 '23 05:04

iBug