Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I mix character classes in Python RegEx?

Tags:

python

regex

Special sequences (character classes) in Python RegEx are escapes like \w or \d that matches a set of characters.

In my case, I need to be able to match all alpha-numerical characters except numbers.

That is, \w minus \d.

I need to use the special sequence \w because I'm dealing with non-ASCII characters and need to match symbols like "Æ" and "Ø".

One would think I could use this expression: [\w^\d] but it doesn't seem to match anything and I'm not sure why.

So in short, how can I mix (add/subtract) special sequences in Python Regular Expressions?


EDIT: I accidentally used [\W^\d] instead of [\w^\d]. The latter does indeed match something, including parentheses and commas which are not alpha-numerical characters as far as I'm concerned.

like image 975
Hubro Avatar asked Sep 10 '12 09:09

Hubro


People also ask

What are character classes in regex?

In the context of regular expressions, a character class is a set of characters enclosed within square brackets. It specifies the characters that will successfully match a single character from a given input string.

How do I create a character class in regex?

With a “character class”, also called “character set”, you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets. If you want to match an a or an e, use [ae].

How do you do multiple regex in Python?

made this to find all with multiple #regular #expressions. regex1 = r"your regex here" regex2 = r"your regex here" regex3 = r"your regex here" regexList = [regex1, regex1, regex3] for x in regexList: if re. findall(x, your string): some_list = re. findall(x, your string) for y in some_list: found_regex_list.

What are character classes in Python?

A "character class", or a "character set", is a set of characters put in square brackets. The regex engine matches only one out of several characters in the character class or character set. We place the characters we want to match between square brackets.


2 Answers

You can use r"[^\W\d]", ie. invert the union of non-alphanumerics and numbers.

like image 76
Janne Karila Avatar answered Oct 18 '22 02:10

Janne Karila


You cannot subtract character classes, no.

Your best bet is to use the new regex module, set to replace the current re module in python. It supports character classes based on Unicode properties:

\p{IsAlphabetic}

This will match any character that the Unicode specification states is an alphabetic character.

Even better, regex does support character class subtraction; it views such classes as sets and allows you to create a difference with the -- operator:

[\w--\d]

matches everything in \w except anything that also matches \d.

like image 27
Martijn Pieters Avatar answered Oct 18 '22 01:10

Martijn Pieters