Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does "(?u)" do in a regex?

Tags:

I looked into how tokenization is implemented in scikit-learn and found this regex (source):

token_pattern = r"(?u)\b\w\w+\b" 

The regex is pretty straightforward but I have never seen the (?u) part before. Can someone explain me what this part is doing?

like image 234
fwind Avatar asked Jan 27 '16 16:01

fwind


People also ask

What does U mean in regex?

U (Unicode dependent), and re. X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.

What does regex (? S match?

i) makes the regex case insensitive. (? s) for "single line mode" makes the dot match all characters, including line breaks.

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.

What does \r represent in regex?

Definition and Usage The \r metacharacter matches carriage return characters.


1 Answers

It switches on the re.U (re.UNICODE) flag for this expression.

From the module documentation:

(?iLmsux)

(One or more letters from the set 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode dependent), and re.X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function.

like image 84
Martijn Pieters Avatar answered Jan 12 '23 00:01

Martijn Pieters