I looked into how tokenization is implemented in scikit-learn and found this regex (source): <pre class="prettyprint"><code>token_pattern = r"(?u)\b\w\w+\b" </code></pre> The regex is pretty straightforward but I have never seen the <code>(?u)</code> part before. Can someone explain me what this part is doing?

It switches on the <code>re.U</code> (<code>re.UNICODE</code>) flag for this expression. From the module documentation: <blockquote> <code>(?iLmsux)</code> (One or more letters from the set <code>'i'</code>, <code>'L'</code>, <code>'m'</code>, <code>'s'</code>, <code>'u'</code>, <code>'x'</code>.) The group matches the empty string; the letters set the corresponding flags: <code>re.I</code> (ignore case), <code>re.L</code> (locale dependent), <code>re.M</code> (multi-line), <code>re.S</code> (dot matches all), <code>re.U</code> (Unicode dependent), and <code>re.X</code> (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the <code>re.compile()</code> function. </blockquote>

What does "(?u)" do in a regex?

Tags:

I looked into how tokenization is implemented in scikit-learn and found this regex (source):

token_pattern = r"(?u)\b\w\w+\b"

The regex is pretty straightforward but I have never seen the (?u) part before. Can someone explain me what this part is doing?

234

asked Jan 27 '16 16:01

fwind

1 Answers

It switches on the re.U (re.UNICODE) flag for this expression.

From the module documentation:

(?iLmsux)

(One or more letters from the set 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode dependent), and re.X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function.

answered Jan 12 '23 00:01

Martijn Pieters

Related questions
                            
                                How to enable Live Visual Tree and Live Property Explorer in Visual Studio
                            
                                Add Serializer on Reverse Relationship - Django Rest Framework
                            
                                What does double colon before expression variable do in angular js?
                            
                                From TimeDelta to float days in Pandas
                            
                                How to check if two permutations are symmetric?
                            
                                TensorFlow: using a tensor to index another tensor
                            
                                TensorFlow: Unpooling
                            
                                What is the parameter "max_q_size" used for in "model.fit_generator"?
                            
                                q.all for angular2 observables
                            
                                Null-conditional operator evaluates to bool not to bool? as expected
                            
                                Replace value of a line in a yml with bash
                            
                                Firebase pod install - pod 'Firebase/Database' - Required a higher minimum deployment target

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With