I looked into how tokenization is implemented in scikit-learn and found this regex (source):
token_pattern = r"(?u)\b\w\w+\b"
The regex is pretty straightforward but I have never seen the (?u)
part before. Can someone explain me what this part is doing?
U (Unicode dependent), and re. X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.
i) makes the regex case insensitive. (? s) for "single line mode" makes the dot match all characters, including line breaks.
Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.
Definition and Usage The \r metacharacter matches carriage return characters.
It switches on the re.U
(re.UNICODE
) flag for this expression.
From the module documentation:
(?iLmsux)
(One or more letters from the set
'i'
,'L'
,'m'
,'s'
,'u'
,'x'
.) The group matches the empty string; the letters set the corresponding flags:re.I
(ignore case),re.L
(locale dependent),re.M
(multi-line),re.S
(dot matches all),re.U
(Unicode dependent), andre.X
(verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to there.compile()
function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With