I have the following text: <code>I don't like to eat Cici's food (it is true)</code> I need to tokenize it to <code>['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(', 'it', 'is', 'true', ')']</code> I have found out that the following regex expression <code>(['()\w]+|\.)</code> splits like this: <code>['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(it', 'is', 'true)']</code> How do I take the parenthesis out of the token and make it to an own token? Thanks for ideas.

When you want to tokenize a string with regex with special restrictions on context, you may use a matching approach that usually yields cleaner output (especially when it comes to empty elements in the resulting list). Any word character is matched with <code>\w</code> and any non-word char is matched with <code>\W</code>. If you wanted to tokenize the string into word and non-word chars, you could use <code>\w+|\W+</code> regex. However, in your case, you want to match word character chunks that are optionally followed with <code>'</code> that is followed with 1+ word characters, and any other single characters that are not whitespace. Use <pre class="prettyprint"><code>re.findall(r"\w+(?:'\w+)?|[^\w\s]", s) </code></pre> Here, <code>\w+(?:'\w+)?</code> matches the words like <code>people</code> or <code>people's</code>, and <code>[^\w\s]</code> matches a single character other than word and whitespace character. See the regex demo Python demo: <pre class="prettyprint"><code>import re rx = r"\w+(?:'\w+)?|[^\w\s]" s = "I don't like to eat Cici's food (it is true)" print(re.findall(rx, s)) </code></pre> Another example that will tokenize using <code>(</code> and <code>)</code>: <pre class="prettyprint"><code>[^()\s]+|[()] </code></pre> See the regex demo Here, <code>[^()\s]+</code> matches 1 or more symbols other than <code>(</code>, <code>)</code> and whitespace, and <code>[()]</code> matches either <code>(</code> or <code>)</code>.

Tokenize by using regular expressions (parenthesis)

1 Answers

When you want to tokenize a string with regex with special restrictions on context, you may use a matching approach that usually yields cleaner output (especially when it comes to empty elements in the resulting list).

Any word character is matched with \w and any non-word char is matched with \W. If you wanted to tokenize the string into word and non-word chars, you could use \w+|\W+ regex. However, in your case, you want to match word character chunks that are optionally followed with ' that is followed with 1+ word characters, and any other single characters that are not whitespace.

Use

re.findall(r"\w+(?:'\w+)?|[^\w\s]", s)

Here, \w+(?:'\w+)? matches the words like people or people's, and [^\w\s] matches a single character other than word and whitespace character.

See the regex demo

Python demo:

import re
rx = r"\w+(?:'\w+)?|[^\w\s]"
s = "I don't like to eat Cici's food (it is true)"
print(re.findall(rx, s))

Another example that will tokenize using ( and ):

[^()\s]+|[()]

See the regex demo

Here, [^()\s]+ matches 1 or more symbols other than (, ) and whitespace, and [()] matches either ( or ).

195

answered Oct 04 '22 07:10

Wiktor Stribiżew

Related questions
                            
                                Regex replace only matching groups and ignore non-matching groups?
                            
                                regex: find integer but not float
                            
                                Regular Expression for Extracting Operands from Mathematical Expression
                            
                                Open a file, read content, make content into a list using regex, then print list in python
                            
                                Use Regex to extract file path and save it in python
                            
                                Preg match forward slashes
                            
                                How to pull phrase out of a string
                            
                                Javascript Regex to replace a sub directory in url
                            
                                Difference between \p{Alpha} and \p{L} in Java
                            
                                Regular expression not containing 101
                            
                                Regex to match some pattern with line break
                            
                                RegEx Challenge: Capture all the numbers in a specific row
                            
                                Matching items in a comma-delimited list which aren't surrounded by single or double quotes
                            
                                How to use stringr's replace_all() function to replace specific matches in a string
                            
                                How can I get a value from an URL?
                            
                                python re.sub non-greed substitute fails with a newline in the string [duplicate]
                            
                                Segmentation fault when using regexec/strtok_r in C
                            
                                Regex lookahead non capturing with if/then
                            
                                Regex replace all match with symbol with same length
                            
                                RegExp. Get only text content of tag (without inner tags)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Tokenize by using regular expressions (parenthesis)

Tags:

string

regex

split

tokenize

Jürgen K.

People also ask

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us