I have the following text:
I don't like to eat Cici's food (it is true)
I need to tokenize it to
['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(', 'it', 'is', 'true', ')']
I have found out that the following regex expression (['()\w]+|\.)
splits like this:
['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(it', 'is', 'true)']
How do I take the parenthesis out of the token and make it to an own token?
Thanks for ideas.
A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences: >>> from nltk.
tokenize . You can use regexp_tokenize(string, pattern) with my_string and one of the patterns as arguments to experiment for yourself and see which is the best tokenizer. ['SOLDIER', '#1', 'Found', 'them', '?'
With the help of NLTK tokenize. regexp() module, we are able to extract the tokens from string by using regular expression with RegexpTokenizer() method. Example #1 : In this example we are using RegexpTokenizer() method to extract the stream of tokens with the help of regular expressions.
When you want to tokenize a string with regex with special restrictions on context, you may use a matching approach that usually yields cleaner output (especially when it comes to empty elements in the resulting list).
Any word character is matched with \w
and any non-word char is matched with \W
. If you wanted to tokenize the string into word and non-word chars, you could use \w+|\W+
regex. However, in your case, you want to match word character chunks that are optionally followed with '
that is followed with 1+ word characters, and any other single characters that are not whitespace.
Use
re.findall(r"\w+(?:'\w+)?|[^\w\s]", s)
Here, \w+(?:'\w+)?
matches the words like people
or people's
, and [^\w\s]
matches a single character other than word and whitespace character.
See the regex demo
Python demo:
import re
rx = r"\w+(?:'\w+)?|[^\w\s]"
s = "I don't like to eat Cici's food (it is true)"
print(re.findall(rx, s))
Another example that will tokenize using (
and )
:
[^()\s]+|[()]
See the regex demo
Here, [^()\s]+
matches 1 or more symbols other than (
, )
and whitespace, and [()]
matches either (
or )
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With