Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenize by using regular expressions (parenthesis)

I have the following text:

I don't like to eat Cici's food (it is true)

I need to tokenize it to

['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(', 'it', 'is', 'true', ')']

I have found out that the following regex expression (['()\w]+|\.) splits like this:

['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(it', 'is', 'true)']

How do I take the parenthesis out of the token and make it to an own token?

Thanks for ideas.

like image 871
Jürgen K. Avatar asked Mar 29 '17 12:03

Jürgen K.


People also ask

What is regex tokenization?

A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences: >>> from nltk.

How do you tokenize a string in regex?

tokenize . You can use regexp_tokenize(string, pattern) with my_string and one of the patterns as arguments to experiment for yourself and see which is the best tokenizer. ['SOLDIER', '#1', 'Found', 'them', '?'

Which of the following method is used to tokenize a text based on a regular expression?

With the help of NLTK tokenize. regexp() module, we are able to extract the tokens from string by using regular expression with RegexpTokenizer() method. Example #1 : In this example we are using RegexpTokenizer() method to extract the stream of tokens with the help of regular expressions.


1 Answers

When you want to tokenize a string with regex with special restrictions on context, you may use a matching approach that usually yields cleaner output (especially when it comes to empty elements in the resulting list).

Any word character is matched with \w and any non-word char is matched with \W. If you wanted to tokenize the string into word and non-word chars, you could use \w+|\W+ regex. However, in your case, you want to match word character chunks that are optionally followed with ' that is followed with 1+ word characters, and any other single characters that are not whitespace.

Use

re.findall(r"\w+(?:'\w+)?|[^\w\s]", s)

Here, \w+(?:'\w+)? matches the words like people or people's, and [^\w\s] matches a single character other than word and whitespace character.

See the regex demo

Python demo:

import re
rx = r"\w+(?:'\w+)?|[^\w\s]"
s = "I don't like to eat Cici's food (it is true)"
print(re.findall(rx, s))

Another example that will tokenize using ( and ):

[^()\s]+|[()]

See the regex demo

Here, [^()\s]+ matches 1 or more symbols other than (, ) and whitespace, and [()] matches either ( or ).

like image 195
Wiktor Stribiżew Avatar answered Oct 04 '22 07:10

Wiktor Stribiżew