Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why "is" and "to" are removed by my regular expression in NLTK RegexpTokenizer()?

I want to tokenize

s = ("mary went to garden. where is mary? "
     "mary is carrying apple and milk. "
     "what mary is carrying? apple,milk")

into

['mary', 'went', 'to', 'garden', '.', 
 'where', 'is', 'mary', '?', 
 'mary', 'is', 'carrying', 'apple', 'and', 'milk', '.', 
 'what', 'mary', 'is', 'carrying', '?', 'apple,milk']

Please note that I want to keep 'apple,milk' as one word.

My code is:

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\w+[\]|\w+[\,]\w+|\.|\?')
s = "mary went to garden. where is mary? mary is carrying apple and milk. what mary is carrying? apple,milk"
tokenizer.tokenize(s)

the result is:

['mary', 'went', 'garden', '.', 
 'where', 'mary', '?', 
 'mary', 'carrying', 'apple', 'and', 'milk', '.', 
 'what', 'mary', 'carrying', '?', 'apple,milk']

However, 'is' and 'to' are missing. How to keep them?

like image 802
KoalaJ Avatar asked Dec 05 '25 16:12

KoalaJ


1 Answers

Your regex pattern simply does not capture the missing words.

You could see this whit a regex tool or using RegexpTokenizer('\w+[\]|\w+[\,]\w+|\.|\?', True) with an additional parameter to show gaps instead of tokens (doc).

Update:
Here is a pattern that finds all the tokens as specified by you:

\w+[\,]\w+|\w+|\.|\?

Remarks: When using regex alternatives it can be important to sort them by length (usually from longest to shortest). The [\] does not make sense to me and is syntactically not correct.

Online demo

like image 196
wp78de Avatar answered Dec 08 '25 09:12

wp78de