I want to tokenize
s = ("mary went to garden. where is mary? "
"mary is carrying apple and milk. "
"what mary is carrying? apple,milk")
into
['mary', 'went', 'to', 'garden', '.',
'where', 'is', 'mary', '?',
'mary', 'is', 'carrying', 'apple', 'and', 'milk', '.',
'what', 'mary', 'is', 'carrying', '?', 'apple,milk']
Please note that I want to keep 'apple,milk' as one word.
My code is:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\w+[\]|\w+[\,]\w+|\.|\?')
s = "mary went to garden. where is mary? mary is carrying apple and milk. what mary is carrying? apple,milk"
tokenizer.tokenize(s)
the result is:
['mary', 'went', 'garden', '.',
'where', 'mary', '?',
'mary', 'carrying', 'apple', 'and', 'milk', '.',
'what', 'mary', 'carrying', '?', 'apple,milk']
However, 'is' and 'to' are missing. How to keep them?
Your regex pattern simply does not capture the missing words.
You could see this whit a regex tool or using RegexpTokenizer('\w+[\]|\w+[\,]\w+|\.|\?', True) with an additional parameter to show gaps instead of tokens (doc).
Update:
Here is a pattern that finds all the tokens as specified by you:
\w+[\,]\w+|\w+|\.|\?
Remarks: When using regex alternatives it can be important to sort them by length (usually from longest to shortest). The [\] does not make sense to me and is syntactically not correct.
Online demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With