I'm looking at nltk for python, but it splits(tokenize) won't
as ['wo',"n't"]
. Are there libraries that do this more robustly?
I know i can build a regex of some sort to solve this problem, but I'm looking for a library/tool because it would be a more directed approach. For example, after a basic regex with periods and commas, I realized words like 'Mr. ' will break the system.
(@artsiom)
If the sentence was "you won't?", split() will give me ["you", "won't?"]. So there's an extra '?' that I have to deal with. I'm looking for a tried and tested method which do away with the kinks like the above mentioned and also the lot many exceptions that I'm sure exist. Of course, I'll resort to a split(regex) if I don't find any.
The Natural Language Toolkit (NLTK) is probably what you need.
>>> from nltk.tokenize import word_tokenize
>>> word_tokenize("'Hello. This is a test. It works!")
["'Hello", '.', 'This', 'is', 'a', 'test', '.', 'It', 'works', '!']
>>> word_tokenize("I won't fix your computer")
['I', 'wo', "n't", 'fix', 'your', 'computer']
nltk.tokenize.word_tokenize
by default use the TreebankWordTokenizer
, a word tokenizer that tokenizes sentences with the Penn Treebank conventions.
Note that this tokenizer assumes that the text has already been segmented into sentences.
You can test some of the various tokenizers provided by NLTK (i.e. WordPunctTokenizer
, WhitespaceTokenizer
...) on this page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With