Is there a library for splitting sentence into a list of words in it?

Question

I'm looking at nltk for python, but it splits(tokenize) won't as ['wo',"n't"]. Are there libraries that do this more robustly?

I know i can build a regex of some sort to solve this problem, but I'm looking for a library/tool because it would be a more directed approach. For example, after a basic regex with periods and commas, I realized words like 'Mr. ' will break the system.

(@artsiom)

If the sentence was "you won't?", split() will give me ["you", "won't?"]. So there's an extra '?' that I have to deal with. I'm looking for a tried and tested method which do away with the kinks like the above mentioned and also the lot many exceptions that I'm sure exist. Of course, I'll resort to a split(regex) if I don't find any.

Paolo Moretti · Accepted Answer

The Natural Language Toolkit (NLTK) is probably what you need.

>>> from nltk.tokenize import word_tokenize
>>> word_tokenize("'Hello. This is a test.  It works!")
["'Hello", '.', 'This', 'is', 'a', 'test', '.', 'It', 'works', '!']
>>> word_tokenize("I won't fix your computer")
['I', 'wo', "n't", 'fix', 'your', 'computer']

nltk.tokenize.word_tokenize by default use the TreebankWordTokenizer, a word tokenizer that tokenizes sentences with the Penn Treebank conventions.

Note that this tokenizer assumes that the text has already been segmented into sentences.

You can test some of the various tokenizers provided by NLTK (i.e. WordPunctTokenizer, WhitespaceTokenizer...) on this page.

Is there a library for splitting sentence into a list of words in it?

Tags:

python

regex

nlp

Karthick

1 Answers

Paolo Moretti

Recent Activity

Donate For Us

Is there a library for splitting sentence into a list of words in it?

Tags:

python

regex

nlp

Karthick

1 Answers

Paolo Moretti

Related questions

Recent Activity

Donate For Us