Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a library for splitting sentence into a list of words in it?

Tags:

python

regex

nlp

I'm looking at nltk for python, but it splits(tokenize) won't as ['wo',"n't"]. Are there libraries that do this more robustly?

I know i can build a regex of some sort to solve this problem, but I'm looking for a library/tool because it would be a more directed approach. For example, after a basic regex with periods and commas, I realized words like 'Mr. ' will break the system.

(@artsiom)

If the sentence was "you won't?", split() will give me ["you", "won't?"]. So there's an extra '?' that I have to deal with. I'm looking for a tried and tested method which do away with the kinks like the above mentioned and also the lot many exceptions that I'm sure exist. Of course, I'll resort to a split(regex) if I don't find any.

like image 662
Karthick Avatar asked Dec 02 '22 01:12

Karthick


1 Answers

The Natural Language Toolkit (NLTK) is probably what you need.

>>> from nltk.tokenize import word_tokenize
>>> word_tokenize("'Hello. This is a test.  It works!")
["'Hello", '.', 'This', 'is', 'a', 'test', '.', 'It', 'works', '!']
>>> word_tokenize("I won't fix your computer")
['I', 'wo', "n't", 'fix', 'your', 'computer']

nltk.tokenize.word_tokenize by default use the TreebankWordTokenizer, a word tokenizer that tokenizes sentences with the Penn Treebank conventions.

Note that this tokenizer assumes that the text has already been segmented into sentences.

You can test some of the various tokenizers provided by NLTK (i.e. WordPunctTokenizer, WhitespaceTokenizer...) on this page.

like image 144
Paolo Moretti Avatar answered Dec 18 '22 15:12

Paolo Moretti