Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

nltk tokenization and contractions

Tags:

python

nlp

nltk

I'm tokenizing text with nltk, just sentences fed to wordpunct_tokenizer. This splits contractions (e.g. 'don't' to 'don' +" ' "+'t') but I want to keep them as one word. I'm refining my methods for a more measured and precise tokenization of text, so I need to delve deeper into the nltk tokenization module beyond simple tokenization.

I'm guessing this is common and I'd like feedback from others who've maybe had to deal with the particular issue before.

edit:

Yeah this a general, splattershot question I know

Also, as a novice to nlp, do I need to worry about contractions at all?

EDIT:

The SExprTokenizer or TreeBankWordTokenizer seems to do what I'm looking for for now.

like image 927
blueblank Avatar asked Jul 05 '12 19:07

blueblank


People also ask

What is tokenization in NLTK?

Tokenization in NLP is the process by which a large quantity of text is divided into smaller parts called tokens. Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc.

What does word_tokenize () function in NLTK do?

word_tokenize() method, we are able to extract the tokens from string of characters by using tokenize. word_tokenize() method. It actually returns the syllables from a single word. A single word can contain one or two syllables.

Does NLTK Tokenize remove punctuation?

Nothing happens with the text. The workflow assumed by NLTK is that you first tokenize into sentences and then every sentence into words. That is why word_tokenize() does not work with multiple sentences. To get rid of the punctuation, you can use a regular expression or python's isalnum() function.


2 Answers

Which tokenizer you use really depends on what you want to do next. As inspectorG4dget said, some part-of-speech taggers handle split contractions, and in that case the splitting is a good thing. But maybe that's not what you want. To decide which tokenizer is best, consider what you need for the next step, and then submit your text to http://text-processing.com/demo/tokenize/ to see how each NLTK tokenizer behaves.

like image 75
Jacob Avatar answered Oct 08 '22 15:10

Jacob


I've worked with NLTK before on this project. When I did, I found that contractions were useful to consider.

However, I did not write custom tokenizer, I simply handled it after POS tagging.

I suspect this is not the answer that you are looking for, but I hope it helps somewhat

like image 27
inspectorG4dget Avatar answered Oct 08 '22 14:10

inspectorG4dget