Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

which tokenizer is better to be used with nltk

I have started learning nltk and following this tutorial. First we use the built-in tokenizer by using sent_tokenize and later we use PunktSentenceTokenizer. The tutorial mentions that PunktSentenceTokenizer is capable of unsupervised machine learning.

So does that mean it is better than the default one? Or what is the standard of comparison among various tokenizers?

like image 862
Riken Shah Avatar asked Jun 22 '16 04:06

Riken Shah


People also ask

Which tokenizer is best?

Whitespace tokenization This is the most simple and commonly used form of tokenization. It splits the text whenever it finds whitespace characters. It is advantageous since it is a quick and easily understood method of tokenization.

How do you Tokenize NLTK?

NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.

What is from NLTK Tokenize import word_tokenize?

Nltk word_tokenize is used to extract tokens from a string of characters using the word tokenize method. It actually returns a single word's syllables.


1 Answers

Looking at the source code for sent_tokenize() reveals that this method currently uses the pre-trained punkt tokenizer, so it is the equivalent to PunktSentenceTokenizer. Whether or not you will need to retrain your tokenizer depends on the nature of the text you are working with. If it is nothing too exotic, like newspaper articles, then you will likely find the pre-trained tokenizer to be sufficient. Tokenizing boils down to a categorization task, and thus different tokenizers could be compared by using the typical metrics such as precision, recall, f-score etc. on labelled data.

The punkt tokenizer is based on the work published in the following paper:

http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485#.V2ouLXUrLeQ

It is fundamentally a heuristic based approach geared to disambiguating sentence boundaries from abbreviations - the bane of sentence tokenization. Calling it a heuristic approach is not meant to be disparaging. I have used the built-in sentence tokenizer before and it worked fine for what I was doing, of course, my task did not really depend on accurate sentence tokenizing. Or rather, I was able to throw enough data at it where it did not really matter.

Here is an example of a question on SO where a user found the pre-trained tokenizer lacking, and needed to train a new one:

How to tweak the NLTK sentence tokenizer

The text in question was Moby Dick, and the odd sentence structure was tripping up the tokenizer. Some examples of where you might need to train your own tokenizer are social media (e.g. twitter) or technical literature with lots of strange abbreviations not encountered by the pre-trained tokenizer.

like image 88
juanpa.arrivillaga Avatar answered Sep 19 '22 02:09

juanpa.arrivillaga