Use of PunktSentenceTokenizer in NLTK

Tags:

I am learning Natural Language Processing using NLTK. I came across the code using PunktSentenceTokenizer whose actual use I cannot understand in the given code. The code is given :

import nltk from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer  train_text = state_union.raw("2005-GWBush.txt") sample_text = state_union.raw("2006-GWBush.txt")  custom_sent_tokenizer = PunktSentenceTokenizer(train_text) #A  tokenized = custom_sent_tokenizer.tokenize(sample_text)   #B  def process_content(): try:     for i in tokenized[:5]:         words = nltk.word_tokenize(i)         tagged = nltk.pos_tag(words)         print(tagged)  except Exception as e:     print(str(e))   process_content()

So, why do we use PunktSentenceTokenizer. And what is going on in the line marked A and B. I mean there is a training text and the other a sample text, but what is the need for two data sets to get the Part of Speech tagging.

Line marked as A and B is which I am not able to understand.

PS : I did try to look in the NLTK book but could not understand what is the real use of PunktSentenceTokenizer

695

asked Feb 08 '16 16:02

arqam

2 Answers

PunktSentenceTokenizer is the abstract class for the default sentence tokenizer, i.e. sent_tokenize(), provided in NLTK. It is an implmentation of Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk (2005). See https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L79

Given a paragraph with multiple sentence, e.g:

>>> from nltk.corpus import state_union >>> train_text = state_union.raw("2005-GWBush.txt").split('\n') >>> train_text[11] u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. '

You can use the sent_tokenize():

>>> sent_tokenize(train_text[11]) [u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.', u'This evening I will set forth policies to advance that ideal at home and around the world. '] >>> for sent in sent_tokenize(train_text[11]): ...     print sent ...     print '--------' ...  Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. -------- This evening I will set forth policies to advance that ideal at home and around the world.  --------

The sent_tokenize() uses a pre-trained model from nltk_data/tokenizers/punkt/english.pickle. You can also specify other languages, the list of available languages with pre-trained models in NLTK are:

alvas@ubi:~/nltk_data/tokenizers/punkt$ ls czech.pickle     finnish.pickle  norwegian.pickle   slovene.pickle danish.pickle    french.pickle   polish.pickle      spanish.pickle dutch.pickle     german.pickle   portuguese.pickle  swedish.pickle english.pickle   greek.pickle    PY3                turkish.pickle estonian.pickle  italian.pickle  README

Given a text in another language, do this:

>>> german_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. "  >>> for sent in sent_tokenize(german_text, language='german'): ...     print sent ...     print '---------' ...  Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. --------- Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten.  ---------

To train your own punkt model, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py and training data format for nltk punkt

200

answered Oct 01 '22 20:10

alvas

PunktSentenceTokenizer is an sentence boundary detection algorithm that must be trained to be used [1]. NLTK already includes a pre-trained version of the PunktSentenceTokenizer.

So if you use initialize the tokenizer without any arguments, it will default to the pre-trained version:

In [1]: import nltk In [2]: tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer() In [3]: txt = """ This is one sentence. This is another sentence.""" In [4]: tokenizer.tokenize(txt) Out[4]: [' This is one sentence.', 'This is another sentence.']

You can also provide your own training data to train the tokenizer before using it. Punkt tokenizer uses an unsupervised algorithm, meaning you just train it with regular text.

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

For most of the cases, it is totally fine to use the pre-trained version. So you can simply initialize the tokenizer without providing any arguments.

So "what all this has to do with POS tagging"? The NLTK POS tagger works with tokenized sentences, so you need to break your text into sentences and word tokens before you can POS tag.

NLTK's documentation.

[1] Kiss and Strunk, " Unsupervised Multilingual Sentence Boundary Detection"

answered Oct 01 '22 21:10

CentAu

Related questions
                            
                                How does it work, the naming convention for Django INSTALLED_APPS?
                            
                                How do you debug Mako templates?
                            
                                Summing over a multiindex level in a pandas series
                            
                                Can't instantiate abstract class ... with abstract methods
                            
                                What does version name 'cp27' or 'cp35' mean in Python?
                            
                                Difference between these array shapes in numpy
                            
                                Doc2Vec Get most similar documents
                            
                                Python and urllib2: how to make a GET request with parameters
                            
                                Writing comments to files with ConfigParser
                            
                                Django: how to annotate queryset with count of filtered ForeignKey field?
                            
                                use a css stylesheet on a jinja2 template
                            
                                what does yield without value do in context manager
                            
                                Exposing a C++ API to Python
                            
                                How to get the current running module path/name
                            
                                Python: How to force overwriting of files when using setup.py install (distutil)
                            
                                Bad operand type for unary +: 'str'
                            
                                Getting started with secure AWS CloudFront streaming with Python
                            
                                Configuring Python to use additional locations for site-packages
                            
                                Pythonic Style for Multiline List Comprehension [duplicate]
                            
                                How to remove outline of circle marker when using pyplot.plot in matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use of PunktSentenceTokenizer in NLTK

Tags:

python

nlp

nltk

arqam

People also ask

2 Answers

alvas

CentAu

Recent Activity

Donate For Us