Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can NLTK recognise initials followed by dot?

Tags:

python

nlp

nltk

I am trying to use NLTK to parse Russian text, but it does not work on abbreviations and initials like А. И. Манташева and Я. Вышинский.

Instead, it breaks like below:

организовывал забастовки и демонстрации, поднимал рабочих на бакинских предприятиях А.

И.

Манташева.

It did the same when I used russian.pickle from https://github.com/mhq/train_punkt ,
Is this a general NLTK limitation or language-specific?

like image 970
user1870840 Avatar asked Dec 30 '12 06:12

user1870840


People also ask

What is the use of Punkt in NLTK?

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

How does NLTK sentence tokenizer work?

The tokenized words and sentences with NLTK can be turned into a data frame and vectorized. Natural Language Tool Kit (NLTK) tokenization involves punctuation cleaning, text cleaning, vectorization of parsed text data for better lemmatization, and stemming along with machine learning algorithm training.

What is NLTK Download (' Punkt ')?

'] punkt is the required package for tokenization. Hence you may download it using nltk download manager or download it programmatically using nltk. download('punkt') .

What is re in NLTK?

Regular-Expression Tokenizers. A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences: >>> from nltk.


2 Answers

As some of the comments hinted at, what you are wanting to use is the Punkt sentence segmenter / tokenizer.

NLTK or Language specific?

Neither. As you have realized, you cannot simply split on every period. NLTK comes with several Punkt segmenters trained on different languages. However, if you're having issues your best bet is to use a larger training corpus for the Punkt tokenizer to learn from.

Documentation Links

  • https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
  • https://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktSentenceTokenizer-class.html

Sample Implementation

Below is part of the code to point you in the right direction. You should be able to do the same for yourself by supplying Russian text files. One source for that could potentially be the Russian version of a Wikipedia database dump, but I leave that as a potential secondary problem for you.

import logging
try:
    import cPickle as pickle
except ImportError:
    import pickle
import nltk


def create_punkt_sent_detector(fnames, punkt_fname, progress_count=None):
    """Makes a pass through the corpus to train a Punkt sentence segmenter.

    Args:
        fname: List of filenames to be used for training.
        punkt_fname: Filename to save the trained Punkt sentence segmenter.
        progress_count: Display a progress count every integer number of pages.
    """
    logger = logging.getLogger('create_punkt_sent_detector')

    punkt = nltk.tokenize.punkt.PunktTrainer()

    logger.info("Training punkt sentence detector")

    doc_count = 0
    try:
        for fname in fnames:
            with open(fname, mode='rb') as f:
                punkt.train(f.read(), finalize=False, verbose=False)
                doc_count += 1
                if progress_count and doc_count % progress_count == 0:
                    logger.debug('Pages processed: %i', doc_count)
    except KeyboardInterrupt:
        print 'KeyboardInterrupt: Stopping the reading of the dump early!'

    logger.info('Now finalzing Punkt training.')

    punkt.finalize_training(verbose=True)
    learned = punkt.get_params()
    sbd = nltk.tokenize.punkt.PunktSentenceTokenizer(learned)
    with open(punkt_fname, mode='wb') as f:
        pickle.dump(sbd, f, protocol=pickle.HIGHEST_PROTOCOL)

    return sbd


if __name__ == 'main':
    punkt_fname = 'punkt_russian.pickle'
    try:
        with open(punkt_fname, mode='rb') as f:
            sent_detector = pickle.load(f)
    except (IOError, pickle.UnpicklingError):
        sent_detector = None

    if sent_detector is None:
        corpora = ['russian-1.txt', 'russian-2.txt']
        sent_detector = create_punkt_sent_detector(fnames=corpora,
                                                   punkt_fname=punkt_fname)

    tokenized_text = sent_detector.tokenize("some russian text.",
                                            realign_boundaries=True)
    print '\n'.join(tokenized_text)
like image 158
Wesley Baugh Avatar answered Sep 25 '22 14:09

Wesley Baugh


You can take the trained Russian sentence tokenizer from https://github.com/Mottl/ru_punkt which can deal with Russian names initials and abbreviations.

text = ("организовывал забастовки и демонстрации, ",
        "поднимал рабочих на бакинских предприятиях А.И. Манташева.")
print(tokenizer.tokenize(text))

Output:

['организовывал забастовки и демонстрации, поднимал рабочих на бакинских предприятиях А.И. Манташева.']
like image 37
Dmitry Mottl Avatar answered Sep 21 '22 14:09

Dmitry Mottl