Can NLTK recognise initials followed by dot?

2 Answers

As some of the comments hinted at, what you are wanting to use is the Punkt sentence segmenter / tokenizer.

NLTK or Language specific?

Neither. As you have realized, you cannot simply split on every period. NLTK comes with several Punkt segmenters trained on different languages. However, if you're having issues your best bet is to use a larger training corpus for the Punkt tokenizer to learn from.

Documentation Links

https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
https://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt.PunktSentenceTokenizer-class.html

Sample Implementation

Below is part of the code to point you in the right direction. You should be able to do the same for yourself by supplying Russian text files. One source for that could potentially be the Russian version of a Wikipedia database dump, but I leave that as a potential secondary problem for you.

import logging
try:
    import cPickle as pickle
except ImportError:
    import pickle
import nltk


def create_punkt_sent_detector(fnames, punkt_fname, progress_count=None):
    """Makes a pass through the corpus to train a Punkt sentence segmenter.

    Args:
        fname: List of filenames to be used for training.
        punkt_fname: Filename to save the trained Punkt sentence segmenter.
        progress_count: Display a progress count every integer number of pages.
    """
    logger = logging.getLogger('create_punkt_sent_detector')

    punkt = nltk.tokenize.punkt.PunktTrainer()

    logger.info("Training punkt sentence detector")

    doc_count = 0
    try:
        for fname in fnames:
            with open(fname, mode='rb') as f:
                punkt.train(f.read(), finalize=False, verbose=False)
                doc_count += 1
                if progress_count and doc_count % progress_count == 0:
                    logger.debug('Pages processed: %i', doc_count)
    except KeyboardInterrupt:
        print 'KeyboardInterrupt: Stopping the reading of the dump early!'

    logger.info('Now finalzing Punkt training.')

    punkt.finalize_training(verbose=True)
    learned = punkt.get_params()
    sbd = nltk.tokenize.punkt.PunktSentenceTokenizer(learned)
    with open(punkt_fname, mode='wb') as f:
        pickle.dump(sbd, f, protocol=pickle.HIGHEST_PROTOCOL)

    return sbd


if __name__ == 'main':
    punkt_fname = 'punkt_russian.pickle'
    try:
        with open(punkt_fname, mode='rb') as f:
            sent_detector = pickle.load(f)
    except (IOError, pickle.UnpicklingError):
        sent_detector = None

    if sent_detector is None:
        corpora = ['russian-1.txt', 'russian-2.txt']
        sent_detector = create_punkt_sent_detector(fnames=corpora,
                                                   punkt_fname=punkt_fname)

    tokenized_text = sent_detector.tokenize("some russian text.",
                                            realign_boundaries=True)
    print '\n'.join(tokenized_text)

158

answered Sep 25 '22 14:09

Wesley Baugh

You can take the trained Russian sentence tokenizer from https://github.com/Mottl/ru_punkt which can deal with Russian names initials and abbreviations.

text = ("организовывал забастовки и демонстрации, ",
        "поднимал рабочих на бакинских предприятиях А.И. Манташева.")
print(tokenizer.tokenize(text))

Output:

['организовывал забастовки и демонстрации, поднимал рабочих на бакинских предприятиях А.И. Манташева.']

answered Sep 21 '22 14:09

Dmitry Mottl

Related questions
                            
                                Build errors when trying to install pylibmc
                            
                                Python dictionary eating up ram
                            
                                AttributeError in tkinter
                            
                                Flask, nginx, and uwsgi
                            
                                Set up multiple python installations on windows with tox
                            
                                ImportError: cannot import name urandom
                            
                                News Scrolling Text in Python
                            
                                What is the correct way to close a Twisted conch SSH connection?
                            
                                gunicorn not serving static files
                            
                                Quiver or Barb with a date axis
                            
                                Kivy: crossplatform notification icon
                            
                                Update Cookies in Session Using python-requests Module
                            
                                Python bluetooth module lightblue doesn't work on mac osx 10.8
                            
                                Is there a Python module for transparently working with a file's contents as a buffer?
                            
                                how to disable pypy assert statement?
                            
                                How to debug GAE python, gql and datastore?
                            
                                PyGtk How to change TreeView data using filter?
                            
                                mtTkinter doesn't terminate threads
                            
                                A literal "*" in RestructuredText
                            
                                Python 3.4 multiprocessing Queue faster than Pipe, unexpected

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can NLTK recognise initials followed by dot?

Tags:

python

nlp

nltk

user1870840

People also ask