Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calling NLTK's concordance - how to get text before/after a word that was used?

Tags:

python

nltk

I'm would like to find out what text comes after the instance that concordace returns. So for instance, if you look at an example they give in 'Searching Text' section, they get concordance of word 'monstrous'. How would you get words that come right after an instance of monstrous?

like image 913
dev.e.loper Avatar asked Jan 17 '12 16:01

dev.e.loper


People also ask

What does NLTK text do?

text module. This module brings together a variety of NLTK functionality for text analysis, and provides simple, interactive interfaces. Functionality includes: concordancing, collocation discovery, regular expression search over tokenized strings, and distributional similarity.

What is NLTK concordance?

A concordance view shows us every occurrence of a given word, together with some context.

What is concordance in natural language processing?

Context 1. ... Concordance: this function lists each instance of a word in the text and displays a list of sentences where it is present, see figure 5.


1 Answers

import nltk
import nltk.book as book
text1 = book.text1
c = nltk.ConcordanceIndex(text1.tokens, key = lambda s: s.lower())
print([text1.tokens[offset+1] for offset in c.offsets('monstrous')])

yields

['size', 'bulk', 'clubs', 'cannibal', 'and', 'fable', 'Pictures', 'pictures', 'stories', 'cabinet', 'size']

I found this by looking up how the concordance method is defined.

This shows text1.concordance is defined in /usr/lib/python2.7/dist-packages/nltk/text.py:

In [107]: text1.concordance?
Type:       instancemethod
Base Class: <type 'instancemethod'>
String Form:    <bound method Text.concordance of <Text: Moby Dick by Herman Melville 1851>>
Namespace:  Interactive
File:       /usr/lib/python2.7/dist-packages/nltk/text.py

In that file you'll find

def concordance(self, word, width=79, lines=25):
    ... 
        self._concordance_index = ConcordanceIndex(self.tokens,
                                                   key=lambda s:s.lower())
    ...            
    self._concordance_index.print_concordance(word, width, lines)

This shows how to instantiate ConcordanceIndex objects.

And in the same file you'll also find:

class ConcordanceIndex(object):
    def __init__(self, tokens, key=lambda x:x):
        ...
    def print_concordance(self, word, width=75, lines=25):
        ...
        offsets = self.offsets(word)
        ...
        right = ' '.join(self._tokens[i+1:i+context])

With some experimentation in the IPython interpreter, this shows self.offsets('monstrous') gives a list of numbers (offsets) where the word monstrous can be found. You can access the actual words with self._tokens[offset], which is the same as text1.tokens[offset].

So the next word after monstrous is given by text1.tokens[offset+1].

like image 92
unutbu Avatar answered Sep 28 '22 00:09

unutbu