Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK: Find contexts of size 2k for a word

I have a corpus and I have a word. For each occurrence of the word in the corpus I want to get a list containing the k words before and the k words after the word. I am doing this algorithmically OK (see below) but I wondered whether NLTK is providing some functionality for my needs that I missed?

def sized_context(word_index, window_radius, corpus):
    """ Returns a list containing the window_size amount of words to the left
    and to the right of word_index, not including the word at word_index.
    """

    max_length = len(corpus)

    left_border = word_index - window_radius
    left_border = 0 if word_index - window_radius < 0 else left_border

    right_border = word_index + 1 + window_radius
    right_border = max_length if right_border > max_length else right_border

    return corpus[left_border:word_index] + corpus[word_index+1: right_border]
like image 553
Zakum Avatar asked Sep 16 '25 11:09

Zakum


1 Answers

If you want to use the nltk's functionality, you can use nltk's ConcordanceIndex. In order to base the width of the display on the number of words instead of the number of characters (the latter being the default for ConcordanceIndex.print_concordance), you can merely create a subclass of ConcordanceIndex with something like this:

from nltk import ConcordanceIndex

class ConcordanceIndex2(ConcordanceIndex):
    def create_concordance(self, word, token_width=13):
        "Returns a list of contexts for @word with a context <= @token_width"
        half_width = token_width // 2
        contexts = []
        for i, token in enumerate(self._tokens):
            if token == word:
                start = i - half_width if i >= half_width else 0
                context = self._tokens[start:i + half_width + 1]
                contexts.append(context)
        return contexts

Then you can obtain your results like this:

>>> from nltk.tokenize import wordpunct_tokenize
>>> my_corpus = 'The gerenuk fled frantically across the vast valley, whereas the giraffe merely turned indignantly and clumsily loped away from the valley into the nearby ravine.'  # my corpus
>>> tokens = wordpunct_tokenize(my_corpus)
>>> c = ConcordanceIndex2(tokens)
>>> c.create_concordance('valley')  # returns a list of lists, since words may occur more than once in a corpus
[['gerenuk', 'fled', 'frantically', 'across', 'the', 'vast', 'valley', ',', 'whereas', 'the', 'giraffe', 'merely', 'turned'], ['and', 'clumsily', 'loped', 'away', 'from', 'the', 'valley', 'into', 'the', 'nearby', 'ravine', '.']]

The create_concordance method I created above is based upon the nltk's ConcordanceIndex.print_concordance method, which works like this:

>>> c = ConcordanceIndex(tokens)
>>> c.print_concordance('valley')
Displaying 2 of 2 matches:
                                  valley , whereas the giraffe merely turn
 and clumsily loped away from the valley into the nearby ravine .
like image 50
Justin O Barber Avatar answered Sep 18 '25 23:09

Justin O Barber