concordance for a phrase using NLTK in Python

Question

Is it possible to get concordance for a phrase in NLTK?

import nltk
from nltk.corpus import PlaintextCorpusReader

corpus_loc = "c://temp//text//"
files = ".*\.txt"
read_corpus = PlaintextCorpusReader(corpus_loc, files)
corpus  = nltk.Text(read_corpus.words())
test = nltk.TextCollection(corpus_loc)

corpus.concordance("claim")

for example the above returns

on okay okay okay i can give you the claim number and my information and
 decide on the shop okay okay so the claim number is xxxx - xx - xxxx got

and now if I try corpus.concordance("claim number") it does not work... I do have the code to do this with just by using .partition() method and some further coding on the same... but I'm wondering if it's possible to do the same using concordance.

b3000 · Accepted Answer

According to this issue it is not (yet) possible to search for multiple words with the concordance() function.

alexis · Answer

If you read the discussion under the very issue that @b3000 dug up, you'll see that strangely enough, multi-word concordance is in fact available-- but only in the graphical concordance tool, which you can start up like this:

>>> from nltk.app import concordance
>>> concordance()

psychemedia · Answer

I munged together this solution...

def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5):
    #concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/

    phraseList=phrase.split(' ')

    c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())

    #Find the offset for each token in the phrase
    offsets=[c.offsets(x) for x in phraseList]
    offsets_norm=[]
    #For each token in the phraselist, find the offsets and rebase them to the start of the phrase
    for i in range(len(phraseList)):
        offsets_norm.append([x-i for x in offsets[i]])
    #We have found the offset of a phrase if the rebased values intersect
    #--
    # http://stackoverflow.com/a/3852792/454773
    #the intersection method takes an arbitrary amount of arguments
    #result = set(d[0]).intersection(*d[1:])
    #--
    intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:])

    concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)&gt;0 else 0,[offset])[0]:offset+len(phraseList)+right_margin]
                    for offset in intersects])

    outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
    return outputs

def n_concordance(txt,phrase,left_margin=5,right_margin=5):
    tokens = nltk.word_tokenize(txt)
    text = nltk.Text(tokens)

    return

n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)

n_concordance_tokenised(text1,'monstrous size')
>> [u'one was of a most monstrous size . ... This came towards ',
    u'; for Whales of a monstrous size are oftentimes cast up dead ']

concordance for a phrase using NLTK in Python

Tags:

python

nlp

nltk

Naresh MG

3 Answers

b3000

alexis

psychemedia

Recent Activity

Donate For Us

concordance for a phrase using NLTK in Python

Tags:

python

nlp

nltk

Naresh MG

3 Answers

b3000

alexis

psychemedia

Related questions

Recent Activity

Donate For Us