Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting a word plus 20 more from a section (python)

Jep still playing around with Python.

I decided to try out Gensim, a tool to find out topics for a choosen word & context.

So I wondered how to find a word in a section of text and extract 20 words together with it (as in 10 words before that spectic word and 10 words after that specific word) then to save it together with other such extractions so Gensim could be run on it.

What seems to be hard for me is to find a way to extract the 10 before and after words when the choosen word is found. I played with nltk before and by just tokenizing the text into words or sentences it was easy to get hold of the sentences. Still getting those words or the sentences before and after that specific sentence seems hard for me to figure out how to do.

For those who are confused (it's 1am here so I may be confusing) I'll show it with an example:

As soon as it had finished, all her blood rushed to her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will I make something which shall destroy her completely." Thus saying, she made a poisoned comb by arts which she understood, and then, disguising herself, she took the form of an old widow. She went over the seven hills to the house of the seven Dwarfs, and[15] knocking at the door, called out, "Good wares to sell to-day!"

If we say the word is Snow-White then I'd want to get this part extracted:

her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will

10 word before and after Snow-White.

It is also cool enough to instead get the sentence before and after the sentence Snow-White appeared in if this can be done in nltk and is easier.

I mean whatever works best I shall be happy with one of the two solutions if someone could help me.

If this can be done with Gensim too...and that is easier, then I shall be happy with that too. So any of the 3 ways will be fine...I just want to try and see how this can be done because atm my head is blank.

like image 541
N00programmer Avatar asked Dec 07 '22 14:12

N00programmer


2 Answers

The process is called Keyword in Context (KWIC).

The first step is to split you input into words. There are many ways to do that using the regular expressions module, see re.split or re.findall for example.

Having located a particular word, you use slicing to find the ten words before and the ten words after.

To build an index for all words, a deque with a maxlen is convenient for implementing a sliding window.

Here's one way to do it efficiently using itertools:

from re import finditer
from itertools import tee, islice, izip, chain, repeat

def kwic(text, tgtword, width=10):
    'Find all occurrences of tgtword and show the surrounding context'
    matches = (mo.span() for mo in finditer(r"[A-Za-z\'\-]+", text))
    padded = chain(repeat((0,0), width), matches, repeat((-1,-1), width))
    t1, t2, t3 = tee((padded), 3)
    t2 = islice(t2, width, None)
    t3 = islice(t3, 2*width, None)
    for (start, _), (i, j), (_, stop) in izip(t1, t2, t3):
        if text[i: j] == tgtword:
            context = text[start: stop]
            yield context

print list(kwic(text, 'Snow-White'))
like image 103
Raymond Hettinger Avatar answered Dec 28 '22 08:12

Raymond Hettinger


text = """
As soon as it had finished, all her blood rushed to her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will I make something which shall destroy her completely." Thus saying, she made a poisoned comb by arts which she understood, and then, disguising herself, she took the form of an old widow. She went over the seven hills to the house of the seven Dwarfs, and[15] knocking at the door, called out, "Good wares to sell to-day!"
"""
spl = text.split()

def ans(word):
    for ind, x in enumerate(spl):
       if x.strip(",'\".!") == word:
           break
    return " ".join(spl[ind-10:ind] + spl[ind:ind+11])


>>> ans('Snow-White')
her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will
like image 25
Ashwini Chaudhary Avatar answered Dec 28 '22 08:12

Ashwini Chaudhary