Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the surrounding sentence of a char/word in a string

Tags:

python

regex

nltk

I am trying to get sentences from a string that contain a given substring using python.

I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:

{
  abstract: "...long abstract here..."
  highlights: [
    {
      concept: 'a word',
      start: 1,
      end: 10
    }
    {
      concept: 'cancer',
      start: 123,
      end: 135
    }
  ]
}

I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.

I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize, but by doing that I render the index location useless.

How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?

like image 893
Elise Avatar asked Mar 20 '13 17:03

Elise


2 Answers

You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this (paragraph from a random generator):

Start with,

from nltk.tokenize import sent_tokenize

paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []

Most intuitive way:

for sentence in sent_tokenize(paragraph):
    for highlight in highlights:
        if highlight in sentence:
            sentencesWithHighlights.append(sentence)
            break

But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence, then each highlight, then each subsequence in the sentence for the highlight.

We can get better performance since we know the start index for each highlight:

highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
    for index in highlightIndices:
        if 0 < index - subtractFromIndex < len(sentence):
            sentencesWithHighlights.append(sentence)
            break
    subtractFromIndex += len(sentence)

In either case we get:

sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']
like image 167
Jared Avatar answered Sep 28 '22 13:09

Jared


I assume that all your sentences end with one of these three characters: !?.

What about looping over the list of highlights, creating a regexp group:

(?:list|of|your highlights)

Then matching your whole abstract against this regexp:

/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig

This way you would get the sentence containing at least one of your highlights in the first subgroup of each match (RegExr).

like image 23
speakr Avatar answered Sep 28 '22 13:09

speakr