Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom sentence segmentation in Spacy

Tags:

python

nlp

spacy

I want spaCy to use the sentence segmentation boundaries as I provide instead of its own processing.

For example:

get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
# => ["Bob meets Alice.", "They play together."]  # two sents

get_sentences("Bob meets Alice. They play together.")
# => ["Bob meets Alice. They play together."]  # ONE sent

get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
# => ["Bob meets Alice,", "they play together."] # two sents

This is what I have so far (borrowing things from documentation here):

import spacy
nlp = spacy.load('en_core_web_sm')

def mark_sentence_boundaries(doc):
    for i, token in enumerate(doc):
        if token.text == '@SentBoundary@':
            doc[i+1].sent_start = True
    return doc

nlp.add_pipe(mark_sentence_boundaries, before='parser')

def get_sentences(text):
    doc = nlp(text)
    return (list(doc.sents))

But the results I get are as follows:

# Ex1
get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
#=> ["Bob meets Alice.", "@SentBoundary@", "They play together."]

# Ex2
get_sentences("Bob meets Alice. They play together.")
#=> ["Bob meets Alice.", "They play together."]

# Ex3
get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
#=> ["Bob meets Alice, @SentBoundary@", "they play together."]

Following are main problems I am facing:

  1. When sentence break is found, how to get rid of @SentBoundary@ token.
  2. How to disallow spaCy from splitting if @SentBoundary@ is not present.
like image 333
Harsh Trivedi Avatar asked Sep 22 '18 16:09

Harsh Trivedi


People also ask

How do you split sentences with spaCy?

In step 1, we import the spaCy package and in step 2, we load the spacy engine. In step 3, we set the sentence variable and in step 4, we process it using the spacy engine. In step 5, we print out the dependency parse information. It will help us determine how to split the sentence into clauses.

What is sentence segmentation?

Sentence segmentation is the process of determining the longer processing units consisting of one or more words. This task involves identifying sentence boundaries between words in different sentences.

How do you use Tokenize sentences with spaCy?

In the example below, we are tokenizing the text using spacy. First, we imported the Spacy library and then loaded the English language model of spacy and then iterate over the tokens of doc objects to print them in the output. [Out] : You only live once , but if you do it right , once is enough .

What is Propn in spaCy?

Login to get full access to this book. [43] In the scheme used by spaCy, prepositions are referred to as “adposition” and use a tag ADP. Words like “Friday” or “Obama” are tagged with PROPN, which stands for “proper nouns” reserved for names of known individuals, places, time references, organizations, events and such.


1 Answers

The following code works:

import spacy
nlp = spacy.load('en_core_web_sm')

def split_on_breaks(doc):
    start = 0
    seen_break = False
    for word in doc:
        if seen_break:
            yield doc[start:word.i-1]
            start = word.i
            seen_break = False
        elif word.text == '@SentBoundary@':
            seen_break = True
    if start < len(doc):
        yield doc[start:len(doc)]

sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_breaks)
nlp.add_pipe(sbd, first=True)

def get_sentences(text):
    doc = nlp(text)
    return (list(doc.sents)) # convert to string if required.

# Ex1
get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
# => ["Bob meets Alice.", "They play together."]  # two sentences

# Ex2
get_sentences("Bob meets Alice. They play together.")
# => ["Bob meets Alice. They play together."]  # ONE sentence

# Ex3
get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
# => ["Bob meets Alice,", "they play together."] # two sentences

Right thing was to check for SentenceSegmenter than manual boundary setting (examples here). This github issue was also helpful.

like image 173
Harsh Trivedi Avatar answered Oct 11 '22 21:10

Harsh Trivedi