Custom sentence segmentation in Spacy

Tags:

I want spaCy to use the sentence segmentation boundaries as I provide instead of its own processing.

For example:

get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
# => ["Bob meets Alice.", "They play together."]  # two sents

get_sentences("Bob meets Alice. They play together.")
# => ["Bob meets Alice. They play together."]  # ONE sent

get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
# => ["Bob meets Alice,", "they play together."] # two sents

This is what I have so far (borrowing things from documentation here):

Click to copy

import spacy
nlp = spacy.load('en_core_web_sm')

def mark_sentence_boundaries(doc):
    for i, token in enumerate(doc):
        if token.text == '@SentBoundary@':
            doc[i+1].sent_start = True
    return doc

nlp.add_pipe(mark_sentence_boundaries, before='parser')

def get_sentences(text):
    doc = nlp(text)
    return (list(doc.sents))

But the results I get are as follows:

Click to copy

# Ex1
get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
#=> ["Bob meets Alice.", "@SentBoundary@", "They play together."]

# Ex2
get_sentences("Bob meets Alice. They play together.")
#=> ["Bob meets Alice.", "They play together."]

# Ex3
get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
#=> ["Bob meets Alice, @SentBoundary@", "they play together."]

Following are main problems I am facing:

When sentence break is found, how to get rid of @SentBoundary@ token.
How to disallow spaCy from splitting if @SentBoundary@ is not present.

333

asked Sep 22 '18 16:09

Harsh Trivedi

1 Answers

The following code works:

Click to copy

import spacy
nlp = spacy.load('en_core_web_sm')

def split_on_breaks(doc):
    start = 0
    seen_break = False
    for word in doc:
        if seen_break:
            yield doc[start:word.i-1]
            start = word.i
            seen_break = False
        elif word.text == '@SentBoundary@':
            seen_break = True
    if start < len(doc):
        yield doc[start:len(doc)]

sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_breaks)
nlp.add_pipe(sbd, first=True)

def get_sentences(text):
    doc = nlp(text)
    return (list(doc.sents)) # convert to string if required.

# Ex1
get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
# => ["Bob meets Alice.", "They play together."]  # two sentences

# Ex2
get_sentences("Bob meets Alice. They play together.")
# => ["Bob meets Alice. They play together."]  # ONE sentence

# Ex3
get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
# => ["Bob meets Alice,", "they play together."] # two sentences

Right thing was to check for SentenceSegmenter than manual boundary setting (examples here). This github issue was also helpful.

173

answered Oct 11 '22 21:10

Harsh Trivedi

Related questions
                            
                                Can One Replace or Remove a specific key from functools.lru_cache?
                            
                                Why are there two hashes in my Pipfile.lock for one module?
                            
                                Engines available for to_excel function in pandas
                            
                                Given a value, find percentile % with Numpy
                            
                                What is the reason that mock.patch ignores a fully imported function?
                            
                                pandas ValueError: transforms cannot produce aggregated results
                            
                                Python, reading a zip file comment
                            
                                Tensorflow: How to tile a tensor that duplicate in certain order? [duplicate]
                            
                                How to set attribute on an object given a dotted path?
                            
                                Python YAML to JSON to YAML
                            
                                Why does QFileDialog use slash instead of backslash?
                            
                                Adding logger causes can't pickle _thread.RLock objects error
                            
                                Switch Spyder Environment without installing spyder at each environment
                            
                                Matplotlib - MatplotlibDeprecationWarning
                            
                                ValueError: Data must be aligned to block boundary in ECB mode
                            
                                Select sublists from python list, beginning and ending on the same element
                            
                                id() vs `is` operator. Is it safe to compare `id`s? Does the same `id` mean the same object?
                            
                                Numpy: proper way of getting maximum from a list of points
                            
                                Using pytest_addoptions in a non-root conftest.py
                            
                                Django rules object permissions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Custom sentence segmentation in Spacy

Tags:

python

nlp

spacy

Harsh Trivedi

People also ask

1 Answers

Harsh Trivedi

Recent Activity

Donate For Us