Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting a part of a Spacy document as a new document

I have a rather long text parsed by Spacy into a Doc instance:

import spacy

nlp = spacy.load('en_core_web_lg')
doc = nlp(content)

doc here becomes a Doc class instance.

Now, since the text is huge, I would like to process, experiment and visualize in a Jupyter notebook using only just one part of the document - for instance, first 100 sentences.

How can I slice and create a new Doc instance from a part of the existing document?

like image 832
alecxe Avatar asked Nov 30 '17 18:11

alecxe


People also ask

What is DocBin in spaCy?

The DocBin class lets you efficiently serialize the information from a collection of Doc objects. You can control which information is serialized by passing a list of attribute IDs, and optionally also specify whether the user data is serialized.

What is Propn in spaCy?

Login to get full access to this book. [43] In the scheme used by spaCy, prepositions are referred to as “adposition” and use a tag ADP. Words like “Friday” or “Obama” are tagged with PROPN, which stands for “proper nouns” reserved for names of known individuals, places, time references, organizations, events and such.

What is Orth in spaCy?

orth is simply an integer that indicates the index of the occurrence of the word that is kept in the spacy.


2 Answers

A rather ugly way to achieve your purpose is to construct a list of sentences and build a new document from a subset of sentences.

sentences = [sent.string.strip() for sent in doc.sents][:100]
minidoc = nlp(' '.join(sentences))

It feels like there should be a better solution, but I guess this at least works.

like image 93
Uvar Avatar answered Oct 18 '22 14:10

Uvar


There's a nicer solution using as_doc() on a Span object (https://spacy.io/api/span#as_doc):

nlp = spacy.load('en_core_web_lg')
content = "This is my sentence. And here's another one."
doc = nlp(content)
for i, sent in enumerate(doc.sents):
    print(i, "a", sent, type(sent))
    doc_sent = sent.as_doc()
    print(i, "b", doc_sent, type(doc_sent))

Gives output:

0 a This is my sentence. <class 'spacy.tokens.span.Span'>   
0 b This is my sentence.  <class 'spacy.tokens.doc.Doc'>   
1 a And here's another one.  <class 'spacy.tokens.span.Span'>   
1 b And here's another one.  <class 'spacy.tokens.doc.Doc'>

(code snippet wrote out in full for clarity - can be further shortened ofcourse)

like image 45
Sofie VL Avatar answered Oct 18 '22 15:10

Sofie VL