I have a rather long text parsed by Spacy
into a Doc
instance:
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(content)
doc
here becomes a Doc
class instance.
Now, since the text is huge, I would like to process, experiment and visualize in a Jupyter notebook using only just one part of the document - for instance, first 100 sentences.
How can I slice and create a new Doc
instance from a part of the existing document?
The DocBin class lets you efficiently serialize the information from a collection of Doc objects. You can control which information is serialized by passing a list of attribute IDs, and optionally also specify whether the user data is serialized.
Login to get full access to this book. [43] In the scheme used by spaCy, prepositions are referred to as “adposition” and use a tag ADP. Words like “Friday” or “Obama” are tagged with PROPN, which stands for “proper nouns” reserved for names of known individuals, places, time references, organizations, events and such.
orth is simply an integer that indicates the index of the occurrence of the word that is kept in the spacy.
A rather ugly way to achieve your purpose is to construct a list of sentences and build a new document from a subset of sentences.
sentences = [sent.string.strip() for sent in doc.sents][:100]
minidoc = nlp(' '.join(sentences))
It feels like there should be a better solution, but I guess this at least works.
There's a nicer solution using as_doc()
on a Span
object (https://spacy.io/api/span#as_doc):
nlp = spacy.load('en_core_web_lg')
content = "This is my sentence. And here's another one."
doc = nlp(content)
for i, sent in enumerate(doc.sents):
print(i, "a", sent, type(sent))
doc_sent = sent.as_doc()
print(i, "b", doc_sent, type(doc_sent))
Gives output:
0 a This is my sentence. <class 'spacy.tokens.span.Span'>
0 b This is my sentence. <class 'spacy.tokens.doc.Doc'>
1 a And here's another one. <class 'spacy.tokens.span.Span'>
1 b And here's another one. <class 'spacy.tokens.doc.Doc'>
(code snippet wrote out in full for clarity - can be further shortened ofcourse)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With