I have a rather long text parsed by <code>Spacy</code> into a <code>Doc</code> instance: <pre class="prettyprint"><code>import spacy nlp = spacy.load('en_core_web_lg') doc = nlp(content) </code></pre> <code>doc</code> here becomes a <code>Doc</code> class instance. Now, since the text is huge, I would like to process, experiment and visualize in a Jupyter notebook using only just one part of the document - for instance, first 100 sentences. How can I slice and create a new <code>Doc</code> instance from a part of the existing document?

A rather ugly way to achieve your purpose is to construct a list of sentences and build a new document from a subset of sentences. <pre class="prettyprint"><code>sentences = [sent.string.strip() for sent in doc.sents][:100] minidoc = nlp(' '.join(sentences)) </code></pre> It feels like there should be a better solution, but I guess this at least works.

There's a nicer solution using <code>as_doc()</code> on a <code>Span</code> object (https://spacy.io/api/span#as_doc): <pre class="prettyprint"><code>nlp = spacy.load('en_core_web_lg') content = "This is my sentence. And here's another one." doc = nlp(content) for i, sent in enumerate(doc.sents): print(i, "a", sent, type(sent)) doc_sent = sent.as_doc() print(i, "b", doc_sent, type(doc_sent)) </code></pre> Gives output: <pre class="prettyprint"><code>0 a This is my sentence. <class 'spacy.tokens.span.Span'> 0 b This is my sentence. <class 'spacy.tokens.doc.Doc'> 1 a And here's another one. <class 'spacy.tokens.span.Span'> 1 b And here's another one. <class 'spacy.tokens.doc.Doc'> </code></pre> (code snippet wrote out in full for clarity - can be further shortened ofcourse)

Extracting a part of a Spacy document as a new document

Tags:

python

nlp

document

spacy

I have a rather long text parsed by Spacy into a Doc instance:

import spacy

nlp = spacy.load('en_core_web_lg')
doc = nlp(content)

doc here becomes a Doc class instance.

Now, since the text is huge, I would like to process, experiment and visualize in a Jupyter notebook using only just one part of the document - for instance, first 100 sentences.

How can I slice and create a new Doc instance from a part of the existing document?

832

asked Nov 30 '17 18:11

alecxe

2 Answers

A rather ugly way to achieve your purpose is to construct a list of sentences and build a new document from a subset of sentences.

sentences = [sent.string.strip() for sent in doc.sents][:100]
minidoc = nlp(' '.join(sentences))

It feels like there should be a better solution, but I guess this at least works.

answered Oct 18 '22 14:10

Uvar

There's a nicer solution using as_doc() on a Span object (https://spacy.io/api/span#as_doc):

nlp = spacy.load('en_core_web_lg')
content = "This is my sentence. And here's another one."
doc = nlp(content)
for i, sent in enumerate(doc.sents):
    print(i, "a", sent, type(sent))
    doc_sent = sent.as_doc()
    print(i, "b", doc_sent, type(doc_sent))

Gives output:

0 a This is my sentence. <class 'spacy.tokens.span.Span'>   
0 b This is my sentence.  <class 'spacy.tokens.doc.Doc'>   
1 a And here's another one.  <class 'spacy.tokens.span.Span'>   
1 b And here's another one.  <class 'spacy.tokens.doc.Doc'>

(code snippet wrote out in full for clarity - can be further shortened ofcourse)

answered Oct 18 '22 15:10

Sofie VL

Related questions
                            
                                How to respect PEP8 when accessing multiple nested dictionaries?
                            
                                How can I mock a module that is imported from a function and not present in sys.path? [duplicate]
                            
                                Type Conversion in python AttributeError: 'str' object has no attribute 'astype'
                            
                                Adding specific lines to a Plotly Scatter3d() plot
                            
                                Connection reset by Peer pymongo
                            
                                datasets.load_iris() in Python
                            
                                Join dataframes - one with multiindex columns and the other without
                            
                                Python script should end with new line or not ? Pylint contradicting itself?
                            
                                Python Pandas: TypeError: unsupported operand type(s) for +: 'datetime.time' and 'Timedelta'
                            
                                How can I do a Monte Carlo analysis on an equation?
                            
                                statespace.SARIMAX model: why the model use all the data to train mode, and predict the a range of train model
                            
                                How to do a cumulative "all"
                            
                                Merging pandas columns (one-to-many)
                            
                                How to use tf.data's initializable iterators within a tf.estimator's input_fn?
                            
                                Add values of keys and sort it by occurrence of the keys in a list of dictionaries in Python
                            
                                How to convert a html document into a pdf using report lab with python
                            
                                Update DOM without reloading the page in Django
                            
                                How to speed-up k-means from Scikit learn?
                            
                                How to convert pandas dataframe columns to native python data types?
                            
                                Copy configuration file on installation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With