What is the recommended way to serialize a collection of spaCy Docs?

Question

I'm working with very large collections of short texts that I need to annotate and save to disk. Ideally I'd like to save/load them as spaCy Doc objects. Obviously I don't want to save the Language or Vocab objects more than once (but happy to save/load it once for a collection of Docs).

The Doc object has a to_disk method and a to_bytes method, but it's not immediately obvious to me how to save a bunch of documents to the same file. Is there a preferred way of doing this? I'm looking for something as space-efficient as possible.

Currently I'm doing this, which I'm not very happy with:

def serialize_docs(docs):
    """
    Writes spaCy Doc objects to a newline-delimited string that can be used to load them later, 
    given the same Vocab object that was used to create them.
    """
    return '
'.join([codecs.encode(doc.to_bytes(), 'hex') for doc in docs])

def write_docs(filename, docs):
    """
    Writes spaCy Doc objects to a file.
    """
    serialized_docs = seralize_docs(docs)
    with open(filename, 'w') as f:
        f.write(serialized_docs)

Sam H. · Accepted Answer

As of Spacy 2.2, the correct answer is to use DocBin.

As the Spacy docs now say,

If you’re working with lots of data, you’ll probably need to pass analyses between machines, either to use something like Dask or Spark, or even just to save out work to disk. Often it’s sufficient to use the Doc.to_array functionality for this, and just serialize the numpy arrays – but other times you want a more general way to save and restore Doc objects.

The DocBin class makes it easy to serialize and deserialize a collection of Doc objects together, and is much more efficient than calling Doc.to_bytes on each individual Doc object. You can also control what data gets saved, and you can merge pallets together for easy map/reduce-style processing.

Example

import spacy
from spacy.tokens import DocBin

doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)
texts = ["Some text", "Lots of texts...", "..."]
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts):
    doc_bin.add(doc)
bytes_data = doc_bin.to_bytes()

# Deserialize later, e.g. in a new process
nlp = spacy.blank("en")
doc_bin = DocBin().from_bytes(bytes_data)
docs = list(doc_bin.get_docs(nlp.vocab))

What is the recommended way to serialize a collection of spaCy Docs?

Tags:

python

spacy

Bill

1 Answers

Sam H.

Recent Activity

Donate For Us

What is the recommended way to serialize a collection of spaCy Docs?

Tags:

python

spacy

Bill

1 Answers

Sam H.

Related questions

Recent Activity

Donate For Us