I'm working with very large collections of short texts that I need to annotate and save to disk. Ideally I'd like to save/load them as spaCy Doc
objects. Obviously I don't want to save the Language
or Vocab
objects more than once (but happy to save/load it once for a collection of Doc
s).
The Doc
object has a to_disk
method and a to_bytes
method, but it's not immediately obvious to me how to save a bunch of documents to the same file. Is there a preferred way of doing this? I'm looking for something as space-efficient as possible.
Currently I'm doing this, which I'm not very happy with:
def serialize_docs(docs):
"""
Writes spaCy Doc objects to a newline-delimited string that can be used to load them later,
given the same Vocab object that was used to create them.
"""
return '\n'.join([codecs.encode(doc.to_bytes(), 'hex') for doc in docs])
def write_docs(filename, docs):
"""
Writes spaCy Doc objects to a file.
"""
serialized_docs = seralize_docs(docs)
with open(filename, 'w') as f:
f.write(serialized_docs)
As of Spacy 2.2, the correct answer is to use DocBin.
As the Spacy docs now say,
If you’re working with lots of data, you’ll probably need to pass analyses between machines, either to use something like Dask or Spark, or even just to save out work to disk. Often it’s sufficient to use the Doc.to_array functionality for this, and just serialize the numpy arrays – but other times you want a more general way to save and restore Doc objects.
The DocBin class makes it easy to serialize and deserialize a collection of Doc objects together, and is much more efficient than calling Doc.to_bytes on each individual Doc object. You can also control what data gets saved, and you can merge pallets together for easy map/reduce-style processing.
Example
import spacy
from spacy.tokens import DocBin
doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)
texts = ["Some text", "Lots of texts...", "..."]
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts):
doc_bin.add(doc)
bytes_data = doc_bin.to_bytes()
# Deserialize later, e.g. in a new process
nlp = spacy.blank("en")
doc_bin = DocBin().from_bytes(bytes_data)
docs = list(doc_bin.get_docs(nlp.vocab))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With