Given a document string s of a certain length and a language mask l of the same length I would like to process each part (span?) of the document with the according spacy language model.
say for example
s = 'As one would say in German: Wie man auf englisch zu sagen pflegt'
l = ['en'] * 27 + ['de'] * 37
I would like to construct a document out of
import spacy
nlp_de = spacy.load('de')
nlp_en = spacy.load('en')
d_de = nlp_de(u"".join([c for i,c in enumerate(s) if l[i] == "de"]))
d_en = nlp_en(u"".join([c for i,c in enumerate(s) if l[i] == "en"]))
And now I would somehow have to glue that two parts together. But unfortunately, the document in spacy holds information about the vocabulary. This would thus be ambiguous.
How should I model my multi-language documents with spacy?
2 thoughts regarding this:
If most of your text is more like your example, i would try to try and separate the text by languages (for your example i would yield 2 sentences and process each on its own).
If it's the other case, I'm not sure if spacy has built-in support for code-switch, and if not you'll need to build your own models (or just try to combine those of spacy) depends on your actual task
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With