I am trying to create a new language model (Luxembourgish) in spaCy, but I am confused on how to do this.
I followed the instructions on their website and did a similar thing as in this post. But what I do not understand is, how to add data like a vocab or wordvectors. (e.g. "fill" the language template)
I get that there are some dev tools for same of these operations, but their execution is poorly documented so I do not get how to install and use them properly, especially as they seem to be in python 2.7 which clashes with my spacy installation as it uses python 3.
As for now I have a corpus.txt
(from a wikipediadump) on which I want to train and a language template with the defaults like stop_words.py
, tokenizer_exceptions.py
etc. that I created and filled by hand.
Anyone ever done this properly and could help me here?
When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.
Essentially, spacy. load() is a convenience wrapper that reads the pipeline's config. cfg , uses the language and pipeline information to construct a Language object, loads in the model data and weights, and returns it.
There are three main components of a "language model" in spaCy: the "static" language-specific data shipped in Python (tokenizer exceptions, stop words, rules for mapping fine-grained to coarse-grained part-of-speech tags), the statistical model trained to predict part-of-speech tags, dependencies and named entities (trained on a large labelled corpus and included as binary weights) and optional word vectors that can be converted and added before or after training. You can also train your own vectors on your raw text using a library like Gensim and then add them to spaCy.
spaCy v2.x allows you to train all pipeline components independently or in on go, so you can train the tagger, parser and entity recognizer on your data. All of this requires labelled data. If you're training a new language from scratch, you normally use an existing treebank. Here's an example of the Universal Dependencies corpus for Spanish (which is also the one that was used to train spaCy's Spanish model). You can then convert the data to spaCy's JSON format and use the spacy train
command to train a model. For example:
git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora
mkdir ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-train.json ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-dev.json ancora-json
mkdir models
python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json
I don't know what's in your corpus.txt
and whether it's fully labelled or only raw text. (I also don't know of any existing resources for Luxembourgish – sounds like that's potentially quite hard to find!) If your data is labelled, you can convert it to spaCy's format using one of the built-in converters or your own little script. If your corpus consists of only raw text, you need to label it first and see if it's suitable to train a general language model. Ultimately, this comes down to experimenting – but here are some strategies:
Remember that you always need evaluation data, too (also referred to as "development data" in the docs). This is usually a random portion of your labelled data that you hold back during training and use to determine whether your model is improving.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With