spacy create new language model with data from corpus

Tags:

I am trying to create a new language model (Luxembourgish) in spaCy, but I am confused on how to do this.

I followed the instructions on their website and did a similar thing as in this post. But what I do not understand is, how to add data like a vocab or wordvectors. (e.g. "fill" the language template)

I get that there are some dev tools for same of these operations, but their execution is poorly documented so I do not get how to install and use them properly, especially as they seem to be in python 2.7 which clashes with my spacy installation as it uses python 3.

As for now I have a corpus.txt (from a wikipediadump) on which I want to train and a language template with the defaults like stop_words.py, tokenizer_exceptions.py etc. that I created and filled by hand.

Anyone ever done this properly and could help me here?

357

asked May 03 '18 10:05

pjominet

1 Answers

There are three main components of a "language model" in spaCy: the "static" language-specific data shipped in Python (tokenizer exceptions, stop words, rules for mapping fine-grained to coarse-grained part-of-speech tags), the statistical model trained to predict part-of-speech tags, dependencies and named entities (trained on a large labelled corpus and included as binary weights) and optional word vectors that can be converted and added before or after training. You can also train your own vectors on your raw text using a library like Gensim and then add them to spaCy.

spaCy v2.x allows you to train all pipeline components independently or in on go, so you can train the tagger, parser and entity recognizer on your data. All of this requires labelled data. If you're training a new language from scratch, you normally use an existing treebank. Here's an example of the Universal Dependencies corpus for Spanish (which is also the one that was used to train spaCy's Spanish model). You can then convert the data to spaCy's JSON format and use the spacy train command to train a model. For example:

git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora
mkdir ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-train.json ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-dev.json ancora-json
mkdir models
python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json

I don't know what's in your corpus.txt and whether it's fully labelled or only raw text. (I also don't know of any existing resources for Luxembourgish – sounds like that's potentially quite hard to find!) If your data is labelled, you can convert it to spaCy's format using one of the built-in converters or your own little script. If your corpus consists of only raw text, you need to label it first and see if it's suitable to train a general language model. Ultimately, this comes down to experimenting – but here are some strategies:

Label your entire corpus manually for each component – e.g. part-of-speech tags if you want to train the tagger, dependency labels if you want to train the parser, and entity spans if you want to train the entity recognizer. You'll need a lot of data though – ideally, a corpus of a similar size to the Universal Dependencies ones.
Experiment with teaching an existing pre-trained model Luxembourgish – for example the German model. This might sound strange, but it's not an uncommon strategy. Instead of training from scratch, you post-train the existing model with examples of Luxembourgish (ideally until its predictions on your Luxembourgish text are good enough). You can also create more training data by running the German model over your Luxembourgish text and extracting and correcting its mistakes (see here for details).

Remember that you always need evaluation data, too (also referred to as "development data" in the docs). This is usually a random portion of your labelled data that you hold back during training and use to determine whether your model is improving.

160

answered Oct 05 '22 23:10

Ines Montani

Related questions
                            
                                Difference between Flask abort() or returning a status
                            
                                Write opencv frames into gstreamer rtsp server pipeline
                            
                                Tensorflow: Load data in multiple threads on cpu
                            
                                Python 3.7 and above: how to determine Linux distribution?
                            
                                Reading a file until a specific character in python
                            
                                "No numeric types to aggregate" after groupby and mean
                            
                                Paho-MQTT Error result code: 5
                            
                                Solving Inequalities in Sympy
                            
                                Tensorflow GetNext() failed because the iterator has not been initialized
                            
                                TypeError: iteration over a 0-d array Python
                            
                                Python3 importlib.util.spec_from_file_location with relative path?
                            
                                why to use " | safe" in jinja2 Python [duplicate]
                            
                                RuntimeWarning: invalid value encountered in reduce
                            
                                "ssl module in Python is not available"
                            
                                Plot 3D convex closed regions in matplotlib
                            
                                'Cannot run the event loop while another loop is running') RuntimeError websockets?
                            
                                How to pass arguments with slash in Django2 urls
                            
                                Implement a C# Interface in Python for .NET
                            
                                why does sys.stdout = None work?
                            
                                How LSTM deal with variable length sequence

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spacy create new language model with data from corpus

Tags:

python

windows

nlp

spacy

pjominet

People also ask

1 Answers

Ines Montani

Recent Activity

Donate For Us